YOLOv7 in Thermal Real-Time Applications

1. Introduction

In our exploration of object detection, YOLOv7 [1] stood out as a leading model in the "You Only Look Once" series, known for its exceptional performance and balance of speed and accuracy. Specifically, YOLOv7 is an optimal choice for real-time applications such as surveillance systems and autonomous vehicles due to its accuracy in rapidly detecting objects, a feature that enhances security measures. Typically, YOLOv7 excels with standard RGB imagery, where it can capitalize on rich, color-based information to identify and classify diverse objects in varied environments. However, a notable limitation of RGB data is its reduced effectiveness under poor lighting or obscured conditions, which can compromise the accuracy of real-time training data and hinder accurate YOLOv7 detection. Our project seeks to explore and assess how effectively a nano version of YOLOv7 adapts to and performs with thermal data, specifically using a public dataset from Teledyne FLIR that captures heat emissions instead of light. This approach offers a promising alternative for data collection and detection tasks in real-time applications that face visibility challenges.

2. Understanding the Paper

The paper "Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors" [5] proposes YOLOv7, a popular object detection algorithm based on the original YOLO architecture proposed in 2016 [2]. The paper primarily focuses on optimizing the architecture of previous YOLO iterations while introducing training optimizations that improve accuracy of object detection without increasing inference cost. In YOLO models, features are extracted from input images via a backbone, optionally refined and enhanced in a neck architecture, and entered into the detection head, outputting bounding box predictions for objects detected in the input frame. The backbone is composed of a variant of CNN, commonly Darknet or CSPDarknet, followed by downsampling operations which reduce the dimension of the feature map while adding channels. The neck generally consists of convolutional layers, skip connections, or other modules aimed at refining the output of the backbone. Finally, the head implements a series of convolutional and fully connected layers, followed by any necessary post-processing to produce the final detections.

The authors discuss several optimizations they made to the previous YOLO architecture and its training techniques while pursuing higher accuracy with equal or better inference speed. First, the authors propose the implementation of layer aggregation in the backbone. The authors choose E-ELAN, an extended version of the ELAN module proposed anonymously in 2022 [3], in order to improve the efficiency of convolutional layers in the backbone. Second, the authors discuss re-parameterization, which they implement at the module level throughout the network. They identify candidate modules via gradient flow propagation paths, where they then implement planned re-parameterized convolution. Finally, they propose a novel implementation of an auxiliary head, where label assignments for both the lead and auxiliary heads are governed by the lead head prediction. These optimizations result in a state of the art object detection model trained on the Microsoft COCO dataset [4]. We seek to apply the YOLOv7 architecture to the FLIR thermal dataset, assessing its performance on data widely encountered in ADAS, CCTV, and military applications.

3. Technical Implementation

This project utilized Teledyne FLIR's publicly available dataset on Roboflow, consisting of 10,495 thermal images, segmented into 92% training, 6% validation, and 2% testing sets. The images were annotated with bounding boxes identifying one or more of the following classes: "bicycle", "car", "dog", and "person". The preprocessing of this dataset included resizing, normalization based on dataset statistics, and augmentation techniques to prepare the images for effective learning. Below is an example of labels provided by the dataset, along with the corresponding predictions of the model.

Figure 1A: Example of Thermal Training Labels

Figure 1B: Example of Thermal Training Predictions

Using a nano pre-trained version of YOLOv7, initially trained on the comprehensive COCO dataset, we explored the trade-offs between inference time and performance when adapted to our specialized thermal image dataset. During the training process using a Tesla T4 GPU over 50 epochs, the model was fine-tuned with a focus on the deeper layers to enhance its capability to recognize thermal-specific features, while retaining early layers to utilize its learned general object detection capabilities. We also experimented with various learning rates and batch sizes, settling on 0.003 and 64, respectively.

Figure 2: YOLOv7 Nano Architecture

The source code for this process is available on this Colab notebook.

4. Experimental Findings

We find that the nano version of YOLOv7 achieves 99.86% test accuracy on our 2% test set, with similar training and validation accuracy (99.97% and 100%, respectively). In Figure 3, we see high mean average precision at 50% overlap, flattening out at 78%, representing a relatively high precision-recall performance suggesting that the model is accurate at detecting objects when evaluating detections using a 50% IoU threshold. However, under a stricter threshold of 95%, mean average precision drops significantly to around 45%. This reveals that the model struggles with precise localization of objects significantly more than broader localization and classification. We hypothesize that improving generalization via image augmentation or additional training data including objects with diverse appearances or poses would result in improved mAP50-95 accuracy.

Figure 3: YOLOv7 Nano Training Metrics

Figure 4: Labeled Multi-Class Test Sample

5. Conclusions

We find that the YOLOv7 nano model performs well on general localization and classification of common objects detected in public environments via thermal imaging. However, the YOLOv7 nano model struggles with precise localization, indicating unsuitability for tasks requiring highly accurate positioning or exact delineation of object boundaries. The data shows potential for applications in autonomous driving and robotics navigation, where the benefits of thermal imaging are useful in detecting objects behind thermally transparent occlusions, particularly glare, fog, and smoke. Navigation systems can benefit greatly from the detection and localization of common objects in their surroundings, particularly in robotics applications where objects can be directly relevant to the primary task. We are curious to explore the applicability of modern segmentation models to thermal imaging for precise localization; we believe that semantic segmentation could improve accuracy when detecting objects with a strict criterion for detections. Ultimately, we show that YOLOv7 nano, a model usable on modular devices suitable for autonomous driving and robotics applications, can perform accurate object detection on thermal imaging data, which has notable implications for computer vision systems implemented in public environments.

References

[1] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696v1, 2022.

[2] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. arXiv preprint arXiv:1506.02640v5, 2016.

[3] anonymous. Designing network design strategies. anonymous submission, 2022.

[4] Tsung-Yi Lin, Michael Maire. Microsoft COCO: Common Objects in Context. https://arxiv.org/abs/1405.0312, 2014.

[5] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696v1, 2022.

Team Members

Ethan Alderson and Jai Amin

YOLOv7 in Thermal Real-Time Applications For DS 4440

This final project is based on the work in: