In the field of wildlife conservation and research, accurately detecting animals in difficult environments is critical. Traditional methods often struggle in low-light conditions or dense vegetation. While more costly, thermal imaging offers a promising alternative, allowing researchers to observe animals at night or through camouflage. Building upon the model’s ability to achieve high accuracy and faster training time, we aim to assess Faster R-CNN's effectiveness to provide insight into the limitations of Faster R-CNN in complex ecological scenarios and recommendations for optimal image acquisition strategies best suited for the model. Specifically, we will compare Faster R-CNN’s ability to identify different animal species across datasets containing normal or thermal images, and containing multiple types of animals (multi-class classification) or one type of animal (binary classification i.e. raccoon or not a raccoon). This will allow us to understand the types of environmental conditions and image setups that will enable the most optimal animal detection for real-world application. By evaluating Faster R-CNN’s performance on datasets containing thermal and normal images of various animal species, we hope to gain valuable insights into the model's capabilities for real-world applications and provide practical recommendations for researchers and conservationists on how to best capture images for animal detection tasks using Faster R-CNN or similar models. This project will contribute to understanding the strengths and limitations of Faster R-CNN for real-world animal identification tasks.
Many object detection networks rely on region proposal algorithms to generate candidate regions and hypothesize object locations before classifying each proposal as a specific object or background. However, this process can create bottlenecks and run slowly, especially for real-time applications. The research paper, “Faster R-CNN: Towards real-time object detection with region proposal networks” describes a framework using a Region Proposal Network (RPN) to generate high-quality region proposals coupled with Fast R-CNN to significantly reduce the runtime of object detection networks.
The RPN is a fully convolutional network operating on an entire image that predicts object bounding boxes and their likelihood of containing an object simultaneously. By sharing convolutional features extracted by the main detector network (i.e. VGG-16), the RPN can carry out proposal generation and object classification with greater efficiency. The RPN is merged with Fast R-CNN into a single neural network and training alternates betweens RPN and Fast R-CNN. During this process, the RPN proposes regions based on image features and can tell the Fast R-CNN where to look to classify those proposals into specific objects or the background. Errors made by the Fast R-CNN are then fed back to the RPN during training, thus allowing for better proposal quality and detection accuracy. The entire architecture, named “Faster R-CNN”, allows for object detection and classification tasks with high accuracy and greater efficiency. In our project, we implement the Faster R-CNN model from the paper through Meta AI Research’s Detectron2 library, which includes state-of-the-art detection algorithms that provide the framework to allow us to train models such as Faster R-CNN. We aim to train Faster R-CNN models on datasets with varied types of images featuring different environments such as thermal images vs. normal images and identify different types of animals and compare their performance across datasets to provide informed recommendations for those looking to implement these algorithms in conservation research.
We began our framework by implementing the Detectron2 model, which is a sophisticated segmentation model that has a robust architecture and pretrained models provided by Facebook AI Research. After ensuring all dependencies and required libraries are correctly adjusted and installed on our notebook, we included a portion of the demo code provided by the Detectron2 team that showcases the model’s capabilities. The model creates predictions to identify various objects and provides annotation(s) in the input image.
Next, we began to test if the model can be adapted to custom images. We found four datasets that represented various categories to test the model including thermal or normal images with one animal type (binary class i.e. raccoon or not a raccoon) or with multiple animal types/classes across various degrees of visibility. The first dataset includes images of raccoons pictured outside, most frequently in grass. These raccoons are depicted clearly in the foreground of the images in broad daylight and a contrasting background. This will serve as the baseline for our detector.
The second dataset is a set of images of cheetahs captured via thermal cameras at night time. The cheetahs are in more complicated environments than the rabbits, often being too close, far away, or laying down. These thermal images are captured in grayscale. The third dataset is another thermal image dataset of both dogs and people, either walking on a street or in an open environment. This dataset differs in two ways from the others: firstly these thermal images are captured in a color range (cold blue to red hot), which can potentially provide more information to the image segmentation model. Secondly, this dataset also frequently has humans in the picture, which is a case of multiple actors being in the same image. The dataset is configured to allow the model to identify humans as their own class. The last dataset is normal camera images of marine life in aquariums. This dataset is different in where there are 7 different animals to classify. Although this is a more difficult task, we chose this dataset since it more fairly represents recognizing animals in challenging environments (murky water, rocky environment, dark lighting). This would be an environment of greater need for a thermal camera with its obstacles and lighting challenges (compared to the racoon dataset which is always taken close up in broad daylight).
The configurations of the model were for the most part standardized to 300 epochs, a 80% - 20% train-test-split, a batch size of 2, and using SGD with momentum at lr=2.5e-4 as an optimizer. After running on the train datasets, the model is tested against the validation sets to evaluate its performance with the measuring metric of average precision. Then, the evaluation is carried out using the COCOEvaluator method, which reads through all the outputs and calculates the model’s effectiveness across all object predictions. This is possible through the image bounding box annotation done by those from the data sharing platform Roboflow.com.
Our entire implementation from setup to evaluation is well documented within our working Google Colab notebook, which can be accessed here.
For this first dataset (examples above), the model outputs a high confidence level on most images containing raccoons. Most of the bounding boxes are ranged in between 75% to as high as 100%, showing how well the model was trained. The images in the dataset also vary in dramatically different environments, which shows that the model adapts to different settings. However we should note that the model got confused when there was too much noise in the image, leading to it missing raccoons or overcounting. This is likely because the model is looking for the raccoons distinctly striped features, and failing when it can’t properly identify them.
The total loss plot shown above for the model in learning the raccoons shows a steady decline, which means that the model is capable of learning and adapting during the 300 epochs training time. The loss was as high as 1.7, and it slowly converges down to about 0.3, which is a promising sign, and if given more epochs, the loss will likely decline further.
Next, the classification accuracy and false negative plots are included above, where it indicates a high accuracy around 0.98% in the given 300 epochs training time. Though there were some slight fluctuations along the way. The false negative rates also show a successful learning phase of the model, where the model first misclassified the raccoons due to its familiarity but soon began to classify correctly with a downward trend.
The average precision score of 62.155 is very good for COCO dataset bounding box problems. The AP50 score is incredibly high at 93.835 suggesting that the overlap between the suggested raccoon boxes and actual raccoon boxes matches almost exactly. Same conclusion can be drawn from the AP75. The AP Large score is the same as the AP score, likely because all the images were classified as having raccoons in the foreground (unlike other datasets where the animals in the image are far away and barely noticeable). Overall, it demonstrates high performance.
When trained with thermal images of cheetahs, Faster R-CNN effectively identifies single or multiple cheetahs within an image, even if they are far in the background. The images above show the effectiveness of the model identifying cheetahs with 97-99% confidence in thermal images depicting obscure monochromatic environments. This dataset is likely easier on the model than the thermal dogs dataset since there are no humans within the scenes to cause false positives.
The cheetah classification challenge showed a near linear decrease in the training loss across 300 training iterations of batch size 2, resulting in a final loss value of 0.17. Although it’s likely based on these graphs that better performance can be achieved through more iterations, we aimed to standardize and benchmark model performance.
The classification accuracy of the cheetah classification model followed logarithmic growth in its performance, leveling out at 98.6% accuracy in correctly detecting a cheetah in a thermal image. An exception to this is a small dip in steps 100 - 125; this alongside the false negative chart suggests that as the model begins to generalize and underfit the data, there is a period in training where it misses instances of cheetahs.
In our results, the cheetah detection achieved an average precision (AP) score of 54.516, indicating a good degree of accuracy in detecting cheetahs across different objects. The AP50 score measures precision at a 50 percent intersection over union threshold, and is high at 70.636, which means a strong performance in detection. Similarly, AP75 score represents the precision level at 75 percent intersection of union threshold, and stood at a similar 65.552, which means a consistent accuracy even under stricter box matching conditions. The high APl score of 68.669 and moderate APm score of 47.063 suggests that although the model is most confident up close, it is fairly confident at a distance. NaN AP small suggests the data didn’t have any labels at an extremely far distance.
From an exploratory view of image classification outputs, we can see that the model is successfully identifying dogs and in thermal imaging. This remains true even if it is a dog at a distance, and the model does not seem to get confused with background items. However, it does generate false positives if the subject is too close to the foreground.
The thermal dogs dataset went through a similar almost linear decrease in loss, outside of a spike around iteration 25. This loss ended up at a value of 0.37, likely with opportunity for additional increase with more training.
The classification accuracy of the thermal dogs model followed logarithmic growth in its performance, leveling out at 97.5% accuracy in correctly detecting a dog in a thermal image. While this model shares a similar increase in the false negatives curve, it experiences a smoother increase in accuracy over time compared to the cheetah classification model.
Our results of the thermal dog detection achieved an average precision (AP) score of 64.959, indicating a relatively high level of accuracy in detecting dogs and people across different objects. The AP50 score measures precision at a 50 percent intersection over union threshold, and is extremely high at 88.156, which means a strong performance in detection. Similarly, AP75 score represents the precision level at 75 percent intersection of union threshold, also at 70.756. However, APs, which is the average precision for smaller objects is 23.267, which means that the model is not performing very well when the dogs or people are very small in the image. We can see in the example images that it is doing a decent job of recognizing people at a distance, however we understand that APs metrics tend to be low due to lack of certainty. Nevertheless, when subjects are medium sized, our model was able to achieve an APm score of 53.937, which is not bad for detecting medium sized objects. This performance is indicative that the thermal imaging model with color channels is extremely efficient at differentiating different subjects both up close and at a distance.
Moving on to our last dataset, the model trained to identify multiple underwater aquatic animals is not performing as well as the previous three. This dataset is different in where there are 7 different animals to classify. Although this is a more difficult task, we chose this dataset since it more fairly represents recognizing animals in challenging environments (murky water, rocky environment, dark lighting). This would be an environment of greater need for a thermal camera with its obstacles and lighting challenges (compared to the raccoon dataset which is always taken in broad daylight).
For most images we have in the dataset, there are multiple fish present, but the model was only able to identify one or two of them, and sometimes the model could not identify anything at all. However, the model is performing better on jellyfish, in which it is able to isolate some jellyfish with around ~25% uncertainty. This training dataset was significantly larger than those prior (638 vs ~100 total samples) though its performance is still lackluster. Part of the challenge may be that the aquarium data set generalized many different species of fish into just one label, “fish”, but different species of a fish may drastically look very different to another, like comparing a shark to a clownfish.
The total loss of the aquarium dataset seems to be going down fairly drastically throughout the 600 iterations. It is possible that given enough training iterations the loss could be decreased enough to be able to handle multi-class classification with normal images. However for benchmarking we will cap the learning process at twice the amount of iterations as other datasets. Its equally possible that the limiting factor in loss decrease is a lack of sufficient training data or too much variability within each aquarium categorization.
The graph above is the classification accuracy for the aquarium set. Note that only for the aquarium dataset, we raised the epochs up to 600. This is because at 300 epochs there was almost no noticeable classification being performed, likely to there not being enough certainty in any bounding box over a 70% threshold. However, the trend of the curve shows an uphill behavior, which indicates a good sign that the model is improving over time. Similarly, our false negative plot above behaves similarly to the trend like the other three datasets. The model peaks at misclassifying the objects in the first 200 epochs, but it soon declines dramatically, as the model was successfully adjusting to the complex aquatic environment. Despite the improvement over longer epochs, the lowest false negative is about 0.3, which still shows room for improvement and optimization on the model. Perhaps more training sets and more epochs will yield a better classification accuracy.
In terms of the average precision metrics for the aquarium dataset, the overall AP shows a low detection accuracy, with a score of 1.684. Per-category results indicate that the model was not simply able to detect penguins, starfish, puffin, stingray, and sharks at all. The model was only able to identify fish and jellyfish. These results illustrate the room for further improvements, particularly in recognizing species other than jellyfish and fish.
To evaluate how effectively Faster R-CNN detects animals across various image types, we will use a metric called Average Precision (AP). This metric considers both the model’s ability to correctly identify animals and how many actual animals it captures, with a higher AP indicating better overall detection performance. After training and testing the model on several datasets including normal raccoon images, thermal cheetah images, normal multi-class aquarium images, and thermal multi-class dog and human images, we combined their AP metrics into the above bar chart to compare the performance of Faster R-CNN on multiple types of images. The scores for AP50 and AP75 reflect the model’s average performance when a detection box overlaps the ground truth box by at least 50% and 75% respectively. The scores indicated by APs, APm, and APl represent their AP value calculated for animals categorized based on their size (i.e. small, medium, large) in the image.
Overall, we found that the model performed well on the normal raccoon, thermal dog and human, and thermal cheetah datasets, with AP scores typically in the 60-70 range. However, the model performed significantly worse on the multi-class normal aquarium images, suggesting that images with fewer animal classes would produce better results. Furthermore, the AP score is consistently lower for small and medium objects than larger objects, indicating that the model struggles with detecting smaller animals, potentially due to factors like lower resolution or inability to capture fine details. While the binary-class raccoon dataset with normal images performs well, the multi-class thermal dataset surprisingly performs comparably or even slightly better than the binary-class thermal dataset using Faster R-CNN. Therefore, we conclude that for detection of a singular animal type in well-lit environments, normal images offer a simpler and potentially sufficient approach with Faster R-CNN. However, while visible-light cameras work well in controlled settings, thermal images offer a significant advantage in environments with multiple animal types and/or poor visibility.
From our results, we recommend that researchers hoping to implement image-based animal detection use a traditional camera for single animals in environments with clear visibility. On the other hand, as the environment becomes more cluttered and multi-animal classification becomes necessary, it is strongly worthwhile to invest in a thermal camera to gather thermal images for optimal results.
Utilizing object detection models to identify animals across various image types has the potential to revolutionize various fields. To aid with wildlife conservation efforts, this pipeline could enable automated monitoring of wildlife populations in both day and night, providing crucial data for protecting endangered species. Detecting animals in restricted areas could also be used to deter poachers. This system could even be used to minimize human-wildlife conflicts by detecting animals approaching livestock or human settlements in rural areas, allowing for preventative measures. Furthermore, this implementation could help further research on animal behavior by allowing scientists to gain insights into nocturnal activities, hunting strategies, and animal interactions that were previously difficult to observe, especially during peak activity periods, which often occur at night. Overall, this project holds significant promise for wildlife conservation, agriculture, and research, offering a novel approach to understanding and interacting with wildlife.