Cityscape Semantic Segmentation For DS 4440

An Analysis of ShelfNet for Realtime Semantic Segmentation

Semantic segmentation to identify various objects within an image is quite important in today's visual processing. Especially urban landscapes with autonomous cars, we want to explore how this model can detect objects in realtime with good accuracy.

Felix Yang (yang.fel@northeastern.edu), Jaeson Pyeon (pyeon.j@northeastern.edu)

Introduction

The constant demand for real time applications specifically in the fields of autonomous driving, mobile navigation, and interactive systems has generated a need for efficient and fast image processing techniques. Semantic segmentation, a process by which parts of images are categorized into defined classes, is one relevant area to the advancements in this field. This project explores the capabilities of ShelfNet, a neural network architecture designed specifically for real time semantic segmentation. ShelfNet appeared to offer a balance between computational efficiency and segmentation accuracy that could make it suitable for embedded systems. This project seeks to address the question of whether or not the ShelfNet architecture can maintain a high level of performance in semantic segmentation on the Cityscapes dataset through various inference speeds. Additionally, this project investigates the adaptability of this model to varying constraints, in particular operational efficiency and speed, ultimately hoping to validate ShelfNet’s practicality for real world applications.

Literary Review

The primary inspiration from our project comes from the paper "ShelfNet for Fast Semantic Segmentation" by Zhuang et al., published in 2018. The authors introduce ShelfNet, a lightweight convolutional neural network (CNN) designed for efficient semantic segmentation. The model leverages a unique architectural feature known as the "shelf" structure, which allows for fast feature propagation and effective deep feature fusion. This structure is pivotal in enabling real-time processing speeds while maintaining high accuracy.

ShelfNet's architecture is built upon the ResNet backbone, utilizing lateral connections similar to those seen in feature pyramid networks. This design ensures that high-level semantic information is efficiently combined with low-level features, a critical requirement for accurate pixel-level predictions. The paper reports significant improvements in inference speed over traditional segmentation networks, making it highly relevant for applications requiring low latency.

Our project draws on the methodologies and findings from Zhuang et al. to explore further enhancements and practical implementations of ShelfNet. We apply the architecture to the Cityscapes dataset, diverging from the datasets originally used in the study, to assess its robustness and effectiveness across different segmentation challenges. Additionally, our project extends the analysis by examining the model's performance on various convolutional channels, contributing to a deeper understanding of its potential in embedded systems.

Methodology

In this project, we employed the ShelfNet architecture with reduced channels to improve the inference speed in segmentation. This seeks to see if varying systems of power can utilize even more realtime systems of this ShelfNet especially in urban landscapes with roads. We used the Cityscapes dataset, a benchmark for segmentation tasks containing 19 object classes across thousands of annotated images.

Our codebase is derived from the official ShelfNet repository, which was adapted for the specific use case of our project. We’ll implement a lot of similar methods to the official paper and adapt similar training methods as well as the overall structure of the project. The neural network was instantiated with 19 classes to match the dataset, and its architecture was further explored using torchinfo to visualize the model’s layers and parameters. In simplifying the channels, we’ll explore the difference in inference speeds alongside the performance in training loss and looking over the predicted segmentations themselves. We compare the full trained model and the original model trained at the same rate as our reduced channel model. Each model is trained over 2000 iterations of the images utilizing the Adam optimizer rather than the original optimizer and with the custom OhemCELoss built into the architecture of the original ShelfNet model. Specifically, we want to reduce the channels by a specific amount: 1/4 of the channels. The architecture we implement is seen below.

ShelfNet Architecture
Fig. 1: SimpleShelfNet Architecture

This reduces the number of parameters from 14.6 million to 12.0 million.

Experimental Findings

First, comparing the inference speeds of each model. The original ShelfNet model with pretrained weights has an inference speed on average of 0.022 seconds per image. The Simple ShelfNet model with its reduced channels has an inference speed on average of 0.012 seconds per image. This is about 45% increase in inference speed.

Next, comparing the training loss of each model. Over 2000 iterations, the original ShelfNet architecture always resulted in lower losses than the Simple ShelfNet model.

Training Losses
Fig. 2: Training Losses Between Models

Then, comparing the mean Intersection over Union (mIoU) scores for each of the models. This was done on the validation dataset for each model.

Model mIoU (%)
Pretrained ShelfNet 74.26
Training ShelfNet 39.27
Simple ShelfNet 19.25

Finally, comparing the visual outputs of each model:

Pretrained ShelfNet Output
Fig. 3: Pretrained ShelfNet Output
Simple ShelfNet Output
Fig. 4: Simple ShelfNet vs. Shelfnet Output

Conclusion

Our project sought to explore the ShelfNet architecture for real-time semantic segmentation, specifically focusing on its adaptability to varying constraints and operational efficiency. We found that the ShelfNet model with reduced channels demonstrated a significant increase in inference speed, making it a promising candidate for real-time applications. However, this came at the cost of reduced segmentation accuracy, as evidenced by the lower mIoU scores and training losses. The visual outputs of the Simple ShelfNet model also showed a noticeable decrease in segmentation quality compared to the original ShelfNet model. Our reduction in channels to 1/4 of its original quality may not have been successful, but further training and research could be done to check the success of such models with decreases in channels. Additionally, different backbones can be utilized in our training or perhaps even freezing the weights could aid in the training speed of our model.



References

[1] Zhuang, J., Tan, T., & He, J. (2018). ShelfNet for Fast Semantic Segmentation. arXiv preprint arXiv:1811.11254

[2] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The Cityscapes Dataset for Semantic Urban Scene Understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.