An Analysis of Activation Functions: Comparison of Trends in Practice and Research For Deep Learning

Introduction

This project aims to extend the foundational research on Activation Functions (AFs) by conducting a systematic comparison of ten different activation functions using two well-known architectures, ResNet and AlexNet . The existing body of work provides comprehensive insights into the characteristics and performance of various AFs but often limits their evaluation to theoretical analysis or application in narrowly defined tasks. In contrast, this project proposes an empirical approach to evaluate these functions under the uniform criteria of network performance in broader applications. By implementing each of the ten AFs within ResNet and AlexNet, we aim to identify which activation functions enhance or hinder the learning process and overall performance of these models. The comparison will focus on quantifiable metrics such as accuracy providing a nuanced understanding of how each AF performs in diverse scenarios.

Summary

The paper provides a thorough survey on the use of activation functions (AFs) in deep learning, showcasing a comprehensive evolution from simpler AFs such as the Sigmoid and Tanh to more advanced, adaptive functions like ReLU and its variants. It methodically highlights the transitions in AF development, emphasizing how each subsequent function has been designed to overcome specific limitations of its predecessors. The Sigmoid and Tanh, for instance, although beneficial for their differentiable nature and probabilistic outputs, often suffer from vanishing gradients—a critical hindrance as network depth increases. The paper successfully introduces ReLU as a breakthrough due to its ability to facilitate faster convergence and maintain sparse activations, which has significantly influenced modern neural network designs. Further exploration in the paper discusses advanced variants like Leaky ReLU and Parametric ReLU, which address ReLU's issue of dead neurons by ensuring a gradient flow even for negative inputs. Additionally, the emergence of Swish and other adaptive AFs reflects a trend towards functions that dynamically adjust to data, which helps in enhancing model training dynamics and performance across varied tasks. The significance of this paper to our project lies in its extensive analysis of how different AFs impact the performance of neural network architectures across diverse datasets and tasks. This foundational knowledge provides a critical backdrop for our investigation, where we aim to empirically compare ten different AFs within the frameworks of ResNet and AlexNet.

Technical Structure

General Overview

Details

click here to view the implementation

Experimental Findings

Overview

This section presents a detailed analysis of the performance of various activation functions tested within ResNet architectures using the CIFAR-10 and CIFAR-100 datasets. This analysis includes a discussion of the potential reasons behind the observed performance differences

Activation Function Performance in ResNet on CIFAR-10

Using Resnet on the dataset achieved much higher accuracy with the lowest being sigmoid and tanh at 0.66 and 0.75 due to the problem of vanishing gradient which becomes more evident with the increase in network depth. ReLU achieved an accuracy of 0.8148. Its effectiveness in deeper networks could be due to its simplicity and the reduction of the likelihood of vanishing gradients, facilitated by ResNet’s skip connections. LeakyReLU and PReLU showed even better accuracies of 0.8187 and 0.8201, respectively, which could be due to small adjustments to ReLU’s formulation help maintain more consistent gradient flow in deep networks, which can be critical in learning complex patterns

Activation Function Performance CIFAR-100

When scaling complexity with CIFAR-100, we chose not to use alexnet at all due to its low accuracy on a similar but lesser complex dataset. The overal results were accuracies centered around 50%, which considering the dataset has 100 classes is sufficient at 5 epochs. PReLU stood out with the highest accuracy of 0.5239. This success underscores PReLU’s capability to dynamically adjust its parameters to the data, proving particularly useful in diverse and complex datasets.LeakyReLU and Mish did better than ReLU, with accuracies of 0.4811 and 0.5105 respectively, reaffirm the value of addressing ReLU's limitations in datasets with extensive variability. SELU, ELU, and Softplus showed moderate results, which indicate that while these functions are designed to address specific neural network training issues (like neuron death and non-zero-centered activations), they do not universally translate into superior performance across all settings.

In general, Activation functions that adapt to input characteristics (like Mish and PReLU) tend to excel in complex datasets such as CIFAR-100. This adaptation helps manage the broader spread of feature types and the complex correlations present in such datasets.

activation functions activation functions

Conclusion

This project's comprehensive evaluation of ten different activation functions within the AlexNet and ResNet architectures, utilizing CIFAR-10 and CIFAR-100 datasets, has led to insightful conclusions about the impact of activation functions on deep learning model performance. The findings indicate that while simpler activation functions like ReLU remain effective in standard scenarios, adaptive functions such as PReLU and Mish significantly enhance performance in more complex and varied datasets due to their ability to dynamically adjust to data characteristics. This adaptability is crucial for tackling real-world applications that involve complex, high-dimensional data, such as medical image analysis, autonomous driving, and speech recognition systems, where accurate and reliable predictions can greatly impact societal safety and efficiency

Finally, it is important to discuss some of the limitations of the project. We have used only one metric to compare which is accuracy which, while, being the most import, is not the only thing to consider. In real world scenarios, the models are way more complex and there is always a need to considering the limited computational power. This suggest an extension of our project that also evaluates the time take to train the model using the function. Furthermore, future work could also focus on developing new activation functions that combine the benefits of adaptability with computational efficiency, potentially using machine learning techniques to optimize their shapes based on specific application needs

References

Chigozie, Enyinna, et al. Activation Functions: Comparison of Trends in Practice and Research for Deep Learning. 2018.

Team Members

Nicholas Gjuraj, Oum Parikh