By: David Pogrebitskiy, Benjamin Wyant
GitHubComputer Vision models have become very reliable classifiers of complex image data, however, they have been shown to be quite easily tricked into misclassifying slightly altered data. This paper tackles a critical challenge in the field of deep learning - the vulnerability of neural networks to adversarial attacks. Adversarial examples are inputs that are almost indistinguishable from natural data, yet are classified incorrectly by the network with high confidence. These attacks not only pose security implications, but also reveal that our current models are not necessarily generalizing as well as we would like.
We recreated the experiments by training five of our own models on the MNIST dataset and tested their adversarial robustness using PGD, BIM, and FGSM to generate adversarial examples. Our goal was to test a wider range of network architectures and determine how large of a role the structure plays in the model's ability to generalize better, thus being more robust to adversarial attacks.
Achieving adversarial robustness is an important step towards making networks more secure and reliable, especially in high-stakes applications like autonomous driving and malware detection. The insights from this work could pave the way for a new generation of deep neural networks that are much more resistant adversarial attacks.
The key challenge addressed in this paper is the vulnerability of deep neural networks to adversarial examples ,inputs that are almost indistinguishable from natural data, yet are classified incorrectly.
Through different experiments, the authors demonstrate that finding adversarial examples has a computationally efficient solution. With simple first-order methods like projected gradient descent (PGD) reliably finding local maxima with similar loss values. Leveraging these insights, the authors trained neural networks on MNIST and CIFAR10 datasets that achieve high robustness against a wide range of adversarial attacks, including iterative methods like PGD. This represents a significant advance over prior work, which was often vulnerable to more sophisticated adversaries.
Aleksander Madry Cadence Design Systems Professor of Computing in the MIT EECS Department. Director of the MIT Center for Deployable Machine Learning and a Faculty Co-Lead of the MIT AI Policy Forum.
Aleksandar Makelov Independent researcher working on mechanistic interpretability of large language models. Did his PhD from MIT.
Ludwig Schmidt Assistant professor in computer science at the University of Washington. He is also a research scientist in the AllenNLP team at AI2 and a member of LAION.
Dimitris Tsipras He is currently a research scientist at OpenAI. Before that, he was a postdoc at Stanford CS and did his PhD in CS at MIT.
Adrian Vladu He is a permanent researcher at IRIF, affiliated with CNRS and Université Paris Cité. Before that, he received his PhD from MIT Math in 2017, which was followed by a postdoc at Boston University.
We picked these models because each architecture contributed a substantial amount to the improvement of CNNs and computer vision. We wanted to see which architecture might be the most resistant to adversarial attacks.
Accuracy and loss for VGG model trained on MNIST data
Each model was trained from scratch for 50 epochs on the MNIST dataset. We then tested each model's performance on the raw test set and adversarial adversarial images from each of the 3 types of attacks (PGD, BIM, FGSM).
After evaluating the original and augmented test sets on the models, we further trained each model for an additional 10 epochs using an augmented dataset. This new training data comprised a 50/50 split between original data and augmented images generated using each of the three adversarial attacks (PGD, BIM, and FGSM), resulting in 5 distinct models trained on 3 datasets each, totaling 15 models with their respective trained parameters.
PGD adversarial examples
After incorporating adversarial examples into our training process, we observed a slight decrease in the overall accuracy of our model on the original dataset. However, this reduction in accuracy was accompanied by a significant increase in adversarial robustness. By exposing the model to adversarial perturbations during training, it became more resilient to such attacks, demonstrating improved performance when faced with previously unseen adversarial inputs. This trade-off between accuracy on benign examples and robustness against adversarial attacks underscores the utility of incorporating adversarial training techniques.
EPS = 0.2 - The EPS is the maximum perturbation size (L-inf)
Model/Accuracy | Original Data | FGSM Adversarial | BIM Adversarial | PGD Adversarial |
---|---|---|---|---|
2-Layer Net | 0.9884 | 0.5251 | 0.1897 | 0.1897 |
VGG | 0.9951 | 0.7865 | 0.2252 | 0.2252 |
LeNet | 0.9927 | 0.8184 | 0.3073 | 0.3073 |
GoogLeNet | 0.994 | 0.7823 | 0.3659 | 0.3659 |
ResNet | 0.9934 | 0.7599 | 0.4163 | 0.4163 |
ResNet | Test Data | ||||
---|---|---|---|---|---|
Trained Model | Original | FGSM | BIM | PGD | |
Original | 0.9934 | 0.7559 | 0.4163 | 0.4163 | |
FGSM | 0.993 | 0.9771 | 0.9675 | 0.9675 | |
BIM | 0.9944 | 0.9732 | 0.9662 | 0.9662 | |
PGD | 0.9933 | 0.9751 | 0.9712 | 0.9712 |
FGSM Hardened | BIM Hardened | PGD Hardened | ||||
---|---|---|---|---|---|---|
Model/Accuracy | Original Data | FGSM Adversarial | Original Data | BIM Adversarial | Original Data | PGD Adversarial |
2-Layer Net | 0.9737 | 0.8335 | 0.9778 | 0.8168 | 0.978 | 0.815 |
VGG | 0.9898 | 0.9634 | 0.9925 | 0.9703 | 0.9906 | 0.9555 |
LeNet | 0.9903 | 0.94 | 0.9898 | 0.9282 | 0.9895 | 0.927 |
GoogLeNet | 0.9936 | 0.9629 | 0.9918 | 0.9531 | 0.9905 | 0.9493 |
ResNet | 0.993 | 0.9713 | 0.9944 | 0.9618 | 0.9933 | 0.9663 |
Accuracy Difference | ||||||
2-Layer Net | -0.0147 | 0.3084 | -0.0106 | 0.6271 | -0.0104 | 0.6253 |
VGG | -0.0053 | 0.1769 | -0.0026 | 0.7451 | -0.0045 | 0.7303 |
LeNet | -0.0024 | 0.1216 | -0.0029 | 0.6209 | -0.0032 | 0.6197 |
GoogLeNet | -0.0004 | 0.1806 | -0.0022 | 0.5872 | -0.0035 | 0.5834 |
ResNet | -0.0004 | 0.2114 | 0.001 | 0.5455 | -0.0001 | 0.580 |
Based on the above tables, here are some key insights:
In summary, the table suggests that more complex models like ResNet and GoogLeNet, coupled with strong adversarial training methods like PGD hardening, can provide significant robustness against adversarial attacks while maintaining high accuracy on original data.
The network architecture plays a crucial role in robustness to adversarial attacks. Deeper and more complex models like ResNet and GoogLeNet exhibit higher robustness compared to simpler architectures like the 2-Layer Net and LeNet, likely due to their increased capacity and flexibility to learn robust feature representations during adversarial training. Architectural elements such as residual connections in ResNet and skip connections in GoogLeNet's Inception modules contribute to maintaining performance on clean examples while learning robustness against adversarial perturbations. In contrast, simple feed-forward architectures struggle to learn robustness without compromising accuracy on clean data. While depth helps, architectural inductive biases beyond just stacking convolutions, as seen in the VGG results, may be necessary for robustness. The architectural design choices provide effective inductive biases that facilitate learning robust representations against adversarial attacks during training.
The error analysis of the models revealed that the most common misclassifications occurred between visually similar digits, such as 4 and 9, 3 and 8, 7 and 1, and 5 and 6. These pairs of digits have similar structures and can be easily confused by the model, especially when the input is perturbed by an adversarial attack. We examined two types of errors: those that were misclassified by all four models, and those that were correctly classified by the original model but misclassified by the hardened models. For both of these error types, the misclassified samples were often visually ambiguous and difficult to classify even for humans. This suggests some type of irreducible error that is inherent to the dataset itself.
Adversarial training techniques could be extended to broader applications and larger, more diverse datasets beyond image classification tasks. As neural network architectures become more advanced, with innovations like transformers and diffusion models, it will be crucial to investigate their robustness against sophisticated adversarial attacks tailored to these new architectures. A particularly concerning area is the potential to apply adversarial attacks to large language models (LLMs) with the goal of tricking them into generating harmful or restricted content that they have been specifically trained not to produce. Developing robust LLMs resilient to such adversarial prompts will be critical for ensuring the safe and responsible deployment of these powerful AI systems across various domains.
[1] Lé Madry et al. Towards Deep Learning Models Resistant to Adversarial Attacks arXiv:1706.06083v4 [stat.ML] 4 Sep 2019
[2] Lé Moosavi-Dezfooli et al. DeepFool: a simple and accurate method to fool deep neural networks arXiv:1511.04599 [cs.LG] 4 Jul 2016