Protecting CV Models Against Adversarial Attacks

Introduction

Computer Vision models have become very reliable classifiers of complex image data, however, they have been shown to be quite easily tricked into misclassifying slightly altered data. This paper tackles a critical challenge in the field of deep learning - the vulnerability of neural networks to adversarial attacks. Adversarial examples are inputs that are almost indistinguishable from natural data, yet are classified incorrectly by the network with high confidence. These attacks not only pose security implications, but also reveal that our current models are not necessarily generalizing as well as we would like.

We recreated the experiments by training five of our own models on the MNIST dataset and tested their adversarial robustness using PGD, BIM, and FGSM to generate adversarial examples. Our goal was to test a wider range of network architectures and determine how large of a role the structure plays in the model's ability to generalize better, thus being more robust to adversarial attacks.

Achieving adversarial robustness is an important step towards making networks more secure and reliable, especially in high-stakes applications like autonomous driving and malware detection. The insights from this work could pave the way for a new generation of deep neural networks that are much more resistant adversarial attacks.

Paper Overview

The key challenge addressed in this paper is the vulnerability of deep neural networks to adversarial examples ,inputs that are almost indistinguishable from natural data, yet are classified incorrectly.

Through different experiments, the authors demonstrate that finding adversarial examples has a computationally efficient solution. With simple first-order methods like projected gradient descent (PGD) reliably finding local maxima with similar loss values. Leveraging these insights, the authors trained neural networks on MNIST and CIFAR10 datasets that achieve high robustness against a wide range of adversarial attacks, including iterative methods like PGD. This represents a significant advance over prior work, which was often vulnerable to more sophisticated adversaries.

Bibliography

Aleksander Madry Cadence Design Systems Professor of Computing in the MIT EECS Department. Director of the MIT Center for Deployable Machine Learning and a Faculty Co-Lead of the MIT AI Policy Forum.

Aleksandar Makelov Independent researcher working on mechanistic interpretability of large language models. Did his PhD from MIT.

Ludwig Schmidt Assistant professor in computer science at the University of Washington. He is also a research scientist in the AllenNLP team at AI2 and a member of LAION.

Dimitris Tsipras He is currently a research scientist at OpenAI. Before that, he was a postdoc at Stanford CS and did his PhD in CS at MIT.

Adrian Vladu He is a permanent researcher at IRIF, affiliated with CNRS and Université Paris Cité. Before that, he received his PhD from MIT Math in 2017, which was followed by a postdoc at Boston University.

Methods

Model Training

2-layer fully connected: This network is a simple model with one hidden layer between the input and output layers. This model was used as a baseline for testing adversarial robustness against more complex architectures.
VGG: VGG (Visual Geometry Group) is a deep convolutional neural network known for its simplicity and effectiveness. It consists of multiple convolutional layers followed by max-pooling layers, with progressively increasing depth.
LeNet: LeNet is one of the earliest convolutional neural networks, designed for handwritten digit recognition. It comprises several layers of convolution, max-pooling, and fully connected layers.
GoogLeNet: GoogLeNet, also known as Inception v1, is a deep convolutional neural network famous for its inception modules, which allow for efficient training and better performance by using multiple parallel convolutional operations at each layer.
ResNet: ResNet (Residual Network) introduced residual connections, enabling the training of very deep neural networks by mitigating the vanishing gradient problem. It features skip connections that add the input to the output of a deeper layer, facilitating the flow of gradients during training.

We picked these models because each architecture contributed a substantial amount to the improvement of CNNs and computer vision. We wanted to see which architecture might be the most resistant to adversarial attacks.

Accuracy and loss for VGG model trained on MNIST data

Each model was trained from scratch for 50 epochs on the MNIST dataset. We then tested each model's performance on the raw test set and adversarial adversarial images from each of the 3 types of attacks (PGD, BIM, FGSM).

Adversarial Training

After evaluating the original and augmented test sets on the models, we further trained each model for an additional 10 epochs using an augmented dataset. This new training data comprised a 50/50 split between original data and augmented images generated using each of the three adversarial attacks (PGD, BIM, and FGSM), resulting in 5 distinct models trained on 3 datasets each, totaling 15 models with their respective trained parameters.

PGD adversarial examples

Code testing our models can be found in attacks/attack_MNIST_Models.ipynb

Experimental Findings

After incorporating adversarial examples into our training process, we observed a slight decrease in the overall accuracy of our model on the original dataset. However, this reduction in accuracy was accompanied by a significant increase in adversarial robustness. By exposing the model to adversarial perturbations during training, it became more resilient to such attacks, demonstrating improved performance when faced with previously unseen adversarial inputs. This trade-off between accuracy on benign examples and robustness against adversarial attacks underscores the utility of incorporating adversarial training techniques.

EPS = 0.2 - The EPS is the maximum perturbation size (L-inf)

Original model accuracy against each adversarial attack
Model/Accuracy	Original Data	FGSM Adversarial	BIM Adversarial	PGD Adversarial
2-Layer Net	0.9884	0.5251	0.1897	0.1897
VGG	0.9951	0.7865	0.2252	0.2252
LeNet	0.9927	0.8184	0.3073	0.3073
GoogLeNet	0.994	0.7823	0.3659	0.3659
ResNet	0.9934	0.7599	0.4163	0.4163

All hardened ResNet models vs each type of adversarial attack
	ResNet	Test Data
Trained Model		Original	FGSM	BIM	PGD
	Original	0.9934	0.7559	0.4163	0.4163
	FGSM	0.993	0.9771	0.9675	0.9675
	BIM	0.9944	0.9732	0.9662	0.9662
	PGD	0.9933	0.9751	0.9712	0.9712

Top half: Accuracy of each model on the original testing data and adversarial data
	FGSM Hardened		BIM Hardened		PGD Hardened
Model/Accuracy	Original Data	FGSM Adversarial	Original Data	BIM Adversarial	Original Data	PGD Adversarial
2-Layer Net	0.9737	0.8335	0.9778	0.8168	0.978	0.815
VGG	0.9898	0.9634	0.9925	0.9703	0.9906	0.9555
LeNet	0.9903	0.94	0.9898	0.9282	0.9895	0.927
GoogLeNet	0.9936	0.9629	0.9918	0.9531	0.9905	0.9493
ResNet	0.993	0.9713	0.9944	0.9618	0.9933	0.9663
Accuracy Difference
2-Layer Net	-0.0147	0.3084	-0.0106	0.6271	-0.0104	0.6253
VGG	-0.0053	0.1769	-0.0026	0.7451	-0.0045	0.7303
LeNet	-0.0024	0.1216	-0.0029	0.6209	-0.0032	0.6197
GoogLeNet	-0.0004	0.1806	-0.0022	0.5872	-0.0035	0.5834
ResNet	-0.0004	0.2114	0.001	0.5455	-0.0001	0.580

Based on the above tables, here are some key insights:

The first table shows that models without adversarial training (Original Data) achieve high accuracy on clean examples but perform poorly on adversarial examples, highlighting the vulnerability of standard models to adversarial attacks
ResNet appears to be the most robust model against adversarial attacks across all hardening methods (FGSM, BIM, PGD). It has the smallest accuracy drop on original data compared to the hardened models, and the highest accuracy on adversarial examples.
GoogLeNet also performs relatively well, especially on the FGSM adversarial examples where it has the second highest accuracy of 0.9629. However, ResNet outperforms it on the BIM and PGD adversarial sets.
The 2-Layer Net model seems to be the least robust, having the largest accuracy drops when evaluated on adversarial examples compared to original data.
Adversarial training with PGD hardening appears to provide the best robustness across models, resulting in higher accuracies on PGD adversarial sets compared to FGSM and BIM for each model.
VGG and LeNet have moderate robustness, performing better than the 2-Layer Net but worse than GoogLeNet and ResNet on adversarial examples.
For the more complex GoogLeNet and ResNet models, the accuracy drop on original data after hardening is very small (0.0004), indicating that adversarial training did not significantly impact their performance on clean data.

In summary, the table suggests that more complex models like ResNet and GoogLeNet, coupled with strong adversarial training methods like PGD hardening, can provide significant robustness against adversarial attacks while maintaining high accuracy on original data.

Conclusions

The network architecture plays a crucial role in robustness to adversarial attacks. Deeper and more complex models like ResNet and GoogLeNet exhibit higher robustness compared to simpler architectures like the 2-Layer Net and LeNet, likely due to their increased capacity and flexibility to learn robust feature representations during adversarial training. Architectural elements such as residual connections in ResNet and skip connections in GoogLeNet's Inception modules contribute to maintaining performance on clean examples while learning robustness against adversarial perturbations. In contrast, simple feed-forward architectures struggle to learn robustness without compromising accuracy on clean data. While depth helps, architectural inductive biases beyond just stacking convolutions, as seen in the VGG results, may be necessary for robustness. The architectural design choices provide effective inductive biases that facilitate learning robust representations against adversarial attacks during training.

Error Analysis

The error analysis of the models revealed that the most common misclassifications occurred between visually similar digits, such as 4 and 9, 3 and 8, 7 and 1, and 5 and 6. These pairs of digits have similar structures and can be easily confused by the model, especially when the input is perturbed by an adversarial attack. We examined two types of errors: those that were misclassified by all four models, and those that were correctly classified by the original model but misclassified by the hardened models. For both of these error types, the misclassified samples were often visually ambiguous and difficult to classify even for humans. This suggests some type of irreducible error that is inherent to the dataset itself.

misclassified by hardened — Examples of digits misclassified by the hardened models but correctly classified by the original model

Examples of digits misclassified by all models

Future Work

Adversarial training techniques could be extended to broader applications and larger, more diverse datasets beyond image classification tasks. As neural network architectures become more advanced, with innovations like transformers and diffusion models, it will be crucial to investigate their robustness against sophisticated adversarial attacks tailored to these new architectures. A particularly concerning area is the potential to apply adversarial attacks to large language models (LLMs) with the goal of tricking them into generating harmful or restricted content that they have been specifically trained not to produce. Developing robust LLMs resilient to such adversarial prompts will be critical for ensuring the safe and responsible deployment of these powerful AI systems across various domains.

References

[1] Lé Madry et al. Towards Deep Learning Models Resistant to Adversarial Attacks arXiv:1706.06083v4 [stat.ML] 4 Sep 2019

[2] Lé Moosavi-Dezfooli et al. DeepFool: a simple and accurate method to fool deep neural networks arXiv:1511.04599 [cs.LG] 4 Jul 2016

Team Members

David Pogrebitskiy pogrebitskiy.d@northeastern.edu
Benjamin Wyant wyant.b@northeastern.edu

Adversarial Robustness of Different Network Structures For DS 4440

An Analysis of Towards Deep Learning Models Resistant to Adversarial Attacks