Witches Brew

Introduction

Modern machine learning typically relies on collecting massive amounts of training data, often from untrusted sources, in order to train a large-scale model. For example, the Pile dataset (Gao et al., 2020), which is commonly used for language modeling tasks, consists of over 800 GB of text data scraped from the internet with only limited filtering. This process exposes machine learning systems to data poisoning attacks. In a data poisoning attack, an adversary injects malicious data into the training set in order to manipulate the behavior of the trained model.

This paper studies a stealthy type of poisoning attack in which the poisoned images must resemble clean training data. This means the attacker cannot simply inject incorrectly-labeled examples into the dataset, but rather must apply imperceptible perturbations on top of the original images. To the human eye, the poisoned dataset is indistinguishable from the original.

An overview of the poisoning procedure — The attack perturbs a small set of data (between 0.1% and 1% of the training set) in an imperceptable manner using parameter information from a pre-trained model. After re-training from scratch on the poisoned data, the model misclassifies the target image as the adversarial label.

Methods and Results

Threat Model

Security papers typically outline a set of rules called a threat model describing the knowledge, capabilities, and goals of the attacker. The threat model used in Witches’ Brew is as follows:

Goal: The attacker wishes to force the model to misclassify a specific image (or a small set of images) \(x^t\) to a particular class \(y^{\mathrm{adv}}\).
Knowledge: The attacker may have knowledge of the model architecture, and potentially some small fraction of the training data.
Capabilities: The attacker can modify a small number of training samples, but cannot modify their labels. Moreover, the attacker can only modify each samples by a small amount, so that poisoned data is indistinguishable from clean data to the human eye.

Formally, this threat model can be described as a bi-level optimization problem in which the adversary wants to minimize an adversarial loss (the adversary's objective) subject to the model developer's training procedure (on poisoned data):

\begin{align} \min_{\Delta \in \mathcal{C}} &\sum_{i=1}^T \mathcal{L}(F(x_i^t, \theta(\Delta)), y_i^{\mathrm{adv}}) \\ &\mathrm{s.t.} \; \theta(\Delta) \in \arg \min_{\theta} \frac{1}{N} \sum_{i=1}^N \mathcal{L}(F(x_i + \Delta_i, \theta), y_i) \end{align}

In this formulation, the (bounded) perturbation \(\Delta\) is optimized to enforce the malicious label onto target data, while the solution to the inner problem still minimizes the original training objective of the model. A per-instance perturbation budget \(\varepsilon\) bounds the \(\ell_\infty\)-norm of the perturbation \(\Delta_i\), and this constraint is encoded inside the set \(\mathcal{C}\).

Key Idea: Gradient Alignment

The bi-level problem above is difficult to solve directly due to the expensive nature of the inner problem. Instead, the authors propose a method called gradient alignment to solve the problem efficiently. The key idea is to align the gradient of the adversarial loss with the gradient of the poisoned training loss. If the adversary could enforce similarity of the gradients:

\[ \nabla_\theta \mathcal{L}(F(x^t, \theta), y^{\mathrm{adv}}) \approx \frac{1}{N} \sum_{i=1}^N \nabla_{\theta} \mathcal{L}(F(x_i + \Delta_i, \theta), y_i) \]

for all \(\theta\) encountered during training, the adversarial loss would correspondingly decrease. Since optimizing over all parameters is infeasible, this is achieved in practice by choosing perturbations \(\Delta_i\) minimizing the cosine dissimilarity between the two gradients over a small number of model parameters:

\[ \min_{\Delta \in \mathcal{C}} \sum_{i=1}^T \left(1 - \frac{\nabla_\theta \mathcal{L}(F(x^t, \theta), y^{\mathrm{adv}}) \cdot \nabla_{\theta} \sum_{i=1}^P \mathcal{L}(F(x_i + \Delta_i, \theta), y_i)}{\|\nabla_\theta \mathcal{L}(F(x^t, \theta), y^{\mathrm{adv}})\|_2 \|\nabla_{\theta} \sum_{i=1}^P \mathcal{L}(F(x_i + \Delta_i, \theta), y_i)\|_2}\right) \]

In order to align the target and the poison gradients in the same direction, the attack minimizes the cosine dissimilarity between them using signed Adam updates with a decaying step size. In this formulation, the model parameters \(\theta\) are not changed, instead \(\Delta_i\) is calculated in order to find the optimal poison to force misclassification. This method is particularly efficient as it only requires time scaling with the number of poisons \(P\), rather than the entire dataset.

The authors additionally justify their method on theoretical grounds by invoking a classical result in numerical optimization called Zoutendijk's Theorem (Nocedal & Wright, 2006) to show that their bi-level gradient descent will reach a local optimum of the adversarial loss if the alignment is maintained consistently.

Experimental Results

Plotting the cosine similarity across training epochs, comparing the poisoned image set versus the unmodified training to illustrate the gradient alignment achieved by the optimization process. The red line which represents that poisoned data maintains a positive similarity throughout training.

Because the threat model of the attack does not give full knowledge of the model architecture, such as dropout layers, hyperparameters, data augmentation, etc, the authors propose several techniques to make to attack more robust to varying architecture:

Data augmentation - at each minimization step, randomly draw a translation crop, and possibly flip for each poisoned image. Creates increased transferability and robustness to possible data augmentations within the model.
Restarts - minimize the cosine similarity several times using different random starting perturbations, then select the captured alignment loss.
Model Ensembles - optimize the attack over an ensemble of different models, increasing generalized performance at a high computational cost.

Results on CIFAR-10 for a ResNet-18 to visualize varied values for R (restarts) and K (number of ensembles).

The authors display results for the success rate of the proposed method in producing the targeted misclassification. Overall, the attack is very effective (success rate \(> 90%\)) when the model architecture is known in advance and when the perturbation budget is sufficiently generous (\(\varepsilon \ge 16\)). When restricting the perturbation budget to \(\varepsilon = 8\), the attack success rate drops (45%-55% on CIFAR-10). When the poisons are generated using a surrogate model having a different architecture than the victim, success decreases further, to as low as \(7%\) when transferring from ResNet-18 to VGG11. However, attack generality can be improved by optimizing poisons over an ensemble of heterogeneous model architectures.

Poisoning ImageNet. On the top, clean images from the original dataset. On the bottom, poisoned images after applying optimized perturbations. Poisons were generated with a perturbation budget of \(\varepsilon = 8\).

Results using benchmark proposed in Schwarzschild et al. (2020). The category "Training From Scratch" evaluates poisoned CIFAR-10 datasets with a poised training data budget of 1% and \(\epsilon = 8\) averaged over 100 attack targets. Compares the results of previously proposed methods for data poisoning from the same category. Models trained using ResNet-18 are evaluated next to other model architectures to demonstrate the transferability of the attack.

History and Related Works

Data poisoning attacks against machine learning extend back at least as early as the 2000s, when security researchers were interested in manipulating email spam filters (Nelson et al., 2008). This was a time before deep learning had gained traction in practical applications, so these attacks typically targeted simple classifiers, such as logistic regression or SVM.

Between 2012 and 2014, a number of works developed gradient-based attacks against classification, including against neural networks. Biggio et al. (2012) used gradient methods to craft poisoned data that increases SVM test error. Around the same time, Biggio et al. (2013) and Szegedy et al. (2014) used gradient methods to evade machine learning classifiers by perturbing test instances in imperceptible ways. This early work sparked a flurry of work on understanding the robustness properties of machine learning systems, especially deep neural networks.

Modern work on poisoning has advanced to become much more effective and stealthy. For example, the early backdoor poisoning attack by Gu et al. (2017), which assumed an attacker with the ability to arbitrarily change training data, was soon refined to include versions in which the attacker does not need the ability to modify data labels (Turner et al., 2019) or reveal the backdoor trigger inside the training data (Saha et al., 2020). Other attacks targeted the transfer learning setting (Shafahi et al, 2018; Zhu et al., 2019) by crafting poisons that behavior anomolously in the learned represenatation space.

Today, practitioners consider poisoning attacks to be among the most serious threats to machine learning security (Kumar et al., 2020). This is because modern machine learning typically relies on collecting massive amounts of training data, often from untrusted sources and on such a large scale that manual data cleaning is impossible.

Author Biography

Jonas Geiping

Education: MS in Mathematics from University of Münster; PhD in Computer Science from University of Siegen
Jonas was a Postdoctoral Associate at University of Maryland during the publishing of this paper.
Research goals: Understanding data privacy and adversarial attacks in large scale models. How to define and develop "safer" models.

Liam Fowl

Education: PhD in Mathematics from University of Maryland
Research Interests: Clean-label data poisoning, metalearning, and federated learning

W Ronny Huang

Education: PhD in Electrical Engineering and Computer Science at MIT, Machine Learning Researcher at University of Maryland
Professional: Technical Staff at NASA, Research Scientist at Google

Wojciech Czaja

Professor in the Mathematics department at University of Maryland
Author on over 100 papers at UMD, several collaborations with Geiping and Fowl

Gavin Taylor

Education: PhD from Duke University
Professional: Professor of Computer Science at US Naval Academy, researcher in collaboration with UMD on the behaviors and vulnerabilities of neural networks

Michael Moeller

Education: PhD from University of Münster
Professional: Professor for Computer Vision at the University of Siegen, Germany
Research Interests: Model and learning based techniques in imaging and computer vision

Tom Goldstein

Education: PhD in Mathematics from UCLA, research scientist at Rice University and Stanford University
Professional: Professor of Computer Science at University of Maryland
Research Interests: Computer vision, signal processing, platform optimization

Societal Impact

Data poisoning attacks pose a serious security threat to the integrity of learning systems. The authors of Witches’ Brew demonstrate that poising attacks can be successfully launched using only imperceptible changes to images, minimal amounts of modified training data, and no significant effect to the validation accuracy of the system. In machine learning systems that rely on large amounts of web scraped data, attacks like Witches’ Brew can easily be injected into several training cycles over time, making the attack both hard to detect and difficult to fix.

There are always qualifications to the feasibility of an adversarial attack; for Witches’ Brew the attack does depend on availability of the trained model and access to contributing training data. This means the attack would be more effective against a continual learning model or a reinforcement based model.

Model integrity attacks have larger social implications by providing a means to bypass important safety measures that the model is trying to ensure. For example, an important measure taken in a large model could be to identify explicit or violent content, and a data poisoning attack could force malicious classification of certain outputs as safe.

Other types of adversarial attacks pose additional threats to model security. In a model extraction attack, the attacker uses queries of the model and observed outputs in order to assemble a duplicate of a previously private model. Model inversion attacks can reveal information from the dataset that was used to train the model, and potentially compromise private information.

Research into adversarial attacks is useful for identifying model vulnerabilities to different types of attacks and evaluating the effectiveness of various defensive strategies. When models are built with increased knowledge of different attacks, security measures can be incorporated into training and deployment of the model. In recent years, the field of adversarial research in ML has grown significantly, providing for new standards to evaluate a model's robustness against adversarial attacks. Finding a completely generalized defensive solution is difficult, however creating more fully documented and reproducible attacks doubtlessly will help unify and progress defensive strategies.

Industry Applications

The central mechanism used to generate the poisons, called gradient alignment, is a general-purpose relaxation of a difficult bi-level optimization problem. Such problems frequently appear in many different applications, even those not involving any security objective. We will briefly motivate one such example, dataset condensation, and describe how gradient alignment can be used to achieve it.

Consider the process of learning a new field of study. When the field is nascent, the task of understanding it falls to researchers, who explore the space inefficiently as they attempt to find what does and does not work. As the field matures, however, the knowledge of the field becomes more concentrated, and the task of learning it becomes easier. This is because the knowledge of the field is condensed into a smaller number of papers, which can be read and understood more quickly than the original research. By the time the information reaches the level of a textbook, the knowledge has been refined so that a new reader can understand decades of work in only a couple of months.

Dataset condensation (sometimes called dataset distillation) is the process of achieving this same effect in machine learning. The goal is to take a large dataset and condense it into a smaller one, while preserving the performance of a model trained on the original dataset. This is useful when developers wish to retrain a model (for example, on a new architecture) but do not want to incur the large cost of retraining on a large number of samples.

The process of dataset condensation can be formulated as a bi-level optimization problem, similar to the formulation used in Witches’ Brew. The outer problem, which relates to the condensation task, is to minimize the test error of a model trained on a condensed dataset. The inner problem, which relates to the training task, is to minimize the training error of a model trained on the condensed dataset (which is being optimized itself). The outer problem is difficult to solve directly, because it requires full knowledge of the inner problem. Gradient alignment provides an efficient relaxation to this problem: instead of minimizing the test error directly, we minimize the cosine similarity between the original training gradient and the condensed training gradient over some sampling of model parameters.

Academic Research Follow-Ups

The robustness of general machine learning algorithms to data poisoning attacks is still not well-understood. A valuable future research direction would be to characterize precisely in which scenarios and to what extent datasets and learning algorithms are susceptible to poisoning. Some recent work has begun to address this question (Wang et al., 2022; Lu et al., 2023), but much more work remains to be done.

Another important direction is to develop defenses against data poisoning attacks. Previous defenses, which treat poison detection as an outlier detection problem in some feature space, fail to detect poisons generated by gradient alignment. This is because the poisons are designed to be indistinguishable from clean data in the feature space. However, future defenses might be able to exploit the fact that the poisons are generated by a specific optimization procedure, and thus have a particular structure. Alternatively, imposing some regularity on the loss function landscape might limit the effectiveness of poisoning attacks generated according to such an optimization procedure.

Peer Review Ratings

Adrian:

Strengths:

Clear conceptual and analytical explanation of gradient matching method used. Provides potential for extended work using the bi-level optimization mechanism becuase of the detailed discussion.
Impressive results in the scaling and transferability of the method; using significantly sized datasets like ImageNet, and showing transfered success on black box models like Google Cloud AutoML.
Clear that the authors made it a goal to make as least invasive an attack as possible. Clearly motivates future work to advance defensive mechanisms against likewise attacks.

Weaknesses:

Authors claim that matching multiple target images at once with the same attack method will yeild comperable results, but fail to give convincing proof that this will hold.
The 0.1% metric for percentage of poisoned data generally seems impressive, but when considered for a very large dataset like ImageNet it actually seems to be a high volume of data still that will need to be poisoned.
Defensive strategies discussed seem to cover basics in the field, but could have further benefitted from a theoretical discussion of what new methods it might take to combat this specific attack.

Rating: 7: Good paper, accept.

Confidence: 4: The reviewer is confident but not absolutely certain that the evaluation is correct.

Evan: