Stress-Testing Stable Diffusion Concept Erasure

Introduction

The issue of unwanted content is one that has plagued diffusion models since their inception. Recently, there have been a few different approaches to this, including a newer method by Rohit Gandikota, Joanna Materzy´nska, Jaden Fiotto-Kauffman and David Bau in their preprint called Erasing Concepts from Diffusion Models. Most methods used to solve the issue of unwanted content -- like undesirable image removal, image cloaking, or model editing -- rely on directly interfering with the model, whether it be through changing the train dataset or editing the output. The new angle that concept erasure provides is one that seeks to employ a model's learning against itself. In concept erasure, a diffusion model has its weights fine-tuned using negative guidance. Concept erasure is naturally void of many of the disadvantages that are present in other methods working to achieve a similar goal. The goal of this project is to attempt to uncover disadvantages that remain undiscovered. We will be stress testing this method by removing objects and artistic style from the model, then attempting to bring them back through prompt engineering. We hope to prove that the erased concepts are not fully gone from the model, and can be reintroduced through clever prompting.

Review

The training of the Erasing Stable Diffusion, or ESD model includes two diffusion models: One whos parameters θ, are updated during training, and a second whos parameters θ* are frozen. The fine tuning of the model weights occurs during denoising stage of image generation, where the image is only partially denoised. The unfrozen model θ will generate a partially denoised image, conditioned on the given concept to erase c. The frozen model θ* will then predict the noise, continuing the image generation process. The frozen model makes two predictions, once predicting the noise while conditioned by c, the second time predicting without conditioning. The images generated are then fed into the loss function, which tunes θ to reduce the probability that the output image will be classified as c.

Methodology

Our methodology will revolve around tweaking an initial prompt to attempt to restore the artistic style or object removed from the model. Our attacks will vary based on removed object or concept. For objects, we will attempt to find work arounds to the term in generation. For example, when the concept of 'car' has been erased from the model, we can attempt to generate it by telling the model to piece together several different concepts it already knows. We might use a prompt such as "Generate a vehicle with four tires driving on a road". This prompt with the combination of tires, driving, and road could cause the model to "remember" the concept of a car, generating it from untouched parts. For artistic style, our approach will be similar. We will specifically describe the image to be generated, and attempt to describe the style to be used. However, as artistic style is inherently less concrete than an object, this method may struggle more with artistic style.

Experimental Findings

Early Findings

The reintroduction of artistic style proved to be significantly more challenging than we anticipated. We began by loading the tuned model weights which removed Van Gogh's artistic style. Then, our approach was to take a base prompt and edit it while testing to effects of each edit. To generate our base prompt, we asked ChatGPT to describe Van Gogh's Starry Night. The results below illustrate the difference between the erased stable diffusion model, and the normal stable diffusion model.