CS 7150 Final Project

Deep Learning by Professor David Bau, Fall 2022

Probing Latent Diffusion

Andrew Lemke & Lakshyana KC

Introduction

Diffusion models’ high quality generations and capability to condition on text brought praise and interest to this development. Examination of the model’s denoising process leads one to the traditional visualization, which shows each intermittent latent processed through the VAE, producing a series of images showing a form appearing out of noise. This interrogation leaves many questions about the models behavior, especially as much of the first half of this visualization is too noisy to determine what actions the model is taking. The nature of the prompt’s grammar and impact is also unclear. We aim to answer these questions by probing the latent space of the model, and visualizing the results.

To discover more about the diffusion process, we constructed four probes that reveal behavior of the model not observable in the traditional visualization. Probes 1 and 3 indicate the model takes only a few steps to determine the sketch of the final output, with probe 3 showing the model’s focus at various steps. Probe 2 investigates the impact of conditioning between generations with good and bad prompt synthesis.

Probe 4 performs a text-image attribution analysis on the Stable Diffusion model with the help of the diffusion attentive attribution maps (DAAM) method, which is based on interpreting the model using aggregated and upsampled cross attention scores [6]. We use this method to probe into the stable diffusion model’s visual reasoning abilities such as compositional reasoning, object recognition and counting, and spatial and object-attribute relationship understanding. This investigation indicates that the model demonstrates a higher accuracy in representing the correct actions and objects based on the prompts in the generated image, compared to their counts and relationships. In addition, when evaluating the model’s general performance after categorizing the attention scores for a word by its parts of speech, certain word types such as nouns and verbs were observed to have a higher attention score distribution over the generated images in comparison to adjectives and numerals.

While comparing the model’s outputs for the same prompt over multiple experiments, the model showed varying performance in several prompts. While some experiments showcased an accurate heat-map and image generation that captured each object and relationships conveyed by the prompt, others had some attributes or objects from the prompt missing in several cases, indicating a relatively weaker visual reasoning skills of the model in comparison to what its realistic image generation capabilities might indicate.

Background

Diffusion Models

Diffusion models are generative models that learn to gradually remove noise from data. In inference mode, they generate data after iterative denoising a random initial seed. These models gained popularity in the image generation space where they can be trained to generate images conditioned on a text input, which is the kind of model we explored here.

The idea for this type of model started in 2015 with Sohl-Dickstein et al., and was improved upon in 2019 and 2020, where it gained enough capability and traction to break out on to the image generation scene (2).

The following model is often presented to describe the noising process. An image starts at time t=0, and q gradually adds noise to the image over a number of steps until the image is pure gaussian noise at t=T. The reverse process p removes the noise at a particular step. (1)

figure. — Figure 1.1: The noise addition and removal process from Ho, Jain, Abbeel (1).

Writing the q process is fairly simple. To calculate the q at the next timestep, we sample some gaussian from the previous timestep with a mean and variance that also depend on some parameter beta . Beta increases with t between (non inclusive) 0 and 1. The schedule of the betas can be altered as a hyperparameter of the process. The key point is that at time t=T, the image must be noised so much that it is indistinguishable from gaussian noise, in other words, no “signal” remains.

The p process is the hard part, in fact, it is intractable as it requires knowing the total distribution of all images. The diffusion network is trained to learn how to do the p process. In practice, Ho et al. finds predicting the noise added “performs approximately as well as predicting [the mean] when trained on the variational bound with fixed variances, but much better [performance] when trained with our simplified objective,” alluding to some theoretical simplifications made in their paper.

In architecture, the noise predictor relies on U-nets. The timestep is incorporated as a positional embedding. The HuggingFace implementation (whose weights are freely available) (3) is based on the Ho et al. implementation, who adapted Wide ResNets from Sergey Zagoruyko and Nikos Komodakis (4) with minimal changes.

The text to image model investigated in this report relies on a diffusion model, as described above, in combination with a text encoder to embed the prompt, and a VAE to transform the latent produced by the diffusion into image space. Training data comes from a set of images and associated texts.

In code, the process is slightly more complicated. To increase performance, the impact of the conditioning is boosted by a factor called the guidance scale. When the diffusion loop runs once, it actually diffuses two latents. One latent is diffused with the prompt and another is diffused without it, which in practice happens by conditioning it on an empty string embedded by the text encoder. The difference between these two can be seen as the impact of conditioning. It is calculated by subtracting the unconditioned latent (empty string one) from the conditioned latent (prompt one). This difference is multiplied by a scalar in the neighborhood of 5-10 and added to the unconditioned latent, increasing the impact of the prompt during the diffusion process. The code below is based on the “writing your own diffusion pipeline” section of citation number 5.

Diffusion Model Attribution

With increasingly complex neural network models in use, attribution has become an important method to better understand the basis of a model’s predictions. Attribution is a way to characterize a model’s response by identifying the parts of an input that are responsible for its output [7]. Most of the attribution methods currently in use are mainly perturbation and gradient based [8,9]. The gradient based methods involve some modification of the backpropagation algorithm to backtrack the network activations from the output back to the input, which when applied to vision models, result in saliency maps highlighting important regions in the input image [9]. The perturbation methods perform analysis of the effects of perturbing the network’s input on the corresponding outputs by observing the effects of selectively removing or occluding parts of the input on the output [9, 10].

Although the gradient-based attribution is used for visualizing the saliency maps for vision models, it is not feasible for diffusion models, as the costly backprogation computation is needed for each pixel across all the timesteps and the perturbation attribution are challenging to interpret owing to high sensitivity of diffusion models to even minor perturbations [6]. As a result, Tang et. al. have proposed a text-image cross attention attribution method called diffusion attentive attribution maps (DAAM) to visualize the influence of each input word on the generated image output. Their method enables a pixel level attribution map generation for each word of the prompt. To obtain the maps, this method upscales the multiscale word-pixel cross-attention scores obtained for the latent image representations in the denoising sub-network, and aggregates these scores across the spatio-temporal dimensions to then overlay each map obtained for an input word over the generated image [6].

We thereby apply the DAAM attribution method to probe into the stable diffusion model to visualize the spatial locality of attribution conditioned on each input word. This attribution technique also enables a way to qualitatively study the stable diffusion model’s understanding of concepts and behavior for input words based on their parts of speech.

Acknowledgements

The code for probe 1 is derived from the first homework of this class. The code for probe 2 and 3 comes from the “writing your own diffusion pipeline” section of the HuggingFace blog post “Stable Diffusion with Diffusers” (see citation 5). Both of which use the model architecture and weights from HuggingFace’s “Stable Diffusion v1.4,” which is citation 3. Without these weights none of this analysis would have been possible. Probe 4 experiment is uses pretrained model weights from HuggingFace’s “Stable Diffusion v1.4” and “RunwayML v1.5” models, and applies the code open-sourced by the authors of the DAAM paper [6] to implement further analysis and experiments. In addition, the words used for generating the prompts are sourced from the top 1000 frequently occurring words in the COCO caption data [10].

Probe 1: “Rip Out” Images from Intermediate Latents

The standard visualization for latent diffusion takes the intermediate latents at various steps through the diffusion process. These intermediates then pass through a VAE to transform from latent space to image space. The effect does convey the principle idea behind diffusion–showing an image emerging from noise–but presents frustration. In a diffusion of 100 steps, the first 20 or so will just appear to be noise.Rapidly and without warning, an image appears. The primary subject (as indicated in the prompt) and background details both become distinguishable. We do not have insight into how the model came to this decision, whether the generation ever altered from its initial notion of the image, or see where focus is on each step. Questions arise like “what happens in the first 5 steps? Does the model randomly pick directions to push the image until one appears patterned enough to provide the diffuser with enough of an image to move forward with noise reduction?” and “Does the model change its mind?”

Methodology & Experiment

I continued the idea explained in the proposal–that I would try to get the model to “finish fast.” This can be done in 2 ways. The first is to alter the scheduling mid run to a schedule that informs the model to quickly finish the diffusion. For example, at step 30 of 100, change the schedule to have only 5 drastic steps to present the final image. The second method is to save a target intermediate from a diffusion run. The latent is then processed through a separate diffusion with the target schedule by using that latent as the starting seed of the diffusion. I elected to do the latter of the two.

The new effective schedule concatenates the original schedule of the latent until it gets to the target intermediate step and the new schedule in the second run. The below charts show the schedule for various targets and the cumulative sums of the schedules. The top of the peak in “quick finish 10” is the start of the new schedule.

In practice, every intermediate latent was captured then passed though the quick finish diffusion, which was 5 steps. The number of steps to force the quick finish is a tunable parameter in generating these results. The guidance schedule in the second diffusion is also critical. The guidance scale is the factor to multiply the difference between the unconditioned (no prompt, just denoise) and the conditioned (denoise with prompt as input) latenets. A guidance scale of 0.0 means the generation uses no information from the prompt and simply removes noise. A scale of 1.0 means the diffuser works on the conditioned prompt. Scales above 1.0 increase the differences between the unconditioned and conditioned prompt, which is added to the unconditioned prompt, resulting in a larger prompt impact. We examined the results with a scale of 0.0 and 1.0.

With 33 diffusion steps, we have 32 intermittent latents and the final latent which generates the output. We ran the quick finish diffusion of 5 steps on each of these latents to “rip out” the idea that may be contained in the noise. We focused our attention on the earlier latents, which appear as just noise if passed through the VAE before the quick finish. Our process generates an equal number of quick finish results (so 33). We ran the quick finish for a guidance of 0.0 and 1.0 though a 5 step quick finish diffusion.

Results

If the quick finish uses a guidance of 0, then there is no conditioning from the prompt, meaning the vector representation of “Rhino in a street” never influences the second diffusion. If the idea in this prompt is not present in the results until the 10th latent, then we can conclude the latents before that have not made appreciable progress toward the end goal of that image. Its appearance in the guidance scale of 0.0 run shows us when the idea appears in the intermediate latents by removing the noise that permeates the early latents without adding any information. Since no prompt information is included in the second diffusion, we can be sure that the model is simply removing noise from the latent from the first diffusion.

We can see from the first latent’s quick finish that the model has put 0 concepts from the prompt into the latent at this point. It appears abstract and vaguely monocolored.

The fourth image appears to have some sort of animal form in it, which morphs into the Rhino in later latents. Since the transition from this animal amalgamation to an actual rhino form is gradual, we are hesitant to pinpoint a specific step where the idea of a rhino appears in the latent. At step 4, we see an animal form, and by step 9, it appears like a rhino.

The concept of a street also comes about in the same range. The background still appears to be wild, as some jungle or outdoors area, but the foreground is pavement.

We observe an interesting effect when a late state latent passes through a diffusion model again as the seed. It comes out looking less detailed and more abstract. We think the model is simply not trained to perform under these conditions and its puzzling behavior is just a coincidence of this unusual situation. The model expects a very noisy image for most of the quick finish, so if a clean image is presented, the model acts unpredictably.

Over the course of many runs, we can profile the model. With the unconditioned (guidance scale 0.0) quick finish strategy, we will record the step where the prompt manifests in a discernible way. A value of 3 for this score means that latent has passed through 3 diffusion steps as it normally would before the quick finish. While “manifesting” is by nature subjective, we will specify that it is when one major physical object from the prompt appears in the diffusion. The average of 10 runs across different prompts and seeds is 12.3 steps. The total process is 33 steps.

When a guidance scale of 1.0 is used, we see the rhino from the first latent. A guidance scale of 1.0 is equivalent to diffusing with the prompt alone. This observation matches the results from the 0.0 guidance scaling run. The latent after run 0 in which guidance scaling of 1.0 has been diffused once with the prompt in the original run and 5 additional times in the quick finish with conditioning, giving it 6 total diffusion steps exposed to the prompt. In the 0.0 guidance scaling case, the 6th latent has been exposed to the prompt 6 times in the original run and 0 times in the quick finish. The rhino is visible in the image from the 6th latent.

Discussion

In the traditional visualization, the form of the prompt is not visible until about 40-50% of the way through the diffusion. Seeing the rhino appear earlier in the quick finish without conditioning tells us that the model is performing useful actions that move it towards a clean coherent image earlier than one would assume when using the traditional visualization. The interesting effect where the quick finish degrades quality in later runs motivated us to seek alternate visualizations or metrics that inform us of the model’s actions and progress throughout the diffusion process. The second attempt to answer this question is in probe 3

Probe 2: Conditioned-unconditioned difference patterns in results with good and bad prompt manifestation

Occasionally, the Huggingface model fails to produce the desired result, which happens when the text in the prompt is not matched or the resulting image possesses incoherence. In the first case, where the prompt is not followed properly, the impact of the guidance may not be strong enough. Alternatively, if the conditioning’s impact in the first part of the diffusion is not great enough, then the ideas in the latent may be “set” incorrectly or not enough. We investigate to see which, if any, is the case.

Methodology

To investigate the idea synthesis, we examined the magnitude of the difference between the conditioned and unconditioned prompts. Three methods were devised to calculate this difference. The first method is the simplest, it is the L2 norm of the difference between the conditioned and unconditioned latent. The latents are of shape (4, 64, 64) and the difference is too. The L2 norm is computed of this flattened difference. The second method looked to smooth out any differences due to changes in where objects borders are. If the conditioning translated an object left a pixel, a large difference would appear very great under the first method since there would be many many mismatches between the object and its shifted position after conditioning even though the only change is a translation one pixel leftward. To remedy this, the second method uses a 3d average pooling kernel of shape (4, 2, 2). 4 is the channel dimension, and (2, 2) are the spatial dimensions. After the pooling, the L2 norm of the difference of the conditioned and unconditioned latent is measured. The last method keeps the channels independent in pooling, using a (2, 2) average pooling kernel for each layer. Again, the L2 norm of the difference of the conditioned and unconditioned latent is captured. There is no padding in the second and third method, and each kernel’s stride length is set to 2, its spatial dimension.

The idea for this investigation came when examining some series of runs for a single prompt. A pattern emerged when plotting the difference between the conditioned and unconditioned split by good or bad prompt synthesis (putting each thing specified in the prompt into the image). The runs with good prompt synthesis have conditioning that rises and peaks mid way through before abating and then dropping off. The poorly synthesized images have a conditioning difference that rises and plateaus until the end. The below figure shows this effect using the third strategy pools at each layer.

Experiment

To examine the pattern, many instances over many prompts must be run, and the results must be classified as good or bad. A third classification of neither is used for images that are somewhere in between. The below example shows one of our classifications for the prompt “Clowns in the US Capitol Building.” A good synthesis of ideas does not always indicate a high quality result; an image may include the whole prompt but look distorted.

The actual effect of the three different methods of difference calculation is slight. We prefer the last way since it provides the benefit of pooling over minor translation issues but keeps the channels separate, since there is no reason to pool over them since each channel may be different and not related in value to the other channels in a way that makes pooling over channels interesting or useful. An example for an image points this out. Suppose image A is all red and image B is all blue. The difference between these images is drastic. A-B will be 255 in all the red channels, -255 in all the blue channels and 0 in the green channels. A 3d 1x1 average pooling strategy will perform (255 + (-255) + 0) / 3 = 0, negating the clear color difference.

A total of 130 images were generated, a prompt was run for 20 or 30 times, with varying seeds. Analysis seeks to capture the trend in the Kansas plot using a variety of mathematical strategies. The metrics are: the location of the maximum difference, the sum of the differences both pre and post the maximum index, the ratio of those two numbers calculated on a per diffusion basis, the pre and post run slope over the 10 preceding or trailing steps, and the mean difference for steps 60 to 80 and for steps 18 to 24. These statistics were run for the raw differences and for the differences scaled so the maximum is 1.0.