By: Riya Gurnani and Xinyu Wu
Paper Link: GAN Dissection: Visualizing and Understanding Generative Adversarial Networks
In this project, we delve into Generative Adversarial Networks (GANs) to better visualize and understand them. GAN Dissection: Visualizing and Understanding Generative Adversarial Networks
by Bau, Strobelt, et al. (2019) introduces a framework for visualizing and understanding GANs at multiple levels of abstraction. The paper provides insights into how GANs represent and generate visual context internally which is crucial given their widespread use in real-world applications. We aim to better understand and expand upon the framework from the paper. Gaining deeper insights into these models will allow for improved performance. Our goal is to successfully implement parts of the code from the paper and test the framework on new objects to evaluate the behavior of the model across various images. This will also provide further insight into their generalization capabilities. For example, the interactive demo from the paper focused on different church images and the ability to remove/add various objects to the image such as trees, grass, and door. We want to expand on this to test with different types of images (conference room, restaurant, and dining room) and new objects not previously tested (window and chair) to understand how well we can apply this concept to find the highest activating units for those objects and be able to generate them in new images.
While Generative Adversarial Networks (GANs) have shown significant ability to produce realistic images, there is still a gap in understanding how different knowledge and variants affect model performance. Therefore, this paper by Bau et al. looks at the internal representations of GANs and how they are structured, including neurons, objects, and contextual relationships between objects (Bau, Strobelt, et al., 2019). Objects that are used in this paper include trees, grass, doors, skies, clouds, bricks, and domes. To examine these objects, the paper utilizes representations to denote tensor outputs (r) on each layer of the generator (G) which generates an image (x) through z layers. In relation to units r, the paper understands it in two phases: (1) dissection, (2) intervention. Dissection involves identifying classes that are explicitly represented in r and measuring the relationships between each single unit and their class, c. The latter, intervention, uses the classes found through dissecting r in order to analyze causal relationships between units and object classes by turning on and off various sets of units.
Building on previous literature, the paper uses an intersection-over-union measure that quantitatively measures the spatial agreement between a unit's threshold featuremap and a concept's segmentation. Therefore, this metric is able to rank and pair units with the concepts that most closely match. However, outputs generally depend on multiple parts of the representation, and a combination of units must be identified in relation to a specific object. Thus, an ablation and insertion to the decomposed images can help identify the causal relationship between units and classes, or the average causal effect. This effect can be optimized using L2 regularization and stochastic gradient descent to achieve stronger causal effect. The paper has a number of key findings when understanding units between different datasets, layers, and models: (1) individual unit object detectors form, (2) there are interpretable units for different scene categories such as living room and kitchen, (3) different layers have varying interpretable units, and (4) these interpretable units can help differentiate between GAN models. For the purposes of our project, our goal is to identify these interpretable units of objects not tested in this paper, such as windows and chairs, in hopes of replicating similar results. Overall, the paper provides a strong basis to understanding neurons, objects, and their relationships in GAN models by dissecting and intervening with individual layers.
Our Implementation Link: DS4440 Final Project
In order to achieve our goal of testing different objects, our first step was to understand the complex framework detailed in the paper and its provided code. The code from the paper analyzes how objects are encoded within a GAN generator through dissection and intervention. By intervening in the network and observing the effects on generated images, we gain insights into the causal relationships between units and objects. Due to more limited GPU power and some dependency issues, we did not use the entire code from the paper github. Instead, we examined how they created their interactive demo and tutorials. These pieces of code included the highest activated units for the trees. However, for the purposes of our project, we needed to find the highest activating units for our chosen objects—windows and chairs.
Another piece of code from the paper has the functionality to load an image, click and drag on it to select a certain part in order to trigger either an erasing or adding action. Thus, to find the highest activating units for our own objects, we utilized parts of this code and added onto it. We wanted to include a variety of testing so we loaded different GAN generators for the objects. For reference, here is the first five layers of the pretrained GAN model used that is then trained on the different image settings:
For window, we used church
and conference room
and for chair, we used restaurant
and dining room
. From here, we were able to create a segment of code to randomly choose one of the generated images and used the select feature tool to select that object in the image and display the highest activating units. We created a script to run this for 50 randomly selected images for each object and setting (window in church, window in conference room, chair in restaurant, and chair in dining room). Here is an example of what it looks like to find the selected units where the window is selected in the conference room:
We saved all of those selected units and images numbers to four files corresponding to each object-setting combination. Then, we did some analysis on this data to find the ten most commonly occurring units for each of the four object-setting combinations. Therefore, we had four sets of units across the two objects. Using these most commonly activated units, we then ran testing to see how well those units were selecting the objects and if they could visualize the selected objects in new images and even generate the object when we run the select tool over a part of the image.
For our testing on how well the chosen units recognize and generate the object in new images, we followed similar processes from the paper for how they tested tree generation in the church images (amongst other objects). We repeated the same process across all four object-setting combinations, with windows generally having higher performance due to a number of factors discussed in this section. This process involved using the highest activating units found from our object identification methodology and running our program to highlight places in the image in order to insert the given object. From here, we recorded the resulting image in order to identify any visible changes and objects created.
For our first object, windows, adding the object to church images led to varied results. Below is a visualization of before and after we tried to generate window objects on ten different church images. There are a few highlighted sections (in red) in the fourth and sixth images that show a generation. For the other images, we found that trying to generate would sometimes blur parts of the image but not generate a window. In the first, seventh, and ninth images, we noticed slight changes but no objects generated. Rather, we found that for those particular images, if we selected a segment near an existing window, it would slightly expand it and make the window larger.
Overall, our chosen units did not generate what we had aimed for. We believe that this is largely in part to the fact that there are many small windows on churches that are difficult to recognize, even for our human eyes, so it can be hard for the model to distinguish these areas. In order to further our understanding of the performance of the model, we also printed out unit masks from the top activating images for the highest activating unit for windows in the church images, shown below. The visualization shows some success for unit 221. More unit testing can be found in our implementation link:
We followed a similar testing process for window objects in the conference room images. We had more success in terms of generating windows in this context, likely due to the larger window sizes in these images as well as the greater contrast in brightness between the windows and the indoor conference rooms. In the visualization below, there are ten images with the before and after for when we selected different parts of the image. There are highlighted sections (in red) for the first four images as well as the last two images. In those tests, we saw object generation in the form of a brighter light forming as a window object or a white canvas. In the other images, we saw small changes but no significant object generation.
This testing process saw more success in generating windows in conference rooms. We believe this is because in conference rooms these are less windows and their surface area is larger so it is easier to recognize. Further, the windows in conference rooms tend to be consistent in their rectangular shape, while church windows are often smaller and more inconsistent in shape. Considering that a number of our generated objects had higher brightness (rather than the full structure of a window), we also believe that the higher contrast in brightness from the sun coming through windows was a key factor associated with the higher activating units and therefore the ability to generate windows in conference rooms. Below is a visualization showing the unit masks from the top activating images for the highest activating unit for window in the conference room images (unit 198):
Following the same procedure as for windows, we ran tests for the generation of chairs for both dining room and restaurant settings. The success rate for generating chairs was lower than for windows, likely due to a number of factors: (1) chairs have more complex shapes, (2) the color and brightness contrast of chairs is lower, and (3) there is more variance in chair designs than window designs. Further, many of the dining room and restaurant photos that we ran the initial unit gathering methodology on were blurry and difficult to distinguish. Therefore, the units chosen were less precise.
For the dining room setting, three of ten test images generated something resembling a chair. For each of these (boxed in red below), our methods did not generate a full chair. Rather, it generated parts of chairs, such as legs and chair backs. All other trials failed to generate anything in resemblance of a chair. These results are not surprising considering that chair legs and backs are the most distinguishable features of chairs and most consistent across all types of chairs.
Shown below is a visualization of the unit masks from the top activating images for the second highest activating unit for chairs in the dining room images (unit 325). The masks identify a number of chair legs but fail to identify the full chair. This is likely because the legs are more consistently apparent across multiple images.
Compared to the dining room setting, the restaurant images had worse performance, with only two of ten trials having visible image differences, and none of the trials being able to generate a chair. Both images with noticeable changes altered the material of the floor from carpet and tile to wood. This can be explained by the fact that most restaurant chairs are of wooden materials. However, it did not accomplish our goal of generating a chair by using these most commonly identified activation units.
Below are the unit masks from the top activating images for the second highest activating unit for chairs in the restaurant images (unit 178). The masks select a very wide area of the image, suggesting that it fails to identify the chair. We believe this is largely due to the fact that there is low contrast between the chairs and other parts of each image.
In comparison to the original paper's use of objects such as trees, we had a more difficult time successfully generating windows and chairs. We believe that this can be attributed to the fact that trees are more homogenous across images, while windows and chairs can vary widely. Windows can be of different shapes, sizes, colors, and brightness levels. Chairs are made of different materials, structures, sizes, colors, and positioning. Further, both windows and chairs will look very different from different angles, while trees are generally symmetrical and look the same from all angles. Overall, our chosen activation units had higher success rates for windows relative to chairs, but did not perform as well in comparison to the paper's results.
While our model results were not as high-performing as the original paper, the tests that we conducted on the four settings allowed us to identify a number of learnings and implications. We found that certain images have qualities that make it easier to identify activation units, such as higher brightness, greater contrast relative to surrounding objects, more consistency in color, and less variability in shapes. Thus, we believe that objects such as trees, which generally have the same color and similar shapes and less variability, are easier to generate than images with high variability such as chairs. Future work in this area can expand to include other types of image settings and objects. This could include erasing or adding facial features from images of human faces, or being able to erase and add humans or animals to different settings. Further work on our project to increase performance includes identifying activation units on a greater number of randomly generated images, and also trying more combinations of different activation units. This area of research has a significant impact on daily lives, and can help develop areas such as social media, digital marketing, security, journalism, and healthcare. For social media, journalism, and digital marketing, this sort of model makes editing images much easier, providing industry professionals with greater creative freedom. For security purposes, this model can help recreate crime scenes for criminal investigations. Healthcare applications include erasing unnecessary objects from X-rays, MRIs, and other scans in order to focus attention on critical parts. All in all, this topic of dissecting GAN models to erase and add objects to images is both interesting and important to many present day issues.
[1] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, Antonio Torralba. GAN Dissection: Visualizing and Understanding Generative Adversarial Networks, Proceedings of the International Conference on Learning Representations (ICLR), 2019.
1. Riya Gurnani (gurnani.r@northeastern.edu)
2. Xinyu Wu (wu.xinyu2@northeastern.edu)