MaskGIT: Masked Generative Image Transformer

Biography

Author	Affiliation	Contribution/Education
Huiwen Chang	Google Research	Computer science researcher focused on image processing, currently at OpenAI and previously worked at Google, Adobe, and Facebook. Ph.D. from Princeton University.
Han Zhang	Google Research	Research Scientist at Google DeepMind with a Ph.D. in Computer Science from Rutgers University. Internships at Google Brain, OpenAI, Facebook, Philips Research North America, and the Lab of Media Search at NUS, Singapore. Interests are computer vision, deep learning, and medical image analysis.
Lu Jiang	Google Research	Research scientist and manager at Google Research. Adjunct faculty member at Carnegie Mellon University. Known for his significant contributions to natural language processing and computer vision. Chair of NeurIPS 2023.
Ce Liu	Microsoft Azure AI	Chief Architect at Microsoft. Ex - Google. IEEE Fellow. Ph.D. from MIT.
William T. Freeman	MIT	Professor of electrical engineering and computer science at MIT. Research interests include machine learning applied to computer vision. Degree in Stanford University in 1979, and his Ph.D. from MIT in 1992.

Literature Review/Previous Work

Transformers: A novel, simple network architecture based solely on an attention mechanism. Inspired by the success of Transformer and GPT in NLP, generative transformer models have received growing interests in image synthesis.
Reference paper: Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
BERT(Bidirectional Transformers): Bidirectional Encoder Representations from Transformers. BERT was designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. The masked modeling in BERT was extended to image representation learning with images quantized to discrete tokens.
Reference paper: Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." preprint arXiv:1810.04805 (2018).
Mask Predict: A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation. This approach allows for efficient iterative decoding. The authors were inspired by the bi-directional machine translation in NLP. They explain the novelty lies in the proposed new masking strategy and the decoding algorithm.
Reference paper: Ghazvininejad, Marjan, et al. "Mask-predict: Parallel decoding of conditional masked language models." arXiv preprint arXiv:1904.09324 (2019).
VQGAN (Vector quantized GAN): The paper proposes a convolutional VQGAN to combine both the efficiency of convolutional approaches with the expressive power of transformers, and to combine adversarial with likelihood training in a perceptually meaningful way. The VQGAN learns a codebook of context-rich visual parts, whose composition is then modeled with an autoregressive transformer.
Reference paper: Esser, Patrick, Robin Rombach, and Bjorn Ommer. "Taming transformers for high-resolution image synthesis." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2021).

Methodology

This paper introduces a new bidirectional transformer for image synthesis called Masked Generative Image Transformer (MaskGIT). During training, MaskGIT is trained on a similar proxy task to the mask prediction in BERT. At inference time, MaskGIT adopts a novel non-autoregressive decoding method to synthesize an image in constant number of steps.

Sequential decoding vs MaskGIT's scheduled parallel decoding

Latent Masks (Rows 1 and 3): These rows shows the latent masks used by the model. Initially, all codes are unknown(lighter gray), representing areas of the image yet to be filled.
Samples (Rows 2 and 4): Here, we see the images as they are progressively constructed. MASKGIT fills up the latent representation with more and more scattered predictions in parallel (darker gray areas), as opposed to the sequential, line-by-line approach.

A key takeaway is MaskGIT's speed. It completes the decoding in just 8 iterations, whereas the sequential method took 256 rounds. This efficiency is due to its non-autoregressive decoding where at each iteration, the model predicts all tokens simultaneously in parallel but only keeps the most confident ones. The remaining tokens are masked out and will be re-predicted in the next iteration.
Unlike the sequential approach, MaskGIT uses bidirectional self-attention, enabling it to generate new tokens from all directions. This not only speeds up the process but also enhances the quality of image generation.

Architecture/Pipeline of MaskGIT: Masked Generative Image Transformer

MaskGIT follows a two-stage design:

Tokenization: The first stage involves a tokenizer, similar to the one used in the VQGAN model. This tokenizer converts images into a series of visual tokens, essentially breaking down the image into smaller, manageable pieces for processing.
Masked Visual Token Modeling (MVTM): A bidirectional transformer predicts and refines these tokens in parallel, using a naval approach that masks and unmasks parts of the image iteratively to generate the final image efficiently.

The decoding algorithm synthesizes an image in T steps. At each iteration, the model predicts all tokens simultaneously but only keeps the most confident ones. The remaining tokens are masked out and re-predicted in the next iteration. Various functions (linear, concave, convex) are considered for mask scheduling, with empirical results showing the cosine function as the most effective.

Masking Tokens

An intriguing aspect of MaskGIT's process is its method for deciding which tokens to mask in each iteration. This is given by the formula:

m_i^(t+1): This represents whether the i-th token will be masked in the next iteration (t+1).
c_i: The confidence score for the i-th token in the current iteration. It reflects how certain the model is about its prediction for this token.
If the confidence score is lower than the threshold, it gets masked in the next iteration. If it is higher, then the token remains unmasked.

Quantative Comparison of MaskGIT with state-of-the-art generative models

The above table provides a detailed comparison of MaskGIT's performance against other models.

Distance (FID): MaskGIT shows a lower FID compared to geneartive models, lower FID(6.16 vs 15.78) against VQGAN indicated better image quality.
Inception Score (IS): MaskGIT also leads in the IS, scoring 182.1 against VQGAN's 78.3, so there is better diversity and clarity in the generated images.
Speed Comparison: The table also illustrated how MaskGIT requires fewer number of steps(forward passed) to generate an image. MaskGIT significantly accelerates VQGAN by 30-64x.
Classification Accuracy Score (CAS): MaskGIT sets a new benchmark on the ImageNet dataset for both 256x256 and 512x512 resolutions.
Precision and Recall: It shows improved recall compared to BigGAN, which indicates the broader coverage in image generation.

Applications of MaskGIT:

Synthesis (Left Column): Demonstrating MaskGIT's ability to generate detailed images from specific categories.
Class-Conditional Image Manipulation (Middle Column): Here, MaskGIT exhibits its skill in altering existing images, replacing objects within a bounding box with alternatives from chosen classes.
Image Extrapolation (Right Column): The model extends images beyond their original boundaries, adding coherent and contextually fitting content.

Ablation Study: Choices of Mask Scheduling Functions

The figure shows seven different functions considered for the mask scheduling function (γ). These functions are visually represented, providing an intuitive understanding of how each function determines the mask ratio during the model's training and decoding phases. Among these, the cosine function slightly outperforms others, making it the preferred choice in MaskGIT.
An observation is that more iterations don't always lead to a better performance. Each function used reahces a “sweet spot” in performance before declining again. Notably, the cosine function not only scores the best overall but also reaches its peak performance earlier, between 8 to 12 iterations, compared to other functions.

Social Impact

Positive Societal Impacts:

MaskGIT have wide-ranging societal impacts on society, enhancing online experiences with personalized content, fostering creative expression, and advancing healthcare with more accurate diagnostics.
Can benefit education, cultural heritage preservation, making digital content, creativity, healthcare, education, and cultural engagement more accessible and meaningful to people in their everyday lives.

Negative Societal Impacts:

This model has the potential for negative societal impacts, including the creation of deepfake content that spreads misinformation and threatens privacy, security vulnerabilities through the generation of fake biometric data, invasions of privacy through enhanced surveillance, manipulation of online content and ethical and legal challenges regarding responsible use and regulation.
The psychological and mental health impacts of highly convincing fake content can lead to confusion and distress.

Advice for policy makers:

Develop ethical frameworks that outline the responsible use of these technologies.
Promote education, awareness and transparency about the capabilities and use of these technologies.
Stricter enforcement of intellectual property rights and cybersecurity measures.

Industry Applications

Application of MaskGIT in Photoshop for Enhanced Image Editing

MaskGIT could be integrated into Photoshop as a feature for automated image extrapolation. This would allow users to extend the borders of an image seamlessly, generating new content that matches the style and context of the original image.
With its class-conditional manipulation capabilities, MaskGIT could be used to intelligently replace or alter objects within an image. For instance, users could select an object and have MaskGIT replace it with a different object of the same class, maintaining the overall coherence of the image.
MaskGIT could enhance Photoshop's existing inpainting tools by providing more context-aware and coherent fill options. This would be particularly useful for restoring old photos or removing unwanted elements from images.
MaskGIT's ability to generate high-quality, high-resolution images could be utilized in Photoshop for tasks that require enlarging images without losing detail or for creating large-format designs.

Follow-on Research

Advanced Tokenization Techniques for Improved Image Synthesis

The authors mention that they employ the same tokenization technique as VQGAN. VQGAN applies vector quantization to the output of the GAN, which divides the continuous-valued image into a fixed number of discrete codebook vectors.
They also mention there is opportunity for potential improvements to the tokenization step to future work.
Reference paper: Yu, Jiahui, et al. "Vector-quantized image modeling with improved vqgan" arXiv preprint arXiv:2110.04627 (2021).
The authors of the above paper suggest taking this approach one step further by replacing both the CNN encoder and decoder with Vision Transformer (ViT) In addition, they introduce a linear projection from the output of the encoder to a low-dimensional latent variable space for lookup of the integer tokens.

Peer-Review - Rakshak

The paper introduces MaskGIT, a bidirectional transformer model for image synthesis. The model excels in various tasks, including class-conditional image synthesis, manipulation, and extrapolation. It showcases significant improvements over existing methods in terms of quality, speed, and diversity of generated images. The model's iterative decoding method enhances efficiency, requiring fewer steps compared to traditional autoregressive methods, and its bidirectional self-attention mechanism allows for richer context utilization.

Overall Score: 7.5/10 (Strong Accept)

Strengths

MaskGIT's parallel decoding approach is a groundbreaking improvement over traditional sequential methods, significantly accelerating the image generation process.
The paper presents a thorough analysis of the output quality and diversity, substantiated by metrics like FID and IS.
The paper explores the limitations of the proposed model, especially in complex or edge-case scenarios, which is crucial for a balanced understanding.

Weaknesses

The paper does not discusses the model's potential for negative societal impacts and it could provide more details on how these impacts could be mitigated.
The discussion on the scalability of the model and its ability to generalize across various datasets is not adequately covered.

Peer-Review - Deepak

Overall Score: 8/10 (Strong Accept)

Strengths

The paper introduces a novel approach to image synthesis, particularly in its use of bidirectional transformers for MVTM and its efficient parallel decoding technique.
Technically sound with well-supported claims. The experiments demonstrate MaskGIT's superiority over existing models in terms of image quality and synthesis speed.
The paper is clearly written and well-organized, particularly in explaining complex concepts like MVTM and iterative decoding.
Demonstrated proficiency in a range of tasks (class-conditional synthesis, image manipulation, and extrapolation) indicates the model's versatility and wide applicability.

Weaknesses

There's a need for a more comprehensive exploration of the model's limitations and potential areas for improvement.
The impact of MaskGIT in practical applications beyond the scope of the experiments could be more deeply explored to underline its significance.