An Analysis of MaskGIT: Masked Generative Image Transformer
MaskGIT: Where pixels play hide and seek, and the bidirectional decoder always wins the game!
Figure 1: Bidirectional sampling
The significance of this paper lies in its approach to image synthesis, which deviates from the conventional token-by-token sequential generation of images. Instead,
it introduces a bidirectional transformer decoder called MaskGIT, which, during training, learns to predict randomly masked tokens by considering information from all directions.
This approach not only promises more efficient image synthesis but also the potential for significantly improved image quality.
The paper demonstrates that MaskGIT outperforms the current state-of-the-art transformer models on the ImageNet dataset. Additionally, the paper highlights a remarkable achievement
in accelerating autoregressive decoding by up to 64 times where all tokens in the image are generated simultaneously in parallel. This is feasible due to the bi-directional self-attention
of MTVM (Masked Visual Token Modeling) training. This breakthrough in efficiency could have a substantial impact enabling faster computations. Moreover, the paper showcases the
versatility of MaskGIT by demonstrating its application in various image editing tasks, including inpainting, extrapolation, and image manipulation.
Biography
Author
Image
Affiliation
Contribution/Education
Huiwen Chang
Google Research
Computer science researcher focused on image processing, currently at OpenAI and previously worked at Google, Adobe, and Facebook.
Ph.D. from Princeton University.
Han Zhang
Google Research
Research Scientist at Google DeepMind with a Ph.D. in Computer Science from Rutgers University.
Internships at Google Brain, OpenAI, Facebook, Philips Research North America, and the Lab of Media Search at NUS, Singapore.
Interests are computer vision, deep learning, and medical image analysis.
Lu Jiang
Google Research
Research scientist and manager at Google Research. Adjunct faculty member at Carnegie Mellon University.
Known for his significant contributions to natural language processing and computer vision.
Chair of NeurIPS 2023.
Ce Liu
Microsoft Azure AI
Chief Architect at Microsoft. Ex - Google.
IEEE Fellow.
Ph.D. from MIT.
William T. Freeman
MIT
Professor of electrical engineering and computer science at MIT.
Research interests include machine learning applied to computer vision.
Degree in Stanford University in 1979, and his Ph.D. from MIT in 1992.
Literature Review/Previous Work
Figure 2: Transformer Architecture
Transformers: A novel, simple network architecture based solely on an attention mechanism. Inspired by the success of Transformer and GPT in NLP, generative transformer
models have received growing interests in image synthesis. Reference paper: Vaswani, Ashish, et al. "Attention is all you need."Advances in neural information processing systems 30 (2017).
BERT(Bidirectional Transformers): Bidirectional Encoder Representations from Transformers. BERT was designed to pre-train deep bidirectional representations from unlabeled
text by jointly conditioning on both left and right context in all layers. The masked modeling in BERT was extended to image representation learning with images quantized to discrete
tokens. Reference paper: Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional
transformers for language understanding." preprint arXiv:1810.04805 (2018).
Mask Predict: A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target
translation. This approach allows for efficient iterative decoding. The authors were inspired by the bi-directional machine translation in NLP. They explain the novelty lies in the
proposed new masking strategy and the decoding algorithm. Reference paper: Ghazvininejad, Marjan, et al. "Mask-predict: Parallel decoding of
conditional masked language models." arXiv preprint arXiv:1904.09324 (2019).
VQGAN (Vector quantized GAN): The paper proposes a convolutional VQGAN to combine both the efficiency of convolutional approaches with the expressive power of transformers, and to combine
adversarial with likelihood training in a perceptually meaningful way. The VQGAN learns a codebook of context-rich visual parts, whose composition is then modeled with an autoregressive
transformer. Reference paper: Esser, Patrick, Robin Rombach, and Bjorn Ommer.
"Taming transformers for high-resolution image synthesis." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2021).
Methodology
This paper introduces a new bidirectional transformer for image synthesis called Masked Generative Image Transformer (MaskGIT). During training, MaskGIT is trained on
a similar proxy task to the mask prediction in BERT. At inference time, MaskGIT adopts a novel non-autoregressive decoding method to synthesize an image in constant number of steps.
Sequential decoding vs MaskGIT's scheduled parallel decoding
Figure 3: Sequential vs Parallel Decoding
Latent Masks (Rows 1 and 3): These rows shows the latent masks used by the model. Initially, all codes are unknown(lighter gray), representing areas of the image yet to be filled.
Samples (Rows 2 and 4): Here, we see the images as they are progressively constructed. MASKGIT fills up the latent representation with more and more scattered predictions
in parallel (darker gray areas), as opposed to the sequential, line-by-line approach.
A key takeaway is MaskGIT's speed. It completes the decoding in just 8 iterations, whereas the sequential method took 256 rounds. This efficiency is due to its non-autoregressive decoding
where at each iteration, the model predicts all tokens simultaneously in parallel but only keeps the most confident ones. The remaining tokens are masked out and will be re-predicted in
the next iteration.
Unlike the sequential approach, MaskGIT uses bidirectional self-attention, enabling it to generate new tokens from all directions. This not only speeds up the process but also enhances the
quality of image generation.
Architecture/Pipeline of MaskGIT: Masked Generative Image Transformer
Figure 4: MASKGit Architecture Pipeline
MaskGIT follows a two-stage design:
Tokenization: The first stage involves a tokenizer, similar to the one used in the VQGAN model. This tokenizer converts images into a series of visual tokens, essentially breaking
down the image into smaller, manageable pieces for processing.
Masked Visual Token Modeling (MVTM): A bidirectional transformer predicts and refines these tokens in parallel, using a naval approach that masks and unmasks parts of the image
iteratively to generate the final image efficiently.
The decoding algorithm synthesizes an image in T steps. At each iteration, the model predicts all tokens simultaneously but only keeps the most confident ones. The remaining tokens are
masked out and re-predicted in the next iteration. Various functions (linear, concave, convex) are considered for mask scheduling, with empirical results showing the cosine function as the
most effective.
Masking Tokens
An intriguing aspect of MaskGIT's process is its method for deciding which tokens to mask in each iteration. This is given by the formula:
Formula 1: Mask generation function
mi(t+1): This represents whether the i-th token will be masked in the next iteration (t+1).
ci: The confidence score for the i-th token in the current iteration. It reflects how certain the model is about its prediction for this token.
If the confidence score is lower than the threshold, it gets masked in the next iteration. If it is higher, then the token remains unmasked.
Quantative Comparison of MaskGIT with state-of-the-art generative models
Table 1: Quantitative comparison with state-of-the-art generative models on ImageNet 256*256 and 512*512.
The above table provides a detailed comparison of MaskGIT's performance against other models.
Distance (FID): MaskGIT shows a lower FID compared to geneartive models, lower FID(6.16 vs 15.78) against VQGAN indicated better image quality.
Inception Score (IS): MaskGIT also leads in the IS, scoring 182.1 against VQGAN's 78.3, so there is better diversity and clarity in the generated images.
Speed Comparison: The table also illustrated how MaskGIT requires fewer number of steps(forward passed) to generate an image. MaskGIT significantly accelerates VQGAN by 30-64x.
Classification Accuracy Score (CAS): MaskGIT sets a new benchmark on the ImageNet dataset for both 256x256 and 512x512 resolutions.
Precision and Recall: It shows improved recall compared to BigGAN, which indicates the broader coverage in image generation.
Applications of MaskGIT:
Figure 5: Applications of MaskGIT
Synthesis (Left Column): Demonstrating MaskGIT's ability to generate detailed images from specific categories.
Class-Conditional Image Manipulation (Middle Column): Here, MaskGIT exhibits its skill in altering existing images, replacing objects within a bounding box with alternatives from chosen
classes.
Image Extrapolation (Right Column): The model extends images beyond their original boundaries, adding coherent and contextually fitting content.
Ablation Study: Choices of Mask Scheduling Functions
Figure 6: Choices of Mask Scheduling Functions
The figure shows seven different functions considered for the mask scheduling function (γ). These functions are visually represented, providing an intuitive understanding of how each function
determines the mask ratio during the model's training and decoding phases. Among these, the cosine function slightly outperforms others, making it the preferred choice in MaskGIT.
An observation is that more iterations don't always lead to a better performance. Each function used reahces a “sweet spot” in performance before declining again. Notably, the cosine function
not only scores the best overall but also reaches its peak performance earlier, between 8 to 12 iterations, compared to other functions.
Social Impact
Positive Societal Impacts:
MaskGIT have wide-ranging societal impacts on society, enhancing online experiences with personalized content, fostering creative expression,
and advancing healthcare with more accurate diagnostics.
Can benefit education, cultural heritage preservation, making digital content, creativity, healthcare, education, and cultural
engagement more accessible and meaningful to people in their everyday lives.
Negative Societal Impacts:
This model has the potential for negative societal impacts, including the creation of deepfake content that spreads misinformation and threatens privacy,
security vulnerabilities through the generation of fake biometric data, invasions of privacy through enhanced surveillance, manipulation of online content and ethical and legal
challenges regarding responsible use and regulation.
The psychological and mental health impacts of highly convincing fake content can lead to confusion and distress.
Advice for policy makers:
Develop ethical frameworks that outline the responsible use of these technologies.
Promote education, awareness and transparency about the capabilities and use of these technologies.
Stricter enforcement of intellectual property rights and cybersecurity measures.
Industry Applications
Application of MaskGIT in Photoshop for Enhanced Image Editing
Figure 7: Application of MaskGIT in Photoshop
MaskGIT could be integrated into Photoshop as a feature for automated image extrapolation. This would allow users to extend the borders of an image seamlessly, generating new
content that matches the style and context of the original image.
With its class-conditional manipulation capabilities, MaskGIT could be used to intelligently replace or alter objects within an image. For instance, users could select an object and
have MaskGIT replace it with a different object of the same class, maintaining the overall coherence of the image.
MaskGIT could enhance Photoshop's existing inpainting tools by providing more context-aware and coherent fill options. This would be particularly useful for restoring old photos or
removing unwanted elements from images.
MaskGIT's ability to generate high-quality, high-resolution images could be utilized in Photoshop for tasks that require enlarging images without losing detail or for creating large-format
designs.
Follow-on Research
Advanced Tokenization Techniques for Improved Image Synthesis
Figure 8: Advanced Tokenization Technique using Vision Transformer (ViT)
The authors mention that they employ the same tokenization technique as VQGAN. VQGAN applies vector quantization to the output of the GAN, which divides the continuous-valued image
into a fixed number of discrete codebook vectors.
They also mention there is opportunity for potential improvements to the tokenization step to future work.
The authors of the above paper suggest taking this approach one step further by replacing both the CNN encoder and decoder with Vision Transformer (ViT) In addition, they introduce a linear projection
from the output of the encoder to a low-dimensional latent variable space for lookup of the integer tokens.
Peer-Review - Rakshak
The paper introduces MaskGIT, a bidirectional transformer model for image synthesis. The model excels in various tasks, including class-conditional image synthesis, manipulation, and
extrapolation. It showcases significant improvements over existing methods in terms of quality, speed, and diversity of generated images. The model's iterative decoding method enhances
efficiency, requiring fewer steps compared to traditional autoregressive methods, and its bidirectional self-attention mechanism allows for richer context utilization.
Overall Score: 7.5/10 (Strong Accept)
Strengths
MaskGIT's parallel decoding approach is a groundbreaking improvement over traditional sequential methods, significantly accelerating the image generation process.
The paper presents a thorough analysis of the output quality and diversity, substantiated by metrics like FID and IS.
The paper explores the limitations of the proposed model, especially in complex or edge-case scenarios, which is crucial for a balanced understanding.
Weaknesses
The paper does not discusses the model's potential for negative societal impacts and it could provide more details on how these impacts could be mitigated.
The discussion on the scalability of the model and its ability to generalize across various datasets is not adequately covered.
Peer-Review - Deepak
Overall Score: 8/10 (Strong Accept)
Strengths
The paper introduces a novel approach to image synthesis, particularly in its use of bidirectional transformers for MVTM and its efficient parallel decoding technique.
Technically sound with well-supported claims. The experiments demonstrate MaskGIT's superiority over existing models in terms of image quality and synthesis speed.
The paper is clearly written and well-organized, particularly in explaining complex concepts like MVTM and iterative decoding.
Demonstrated proficiency in a range of tasks (class-conditional synthesis, image manipulation, and extrapolation) indicates the model's versatility and wide applicability.
Weaknesses
There's a need for a more comprehensive exploration of the model's limitations and potential areas for improvement.
The impact of MaskGIT in practical applications beyond the scope of the experiments could be more deeply explored to underline its significance.