Vision Transformers

Literature Review

The main motivation of behind ViT was the success of using Transformers for NLP tasks. The dominant approach of pre-training the model on large dataset and fine-tuning by small task-specific datasets (BERT and GPT) was limited to textual data, as directly applying self-attention mechanism to pixels was computationally intractable due to very large input size (image pixels). Often attention mechanism was considered in conjunction with convolutional layers.

Some attempts had been made to make the self-attention mechanism work for images:

Replacing global self-attention with just local, neighboring pixels:
- Transformer (Parmar 2018) - "By restricting the self-attention mechanism to attend to local neighborhoods we significantly increase the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks."
- Self-Attention in Vision Models (Ramachandran 2019) - "we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention applied to ResNet model produces a fully self-attentional model that outperforms the baseline on ImageNet classification with 12% fewer FLOPS and 29% fewer parameters."
- Self-attention for Image Recognition (Zhao 2020) - "Our pairwise self-attention networks match or outperform their convolutional counterparts, and the patchwise models substantially outperform the convolutional baselines."
Approximations to global self-attention:
- Long Sequences with Sparse Transformers (Child 2019)
Applying the attention mechanism for just rows/columns:
- Axial attention in Multidimensional Transformers (Ho 2019) - "Our architecture, by contrast, maintainsboth full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation and achieving state-of-the-art results on standard generative modeling benchmarks."

Many of these specialized attention architectures above demonstrated promising results on computer vision tasks, but require complex engineering to be implemented efficiently on hardware accelerators.

Most significant and similar work to ViT was On the Relationship between Self-Attention and Convolutional Layer (Cordonnier 2020). Key finding: "we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer". However, the paper lacked the evaluation of the method against state-of-the-art models, was limited to 2x2 image patches that hindered the scalability and thus only worked on small resolution images.

The authors of the paper built on top of their prior work Big Transfer (BiT) on CNN transfer learning (Kolesnikov 2020 and Djolonga 2020), comparing the approach already done on ResNet to the Transformer architecture.

Biography

This study is part of a sequence of studies that this research group at Google , Zurich have been conducting for a couple of years. They are highly motivated researchers with backgrounds in Computer Science, Neuroscience, Physics and Biology working with deep learning architectures for computer vision tasks. They also love to scale networks both ways (create smaller and more efficient networks as well as create big networks with SOTA performance). Some members are now at different research groups but have collaborated together frequently in the past.

Name	Affiliation	Contribution
Alexey Dosovitskiy	Google Research	Ex Intel, left Google for an year to setup Inceptive with Jakob, recently returned back. PhD in Mathematics from Lomonosov Moscow State University. 66095 Citations
Lucas Beyer	Google Research	Co-author, involved in the exploration and experimentation with Vision Transformers.French- German national from Belgium with PhD from RWTH Aachen. 37236 Citations
Alexander Kolesnikov	Google Research	Previously PhD at IST Austria, and applied math MSc at Moscow State University. 38545 Citations
Dirk Weissenborn	Inceptive Inc.	Based out of Germany, Ex-Google, Meta, DeepMind. 28727 Citations
Xiaohua Zhai	Google Research	Seniour Staff Researcher based out of Zurich. PhD from Peking University, China. Interested in vision, representation learning and generative modelling. 35628 Citations
Thomas Unterthiner	Google Research	ML Engineer , PhD from Johannes Kepler Universitat, Germany. Research interests include Compuational Biology, understanding deep networks, activation functions, emaluation metrics. 51478 Citations
Mostafa Dehghani	Google Research	Research Scientist, Ex-Apple. PhD from University of Amsterdam. 33777 Citations
Matthias Minderer	Google Research	PhD from Harvard under Christopher Harvey, Ex Neuroscience Major at ETH Zurich, also studied Biochemistry at University of Cambridge . Interested in representation learning of vision tasks. 27314 Citations
Georg Heigold	Google Research	Diploma in Physics from ETH Zurich. His research interests include automatic speech recognition, discriminative training, and log-linear modeling. Ex-Apple. 33475 Citations
Sylvain Gelly	Independent Researcher / Google Research	Deep learning Researcher based out of Zurich. Likes to work on reinforcement learning and dynamic programming. 41451 Citations
Jakob Uszkoreit	Co-founder @ Inceptive	One of the authors of the original transformer paper, Most cited authors on this panel. Ex Googler. 145054 Citations
Neil Houlsby	Google Research	PhD in Computational Biology from University of Cambridge. Research interests include Bayesian ML, Cognitive Science and Active Learning. 37994 Citations

Novel Work by the Paper (Diagrammer)

Transformers had been recently proposed by Vaswani et al in [2] and had become state of the art for language related tasks. The authors of this paper wanted to come up with a solution which was as close as possible to the original Transformer so that other language based approaches to scale the architecture could be used out of the box. Hence, they chose to use the same architecture setup as done by the authors in [2]. However, as the original Transformer requires a 1D sequence of tokens, the authors had to figure out a way to convert an image, which is 2-Dimensional, into a 1-D sequence of information.
In order to solve this, the input images (of dimension H and W) were reshaped into N patches of size P, where in N was calculated by computing HW ⁄ P². The output of this operation were described as patch embeddings. Similary to its natural language counterpart which uses BERT's [class] tokens, it was essential to prepend a learnable embedding to the patch embeddings whose state at the output of the Encoder block could be used as the image representations.
The transformer architecture is a suitable architecture for language related tasks as they are rich in sequence and the architecture adds encodes the position of text along with the text as position embeddings. These positional embeddings are key to the architecture. In order to mimic this charachterstic in vision, positional embeddings are added into the patch embeddings which help retain position information of the patches. The resulting embedding vector is used as input to the encoder block of the transformer.

The Vision Transformer architecture

Up until this time, CNNs based architectures achieved state-of-the-art performance of vision tasks due to their wonderful ability to have robust neighbourhood structure and translational equivaraince, however, as the network grew deeper they were prone to information loss. Residual Connections mitigated this to some extent. However, the transformer block consists of MLP units which capture local and translational equivaraince and the Attention units are focused on capturing global context. This ability to inherintly capture local as well as global context make the transformer architecture solve The original architecture consists of alternating layers of multihead self-attention (MSA) and MLP blocks with LayerNorm and Residual Connections being applied after every block. Mathematically, the layers can be described as shown in equation 1

Equation 1 : Mathematical expression for transformer layers

As the transformer block does not have any CNN layers which may lead to neighbourhood structure being learnt, the model instead learns the 2D positions as well as the spatial relations of the patches making it free of any sort of inductive bias. Furthermore, the authors propose another way to generate patchh embeddings by using a CNN feature map to extract patches instead of generating raw patches.

The variants of the ViT architecture follow the ones from BERT:

Comparing to state-of-the-art models on popular image classification:

Pre-training data and computation requirements compared to BiT:

Some lower layer attention heads behave like convolutional layers, focusing on local patches, while others have a more global view:

The authors have experimented with different positional embeddings, curiously, explicitly including 2-D information in them does not improve performance, as the model learns to encode it implicitly:

Using Attention Rollout to visualize the attention maps show that the model learns to attend to the semantically relevant parts of the image:

Social Impact

The paper laid the foundation for the use of transformers in vision tasks. This was a major breakthrough as the well established benefits of transformers were now available for vision tasks as well.

Positive social impact:

Pre-trained models democratize the landscape of computer vision, as they can be fine-tuned with low computational resources for a wide range of downstream tasks, especially for those with limited datasets like medical imaging.
Improved the accuracy of image recognition systems, surpassing previous benchmarks in various visual tasks. This might boost the reliability and precision of AI systems in domains like autonomous vehicles, or medical imaging.
More efficient training and fine-tuning compared to CNNs, less energy consumption.

Potential Negative social impact:

Transformers are known to be very data hungry, and the pre-training datasets have to be very large to be effective. This can lead to a greater incentive to collect more data, which can be a privacy concern, especially as image data is often more sensitive than text data.
The scaling benefits of the Transformer architecture to larger sizes can lead to a greater computational cost, which can be a barrier for many research groups and centralize the research to a few large companies.

Industry Applications

Due to the rich latent space that vision transformer learn, this architecture was quickly adopted in the industry and has become one of the most used architectures for production grade ML models. Some of the industries which use Vision based transformers are :

Autonomous Vehicle: It is highly speculated that Tesla's Full Self-Driving software uses a Vision transformer based architecture as its backbone. Similary Nvidia's self driving car platform as well is rumored to have a transformer based backbone!
Medical Imaging : Due to the inherent charachterstic of transformers to pay attention to the global context, they have proven to work exceptionally well on medical imaging tasks such as detecting types of cancers from CT scans and restoring high quality scans from low quality images.
Augumented Reality : ViTs can improve the recognition and tracking capabilities in AR and VR applications, providing more realistic and immersive experiences.
Document Analysis and OCR: ViTs can be used for optical character recognition (OCR) in document analysis. They can extract text and information from images or scanned documents, streamlining data entry and document processing workflows.

Follow-on Research

This study was the first to bring the transformer architecture to vision which soon became one of the most popular methods. This led to various aspects of the model being subjected to improvement. A few areas of research which have led to significant improvements in the transformer architecture are

The authors of this paper are really interested in scaling the models, they leverage the power laws and propose structural changes to the architecture which results in the following research Scaling Vision Transformers
Since then many studies have tried to make the architecture more efficient, some of which are
- Efficient Vision Transformers via Fine-Grained Manifold Distillation
- EsViT (Efficient self-supervised Vision Transformers)
Many attempts have been made to introduce add convolutions to the transformer blocks in order to make the model more efficient and robust. Some significant approaches are:
Liu et al propose the Swin Transformer which leverages a heirarchical transformer whose representation is computed with shifted windows which limits self-attention to non-overlapping local windows and instead uses cross attention for global context.

Review

Overal Score: 8 Strong Accept

Summary

The paper challenges traditional convolutional neural networks (CNNs) with a novel architecture inspired by natural language processing. This results in a new approach to computer vision tasks with proves to be efficient and scalable at the same time.

Strengths

The paper itself is really well written with clear objectives defined and ample proof to justify the motivations behind each proposed novelity. The authors also conduct extensive ablation studies particularly with the embedding layers as well as other architectures to prove their proposed method is an improvement from the commpared approaches.

This research also changed the persepctive of the vision community, since this paper, there has been a tremendous amount of interest in scaling and improving transformers. One of the key strengths of this paper lies in its empirical validation. The authors demonstrate the superior performance of ViTs on benchmark datasets, outperforming state-of-the-art CNNs. The scalability of ViTs allows them to be trained on massive datasets, showcasing their potential for handling diverse and extensive visual information.

The paper not only highlights the success of ViTs in image classification but also introduces the concept of pre-training on large datasets and fine-tuning on task-specific data. This transfer learning strategy proves to be instrumental in achieving remarkable results with limited labeled data. Moreover, the authors delve into the interpretability of ViTs, showcasing their ability to highlight relevant image regions. The attention maps generated by ViTs provide insights into the model's decision-making process, contributing to improved transparency and understanding a critical aspect in real-world applications.

Weaknesses

The paper relies heavily on the authors' previous work on Big Transfer (BiT) and while the evaluation datasets are quite diverse, the authors do not compare their approach to other state-of-the-art models.

While the paper explores the scaling of data needed for pre-training in detail, such experiments are not highlighted in the main part of the paper (Table 6 in Appendix). Since transformer architecture is known to be higly scalable, it should have been discussed in greater detail.

Some practical aspects of the architecture are not discussed, such how to handle images of different sizes and shapes (e.g. non-square images). Shall we conseider padding for non-patch-multiple image sizes? How shall we choose patch size for larger resolution? At a first glance, patch size should going up as the image size is increased, otherwise each patch will be too homogeneous.

The paper discusses very little about the limitations or drawbacks of ViTs. The authors present this approach as superior to CNNs in nearly every aspect, which is not necessarily true. For example, CNNs are known to be more robust to adversarial attacks, which is not discussed in the paper.

Food for thought:

Instead of just using a linear projection, can we do something better? Maybe use a pretraining to learn better representaions of the input embeddings?
The main comparison done by the authors is against BiT[3] paper. Other architectures such as Bag of visual words could be used for comparison.
Can we combine the best of CNNs and Transformers?

Code Implementation (Optional) can be found at : this github repository

References

[1] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
[2] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[3] Kolesnikov, Alexander, et al. "Big transfer (bit): General visual representation learning." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer International Publishing, 2020.

Team Members

Aditya Varshney
Gega Darakhvelidze

An Image is worth 16 x 16 words For CS 7150

An Analysis of AN IMAGE IS WORTH 16X16 WORDS