The main motivation of behind ViT was the success of using Transformers for NLP tasks. The dominant approach of pre-training the model on large dataset and fine-tuning by small task-specific datasets (BERT and GPT) was limited to textual data, as directly applying self-attention mechanism to pixels was computationally intractable due to very large input size (image pixels). Often attention mechanism was considered in conjunction with convolutional layers.
Some attempts had been made to make the self-attention mechanism work for images:
Many of these specialized attention architectures above demonstrated promising results on computer vision tasks, but require complex engineering to be implemented efficiently on hardware accelerators.
Most significant and similar work to ViT was On the Relationship between Self-Attention and Convolutional Layer (Cordonnier 2020). Key finding: "we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer". However, the paper lacked the evaluation of the method against state-of-the-art models, was limited to 2x2 image patches that hindered the scalability and thus only worked on small resolution images.
The authors of the paper built on top of their prior work Big Transfer (BiT) on CNN transfer learning (Kolesnikov 2020 and Djolonga 2020), comparing the approach already done on ResNet to the Transformer architecture.
This study is part of a sequence of studies that this research group at Google , Zurich have been conducting
for a couple of years.
They are highly motivated researchers with
backgrounds in Computer Science, Neuroscience, Physics and Biology working with deep learning architectures
for computer vision tasks.
They also love to scale networks both ways (create smaller and more efficient networks as well as create big
networks with SOTA performance).
Some members are now at different research groups but have collaborated together frequently in the past.
Name | Photo | Affiliation | Contribution |
---|---|---|---|
Alexey Dosovitskiy | ![]() |
Google Research | Ex Intel, left Google for an year to setup Inceptive with Jakob, recently returned back. PhD in Mathematics from Lomonosov Moscow State University. 66095 Citations |
Lucas Beyer | ![]() |
Google Research | Co-author, involved in the exploration and experimentation with Vision Transformers.French- German national from Belgium with PhD from RWTH Aachen. 37236 Citations |
Alexander Kolesnikov | ![]() |
Google Research | Previously PhD at IST Austria, and applied math MSc at Moscow State University. 38545 Citations |
Dirk Weissenborn | ![]() |
Inceptive Inc. | Based out of Germany, Ex-Google, Meta, DeepMind. 28727 Citations |
Xiaohua Zhai | ![]() |
Google Research | Seniour Staff Researcher based out of Zurich. PhD from Peking University, China. Interested in vision, representation learning and generative modelling. 35628 Citations |
Thomas Unterthiner | ![]() |
Google Research | ML Engineer , PhD from Johannes Kepler Universitat, Germany. Research interests include Compuational Biology, understanding deep networks, activation functions, emaluation metrics. 51478 Citations |
Mostafa Dehghani | ![]() |
Google Research | Research Scientist, Ex-Apple. PhD from University of Amsterdam. 33777 Citations |
Matthias Minderer | ![]() |
Google Research | PhD from Harvard under Christopher Harvey, Ex Neuroscience Major at ETH Zurich, also studied Biochemistry at University of Cambridge . Interested in representation learning of vision tasks. 27314 Citations |
Georg Heigold | ![]() |
Google Research | Diploma in Physics from ETH Zurich. His research interests include automatic speech recognition, discriminative training, and log-linear modeling. Ex-Apple. 33475 Citations |
Sylvain Gelly | ![]() |
Independent Researcher / Google Research | Deep learning Researcher based out of Zurich. Likes to work on reinforcement learning and dynamic programming. 41451 Citations |
Jakob Uszkoreit | ![]() |
Co-founder @ Inceptive | One of the authors of the original transformer paper, Most cited authors on this panel. Ex Googler. 145054 Citations |
Neil Houlsby | ![]() |
Google Research | PhD in Computational Biology from University of Cambridge. Research interests include Bayesian ML, Cognitive Science and Active Learning. 37994 Citations |
Transformers had been recently proposed by Vaswani et al in [2] and had become
state of the art for language related tasks. The authors of this paper wanted to come up with a solution
which was as close as possible to the original Transformer so that other language based approaches to scale
the architecture could be used out of the box.
Hence, they chose to use the same architecture setup as done by the authors in [2].
However, as the original Transformer requires a 1D sequence of tokens, the authors had to figure out a way to
convert an image, which is 2-Dimensional, into a 1-D sequence of information.
In order to solve this, the input images (of dimension H and W) were reshaped into N
patches of size P,
where in N was calculated by computing HW ⁄ P2. The output of this operation were
described as patch embeddings.
Similary to its natural language counterpart which uses BERT's [class] tokens, it was essential to prepend a
learnable embedding to
the patch embeddings whose state at the output of the Encoder block could be used as the image
representations.
The transformer architecture is a suitable architecture for language related tasks as they are rich in
sequence and the architecture adds encodes the position of text
along with the text as position embeddings. These positional embeddings are key to the architecture. In order
to mimic this charachterstic in vision, positional embeddings
are added into the patch embeddings which help retain position information of the patches. The resulting
embedding vector is used as input to the encoder block of the transformer.
As the transformer block does not have any CNN layers which may lead to neighbourhood structure being learnt, the model instead learns the 2D positions as well as the spatial relations of the patches making it free of any sort of inductive bias. Furthermore, the authors propose another way to generate patchh embeddings by using a CNN feature map to extract patches instead of generating raw patches.
The authors have experimented with different positional embeddings, curiously, explicitly including 2-D information in them does not improve performance, as the model learns to encode it implicitly:
The paper laid the foundation for the use of transformers in vision tasks. This was a major breakthrough as the well established benefits of transformers were now available for vision tasks as well.
Due to the rich latent space that vision transformer learn, this architecture was quickly adopted in the industry and has become one of the most used architectures for production grade ML models. Some of the industries which use Vision based transformers are :
This study was the first to bring the transformer architecture to vision which soon became one of the most popular methods. This led to various aspects of the model being subjected to improvement. A few areas of research which have led to significant improvements in the transformer architecture are
The paper challenges traditional convolutional neural networks (CNNs) with a novel architecture inspired by natural language processing. This results in a new approach to computer vision tasks with proves to be efficient and scalable at the same time.
The paper itself is really well written with clear objectives defined and ample proof to justify the motivations behind each proposed novelity. The authors also conduct extensive ablation studies particularly with the embedding layers as well as other architectures to prove their proposed method is an improvement from the commpared approaches.
This research also changed the persepctive of the vision community, since this paper, there has been a tremendous amount of interest in scaling and improving transformers. One of the key strengths of this paper lies in its empirical validation. The authors demonstrate the superior performance of ViTs on benchmark datasets, outperforming state-of-the-art CNNs. The scalability of ViTs allows them to be trained on massive datasets, showcasing their potential for handling diverse and extensive visual information.
The paper not only highlights the success of ViTs in image classification but also introduces the concept of pre-training on large datasets and fine-tuning on task-specific data. This transfer learning strategy proves to be instrumental in achieving remarkable results with limited labeled data. Moreover, the authors delve into the interpretability of ViTs, showcasing their ability to highlight relevant image regions. The attention maps generated by ViTs provide insights into the model's decision-making process, contributing to improved transparency and understanding a critical aspect in real-world applications.
The paper relies heavily on the authors' previous work on Big Transfer (BiT) and while the evaluation datasets are quite diverse, the authors do not compare their approach to other state-of-the-art models.
While the paper explores the scaling of data needed for pre-training in detail, such experiments are not highlighted in the main part of the paper (Table 6 in Appendix). Since transformer architecture is known to be higly scalable, it should have been discussed in greater detail.
Some practical aspects of the architecture are not discussed, such how to handle images of different sizes and shapes (e.g. non-square images). Shall we conseider padding for non-patch-multiple image sizes? How shall we choose patch size for larger resolution? At a first glance, patch size should going up as the image size is increased, otherwise each patch will be too homogeneous.
The paper discusses very little about the limitations or drawbacks of ViTs. The authors present this approach as superior to CNNs in nearly every aspect, which is not necessarily true. For example, CNNs are known to be more robust to adversarial attacks, which is not discussed in the paper.