COLT5 For CS 7150

An Analysis of COLT5: Faster Long-Range Transformers with Conditional Computation

Optimizing Transformer models for long context windows has always been a problem. In the original transformer decoder implementation, to generate a sequence of length N, the time and memory complexity is respectively O(n2) and O(n). In this work, the authors proposed a way to optimize the coefficient of the n2 term, from 2 to 1/84.

Figure 1

The optimization is done with a "routing" layer: by adding a "light" attention and mlp alongside the original "heavy" attention and mlp, the model can use more computation power on the important tokens, and use less computational power on unimportant tokens. The authors trained a model with a 64k context window, yielding higher performance (F1 score) in less inference time than LongT5 on the NarrativeQA dataset.

Figure 2

This method also improves the interpretability of the Transformer model. The routing score can be used as an importance score of the token.

Literature Review

Optimizing LLMs for long context have drawn great attention. Guo et al. (2022) proposed a method to use local attention and transient global attention to optimize T5 model for long context. Xiao et al. (2023) followes Evan Miller's "Attention is off by one" work, found out that a great proportation of attention weight was infact on the '[CLS]' token. Xiao et al. (2023) then proposed a method where an attention sink is used to streamline transformer model. Allowing LLMs to dynamically change computatoin power is also an interesting topic. Goyal et al. (2023) proposed a method to insert '[pause]' token during pre-training and fine-tuning, allowing model to use more computational power on important tokens, thus achieveing better result on multiple datasets.

Biography

Joshua Ainslie
Joshua Ainslie

MS in Statistic at Stanford; Software Engineer at Google

Tao Lei
Tao Lei

PhD at MIT, Research scientist at Google Brain

Michiel de Jong
Michiel de Jong

PhD at USC; Research Scientist at Stealth Startup; Former Googler

Santiago Ontanon
Santiago Ontanon

PhD at Autonomous University of Barcelona; Associate Professor at Drexel University

Siddhartha Brahma
Siddhartha Brahma

PhD at EPFL; Research Scientist at Google Deepmind

Yury Zemlyanskiy
Yury Zemlyanskiy

PhD student at USC

David Uthus
David Uthus

PhD at The University of Auckland, Software Engineer at Google

Mandy Guo
Mandy Guo

BS at Cornell, Software Engineer at Google

James Lee-Thorp
James Lee-Thorp

Google Researcher

Yi Tay
Yi Tay

Senior Research Scientist at Google Brain

Yun-Hsuan Sung
Yun-Hsuan Sung

PhD at Stanford; Senior Research Scientist at Google

Sumit Sanghai
Sumit Sanghai

Software Engineer at Google

Academic Impact

This work enables a new research question: optimizing the transformer at a token level. This work assumes that each token has a different importance in the sequence, so they can allocate less computational power to the less important token, thus reducing overall time complexity. Another interesting research question is, can we add more computation power to a specific token? (Think before you speak: Training Language Models With Pause Tokens) studied this method, and reported a significant performance gain when allowing more computation of the important output tokens.

Industry Impact

The cost of running LLMs is a huge burden to be deployed everywhere. For example, the "new bing" chatbot has multiple versions of model sizes, and they have a small "router" in front of all incoming requests to tell the "difficulty" of the question. However, the performance of this router is very bad, so you'll often see new bing replies silly responses for tricky questions.

So, optimizing LLMs for longer contexts and fewer computations can help the industry a lot.

Review from Yuxiong Wu

Score: 8/10 (Strong Accept)

Pros:

Innovation: Implements a conditional computation mechanism for efficient processing of long texts.

Efficiency: Significantly improves training and inference speed, especially with extremely long inputs.

Scalability: Effectively handles long inputs up to 64k tokens.

Few-Shot Learning Capabilities: Performs well in few-shot learning tasks.

Cons:

Limited to Encoder: Conditional computation is only applied to the encoder, not suitable for token-by-token generation in decoders.

Specialization & Adaptability: Primarily designed for long sequences, might not be suitable for short sequences or require training from scratch.

Decoder Integration: No exploration of how to integrate this model with decoders, limiting its application in decoder-only models.

Review from Yuxuan Lu

Score: 8/10 (Strong Accept)

Pros:

Novel idea: Optimizing transformers with Comditional Computation is really interesting and novel idea

Results: Outperforms other methods while takes less computational power

Cons:

Can't be applied to transformer decoders, which most recent model is based on

Require training from scratch, can't "plug-and-play"

References

[1] Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan. Think before you speak: Training Language Models With Pause Tokens. ICLR 2024

[1] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. Efficient Streaming Language Models with Attention Sinks. ICLR 2024

[1] Iz Beltagy, Matthew E. Peters, Arman Cohan. Longformer: The Long-Document Transformer. 2020

Team Members

Yuxiong WU & Yuxuan LU