Optimizing Transformer models for long context windows has always been a problem. In the original transformer decoder implementation, to generate a sequence of length N, the time and memory complexity is respectively O(n2) and O(n). In this work, the authors proposed a way to optimize the coefficient of the n2 term, from 2 to 1/84.
The optimization is done with a "routing" layer: by adding a "light" attention and mlp alongside the original "heavy" attention and mlp, the model can use more computation power on the important tokens, and use less computational power on unimportant tokens. The authors trained a model with a 64k context window, yielding higher performance (F1 score) in less inference time than LongT5 on the NarrativeQA dataset.
This method also improves the interpretability of the Transformer model. The routing score can be used as an importance score of the token.
Optimizing LLMs for long context have drawn great attention. Guo et al. (2022) proposed a method to use local attention and transient global attention to optimize T5 model for long context. Xiao et al. (2023) followes Evan Miller's "Attention is off by one" work, found out that a great proportation of attention weight was infact on the '[CLS]' token. Xiao et al. (2023) then proposed a method where an attention sink is used to streamline transformer model. Allowing LLMs to dynamically change computatoin power is also an interesting topic. Goyal et al. (2023) proposed a method to insert '[pause]' token during pre-training and fine-tuning, allowing model to use more computational power on important tokens, thus achieveing better result on multiple datasets.
MS in Statistic at Stanford; Software Engineer at Google
PhD at MIT, Research scientist at Google Brain
PhD at USC; Research Scientist at Stealth Startup; Former Googler
PhD at Autonomous University of Barcelona; Associate Professor at Drexel University
PhD at EPFL; Research Scientist at Google Deepmind
PhD student at USC
PhD at The University of Auckland, Software Engineer at Google
BS at Cornell, Software Engineer at Google
Google Researcher
Senior Research Scientist at Google Brain
PhD at Stanford; Senior Research Scientist at Google
Software Engineer at Google
This work enables a new research question: optimizing the transformer at a token level. This work assumes that each token has a different importance in the sequence, so they can allocate less computational power to the less important token, thus reducing overall time complexity. Another interesting research question is, can we add more computation power to a specific token? (Think before you speak: Training Language Models With Pause Tokens) studied this method, and reported a significant performance gain when allowing more computation of the important output tokens.
The cost of running LLMs is a huge burden to be deployed everywhere. For example, the "new bing" chatbot has multiple versions of model sizes, and they have a small "router" in front of all incoming requests to tell the "difficulty" of the question. However, the performance of this router is very bad, so you'll often see new bing replies silly responses for tricky questions.
So, optimizing LLMs for longer contexts and fewer computations can help the industry a lot.
Score: 8/10 (Strong Accept)
Pros:
Innovation: Implements a conditional computation mechanism for efficient processing of long texts.
Efficiency: Significantly improves training and inference speed, especially with extremely long inputs.
Scalability: Effectively handles long inputs up to 64k tokens.
Few-Shot Learning Capabilities: Performs well in few-shot learning tasks.
Cons:
Limited to Encoder: Conditional computation is only applied to the encoder, not suitable for token-by-token generation in decoders.
Specialization & Adaptability: Primarily designed for long sequences, might not be suitable for short sequences or require training from scratch.
Decoder Integration: No exploration of how to integrate this model with decoders, limiting its application in decoder-only models.
Score: 8/10 (Strong Accept)
Pros:
Novel idea: Optimizing transformers with Comditional Computation is really interesting and novel idea
Results: Outperforms other methods while takes less computational power
Cons:
Can't be applied to transformer decoders, which most recent model is based on
Require training from scratch, can't "plug-and-play"
[1] Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan. Think before you speak: Training Language Models With Pause Tokens. ICLR 2024
[1] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. Efficient Streaming Language Models with Attention Sinks. ICLR 2024
[1] Iz Beltagy, Matthew E. Peters, Arman Cohan. Longformer: The Long-Document Transformer. 2020
Yuxiong WU & Yuxuan LU