Literature Review
"Attention Is All You Need" (Vaswani et al., 2017)
- Introduced the Transformer model, focusing solely on attention mechanisms, eliminating the need for recurrence and convolutions.
- Achieved state-of-the-art results in machine translation, proving more parallelizable and efficient in training.
"Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (Katharopoulos et al., 2020)
- Proposed linear transformers to reduce complexity from O(N^2) to O(N), making them significantly faster for long sequences.
- Maintained performance comparable to standard transformers while being up to 4000x faster in autoregressive prediction.
"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (Dao et al., 2022)
- Introduced FlashAttention, an IO-aware attention algorithm optimizing memory usage between GPU memory levels.
- Enabled faster Transformer training and improved performance on various tasks, including long-sequence challenges.
"Retentive Network: A Successor to Transformer for Large Language Models" (Sun et al., 2023)
- Proposed RETNET, a new architecture for large language models, achieving efficient training, low-cost inference, and strong performance.
- Introduced a retention mechanism supporting various computation paradigms, positioning it as a successor to the Transformer.
Biography
Yutao Sun |
Li Dong |
Shaohan Huang |
Shuming Ma |
Research intern in Microsoft Research Lab - Asia
PhD student in Tsinghua University, Beijing
Research interest in LLM backbone & applications, and long sequence modeling and inference
|
Principal Researcher, Microsoft Research Lab - Asia
PhD in Informatics from The University of Edinburgh, Scotland
Research interest in Human Language Technologies and Natural Language Computing
|
Senior Researcher at the Natural Language Computing group at Microsoft Research Lab - Asia
MS in Computer Science at Beihang University, Beijing
Recently worked on a paper called: Language is not all you need
|
Researcher at the Natural Language Computing group at Microsoft Research Lab - Asia
Master’s and Bachelor’s degrees from Peking University, with a focus on natural language processing
Published 30+ papers on large-scale pre-trained LMs at top conferences (ICML, ICLR, ACL, EMNLP)
|
Yuqing Xia |
Jilong Xue |
Furu Wei |
Jianyong Wang |
Researcher at the Natural Language Computing group at Microsoft Research Lab - Asia
Ph.D. in Biology from Peking University in 2019
Interested in using AI to empower natural science research
|
Principal Researcher in System Research Group of Microsoft Research Asia (MSRA)
Jilong received his Ph.D. in Computer Science from Peking University
Builds AI frameworks and compilers to bridge and optimize hardware for AI applications
|
Partner Research Manager at Microsoft Research Asia
Previously a Staff Researcher at IBM Research - China
Ph.D. in computer science from Wuhan University in 2009
Worked with Li Dong on several Microsoft research papers
|
Professor in the Department of Computer Science and Technology at Tsinghua University
Supported by various organizations
Research focuses on various data mining algorithms
|
Methodology and Architecture
RetNet is a novel architecture designed to address some of the limitations of the Transformer model, particularly in the context of large language models. Here's an overview of its methodology and architecture:
Methodology:
- Objective: The primary goal of RetNet is to achieve training parallelism, efficient long-sequence modeling, Transformer-comparable performance, and low-cost inference simultaneously.
- Addressing the 'Impossible Triangle': The paper identifies the challenge of achieving training parallelism, good performance, and low inference cost simultaneously in Transformer models, referring to this challenge as the "impossible triangle."
- Comparison with Existing Models: RetNet is compared extensively with the Transformer and its variants, focusing on aspects like scaling curves, in-context learning, and inference cost.
Figure 1: Impossible Triangle
Figure 2: Comparison of Existing Models
Architecture:
- Multi-Scale Retention Mechanism: RetNet introduces a multi-scale retention mechanism as a substitute for the multi-head attention mechanism used in Transformers. This mechanism supports three computation paradigms:
- Parallel Representation: Facilitates training parallelism, fully utilizing GPU devices for efficient training.
- Recurrent Representation: Enables efficient O(1) inference in terms of memory and computation, significantly reducing deployment cost and latency. This representation also simplifies implementation by eliminating the need for key-value cache tricks.
Figure 3.1: Retention Equation
Figure 3.2: Retention Equation in Sequential Form
Figure 4: Parallel and Sequential Representation of Retention
Performance and Efficiency:
- Inference Efficiency: RetNet demonstrates length-invariant inference cost. For instance, with a 7B model and 8k sequence length, RetNet decodes 8.4 times faster and saves 70% of memory compared to Transformers.
- Training Efficiency: During training, RetNet achieves 25-50% memory saving and 7 times acceleration compared to the standard Transformer. It also shows advantages over highly-optimized variants like FlashAttention.
Performance and Efficiency
Figure 5: Inference Cost
Figure 6: Model Size
Figure 7: Performance Comparison between Different Transformers and RetNet
Hence, RetNet's architecture is designed to overcome the limitations of the Transformer model by introducing a novel retention mechanism that supports multiple computation paradigms, leading to improvements in training efficiency, inference cost, and scalability.
Social Impact
Since RetNets function in a manner similar to transformers, their societal impact overlaps:
- Environmental Concerns: Everytime a series of 25 prompts are asked to ChatGPT, it consumes approximately the equivalent of 500ml (16oz) of water. But by using RetNets, we can reduce the inference costs significantly, hence reducing its environmental impact.
- Bias: It can inadvertently perpetuate biases present in its training data. Addressing these biases is crucial for ensuring fairness and ethical use of AI.
- Inclusivity: These technologies can significantly enhance accessibility for people with disabilities, offering tools for communication, navigation, and interaction that were previously unavailable.
- Privacy Concerns: Most Language models are trained on a larger scale data scraped from the internet, which might also include user data. There is always a risk of privacy violations as adversarial prompts can lead to data leaks.
- Impact on Employment: Automation and AI are already leading to displacement in certain sectors, but it's also creating new opportunities in tech and other industries. The social impact of this is still underway.
Industry Applications
- Jobs in General: Using a language model as a knowledge base increases the productivity of workers manifold.
- Finance: In the finance sector, these technologies can enhance fraud detection and improve customer service through intelligent chatbots.
- Retail and E-commerce: AI can personalize shopping experiences, summarize reviews, and recommend products to users.
The above are the examples that I can think of, but I believe it will be beneficial to almost every industry in some way or the other.
Follow-on Research
- Model Efficiency and Validity: Follow-on research has to be carried out on the scalability, efficiency, and validity of this architecture. Only one RetNet model is shown in the paper as a 7B parameter model.
- Cross-disciplinary Applications: Exploring the application of these AI models in various fields such as climate science, astrophysics, and materials science is a burgeoning area of research.
- More applications of retention can be studied and how it can affect the field currently dominated by transformers can be explored.
Peer-Review
Summary: The Retentive Network (RETNET), introduced in this study, is a proposed foundational architecture for large language models. It claims to achieve training parallelism, low-cost inference, and robust performance. The paper explores the interplay between recurrence and attention, leading to the development of a retention mechanism for sequence modeling. Notably, the parallel representation fosters training parallelism, while the recurrent representation facilitates low-cost O(1) inference, enhancing decoding throughput, latency, and GPU memory efficiency without compromising on performance.
Strengths:
- The paper is well written and easy to understand.
- It introduces a novel concept with the potential to significantly enhance future large language models (LLMs).
- The research includes experiments comparing RetNet's performance with traditional transformers.
Weaknesses:
- The model has been tested and compared only for a single size, leaving some uncertainty about its scalability.
- The evaluation of model performance relies solely on perplexity, which may not fully capture the model's capabilities.
- Since it was published by Microsoft, they could've released a working implementation to test it out, similar to Chat GPT.
Limitations:
- The paper does not explore the scalability of RETNET for significantly larger models. A broader range of model sizes, extending beyond the 13B range, would provide a more comprehensive understanding of its performance scalability.
Overall Rating: 6 (Weak Accept)
The field is rapidly evolving, and competitors will take any advantage when available.
If RetNet was truly revolutionary, we'd already be seeing LLMs based on it. It's been around 5 months,
and it hasn't gotten a lot of traction as of yet.
References
[1] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré.
Flash Attention:
Fast and Memory-Efficient Exact Attention with IO-Awareness
[2] Vaswani et al.
Attention Is All You Need
[3] Angelos Katharopoulos & Apoorv Vyas et al.
Transformers are RNNs:
Fast Autoregressive Transformers with Linear Attention.
[4] Yutao Sun & Li Dong et al.
Retentive Network:
A Successor to Transformer for Large Language Models
Team Members
Karan Mudaliar
Sashank Reddy