An Analysis of AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

- by Luv Verma, Rakesh Rathod (Hugging Face)

Introduction

Large Language Models (LLMs) based on transformers[1] have taken the world by storm. ChatGPT is one such example. However, with advantages comes the astronomical cost for hardware. For example, GPT-3 has 175B parameters which is 350GB in FP16, while the latest H100 GPU only has 96GB memory, let alone edge devices. Large Language Models, such as GPT-3 (Generative Pre-trained Transformer 3), often have millions or even billions of parameters, and these parameters are typically represented using high-precision floating-point numbers. Quantization solves this problem to an extent. Quantization refers to the process of reducing the precision of the numerical values used to represent the model's parameters and activations. Quantization aims to make the model more efficient for deployment on resource-constrained devices by reducing the bit-width or precision of these numerical values. Instead of using 32-bit floating-point numbers, quantization may involve representing them using 16-bit fixed-point or even lower-precision integer formats. This results in a reduction in memory requirements, allowing for more compact model storage and faster inference. Hence, this blog has taken a recently developed method called Activation-aware weight quantization [2].

Historical Background

In the early days of telecommunications during the 19th century, engineers grappled with the formidable challenge of transmitting analog signals over vast distances. The inherent infinite resolution of continuous analog signals posed a dilemma. Enter Harry Nyquist, an American engineer, who, in the early 20th century, formulated the Nyquist-Shannon sampling theorem. This groundbreaking idea, later popularized by Claude Shannon, asserted that to accurately reconstruct a continuous signal, the sampling rate must be at least twice the signal's maximum frequency. This theoretical revelation set the stage for practical implementation by Alec H. Reeves in the 1930s—the Pulse Code Modulation (PCM). PCM involves the systematic quantization of analog signals through regular sampling and discretization. With the advent of digital computers, Digital Signal Processing (DSP) techniques emerged, ushering in a new era of applying quantization to a broader range of signal-processing tasks. Claude Shannon's information theory further deepened the understanding of quantization's role in balancing information compression and loss. As technology progressed, quantization became a linchpin in compression methods for efficient data storage and transmission. Today, in the realm of machine learning, quantization continues to evolve, optimizing the deployment of large neural networks on resource-constrained devices. The story of quantization unfolds as a journey from the early challenges of telecommunications to a fundamental concept shaping the digital landscape.

As the demand for more sophisticated machine learning models grew, the limitations of hardware became evident, especially when dealing with large-scale models like GPT-3. With billions of parameters, these models strain memory and computational resources, making deployment on edge devices a formidable task.

In response to these challenges, researchers and practitioners sought ways to optimize the quantization process further. Traditional quantization techniques were effective but lacked a nuanced understanding of the dynamic nature of activations within the neural network during inference.

Enter Activation-Aware Weight Quantization. This method represents a paradigm shift in the quantization landscape by introducing a heightened awareness of activation patterns. Instead of uniformly treating weights and activations, AAWQ takes into account the specific characteristics of the data flow during inference.

Literature Review

Deep Compression [3]: Quantization reduces the bit-precision of deep learning models, which helps to reduce the model size and accelerate inference

Post-Training Quantization (PTQ) [4, 5]: It quantizes the weights and activations of the model without necessitating any retraining. It fuses activations into preceding layers wherever possible and requires calibration with a representative dataset to determine optimal quantization parameters for activations. This is used when both memory bandwidth and compute savings are important with CNNs being the typical use case. However, in a low-bit setting, PTQ suffers from large accuracy degradation.

Quantization Aware Training (QAT) [6]: QAT inserts fake quantization to all the weights and activations during the model training process and results in higher inference accuracy than PTQ methods. This is typically used for CNNs, however, the training cost is higher and also needs training data.

Generative Pre-trained Transformer Quantization (GPTQ) [7]: Published in 2023 at ICLR, GPTQ uses a second-order information to perform error compensation. However, it overfits the calibration set during reconstruction, distorting the learned features on out-of-distribution domains, which could be problematic since LLMs are generalist models.

4 bits Quantization for LLM Inference [8]: Published in 2023 in ICML, the researchers showed that using 4-bit quantization for LLM inference yield optimal performance for a fixed number of model bits across all model scales and model families instead.

LLM.int8()[9]: Published in 2022, researchers developed a methodology to load a 175B parameter transformer with 16 or 32-bit weights, convert the feed-forward and attention projection layers to 8-bit, and use the resulting model immediately for inference without any performance degradation. To do this, they solved two key challenges, which were the need for higher quantization precision at scales beyond 1B parameters and need to expicitle represent teh sparse but systematic large magnitude outlier features.

Weight Magnitudes for Quantization [10]: In 2015, researchers observed that the most important weight channels are determined by looking at the magnitudes of the weight channels or their $L_2$ -Norm, however the current research didn’t find it to be helpful in quantization procedure.

W8A8 Quantization[11]: Both activation and weights are quantized to INT8. Industrial standard for CNNs. However, it has been observed that W8A8 quantization does not work as the number of parameters increases (for LLMs). The reason behind this is that systematic outliers emerge in activations when LLMs are scaled up beyond 6.7 billion parameters.

W4A16 Quantization[12]: This is low-bit weight-only quantiation, where only weights are quantizated into low-bit integers. It;s shown that this setting leads to the higher throughput during the inference. Specifically, W4A16 leads to efficeint offloading strategy for the higher throughput generative inference on a single commodity. One of the ways is to offload to a secondary storage and perform computation part-by-part by partial loading. W4A16 settings overcome such structure. Therefore, in the current research W4A16 structure is explored

Paper Authors

Song Han is an associate professor at MIT EECS. He received his PhD degree from Stanford University. He proposed the “Deep Compression” technique including pruning and quantization that is widely used for efficient AI computing, and “Efficient Inference Engine” that first brought weight sparsity to modern AI chips, which influenced NVIDIA’s Ampere GPU Architecture with Sparse Tensor Core. He pioneered the TinyML research that brings deep learning to IoT devices, enabling learning on the edge. He has been cited 50392 times.

Chuang Gan is a faculty member at UMass Amherst and a research manager at MIT-IBM Watson AI Lab. I was a postdoc at MIT, working with Prof. Antonio Torralba, Prof. Daniela Rus, and Prof. Josh Tenenbaum. He has been cited 15131 times.

Ji Lin is a Final-year Ph.D. student at MIT EECS, advised by Prof. Song Han and a Member of Technical Staff at OpenAI. He has been cited 8070 times.

Haitian Tang is a third year PhD student at Han Lab of MIT EECS. He has been cited 1614 times.

Shang Yang is a First year PhD student at MIT Han Lab

Xingyu Dang is a Undergraduate Research assistant at MIT Han Lab.

Jiaming Tang is a Research Intern at MIT Han Lab.

Challenges with LLMs Inferencing

Post Training Quantization Method: SmoothQuant

The QAT methods cannot easily scale up to large models like LLMs. Therefore, people usually use PTQ methods to quantize LLMs.

Two settings for LLM Quantization [2]

W8A8 Quantization: Both activation and weights are quantized to INT8. Industrial standard for CNNs.

W4A16 Quantization: Weights are quantized to INT4. This setting leads to a reduction in hardware barrier (requiring a smaller memory size) and also speeds up the token generation (remedying memory-bound workload)

Figure 2 [13]: Accuracy degradation with W8A8 quantization with an increase in the number of parameters

From Figure 2, it has been observed that W8A8 quantization does not work as the number of parameters increases. The reason behind this is that systematic outliers emerge in activations when LLMs are scaled up beyond 6.7 billion parameters.

Figure 3 [13]: shows the systematic outliers happening in LLM (OPT-13B).

In Figure 3, it can be seen that certain channels have much higher activation values (magnitudes, greater than 70), however, the weight distribution is flat and uniform (between 0 and 1). Such higher dynamic ranges for activation make it difficult to quantize activations, however, weights are very easy to quantize.

Figure 4 [13]: (a): Scaling down activations and scaling up the weights, (b): Scaling factor and output calculations

Figure 4, shows how the smoothQuant technique deals with it. It scales down the activation, and scales by the weight so that the overall effect cancels out, which now makes activations easier to quantize. The scaling parameter now reduces the range of activations, which makes them easier to quantize in comparison to the Figure 3 range. Weight values are slightly increased (for the same channels), but still, weights can also be quantized.

Figure 5 [13]: Green Layers (INT 8) quantized, yellow layers (FP16). All compute-intensive operators quantized

Figure 5 shows which were the layers quantized using SmoothQuant (W8A8)

Figure 6 (a): Latency, (b): Memory footprint for OPT-175B

Figure 6, shows that the method worked well in terms of maintaining accuracy without fine-tuning, accelerated inference, and reduced the memory footprints to half for the OPT-175B model. For the MT-NLG 530B model, it was observed that it reduced the no of GPUs from 16 (for FP16) to 8 (for INT8) at a similar latency.

Memory Bound Challenges

Figure 7 (a): Roofline Model (Performance (GFLOPs/sec) vs Operational Intensity (Flops/byte)), (b): Example showing the downside of SmoothQuant method for single-query (batch size = 1) LLM inference over A100 GPU, LLaMA-65B model (highly memory bounded) [16]

According to Figure 7 (b), the technique works well for the batch serving (tested on a batch size of 128, LLaMA-65B model), however for a single-query LLM Inference, the method is still memory-bounded. According to Figure 7 (b), it is memory-bounded, because if there are 65 billion parameters, at fp16, for a single query inference (generating a single token), it requires access to 130 GB of memory.

To put things in perspective, for A100 GPU:

There are 108 Streaming multiprocessors (SMs)

Each SM can request 64 bytes of memory per clock cycle.

Each SM can run at 1410 MHz.

Therefore, in 1 second, the peak memory request can be: 64 bytes * 108 SMs * 1410 MHz = 9750 GB/sec.

A100 has a memory bandwidth of 1555 Gb/sec, which means that data request is 6.7x times the memory bandwidth.

SmoothQuant is Memory Bounded for inference of batch size of 1: In scenarios, where data is not accessed with peak memory request, the number of accesses increases. Such is a scenario for single-size batch inference. Therefore, it’s shown in Figure 7 (b), that the smaller the batch size, the higher is the memory bound (Figure 7(a))

Core Idea/Methodology

PTQ Method 2 for LLMs: Activation-aware Weight Quantization (AWQ)

Figure 8: (a): changing weights from FP16 to INT3. (b): Perplexity degraded on OPT-6.7B Wiki-2

In AWQ [2], weights are quantized to INT4 or INT3. Activations are left at FP16. This is because, for a single batch, activations would be much smaller (vector) in comparison to the weights (matrix), and thus it makes sense to quantize weights. Quantizing weights leads to a reduction in the memory requirements and accelerates the token generation process. For example, if for 7 billion prameters, if we want to fetch the weights saved at FP16, that would be 7 billion * 2 bytes (2 bytes for FP16) = 14 GB of memory. However, at INT4 this will reduce to only 3.5 GB (INT4 is 0.5 bytes). However, by reducing all weights to INT4 it was observed that the perplexity score went up (Figure 8(a & b), the lower the better).

Figure 9: (a): Mixed precision weights (1% salient weights kept at FP16), (b): Preplexity reduced again after mixed precision.

How to Select Salient Weights?

The authors came up with an idea of mixed-precision quantization of weights, where only 1% of the weights were kept at FP16 and the rest were reduced to either INT4 or INT3. The 1% weights, which were kept at FP16 were the salient/most important weights. The saliency was determined based on the value of activations. For example, in the outlier channels (where the activations were outlier), the weight of the corresponding channel was important and should be included in 1% and that is called activation aware weight quantization in the paper. This led to an instant reduction on the perplexity as shown in Figure 9 (a & b). Therefore, the authors looked for the activation distribution, for selecting 1% of salient weight values.

Can we get rid of mixed-precision?

Mixed precision is not hardware efficient, as it makes the system implementation difficult. Therefore, the requirement was to protect the important weights without actually keeping them as FP16 (Figure 9b).

Figure 9b: (a): Multiplied the salient channels with s > 1 to reduce the quantization error (b): Preplexity reduced.

Protecting Salient Weights by Activation-aware Scaling (Figure 9b):

Consider a linear operator: $\mathbf{y}=\mathbf{w}\mathbf{x}$

Its quantized counterpart is: $\mathbf{y} = Q(\mathbf{w})\mathbf{x}$

The quantization $Q(\mathbf{w})$ is given as:

\begin{equation} \begin{split} Q(\mathbf{w}) & = \Delta.Round(\frac{\mathbf{w}}{\Delta})\\ \end{split} \end{equation}

where

\begin{equation} \begin{split} \Delta & = \frac{\max(|\mathbf{w}|)}{2^{N-1}}\\ \end{split} \end{equation}

From equations 1, and 2, $N$ is the number of quantization bits, and $\Delta$ is the quantization scaler determined by the absolute maximum value. Now consider a weight element $w \in \mathbf{w}$ . In equation 1, multiply $\mathbf{w}$ with $s>1$ and the inversely scale x, we will have:

\begin{equation} \begin{split} Q(w.s).\frac{x}{s} & = \Delta^{'}.Round(\frac{ws}{\Delta}).x.\frac{1}{s}\\ \end{split} \end{equation}

$\Delta^{'}$ is the new quantization scaler after applying s. Empirically, the authors observed the following:

expected error from Round(.) is always ~0.25 since round error is uniformly distributed from 0-0.5.

Scaling up a single element w usually does not change the extreme value from the group $\mathbf{w}$ . Therefore $\Delta^{'} \equiv \Delta$

when $s>1$ , the quantization error is scaled down (from Equation 3).

From all the points above, it can be derived that if scaling up the most salient weight channel is equivalent of preserving it (as happened in mixed precision, Figure 9). Therefore we can instead of working with $y =\mathbf{W}\mathbf{X}$ , we can write, $Q(\mathbf{W}.\mathbf{s})(\mathbf{s}^{-1}.\mathbf{X})$ .

Formally, the objective becomes:

\begin{equation} \begin{split} s^{*} &= arg \:min\: L(s) \\ &= ||Q(\mathbf{W}.\mathbf{s})(\mathbf{s}^{-1}.\mathbf{X}) - \mathbf{W}\mathbf{X}||\\ \end{split} \end{equation}

$Q$ : Weight quantization function (e.g, INT3/INT4 quantization). Not directly differentiable during back prop.

$\mathbf{W}$ : Original weights in FP16

$\mathbf{X}$ : Input features cached from a small calibration set

$s$ : Scaling factor

$s^{-1}.\mathbf{X}$ : to be fused into the previous operator

In equation 4, the optimization objective is to find the scaling factor s that minimizes the difference between the quantized model and the original model’s output. Since the quantization function is non-differentiable, s cannot be optimized by backpropagation, however, a search space can be defined for the optimal scale by analyzing the factors that will affect the choice of scaling factor. Therefore s is defined as:

\begin{equation} \begin{split} s^{*} &= s_{\mathbf{X}}^{\alpha} \end{split} \end{equation}

In equation 5, $s_{\mathbf{X}}$ is the magnitude of the activation, and $\alpha$ is the single hyper-parameter to balance between the protection of salient and non-salient channels. The best $\alpha$ can be found out by a fast grid search over the interval of [0,1] (0 means no scaling, 1 means aggressive scaling).

Academic Impact

Figure 10 [2]: Effect of various scaling factors (s>1) on perplexity (OPT-6.7B model, Wiki-2 data)

As can be seen in Figure 10, the author tried different scaling factors and found out that by searching over the scaling factor for search space, that scaling factor value of 2 reduced perplexity the most, increasing it again to 4 again led to degradation of perplexity

Figure 11 [2]: AWQ improved (lowest perplexity) overall SOTA PTQ methods for different model sizes and INT3, INT4 low-bit precision

From Figure 11, both INT3 and INT4, the AWQ is performing the best in terms of the perplexity ( Llama-2, LLaMA ) in comparison to the previous state-of-the-art architecture (RTN and GPTQ [14]).

Figure 12[2]: (a): Quantization results of a visual language model OpenFlamingo-9B [15] on COCO Captioning datasets, (b): Qualitative results of quantized OpenFlamingo-9B [15] on COCO captioning dataset (4-shot, INT4-g128 quantization).

From Figure 12 (a & b), quantization and qualitative results are shown for OpenFlamingo-9B on COCO captioning datasets. AWQ outperforms existing methods under zero-shot and various few-shot settings, demonstrating the generability to different modalities and in-context learning workloads. It reduces the quantization degradation (32-shot) from 4.57 to 1.17 under INT4-g128, providing 4× model size reduction with negligible performance loss. Qualitatively (Figure 12 (b)), it can be seen that AWQ gave better captions. For image 12(b), two dogs walking on the street is given as a caption whereas W4-RTN gave the wrong caption (it talks about the bushes).

Speedup Evaluation

Figure 13: AWQ speed comparison (in terms of tokens/sec) concerning FP16 model (from HuggingFace) on three different devices. AWQ offered up to 3.9x to 3.5x speed up.

The speedup was evaluated for AWQ on RTX 4090 (desktop GPU), RTX 4070 (laptop GPU), and Jetso Orin mobile GPU. For batch size = 1, the inference was performed for all LLMs using a fixed prompt length of 4 tokens. 200 tokens were generated for each inference run and the median latency was calculated. Figure 13, shows a 2.7-3.9x speedup to three families of LLMs (Llama-2, MPT, and Falcon) on 4090 compared with HuggingFace FP16 implementation.

Industrial and Social Impact

https://prod-files-secure.s3.us-west-2.amazonaws.com/779e1b8a-0e93-42b6-83b1-a403fb189ca5/08d97dc3-3fc4-46d6-b230-baebfd11772a/lec13_removed_(1).pdf

AWQ has already been making an impact in the field of Deep Learning. Applications like vLLMs, FastChat, Imdeploy, and replicate have already integrated it with their mainframes (Links provided above).

Industrial Impact

Hardware Efficiency: Quantization is crucial in optimizing hardware efficiency, especially in the development of electronic devices and integrated circuits. By representing data with fewer bits, the storage and transmission requirements are reduced, leading to more energy-efficient and cost-effective hardware solutions.

Computational Performance: In industrial applications such as machine learning and signal processing, quantization plays a pivotal role in enhancing computational performance. By reducing the precision of numerical representations, algorithms can operate faster, making real-time processing feasible in various industries, including manufacturing, finance, and telecommunications.

Embedded Systems: Quantization is essential in the design of embedded systems, which are prevalent in industrial automation and control systems. These systems often operate under resource constraints, and efficient quantization techniques enable the implementation of complex algorithms on resource-limited devices, contributing to the advancement of Industry 4.0.

Communication Systems: In telecommunications and networking, quantization is instrumental in compressing and transmitting data efficiently. This is particularly relevant for the efficient utilization of bandwidth, reducing latency, and ensuring reliable communication in industrial applications.

Platforms with AWQ Integration

FastChat

Replicate

InternLM

vLLM

Social Impact

Data Storage and Transmission: Quantization has a direct impact on data storage and transmission in social contexts. In everyday life, this is evident in multimedia applications, where compressed images, videos, and audio files utilize quantization techniques to reduce file sizes without significant loss of perceived quality. This, in turn, contributes to faster and more efficient sharing of information on social media platforms.

Mobile and Wearable Devices: Quantization is crucial in the development of mobile and wearable technologies that have become integral parts of our daily lives. By employing quantization to optimize algorithms and reduce power consumption, these devices can offer longer battery life, making technology more accessible and convenient for a broader population.

Healthcare: In medical imaging and diagnostics, quantization aids in the efficient storage and transmission of patient data. This is vital for the timely sharing of medical information among healthcare professionals and for the development of telemedicine solutions, improving access to healthcare services, especially in remote or underserved areas.

Privacy and Security: Quantization plays a role in ensuring privacy and security in social and communication systems. By reducing the amount of information transmitted and stored, quantization contributes to protecting sensitive data and maintaining user privacy in various online and digital interactions.

In summary, quantization's industrial and social impact is broad, ranging from optimizing hardware and computational efficiency to facilitating faster and more efficient data storage, transmission, and communication across various sectors of society.

Review

Rakesh Rathod (score - 7/10, Accept)

I find the paper's innovative approach, introducing Activation-aware Weight Quantization (AWQ), to be a significant strength. AWQ provides a fresh perspective on addressing low-bit weight-only quantization challenges for Large Language Models (LLMs). The experimental results robustly support AWQ's effectiveness, consistently outperforming existing methods like round-to-nearest quantization (RTN) and GPTQ across a diverse range of LLM models. This innovation positions AWQ as a promising solution for enhancing the efficiency of language models.

However, a notable weakness is the paper's limited details on the implementation of AWQ. While it emphasizes the efficiency of the system implementation, the lack of specific technical details about optimizations and considerations during implementation raises questions about reproducibility and robustness. A more thorough exposition on the implementation would strengthen the paper's practical foundation.

Luv Verma (score - 7/10, Accept)

While the paper exhibits strengths, there are areas that warrant attention and improvement. One such strength is the inclusion of a practical system implementation alongside theoretical advancements. This feature sets AWQ apart by translating the achieved memory savings into substantial speedups, making it a practical solution for deployment on different GPUs. The widespread adoption of AWQ by various open-source LLM serving solutions further validates its utility within the research community.

However, a corresponding weakness lies in the assumption made in the paper regarding the relationship between activation magnitude and weight channel saliency. While this assumption is foundational to AWQ, the paper lacks sufficient explanation or evidence to justify this choice. A more comprehensive exploration of this aspect would bolster the paper's theoretical underpinnings and strengthen its overall argument. Also, the method is evaluated on standard accuracy metrics such as perplexity and accuracy, however besides accuracy there are other important LLM benchmarks like robustness, fairness, bias, toxicity, helpfulness where there are no evaluated results

Demo

Files are available at - HuggingFace

Demo on perplexity over 1.3b parameter model

https://rakesh9177-quantization.hf.space

Example of Llama2 performance non-quantized vs quantized:

References

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

[2] Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978.

[3] Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding." arXiv preprint arXiv:1510.00149 (2015).

[4] https://pytorch.org/docs/stable/quantization.html

[5] Migacz, Szymon. "8-bit inference with tensorrt." GPU technology conference. Vol. 2. No. 4. 2017.

[6] Krishnamoorthi, Raghuraman. "Quantizing deep convolutional networks for efficient inference: A whitepaper." arXiv preprint arXiv:1806.08342 (2018).

[7] Frantar, Elias, et al. "Gptq: Accurate post-training quantization for generative pre-trained transformers." arXiv preprint arXiv:2210.17323 (2022).

[8] Dettmers, Tim, and Luke Zettlemoyer. "The case for 4-bit precision: k-bit inference scaling laws." International Conference on Machine Learning. PMLR, 2023.

[9] Dettmers, Tim, et al. "Llm. int8 (): 8-bit matrix multiplication for transformers at scale." arXiv preprint arXiv:2208.07339 (2022).

[10] Han, Song, et al. "Learning both weights and connections for efficient neural network." Advances in neural information processing systems 28 (2015).

[11] Xiao, Guangxuan, et al. "Smoothquant: Accurate and efficient post-training quantization for large language models." International Conference on Machine Learning. PMLR, 2023.

[12] Sheng, Ying, et al. "High-throughput generative inference of large language models with a single gpu." arXiv preprint arXiv:2303.06865 (2023).

[13] Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2023, July). Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning (pp. 38087-38099). PMLR.

[14] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.

[15] Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023.

[16] Roofline: An Insightful Visual Performance Model for Multicore Architectures

[17] Jacob, Benoit, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704-2713. 2018.

Appendix

Basics of Quantization

Quantization: It is the process of constraining an input from a continuous or otherwise large set of values to a discrete set (Figure 1)

Figure 1: Quantized signal and Image

Numerical Formats (Table 1 ): BF16 is very popular among the LLM community for standard pretraining. BF16 is known for helping in training stability (supported by a100 NVIDIA GPU). For inference, the models are either quantized in INT8, INT4 or INT3. INT8 has a range from -128 to 127.

Table 1: Details of various commonly used numerical formats ( 1 bit is assigned to sign)

	Bits	Exponent	Fraction	Memory to store one value
FP32	32	8	23	4 bytes
FP16	16	5	10	2 bytes
BFLOAT16	16	8	7	2 bytes
INT8	8	-/-	7	1 byte

Bytes required per parameter: Table 2 shows the bytes required per parameter during training for 1 billion parameters.

	Bytes per parameter
Model parameters (Weights)	4
Adam Optimizer (2 states)	8
Gradients	4
Activations and temp Memory (Variable Size)	8
Total memory per parameter	4 bytes + 20 bytes (per parameter)
Memory for 1 billion parameters at FP32	24 bytes * 1 bil = 24 GB

Two Complement’s Representation: The most commonly used numeric data type.
- n-bit Range: $[-2^{n-1}, 2^{n-1}-1]$
- 000…00 represents 0
- 100…00 represents $-2^{n-1}$
- Generally used to convert either FP32, FP16, or BF16 representation to binary or vice-versa.

The common choice for quantization in DL networks: Weights and activations. Quantizing weights leads to a reduction in storage/memory requirements. Methods like linear quantization can also lead to lesser computational costs with the help of integer arithmetic during forward and backward propagation (quantization aware training [17])