Diving deeper into Quantization Realm : Introduction to PTQ and QAT

Pratima Rathore
8 min readAug 30, 2023

Quantization in simple terms is a method for reducing the size of a model. It involves transforming model weights from a form of high-precision floating-point representation to one of lower precision, like 16-bit or 8-bit floating-point (FP) or integer (INT) representations. Through this conversion from high-precision to lower-precision, the model’s dimensions and speed of inference can experience substantial enhancement, all while maintaining an acceptable level of accuracy.

Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions.

In general, a model’s size is determined by multiplying the no of parameters (size) by the precision of these values (data type). For the purpose of conserving memory, weights can be stored utilizing data types of reduced precision through quantization.

Quantization can be applied to all of the components of a neural network, including Weights, Biases, Activation functions,Layers or Channels.

This compression process, involving the estimation of the range, is carried out through a procedure referred to as calibration.

However, it’s essential to acknowledge that quantization has its trade-offs. The model’s performance might experience some decline due to reduced precision of weights and activations, which depends on factors like bit width and model complexity.

Floating Point Representation

The choice of data type dictates the quantity of computational resources required, affecting the speed and efficiency of the model. In deep learning applications, balancing precision and computational performance becomes a vital exercise as higher precision often implies greater computational demands.

A floating-point number is encoded using n bits, divided into three key parts:

Sign: Occupying a single bit, the sign bit distinguishes between positive (0) and negative (1) values.
Exponent: Comprising several bits, the exponent section represents the power to which the base (often 2 in binary) is raised. This allows for representation of large or small values through positive or negative exponents.

Significand/Mantissa: The remaining bits form the significand or mantissa, capturing the number’s significant digits. Precision hinges on the length of the significand, influencing the number’s accuracy.

Quantization involves approximating values. A closer approximation typically leads to less performance degradation. While using float16 for quantization can halve memory usage without significant accuracy loss, it might not offer substantial speedup. Conversely, employing int8/int4 can significantly accelerate inference but often comes with performance trade-offs. In some cases, extreme scenarios might necessitate quantization-aware training, especially when int8/int4 quantization alone fails.

Quantization techniques can be classified based on how and when are we performing quantization.

How we are doing Quantization: Naive, Hybrid, and Selective

  • Naive quantization involves quantizing all operators to INT8 precision and calibrating them uniformly. However, this straightforward approach often leads to a notable decrease in model accuracy compared to the original floating-point model. The universal quantization method is applied to all operators without considering their individual sensitivity to quantization.
  • Hybrid quantization involves converting specific operators to INT8 precision, while keeping others as FP16 or FP32. This requires understanding the network’s structure and quantization-sensitive layers. Alternatively, you can perform a sensitivity analysis by excluding layers one by one and assessing changes in latency and accuracy.
  • Selective quantization involves quantizing certain operators to INT8 precision, using diverse calibration methods and granularity (either per channel or per tensor). Residuals, as well as sensitive and non-friendly layers, are also quantized to INT8. Meanwhile, FP16 precision is retained for specific layers. This approach allows users to modify entire model sections for improved quantization suitability. It offers maximum flexibility in selecting quantization parameters for various network types, aiming to optimize accuracy and minimize latency concurrently.

What are the reasons for employing selective quantization?
* Certain layers are sensitive, leading to a significant accuracy decline.
* Operator sequences are not INT8-friendly, potentially worsening latency instead of enhancing it.
* Specific blocks require special structures to smoothly convert with inference frameworks like TensorRT.
* Activation distributions are different and different calibrators are needed for weights and activations

Learn about the best practises for quantization.

When we are doing Quanitization : Post training methods and Quantization-aware training (QAT)

Quantization methods fall broadly into two categories:
* Post training methods ( PTQ ) is a quantization technique where the model is quantized after it has been trained.

In post-training quantization, the model’s weights and activations are evaluated on a representative dataset to determine the range of values taken by these parameters. These ranges are then used to quantize the weights and activations to the desired integer precision.

The quantization process involves dividing the range of values into equal intervals and mapping the original values to the closest interval boundaries.Post-training quantization is typically performed by applying one of several algorithms, including dynamic range (this technique quantizes the model’s weights and activations to a set number of bits), weight (merely quantifies the model’s weights, leaving the activations in floating-point notation), and per-channel quantization (quantifies the model’s weights and activations per channel rather than globally).

Most popular PTQ methods are:
* GPTQ, one-shot weight quantization method
* GGML
* QLORA’s 4bits ( bitsandbytes )

To learn more about it refer my article Post-Training Magic (PTQ) focusing on popular and most efficient PTQ Techniques.

  • Quantization-aware training (QAT) is a fine-tuning of the model/ PTQ model, where the model is further trained with quantization in mind. The quantization process (scaling, clipping, and rounding) is incorporated into the training process, allowing the model to be trained to retain its accuracy even after quantization, leading to benefits during deployment (lower latency, smaller model size, lower memory footprint).

Quantization-aware training (QAT) is a representative model compression method to leverage redundancy in weights and activations.

Behind the Scenes 🎪

The mechanism of quantization aware training is simple, it places fake quantization modules, i.e., quantization and dequantization modules Q/DQ, at the places where quantization happens during floating-point model to quantized integer model conversion, to simulate the effects of clamping and rounding brought by integer quantization. The fake quantization modules will also monitor scales and zero points of the weights and activations.Once the quantization aware training is finished, the floating point model could be converted to quantized integer model immediately using the information stored in the fake quantization modules.This training optimizes model weights to improve model performance on downstream tasks by emulating inference-time quantization.

Essentially, during quant-aware training, the forward pass emulates low precision behavior, while the backward pass remains unchanged. This introduces quantization errors that aggregate within the model’s total loss. As a result, the optimizer strives to minimize these errors by adjusting the parameters. This approach enhances the robustness of our parameters to quantization, leading to a nearly lossless process.

There’s no need to be concerned about building this intricate mechanism from scratch, Tensorflow ,Pytorch and Huggingface offers these APIs-
Tensorflow
Pytorch
Huggingface

Trade-off ⚖️

When there’s a scarcity of training data and a need for rapid quantization, PTQ typically emerges as the top choice. However, this comes at the expense of precision, as it mandates the use of at least 4 bits. Even when employing 4 or more bits, PTQ often falls short in terms of accuracy compared to QAT, rendering it a less favored choice. With equivalent bit precision, QAT consistently attains superior accuracy, making it the more preferred option.

The primary trade-off with Quantization-Aware Training (QAT), which provides high accuracy, is the extended training duration, typically involving hundreds of epochs, along with the associated computational retraining expenses. Additionally, a substantial training period is essential to avoid overfitting. However, this extended training duration is frequently justified, especially for models intended for long-term deployment, where the benefits in terms of hardware and energy efficiency far outweigh the retraining costs.

Bonus⭐🌟

Noteworthy Related Research

Quantizing large language models (LLMs) through quantization-aware training (QAT) poses significant challenges in maintaining their zero-shot generalization, primarily because selecting an appropriate fine-tuning dataset is critical. Striking a balance between a dataset that aligns with the model’s pre-training distribution and one that isn’t too narrow in domain is essential. Additionally, replicating the original training setup for LLMs is complicated due to their scale and complexity.

In this work, they tackle this issue by using generated data from the LLM itself for knowledge distillation. This simple workaround, which we refer to as data-free knowledge-distillation is applicable to any generative model independent of whether or not the original training data is available. In addition to quantizing weights and activations, it also quantize the KV cache.

If you liked the article, show your support by clapping for this article. Follow me and lets unravel the mysteries and unleash the potential of AI. Feel free to connect with me on LinkedIn as well!

Happy Parameter-Efficient Fine-Tuning!

References

https://towardsdatascience.com/inside-quantization-aware-training-4f91c8837ead

--

--