Diving deeper into Quantization Realm : Post-Training Magic (PTQ)

Pratima Rathore
6 min readAug 30, 2023

--

Part 2 — Introduction to Post Training Quantization (PTQ) Methods — GPTQ , GGLM and QLORA 4bit Quantization

Crafting Compact Models

This article is in continuation to my Quantization series, succeeding Part 1, which delved into the fundamentals of PTQ and QAT.

As we discussed in last article Quantization stands out as a highly efficient approach for compressing Large Language Models (LLMs). In practical terms, the primary objective of quantization revolves around lowering the precision of an LLM’s weights ( it can also be other parameters like Biases, Activation functions,Layers or Channels) usually transitioning from 16-bit to 8-bit, 4-bit, or in some cases, even 3-bit, all while keeping performance deterioration at a minimum.

There are two most popular quantization methods for LLMs: GPTQ and 4/8-bit (bitsandbytes) Quantization. We will be discussing them in detail in this article.

GPTQ: Post-Training Quantization for Generative Pre-trained Transformers

GPTQ falls into the PTQ category and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive.

GPTQ is a one-shot weight quantization method based on approximate second-order information, that is both highly accurate and highly-efficient.

Lets try to understand this statement which is taken right from GPTQ (Frantar et al., 2023).

GPTQ is a post-training quantization ( PTQ) method to make the model smaller with a calibration dataset. The idea behind GPTQ is very simple: it quantizes each weight by finding a compressed version of that weight, that will yield a minimum mean squared error. The GPTQ algorithm requires to calibrate the quantized weights of the model by doing inference on the quantized model.

The effectiveness of quantization greatly depends on samples for evaluating and refining its quality. These samples serve as a basis for comparing the outputs of the original and quantized models. By using a higher number of samples, the potential for precise and impactful comparisons increases, subsequently enhancing the quality of quantization.

Under the hood 🧙🏽

It adopts a hybrid quantization scheme where model weights are quantized as int4, while activations are retained in float16. Weights are dynamically dequantized during inference, and actual computation is performed in float16. This approach brings memory savings due to fused kernel-based dequantization and potential speedups through reduced data communication time.

The benefits of this scheme are twofold:
* Memory savings close to x4 for int4 quantization, as the dequantization happens close to the compute unit in a fused kernel, and not in the GPU global memory.
* Potential speedups thanks to the time saved on data communication due to the lower bitwidth used for weights.

In GPTQ, we apply post-quantization for once, and this results in both memory savings and inference speedup

You may have come across the term AutoGPTQ 🪄

AutoGPTQ library — the one-stop library for efficiently leveraging GPTQ for LLMs. Huggingface and bitsandbytes collaboration, have just integrated the AutoGPTQ library in Transformers, making it possible for users to quantize and run models in 8, 4, 3, or even 2-bit precision using the GPTQ algorithm .

Want to try it out 👩🏽‍🏭- Colab Notebook - Implementation of GPTQ

You can find GPTQ quantized models here🤗

GGML — A CPU Optimized Version

Big shoutout to The-Bloke who graciously quantized these models in GGML/GPTQ format to further serve the AI community

GGML is a C library for machine learning.In addition to defining low-level machine learning primitives (like a tensor type), GGML defines a binary format for distributing LLMs.

GGML presents an alternative approach to quantization with a focus on optimizing CPU performance. While it follows the same core principles, it employs a distinct underlying implementation. As a general guideline, if your setup involves NVIDIA hardware and your entire model fits within VRAM, GPTQ is likely to offer better speed. On the other hand, for Apple or Intel hardware, GGML is more likely to deliver faster performance.

GGML models have been optimized to run well on CPU’s,which will allow for good model performance even without a GPU.They will use more RAM than they would on the GPU RAM utilization.

Want to try it out👨🏽‍🏭 - Colab Notebook -Implementation of GGML

You can find GGML quantized models here 🤗

QLoRa Quantization

Huggingface collaborated with bitsandbytes to make models even more accessible to anyone

QLoRa Quantization comes from QLoRa paper which depicts a very efficient method for fine-tuning pretrained LLMs with adapters.

But that doesn’t mean QLoRa is limited to fine tunning , certain elements of QLoRa can be employed during inference also, excluding the LoRa aspect which introduces trainable adapters for individual layers, as these adapters are not subject to training. Nevertheless, during inference, the remaining concepts introduced by QLoRa remain applicable to curbing the memory footprint of Large Language Models (LLMs):
* nf4 quantization
* Double quantization ( which uses a second quantization after the first one to save an additional 0.4 bits per parameter )
* Computation with bf16 ( it dequantizes on the fly to bf16 without much impact on memory usage but may improve the results )

It has 2 different variants of 4bit quantization i.e. NF4 (normalized float 4 (default)) or pure FP4 quantization. Based on theoretical considerations and empirical results from the paper, we recommend using NF4 quantization for better performance.

You don’t need to fine-tune with QLoRa to do inference with QLoRa.
The convenient integration of 4-bit NF (nf4) in QLoRa is the main advantage of bitsandbytes over GPTQ.

Keep also in mind that the computation is not done in 4bit, the weights and activations are compressed to that format and the computation is still kept in the desired or native dtype.

Want to try it out👨🏽‍🏭 — Article with code snippet with detail explaination , Colab Notebook -Implementation of 4bit quantization

Conclusion

To sum up, these Post-Training Quantization (PTQ) methods offer significant value to language models by facilitating improved memory utilization, quicker computations, and the ability to be deployed on budget-friendly servers or personal devices.

Now, turning to the question of which PTQ method stands as the most optimal, I could not arrive at a definitive conclusion. However, for insights into this comparison, you can refer to the article GPTQ versus QLoRa where they have extensively evaluated both techniques on Llama.

In the next article under this series we will talk about quantization aware training (QAT) for LLMs to push quantization levels even further.

If you liked the article, show your support by clapping for this article. Follow me and lets unravel the mysteries and unleash the potential of AI. Feel free to connect with me on LinkedIn as well!

Happy Parameter-Efficient Fine-Tuning!

References

https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34#:~:text=Therefore%2C%20GPTQ%20can%20apply%20the,updates%20on%20the%20entire%20matrix.

https://huggingface.co/blog/gptq-integration

https://huggingface.co/blog/merve/quantization

--

--