Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman
Large language models (LLMs) often have billions of parameters, typically stored as 16-bit floating-point numbers. Quantising these weights to lower precision (e.g. 4-bit integers) offers significant advantages, including reduced memory usage, lower computational requirements and improved energy efficiency.
In the extreme case, models like BitNet [1] store weights as 1.58-bit values {-1, 0, 1}. However, LLMs’ weight matrices often contain large outliers, making quantisation challenging. In this paper, the authors propose a novel approach using randomised Hadamard transformations (rotations) to preprocess the weight matrices. These rotations effectively remove outliers, enabling more efficient quantisation.
Using the GPTQ algorithm [2], which quantises without requiring model retraining, the authors achieve end-to-end 4-bit quantisation. They demonstrate that their approach preserves model performance (e.g., minimal increase in text perplexity) even for large models like LLaMA2-70B, which exhibit the smallest performance drop. Furthermore, their method delivers a 3x speedup during inference and a 3.5x reduction in peak memory usage.
[1] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (2024), https://arxiv.org/abs/2402.17764
[2] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022), https://arxiv.org/abs/2210.17323
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs