Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization
Abstract
Additive quantization for LLM compression faces challenges at 2-bit precision due to codebook initialization issues, which OA-EM addresses through output-aware EM initialization based on Hessian-weighted Mahalanobis distance.
Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio ho = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with ho: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.
Community
Fixing catastrophic degradation in 2-bit LLMs (Llama 3.1, 3.2 & Qwen 2.5 2-bit weights inside)
Catastrophic degradation in 2-bit LLMs isn't a compute problem; it’s an initialisation problem. At 2 bits per parameter, additive quantization (like AQLM) hits an "undercomplete regime." Weight groups compete for starved codebook capacity, and standard greedy initialisation traps the model in a bad optimisation basin.
We introduce Output-Aware Expectation-Maximisation (OA-EM). On Llama 3.2 3B (2bpp), OA-EM achieves 11.53 Post-PV perplexity in just 6.1h, beating the greedy wide-beam baseline (12.01) that takes 16.9h.
The best part: Because we keep free-form codebooks, you get O(1) LUT dequantization with EXACTLY ZERO MAC operations. Pure memory reads are perfectly suited for edge deployment with a 4096 context window.
We've open-sourced the code and the 2-bit weights on our Hugging Face profile. Happy to answer any questions about the Hessian-weighting or implementation!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference (2026)
- Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs (2026)
- SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization (2026)
- AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning (2026)
- pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training (2026)
- 1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization (2026)
- Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.08118 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper