arxiv:2604.08118

Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

Published on Apr 9

· Submitted by

Ian Kennedy on Apr 13

Upvote

Authors:

Ian W. Kennedy ,

Abstract

Additive quantization for LLM compression faces challenges at 2-bit precision due to codebook initialization issues, which OA-EM addresses through output-aware EM initialization based on Hessian-weighted Mahalanobis distance.

AI-generated summary

Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio ho = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with ho: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.

View arXiv page View PDF GitHub 2 Add to collection

Community

kennedyian94

Paper author Paper submitter about 20 hours ago

Fixing catastrophic degradation in 2-bit LLMs (Llama 3.1, 3.2 & Qwen 2.5 2-bit weights inside)

Catastrophic degradation in 2-bit LLMs isn't a compute problem; it’s an initialisation problem. At 2 bits per parameter, additive quantization (like AQLM) hits an "undercomplete regime." Weight groups compete for starved codebook capacity, and standard greedy initialisation traps the model in a bad optimisation basin.

We introduce Output-Aware Expectation-Maximisation (OA-EM). On Llama 3.2 3B (2bpp), OA-EM achieves 11.53 Post-PV perplexity in just 6.1h, beating the greedy wide-beam baseline (12.01) that takes 16.9h.

The best part: Because we keep free-form codebooks, you get O(1) LUT dequantization with EXACTLY ZERO MAC operations. Pure memory reads are perfectly suited for edge deployment with a 4096 context window.

We've open-sourced the code and the 2-bit weights on our Hugging Face profile. Happy to answer any questions about the Hessian-weighting or implementation!

librarian-bot

about 6 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.08118

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.08118 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.08118 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.