|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- Rowan/hellaswag |
|
|
- lighteval/piqa |
|
|
- google/boolq |
|
|
- omarmohamed/arc_easy |
|
|
metrics: |
|
|
- accuracy |
|
|
--- |
|
|
tags: |
|
|
- quantization |
|
|
- kronecker |
|
|
- second-order |
|
|
- YAQA |
|
|
- LLaMA |
|
|
- Qwen |
|
|
- efficient |
|
|
--- |
|
|
|
|
|
# โก FastKronQuantization |
|
|
|
|
|
**Fast second-order Kronecker-factored quantization for LLMs** |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ง Abstract |
|
|
|
|
|
Quantization with second-order information has shown strong promise for preserving model quality under aggressive compression. Building on the recent [YAQA](https://arxiv.org/abs/2505.22988) framework (Tseng et al., 2025b), which employs Kronecker-factored approximations of the Hessian via a power-iteration technique, we propose an alternative approach that replaces this step with a more efficient Kronecker decomposition method from Chekalina et al. (2025). |
|
|
This formulation preserves the benefits of second-order curvature-aware quantization while substantially reducing computational cost. |
|
|
We apply our method to **LLaMA-2 7B**, **LLaMA-3 8B Instruct**, and **Qwen-3 8B Instruct** and demonstrate that it achieves the same post-quantization model quality as YAQA, but with significantly faster computation โ the Kronecker factors required for target quality are obtained with **10ร fewer tokens** and approximately a **10ร speedup** over the original work. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐งญ Checkpoints |
|
|
|
|
|
| Model name | Architecture | Bits | |
|
|
|-------------------------|-------------------------|------| |
|
|
| `FastKronQuant-LLaMA2-7B-4bit` | LLaMA-2-7B | 4-bit | |
|
|
| `FastKronQuant-LLaMA3-8B-4bit` | LLaMA-3-8B-Instruct | 4-bit | |
|
|
| `FastKronQuant-Qwen3-8B-4bit` | Qwen-3-8B | 4-bit | |
|
|
| `FastKronQuant-LLaMA2-7B-2bit` | LLaMA-2-7B | 2-bit | |
|
|
| `FastKronQuant-LLaMA3-8B-2bit` | LLaMA-3-8B-Instruct | 2-bit | |
|
|
| `FastKronQuant-Qwen3-8B-2bit` | Qwen-3-8B | 2-bit | |
|
|
|
|
|
Each checkpoint is fully compatible with Hugging Face `transformers` and can be loaded like any standard model. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Features |
|
|
|
|
|
- โก **Fast Kronecker decomposition** โ up to 10ร faster factor estimation |
|
|
- ๐งฎ **Second-order quantization** โ preserves model accuracy |
|
|
- ๐ชถ Works with popular architectures: **LLaMA-2**, **LLaMA-3**, **Qwen-3** |
|
|
- ๐ธ Compatible with ๐ค `transformers` out of the box |
|
|
|
|
|
## ๐ Usage Example (LLaMA-2 7B) |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
repo_id = "username/FastKronQuant-LLaMA2-7B" # replace with actual repo |
|
|
tokenizer = AutoTokenizer.from_pretrained(repo_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
repo_id, |
|
|
torch_dtype="auto", |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
prompt = "What is the capital of France?" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=50) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
## ๐งช Example โ ARC Easy evaluation |
|
|
|
|
|
```python |
|
|
from datasets import load_dataset |
|
|
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load ARC-Easy |
|
|
ds = load_dataset("ai2_arc", "ARC-Easy")["test"] |
|
|
|
|
|
# Load quantized model |
|
|
repo_id = "username/FastKronQuant-LLaMA2-7B" |
|
|
tok = AutoTokenizer.from_pretrained(repo_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
repo_id, |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
pipe = pipeline("text-generation", model=model, tokenizer=tok) |
|
|
|
|
|
|
|
|
# Simple evaluation loop |
|
|
for i in range(3): |
|
|
q = ds[i]["question"] |
|
|
a = pipe(q, max_new_tokens=32)[0]["generated_text"] |
|
|
print(f"Q: {q}\nA: {a}\n---") |
|
|
``` |
|
|
|
|
|
If you use model chekpoints in your experiments, please cite |
|
|
>@misc{chekalina2025gfwsvd, |
|
|
> title={Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models}, |
|
|
> author={Viktoriia Chekalina and Daniil Moskovskiy and Daria Cherniuk and Maxim Kurkin and Andrey Kuznetsov and Evgeny Frolov}, |
|
|
> year={2025}, |
|
|
> eprint={2505.17974}, |
|
|
> archivePrefix={arXiv}, |
|
|
> primaryClass={cs.LG}, |
|
|
> url={https://arxiv.org/abs/2505.17974}, |
|
|
>} |
|
|
|
|
|
## ๐ Zero-shot results โ LLaMA-3 8B |
|
|
|
|
|
### ๐ก 4-bit Quantization |
|
|
|
|
|
| Method | Steps | ARC_c โ | BoolQ โ | PIQA โ | ARC_e โ | HSwag โ | AVG โ | GPU/h โ | Tokens โ | |
|
|
|---------------------|-------|---------|---------|--------|---------|---------|--------|---------|-----------| |
|
|
| 16 bit (baseline) | โ | **0.5171** | **0.8409** | **0.7986** | **0.8177** | **0.5908** | **0.7131** | โ | โ | |
|
|
| 4-bit Sketch A | 4096 | **0.5136** | **0.8443** | 0.7997 | 0.8198 | **0.5865** | 0.7127 | 92 | 16 M | |
|
|
| 4-bit FastKron | 75 | 0.5116 | 0.8438 | **0.8025** | **0.8207** | 0.5863 | **0.7129** | 9.5 | 712 K | |
|
|
| 4-bit No Hess | โ | 0.5119 | 0.8415 | 0.7959 | 0.8097 | 0.5859 | 0.7112 | โ | โ | |
|
|
|
|
|
|
|
|
### ๐ 2-bit Quantization |
|
|
|
|
|
| Method | Steps | ARC_c โ | BoolQ โ | PIQA โ | ARC_e โ | HSwag โ | AVG โ | GPU/h โ | Tokens โ | |
|
|
|---------------------|-------|---------|---------|--------|---------|---------|--------|---------|-----------| |
|
|
| 2-bit Sketch A | 4096 | **0.4312** | 0.7567 | 0.7647 | 0.7391 | **0.5259** | 0.6435 | 92 | 16 M | |
|
|
| 2-bit FastKron | 100 | 0.4277 | **0.7646** | **0.7661** | **0.7468** | 0.5159 | **0.6442** | 11.5 | 950 K | |
|
|
| 2-bit No Hess | โ | 0.2363 | 0.6336 | 0.6554 | 0.5108 | 0.3620 | 0.5094 | โ | โ | |
|
|
|
|
|
|
|
|
|
|
|
## ๐ Zero-shot results โ Qwen-3 8B |
|
|
|
|
|
### ๐ก 4-bit Quantization |
|
|
|
|
|
| Method | Steps | ARC_c โ | BoolQ โ | PIQA โ | ARC_e โ | HSwag โ | AVG โ | GPU/h โ | Tokens โ | |
|
|
|---------------------|-------|---------|---------|--------|---------|---------|--------|---------|-----------| |
|
|
| 16 bit (baseline) | โ | **0.5563** | **0.8682** | **0.7677** | **0.8354** | **0.5708** | **0.7197** | โ | โ | |
|
|
| 4-bit Sketch A | 4096 | **0.5503** | 0.8611 | 0.7612 | 0.8324 | 0.5601 | **0.7132** | 84 | 8 M | |
|
|
| 4-bit FastKron | 150 | 0.5469 | 0.8667 | 0.7601 | **0.8287** | **0.5637** | **0.7132** | 42 | 712 K | |
|
|
| 4-bit No Hess | โ | 0.5467 | **0.8675** | **0.7622** | 0.8312 | 0.5585 | **0.7132** | โ | โ | |
|
|
|
|
|
|
|
|
### ๐ 2-bit Quantization |
|
|
|
|
|
| Method | Steps | ARC_c โ | BoolQ โ | PIQA โ | ARC_e โ | HSwag โ | AVG โ | GPU/h โ | Tokens โ | |
|
|
|---------------------|-------|---------|---------|--------|---------|---------|--------|---------|-----------| |
|
|
| 2-bit Sketch A | 4096 | 0.4536 | 0.7782 | **0.7435** | **0.7797** | 0.4611 | 0.6432 | 84 | 8 M | |
|
|
| 2-bit FastKron | 150 | **0.4616** | 0.8416 | 0.7334 | 0.7702 | **0.4853** | **0.6584** | 42 | 712 K | |
|
|
| 2-bit No Hess | โ | 0.3993 | **0.8675** | 0.7743 | 0.7003 | 0.4758 | 0.6434 | โ | โ | |
|
|
|
|
|
|
|
|
## ๐ Zero-shot results โ LLaMA-2 7B |
|
|
|
|
|
### ๐ก 4-bit Quantization |
|
|
|
|
|
| Method | Steps | ARC_c โ | BoolQ โ | PIQA โ | ARC_e โ | HSwag โ | AVG โ | GPU/h โ | Tokens โ | |
|
|
|---------------------|-------|---------|---------|--------|---------|---------|--------|---------|-----------| |
|
|
| 16 bit (baseline) | โ | **0.4325** | **0.7767** | **0.7774** | **0.7617** | **0.5721** | **0.6640** | โ | โ | |
|
|
| 4-bit Sketch A | 4096 | 0.4274 | 0.7688 | 0.7752 | **0.7613** | **0.5672** | 0.6599 | 50 | 16 M | |
|
|
| 4-bit FastKron | 75 | 0.4283 | 0.7792 | **0.7802** | 0.7610 | 0.5660 | 0.6629 | 5 | 712 K | |
|
|
| 4-bit No Hess | โ | **0.4352** | **0.7875** | 0.7742 | 0.7609 | 0.5628 | **0.6641** | โ | โ | |
|
|
|
|
|
|
|
|
### ๐ 2-bit Quantization |
|
|
|
|
|
| Method | Steps | ARC_c โ | BoolQ โ | PIQA โ | ARC_e โ | HSwag โ | AVG โ | GPU/h โ | Tokens โ | |
|
|
|---------------------|-------|---------|---------|--------|---------|---------|--------|---------|-----------| |
|
|
| 2-bit Sketch A | 4096 | 0.3805 | 0.7333 | 0.7562 | **0.7192** | **0.5227** | 0.6223 | 50 | 16 M | |
|
|
| 2-bit FastKron | 150 | **0.3843** | **0.7510** | **0.7600** | 0.7112 | 0.5139 | **0.6240** | 6 | 1400 K | |
|
|
| 2-bit No Hess | โ | 0.2210 | 0.6355 | 0.6306 | 0.5152 | 0.3422 | 0.4689 | โ | โ | |
|
|
|