Update README.md

1aeb829 verified 3 months ago

8.11 kB

	---
	license: apache-2.0
	datasets:
	- Rowan/hellaswag
	- lighteval/piqa
	- google/boolq
	- omarmohamed/arc_easy
	metrics:
	- accuracy
	---
	tags:
	- quantization
	- kronecker
	- second-order
	- YAQA
	- LLaMA
	- Qwen
	- efficient
	---

	# ⚡ FastKronQuantization

	Fast second-order Kronecker-factored quantization for LLMs

	---

	## 🧠 Abstract

	Quantization with second-order information has shown strong promise for preserving model quality under aggressive compression. Building on the recent [YAQA](https://arxiv.org/abs/2505.22988) framework (Tseng et al., 2025b), which employs Kronecker-factored approximations of the Hessian via a power-iteration technique, we propose an alternative approach that replaces this step with a more efficient Kronecker decomposition method from Chekalina et al. (2025).
	This formulation preserves the benefits of second-order curvature-aware quantization while substantially reducing computational cost.
	We apply our method to LLaMA-2 7B, LLaMA-3 8B Instruct, and Qwen-3 8B Instruct and demonstrate that it achieves the same post-quantization model quality as YAQA, but with significantly faster computation — the Kronecker factors required for target quality are obtained with 10× fewer tokens and approximately a 10× speedup over the original work.

	---

	## 🧭 Checkpoints

	\| Model name \| Architecture \| Bits \|
	\|-------------------------\|-------------------------\|------\|
	\| `FastKronQuant-LLaMA2-7B-4bit` \| LLaMA-2-7B \| 4-bit \|
	\| `FastKronQuant-LLaMA3-8B-4bit` \| LLaMA-3-8B-Instruct \| 4-bit \|
	\| `FastKronQuant-Qwen3-8B-4bit` \| Qwen-3-8B \| 4-bit \|
	\| `FastKronQuant-LLaMA2-7B-2bit` \| LLaMA-2-7B \| 2-bit \|
	\| `FastKronQuant-LLaMA3-8B-2bit` \| LLaMA-3-8B-Instruct \| 2-bit \|
	\| `FastKronQuant-Qwen3-8B-2bit` \| Qwen-3-8B \| 2-bit \|

	Each checkpoint is fully compatible with Hugging Face `transformers` and can be loaded like any standard model.

	---

	## 📌 Features

	- ⚡ Fast Kronecker decomposition — up to 10× faster factor estimation
	- 🧮 Second-order quantization — preserves model accuracy
	- 🪶 Works with popular architectures: LLaMA-2, LLaMA-3, Qwen-3
	- 🔸 Compatible with 🤗 `transformers` out of the box

	## 🚀 Usage Example (LLaMA-2 7B)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	repo_id = "username/FastKronQuant-LLaMA2-7B" # replace with actual repo
	tokenizer = AutoTokenizer.from_pretrained(repo_id)
	model = AutoModelForCausalLM.from_pretrained(
	repo_id,
	torch_dtype="auto",
	device_map="auto",
	)

	prompt = "What is the capital of France?"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=50)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```
	## 🧪 Example — ARC Easy evaluation

	```python
	from datasets import load_dataset
	from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

	# Load ARC-Easy
	ds = load_dataset("ai2_arc", "ARC-Easy")["test"]

	# Load quantized model
	repo_id = "username/FastKronQuant-LLaMA2-7B"
	tok = AutoTokenizer.from_pretrained(repo_id)
	model = AutoModelForCausalLM.from_pretrained(
	repo_id,
	torch_dtype="auto",
	device_map="auto"
	)
	pipe = pipeline("text-generation", model=model, tokenizer=tok)


	# Simple evaluation loop
	for i in range(3):
	q = ds[i]["question"]
	a = pipe(q, max_new_tokens=32)[0]["generated_text"]
	print(f"Q: {q}\nA: {a}\n---")
	```

	If you use model chekpoints in your experiments, please cite
	>@misc{chekalina2025gfwsvd,
	> title={Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models},
	> author={Viktoriia Chekalina and Daniil Moskovskiy and Daria Cherniuk and Maxim Kurkin and Andrey Kuznetsov and Evgeny Frolov},
	> year={2025},
	> eprint={2505.17974},
	> archivePrefix={arXiv},
	> primaryClass={cs.LG},
	> url={https://arxiv.org/abs/2505.17974},
	>}

	## 📊 Zero-shot results — LLaMA-3 8B

	### 🟡 4-bit Quantization

	\| Method \| Steps \| ARC_c ↑ \| BoolQ ↑ \| PIQA ↑ \| ARC_e ↑ \| HSwag ↑ \| AVG ↑ \| GPU/h ↓ \| Tokens ↓ \|
	\|---------------------\|-------\|---------\|---------\|--------\|---------\|---------\|--------\|---------\|-----------\|
	\| 16 bit (baseline) \| – \| 0.5171 \| 0.8409 \| 0.7986 \| 0.8177 \| 0.5908 \| 0.7131 \| – \| – \|
	\| 4-bit Sketch A \| 4096 \| 0.5136 \| 0.8443 \| 0.7997 \| 0.8198 \| 0.5865 \| 0.7127 \| 92 \| 16 M \|
	\| 4-bit FastKron \| 75 \| 0.5116 \| 0.8438 \| 0.8025 \| 0.8207 \| 0.5863 \| 0.7129 \| 9.5 \| 712 K \|
	\| 4-bit No Hess \| – \| 0.5119 \| 0.8415 \| 0.7959 \| 0.8097 \| 0.5859 \| 0.7112 \| – \| – \|


	### 🟠 2-bit Quantization

	\| Method \| Steps \| ARC_c ↑ \| BoolQ ↑ \| PIQA ↑ \| ARC_e ↑ \| HSwag ↑ \| AVG ↑ \| GPU/h ↓ \| Tokens ↓ \|
	\|---------------------\|-------\|---------\|---------\|--------\|---------\|---------\|--------\|---------\|-----------\|
	\| 2-bit Sketch A \| 4096 \| 0.4312 \| 0.7567 \| 0.7647 \| 0.7391 \| 0.5259 \| 0.6435 \| 92 \| 16 M \|
	\| 2-bit FastKron \| 100 \| 0.4277 \| 0.7646 \| 0.7661 \| 0.7468 \| 0.5159 \| 0.6442 \| 11.5 \| 950 K \|
	\| 2-bit No Hess \| – \| 0.2363 \| 0.6336 \| 0.6554 \| 0.5108 \| 0.3620 \| 0.5094 \| – \| – \|



	## 📊 Zero-shot results — Qwen-3 8B

	### 🟡 4-bit Quantization

	\| Method \| Steps \| ARC_c ↑ \| BoolQ ↑ \| PIQA ↑ \| ARC_e ↑ \| HSwag ↑ \| AVG ↑ \| GPU/h ↓ \| Tokens ↓ \|
	\|---------------------\|-------\|---------\|---------\|--------\|---------\|---------\|--------\|---------\|-----------\|
	\| 16 bit (baseline) \| – \| 0.5563 \| 0.8682 \| 0.7677 \| 0.8354 \| 0.5708 \| 0.7197 \| – \| – \|
	\| 4-bit Sketch A \| 4096 \| 0.5503 \| 0.8611 \| 0.7612 \| 0.8324 \| 0.5601 \| 0.7132 \| 84 \| 8 M \|
	\| 4-bit FastKron \| 150 \| 0.5469 \| 0.8667 \| 0.7601 \| 0.8287 \| 0.5637 \| 0.7132 \| 42 \| 712 K \|
	\| 4-bit No Hess \| – \| 0.5467 \| 0.8675 \| 0.7622 \| 0.8312 \| 0.5585 \| 0.7132 \| – \| – \|


	### 🟠 2-bit Quantization

	\| Method \| Steps \| ARC_c ↑ \| BoolQ ↑ \| PIQA ↑ \| ARC_e ↑ \| HSwag ↑ \| AVG ↑ \| GPU/h ↓ \| Tokens ↓ \|
	\|---------------------\|-------\|---------\|---------\|--------\|---------\|---------\|--------\|---------\|-----------\|
	\| 2-bit Sketch A \| 4096 \| 0.4536 \| 0.7782 \| 0.7435 \| 0.7797 \| 0.4611 \| 0.6432 \| 84 \| 8 M \|
	\| 2-bit FastKron \| 150 \| 0.4616 \| 0.8416 \| 0.7334 \| 0.7702 \| 0.4853 \| 0.6584 \| 42 \| 712 K \|
	\| 2-bit No Hess \| – \| 0.3993 \| 0.8675 \| 0.7743 \| 0.7003 \| 0.4758 \| 0.6434 \| – \| – \|


	## 📊 Zero-shot results — LLaMA-2 7B

	### 🟡 4-bit Quantization

	\| Method \| Steps \| ARC_c ↑ \| BoolQ ↑ \| PIQA ↑ \| ARC_e ↑ \| HSwag ↑ \| AVG ↑ \| GPU/h ↓ \| Tokens ↓ \|
	\|---------------------\|-------\|---------\|---------\|--------\|---------\|---------\|--------\|---------\|-----------\|
	\| 16 bit (baseline) \| – \| 0.4325 \| 0.7767 \| 0.7774 \| 0.7617 \| 0.5721 \| 0.6640 \| – \| – \|
	\| 4-bit Sketch A \| 4096 \| 0.4274 \| 0.7688 \| 0.7752 \| 0.7613 \| 0.5672 \| 0.6599 \| 50 \| 16 M \|
	\| 4-bit FastKron \| 75 \| 0.4283 \| 0.7792 \| 0.7802 \| 0.7610 \| 0.5660 \| 0.6629 \| 5 \| 712 K \|
	\| 4-bit No Hess \| – \| 0.4352 \| 0.7875 \| 0.7742 \| 0.7609 \| 0.5628 \| 0.6641 \| – \| – \|


	### 🟠 2-bit Quantization

	\| Method \| Steps \| ARC_c ↑ \| BoolQ ↑ \| PIQA ↑ \| ARC_e ↑ \| HSwag ↑ \| AVG ↑ \| GPU/h ↓ \| Tokens ↓ \|
	\|---------------------\|-------\|---------\|---------\|--------\|---------\|---------\|--------\|---------\|-----------\|
	\| 2-bit Sketch A \| 4096 \| 0.3805 \| 0.7333 \| 0.7562 \| 0.7192 \| 0.5227 \| 0.6223 \| 50 \| 16 M \|
	\| 2-bit FastKron \| 150 \| 0.3843 \| 0.7510 \| 0.7600 \| 0.7112 \| 0.5139 \| 0.6240 \| 6 \| 1400 K \|
	\| 2-bit No Hess \| – \| 0.2210 \| 0.6355 \| 0.6306 \| 0.5152 \| 0.3422 \| 0.4689 \| – \| – \|