Safetensors
Sayankotor's picture
Update README.md
1aeb829 verified
metadata
license: apache-2.0
datasets:
  - Rowan/hellaswag
  - lighteval/piqa
  - google/boolq
  - omarmohamed/arc_easy
metrics:
  - accuracy

tags: - quantization - kronecker - second-order - YAQA - LLaMA - Qwen - efficient

โšก FastKronQuantization

Fast second-order Kronecker-factored quantization for LLMs


๐Ÿง  Abstract

Quantization with second-order information has shown strong promise for preserving model quality under aggressive compression. Building on the recent YAQA framework (Tseng et al., 2025b), which employs Kronecker-factored approximations of the Hessian via a power-iteration technique, we propose an alternative approach that replaces this step with a more efficient Kronecker decomposition method from Chekalina et al. (2025).
This formulation preserves the benefits of second-order curvature-aware quantization while substantially reducing computational cost.
We apply our method to LLaMA-2 7B, LLaMA-3 8B Instruct, and Qwen-3 8B Instruct and demonstrate that it achieves the same post-quantization model quality as YAQA, but with significantly faster computation โ€” the Kronecker factors required for target quality are obtained with 10ร— fewer tokens and approximately a 10ร— speedup over the original work.


๐Ÿงญ Checkpoints

Model name Architecture Bits
FastKronQuant-LLaMA2-7B-4bit LLaMA-2-7B 4-bit
FastKronQuant-LLaMA3-8B-4bit LLaMA-3-8B-Instruct 4-bit
FastKronQuant-Qwen3-8B-4bit Qwen-3-8B 4-bit
FastKronQuant-LLaMA2-7B-2bit LLaMA-2-7B 2-bit
FastKronQuant-LLaMA3-8B-2bit LLaMA-3-8B-Instruct 2-bit
FastKronQuant-Qwen3-8B-2bit Qwen-3-8B 2-bit

Each checkpoint is fully compatible with Hugging Face transformers and can be loaded like any standard model.


๐Ÿ“Œ Features

  • โšก Fast Kronecker decomposition โ€” up to 10ร— faster factor estimation
  • ๐Ÿงฎ Second-order quantization โ€” preserves model accuracy
  • ๐Ÿชถ Works with popular architectures: LLaMA-2, LLaMA-3, Qwen-3
  • ๐Ÿ”ธ Compatible with ๐Ÿค— transformers out of the box

๐Ÿš€ Usage Example (LLaMA-2 7B)

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "username/FastKronQuant-LLaMA2-7B"  # replace with actual repo
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

๐Ÿงช Example โ€” ARC Easy evaluation

from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# Load ARC-Easy
ds = load_dataset("ai2_arc", "ARC-Easy")["test"]

# Load quantized model
repo_id = "username/FastKronQuant-LLaMA2-7B"
tok = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto"
)
pipe = pipeline("text-generation", model=model, tokenizer=tok)


# Simple evaluation loop
for i in range(3):
    q = ds[i]["question"]
    a = pipe(q, max_new_tokens=32)[0]["generated_text"]
    print(f"Q: {q}\nA: {a}\n---")

If you use model chekpoints in your experiments, please cite

@misc{chekalina2025gfwsvd, title={Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models}, author={Viktoriia Chekalina and Daniil Moskovskiy and Daria Cherniuk and Maxim Kurkin and Andrey Kuznetsov and Evgeny Frolov}, year={2025}, eprint={2505.17974}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.17974}, }

๐Ÿ“Š Zero-shot results โ€” LLaMA-3 8B

๐ŸŸก 4-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
16 bit (baseline) โ€“ 0.5171 0.8409 0.7986 0.8177 0.5908 0.7131 โ€“ โ€“
4-bit Sketch A 4096 0.5136 0.8443 0.7997 0.8198 0.5865 0.7127 92 16 M
4-bit FastKron 75 0.5116 0.8438 0.8025 0.8207 0.5863 0.7129 9.5 712 K
4-bit No Hess โ€“ 0.5119 0.8415 0.7959 0.8097 0.5859 0.7112 โ€“ โ€“

๐ŸŸ  2-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
2-bit Sketch A 4096 0.4312 0.7567 0.7647 0.7391 0.5259 0.6435 92 16 M
2-bit FastKron 100 0.4277 0.7646 0.7661 0.7468 0.5159 0.6442 11.5 950 K
2-bit No Hess โ€“ 0.2363 0.6336 0.6554 0.5108 0.3620 0.5094 โ€“ โ€“

๐Ÿ“Š Zero-shot results โ€” Qwen-3 8B

๐ŸŸก 4-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
16 bit (baseline) โ€“ 0.5563 0.8682 0.7677 0.8354 0.5708 0.7197 โ€“ โ€“
4-bit Sketch A 4096 0.5503 0.8611 0.7612 0.8324 0.5601 0.7132 84 8 M
4-bit FastKron 150 0.5469 0.8667 0.7601 0.8287 0.5637 0.7132 42 712 K
4-bit No Hess โ€“ 0.5467 0.8675 0.7622 0.8312 0.5585 0.7132 โ€“ โ€“

๐ŸŸ  2-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
2-bit Sketch A 4096 0.4536 0.7782 0.7435 0.7797 0.4611 0.6432 84 8 M
2-bit FastKron 150 0.4616 0.8416 0.7334 0.7702 0.4853 0.6584 42 712 K
2-bit No Hess โ€“ 0.3993 0.8675 0.7743 0.7003 0.4758 0.6434 โ€“ โ€“

๐Ÿ“Š Zero-shot results โ€” LLaMA-2 7B

๐ŸŸก 4-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
16 bit (baseline) โ€“ 0.4325 0.7767 0.7774 0.7617 0.5721 0.6640 โ€“ โ€“
4-bit Sketch A 4096 0.4274 0.7688 0.7752 0.7613 0.5672 0.6599 50 16 M
4-bit FastKron 75 0.4283 0.7792 0.7802 0.7610 0.5660 0.6629 5 712 K
4-bit No Hess โ€“ 0.4352 0.7875 0.7742 0.7609 0.5628 0.6641 โ€“ โ€“

๐ŸŸ  2-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
2-bit Sketch A 4096 0.3805 0.7333 0.7562 0.7192 0.5227 0.6223 50 16 M
2-bit FastKron 150 0.3843 0.7510 0.7600 0.7112 0.5139 0.6240 6 1400 K
2-bit No Hess โ€“ 0.2210 0.6355 0.6306 0.5152 0.3422 0.4689 โ€“ โ€“