dev-bjoern
/

smollm3-int4-ov

@@ -41,4 +41,121 @@ This is an INT4 quantized version of [SmolLM3-3B](https://huggingface.co/Hugging
 ### Quantization Process
 ```python
 # Quantized using OpenVINO NNCF
-# INT4 symmetric quantization

 ### Quantization Process
 ```python
 # Quantized using OpenVINO NNCF
+# INT4 symmetric quantization
+# Calibration dataset: [specify if used]
+```
+### Model Architecture
+- Same architecture as SmolLM3-3B
+- GQA and NoPE preserved
+- 64k context support (128k with YARN)
+- Multilingual capabilities maintained
+## 📊 Performance (Experimental)
+> ⚠️ **Note:** This is an experimental quantization. Formal benchmarks pending.
+Expected characteristics:
+- **Model Size:** ~1GB (vs ~6GB fp16)
+- **Inference Speed:** 2-4x faster on CPU
+- **Quality Trade-off:** Minor degradation expected
+## 🛠️ How to Use
+### Installation
+```bash
+pip install optimum[openvino] transformers
+```
+### Basic Usage
+```python
+from optimum.intel import OVModelForCausalLM
+from transformers import AutoTokenizer
+model_id = "dev-bjoern/smollm3-int4-ov"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = OVModelForCausalLM.from_pretrained(model_id)
+# Generate text
+prompt = "Explain quantum computing in simple terms"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+### With Extended Thinking
+```python
+messages = [
+    {"role": "system", "content": "/think"},
+    {"role": "user", "content": "Solve this step by step: 25 * 16"}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+```
+## 🎯 Intended Use
+- **Edge AI applications**
+- **Local LLM deployment**
+- **Resource-constrained environments**
+- **Privacy-focused applications**
+- **Offline AI assistants**
+## ⚡ Optimization Tips
+1. **CPU Inference:** Use OpenVINO runtime for best performance
+2. **Batch Processing:** Leverage dynamic batching when possible
+3. **Memory:** Requires ~2GB RAM for comfortable operation
+## 🧪 Experimental Status
+This is my first experiment with OpenVINO INT4 quantization. Feedback and contributions are welcome!
+### Known Limitations
+- No formal benchmarks yet
+- Quantization settings not fully optimized
+- Some quality degradation vs full precision
+### Future Improvements
+- [ ] Comprehensive benchmarking
+- [ ] Mixed precision experiments
+- [ ] Model compression analysis
+- [ ] Calibration dataset optimization
+## 🤝 Contributing
+Found issues or have suggestions? Please open a discussion or issue!
+## 📚 Resources
+- [Original SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
+- [OpenVINO Documentation](https://docs.openvino.ai/)
+- [Optimum Intel](https://huggingface.co/docs/optimum/intel/index)
+## 🙏 Acknowledgments
+- HuggingFace team for SmolLM3
+- Intel OpenVINO team for quantization tools
+- Community for feedback and support
+## 📝 Citation
+If you use this model, please cite both the original and this work:
+```bibtex
+@misc{smollm3-int4-ov,
+  author = {Bjoern Bethge},
+  title = {SmolLM3 INT4 OpenVINO},
+  year = {2024},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/dev-bjoern/smollm3-int4-ov}}
+}
+```
+---
+**Status:** 🧪 Experimental | **Feedback:** Welcome | **License:** Apache 2.0