Instructions to use akratz/unsloth.Q8_0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use akratz/unsloth.Q8_0 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("akratz/unsloth.Q8_0", dtype="auto") - llama-cpp-python
How to use akratz/unsloth.Q8_0 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="akratz/unsloth.Q8_0", filename="unsloth.Q8_0.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use akratz/unsloth.Q8_0 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf akratz/unsloth.Q8_0:Q8_0 # Run inference directly in the terminal: llama-cli -hf akratz/unsloth.Q8_0:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf akratz/unsloth.Q8_0:Q8_0 # Run inference directly in the terminal: llama-cli -hf akratz/unsloth.Q8_0:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf akratz/unsloth.Q8_0:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf akratz/unsloth.Q8_0:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf akratz/unsloth.Q8_0:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf akratz/unsloth.Q8_0:Q8_0
Use Docker
docker model run hf.co/akratz/unsloth.Q8_0:Q8_0
- LM Studio
- Jan
- Ollama
How to use akratz/unsloth.Q8_0 with Ollama:
ollama run hf.co/akratz/unsloth.Q8_0:Q8_0
- Unsloth Studio
How to use akratz/unsloth.Q8_0 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for akratz/unsloth.Q8_0 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for akratz/unsloth.Q8_0 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for akratz/unsloth.Q8_0 to start chatting
- Pi
How to use akratz/unsloth.Q8_0 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf akratz/unsloth.Q8_0:Q8_0
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "akratz/unsloth.Q8_0:Q8_0" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use akratz/unsloth.Q8_0 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf akratz/unsloth.Q8_0:Q8_0
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default akratz/unsloth.Q8_0:Q8_0
Run Hermes
hermes
- Docker Model Runner
How to use akratz/unsloth.Q8_0 with Docker Model Runner:
docker model run hf.co/akratz/unsloth.Q8_0:Q8_0
- Lemonade
How to use akratz/unsloth.Q8_0 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull akratz/unsloth.Q8_0:Q8_0
Run and chat with the model
lemonade run user.unsloth.Q8_0-Q8_0
List all available models
lemonade list
Model Card: Qwen2.5-3B (GRPO-GSM8K) — Reasoning Format
Model summary
- Name: Qwen2.5-3B (GRPO-GSM8K)
- Base model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
- Developer/Maintainer: akratz
- License: Apache-2.0
- Languages: English
- Status: Experimental; trained to emit explicit reasoning traces
- Primary change: Reinforcement learning with GRPO on GSM8K to induce chain-of-thought style outputs
What's in the box
This is a small Qwen 2.5 3B-parameter instruct model fine-tuned with GRPO following Unsloth’s RL tutorial to produce responses in a two-block format:
> Tom’s cat is 8 years old. His rabbit is half the age of his cat. His dog is three times as old as his rabbit. How old is the
dog?
<reasoning>
To find out how old Tom's dog is, we need to follow these steps:
1. Determine the age of Tom's rabbit.
2. Use the rabbit's age to find out the dog's age.
First, we know that Tom's cat is 格格 (which I assume was supposed to be '8' years old).
Next, the rabbit is half the age of the cat. Therefore, the rabbit's age is:
\[ \text{Rabbit's age} = \frac{1}{2} \times \text{Cat's age} = \frac{1}{2} \times 8 = 4 \text{ years old} \]
Now, the dog is three times as old as the rabbit. So, the dog's age is:
\[ \text{Dog's age} = 3 \times \text{Rabbit's age} = 3 \times 4 = 12 \text{ years old} \]
Thus, the dog is 12 years old.
...
<answer>
12
</answer>
The goal was to validate the full RL pipeline end-to-end, not to achieve state-of-the-art accuracy. The model does emit structured reasoning, but overall quality is low compared with larger or better-trained models.
Intended use
- Intended: Educational and reproducibility demos of GRPO on a small model; sandbox experiments on formatting, reward shaping, and inference tooling.
- Not intended: Production use; safety-critical domains; factual or mathematical reliability; privacy-sensitive data.
Limitations and risks
- Trained only on GSM8K signals; narrow domain coverage.
- Reasoning traces can be verbose, inconsistent, or incorrect.
- Hallucinations and arithmetic mistakes are common.
- Safety and refusal behaviors were not a focus of training.
- Reward overfitting to the target format and short-term hacks is likely.
Training data
- Supervision signal: GSM8K math word problems (for rewards/formatting).
- Data scope: Math grade-school style problems only.
- Provenance: Public dataset (standard GSM8K release). No private data added.
Training procedure
- Method: GRPO (Group Relative Policy Optimization) via Hugging Face TRL + Unsloth.
- Starting weights: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
- Objective: Encourage the required output format and correct final answers on GSM8K.
- Notes: This run was sized and configured to “make it work,” not tuned for peak performance. Hyperparameters were kept close to the Unsloth tutorial defaults.
Inference
llama.cpp example
Invoke with a system prompt that enforces the format:
~/work/llama.cpp/build/bin/llama-cli \
-m ~/work/llama.cpp/models/unsloth.Q8_0.gguf \
-sys "Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>"
Output contract
- Always emits ... followed by ....
- Consumers should parse only the block as the final answer.
- The block is untrusted and may contain errors.
Practical tips
- Use deterministic decoding for evaluation (temp≈0.0–0.2, small top_p).
- Set a stop sequence on to avoid trailing chatter.
- For non-math tasks, expect degraded quality; re-prompt with very explicit instructions if you must try.
Evaluation
- Benchmarking: No formal leaderboard-quality evaluation was performed.
- Sanity checks: The model produces the requested two-block format on GSM8K-style prompts and can reach correct answers on simple instances.
- Known behavior: Accuracy is inconsistent; failure modes include arithmetic slips and brittle parsing.
Compatibility
- Quantization used for deployment: unsloth.Q8_0.gguf tested with llama.cpp.
- Libraries: Works with standard GGUF-compatible runtimes. Training used TRL + Unsloth.
Responsible use
- Do not deploy in any domain where mistakes cause harm.
- Do not rely on the text for truth; treat it as a by-product of training.
- Filter or redact sensitive inputs. Add your own safety layers if experimenting in the wild.
How to reproduce
- Tutorial followed: Unsloth “Train your own reasoning model with GRPO” (the GSM8K walkthrough).
- Replicate the pipeline, then enforce the same output format in the system prompt at inference time.
Versioning
- v0.1 (2025-08-09): Initial GRPO finetune from unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit on GSM8K; emits reasoning/answer blocks; experimental quality.
Citation
If you use this model in experiments, please cite:
- Qwen2.5 model family (Qwen team)
- TRL library (Hugging Face)
- Unsloth library and GRPO tutorial
- GSM8K dataset
@misc{akratz_qwen25_3b_grpo_gsm8k_2025, title = {Qwen2.5-3B (GRPO-GSM8K) — Reasoning Format}, author = {akratz}, year = {2025}, note = {Fine-tuned from unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit using GRPO on GSM8K; Apache-2.0} }
Card metadata (for Hub front-matter)
base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit tags:
- text-generation-inference
- transformers
- unsloth
- qwen2
- reasoning
- grpo license: apache-2.0 language:
- en
- Downloads last month
- 4
8-bit