Instructions to use akratz/unsloth.Q8_0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use akratz/unsloth.Q8_0 with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("akratz/unsloth.Q8_0", dtype="auto")

llama-cpp-python

How to use akratz/unsloth.Q8_0 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="akratz/unsloth.Q8_0",
	filename="unsloth.Q8_0.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use akratz/unsloth.Q8_0 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf akratz/unsloth.Q8_0:Q8_0
# Run inference directly in the terminal:
llama-cli -hf akratz/unsloth.Q8_0:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf akratz/unsloth.Q8_0:Q8_0
# Run inference directly in the terminal:
llama-cli -hf akratz/unsloth.Q8_0:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf akratz/unsloth.Q8_0:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf akratz/unsloth.Q8_0:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf akratz/unsloth.Q8_0:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf akratz/unsloth.Q8_0:Q8_0

Use Docker

docker model run hf.co/akratz/unsloth.Q8_0:Q8_0

LM Studio
Jan
Ollama
How to use akratz/unsloth.Q8_0 with Ollama:
```
ollama run hf.co/akratz/unsloth.Q8_0:Q8_0
```

Unsloth Studio

How to use akratz/unsloth.Q8_0 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for akratz/unsloth.Q8_0 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for akratz/unsloth.Q8_0 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for akratz/unsloth.Q8_0 to start chatting

How to use akratz/unsloth.Q8_0 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf akratz/unsloth.Q8_0:Q8_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "akratz/unsloth.Q8_0:Q8_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use akratz/unsloth.Q8_0 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf akratz/unsloth.Q8_0:Q8_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default akratz/unsloth.Q8_0:Q8_0

Run Hermes

hermes

Docker Model Runner
How to use akratz/unsloth.Q8_0 with Docker Model Runner:
```
docker model run hf.co/akratz/unsloth.Q8_0:Q8_0
```

Lemonade

How to use akratz/unsloth.Q8_0 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull akratz/unsloth.Q8_0:Q8_0

Run and chat with the model

lemonade run user.unsloth.Q8_0-Q8_0

List all available models

lemonade list

Model Card: Qwen2.5-3B (GRPO-GSM8K) — Reasoning Format

Model summary

Name: Qwen2.5-3B (GRPO-GSM8K)
Base model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
Developer/Maintainer: akratz
License: Apache-2.0
Languages: English
Status: Experimental; trained to emit explicit reasoning traces
Primary change: Reinforcement learning with GRPO on GSM8K to induce chain-of-thought style outputs

What's in the box

This is a small Qwen 2.5 3B-parameter instruct model fine-tuned with GRPO following Unsloth’s RL tutorial to produce responses in a two-block format:

> Tom’s cat is 8 years old.  His rabbit is half the age of his cat.  His dog is three times as old as his rabbit.  How old is the
 dog? 
<reasoning>
To find out how old Tom's dog is, we need to follow these steps:

1. Determine the age of Tom's rabbit.
2. Use the rabbit's age to find out the dog's age.

First, we know that Tom's cat is 格格 (which I assume was supposed to be '8' years old).

Next, the rabbit is half the age of the cat. Therefore, the rabbit's age is:
\[ \text{Rabbit's age} = \frac{1}{2} \times \text{Cat's age} = \frac{1}{2} \times 8 = 4 \text{ years old} \]

Now, the dog is three times as old as the rabbit. So, the dog's age is:
\[ \text{Dog's age} = 3 \times \text{Rabbit's age} = 3 \times 4 = 12 \text{ years old} \]

Thus, the dog is 12 years old.
...
<answer>
12
</answer>

The goal was to validate the full RL pipeline end-to-end, not to achieve state-of-the-art accuracy. The model does emit structured reasoning, but overall quality is low compared with larger or better-trained models.

Intended use

Intended: Educational and reproducibility demos of GRPO on a small model; sandbox experiments on formatting, reward shaping, and inference tooling.
Not intended: Production use; safety-critical domains; factual or mathematical reliability; privacy-sensitive data.

Limitations and risks

Trained only on GSM8K signals; narrow domain coverage.
Reasoning traces can be verbose, inconsistent, or incorrect.
Hallucinations and arithmetic mistakes are common.
Safety and refusal behaviors were not a focus of training.
Reward overfitting to the target format and short-term hacks is likely.

Training data

Supervision signal: GSM8K math word problems (for rewards/formatting).
Data scope: Math grade-school style problems only.
Provenance: Public dataset (standard GSM8K release). No private data added.

Training procedure

Method: GRPO (Group Relative Policy Optimization) via Hugging Face TRL + Unsloth.
Starting weights: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
Objective: Encourage the required output format and correct final answers on GSM8K.
Notes: This run was sized and configured to “make it work,” not tuned for peak performance. Hyperparameters were kept close to the Unsloth tutorial defaults.

Inference

llama.cpp example

Invoke with a system prompt that enforces the format:

~/work/llama.cpp/build/bin/llama-cli \
  -m ~/work/llama.cpp/models/unsloth.Q8_0.gguf \
  -sys "Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>"

Output contract

Always emits ... followed by ....
Consumers should parse only the block as the final answer.
The block is untrusted and may contain errors.

Practical tips

Use deterministic decoding for evaluation (temp≈0.0–0.2, small top_p).
Set a stop sequence on to avoid trailing chatter.
For non-math tasks, expect degraded quality; re-prompt with very explicit instructions if you must try.

Evaluation

Benchmarking: No formal leaderboard-quality evaluation was performed.
Sanity checks: The model produces the requested two-block format on GSM8K-style prompts and can reach correct answers on simple instances.
Known behavior: Accuracy is inconsistent; failure modes include arithmetic slips and brittle parsing.

Compatibility

Quantization used for deployment: unsloth.Q8_0.gguf tested with llama.cpp.
Libraries: Works with standard GGUF-compatible runtimes. Training used TRL + Unsloth.

Responsible use

Do not deploy in any domain where mistakes cause harm.
Do not rely on the text for truth; treat it as a by-product of training.
Filter or redact sensitive inputs. Add your own safety layers if experimenting in the wild.

How to reproduce

Tutorial followed: Unsloth “Train your own reasoning model with GRPO” (the GSM8K walkthrough).
Replicate the pipeline, then enforce the same output format in the system prompt at inference time.

Versioning

v0.1 (2025-08-09): Initial GRPO finetune from unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit on GSM8K; emits reasoning/answer blocks; experimental quality.

Citation

If you use this model in experiments, please cite:

Qwen2.5 model family (Qwen team)
TRL library (Hugging Face)
Unsloth library and GRPO tutorial
GSM8K dataset

@misc{akratz_qwen25_3b_grpo_gsm8k_2025, title = {Qwen2.5-3B (GRPO-GSM8K) — Reasoning Format}, author = {akratz}, year = {2025}, note = {Fine-tuned from unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit using GRPO on GSM8K; Apache-2.0} }

Card metadata (for Hub front-matter)

base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit tags:

text-generation-inference
transformers
unsloth
qwen2
reasoning
grpo license: apache-2.0 language:
en

Downloads last month: 4

GGUF

Model size

3B params

Architecture

qwen2

Hardware compatibility

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support