Instructions to use Kukedlc/Slopus-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Kukedlc/Slopus-9B with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Kukedlc/Slopus-9B", dtype="auto")

llama-cpp-python

How to use Kukedlc/Slopus-9B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Kukedlc/Slopus-9B",
	filename="Slopus-9B-Q2_K.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Kukedlc/Slopus-9B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Kukedlc/Slopus-9B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Kukedlc/Slopus-9B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Kukedlc/Slopus-9B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Kukedlc/Slopus-9B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Kukedlc/Slopus-9B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Kukedlc/Slopus-9B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Kukedlc/Slopus-9B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Kukedlc/Slopus-9B:Q4_K_M

Use Docker

docker model run hf.co/Kukedlc/Slopus-9B:Q4_K_M

LM Studio
Jan
Ollama
How to use Kukedlc/Slopus-9B with Ollama:
```
ollama run hf.co/Kukedlc/Slopus-9B:Q4_K_M
```

Unsloth Studio new

How to use Kukedlc/Slopus-9B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Kukedlc/Slopus-9B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Kukedlc/Slopus-9B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Kukedlc/Slopus-9B to start chatting

Pi new

How to use Kukedlc/Slopus-9B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Kukedlc/Slopus-9B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Kukedlc/Slopus-9B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Kukedlc/Slopus-9B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Kukedlc/Slopus-9B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Kukedlc/Slopus-9B:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Kukedlc/Slopus-9B with Docker Model Runner:
```
docker model run hf.co/Kukedlc/Slopus-9B:Q4_K_M
```

Lemonade

How to use Kukedlc/Slopus-9B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Kukedlc/Slopus-9B:Q4_K_M

Run and chat with the model

lemonade run user.Slopus-9B-Q4_K_M

List all available models

lemonade list

Slopus 9B

Experimental fine-tuning pipeline on top of Tesslate/OmniCoder-9B (Qwen3.5-9B base + 425K Opus agentic traces).

Two chained LoRA stages: a light coding-style adaptation first, then a heavier capacity injection of pure Opus 4.6 reasoning traces.

Pipeline overview

Summary
Pipeline in 5 steps
Phase 1 — LoRA r=8 on OmniCoder-9B
Phase 2 — LoRA r=128 on phase 1 merge
Costs
Available quants
Usage with llama-server
Limitations
Datasets & credits

Summary

	Phase 1	Phase 2
Dataset	`Kukedlc/omnicoder-train` (16K, agentic coding)	`Kukedlc/omnicoder-fase2-reasoning` (24K, Opus 4.6 reasoning, derived from `Gryphe/Opus-4.6-Reasoning-24k`)
Base	`Tesslate/OmniCoder-9B`	`Tesslate/OmniCoder-9B` + phase 1 LoRA merged (fp16)
LoRA rank / alpha	r=8, alpha=16	r=128, alpha=256
Targets	q,k,v,o,gate,up,down,out_proj (GDN)	same + much higher rank
Epochs	2	2
Total steps	506	758
Effective batch	16 (batch=8, GA=8)	64 (batch=16, GA=4)
LR	2e-4	1e-4 (lower, base already fine-tuned)
Max seq	2048	4096
Training hardware	RunPod H100 80GB SXM Secure	RunPod H100 80GB SXM Secure
Approximate time	~1.5h	~4.8h
Loss start → end	1.14 → 0.88 (-23.1%)	1.03 → 0.88 (-14.3%)

Pipeline in 5 steps

Base: Tesslate/OmniCoder-9B — Qwen3.5-9B with hybrid GDN (Gated Delta Networks) architecture, fine-tuned by Tesslate on 425K Opus agentic coding traces.
Phase 1 LoRA r=8 alpha=16 on top of the base, dataset Kukedlc/omnicoder-train. 2 epochs, 506 steps. Resulting adapter: Kukedlc/omnicoder-9b-lora (checkpoints 100/200/300/400/500/506).
Phase 1 merge: base + phase 1 adapter → ~18 GB HF fp16 model (not published, used only as internal base for phase 2).
Phase 2 LoRA r=128 alpha=256 on top of the merged model, dataset Kukedlc/omnicoder-fase2-reasoning (re-rendered from Gryphe Opus reasoning with <think> inline in the chat template). 2 epochs, 758 steps. Resulting adapter: Kukedlc/omnicoder-9b-fase2-lora (checkpoints 100/200/.../758).
Phase 2 merge + Quantize with llama.cpp tag b8292 (commit b54124110) — newer master and other versions do NOT fully support Qwen3.5's hybrid GDN architecture and fail with missing tensor 'blk.32.attn_norm.weight'. Generated quants: Q2_K, Q3_K_M, Q4_K_M, Q6_K, Q8_0.

Phase 1

A conservative LoRA (r=8 alpha=16) on top of OmniCoder-9B to introduce a first style adjustment in the target domain before the heavier injection in phase 2. The idea: don't break the base with a giant LoRA on the first pass.

Observation: loss drops MORE in epoch 2 than in epoch 1 (10% in epoch 1 vs 15% in epoch 2), suggesting the base still had headroom to learn — no overfitting.

Phase 2

High-capacity LoRA (r=128 alpha=256, 258M trainable params = 2.67% of the 9B) on top of the phase 1 merge, with a pure reasoning dataset from Opus 4.6 (24K chat examples with separate reasoning_content, re-rendered as <think>X</think>content inline for the Qwen3.5 chat template).

LR lowered to 1e-4 (vs 2e-4 in phase 1) since the model was already fine-tuned. Cosine scheduler with 20 warmup steps.

Recommended sampling parameters (from Qwen3.6 model card, valid for Qwen3.5):

Mode	temp	top_p	pres_pen	top_k
Thinking general	1.0	0.95	1.5	20
Thinking coding	0.6	0.95	0.0	20
Instruct general	0.7	0.80	1.5	20

Benchmarks

All benchmarks run locally on RTX 3090, against the Q4_K_M GGUF served via llama-server at b8292. Custom Python harness (_custom_gsm8k_bench.py and _custom_mc_bench.py in the dataset repo) — we send each question + choices to the model and extract the final letter/number with regex. No lm-evaluation-harness because of multiple bugs with the Qwen3.5 hybrid architecture + chat template + logprobs format from llama-server (documented in the issues).

GSM8K — Slopus 9B vs base OmniCoder 9B (100 problems, same Q&A set)

Setting	Slopus 9B Q4	OmniCoder 9B Q4	Δ
Greedy + no-thinking	80.0% (5.5 min)	81.0% (~5 min)	-1.0 pp
Thinking ON + Qwen sampling (temp=1.0, top_p=0.95, top_k=20, presence_penalty=1.5)	92.0% (5.5 min)	97.0% (65.6 min)	-5.0 pp

Speed: Slopus approximately 12x faster than OmniCoder with thinking on (5.5 min vs 65.6 min, same hardware, same --parallel 2). Slopus reasoning is more concise (approximately 300 tokens avg vs approximately 1500 tokens avg for OmniCoder).

Tradeoff: OmniCoder wins approximately 5pp accuracy on math by re-verifying answers multiple times (Step → Alternative Plan → Verification), Slopus wins approximately 12x throughput. For batch workloads (1000s of queries), Slopus processes 12x the volume per hour.

tinyBenchmarks (100 examples each, custom generative — letter pick, thinking ON + Qwen sampling)

Benchmark	Slopus 9B Q4	Time
tinyMMLU (knowledge)	76.0%	6.3 min
tinyArc (science Q&A)	92.0%	5.4 min
tinyHellaswag (common sense)	65.0%	5.1 min
tinyWinogrande (coreference)	76.0%	4.0 min
tinyTruthfulQA (truthfulness, mc1)	61.0%	5.2 min
Average	74.0%	total ~26 min

OmniCoder NOT benchmarked on tinyBenchmarks MC because at OmniCoder's pace (approximately 12x slower) it would have taken 5-6h instead of 26 min. The GSM8K comparison is enough signal of where each model stands.

Random baseline for 4-choice MC ≈ 25%. All results are well above baseline. ARC at 92% is excellent for a 9B Q4.

Costs

Item	Time	Cost
Phase 1 training, H100 SXM Secure ($3.29/h)	~1.5h	~$4.95
Phase 2 training, H100 SXM Secure ($3.29/h)	~4.8h	~$15.80
Intermediate merges + quantize on local 3090	~2h	$0 (local)
HF storage (XET turbo)	-	$0
Total paid compute	~6.3h cloud	~$20.75

Available quants

Quant	Size	Recommended use
`Slopus-9B-Q2_K.gguf`	~3.6 GB	Only if VRAM <6 GB. Noticeably lower quality.
`Slopus-9B-Q3_K_M.gguf`	~4.5 GB	VRAM ~6 GB. Tight compromise.
`Slopus-9B-Q4_K_M.gguf`	~5.5 GB	Sweet spot, default recommended. VRAM 8 GB+
`Slopus-9B-Q6_K.gguf`	~7.5 GB	Near-perfect quality. VRAM 10 GB+
`Slopus-9B-Q8_0.gguf`	~9.5 GB	Almost indistinguishable from fp16. VRAM 12 GB+

Usage with llama-server

IMPORTANT: you must use llama.cpp at tag b8292 (commit b54124110). New master does NOT load these GGUFs (it fails with a missing tensor error due to a bug in the converter+loader for Qwen3.5's hybrid GDN architecture).

# Install llama.cpp at b8292
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git checkout b8292
cmake -B build -DGGML_CUDA=ON
cmake --build build --target llama-quantize llama-server -j$(nproc)

# Serve
export LLAMA_CHAT_TEMPLATE_KWARGS='{"enable_thinking":true}'
./build/bin/llama-server \
    --model Slopus-9B-Q4_K_M.gguf \
    -ngl 999 -fa on --no-mmap \
    -c 65536 \
    --parallel 4 \
    --jinja --reasoning-format deepseek \
    --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 \
    --presence-penalty 1.5 \
    --port 12345

Phase 2 LoRA adapter (without merge)

Both LoRA adapters are included in this same repo:

adapter_phase1/ — final phase 1 adapter (checkpoint-506, r=8 alpha=16), trained on Kukedlc/omnicoder-train. Loads on top of Tesslate/OmniCoder-9B.
adapter_phase2/ — final phase 2 adapter (checkpoint-758, r=128 alpha=256), trained on Kukedlc/omnicoder-fase2-reasoning. Loads on top of Tesslate/OmniCoder-9B + adapter_phase1 merged. The intermediate merge is NOT published (~18 GB, adds no value vs. the Q4_K_M GGUF directly).

To reproduce Slopus 9B from scratch:

Apply adapter_phase1 over Tesslate/OmniCoder-9B → merge to fp16
Apply adapter_phase2 over the merged phase 1 → merge to fp16
Convert + quantize with llama.cpp@b8292

Limitations

r=128 alpha=256 on 24K examples is borderline against the rule-of-thumb "high rank requires a large dataset". The model showed NO visible signs of overfitting (loss still going down at the end of epoch 2), but without an eval set it can't be fully confirmed. For production use ideally build an out-of-distribution benchmark.
GGUFs require llama.cpp b8292 specifically. Newer versions don't load them. This model is NOT compatible with stock Ollama (which embeds a newer llama.cpp) — you'd have to patch Ollama or use llama-server directly.
Basic arithmetic in Q4_K_M can be slightly off (the model lists the terms correctly but sometimes sums the final result wrong). Use Q6_K or Q8_0 for tasks requiring numerical precision.
No RLHF/DPO. SFT only in both phases.

Datasets & credits

Tesslate/OmniCoder-9B — base model. Apache 2.0.
Gryphe/Opus-4.6-Reasoning-24k — source of the reasoning dataset (re-rendered to Qwen3.5 chat template with inline <think>). Apache 2.0.
llama.cpp b8292 — GitHub ggml-org/llama.cpp, commit b54124110.
Unsloth — efficient bf16 LoRA training.
bartowski for publishing OmniCoder-9B GGUFs at b8292, which served as the reference to identify the correct llama.cpp version.

Reproducing the experiment

Scripts in the Kukedlc/omnicoder-train dataset repo:

train_omnicoder.py — phase 1
train_omnicoder_fase2.py — phase 2
setup_train.sh / setup_fase2.sh — RunPod pod launchers
watcher_upload.sh / watcher_upload_fase2.sh — automatic checkpoint upload

License

Apache 2.0 — inherited from the OmniCoder-9B base and the Gryphe dataset.

Downloads last month: 192

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

2-bit

3-bit

4-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kukedlc/Slopus-9B

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B