Instructions to use Kukedlc/Slopus-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Kukedlc/Slopus-9B with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Kukedlc/Slopus-9B", dtype="auto") - llama-cpp-python
How to use Kukedlc/Slopus-9B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Kukedlc/Slopus-9B", filename="Slopus-9B-Q2_K.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Kukedlc/Slopus-9B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Kukedlc/Slopus-9B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Kukedlc/Slopus-9B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Kukedlc/Slopus-9B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Kukedlc/Slopus-9B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Kukedlc/Slopus-9B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Kukedlc/Slopus-9B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Kukedlc/Slopus-9B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Kukedlc/Slopus-9B:Q4_K_M
Use Docker
docker model run hf.co/Kukedlc/Slopus-9B:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use Kukedlc/Slopus-9B with Ollama:
ollama run hf.co/Kukedlc/Slopus-9B:Q4_K_M
- Unsloth Studio new
How to use Kukedlc/Slopus-9B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Kukedlc/Slopus-9B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Kukedlc/Slopus-9B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Kukedlc/Slopus-9B to start chatting
- Pi new
How to use Kukedlc/Slopus-9B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Kukedlc/Slopus-9B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Kukedlc/Slopus-9B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Kukedlc/Slopus-9B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Kukedlc/Slopus-9B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Kukedlc/Slopus-9B:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use Kukedlc/Slopus-9B with Docker Model Runner:
docker model run hf.co/Kukedlc/Slopus-9B:Q4_K_M
- Lemonade
How to use Kukedlc/Slopus-9B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Kukedlc/Slopus-9B:Q4_K_M
Run and chat with the model
lemonade run user.Slopus-9B-Q4_K_M
List all available models
lemonade list
Slopus 9B
Experimental fine-tuning pipeline on top of Tesslate/OmniCoder-9B (Qwen3.5-9B base + 425K Opus agentic traces).
Two chained LoRA stages: a light coding-style adaptation first, then a heavier capacity injection of pure Opus 4.6 reasoning traces.
Pipeline overview
Table of contents
- Summary
- Pipeline in 5 steps
- Phase 1 — LoRA r=8 on OmniCoder-9B
- Phase 2 — LoRA r=128 on phase 1 merge
- Costs
- Available quants
- Usage with llama-server
- Limitations
- Datasets & credits
Summary
| Phase 1 | Phase 2 | |
|---|---|---|
| Dataset | Kukedlc/omnicoder-train (16K, agentic coding) |
Kukedlc/omnicoder-fase2-reasoning (24K, Opus 4.6 reasoning, derived from Gryphe/Opus-4.6-Reasoning-24k) |
| Base | Tesslate/OmniCoder-9B |
Tesslate/OmniCoder-9B + phase 1 LoRA merged (fp16) |
| LoRA rank / alpha | r=8, alpha=16 | r=128, alpha=256 |
| Targets | q,k,v,o,gate,up,down,out_proj (GDN) | same + much higher rank |
| Epochs | 2 | 2 |
| Total steps | 506 | 758 |
| Effective batch | 16 (batch=8, GA=8) | 64 (batch=16, GA=4) |
| LR | 2e-4 | 1e-4 (lower, base already fine-tuned) |
| Max seq | 2048 | 4096 |
| Training hardware | RunPod H100 80GB SXM Secure | RunPod H100 80GB SXM Secure |
| Approximate time | ~1.5h | ~4.8h |
| Loss start → end | 1.14 → 0.88 (-23.1%) | 1.03 → 0.88 (-14.3%) |
Pipeline in 5 steps
- Base:
Tesslate/OmniCoder-9B— Qwen3.5-9B with hybrid GDN (Gated Delta Networks) architecture, fine-tuned by Tesslate on 425K Opus agentic coding traces. - Phase 1 LoRA r=8 alpha=16 on top of the base, dataset
Kukedlc/omnicoder-train. 2 epochs, 506 steps. Resulting adapter:Kukedlc/omnicoder-9b-lora(checkpoints 100/200/300/400/500/506). - Phase 1 merge: base + phase 1 adapter → ~18 GB HF fp16 model (not published, used only as internal base for phase 2).
- Phase 2 LoRA r=128 alpha=256 on top of the merged model, dataset
Kukedlc/omnicoder-fase2-reasoning(re-rendered from Gryphe Opus reasoning with<think>inline in the chat template). 2 epochs, 758 steps. Resulting adapter:Kukedlc/omnicoder-9b-fase2-lora(checkpoints 100/200/.../758). - Phase 2 merge + Quantize with
llama.cpptagb8292(commitb54124110) — newer master and other versions do NOT fully support Qwen3.5's hybrid GDN architecture and fail withmissing tensor 'blk.32.attn_norm.weight'. Generated quants: Q2_K, Q3_K_M, Q4_K_M, Q6_K, Q8_0.
Phase 1
A conservative LoRA (r=8 alpha=16) on top of OmniCoder-9B to introduce a first style adjustment in the target domain before the heavier injection in phase 2. The idea: don't break the base with a giant LoRA on the first pass.
Observation: loss drops MORE in epoch 2 than in epoch 1 (10% in epoch 1 vs 15% in epoch 2), suggesting the base still had headroom to learn — no overfitting.
Phase 2
High-capacity LoRA (r=128 alpha=256, 258M trainable params = 2.67% of the 9B) on top of the phase 1 merge, with a pure reasoning dataset from Opus 4.6 (24K chat examples with separate reasoning_content, re-rendered as <think>X</think>content inline for the Qwen3.5 chat template).
LR lowered to 1e-4 (vs 2e-4 in phase 1) since the model was already fine-tuned. Cosine scheduler with 20 warmup steps.
Recommended sampling parameters (from Qwen3.6 model card, valid for Qwen3.5):
| Mode | temp | top_p | pres_pen | top_k | min_p |
|---|---|---|---|---|---|
| Thinking general | 1.0 | 0.95 | 1.5 | 20 | 0 |
| Thinking coding | 0.6 | 0.95 | 0.0 | 20 | 0 |
| Instruct general | 0.7 | 0.80 | 1.5 | 20 | 0 |
Benchmarks
All benchmarks run locally on RTX 3090, against the Q4_K_M GGUF served via llama-server at b8292. Custom Python harness (_custom_gsm8k_bench.py and _custom_mc_bench.py in the dataset repo) — we send each question + choices to the model and extract the final letter/number with regex. No lm-evaluation-harness because of multiple bugs with the Qwen3.5 hybrid architecture + chat template + logprobs format from llama-server (documented in the issues).
GSM8K — Slopus 9B vs base OmniCoder 9B (100 problems, same Q&A set)
| Setting | Slopus 9B Q4 | OmniCoder 9B Q4 | Δ |
|---|---|---|---|
| Greedy + no-thinking | 80.0% (5.5 min) | 81.0% (~5 min) | -1.0 pp |
| Thinking ON + Qwen sampling (temp=1.0, top_p=0.95, top_k=20, presence_penalty=1.5) | 92.0% (5.5 min) | 97.0% (65.6 min) | -5.0 pp |
Speed: Slopus approximately 12x faster than OmniCoder with thinking on (5.5 min vs 65.6 min, same hardware, same --parallel 2). Slopus reasoning is more concise (approximately 300 tokens avg vs approximately 1500 tokens avg for OmniCoder).
Tradeoff: OmniCoder wins approximately 5pp accuracy on math by re-verifying answers multiple times (Step → Alternative Plan → Verification), Slopus wins approximately 12x throughput. For batch workloads (1000s of queries), Slopus processes 12x the volume per hour.
tinyBenchmarks (100 examples each, custom generative — letter pick, thinking ON + Qwen sampling)
| Benchmark | Slopus 9B Q4 | Time |
|---|---|---|
| tinyMMLU (knowledge) | 76.0% | 6.3 min |
| tinyArc (science Q&A) | 92.0% | 5.4 min |
| tinyHellaswag (common sense) | 65.0% | 5.1 min |
| tinyWinogrande (coreference) | 76.0% | 4.0 min |
| tinyTruthfulQA (truthfulness, mc1) | 61.0% | 5.2 min |
| Average | 74.0% | total ~26 min |
OmniCoder NOT benchmarked on tinyBenchmarks MC because at OmniCoder's pace (approximately 12x slower) it would have taken 5-6h instead of 26 min. The GSM8K comparison is enough signal of where each model stands.
Random baseline for 4-choice MC ≈ 25%. All results are well above baseline. ARC at 92% is excellent for a 9B Q4.
Costs
| Item | Time | Cost |
|---|---|---|
| Phase 1 training, H100 SXM Secure ($3.29/h) | ~1.5h | ~$4.95 |
| Phase 2 training, H100 SXM Secure ($3.29/h) | ~4.8h | ~$15.80 |
| Intermediate merges + quantize on local 3090 | ~2h | $0 (local) |
| HF storage (XET turbo) | - | $0 |
| Total paid compute | ~6.3h cloud | ~$20.75 |
Available quants
| Quant | Size | Recommended use |
|---|---|---|
Slopus-9B-Q2_K.gguf |
~3.6 GB | Only if VRAM <6 GB. Noticeably lower quality. |
Slopus-9B-Q3_K_M.gguf |
~4.5 GB | VRAM ~6 GB. Tight compromise. |
Slopus-9B-Q4_K_M.gguf |
~5.5 GB | Sweet spot, default recommended. VRAM 8 GB+ |
Slopus-9B-Q6_K.gguf |
~7.5 GB | Near-perfect quality. VRAM 10 GB+ |
Slopus-9B-Q8_0.gguf |
~9.5 GB | Almost indistinguishable from fp16. VRAM 12 GB+ |
Usage with llama-server
IMPORTANT: you must use llama.cpp at tag b8292 (commit b54124110). New master does NOT load these GGUFs (it fails with a missing tensor error due to a bug in the converter+loader for Qwen3.5's hybrid GDN architecture).
# Install llama.cpp at b8292
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git checkout b8292
cmake -B build -DGGML_CUDA=ON
cmake --build build --target llama-quantize llama-server -j$(nproc)
# Serve
export LLAMA_CHAT_TEMPLATE_KWARGS='{"enable_thinking":true}'
./build/bin/llama-server \
--model Slopus-9B-Q4_K_M.gguf \
-ngl 999 -fa on --no-mmap \
-c 65536 \
--parallel 4 \
--jinja --reasoning-format deepseek \
--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 \
--presence-penalty 1.5 \
--port 12345
Phase 2 LoRA adapter (without merge)
Both LoRA adapters are included in this same repo:
adapter_phase1/— final phase 1 adapter (checkpoint-506, r=8 alpha=16), trained onKukedlc/omnicoder-train. Loads on top ofTesslate/OmniCoder-9B.adapter_phase2/— final phase 2 adapter (checkpoint-758, r=128 alpha=256), trained onKukedlc/omnicoder-fase2-reasoning. Loads on top ofTesslate/OmniCoder-9B + adapter_phase1merged. The intermediate merge is NOT published (~18 GB, adds no value vs. the Q4_K_M GGUF directly).
To reproduce Slopus 9B from scratch:
- Apply
adapter_phase1overTesslate/OmniCoder-9B→ merge to fp16 - Apply
adapter_phase2over the merged phase 1 → merge to fp16 - Convert + quantize with
llama.cpp@b8292
Limitations
- r=128 alpha=256 on 24K examples is borderline against the rule-of-thumb "high rank requires a large dataset". The model showed NO visible signs of overfitting (loss still going down at the end of epoch 2), but without an eval set it can't be fully confirmed. For production use ideally build an out-of-distribution benchmark.
- GGUFs require llama.cpp b8292 specifically. Newer versions don't load them. This model is NOT compatible with stock
Ollama(which embeds a newer llama.cpp) — you'd have to patch Ollama or usellama-serverdirectly. - Basic arithmetic in Q4_K_M can be slightly off (the model lists the terms correctly but sometimes sums the final result wrong). Use Q6_K or Q8_0 for tasks requiring numerical precision.
- No RLHF/DPO. SFT only in both phases.
Datasets & credits
Tesslate/OmniCoder-9B— base model. Apache 2.0.Gryphe/Opus-4.6-Reasoning-24k— source of the reasoning dataset (re-rendered to Qwen3.5 chat template with inline<think>). Apache 2.0.- llama.cpp b8292 — GitHub ggml-org/llama.cpp, commit b54124110.
- Unsloth — efficient bf16 LoRA training.
- bartowski for publishing OmniCoder-9B GGUFs at b8292, which served as the reference to identify the correct llama.cpp version.
Reproducing the experiment
Scripts in the Kukedlc/omnicoder-train dataset repo:
train_omnicoder.py— phase 1train_omnicoder_fase2.py— phase 2setup_train.sh/setup_fase2.sh— RunPod pod launcherswatcher_upload.sh/watcher_upload_fase2.sh— automatic checkpoint upload
License
Apache 2.0 — inherited from the OmniCoder-9B base and the Gryphe dataset.
- Downloads last month
- 192
2-bit
3-bit
4-bit
6-bit
8-bit



