Slopus

Slopus 9B

Experimental fine-tuning pipeline on top of Tesslate/OmniCoder-9B (Qwen3.5-9B base + 425K Opus agentic traces).

Two chained LoRA stages: a light coding-style adaptation first, then a heavier capacity injection of pure Opus 4.6 reasoning traces.

Pipeline overview

pipeline

Table of contents

  1. Summary
  2. Pipeline in 5 steps
  3. Phase 1 — LoRA r=8 on OmniCoder-9B
  4. Phase 2 — LoRA r=128 on phase 1 merge
  5. Costs
  6. Available quants
  7. Usage with llama-server
  8. Limitations
  9. Datasets & credits

Summary

Phase 1 Phase 2
Dataset Kukedlc/omnicoder-train (16K, agentic coding) Kukedlc/omnicoder-fase2-reasoning (24K, Opus 4.6 reasoning, derived from Gryphe/Opus-4.6-Reasoning-24k)
Base Tesslate/OmniCoder-9B Tesslate/OmniCoder-9B + phase 1 LoRA merged (fp16)
LoRA rank / alpha r=8, alpha=16 r=128, alpha=256
Targets q,k,v,o,gate,up,down,out_proj (GDN) same + much higher rank
Epochs 2 2
Total steps 506 758
Effective batch 16 (batch=8, GA=8) 64 (batch=16, GA=4)
LR 2e-4 1e-4 (lower, base already fine-tuned)
Max seq 2048 4096
Training hardware RunPod H100 80GB SXM Secure RunPod H100 80GB SXM Secure
Approximate time ~1.5h ~4.8h
Loss start → end 1.14 → 0.88 (-23.1%) 1.03 → 0.88 (-14.3%)

Pipeline in 5 steps

  1. Base: Tesslate/OmniCoder-9B — Qwen3.5-9B with hybrid GDN (Gated Delta Networks) architecture, fine-tuned by Tesslate on 425K Opus agentic coding traces.
  2. Phase 1 LoRA r=8 alpha=16 on top of the base, dataset Kukedlc/omnicoder-train. 2 epochs, 506 steps. Resulting adapter: Kukedlc/omnicoder-9b-lora (checkpoints 100/200/300/400/500/506).
  3. Phase 1 merge: base + phase 1 adapter → ~18 GB HF fp16 model (not published, used only as internal base for phase 2).
  4. Phase 2 LoRA r=128 alpha=256 on top of the merged model, dataset Kukedlc/omnicoder-fase2-reasoning (re-rendered from Gryphe Opus reasoning with <think> inline in the chat template). 2 epochs, 758 steps. Resulting adapter: Kukedlc/omnicoder-9b-fase2-lora (checkpoints 100/200/.../758).
  5. Phase 2 merge + Quantize with llama.cpp tag b8292 (commit b54124110) — newer master and other versions do NOT fully support Qwen3.5's hybrid GDN architecture and fail with missing tensor 'blk.32.attn_norm.weight'. Generated quants: Q2_K, Q3_K_M, Q4_K_M, Q6_K, Q8_0.

Phase 1

phase 1

A conservative LoRA (r=8 alpha=16) on top of OmniCoder-9B to introduce a first style adjustment in the target domain before the heavier injection in phase 2. The idea: don't break the base with a giant LoRA on the first pass.

Observation: loss drops MORE in epoch 2 than in epoch 1 (10% in epoch 1 vs 15% in epoch 2), suggesting the base still had headroom to learn — no overfitting.

Phase 2

phase 2

High-capacity LoRA (r=128 alpha=256, 258M trainable params = 2.67% of the 9B) on top of the phase 1 merge, with a pure reasoning dataset from Opus 4.6 (24K chat examples with separate reasoning_content, re-rendered as <think>X</think>content inline for the Qwen3.5 chat template).

LR lowered to 1e-4 (vs 2e-4 in phase 1) since the model was already fine-tuned. Cosine scheduler with 20 warmup steps.

Recommended sampling parameters (from Qwen3.6 model card, valid for Qwen3.5):

Mode temp top_p pres_pen top_k min_p
Thinking general 1.0 0.95 1.5 20 0
Thinking coding 0.6 0.95 0.0 20 0
Instruct general 0.7 0.80 1.5 20 0

Benchmarks

All benchmarks run locally on RTX 3090, against the Q4_K_M GGUF served via llama-server at b8292. Custom Python harness (_custom_gsm8k_bench.py and _custom_mc_bench.py in the dataset repo) — we send each question + choices to the model and extract the final letter/number with regex. No lm-evaluation-harness because of multiple bugs with the Qwen3.5 hybrid architecture + chat template + logprobs format from llama-server (documented in the issues).

GSM8K — Slopus 9B vs base OmniCoder 9B (100 problems, same Q&A set)

Setting Slopus 9B Q4 OmniCoder 9B Q4 Δ
Greedy + no-thinking 80.0% (5.5 min) 81.0% (~5 min) -1.0 pp
Thinking ON + Qwen sampling (temp=1.0, top_p=0.95, top_k=20, presence_penalty=1.5) 92.0% (5.5 min) 97.0% (65.6 min) -5.0 pp

Speed: Slopus approximately 12x faster than OmniCoder with thinking on (5.5 min vs 65.6 min, same hardware, same --parallel 2). Slopus reasoning is more concise (approximately 300 tokens avg vs approximately 1500 tokens avg for OmniCoder).

Tradeoff: OmniCoder wins approximately 5pp accuracy on math by re-verifying answers multiple times (Step → Alternative Plan → Verification), Slopus wins approximately 12x throughput. For batch workloads (1000s of queries), Slopus processes 12x the volume per hour.

tinyBenchmarks (100 examples each, custom generative — letter pick, thinking ON + Qwen sampling)

Benchmark Slopus 9B Q4 Time
tinyMMLU (knowledge) 76.0% 6.3 min
tinyArc (science Q&A) 92.0% 5.4 min
tinyHellaswag (common sense) 65.0% 5.1 min
tinyWinogrande (coreference) 76.0% 4.0 min
tinyTruthfulQA (truthfulness, mc1) 61.0% 5.2 min
Average 74.0% total ~26 min

OmniCoder NOT benchmarked on tinyBenchmarks MC because at OmniCoder's pace (approximately 12x slower) it would have taken 5-6h instead of 26 min. The GSM8K comparison is enough signal of where each model stands.

Random baseline for 4-choice MC ≈ 25%. All results are well above baseline. ARC at 92% is excellent for a 9B Q4.

Costs

Item Time Cost
Phase 1 training, H100 SXM Secure ($3.29/h) ~1.5h ~$4.95
Phase 2 training, H100 SXM Secure ($3.29/h) ~4.8h ~$15.80
Intermediate merges + quantize on local 3090 ~2h $0 (local)
HF storage (XET turbo) - $0
Total paid compute ~6.3h cloud ~$20.75

Available quants

Quant Size Recommended use
Slopus-9B-Q2_K.gguf ~3.6 GB Only if VRAM <6 GB. Noticeably lower quality.
Slopus-9B-Q3_K_M.gguf ~4.5 GB VRAM ~6 GB. Tight compromise.
Slopus-9B-Q4_K_M.gguf ~5.5 GB Sweet spot, default recommended. VRAM 8 GB+
Slopus-9B-Q6_K.gguf ~7.5 GB Near-perfect quality. VRAM 10 GB+
Slopus-9B-Q8_0.gguf ~9.5 GB Almost indistinguishable from fp16. VRAM 12 GB+

Usage with llama-server

IMPORTANT: you must use llama.cpp at tag b8292 (commit b54124110). New master does NOT load these GGUFs (it fails with a missing tensor error due to a bug in the converter+loader for Qwen3.5's hybrid GDN architecture).

# Install llama.cpp at b8292
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git checkout b8292
cmake -B build -DGGML_CUDA=ON
cmake --build build --target llama-quantize llama-server -j$(nproc)

# Serve
export LLAMA_CHAT_TEMPLATE_KWARGS='{"enable_thinking":true}'
./build/bin/llama-server \
    --model Slopus-9B-Q4_K_M.gguf \
    -ngl 999 -fa on --no-mmap \
    -c 65536 \
    --parallel 4 \
    --jinja --reasoning-format deepseek \
    --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 \
    --presence-penalty 1.5 \
    --port 12345

Phase 2 LoRA adapter (without merge)

Both LoRA adapters are included in this same repo:

  • adapter_phase1/ — final phase 1 adapter (checkpoint-506, r=8 alpha=16), trained on Kukedlc/omnicoder-train. Loads on top of Tesslate/OmniCoder-9B.
  • adapter_phase2/ — final phase 2 adapter (checkpoint-758, r=128 alpha=256), trained on Kukedlc/omnicoder-fase2-reasoning. Loads on top of Tesslate/OmniCoder-9B + adapter_phase1 merged. The intermediate merge is NOT published (~18 GB, adds no value vs. the Q4_K_M GGUF directly).

To reproduce Slopus 9B from scratch:

  1. Apply adapter_phase1 over Tesslate/OmniCoder-9B → merge to fp16
  2. Apply adapter_phase2 over the merged phase 1 → merge to fp16
  3. Convert + quantize with llama.cpp@b8292

Limitations

  • r=128 alpha=256 on 24K examples is borderline against the rule-of-thumb "high rank requires a large dataset". The model showed NO visible signs of overfitting (loss still going down at the end of epoch 2), but without an eval set it can't be fully confirmed. For production use ideally build an out-of-distribution benchmark.
  • GGUFs require llama.cpp b8292 specifically. Newer versions don't load them. This model is NOT compatible with stock Ollama (which embeds a newer llama.cpp) — you'd have to patch Ollama or use llama-server directly.
  • Basic arithmetic in Q4_K_M can be slightly off (the model lists the terms correctly but sometimes sums the final result wrong). Use Q6_K or Q8_0 for tasks requiring numerical precision.
  • No RLHF/DPO. SFT only in both phases.

Datasets & credits

  • Tesslate/OmniCoder-9B — base model. Apache 2.0.
  • Gryphe/Opus-4.6-Reasoning-24k — source of the reasoning dataset (re-rendered to Qwen3.5 chat template with inline <think>). Apache 2.0.
  • llama.cpp b8292 — GitHub ggml-org/llama.cpp, commit b54124110.
  • Unsloth — efficient bf16 LoRA training.
  • bartowski for publishing OmniCoder-9B GGUFs at b8292, which served as the reference to identify the correct llama.cpp version.

Reproducing the experiment

Scripts in the Kukedlc/omnicoder-train dataset repo:

  • train_omnicoder.py — phase 1
  • train_omnicoder_fase2.py — phase 2
  • setup_train.sh / setup_fase2.sh — RunPod pod launchers
  • watcher_upload.sh / watcher_upload_fase2.sh — automatic checkpoint upload

License

Apache 2.0 — inherited from the OmniCoder-9B base and the Gryphe dataset.

Downloads last month
192
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kukedlc/Slopus-9B

Finetuned
Qwen/Qwen3.5-9B
Adapter
(2)
this model