Instructions to use olka-fi/MiniMax-M2.7-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use olka-fi/MiniMax-M2.7-MXFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="olka-fi/MiniMax-M2.7-MXFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("olka-fi/MiniMax-M2.7-MXFP4", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("olka-fi/MiniMax-M2.7-MXFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use olka-fi/MiniMax-M2.7-MXFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "olka-fi/MiniMax-M2.7-MXFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/MiniMax-M2.7-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/olka-fi/MiniMax-M2.7-MXFP4

SGLang

How to use olka-fi/MiniMax-M2.7-MXFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "olka-fi/MiniMax-M2.7-MXFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/MiniMax-M2.7-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "olka-fi/MiniMax-M2.7-MXFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/MiniMax-M2.7-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use olka-fi/MiniMax-M2.7-MXFP4 with Docker Model Runner:
```
docker model run hf.co/olka-fi/MiniMax-M2.7-MXFP4
```

MiniMax-M2.7-MXFP4

MXFP4 quantization of MiniMax-M2.7 (228B params, 62 layers, 256 experts/layer, top-8 sigmoid routing).

All 15,872 MoE expert weights quantized to MXFP4. Attention, layer norms, embeddings, and router weights kept at original precision.

	Base (FP8)	MXFP4
Size	215 GB	119 GB
Perplexity (WikiText-2)	4.997	5.063 (+1.34%)
KL divergence	--	0.174 nats/tok (mean), 0.031 (median)
Top-1 agreement	--	85.8%
Compression	1x	1.81x

Quality Analysis

KLD is heavily right-skewed: median KLD is 0.031 nats/tok (5.6x lower than the mean). 96.6% of tokens have KLD < 1 nat. Only 69 out of 2048 eval tokens show significant divergence -- these are low-confidence positions where the model is already distributing probability across many candidates.

Error is diffuse across experts: per-expert quantization error analysis of all 15,872 experts shows extremely uniform error (std=0.000271, range 0.110--0.116). The 256-expert top-8 architecture is inherently quantization-tolerant -- each expert contributes ~1/8th of the output, so MXFP4 errors average out across the mixture.

Format

MXFP4 block-32 quantization in compressed-tensors format:

weight_packed: uint8 [out, in//2] -- two 4-bit values packed per byte (even=low nibble, odd=high nibble)
weight_scale: uint8 e8m0 [out, in//32] -- one shared exponent per block of 32 elements

Quantization is calibration-free (no calibration data needed). MXFP4 block-32 scaling is deterministic -- the shared exponent is derived directly from the max magnitude in each block.

Quantized with quant4.

Serving

vLLM

Requires vLLM with MXFP4 compressed-tensors support and the CUTLASS FP4xFP8 kernel for Blackwell GPUs:

vllm serve /path/to/MiniMax-M2.7-MXFP4 \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --max-num-seqs 512 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 16384 \
    --kv-cache-dtype fp8

Memory Budget

At 119 GB, this fits on 2x DGX Spark (2x 120 GB = 240 GB total) with ~100 GB remaining for KV cache, enabling long-context or multi-session serving that would be impossible with the 215 GB FP8 original.

Evaluation Details

Evaluated on WikiText-2 test set (2048 tokens) using layer-by-layer streaming inference with MiniMaxLayerRunner. Both models run identical forward passes; logits compared token-by-token.

Metric	Value
Perplexity (ref)	4.997
Perplexity (MXFP4)	5.063
PPL degradation	+1.34%
KL(ref\|\|target) mean	0.174 nats/tok
KL(ref\|\|target) median	0.031 nats/tok
KL(ref\|\|target) P95	0.824 nats/tok
KL(ref\|\|target) P99	1.827 nats/tok
Top-1 agreement	85.8%
Tokens with KLD > 1 nat	69 / 2048 (3.4%)

Acknowledgments

Based on MiniMax-M2.7 by MiniMax. Original model license applies.

Downloads last month: 707

Safetensors

Model size

123B params

Tensor type

F32

BF16

Model tree for olka-fi/MiniMax-M2.7-MXFP4

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(106)

this model