Instructions to use RedHatAI/Qwen3.5-9B-FP8-dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RedHatAI/Qwen3.5-9B-FP8-dynamic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="RedHatAI/Qwen3.5-9B-FP8-dynamic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("RedHatAI/Qwen3.5-9B-FP8-dynamic") model = AutoModelForImageTextToText.from_pretrained("RedHatAI/Qwen3.5-9B-FP8-dynamic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use RedHatAI/Qwen3.5-9B-FP8-dynamic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RedHatAI/Qwen3.5-9B-FP8-dynamic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/Qwen3.5-9B-FP8-dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/RedHatAI/Qwen3.5-9B-FP8-dynamic
- SGLang
How to use RedHatAI/Qwen3.5-9B-FP8-dynamic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RedHatAI/Qwen3.5-9B-FP8-dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/Qwen3.5-9B-FP8-dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RedHatAI/Qwen3.5-9B-FP8-dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/Qwen3.5-9B-FP8-dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use RedHatAI/Qwen3.5-9B-FP8-dynamic with Docker Model Runner:
docker model run hf.co/RedHatAI/Qwen3.5-9B-FP8-dynamic
Qwen3.5-9B-FP8-dynamic
Model Overview
- Model Architecture: Qwen/Qwen3.5-9B
- Input: Text / Image
- Output: Text
- Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
- Model size: 14.0 GB (reduced from 19.3 GB in BF16)
- Release Date: 2026-05-11
- Version: 1.0
- Model Developers: RedHatAI
This model is a quantized version of Qwen/Qwen3.5-9B. Evaluation results and reproduction steps are provided below.
Model Optimizations
This model was obtained by quantizing the weights and activations of Qwen/Qwen3.5-9B to FP8 data type, ready for inference with vLLM.
This optimization reduces the model weights from 19.3 GB to 14.0 GB on disk (~27% reduction). Activations are quantized dynamically at inference time using per-tensor scaling, requiring no calibration data.
Only the weights and activations of the linear operators within transformer blocks are quantized using LLM Compressor.
Deployment
Use with vLLM
- Initialize vLLM server:
Multimodal (vision + text):
vllm serve RedHatAI/Qwen3.5-9B-FP8-dynamic \
--reasoning-parser qwen3 \
--max-model-len 262144
Text-only (lower memory):
vllm serve RedHatAI/Qwen3.5-9B-FP8-dynamic \
--reasoning-parser qwen3 \
--max-model-len 262144 \
--language-model-only
- Send requests to the server:
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
model = "RedHatAI/Qwen3.5-9B-FP8-dynamic"
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = client.chat.completions.create(
model=model,
messages=messages,
)
generated_text = outputs.choices[0].message.content
print(generated_text)
Creation
This model was created by applying LLM Compressor using data-free FP8 dynamic quantization, as presented in the code snippet below.
from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoProcessor, AutoTokenizer, Qwen3_5ForConditionalGeneration
MODEL_ID = "Qwen/Qwen3.5-9B"
IGNORE_LAYERS = [
"re:.*lm_head",
"re:.*embed_tokens$",
"re:.*visual.*",
"re:.*model.visual.*",
"re:.*linear_attn.*",
]
model = Qwen3_5ForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
processor = AutoProcessor.from_pretrained(MODEL_ID)
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=IGNORE_LAYERS,
)
oneshot(model=model, recipe=recipe)
model.save_pretrained("Qwen3.5-9B-FP8-dynamic", save_compressed=True)
processor.save_pretrained("Qwen3.5-9B-FP8-dynamic")
save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir="Qwen3.5-9B-FP8-dynamic")
Package versions
llm-compressor==0.10.1.dev44+g437f8afecompressed-tensors==0.14.1a20260325transformers==5.3.0vllm==0.18.1lm-eval—neuralmagic/lm-evaluation-harness@741f1d8(branch:mmlu-pro-chat-variant)lighteval—neuralmagic/lighteval@6f0f351(branch:eldar-fix-litellm)
Evaluation
This model was evaluated on GSM8k-Platinum, MMLU-Pro, IFEval, Math 500, AIME 2025, and GPQA Diamond using lm-evaluation-harness and lighteval, with inference served via vLLM.
Accuracy
| Category | Benchmark | Qwen/Qwen3.5-9B | RedHatAI/Qwen3.5-9B-FP8-dynamic | Recovery |
|---|---|---|---|---|
| Instruction Following | GSM8k-Platinum (0-shot) | 94.7% | 94.5% | 99.8% |
| MMLU-Pro (0-shot) | 82.5% | 82.4% | 99.9% | |
| IFEval — prompt strict (0-shot) | 90.3% | 88.9% | 98.4% | |
| IFEval — instruction strict (0-shot) | 92.9% | 92.0% | 99.0% | |
| Reasoning | Math 500 (0-shot) | 85.0% | 84.7% | 99.7% |
| AIME 2025 (0-shot) | 88.3% | 87.9% | 99.5% | |
| GPQA Diamond (0-shot) | 84.0% | 83.8% | 99.8% |
Reproduction
The results were obtained using the following commands. GSM8k-Platinum, MMLU-Pro, IFEval, Math 500, and GPQA Diamond were each run 3 times with different seeds and results averaged. AIME 2025 was run 8 times. The vLLM server was started with --language-model-only for all evaluations.
GSM8k-Platinum (lm-eval, 0-shot, 3 repetitions)
lm_eval --model local-chat-completions \
--tasks gsm8k_platinum_cot_llama \
--model_args "model=RedHatAI/Qwen3.5-9B-FP8-dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=100,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
--num_fewshot 0 \
--apply_chat_template \
--output_path results_gsm8k_platinum.json \
--seed <SEED> \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,presence_penalty=1.5,repetition_penalty=1.0,max_gen_toks=65536,seed=<SEED>"
Seeds used: 42, 1234, 4158
MMLU-Pro (lm-eval, 0-shot, 3 repetitions)
lm_eval --model local-chat-completions \
--tasks mmlu_pro_chat \
--model_args "model=RedHatAI/Qwen3.5-9B-FP8-dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=100,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
--num_fewshot 0 \
--apply_chat_template \
--output_path results_mmlu_pro.json \
--seed <SEED> \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,presence_penalty=1.5,repetition_penalty=1.0,max_gen_toks=65536,seed=<SEED>"
Seeds used: 42, 1234, 4158
IFEval (lm-eval, 0-shot, 3 repetitions)
lm_eval --model local-chat-completions \
--tasks ifeval \
--model_args "model=RedHatAI/Qwen3.5-9B-FP8-dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=100,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
--num_fewshot 0 \
--apply_chat_template \
--output_path results_ifeval.json \
--seed <SEED> \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,presence_penalty=1.5,repetition_penalty=1.0,max_gen_toks=65536,seed=<SEED>"
Seeds used: 42, 1234, 4158
Math 500 (lighteval, 0-shot, 3 repetitions)
lighteval endpoint litellm \
"model_name=hosted_vllm/RedHatAI/Qwen3.5-9B-FP8-dynamic,provider=hosted_vllm,base_url=http://0.0.0.0:8000/v1,timeout=3600,concurrent_requests=100,generation_parameters={temperature:1.0,max_new_tokens:65536,top_p:0.95,top_k:20,min_p:0.0,presence_penalty:1.5,repetition_penalty:1.0,seed:<SEED>}" \
"math_500@k=1@n=1|0" \
--output-dir results_math500 \
--save-details
Seeds used: 42, 1234, 4158
AIME 2025 (lighteval, 0-shot, 8 repetitions)
lighteval endpoint litellm \
"model_name=hosted_vllm/RedHatAI/Qwen3.5-9B-FP8-dynamic,provider=hosted_vllm,base_url=http://0.0.0.0:8000/v1,timeout=3600,concurrent_requests=100,generation_parameters={temperature:1.0,max_new_tokens:65536,top_p:0.95,top_k:20,min_p:0.0,presence_penalty:1.5,repetition_penalty:1.0,seed:<SEED>}" \
"aime25@k=1@n=1|0" \
--output-dir results_aime25 \
--save-details
Seeds used: 42, 1234, 1356, 3344, 4158, 5322, 5678, 9843
GPQA Diamond (lighteval, 0-shot, 3 repetitions)
lighteval endpoint litellm \
"model_name=hosted_vllm/RedHatAI/Qwen3.5-9B-FP8-dynamic,provider=hosted_vllm,base_url=http://0.0.0.0:8000/v1,timeout=3600,concurrent_requests=100,generation_parameters={temperature:1.0,max_new_tokens:65536,top_p:0.95,top_k:20,min_p:0.0,presence_penalty:1.5,repetition_penalty:1.0,seed:<SEED>}" \
"gpqa:diamond@k=1@n=1|0" \
--output-dir results_gpqa_diamond \
--save-details
Seeds used: 42, 1234, 4158
- Downloads last month
- 1,521