File size: 11,233 Bytes
e85a10c
 
 
 
 
 
 
 
 
e1bfbd7
 
e85a10c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
license: apache-2.0
pipeline_tag: audio-text-to-text
library_name: transformers
tags:
- audio-reasoning
- chain-of-thought
- multi-modal
- step-audio-r1
base_model:
- stepfun-ai/Step-Audio-R1
---
## Step-Audio-R1-NVFP4A16 (Quantized)

This is a **quantized version** of Step-Audio-R1 using NVFP4A16 quantization via [LLM Compressor](https://github.com/vllm-project/llm-compressor).

### Quantization Details

- **Scheme**: NVFP4A16 (FP4 weights with FP16 activations)
- **Target layers**: All Linear layers (except `encoder`, `adapter`, `lm_head`)
- **Group size**: 16
- **Method**: Post-Training Quantization (PTQ)

### Quantization Code
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation

MODEL_ID = "stepfun-ai/Step-Audio-R1"

# Load model
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme
# Quantize weights to FP4 with per group 16 via PTQ
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4A16", ignore=["lm_head", "re:encoder.*", "re:adapter.*"])

# Apply quantization
oneshot(model=model, recipe=recipe)

# Save to disk in compressed-tensors format
SAVE_DIR = "Step-Audio-R1-NVFP4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
```


## Step-Audio-R1

 ✨ [Demo Page](https://stepaudiollm.github.io/step-audio-r1/) 
| 🎮 [Playground](https://huggingface.co/spaces/stepfun-ai/Step-Audio-R1) 
| 🌟 [GitHub](https://github.com/stepfun-ai/Step-Audio-R1) 
| 📑 [Paper](https://arxiv.org/abs/2511.15848) 

Step-Audio-R1 is the **first audio language model to successfully unlock Chain-of-Thought (CoT) reasoning**. 
It decisively solves the "inverted scaling" problem that plagues existing models, where performance degrades 
with longer reasoning. Step-Audio-R1 is the first model to demonstrate that for audio, like text and vision, 
allocating more compute at test-time predictably improves performance.

We found the root cause of this anomaly: models were engaging in **textual surrogate reasoning** 
(analyzing transcripts, not audio) due to a modality mismatch. To solve this, we introduce 
**Modality-Grounded Reasoning Distillation (MGRD)**, an iterative training framework that shifts the model's 
reasoning from textual abstractions to acoustic properties.

This new approach allows us to create **Step-Audio-R1**, which:
- Is the **first audio reasoning model** that successfully benefits from test-time compute scaling.
- Surpasses **Gemini 2.5 Pro** and is comparable to **Gemini 3** across major audio reasoning tasks.
- Transforms extended deliberation from a liability into a **powerful asset** for audio intelligence.

## Features
- **Chain-of-Thought (CoT) Reasoning**
  - First audio language model to successfully unlock Chain-of-Thought reasoning capabilities.
  - Generates audio-relevant reasoning chains that genuinely ground themselves in acoustic features.
 
- **Modality-Grounded Reasoning Distillation (MGRD)**
  - Innovative iterative training framework that shifts reasoning from textual abstractions to acoustic properties.
  - Solves the modality mismatch problem that caused textual surrogate reasoning in previous models.
    
- **Superior Performance**
  - Surpasses **Gemini 2.5 Pro** across comprehensive audio understanding and reasoning benchmarks.
  - Comparable to **Gemini 3** across major audio reasoning tasks.
  - Surpasses **Qwen3** in textual reasoning.
  - Covers speech, environmental sounds, and music domains.


For more examples, see [demo page](https://stepaudiollm.github.io/step-audio-r1/).

## Model Usage
### 📜 Requirements
- **GPU**: NVIDIA GPUs with CUDA support (tested on 4×L40S/H100/H800/H20).
- **Operating System**: Linux.
- **Python**: >= 3.10.0.

### ⬇️ Download Model
First, you need to download the Step-Audio-R1 model weights.

**Method A · Git LFS**
   ```bash
   git lfs install
   git clone https://huggingface.co/stepfun-ai/Step-Audio-R1
   ```

**Method B · Hugging Face CLI**
   ```bash
   hf download stepfun-ai/Step-Audio-R1 --local-dir ./Step-Audio-R1
   ```

### 🚀 Deployment and Execution
We provide two ways to serve the model: Docker (recommended) or compiling the customized vLLM backend.

#### 🐳 Method 1 · Run with Docker (Recommended)

A customized vLLM image is required.

1.  **Pull the image**:
```bash
docker pull stepfun2025/vllm:step-audio-2-v20250909
```
2.  **Start the service**:
    Assuming the model is downloaded in the `Step-Audio-R1` folder in the current directory.

    ```bash
    docker run --rm -ti --gpus all \
        -v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
        -p 9999:9999 \
        stepfun2025/vllm:step-audio-2-v20250909 \
        -- vllm serve /Step-Audio-R1 \
        --served-model-name Step-Audio-R1 \
        --port 9999 \
        --max-model-len 16384 \
        --max-num-seqs 32 \
        --tensor-parallel-size 4 \
        --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}' \
        --enable-log-requests \
        --interleave-mm-strings \
        --trust-remote-code
    ```
After the service starts, it will listen on `localhost:9999`.

#### 🐳  Method 2 · Run from Source (Compile vLLM)
Step-Audio-R1 requires a customized vLLM backend.

1.  **Download Source Code**:
    ```bash
    git clone https://github.com/stepfun-ai/vllm.git
    cd vllm
    ```

2.  **Prepare Environment**:
    ```bash
    python3 -m venv .venv
    source .venv/bin/activate
    ```

3.  **Install and Compile**:
    vLLM contains both C++ and Python code. We mainly modified the Python code, so the C++ part can use the pre-compiled version to speed up the process.
    
    ```bash
    # Use pre-compiled C++ extensions (Recommended)
    VLLM_USE_PRECOMPILED=1 pip install -e .
    ```

4.  **Switch Branch**:
    After compilation, switch to the branch that supports Step-Audio.
    ```bash
    git checkout step-audio-2-mini
    ```

5.  **Start the Service**:
    ```bash
    # Ensure you are in the vllm directory and the virtual environment is activated
    source .venv/bin/activate

    python3 -m vllm.entrypoints.openai.api_server \
        --model ../Step-Audio-R1 \
        --served-model-name Step-Audio-R1 \
        --port 9999 \
        --host 0.0.0.0 \
        --max-model-len 65536 \
        --max-num-seqs 128 \
        --tensor-parallel-size 4 \
        --gpu-memory-utilization 0.85 \
        --trust-remote-code \
        --enable-log-requests \
        --interleave-mm-strings \
        --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}'
    ```

After the service starts, it will listen on `localhost:9999`.


### 🧪 Client Examples

Get the example code and run it:
```bash
# Clone the repository containing example scripts
git clone https://github.com/stepfun-ai/Step-Audio-R1.git r1-scripts

# Run the example
cd r1-scripts
python examples-vllm_r1.py
```


## Citation

```
@article{tian2025step,
  title={Step-Audio-R1 Technical Report},
  author={Tian, Fei and Zhang, Xiangyu Tony and Zhang, Yuxin and Zhang, Haoyang and Li, Yuxin and Liu, Daijiao and Deng, Yayue and Wu, Donghang and Chen, Jun and Zhao, Liang and others},
  journal={arXiv preprint arXiv:2511.15848},
  year={2025}
}

```