Add vllm to supported inference engine
#3
by wzhao18 - opened
README.md
CHANGED
|
@@ -61,6 +61,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
|
|
| 61 |
## Software Integration:
|
| 62 |
**Runtime Engine(s):** <br>
|
| 63 |
* SGLang <br>
|
|
|
|
| 64 |
|
| 65 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
| 66 |
* NVIDIA Blackwell <br>
|
|
@@ -95,7 +96,7 @@ The model is quantized with nvidia-modelopt **v0.43.0** <br>
|
|
| 95 |
|
| 96 |
|
| 97 |
## Inference:
|
| 98 |
-
**Engine:** SGLang <br>
|
| 99 |
**Test Hardware:** B200 <br>
|
| 100 |
|
| 101 |
## Post Training Quantization
|
|
@@ -109,6 +110,17 @@ To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), y
|
|
| 109 |
python3 -m sglang.launch_server --model nvidia/MiniMax-M2.5-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --trust-remote-code --reasoning-parser minimax-append-think --tool-call-parser minimax-m2 --moe-runner-backend flashinfer_cutlass --attention-backend flashinfer
|
| 110 |
```
|
| 111 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
### Evaluation
|
| 113 |
The accuracy benchmark results are presented in the table below:
|
| 114 |
<table>
|
|
|
|
| 61 |
## Software Integration:
|
| 62 |
**Runtime Engine(s):** <br>
|
| 63 |
* SGLang <br>
|
| 64 |
+
* vLLM <br>
|
| 65 |
|
| 66 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
| 67 |
* NVIDIA Blackwell <br>
|
|
|
|
| 96 |
|
| 97 |
|
| 98 |
## Inference:
|
| 99 |
+
**Engine:** SGLang, vLLM <br>
|
| 100 |
**Test Hardware:** B200 <br>
|
| 101 |
|
| 102 |
## Post Training Quantization
|
|
|
|
| 110 |
python3 -m sglang.launch_server --model nvidia/MiniMax-M2.5-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --trust-remote-code --reasoning-parser minimax-append-think --tool-call-parser minimax-m2 --moe-runner-backend flashinfer_cutlass --attention-backend flashinfer
|
| 111 |
```
|
| 112 |
|
| 113 |
+
To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you can launch the docker image `vllm/vllm-openai:latest` and run the sample command (for B200) below:
|
| 114 |
+
|
| 115 |
+
```sh
|
| 116 |
+
vllm serve nvidia/MiniMax-M2.5-NVFP4 \
|
| 117 |
+
--tensor-parallel-size 2 \
|
| 118 |
+
--tool-call-parser minimax_m2 \
|
| 119 |
+
--reasoning-parser minimax_m2_append_think \
|
| 120 |
+
--enable-auto-tool-choice \
|
| 121 |
+
--trust-remote-code
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
### Evaluation
|
| 125 |
The accuracy benchmark results are presented in the table below:
|
| 126 |
<table>
|