Add vllm to supported inference engine

#3
by wzhao18 - opened
Files changed (1) hide show
  1. README.md +13 -1
README.md CHANGED
@@ -61,6 +61,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
61
  ## Software Integration:
62
  **Runtime Engine(s):** <br>
63
  * SGLang <br>
 
64
 
65
  **Supported Hardware Microarchitecture Compatibility:** <br>
66
  * NVIDIA Blackwell <br>
@@ -95,7 +96,7 @@ The model is quantized with nvidia-modelopt **v0.43.0** <br>
95
 
96
 
97
  ## Inference:
98
- **Engine:** SGLang <br>
99
  **Test Hardware:** B200 <br>
100
 
101
  ## Post Training Quantization
@@ -109,6 +110,17 @@ To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), y
109
  python3 -m sglang.launch_server --model nvidia/MiniMax-M2.5-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --trust-remote-code --reasoning-parser minimax-append-think --tool-call-parser minimax-m2 --moe-runner-backend flashinfer_cutlass --attention-backend flashinfer
110
  ```
111
 
 
 
 
 
 
 
 
 
 
 
 
112
  ### Evaluation
113
  The accuracy benchmark results are presented in the table below:
114
  <table>
 
61
  ## Software Integration:
62
  **Runtime Engine(s):** <br>
63
  * SGLang <br>
64
+ * vLLM <br>
65
 
66
  **Supported Hardware Microarchitecture Compatibility:** <br>
67
  * NVIDIA Blackwell <br>
 
96
 
97
 
98
  ## Inference:
99
+ **Engine:** SGLang, vLLM <br>
100
  **Test Hardware:** B200 <br>
101
 
102
  ## Post Training Quantization
 
110
  python3 -m sglang.launch_server --model nvidia/MiniMax-M2.5-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --trust-remote-code --reasoning-parser minimax-append-think --tool-call-parser minimax-m2 --moe-runner-backend flashinfer_cutlass --attention-backend flashinfer
111
  ```
112
 
113
+ To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you can launch the docker image `vllm/vllm-openai:latest` and run the sample command (for B200) below:
114
+
115
+ ```sh
116
+ vllm serve nvidia/MiniMax-M2.5-NVFP4 \
117
+ --tensor-parallel-size 2 \
118
+ --tool-call-parser minimax_m2 \
119
+ --reasoning-parser minimax_m2_append_think \
120
+ --enable-auto-tool-choice \
121
+ --trust-remote-code
122
+ ```
123
+
124
  ### Evaluation
125
  The accuracy benchmark results are presented in the table below:
126
  <table>