lukealonso commited on
Commit
da3aa9e
·
verified ·
1 Parent(s): 9abb935

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -140
README.md CHANGED
@@ -5,150 +5,11 @@ base_model:
5
 
6
  modelopt NVFP4 quantized MiniMax-M2
7
 
8
- Instructions from another user, running on RTX Pro 6000 Blackwell:
9
-
10
- Running this model on vllm/vllm-openai:nightly has been met with mixed results. Sometimes working, sometimes not.
11
-
12
- I noticed that I could not run this model with just the base vllm v0.12.0 because it was built using cuda 12.9.
13
- Other discussions on this model show that the model is working with vllm v0.12.0 when using cuda 13.
14
-
15
- I stepped through the following instructions in order to reliably build vllm & run this model using vllm 0.12.0 and cuda 13.0.2.
16
-
17
- # Instructions
18
- ## 1. Build VLLM Image
19
-
20
- ```bash
21
- # Clone the VLLM Repo
22
- if [[ ! -d vllm ]]; then
23
- git clone https://github.com/vllm-project/vllm.git
24
- fi
25
-
26
- # Checkout the v0.12.0 version of VLLM
27
- cd vllm
28
- git checkout releases/v0.12.0
29
-
30
- # Build with cuda 13.0.2, use precompiled vllm cu130, & ubuntu 22.04 image
31
- DOCKER_BUILDKIT=1 \
32
- docker build . \
33
- --target vllm-openai \
34
- --tag vllm/vllm-openai:custom-vllm-0.12.0-cuda-13.0.2-py-3.12 \
35
- --file docker/Dockerfile \
36
- --build-arg max_jobs=64 \
37
- --build-arg nvcc_threads=16 \
38
- --build-arg CUDA_VERSION=13.0.2 \
39
- --build-arg PYTHON_VERSION=3.12 \
40
- --build-arg VLLM_USE_PRECOMPILED=true \
41
- --build-arg VLLM_MAIN_CUDA_VERSION=130 \
42
- --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.2-devel-ubuntu22.04 \
43
- --build-arg RUN_WHEEL_CHECK=false \
44
- ;
45
- # ubuntu 22.04 required because 20.04 does not have cuda 13.0.2 varient
46
- ```
47
-
48
- ## 2. Run the custom VLLM Image
49
-
50
- ```yaml
51
- services:
52
- vllm:
53
-
54
- # !!! Notice !!! our custom built image is here from the build command above
55
- image: vllm/vllm-openai:custom-vllm-0.12.0-cuda-13.0.2-py-3.12
56
-
57
- environment:
58
- # Optional
59
- VLLM_NO_USAGE_STATS: "1"
60
- DO_NOT_TRACK: "1"
61
- CUDA_DEVICE_ORDER: PCI_BUS_ID
62
- VLLM_LOGGING_LEVEL: INFO
63
-
64
- # Required (I think)
65
- VLLM_ATTENTION_BACKEND: FLASHINFER
66
- VLLM_FLASHINFER_MOE_BACKEND: throughput
67
- VLLM_USE_FLASHINFER_MOE_FP16: 1
68
- VLLM_USE_FLASHINFER_MOE_FP8: 1
69
- VLLM_USE_FLASHINFER_MOE_FP4: 1
70
- VLLM_USE_FLASHINFER_MOE_MXFP4_BF16: 1
71
- VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS: 1
72
- VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: 1
73
-
74
- # Required (I think)
75
- VLLM_WORKER_MULTIPROC_METHOD: spawn
76
- NCCL_P2P_DISABLE: 1
77
-
78
- entrypoint: /bin/bash
79
- command:
80
- - -c
81
- - |
82
- vllm serve \
83
- /root/.cache/huggingface/hub/models--lukealonso--MiniMax-M2-NVFP4/snapshots/d8993b15556ab7294530f1ba50a93ad130166174/ \
84
- --served-model-name minimax-m2-fp4 \
85
- --gpu-memory-utilization 0.95 \
86
- --pipeline-parallel-size 1 \
87
- --enable-expert-parallel \
88
- --tensor-parallel-size 4 \
89
- --max-model-len $(( 192 * 1024 )) \
90
- --max-num-seqs 32 \
91
- --enable-auto-tool-choice \
92
- --reasoning-parser minimax_m2_append_think \
93
- --tool-call-parser minimax_m2 \
94
- --all2all-backend pplx \
95
- --enable-prefix-caching \
96
- --enable-chunked-prefill \
97
- --max-num-batched-tokens $(( 64 * 1024 )) \
98
- --dtype auto \
99
- --kv-cache-dtype fp8 \
100
- ;
101
-
102
-
103
-
104
- ```
105
-
106
- # VLLM CLI Arugments Explained
107
-
108
- ## Required Model Arguments
109
-
110
- 1. The path to the model (alternatively huggingface `lukealonso/MiniMax-M2-NVFP4`)
111
- - `/root/.cache/huggingface/hub/models--lukealonso--MiniMax-M2-NVFP4/snapshots/d8993b15556ab7294530f1ba50a93ad130166174/`
112
- 2. The minimax-m2 model parsers & tool config
113
- - `--reasoning-parser minimax_m2_append_think`
114
- - `--tool-call-parser minimax_m2`
115
- - `--enable-auto-tool-choice`
116
-
117
- ## Required Compute Arguments
118
- 1. The parallelism mode (multi gpu, 4x in this example)
119
- - `--enable-expert-parallel`
120
- - `--pipeline-parallel-size 1`
121
- - `--tensor-parallel-size 4`
122
- - `--all2all-backend pplx`
123
- 2. The kv cache & layer data types
124
- - `--kv-cache-dtype fp8`
125
- - `--dtype auto`
126
-
127
- ## Optional Model Arguments
128
- 1. The name of the model to present to api clients
129
- - `--served-model-name minimax-m2-fp4`
130
- 2. The context size available to the model 192k max
131
- - `--max-model-len $(( 192 * 1024 ))`
132
- 3. The prompt chunking size (faster time-til-first-token with large prompts)
133
- - `--enable-chunked-prefill`
134
- - `--max-num-batched-tokens $(( 64 * 1024 ))`
135
-
136
- ## Optional Performance Arguments
137
- 1. How much system VRAM can vllm use?
138
- - `--gpu-memory-utilization 0.95`
139
- 2. How many parallel requests can be made to the server at once.
140
- - `--max-num-seqs 32`
141
- 3. Allow KV cache sharing for overlapping prompts
142
- - `--enable-prefix-caching`
143
-
144
-
145
-
146
  Tested (but not extensively validated) on *2x* RTX Pro 6000 Blackwell via:
147
- (note that these instructions no longer work due to nightly vLLM breaking NVFP4 support)
148
 
149
  ```
150
  inference:
151
- image: vllm/vllm-openai:nightly
152
  container_name: inference
153
  ports:
154
  - "0.0.0.0:8000:8000"
 
5
 
6
  modelopt NVFP4 quantized MiniMax-M2
7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  Tested (but not extensively validated) on *2x* RTX Pro 6000 Blackwell via:
 
9
 
10
  ```
11
  inference:
12
+ image: vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77
13
  container_name: inference
14
  ports:
15
  - "0.0.0.0:8000:8000"