Introduction
We trained Qwen2.5-VL-7B/32B/72B-Instruct EAGLE3 draft models on 95K randomly selected samples from the FreedomIntelligence/ALLaVA-4V dataset using SpecForge.
Usage
Inference
Infer with SGLang.
Recommended configuration for H200: 3-4 steps, 10 topk, 64 draft tokens.
Start Server
python -m sglang.launch_server \
--model-path Qwen/Qwen2.5-VL-72B-Instruct \
--speculative-draft Geraldxm/Qwen2.5-VL-72B-Instruct-eagle3-sgl \
--trust-remote-code \
--chat-template qwen2-vl \
--chunked-prefill-size -1 \
--cuda-graph-max-bs 1 \
--speculative-algo EAGLE3 \
--speculative-num-steps 4 \
--speculative-eagle-topk 10 \
--speculative-num-draft-tokens 64 \
--tp 4 \
--mem-fraction-static 0.7 \
--host 0.0.0.0 \
--port 3000
Benchmarking
Benchmark with SpecForge benchmarkers.
cd path/to/SpecForge/benchmarks
python bench_eagle3.py \
--model Qwen/Qwen2.5-VL-72B-Instruct \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path Geraldxm/Qwen2.5-VL-72B-Instruct-eagle3-sgl \
--cuda-graph-max-bs 1 \
--tp 4 \
--config-list 1,0,0,0 1,4,10,64 \
--mem-fraction-static 0.6 \
--port 30000 \
--dtype bfloat16 \
--benchmark-list "humaneval:50" "mmlu:50:all" "mmstar:100" "math500:50"
Performance Results
Speedup
Qwen2.5-VL-7B-Instruct: Up to 1.68x speedup on MMStar benchmark (TP=1)
Qwen2.5-VL-32B-Instruct: Up to 1.52x speedup on MMStar benchmark (TP=4)
Qwen2.5-VL-72B-Instruct: Up to 2.04x speedup on MMStar benchmark (TP=4)

Analysis
Dataset Performance
We evaluated the models on four datasets representing different task types:
- HumanEval: Code generation tasks
- MMLU: Multilingual understanding tasks
- MMStar: Multimodal tasks
- Math500: Mathematical reasoning tasks
Key Findings:
MMStar: Achieved the best performance gains with accept_length typically reaching ~3.5, sometimes approaching 4.5. This is likely due to the similarity between the training dataset ALLaVA-4V (multimodal tasks) and MMStar's distribution.
HumanEval: Unexpectedly, none of the 7B/32B/72B models showed throughput improvements, and accuracy remained consistently at 0. This appears to be a bug that requires further investigation.
MMLU/Math500: Accept_length remained relatively consistent, typically ranging from 2 to 2.5.
Hyperparameter Analysis
Based on extensive experiments, we recommend the following configuration for H200 GPUs:
- Steps: 3-4
- TopK: 10
- Draft Tokens: 64
Performance Heatmap
Steps Impact Analysis
Draft Tokens Impact Analysis
Size Scaling Analysis
It can be observed that larger models (72B) achieve better acceleration. The reason the 32B model has a smaller speedup compared to the 7B model is that it uses TP=4, while the 7B model uses TP=1.
Training
We trained the draft model on 4×H200 GPUs following the train qwen2.5-vl eagle3 guide.
Training Curves
The training loss and accuracy curves are shown below:
Known Issues and Solutions
When directly using the training scripts from the PR, several issues may arise. Below we outline the problems encountered and their solutions. We plan to submit a new PR to address these issues in the SpecForge repository.
Data Preprocessing
Issue: Data loading stalls at 0% (Deadlock).
Solution: Set export OMP_NUM_THREADS=1 in the launch script to force the underlying library to run in single-threaded mode.
OOM During FSDP Initialization
Issues Identified:
- Memory Spike: Calling
.cuda()before FSDP sharding causes each GPU to load the full large model, doubling memory usage during initialization. - Large Granularity: By default, the entire model is treated as a single FSDP unit. For 32B/72B models with massive parameters, this prevents execution without parameter sharding.
- Full Parameter Aggregation: When saving, FSDP defaults to aggregating full parameters on GPU 0 (including 72B frozen parameters), instantly exceeding memory limits.
Solutions:
- Remove
.cuda()during model loading (keep on CPU) - Specify
device_idduring FSDP initialization to enable streaming sharded loading from CPU to GPU - Use
transformer_auto_wrap_policyto wrap by layer (Decoder Layer) - Change strategy from
SHARD_GRAD_OPtoFULL_SHARD(ZeRO-3) - Use
FullStateDictConfig(offload_to_cpu=True, rank0_only=True)to move parameter aggregation from GPU memory to CPU memory
These solutions are verified to work for 32B/72B models.
- Downloads last month
- 12
Model tree for Geraldxm/Qwen2.5-VL-72B-Instruct-eagle3-sgl
Base model
Qwen/Qwen2.5-VL-72B-Instruct