Introduction

We trained Qwen2.5-VL-7B/32B/72B-Instruct EAGLE3 draft models on 95K randomly selected samples from the FreedomIntelligence/ALLaVA-4V dataset using SpecForge.

Usage

Inference

Infer with SGLang.

Recommended configuration for H200: 3-4 steps, 10 topk, 64 draft tokens.

Start Server

python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-VL-72B-Instruct \
  --speculative-draft Geraldxm/Qwen2.5-VL-72B-Instruct-eagle3-sgl \
  --trust-remote-code \
  --chat-template qwen2-vl \
  --chunked-prefill-size -1 \
  --cuda-graph-max-bs 1 \
  --speculative-algo EAGLE3 \
  --speculative-num-steps 4 \
  --speculative-eagle-topk 10 \
  --speculative-num-draft-tokens 64 \
  --tp 4 \
  --mem-fraction-static 0.7 \
  --host 0.0.0.0 \
  --port 3000

Benchmarking

Benchmark with SpecForge benchmarkers.

cd path/to/SpecForge/benchmarks
python bench_eagle3.py \
    --model Qwen/Qwen2.5-VL-72B-Instruct \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path Geraldxm/Qwen2.5-VL-72B-Instruct-eagle3-sgl \
    --cuda-graph-max-bs 1 \
    --tp 4 \
    --config-list 1,0,0,0 1,4,10,64 \
    --mem-fraction-static 0.6 \
    --port 30000 \
    --dtype bfloat16 \
    --benchmark-list "humaneval:50" "mmlu:50:all" "mmstar:100" "math500:50"

Performance Results

Speedup

  • Qwen2.5-VL-7B-Instruct: Up to 1.68x speedup on MMStar benchmark (TP=1)

    Qwen2.5-VL-7B Throughput Comparison
  • Qwen2.5-VL-32B-Instruct: Up to 1.52x speedup on MMStar benchmark (TP=4)

    Qwen2.5-VL-32B Throughput Comparison
  • Qwen2.5-VL-72B-Instruct: Up to 2.04x speedup on MMStar benchmark (TP=4)

    Qwen2.5-VL-72B Throughput Comparison

Analysis

Dataset Performance

We evaluated the models on four datasets representing different task types:

  • HumanEval: Code generation tasks
  • MMLU: Multilingual understanding tasks
  • MMStar: Multimodal tasks
  • Math500: Mathematical reasoning tasks

Key Findings:

  • MMStar: Achieved the best performance gains with accept_length typically reaching ~3.5, sometimes approaching 4.5. This is likely due to the similarity between the training dataset ALLaVA-4V (multimodal tasks) and MMStar's distribution.

  • HumanEval: Unexpectedly, none of the 7B/32B/72B models showed throughput improvements, and accuracy remained consistently at 0. This appears to be a bug that requires further investigation.

  • MMLU/Math500: Accept_length remained relatively consistent, typically ranging from 2 to 2.5.

Accuracy Consistency

Hyperparameter Analysis

Based on extensive experiments, we recommend the following configuration for H200 GPUs:

  • Steps: 3-4
  • TopK: 10
  • Draft Tokens: 64

Performance Heatmap

Global Heatmap

Steps Impact Analysis

Steps Impact

Draft Tokens Impact Analysis

Draft Tokens Impact

Size Scaling Analysis

It can be observed that larger models (72B) achieve better acceleration. The reason the 32B model has a smaller speedup compared to the 7B model is that it uses TP=4, while the 7B model uses TP=1.

Size Comparison

Training

We trained the draft model on 4×H200 GPUs following the train qwen2.5-vl eagle3 guide.

Training Curves

The training loss and accuracy curves are shown below:

Training Loss and Accuracy

Known Issues and Solutions

When directly using the training scripts from the PR, several issues may arise. Below we outline the problems encountered and their solutions. We plan to submit a new PR to address these issues in the SpecForge repository.

Data Preprocessing

Issue: Data loading stalls at 0% (Deadlock).

Solution: Set export OMP_NUM_THREADS=1 in the launch script to force the underlying library to run in single-threaded mode.

OOM During FSDP Initialization

Issues Identified:

  1. Memory Spike: Calling .cuda() before FSDP sharding causes each GPU to load the full large model, doubling memory usage during initialization.
  2. Large Granularity: By default, the entire model is treated as a single FSDP unit. For 32B/72B models with massive parameters, this prevents execution without parameter sharding.
  3. Full Parameter Aggregation: When saving, FSDP defaults to aggregating full parameters on GPU 0 (including 72B frozen parameters), instantly exceeding memory limits.

Solutions:

  1. Remove .cuda() during model loading (keep on CPU)
  2. Specify device_id during FSDP initialization to enable streaming sharded loading from CPU to GPU
  3. Use transformer_auto_wrap_policy to wrap by layer (Decoder Layer)
  4. Change strategy from SHARD_GRAD_OP to FULL_SHARD (ZeRO-3)
  5. Use FullStateDictConfig(offload_to_cpu=True, rank0_only=True) to move parameter aggregation from GPU memory to CPU memory

These solutions are verified to work for 32B/72B models.

Downloads last month
12
Safetensors
Model size
1B params
Tensor type
I64
·
BF16
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Geraldxm/Qwen2.5-VL-72B-Instruct-eagle3-sgl

Finetuned
(26)
this model

Dataset used to train Geraldxm/Qwen2.5-VL-72B-Instruct-eagle3-sgl