Introduction

We trained Qwen2.5-VL-7B/32B/72B-Instruct EAGLE3 draft models on 95K randomly selected samples from the FreedomIntelligence/ALLaVA-4V dataset using SpecForge.

Usage

Inference

Infer with SGLang.

Recommended configuration for H200: 3-4 steps, 10 topk, 64 draft tokens.

Start Server

python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-VL-72B-Instruct \
  --speculative-draft Geraldxm/Qwen2.5-VL-72B-Instruct-eagle3-sgl \
  --trust-remote-code \
  --chat-template qwen2-vl \
  --chunked-prefill-size -1 \
  --cuda-graph-max-bs 1 \
  --speculative-algo EAGLE3 \
  --speculative-num-steps 4 \
  --speculative-eagle-topk 10 \
  --speculative-num-draft-tokens 64 \
  --tp 4 \
  --mem-fraction-static 0.7 \
  --host 0.0.0.0 \
  --port 3000

Benchmarking

Benchmark with SpecForge benchmarkers.

cd path/to/SpecForge/benchmarks
python bench_eagle3.py \
    --model Qwen/Qwen2.5-VL-72B-Instruct \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path Geraldxm/Qwen2.5-VL-72B-Instruct-eagle3-sgl \
    --cuda-graph-max-bs 1 \
    --tp 4 \
    --config-list 1,0,0,0 1,4,10,64 \
    --mem-fraction-static 0.6 \
    --port 30000 \
    --dtype bfloat16 \
    --benchmark-list "humaneval:50" "mmlu:50:all" "mmstar:100" "math500:50"

Performance Results

Speedup

Qwen2.5-VL-7B-Instruct: Up to 1.68x speedup on MMStar benchmark (TP=1)
Qwen2.5-VL-32B-Instruct: Up to 1.52x speedup on MMStar benchmark (TP=4)
Qwen2.5-VL-72B-Instruct: Up to 2.04x speedup on MMStar benchmark (TP=4)

Analysis

Dataset Performance

We evaluated the models on four datasets representing different task types:

HumanEval: Code generation tasks
MMLU: Multilingual understanding tasks
MMStar: Multimodal tasks
Math500: Mathematical reasoning tasks

Key Findings:

MMStar: Achieved the best performance gains with accept_length typically reaching ~3.5, sometimes approaching 4.5. This is likely due to the similarity between the training dataset ALLaVA-4V (multimodal tasks) and MMStar's distribution.
HumanEval: Unexpectedly, none of the 7B/32B/72B models showed throughput improvements, and accuracy remained consistently at 0. This appears to be a bug that requires further investigation.
MMLU/Math500: Accept_length remained relatively consistent, typically ranging from 2 to 2.5.

Hyperparameter Analysis

Based on extensive experiments, we recommend the following configuration for H200 GPUs:

Steps: 3-4
TopK: 10
Draft Tokens: 64

Performance Heatmap

Steps Impact Analysis

Draft Tokens Impact Analysis

Size Scaling Analysis

It can be observed that larger models (72B) achieve better acceleration. The reason the 32B model has a smaller speedup compared to the 7B model is that it uses TP=4, while the 7B model uses TP=1.

Training

We trained the draft model on 4×H200 GPUs following the train qwen2.5-vl eagle3 guide.

Training Curves

The training loss and accuracy curves are shown below:

Known Issues and Solutions

When directly using the training scripts from the PR, several issues may arise. Below we outline the problems encountered and their solutions. We plan to submit a new PR to address these issues in the SpecForge repository.

Data Preprocessing

Issue: Data loading stalls at 0% (Deadlock).

Solution: Set export OMP_NUM_THREADS=1 in the launch script to force the underlying library to run in single-threaded mode.

OOM During FSDP Initialization

Issues Identified:

Memory Spike: Calling .cuda() before FSDP sharding causes each GPU to load the full large model, doubling memory usage during initialization.
Large Granularity: By default, the entire model is treated as a single FSDP unit. For 32B/72B models with massive parameters, this prevents execution without parameter sharding.
Full Parameter Aggregation: When saving, FSDP defaults to aggregating full parameters on GPU 0 (including 72B frozen parameters), instantly exceeding memory limits.

Solutions:

Remove .cuda() during model loading (keep on CPU)
Specify device_id during FSDP initialization to enable streaming sharded loading from CPU to GPU
Use transformer_auto_wrap_policy to wrap by layer (Decoder Layer)
Change strategy from SHARD_GRAD_OP to FULL_SHARD (ZeRO-3)
Use FullStateDictConfig(offload_to_cpu=True, rank0_only=True) to move parameter aggregation from GPU memory to CPU memory

These solutions are verified to work for 32B/72B models.

Downloads last month: 12

Safetensors

Model size

1B params

Tensor type

I64

BF16

BOOL

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Geraldxm/Qwen2.5-VL-72B-Instruct-eagle3-sgl

Base model

Qwen/Qwen2.5-VL-72B-Instruct

Finetuned

(26)

this model

Geraldxm
/

Qwen2.5-VL-72B-Instruct-eagle3-sgl