Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Paper
โข
2504.02438
โข
Published
โข
1
ViLAMP is a video-language model for hour-long video understanding, addressing computational bottlenecks in long-form processing through differential distillation. It employs two mechanisms: (1) query-aware keyframe selection and (2) patch-level feature merging to preserve salient details in non-keyframes. ViLAMP achieves state-of-the-art performance on long-video benchmarks while enabling efficient processing of 10K-frame videos on a single GPU, balancing accuracy and computational efficiency.