GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation Paper • 2512.01801 • Published 10 days ago • 23
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench Paper • 2510.26865 • Published Oct 30 • 11
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench Paper • 2510.26865 • Published Oct 30 • 11 • 1
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench Paper • 2510.26865 • Published Oct 30 • 11
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs Paper • 2505.11842 • Published May 17 • 2
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning Paper • 2401.14011 • Published Jan 25, 2024 • 1
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs Paper • 2505.11842 • Published May 17 • 2
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions Paper • 2509.17177 • Published Sep 21 • 13
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions Paper • 2509.17177 • Published Sep 21 • 13
view article Article Letting Large Models Debate: The First Multilingual LLM Debate Competition +10 Nov 20, 2024 • 33