--- license: apache-2.0 datasets: - multimodal-reasoning-lab/Zebra-CoT base_model: - GAIR/Anole-7b pipeline_tag: any-to-any --- # Anole‑Zebra‑CoT [![Paper on ArXiv](https://img.shields.io/badge/arxiv-2507.16746-red)](https://arxiv.org/abs/2507.16746) [![Dataset on Hugging Face](https://img.shields.io/badge/huggingface-Zebra--CoT-lightblue)](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT) [![Model on Hugging Face](https://img.shields.io/badge/huggingface-Anole--Zebra--CoT-green)](https://huggingface.co/multimodal-reasoning-lab/Anole-Zebra-CoT) A Vision–Language model based on **Anole‑7B**, fine‑tuned on the **Zebra‑CoT** dataset to generate interleaved text‑image reasoning traces. --- ## Table of Contents * [Model Description](#model-description) * [Usage](#usage) * [Training Details](#training-details) * [Dataset](#dataset) * [Evaluation](#evaluation) * [Citation](#citation) --- ## Model Description **Anole‑Zebra‑CoT** is derived from the open‑source Anole‑7B model (Chern et al., 2024) and further fine‑tuned end‑to‑end on the Zebra‑CoT corpus, a large‑scale dataset of high‑quality interleaved text‑image reasoning traces covering four major categories (scientific, 2D, 3D, and logic/game tasks). After fine‑tuning, the model’s in‑distribution test accuracy improved from 4.2 % to 16.9 %, a +12 % absolute gain, and uniform improvement on several challenging VLM benchmarks, demonstrating substantially enhanced visual reasoning capabilities. --- ## Usage For usage examples and code snippets, please refer to the [inference directory of the Thinking with Generated Images repository](https://github.com/GAIR-NLP/thinking-with-generated-images/tree/main/inference). --- ## Training Details * **Base model**: Anole‑7B (Chameleon Research License) --- ## Dataset * **Zebra‑CoT**: 182 384 interleaved text‑image reasoning samples across 18 sub‑tasks in 4 categories (2D visual, 3D visual, scientific reasoning, visual logic & strategic games). --- ## Evaluation | Benchmark | Anole + CoT Prompting | Anole‑Zebra‑CoT | | ---------- | --------------------: | --------------: | | MathVision | 13.80 % | 16.45 % | | MathVista | 22.80 % | 25.30 % | | VisuLogic | 8.50 % | 21.80 % | | EMMA | 12.80 % | 15.02 % | | MMVP | 10.00 % | 15.33 % | | BLINK | 26.46 % | 31.25 % | | Vstar | 23.60 % | 27.20 % | --- ## Citation If you use this model, please cite: ```bibtex @misc{li2025zebracot, title={Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning}, author={Ang Li and Charles Wang and Kaiyu Yue and Zikui Cai and Ollie Liu and Deqing Fu and Peng Guo and Wang Bill Zhu and Vatsal Sharan and Robin Jia and Willie Neiswanger and Furong Huang and Tom Goldstein and Micah Goldblum}, year={2025}, eprint={2507.16746}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2507.16746}, } ```