TRUST-VL Model Card
Model Details
TRUST-VL is a unified and explainable vision-language model for general multimodal misinformation detection. It incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.
TRUST-VL training consists of three stages: In Stage 1, we begin by training the projection module for one epoch on 1.2 million image–text pairs (653K news samples from VisualNews and 558K samples from the LLaVA training corpus). This stage aligns the visual features with the language model. In Stage 2, we jointly train the LLM and the projection module for one epoch using 665K synthetic conversation samples from the LLaVA training corpus to improve the model’s ability to follow complex instructions. In Stage 3, we fine-tune the full model on 198K reasoning samples from TRUST-Instruct for three epochs to further enhance its misinformation-specific reasoning capabilities.
Resources for More Information
Paper or resources for more information:
- 🔗 Paper: https://arxiv.org/abs/2509.04448 (to be appear in EMNLP 2025)
- 🌐 Project: https://yanzehong.github.io/trust-vl/
- 📄 Dataset: TRUST-Instruct
Citation
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)
@article{yan2025trustvl,
title={{TRUST-VL}: An Explainable News Assistant for General Multimodal Misinformation Detection},
author={Yan, Zehong and Qi, Peng and Hsu, Wynne and Lee, Mong Li},
journal={arXiv preprint arXiv:2509.04448},
year={2025}
}
- Downloads last month
- 13
