Communicating about Space: Language-Mediated Spatial Integration Across Partial Views
Abstract
MLLMs demonstrate limited capability in collaborative spatial communication tasks, achieving only 72% accuracy compared to humans' 95%, with models struggling to build consistent shared mental models unlike human dialogues that become more specific during convergence.
Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at https://github.com/ankursikarwar/Cosmic
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence (2026)
- MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents (2026)
- VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations (2026)
- CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs (2026)
- Learning Multi-View Spatial Reasoning from Cross-View Relations (2026)
- MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model (2026)
- SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
the most striking part is that even thinking-enabled models boost anchor grounding but barely move the needle on global, allocentric mapping. COSMIC's two-static-agent setup with no parameter sharing isolates linguistic coordination, yet it makes clear that cross-view relational and map-level reasoning lag far behind anchor grounding. the arxivlens breakdown helped me parse the method details and spot where the bottlenecks lie https://arxivlens.com/PaperView/Details/communicating-about-space-language-mediated-spatial-integration-across-partial-views-3429-793d3984. one practical direction could be explicit reference-frame agreement or sketch-based spatial hints to bootstrap a shared map, rather than relying on thinking alone. do you think adding such cues would lift cognitive mapping more than pushing the model to "think" through longer dialogues?
Get this paper in your agent:
hf papers read 2603.27183 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
