OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent
Abstract
OS-Symphony presents a comprehensive framework for computer-using agents that enhances robustness in long-horizon tasks through reflection-memory and multimodal search capabilities.
While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.
Community
Despite VLM advances, current CUA frameworks remain brittle in long-horizon workflows and weak in novel domains due to coarse historical visual context management and missing visual-aware tutorial retrieval, so we propose OS-SYMPHONY, an orchestrated framework combining milestone-driven reflection memory for trajectory-level self-correction with a SeeAct-style multimodal searcher that synthesizes visually aligned live tutorials, achieving new SOTA across three online benchmarks (65.84% on OSWorld).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding (2025)
- EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration (2025)
- VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents (2025)
- WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance (2025)
- Step-GUI Technical Report (2025)
- OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models (2025)
- Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper