Large language model agents that interact with PC applications often face limitations due to their singular mode of interaction with real-world environments, leading to restricted versatility and frequent hallucinations. To address this, we propose the Multi-Modal Agent Collaboration framework (MMAC-Copilot), a framework utilizes the collective expertise of diverse agents to enhance interaction ability with application. The framework introduces a team collaboration chain, enabling each participating agent to contribute insights based on their specific domain knowledge, effectively reducing the hallucination associated with knowledge domain gaps. We evaluate MMAC-Copilot using the GAIA benchmark and our newly introduced Visual Interaction Benchmark (VIBench). MMAC-Copilot achieved exceptional performance on GAIA, with an average improvement of 6.8\% over existing leading systems. VIBench focuses on non-API-interactable applications across various domains, including 3D gaming, recreation, and office scenarios. It also demonstrated remarkable capability on VIBench. We hope this work can inspire in this field and provide a more comprehensive assessment of Autonomous agents. The anonymo
使用 AI 将内容摘要翻译为中文,便于快速阅读
使用 AI 分析这篇文章的核心发现、关键要点和深度见解
由 DeepSeek AI 提供分析 · 首次使用需配置 API Key