搜索 — ResearchTracker

Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld, which serve both as evaluation tools and as training environments for reinforcement learning. However, macOS remains underserved in this landscape: the only existing benchmark, macOSWorld, covers a narrow slice of first-party applications with simpler tasks, and runs on x86 virtual machines incompatible with Apple Silicon. We introduce MacArena, a benchmark of 421 manually verified tasks spanning 50 applications that combines a curated port of OSWorld tasks, content sourced from macOSWorld, and 49 new macOS-native tasks, all running on Apple's native Virtualization framework on Apple Silicon. We argue that macOS presents distinct GUI challenges beyond what Linux-based benchmarks capture, and our evaluation supports this claim: strong model performance on existing benchmarks can reflect familiarity with task distributions rather than genuine cross-platform GUI competence. Notably, model rankings invert between ported and macOS-native tasks, with a leading m

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

arXiv2025-06-04作者：Pei Yang, Hai Ci, Mike Zheng Shou

Graphical User Interface (GUI) agents show promising capabilities for automating computer-use tasks and facilitating accessibility, but existing interactive benchmarks are mostly English-only, covering web-use or Windows, Linux, and Android environments, but not macOS. macOS is a major OS with distinctive GUI patterns and exclusive applications. To bridge the gaps, we present macOSWorld, the first comprehensive benchmark for evaluating GUI agents on macOS. macOSWorld features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with task instructions and OS interfaces offered in 5 languages (English, Chinese, Arabic, Japanese, and Russian). As GUI agents are shown to be vulnerable to deception attacks, macOSWorld also includes a dedicated safety benchmarking subset. Our evaluation on six GUI agents reveals a dramatic gap: proprietary computer-use agents lead at above 30% success rate, while open-source lightweight research models lag at below 5\%, highlighting the need for macOS domain adaptation. Multilingual benchmarks also expose common weaknesses, especially in Arabic, with a 28.8% average degradation compared to English. Results from safety benchmark

搜索结果：MacOS

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

The Role of Domain-Specific Features in Malware Detection: A macOS Case Study

Exposing Hidden Interfaces: LLM-Guided Type Inference for Reverse Engineering macOS Private Frameworks

Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation

Crash-Consistent Checkpointing for AI Training on macOS/APFS

Position Paper: Think Globally, React Locally -- Bringing Real-time Reference-based Website Phishing Detection on macOS

Auditing Apple's DifferentialPrivacy.framework: Implementation Bugs, Misconfigurations, and Practical Risks

One (Thread) Can Keep a (PRNG) Secret, but not Two

GUIrilla: A Scalable Framework for Automated Desktop UI Exploration

MaCo-GAN: Manifold-Contrastive Adversarial Learning for Single Image Super-Resolution

Performance Evaluation of Bitstring Representations in a Linear Genetic Programming Framework

Linux for Everyone: Can Standardization Drive Mainstream Adoption?

FeynGame 3.0

Treemble: A Graphical Tool to Generate Newick Strings from Phylogenetic Tree Images

MACO: A Multi-Agent LLM Framework for Automated CGRA Hardware/Software Co-Design

"Are We Done Yet?": A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents

The Illusion of Randomness: An Empirical Analysis of Address Space Layout Randomization Implementations

Complexation of a Thermoresponsive Brush-Type Polyelectrolyte with an Oppositely Charged Surfactant: Effect of Temperature and Surfactant Concentration

Portable-CELLxGENE: standalone executables of CELLxGENE for easy installation