搜索 — ResearchTracker

Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to adaptively determine when, whether, and how to observe the interface. We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks. To acquire more informative observations, the agent learns to make strategic decisions on both whether and how to invoke visual tools, such as cropping or zooming, within a two-stage reasoning process. To support this behavior, we introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding, coordinated by a two-level policy. In addition, we design a spatially continuous reward function tailored to tool usage, which integrates both location proximity and region overlap to provide dense supervision and alleviate the reward sparsity common in GUI environments. On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples, significantly outperforming both supervised and RL-b

MAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIs

arXiv2026-04-07作者：Sangwook Lee, Sang Won Lee, Adnan Abbas

Modern task-oriented chatbots present GUI elements alongside natural-language dialogue, yet the agent's role has largely been limited to interpreting natural-language input as GUI actions and following a linear workflow. In preference-driven, multi-step tasks such as booking a flight or reserving a restaurant, earlier choices constrain later options and may force users to restart from scratch. User preferences serve as the key criteria for these decisions, yet existing agents do not systematically leverage them. We present MAESTRO, which extends the agent's role from execution to decision support. MAESTRO maintains a shared preference memory that extracts preferences from natural-language utterances with their strength, and provides two mechanisms. Preference-Grounded GUI Adaptation applies in-place operators (augment, sort, filter, and highlight) to the existing GUI according to preference strength, supporting within-stage comparison. Preference-Guided Workflow Navigation detects conflicts between preferences and available options, proposes backtracking, and records failed paths to avoid revisiting dead ends. We evaluated MAESTRO in a movie-booking Conversational Agent with GUI (C

搜索结果：Guys

GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

MAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIs

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

GUI-GENESIS: Automated Synthesis of Efficient Environments with Verifiable Rewards for GUI Agent Post-Training

GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

GUI-AC: Enhancing Continual Learning in GUI Agents

Continual GUI Agents

POINTS-GUI-G: GUI-Grounding Journey

Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

GUI-ARP: Enhancing Grounding with Adaptive Region Perception for GUI Agents

GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning

GUI Knowledge Bench: Revealing the Knowledge Gap of VLMs in GUI Tasks

GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning

ScaleTrack: Scaling and back-tracking Automated GUI Agents

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

Mobile-Agent-v3: Fundamental Agents for GUI Automation