搜索 — ResearchTracker

Multimodal Large Language Models (MLLMs) have demonstrated strong performance on the UI-to-code task, which aims to generate UI code from design mock-ups. However, when applied to long and complex websites, they often struggle with fragmented segmentation, redundant code generation for repetitive components, and frequent UI inconsistencies. To systematically investigate and address these challenges, we introduce ComUIBench, a new multi-page complex webpage benchmark with component annotations, designed to evaluate MLLMs' ability to generate reusable UI code in realistic website scenarios. Building upon this benchmark, we propose ComUICoder, a component-based UI code generation framework that emphasizes semantic-aware segmentation, code reuse, and fine-grained refinement. Specifically, ComUICoder incorporates (1) Hybrid Semantic-aware Block Segmentation for accurate UI semantic coherent block detection, (2) Visual-aware Graph-based Block Merge to consolidate structurally similar components within and across webpages for reusable implementation, and (3) Priority-based Element-wise Feedback to refine generated code and reduce element-level inconsistencies. Extensive experiments demons

What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

arXiv2026-04-08作者：Songze Li, Xiaoke Guo, Tianqi Liu

Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal Large Language Models (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics. Correspondingly, we contribute a benchmark of 26K samples (UI Comprehension-Bench) to comprehensively evaluate existing methods' mastery of UI elements. Extensive experiments demonstrate that UILoop achieves state-of-the-art UI understanding performance while yielding superior results in GUI reasoning tasks.

搜索结果：Performative-UI

ComUICoder: Component-based Reusable UI Code Generation for Complex Websites via Semantic Segmentation and Element-wise Feedback

What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

UI-Venus-1.5 Technical Report

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

UI Placement as a Critical Design Factor for Augmented Reality During Locomotion

UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking

Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

UI-Venus Technical Report: Building High-performance UI Agents with RFT

UI2Code^N: UI-to-Code Generation as Interactive Visual Optimization

RWKV-UI: UI Understanding with Enhanced Perception and Reasoning

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Morae: Proactively Pausing UI Agents for User Choices

Magentic-UI: Towards Human-in-the-loop Agentic Systems

LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

MLLM-Based UI2Code Automation Guided by UI Layout Information

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents