Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal Large Language Models (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics. Correspondingly, we contribute a benchmark of 26K samples (UI Comprehension-Bench) to comprehensively evaluate existing methods' mastery of UI elements. Extensive experiments demonstrate that UILoop achieves state-of-the-art UI understanding performance while yielding superior results in GUI reasoning tasks.
Designing user interfaces that align with user preferences is a time-consuming process, which requires iterative cycles of prototyping, user testing, and refinement. Recent advancements in LLM-based UI generation have enabled efficient UI generation to assist the UI design process. We introduce AlignUI, a method that aligns LLM-generated UIs with user tasks and preferences by using a user preference dataset to guide the LLM's reasoning process. The dataset was crowdsourced from 50 general users (the target users of generated UIs) and contained 720 UI control preferences on eight image-editing tasks. We evaluated AlignUI by generating UIs for six unseen tasks and conducting a user study with 72 additional general users. The results showed that the generated UIs closely align with multiple dimensions of user preferences. We conclude by discussing the applicability of our method to support user-aligned UI design for multiple task domains and user groups, as well as personalized user needs.
In the mobile development process, creating the user interface (UI) is highly resource intensive. Consequently, numerous studies have focused on automating UI development, such as generating UI from screenshots or design specifications. However, they heavily rely on computer vision techniques for image recognition. Any recognition errors can cause invalid UI element generation, compromising the effectiveness of these automated approaches. Moreover, developing an app UI from scratch remains a time consuming and labor intensive task. To address this challenge, we propose a novel approach called GUIMIGRATOR, which enables the cross platform migration of existing Android app UIs to iOS, thereby automatically generating UI to facilitate the reuse of existing UI. This approach not only avoids errors from screenshot recognition but also reduces the cost of developing UIs from scratch. GUIMIGRATOR extracts and parses Android UI layouts, views, and resources to construct a UI skeleton tree. GUIMIGRATOR generates the final UI code files utilizing target code templates, which are then compiled and validated in the iOS development platform, i.e., Xcode. We evaluate the effectiveness of GUIMIGR
As personal agents evolve to handle complex, user-centric tasks, static plain-text chat is rapidly becoming a bottleneck. Generative UI emerges as the necessary new interface layer, dynamically synthesizing the right controls, options, and state from the interaction context in real time. We present Macaron-A2UI, a model for Generative UI in personal agents. Our goal is to move beyond text-only interaction by enabling agents to generate natural language together with lightweight, executable UI actions for information collection, preference refinement, confirmation, and multi-goal organization. We build a large-scale Generative UI corpus from heterogeneous dialogue sources, introduce A2UI-Bench for controlled evaluation, and train 30B, 235B and 754B models with parameter-efficient LoRA-based supervised fine-tuning followed by reward-driven reinforcement learning. The best Macaron-A2UI model reaches 75.6 overall on A2UI-Bench without explicit schema hints, surpassing the strongest full-schema frontier baseline. We release the models, benchmark, and evaluation protocol to support future work on Generative UI for personal agents.
Developing and testing user interfaces (UIs) and training AI agents to interact with them are challenging due to the dynamic and diverse nature of real-world mobile environments. Existing methods often rely on cumbersome physical devices or limited static analysis of screenshots, which hinders scalable testing and the development of intelligent UI agents. We introduce UISim, a novel image-based UI simulator that offers a dynamic and interactive platform for exploring mobile phone environments purely from screen images. Our system employs a two-stage method: given an initial phone screen image and a user action, it first predicts the abstract layout of the next UI state, then synthesizes a new, visually consistent image based on this predicted layout. This approach enables the realistic simulation of UI transitions. UISim provides immediate practical benefits for UI testing, rapid prototyping, and synthetic data generation. Furthermore, its interactive capabilities pave the way for advanced applications, such as UI navigation task planning for AI agents. Our experimental results show that UISim outperforms end-to-end UI generation baselines in generating realistic and coherent subse
Online Reinforcement Learning (RL) offers a promising paradigm for enhancing GUI agents through direct environment interaction. However, its effectiveness is severely hindered by inefficient credit assignment in long-horizon tasks and repetitive errors across tasks due to the lack of experience transfer. To address these challenges, we propose UI-Mem, a novel framework that enhances GUI online RL with a Hierarchical Experience Memory. Unlike traditional replay buffers, our memory accumulates structured knowledge, including high-level workflows, subtask skills, and failure patterns. These experiences are stored as parameterized templates that enable cross-task and cross-application transfer. To effectively integrate memory guidance into online RL, we introduce Stratified Group Sampling, which injects varying levels of guidance across trajectories within each rollout group to maintain outcome diversity, driving the unguided policy toward internalizing guided behaviors. Furthermore, a Self-Evolving Loop continuously abstracts novel strategies and errors to keep the memory aligned with the agent's evolving policy. Experiments on online GUI benchmarks demonstrate that UI-Mem significant
In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves
Humans can learn to operate the user interface (UI) of an application by reading an instruction manual or how-to guide. Along with text, these resources include visual content such as UI screenshots and images of application icons referenced in the text. We explore how to leverage this data to learn generic visio-linguistic representations of UI screens and their components. These representations are useful in many real applications, such as accessibility, voice navigation, and task automation. Prior UI representation models rely on UI metadata (UI trees and accessibility labels), which is often missing, incompletely defined, or not accessible. We avoid such a dependency, and propose Lexi, a pre-trained vision and language model designed to handle the unique features of UI screens, including their text richness and context sensitivity. To train Lexi we curate the UICaption dataset consisting of 114k UI images paired with descriptions of their functionality. We evaluate Lexi on four tasks: UI action entailment, instruction-based UI image retrieval, grounding referring expressions, and UI entity recognition.
Large Language Model (LLM)-based UI agents show great promise for UI automation but often hallucinate in long-horizon tasks due to their lack of understanding of the global UI transition structure. To address this, we introduce AGENT+P, a novel framework that leverages symbolic planning to guide LLM-based UI agents. Specifically, we model an app's UI transition structure as a UI Transition Graph (UTG), which allows us to reformulate the UI automation task as a pathfinding problem on the UTG. This further enables an off-the-shelf symbolic planner to generate a provably correct and optimal high-level plan, preventing the agent from redundant exploration and guiding the agent to achieve the automation goals. AGENT+P is designed as a plug-and-play framework to enhance existing UI agents. Evaluation on the AndroidWorld benchmark demonstrates that AGENT+P improves the success rates of state-of-the-art UI agents by up to 14.31% and reduces the action steps by 37.70%.
Texts, widgets, and images on a UI page do not work separately. Instead, they are partitioned into groups to achieve certain interaction functions or visual information. Existing studies on UI elements grouping mainly focus on a specific single UI-related software engineering task, and their groups vary in appearance and function. In this case, we propose our semantic component groups that pack adjacent text and non-text elements with similar semantics. In contrast to those task-oriented grouping methods, our semantic component group can be adopted for multiple UI-related software tasks, such as retrieving UI perceptual groups, improving code structure for automatic UI-to-code generation, and generating accessibility data for screen readers. To recognize semantic component groups on a UI page, we propose a robust, deep learning-based vision detector, UISCGD, which extends the SOTA deformable-DETR by incorporating UI element color representation and a learned prior on group distribution. The model is trained on our UI screenshots dataset of 1988 mobile GUIs from more than 200 apps in both iOS and Android platforms. The evaluation shows that our UISCGD achieves 6.1\% better than the
AI models excel at creating content, but typically render it with static, predefined interfaces. Specifically, the output of LLMs is often a markdown "wall of text". Generative UI is a long standing promise, where the model generates not just the content, but the interface itself. Until now, Generative UI was not possible in a robust fashion. We demonstrate that when properly prompted and equipped with the right set of tools, a modern LLM can robustly produce high quality custom UIs for virtually any prompt. When ignoring generation speed, results generated by our implementation are overwhelmingly preferred by humans over the standard LLM markdown output. In fact, while the results generated by our implementation are worse than those crafted by human experts, they are at least comparable in 50% of cases. We show that this ability for robust Generative UI is emergent, with substantial improvements from previous models. We also create and release PAGEN, a novel dataset of expert-crafted results to aid in evaluating Generative UI implementations, as well as the results of our system for future comparisons. Interactive examples can be seen at https://generativeui.github.io
We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5. To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models. To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies. To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment & Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more
The importance of computational modeling of mobile user interfaces (UIs) is undeniable. However, these require a high-quality UI dataset. Existing datasets are often outdated, collected years ago, and are frequently noisy with mismatches in their visual representation. This presents challenges in modeling UI understanding in the wild. This paper introduces a novel approach to automatically mine UI data from Android apps, leveraging Large Language Models (LLMs) to mimic human-like exploration. To ensure dataset quality, we employ the best practices in UI noise filtering and incorporate human annotation as a final validation step. Our results demonstrate the effectiveness of LLMs-enhanced app exploration in mining more meaningful UIs, resulting in a large dataset MUD of 18k human-annotated UIs from 3.3k apps. We highlight the usefulness of MUD in two common UI modeling tasks: element detection and UI retrieval, showcasing its potential to establish a foundation for future research into high-quality, modern UIs.
UI-to-code aims to translate UI screenshots into executable front-end code. Despite progress with vision-language models (VLMs), most existing methods formulate UI-to-code as a single-pass generation, which mismatches real-world UI development that is inherently iterative and feedback-driven. We reformulate UI-to-code as an interactive visual optimization problem, where code generation is embedded in a closed-loop process of execution, visual inspection, and iterative refinement driven by rendered visual feedback. To address the non-differentiability of visual objectives and the noise of absolute visual evaluators, we propose Relative Visual Policy Optimization (RVPO), a preference-based reinforcement learning method that optimizes relative visual rankings among rendered candidates under execution feedback. We instantiate this paradigm in UI2Code^N, an open-source 9B model trained via continual pre-training, supervised fine-tuning, and reinforcement learning. Experiments demonstrate state-of-the-art performance on UI drafting, UI polishing, and UI editing benchmarks, even outperforming larger models, with performance consistently improving through iterative visual optimization. Our
Bug report management is a costly software maintenance process comprised of several challenging tasks. Given the UI-driven nature of mobile apps, bugs typically manifest through the UI, hence the identification of buggy UI screens and UI components (Buggy UI Localization) is important to localizing the buggy behavior and eventually fixing it. However, this task is challenging as developers must reason about bug descriptions (which are often low-quality), and the visual or code-based representations of UI screens. This paper is the first to investigate the feasibility of automating the task of Buggy UI Localization through a comprehensive study that evaluates the capabilities of one textual and two multi-modal deep learning (DL) techniques and one textual unsupervised technique. We evaluate such techniques at two levels of granularity, Buggy UI Screen and UI Component localization. Our results illustrate the individual strengths of models that make use of different representations, wherein models that incorporate visual information perform better for UI screen localization, and models that operate on textual screen information perform better for UI component localization -- highligh
Large Language Models (LLMs) have demonstrated remarkable potential across various design domains, including user interface (UI) generation. However, current LLMs for UI generation tend to offer generic solutions that lack a nuanced understanding of task context and user preferences. We present CrowdGenUI, a framework that enhances LLM-based UI generation with crowdsourced user preferences. This framework addresses the limitations by guiding LLM reasoning with real user preferences, enabling the generation of UI widgets that reflect user needs and task-specific requirements. We evaluate our framework in the image editing domain by collecting a library of 720 user preferences from 50 participants, covering preferences such as predictability, efficiency, and explorability of various UI widgets. A user study (N=78) demonstrates that UIs generated with our preference-guided framework can better match user intentions compared to those generated by LLMs alone, highlighting the effectiveness of our proposed framework. We further discuss the study findings and present insights for future research on LLM-based user-centered UI generation.
We present AutoOptimization, a novel multi-objective optimization framework for adapting user interfaces. From a user's verbal preferences for changing a UI, our framework guides a prioritization-based Pareto frontier search over candidate layouts. It selects suitable objective functions for UI placement while simultaneously parameterizing them according to the user's instructions to define the optimization problem. A solver then generates a series of optimal UI layouts, which our framework validates against the user's instructions to adapt the UI with the final solution. Our approach thus overcomes the previous need for manual inspection of layouts and the use of population averages for objective parameters. We integrate multiple agents sequentially within our framework, enabling the system to leverage their reasoning capabilities to interpret user preferences, configure the optimization problem, and validate optimization outcomes.
With the fast-growing GUI development workload in the Internet industry, some work on intelligent methods attempted to generate maintainable front-end code from UI screenshots. It can be more suitable for utilizing UI design drafts that contain UI metadata. However, fragmented layers inevitably appear in the UI design drafts which greatly reduces the quality of code generation. None of the existing GUI automated techniques detects and merges the fragmented layers to improve the accessibility of generated code. In this paper, we propose UI Layers Merger (UILM), a vision-based method, which can automatically detect and merge fragmented layers into UI components. Our UILM contains Merging Area Detector (MAD) and a layers merging algorithm. MAD incorporates the boundary prior knowledge to accurately detect the boundaries of UI components. Then, the layers merging algorithm can search out the associated layers within the components' boundaries and merge them into a whole part. We present a dynamic data augmentation approach to boost the performance of MAD. We also construct a large-scale UI dataset for training the MAD and testing the performance of UILM. The experiment shows that the p
While generative AI enables high-fidelity UI generation from text prompts, users struggle to articulate design intent and evaluate or refine results-creating gulfs of execution and evaluation. To understand the information needed for UI generation, we conducted a thematic analysis of UI prompting guidelines, identifying key design semantics and discovering that they are hierarchical and interdependent. Leveraging these findings, we developed a system that enables users to specify semantics, visualize relationships, and extract how semantics are reflected in generated UIs. By making semantics serve as an intermediate representation between human intent and AI output, our system bridges both gulfs by making requirements explicit and outcomes interpretable. A comparative user study suggests that our approach enhances users' perceived control over intent expression and outcome interpretation, and facilitates more predictable iterative refinement. Our work demonstrates how explicit semantic representation enables systematic and explainable exploration of design possibilities in AI-driven UI design.
Background: User interface (UI) testing, which is used to verify the behavior of interactive elements in applications, plays an important role in software development and quality assurance. However, little is known about the adoption of UI testing frameworks in continuous integration and continuous delivery (CI/CD) workflows and their impact on open-source software development processes. Objective: We aim to investigate the current usage of popular UI testing frameworks-Selenium, Playwright and Cypress-in CI/CD pipelines among GitHub repositories. Our goal is to understand how UI testing tools are used in CI/CD processes and assess their potential impacts on open-source development activity and CI/CD workflows. Method: We propose an empirical study to examine GitHub repositories that incorporate UI testing in CI/CD workflows. Our exploratory evaluation will collect repositories that implement UI testing frameworks in configuration files for GitHub Actions workflows to inspect UI testing-related and non-UI testing-related workflows. Moreover, we further plan to collect metrics related to repository development activity and GitHub Actions workflows to conduct comparative and time ser