搜索 — ResearchTracker

Agentic benchmarks increasingly rely on LLM-simulated users to scalably evaluate agent performance, yet the robustness, validity, and fairness of this approach remain unexamined. Through a user study with participants across the United States, India, Kenya, and Nigeria, we investigate whether LLM-simulated users serve as reliable proxies for real human users in evaluating agents on τ-Bench retail tasks. We find that user simulation lacks robustness, with agent success rates varying up to 9 percentage points across different user LLMs. Furthermore, evaluations using simulated users exhibit systematic miscalibration, underestimating agent performance on challenging tasks and overestimating it on moderately difficult ones. African American Vernacular English (AAVE) speakers experience consistently worse success rates and calibration errors than Standard American English (SAE) speakers, with disparities compounding significantly with age. We also find simulated users to be a differentially effective proxy for different populations, performing worst for AAVE and Indian English speakers. Additionally, simulated users introduce conversational artifacts and surface different failure patter

Analysis of User Experience Evaluation Methods for Deaf users: A Case Study on a mobile App

arXiv2025-07-30作者：A. E. Fuentes-Cortázar, A. Rivera-Hernández, J. R. Rojano-Cáceres

User Experience (UX) evaluation methods that are commonly used with hearing users may not be functional or effective for Deaf users. This is because these methods are primarily designed for users with hearing abilities, which can create limitations in the interaction, perception, and understanding of the methods for Deaf individuals. Furthermore, traditional UX evaluation approaches often fail to address the unique accessibility needs of Deaf users, resulting in an incomplete or biased assessment of their user experience. This research focused on analyzing a set of UX evaluation methods recommended for use with Deaf users, with the aim of validating the accessibility of each method through findings and limitations. The results indicate that, although these evaluation methods presented here are commonly recommended in the literature for use with Deaf users, they present various limitations that must be addressed in order to better adapt to the communication skills specific to the Deaf community. This research concludes that evaluation methods must be adapted to ensure accessible software evaluation for Deaf individuals, enabling the collection of data that accurately reflects their

搜索结果：Users

Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations

Analysis of User Experience Evaluation Methods for Deaf users: A Case Study on a mobile App

User-Driven Value Alignment: Understanding Users' Perceptions and Strategies for Addressing Biased and Discriminatory Statements in AI Companions

Recommending to Strategic Users

Users volatility on Reddit and Voat

Balanced Co-Clustering of Users and Items for Embedding Table Compression in Recommender Systems

Multi-User Diversity with Random Number of Users

Accessible Capacity of Secondary Users

Dedicating Cellular Infrastructure for Aerial Users: Advantages and Potential Impact on Ground Users

Characterizing and Detecting Hateful Users on Twitter

Multi-User MultiWOZ: Task-Oriented Dialogues among Multiple Users

User-to-User Interference Mitigation in Dynamic TDD MIMO Systems with Multi-Antenna Users

Precoding Design for Multi-user MIMO Systems with Delay-Constrained and -Tolerant Users

Show Me My Users: A Dashboard Visualizing User Interaction Logs

Leave No User Behind: Towards Improving the Utility of Recommender Systems for Non-mainstream Users

Spectrum Sharing Scheme Between Cellular Users and Ad-hoc Device-to-Device Users

User Tracking in the Post-cookie Era: How Websites Bypass GDPR Consent to Track Users

Learning to Augment for Casual User Recommendation

Measuring Individual User Fairness with User Similarity and Effectiveness Disparity

Information Theoretic Secure Aggregation with User Dropouts