Agentic benchmarks increasingly rely on LLM-simulated users to scalably evaluate agent performance, yet the robustness, validity, and fairness of this approach remain unexamined. Through a user study with participants across the United States, India, Kenya, and Nigeria, we investigate whether LLM-simulated users serve as reliable proxies for real human users in evaluating agents on τ-Bench retail tasks. We find that user simulation lacks robustness, with agent success rates varying up to 9 percentage points across different user LLMs. Furthermore, evaluations using simulated users exhibit systematic miscalibration, underestimating agent performance on challenging tasks and overestimating it on moderately difficult ones. African American Vernacular English (AAVE) speakers experience consistently worse success rates and calibration errors than Standard American English (SAE) speakers, with disparities compounding significantly with age. We also find simulated users to be a differentially effective proxy for different populations, performing worst for AAVE and Indian English speakers. Additionally, simulated users introduce conversational artifacts and surface different failure patter
User Experience (UX) evaluation methods that are commonly used with hearing users may not be functional or effective for Deaf users. This is because these methods are primarily designed for users with hearing abilities, which can create limitations in the interaction, perception, and understanding of the methods for Deaf individuals. Furthermore, traditional UX evaluation approaches often fail to address the unique accessibility needs of Deaf users, resulting in an incomplete or biased assessment of their user experience. This research focused on analyzing a set of UX evaluation methods recommended for use with Deaf users, with the aim of validating the accessibility of each method through findings and limitations. The results indicate that, although these evaluation methods presented here are commonly recommended in the literature for use with Deaf users, they present various limitations that must be addressed in order to better adapt to the communication skills specific to the Deaf community. This research concludes that evaluation methods must be adapted to ensure accessible software evaluation for Deaf individuals, enabling the collection of data that accurately reflects their
Large language model-based AI companions are increasingly viewed by users as friends or romantic partners, leading to deep emotional bonds. However, they can generate biased, discriminatory, and harmful outputs. Recently, users are taking the initiative to address these harms and re-align AI companions. We introduce the concept of user-driven value alignment, where users actively identify, challenge, and attempt to correct AI outputs they perceive as harmful, aiming to guide the AI to better align with their values. We analyzed 77 social media posts about discriminatory AI statements and conducted semi-structured interviews with 20 experienced users. Our analysis revealed six common types of discriminatory statements perceived by users, how users make sense of those AI behaviors, and seven user-driven alignment strategies, such as gentle persuasion and anger expression. We discuss implications for supporting user-driven value alignment in future AI systems, where users and their communities have greater agency.
Recommendation systems are pervasive in the digital economy. An important assumption in many deployed systems is that user consumption reflects user preferences in a static sense: users consume the content they like with no other considerations in mind. However, as we document in a large-scale online survey, users do choose content strategically to influence the types of content they get recommended in the future. We model this user behavior as a two-stage noisy signalling game between the recommendation system and users: the recommendation system initially commits to a recommendation policy, presents content to the users during a cold start phase which the users choose to strategically consume in order to affect the types of content they will be recommended in a recommendation phase. We show that in equilibrium, users engage in behaviors that accentuate their differences to users of different preference profiles. In addition, (statistical) minorities out of fear of losing their minority content exposition may not consume content that is liked by mainstream users. We next propose three interventions that may improve recommendation quality (both on average and for minorities) when t
Social media platforms are like giant arenas where users can rely on different content and express their opinions through likes, comments, and shares. However, do users welcome different perspectives or only listen to their preferred narratives? This paper examines how users explore the digital space and allocate their attention among communities on two social networks, Voat and Reddit. By analysing a massive dataset of about 215 million comments posted by about 16 million users on Voat and Reddit in 2019 we find that most users tend to explore new communities at a decreasing rate, meaning they have a limited set of preferred groups they visit regularly. Moreover, we provide evidence that preferred communities of users tend to cover similar topics throughout the year. We also find that communities have a high turnover of users, meaning that users come and go frequently showing a high volatility that strongly departs from a null model simulating users' behaviour.
Recommender systems have advanced markedly over the past decade by transforming each user/item into a dense embedding vector with deep learning models. At industrial scale, embedding tables constituted by such vectors of all users/items demand a vast amount of parameters and impose heavy compute and memory overhead during training and inference, hindering model deployment under resource constraints. Existing solutions towards embedding compression either suffer from severely compromised recommendation accuracy or incur considerable computational costs. To mitigate these issues, this paper presents BACO, a fast and effective framework for compressing embedding tables. Unlike traditional ID hashing, BACO is built on the idea of exploiting collaborative signals in user-item interactions for user and item groupings, such that similar users/items share the same embeddings in the codebook. Specifically, we formulate a balanced co-clustering objective that maximizes intra-cluster connectivity while enforcing cluster-volume balance, and unify canonical graph clustering techniques into the framework through rigorous theoretical analyses. To produce effective groupings while averting codeboo
Multi-user diversity is considered when the number of users in the system is random. The complete monotonicity of the error rate as a function of the (deterministic) number of users is established and it is proved that randomization of the number of users always leads to deterioration of average system performance at any average SNR. Further, using stochastic ordering theory, a framework for comparison of system performance for different user distributions is provided. For Poisson distributed users, the difference in error rate of the random and deterministic number of users cases is shown to asymptotically approach zero as the average number of users goes to infinity for any fixed average SNR. In contrast, for a finite average number of users and high SNR, it is found that randomization of the number of users deteriorates performance significantly, and the diversity order under fading is dominated by the smallest possible number of users. For Poisson distributed users communicating over Rayleigh faded channels, further closed-form results are provided for average error rate, and the asymptotic scaling law for ergodic capacity is also provided. Simulation results are provided to co
A new problem formulation is presented for the Gaussian interference channels (GIFC) with two pairs of users, which are distinguished as primary users and secondary users, respectively. The primary users employ a pair of encoder and decoder that were originally designed to satisfy a given error performance requirement under the assumption that no interference exists from other users. In the scenario when the secondary users attempt to access the same medium, we are interested in the maximum transmission rate (defined as {\em accessible capacity}) at which secondary users can communicate reliably without affecting the error performance requirement by the primary users under the constraint that the primary encoder (not the decoder) is kept unchanged. By modeling the primary encoder as a generalized trellis code (GTC), we are then able to treat the secondary link and the cross link from the secondary transmitter to the primary receiver as finite state channels (FSCs). Based on this, upper and lower bounds on the accessible capacity are derived. The impact of the error performance requirement by the primary users on the accessible capacity is analyzed by using the concept of interferen
A new generation of aerial vehicles is hopeful to be the next frontier for the transportation of people and goods, becoming even as important as ground users in the communication systems. To enhance the coverage of aerial users, appropriate adjustments should be made to the existing cellular networks that mainly provide services for ground users by the down-tilted antennas of the terrestrial base stations (BSs). It is promising to up-tilt the antennas of a subset of BSs for serving aerial users through the mainlobe. With this motivation, in this work, we use tools from stochastic geometry to analyze the coverage performance of the adjusted cellular network (consisting of the up-tilted BSs and the down-tilted BSs). Correspondingly, we present exact and approximate expressions of the signal-to-interference ratio (SIR)-based coverage probabilities for users in the sky and on the ground, respectively. Numerical results verify the analysis accuracy and clarify the advantages of up-tilting BS antennas on the communication connectivity of aerial users without the potential adverse impact on the quality of service (QoS) of ground users.
Most current approaches to characterize and detect hate speech focus on \textit{content} posted in Online Social Networks. They face shortcomings to collect and annotate hateful speech due to the incompleteness and noisiness of OSN text and the subjectivity of hate speech. These limitations are often aided with constraints that oversimplify the problem, such as considering only tweets containing hate-related words. In this work we partially address these issues by shifting the focus towards \textit{users}. We develop and employ a robust methodology to collect and annotate hateful users which does not depend directly on lexicon and where the users are annotated given their entire profile. This results in a sample of Twitter's retweet graph containing $100,386$ users, out of which $4,972$ were annotated. We also collect the users who were banned in the three months that followed the data collection. We show that hateful users differ from normal ones in terms of their activity patterns, word usage and as well as network structure. We obtain similar results comparing the neighbors of hateful vs. neighbors of normal users and also suspended users vs. active users, increasing the robustn
While most task-oriented dialogues assume conversations between the agent and one user at a time, dialogue systems are increasingly expected to communicate with multiple users simultaneously who make decisions collaboratively. To facilitate development of such systems, we release the Multi-User MultiWOZ dataset: task-oriented dialogues among two users and one agent. To collect this dataset, each user utterance from MultiWOZ 2.2 was replaced with a small chat between two users that is semantically and pragmatically consistent with the original user utterance, thus resulting in the same dialogue state and system response. These dialogues reflect interesting dynamics of collaborative decision-making in task-oriented scenarios, e.g., social chatter and deliberation. Supported by this data, we propose the novel task of multi-user contextual query rewriting: to rewrite a task-oriented chat between two users as a concise task-oriented query that retains only task-relevant information and that is directly consumable by the dialogue system. We demonstrate that in multi-user dialogues, using predicted rewrites substantially improves dialogue state tracking without modifying existing dialogue
We propose a novel method for user-to-user interference (UUI) mitigation in dynamic time-division duplex multiple-input multiple-output communication systems with multi-antenna users. Specifically, we consider the downlink data transmission in the presence of UUI caused by a user that simultaneously transmits in uplink. Our method introduces an overhead for estimation of the user-to-user channels by transmitting pilots from the uplink user to the downlink users. Each downlink user obtains a channel estimate that is used to design a combining matrix for UUI mitigation. We analytically derive an achievable spectral efficiency for the downlink transmission in the presence of UUI with our mitigation technique. Through numerical simulations, we show that our method can significantly improve the spectral efficiency performance in cases of heavy UUI.
In both academia and industry, multi-user multiple-input multiple-output (MU-MIMO) techniques have shown enormous gains in spectral efficiency by exploiting spatial degrees of freedom. So far, an underlying assumption in most of the existing MU-MIMO design has been that all the users use infinite blocklength, so that they can achieve the Shannon capacity. This setup, however, is not suitable considering delay-constrained users whose blocklength tends to be finite. In this paper, we consider a heterogeneous setting in MU-MIMO systems where delay-constrained users and delay-tolerant users coexist, called a DCTU-MIMO network. To maximize the sum spectral efficiency in this system, we present the spectral efficiency for delay-tolerant users and provide a lower bound of the spectral efficiency for delay-constrained users. We consider an optimization problem that maximizes the sum spectral efficiency of delay-tolerant users while satisfying the latency constraint of delay-constrained users, and propose a generalized power iteration (GPI) precoding algorithm that finds a principal precoding vector. Furthermore, we extend a DCTU-MIMO network to the multiple time slots scenario and propose
This paper describes the design of a dashboard and analysis pipeline to monitor users of visualization tools in the wild. Our pipeline describes how to extract analytical KPIs from extensive log event data involving a mix of user types. The resulting three-page dashboard displays live KPIs, helping analysts understand users, detect exploratory behaviors, plan education interventions, and improve tool features. We propose this case study as a motivation to use the dashboard approach for a more `casual' monitoring of users and building carer mindsets for visualization tools.
In a collaborative-filtering recommendation scenario, biases in the data will likely propagate in the learned recommendations. In this paper we focus on the so-called mainstream bias: the tendency of a recommender system to provide better recommendations to users who have a mainstream taste, as opposed to non-mainstream users. We propose NAECF, a conceptually simple but effective idea to address this bias. The idea consists of adding an autoencoder (AE) layer when learning user and item representations with text-based Convolutional Neural Networks. The AEs, one for the users and one for the items, serve as adversaries to the process of minimizing the rating prediction error when learning how to recommend. They enforce that the specific unique properties of all users and items are sufficiently well incorporated and preserved in the learned representations. These representations, extracted as the bottlenecks of the corresponding AEs, are expected to be less biased towards mainstream users, and to provide more balanced recommendation utility across all users. Our experimental results confirm these expectations, significantly improving the recommendations for non-mainstream users while
In an attempt to utilize spectrum resources more efficiently, protocols sharing licensed spectrum with unlicensed users are receiving increased attention. From the perspective of cellular networks, spectrum underutilization makes spatial reuse a feasible complement to existing standards. Interference management is a major component in designing these schemes as it is critical that licensed users maintain their expected quality of service. We develop a distributed dynamic spectrum protocol in which ad-hoc device-to-device users opportunistically access the spectrum actively in use by cellular users. First, channel gain estimates are used to set feasible transmit powers for device-to-device users that keeps the interference they cause within the allowed interference temperature. Then network information is distributed by route discovery packets in a random access manner to help establish either a single-hop or multi-hop route between two device-to-device users. We show that network information in the discovery packet can decrease the failure rate of the route discovery and reduce the number of necessary transmissions to find a route. Using the found route, we show that two device-to-
During the past few years, mostly as a result of the GDPR and the CCPA, websites have started to present users with cookie consent banners. These banners are web forms where the users can state their preference and declare which cookies they would like to accept, if such option exists. Although requesting consent before storing any identifiable information is a good start towards respecting the user privacy, yet previous research has shown that websites do not always respect user choices. Furthermore, considering the ever decreasing reliance of trackers on cookies and actions browser vendors take by blocking or restricting third-party cookies, we anticipate a world where stateless tracking emerges, either because trackers or websites do not use cookies, or because users simply refuse to accept any. In this paper, we explore whether websites use more persistent and sophisticated forms of tracking in order to track users who said they do not want cookies. Such forms of tracking include first-party ID leaking, ID synchronization, and browser fingerprinting. Our results suggest that websites do use such modern forms of tracking even before users had the opportunity to register their ch
Users who come to recommendation platforms are heterogeneous in activity levels. There usually exists a group of core users who visit the platform regularly and consume a large body of content upon each visit, while others are casual users who tend to visit the platform occasionally and consume less each time. As a result, consumption activities from core users often dominate the training data used for learning. As core users can exhibit different activity patterns from casual users, recommender systems trained on historical user activity data usually achieve much worse performance on casual users than core users. To bridge the gap, we propose a model-agnostic framework L2Aug to improve recommendations for casual users through data augmentation, without sacrificing core user experience. L2Aug is powered by a data augmentor that learns to generate augmented interaction sequences, in order to fine-tune and optimize the performance of the recommendation system for casual users. On four real-world public datasets, L2Aug outperforms other treatment methods and achieves the best sequential recommendation performance for both casual and core users. We also test L2Aug in an online simulati
Individual user fairness is commonly understood as treating similar users similarly. In Recommender Systems (RSs), several evaluation measures exist for quantifying individual user fairness. These measures evaluate fairness via either: (i) the disparity in RS effectiveness scores regardless of user similarity, or (ii) the disparity in items recommended to similar users regardless of item relevance. Both disparity in recommendation effectiveness and user similarity are very important in fairness, yet no existing individual user fairness measure simultaneously accounts for both. In brief, current user fairness evaluation measures implement a largely incomplete definition of fairness. To fill this gap, we present Pairwise User unFairness (PUF), a novel evaluation measure of individual user fairness that considers both effectiveness disparity and user similarity. PUF is the only measure that can express this important distinction. We empirically validate that PUF does this consistently across 4 datasets and 7 rankers, and robustly when varying user similarity or effectiveness. In contrast, all other measures are either almost insensitive to effectiveness disparity or completely insensi
In the robust secure aggregation problem, a server wishes to learn and only learn the sum of the inputs of a number of users while some users may drop out (i.e., may not respond). The identity of the dropped users is not known a priori and the server needs to securely recover the sum of the remaining surviving users. We consider the following minimal two-round model of secure aggregation. Over the first round, any set of no fewer than $U$ users out of $K$ users respond to the server and the server wants to learn the sum of the inputs of all responding users. The remaining users are viewed as dropped. Over the second round, any set of no fewer than $U$ users of the surviving users respond (i.e., dropouts are still possible over the second round) and from the information obtained from the surviving users over the two rounds, the server can decode the desired sum. The security constraint is that even if the server colludes with any $T$ users and the messages from the dropped users are received by the server (e.g., delayed packets), the server is not able to infer any additional information beyond the sum in the information theoretic sense. For this information theoretic secure aggrega