LLM-as-a-Judge has become the dominant evaluation paradigm for many natural language generation tasks, due to shortcomings of conventional metrics and high correlations with human judgment, albeit mostly in English. There are now attempts to extend LLM-as-a-Judge to multilingual settings including low-resource languages. However, LLMs have limited proficiency in low-resource languages, and there is often no adequate human validation in these settings. To highlight the scope of the problem and current practices, we explore the use of LLM-as-a-Judge evaluators in ACL Anthology papers focusing on multilingual settings and low-resource languages across a diverse set of tasks. Out of 650 papers mentioning LLM-as-a-judge, only 33 of them focus on low-resource or multilingual settings. Our in-depth analysis of these papers indicates inconsistent evaluation outcomes, a tendency to overtrust LLM judgments in multilingual settings, and the widespread reliance on a single judge model per study. To help the NLP community further, we conclude with recommendations about how to use LLM-as-a-Judge in multilingual and low-resource settings.
Designing for sufficiency is one of many approaches that could foster more moderate and sustainable digital practices. Based on the Sustainable Information and Communication Technologies (ICT) and Human-Computer Interaction (HCI) literature, we identify five environmental settings categories. However, our analysis of three mobile OS and nine representative applications shows an overall lack of environmental concerns in settings design, leading us to identify six pervasive anti-patterns. Environmental settings, where they exist, are set on the most intensive option by default. They are not presented as such, are not easily accessible, and offer little explanation of their impact. Instead, they encourage more intensive use. Based on these findings, we create a design workbook that explores design principles for environmental settings: presenting the environmental potential of settings; shifting to environmentally neutral states; previewing effects to encourage moderate use; rethinking defaults; facilitating settings access and; exploring more frugal settings. Building upon this workbook, we discuss how settings can tie individual behaviors to systemic factors.
We extend the restrictiveness measure of Fudenberg, Gao & Liang (2026) to functional and structural econometric settings using Gaussian process priors. We find that models evaluated over continuum domains appear more restrictive than when evaluated over finite sets of observations. We also extend the restrictiveness framework to structural models with endogeneity, instrumental variables, multiple equilibria, and nonparametric nuisance components. We explain why the choice of discrepancy function is a substantive modeling decision, and why the Rademacher complexity and GMM criterion functions are unsuitable as discrepancies. We further show that restrictiveness equals the normalized limit of the noise-free average-case learning curve. In applications to preferences under risk, and multinomial choice under exogenous and endogenous settings, we find that the same models exhibit uniformly higher restrictiveness when evaluated over continuum domains than based on their predictions on finite sets, and that moment restrictions from endogeneity substantially increase restrictiveness and alter model rankings.
Learning algorithms can be significantly improved by routing complex or uncertain inputs to specialized experts, balancing accuracy with computational cost. This approach, known as learning to defer, is essential in domains like natural language generation, medical diagnosis, and computer vision, where an effective deferral can reduce errors at low extra resource consumption. However, the two-stage learning to defer setting, which leverages existing predictors such as a collection of LLMs or other classifiers, often faces challenges due to an expert imbalance problem. This imbalance can lead to suboptimal performance, with deferral algorithms favoring the majority expert. We present a comprehensive study of two-stage learning to defer in expert imbalance settings. We cast the deferral loss optimization as a novel cost-sensitive learning problem over the input-expert domain. We derive new margin-based loss functions and guarantees tailored to this setting, and develop novel algorithms for cost-sensitive learning. Leveraging these results, we design principled deferral algorithms, MILD (Margin-based Imbalanced Learning to Defer), specifically suited for expert imbalance settings. Ext
This study evaluates the effectiveness of child-robot interactions with the HaKsh-E social robot in India, examining both individual and group interaction settings. The research centers on game-based interactions designed to teach hand hygiene to children aged 7-11. Utilizing video analysis, rubric assessments, and post-study questionnaires, the study gathered data from 36 participants. Findings indicate that children in both settings developed positive perceptions of the robot in terms of the robot's trustworthiness, closeness, and social support. The significant difference in the interaction level scores presented in the study suggests that group settings foster higher levels of interaction, potentially due to peer influence and collaborative dynamics. While both settings showed significant improvements in learning outcomes, the individual setting had more pronounced learning gains. This suggests that personal interactions with the robot might lead to deeper or more effective learning experiences. Consequently, this study concludes that individual interaction settings are more conducive for focused learning gains, while group settings enhance interaction and engagement.
Current evaluation frameworks and benchmarks for LLM powered agents focus on text chat driven agents, these frameworks do not expose the persona of user to the agent, thus operating in a user agnostic environment. Importantly, in customer experience management domain, the agent's behaviour evolves as the agent learns about user personality. With proliferation of real time TTS and multi-modal language models, LLM based agents are gradually going to become multi-modal. Towards this, we propose the MM-tau-p$^2$ benchmark with metrics for evaluating the robustness of multi-modal agents in dual control setting with and without persona adaption of user, while also taking user inputs in the planning process to resolve a user query. In particular, our work shows that even with state of-the-art frontier LLMs like GPT-5, GPT 4.1, there are additional considerations measured using metrics viz. multi-modal robustness, turn overhead while introducing multi-modality into LLM based agents. Overall, MM-tau-p$^2$ builds on our prior work FOCAL and provides a holistic way of evaluating multi-modal agents in an automated way by introducing 12 novel metrics. We also provide estimates of these metrics
Vision- and hearing-threatening diseases cause preventable disability, especially in resource-constrained settings(RCS) with few specialists and limited screening setup. Large scale AI-assisted screening and telehealth has potential to expand early detection, but practical deployment is challenging in paper-based workflows and limited documented field experience exist to build upon. We provide insights on challenges and ways forward in development to adoption of scalable AI-assisted Telehealth and screening in such settings. Specifically, we find that iterative, interdisciplinary collaboration through early prototyping, shadow deployment and continuous feedback is important to build shared understanding as well as reduce usability hurdles when transitioning from paper-based to AI-ready workflows. We find public datasets and AI models highly useful despite poor performance due to domain shift. In addition, we find the need for automated AI-based image quality check to capture gradable images for robust screening in high-volume camps. Our field learning stress the importance of treating AI development and workflow digitization as an end-to-end, iterative co-design process. By documen
An original serious game prototype named 'Puzzlegram' is created for the elderly demographic in group settings as the target players. Puzzlegram is precisely designed to accentuate memory, auditory interaction as well as haptic response to visual signals with the use of music. Music is introduced as a key component for establishing the game design that provides a source of meaningful contextualization (familiar music from the past) for setting the game mechanics, which facilitated the construction of the serious game design process. The discussion topics raised include the need to design serious games for fostering meaningful interactions, as well as developing a thorough framework for constructing purposeful design for serious games. A potential integral of artificial intelligence to Puzzlegram may involve assigning a novel dimension to its existing problem solving task by adapting to varying states of cognitive function for monitoring purposes based on an individual's interaction with the game.
Spatio-temporal forecasting is crucial in transportation, logistics, and supply chain management. However, current methods struggle with large, complex datasets. We propose a dynamic, multi-modal approach that integrates the strengths of traditional forecasting methods and instruction tuning of small language models for time series trend analysis. This approach utilizes a mixture of experts (MoE) architecture with parameter-efficient fine-tuning (PEFT) methods, tailored for consumer hardware to scale up AI solutions in low resource settings while balancing performance and latency tradeoffs. Additionally, our approach leverages related past experiences for similar input time series to efficiently handle both intra-series and inter-series dependencies of non-stationary data with a time-then-space modeling approach, using grouped-query attention, while mitigating the limitations of traditional forecasting techniques in handling distributional shifts. Our approach models predictive uncertainty to improve decision-making. Our framework enables on-premises customization with reduced computational and memory demands, while maintaining inference speed and data privacy/security. Extensive e
We introduce a version of skein categories of surfaces which depends on a tensor ideal in a linear ribbon category, thereby extending the existing theory to the setting of non-semisimple TQFTs. We obtain modified notions of skein algebras of surfaces and skein modules of 3-cobordisms for non-semisimple ribbon categories. We prove that these skein categories built from ideals coincide with factorization homology, shedding new light on the similarities and differences between the semisimple and non-semisimple settings. The essential difference is the need to work with profunctors in the non-semisimple setting. Doing so produces a ``distinguished presheaf'' which plays the role of the distinguished object in skein categories in semisimple settings. As a consequence, we get a skein-theoretic description of factorization homology for a large class of balanced braided presentable categories, precisely all those which are expected to induce oriented categorified 3-TQFTs.
To secure computer infrastructure, we need to configure all security-relevant settings. We need security experts to identify security-relevant settings, but this process is time-consuming and expensive. Our proposed solution uses state-of-the-art natural language processing to classify settings as security-relevant based on their description. Our evaluation shows that our trained classifiers do not perform well enough to replace the human security experts but can help them classify the settings. By publishing our labeled data sets and the code of our trained model, we want to help security experts analyze configuration settings and enable further research in this area.
In compact settings, the convergence rate of the empirical optimal transport cost to its population value is well understood for a wide class of spaces and cost functions. In unbounded settings, however, hitherto available results require strong assumptions on the ground costs and the concentration of the involved measures. In this work, we pursue a decomposition-based approach to generalize the convergence rates found in compact spaces to unbounded settings under generic moment assumptions that are sharp up to an arbitrarily small $ε> 0$. Hallmark properties of empirical optimal transport on compact spaces, like the recently established adaptation to lower complexity, are shown to carry over to the unbounded case.
Detecting out-of-distribution examples is important for safety-critical machine learning applications such as detecting novel biological phenomena and self-driving cars. However, existing research mainly focuses on simple small-scale settings. To set the stage for more realistic out-of-distribution detection, we depart from small-scale settings and explore large-scale multiclass and multi-label settings with high-resolution images and thousands of classes. To make future work in real-world settings possible, we create new benchmarks for three large-scale settings. To test ImageNet multiclass anomaly detectors, we introduce the Species dataset containing over 700,000 images and over a thousand anomalous species. We leverage ImageNet-21K to evaluate PASCAL VOC and COCO multilabel anomaly detectors. Third, we introduce a new benchmark for anomaly segmentation by introducing a segmentation benchmark with road anomalies. We conduct extensive experiments in these more realistic settings for out-of-distribution detection and find that a surprisingly simple detector based on the maximum logit outperforms prior methods in all the large-scale multi-class, multi-label, and segmentation tasks,
Expert decision-makers (DMs) in high-stakes AI-assisted decision-making (AIaDM) settings receive and reconcile recommendations from AI systems before making their final decisions. We identify distinct properties of these settings which are key to developing AIaDM models that effectively benefit team performance. First, DMs incur reconciliation costs from exerting decision-making resources (e.g., time and effort) when reconciling AI recommendations that contradict their own judgment. Second, DMs in AIaDM settings exhibit algorithm discretion behavior (ADB), i.e., an idiosyncratic tendency to imperfectly accept or reject algorithmic recommendations for any given decision task. The human's reconciliation costs and imperfect discretion behavior introduce the need to develop AI systems which (1) provide recommendations selectively, (2) leverage the human partner's ADB to maximize the team's decision accuracy while regularizing for reconciliation costs, and (3) are inherently interpretable. We refer to the task of developing AI to advise humans in AIaDM settings as learning to advise and we address this task by first introducing the AI-assisted Team (AIaT)-Learning Framework. We instanti
Recent developments to encrypt the Domain Name System (DNS) have resulted in major browser and operating system vendors deploying encrypted DNS functionality, often enabling various configurations and settings by default. In many cases, default encrypted DNS settings have implications for performance and privacy; for example, Firefox's default DNS setting sends all of a user's DNS queries to Cloudflare, potentially introducing new privacy vulnerabilities. In this paper, we confirm that most users are unaware of these developments -- with respect to the rollout of these new technologies, the changes in default settings, and the ability to customize encrypted DNS configuration to balance user preferences between privacy and performance. Our findings suggest several important implications for the designers of interfaces for encrypted DNS functionality in both browsers and operating systems, to help improve user awareness concerning these settings, and to ensure that users retain the ability to make choices that allow them to balance tradeoffs concerning DNS privacy and performance.
We consider retarded settings in the context of a Bell-type experiment. The retarded setting is defined as the value the setting would have taken were it not for some external intervention (for example, by a human). We derive retarded Bell inequalities that explicitly take into account the retarded settings. These inequalities are not violated by Quantum Theory (or any other theory) when the retarded settings are equal to the actual settings. We construct a simple model that reproduces Quantum Theory when the retarded and actual settings are equal, but violates it when they are not. We discuss using humans to choose the settings in this type of experiment and the implications of a violation of Quantum Theory (in agreement with the retarded Bell inequalities) in this context.
Standard training datasets for deep learning often contain objects in common settings (e.g., "a horse on grass" or "a ship in water") since they are usually collected by randomly scraping the web. Uncommon and rare settings (e.g., "a plane on water", "a car in snowy weather") are thus severely under-represented in the training data. This can lead to an undesirable bias in model predictions towards common settings and create a false sense of accuracy. In this paper, we introduce FOCUS (Familiar Objects in Common and Uncommon Settings), a dataset for stress-testing the generalization power of deep image classifiers. By leveraging the power of modern search engines, we deliberately gather data containing objects in common and uncommon settings in a wide range of locations, weather conditions, and time of day. We present a detailed analysis of the performance of various popular image classifiers on our dataset and demonstrate a clear drop in performance when classifying images in uncommon settings. By analyzing deep features of these models, we show that such errors can be due to the use of spurious features in model predictions. We believe that our dataset will aid researchers in unde
The estimation of causal effects using quasiexperiments often relies on the use of unusual or serendipitous sources of exogenous variation. When the goal is estimating the same causal effects across many different settings, the same unusual exogenous variation often does not exist in all settings, and the only available form of identification is selection-on-observables, which relies on a conditional indepdendence assumption. Partial identification is especially valuable in this context, as it allows conditional independence to not hold perfectly. This paper proposes a method that sharpens the jointly identified set of causal effects across many settings by making use of unobserved relationships between omitted variable biases across settings.
Set partitions are arrangements of distinct objects into groups. The problem of listing all set partitions arises in a variety of settings, in particular in combinatorial optimization tasks. After a brief review, we give practical approximate formulas for determining the number of set partitions, both for small and large set sizes. Several algorithms for enumerating all set partitions are reviewed, and benchmarking tests were conducted. The algorithm of Djokic et al. is recommended for practical use.
We introduce the Graph Set Transformer (GST), a neural network architecture for learning on sets of graphs, designed for tasks in which per-element predictions depend on set-wide context as well as local structure. Existing architectures, including DeepSets and SetTransformer, require pre-encoded graph embeddings from a separate GNN, creating a bottleneck between feature extraction and set-level contextualisation. In contrast, GST interleaves node-level feature propagation and cross-graph contextual modelling at every layer, fusing the two levels of information through a gating mechanism. We evaluate GST on a controlled synthetic suite designed to isolate set-conditional structural reasoning and on three real-data benchmarks spanning per-atom reaction-centre identification, reaction yield prediction, and image classification. Under matched parameter budgets, GST performs better than the baselines across these settings. An architectural ablation strongly suggests that the interleaving of local and set context contributes substantially to this advantage.