Leveraging external tools is a key feature for modern Language Models (LMs) to expand their capabilities and integrate them into existing systems. However, existing benchmarks primarily focus on the accuracy of tool calling -- whether the correct tool is called with the correct parameters -- and less on evaluating when LMs should (not) call tools. We develop a new benchmark, When2Call, which evaluates tool-calling decision-making: when to generate a tool call, when to ask follow-up questions and when to admit the question can't be answered with the tools provided. We find that state-of-the-art tool-calling LMs show significant room for improvement on When2Call, indicating the importance of this benchmark. We also develop a training set for When2Call and leverage the multiple-choice nature of the benchmark to develop a preference optimization training regime, which shows considerably more improvement than traditional fine-tuning. We release the benchmark and training data as well as evaluation scripts at https://github.com/NVIDIA/When2Call.
AI systems can generate outputs at scale, but most outputs require human approval before release. This creates a bottleneck: humans cannot keep pace with AI-generated volume. A natural response is to insert an LLM-judge that screens outputs before they reach humans, filtering errors and amplifying effective review capacity. But judges are imperfect. False rejections send correct outputs back for unnecessary rework; false acceptances consume judge capacity without relieving humans. When should outputs be routed through the judge, and when should they bypass it directly to human review? We model this workflow as a queueing network with three resource pools and use a fluid approximation to characterize optimal judge allocation. The analysis reveals that optimal allocation depends critically on which resource is the current bottleneck: screening amplifies human capacity when reviewers are scarce, yet generates a rework trap that crowds out new production when workers are stretched thin. For heterogeneous task classes with different error profiles, optimal priority can reverse across operating regimes, and classes with complementary error structures can be mixed to achieve throughput th
We document a fundamental paradox in AI transparency: explanations improve decisions when algorithms are correct but systematically worsen them when algorithms err. In an experiment with 257 medical students making 3,855 diagnostic decisions, we find explanations increase accuracy by 6.3 percentage points when AI is correct (73% of cases) but decrease it by 4.9 points when incorrect (27% of cases). This asymmetry arises because modern AI systems generate equally persuasive explanations regardless of recommendation quality-physicians cannot distinguish helpful from misleading guidance. We show physicians treat explained AI as 15.2 percentage points more accurate than reality, with over-reliance persisting even for erroneous recommendations. Competent physicians with appropriate uncertainty suffer most from the AI transparency paradox (-12.4pp when AI errs), while overconfident novices benefit most (+9.9pp net). Welfare analysis reveals that selective transparency generates \$2.59 billion in annual healthcare value, 43% more than the \$1.82 billion from mandated universal transparency.
To preserve previously learned representations, continual learning systems must strike a balance between plasticity, the ability to acquire new knowledge, and stability. This stability-plasticity dilemma affects how representations can be reused across tasks: shared structure enables transfer when tasks are similar but may also induce interference when new learning disrupts existing representations. However, it remains unclear when and why structural separation influences this trade-off. In this study, we examine how network architecture, task similarity, and representational dimensionality jointly shape learning in a sequential task paradigm inspired by transfer-interference studies. We compare a task-partitioned modular recurrent network with a single-module baseline by systematically varying task similarity (low, medium, high) and the scale of weight initialization, which induces different learning regimes that we empirically characterize through the effective dimensionality of the learned representations. We find that architecture has minimal impact in high-dimensional regimes where representations are sufficiently unconstrained to accommodate multiple tasks without strong inte
We study a long-run persuasion problem where a long-lived Sender repeatedly interacts with a sequence of short-lived Receivers who may adopt a misspecified model for belief updating. The Sender commits to a stationary information structure, but suspicious Receivers compare it to an uninformative alternative and may switch based on the Bayes factor rule. We characterize when the one-shot Bayesian Persuasion-optimal (BP-optimal) structure remains optimal in the long run despite this switching risk. In particular, when Receivers cannot infer the state from the Sender's preferred action, they never switch, and the BP-optimal structure maximizes the Sender's lifetime utility. In contrast, when such inference is possible, full disclosure may outperform BP-optimal. Our findings highlight the strategic challenges of information design when the Receivers' interpretation of signals evolves over time.
When a model knows when it does not know, many possibilities emerge. The first question is how to enable a model to recognize that it does not know. A promising approach is to use confidence, computed from the model's internal signals, to reflect its ignorance. Prior work in specific domains has shown that calibration can provide reliable confidence estimates. In this work, we propose a simple, effective, and universal training-free method that applies to both vision and language models, performing model calibration, cascading, and data cleaning to better exploit a model's ability to recognize when it does not know. We first highlight two key empirical observations: higher confidence corresponds to higher accuracy within a single model, and models calibrated on the validation set remain calibrated on a held-out test set. These findings empirically establish the reliability and comparability of calibrated confidence. Building on this, we introduce two applications: (1) model cascading with calibrated advantage routing and (2) data cleaning based on model ensemble. Using the routing signal derived from the comparability of calibrated confidences, we cascade large and small models to
Appropriate decisions depend on information gathered beforehand, yet such information is often obtained through intermediaries with biased preferences. Motivated by settings such as testing and recertification in organ transplantation, we study the problem faced by a decision-maker who can only access costly information through an agent with misaligned preferences. In a dynamic framework with exogenous decision timing, we ask how requests for verifiable information (evidence) should be scheduled and their implications for the quality of attained choices. When the agent's incentives are ignored, evidence requests do not condition on previously reported information. However, such policies may be susceptible to strategic manipulation by the agent. We show that, in these cases, optimal requests should be biased: additional evidence is more likely to be sought when previous reports favor the agent's preferred outcome.
This article is a continuation of [6] where a classification of when the space of minimal prime subgroups of a given lattice-ordered group equipped with the inverse topology has a clopen $π$-base. For nice $\ell$-groups, (e.g. W-objects) this occurs precisely when the space of maximal $d$-subgroups (qua the hull kernel topology) has a clopen $π$-base. It occurred to us that presently there is no classification of when the space of maximal $d$-subgroups of a W-object is zero-dimensional, except for the case of the $C(X)$, the real-valued continuous functions on a topological space $X$, considered in [5].
When reading books, humans focus primarily on the current page, flipping back to recap prior context only when necessary. Similarly, we demonstrate that Large Language Models (LLMs) can learn to dynamically determine when to attend to global context. We propose All-or-Here Attention (AHA), which utilizes a binary router per attention head to dynamically toggle between full attention and local sliding window attention for each token. Our results indicate that with a window size of 256 tokens, up to 93\% of the original full attention operations can be replaced by sliding window attention without performance loss. Furthermore, by evaluating AHA across various window sizes, we identify a long-tail distribution in context dependency, where the necessity for full attention decays rapidly as the local window expands. By decoupling local processing from global access, AHA reveals that full attention is largely redundant, and that efficient inference requires only on-demand access to the global context.
Causal forests estimate how treatment effects vary across individuals, guiding personalized interventions in areas like marketing, operations, and public policy. A standard modeling practice with this method is honest estimation: dividing the data into two samples, one to define subgroups and another to estimate treatment effects within them. This is intended to reduce overfitting and is the default in many software packages. But is it the right choice? In this paper, we show that honest estimation can reduce the accuracy of individual-level treatment effect estimates, especially when there are substantial differences in how individuals respond to treatment, and the data is rich enough to uncover those differences. The core issue is a classic bias-variance trade-off: honesty lowers the risk of overfitting but increases the risk of underfitting, because it limits the data available to detect and model heterogeneity. Across 7,500 benchmark datasets, we find that the cost of using honesty by default can be as high as requiring 25% more data to match the performance of models trained without it. We argue that honesty is best understood as a form of regularization and its use should be
Centrality metrics aim to identify the most relevant nodes in a network. In literature, a broad set of metrics exists, either measuring local or global centrality characteristics. Nevertheless, when networks exhibit a high spectral gap, the usual global centrality measures typically do not add significant information with respect to the degree, i.e., the simplest local metric. To extract new information from this class of networks, we propose the use of the GENeralized Economic comPlexitY index (GENEPY). Despite its original definition within the economic field, the GENEPY can be easily applied and interpreted on a wide range of networks, characterized by high spectral gap, including monopartite and bipartite networks systems. Tests on synthetic and real-world networks show that the GENEPY can shed new light about the nodes centrality, carrying information generally poorly correlated with the nodes number of direct connections (nodes degree).
For a one dimensional analytically unramified Cohen-Macaulay local ring $R$, the blowup algebra of the canonical ideal is a module finite birational extension. The conductor of this extension always contains the conductor of $R$. We study the case when there is equality. This is the case where $R$ is far from being almost Gorenstein. We study this property within the landscape of numerical semigroup rings and local Arf rings.
Robots often localize to lower navigational errors and facilitate downstream, high-level tasks. However, a robot may want to selectively localize when localization is costly (such as with resource-constrained robots) or inefficient (for example, submersibles that need to surface), especially when navigating in environments with variable numbers of hazards such as obstacles and shipping lanes. In this study, we propose a method that helps a robot determine ``when to localize'' to 1) minimize such actions and 2) not exceed the probability of failure (such as surfacing within high-traffic shipping lanes). We formulate our method as a Constrained Partially Observable Markov Decision Process and use the Cost-Constrained POMCP solver to plan the robot's actions. The solver simulates failure probabilities to decide if a robot moves to its goal or localizes to prevent failure. We performed numerical experiments with multiple baselines.
The ability to predict the attention of expert pathologists could lead to decision support systems for better pathology training. We developed methods to predict the spatio-temporal (where and when) movements of pathologists' attention as they grade whole slide images (WSIs) of prostate cancer. We characterize a pathologist's attention trajectory by their x, y, and m (magnification) movements of a viewport as they navigate WSIs using a digital microscope. This information was obtained from 43 pathologists across 123 WSIs, and we consider the task of predicting the pathologist attention scanpaths constructed from the viewport centers. We introduce a fixation extraction algorithm that simplifies an attention trajectory by extracting fixations in the pathologist's viewing while preserving semantic information, and we use these pre-processed data to train and test a two-stage model to predict the dynamic (scanpath) allocation of attention during WSI reading via intermediate attention heatmap prediction. In the first stage, a transformer-based sub-network predicts the attention heatmaps (static attention) across different magnifications. In the second stage, we predict the attention sca
Ensuring the reliability and safety of automated decision-making is crucial. It is well-known that data distribution shifts in machine learning can produce unreliable outcomes. This paper proposes a new approach for measuring the reliability of predictions under distribution shifts. We analyze how the outputs of a trained neural network change using clustering to measure distances between outputs and class centroids. We propose this distance as a metric to evaluate the confidence of predictions under distribution shifts. We assign each prediction to a cluster with centroid representing the mean softmax output for all correct predictions of a given class. We then define a safety threshold for a class as the smallest distance from an incorrect prediction to the given class centroid. We evaluate the approach on the MNIST and CIFAR-10 datasets using a Convolutional Neural Network and a Vision Transformer, respectively. The results show that our approach is consistent across these data sets and network models, and indicate that the proposed metric can offer an efficient way of determining when automated predictions are acceptable and when they should be deferred to human operators given
Many biological systems are governed by difference equations and exhibit discrete-time dynamics. Examples include the size of a population when generations are non-overlapping, and the incidence of a disease when infections are recorded at fixed intervals. For discrete-time systems lacking exact solutions, continuous-time approximations are frequently employed when small changes occur between discrete time steps. Here, we present an approach motivated by exactly soluble discrete time problems. We show that such systems have continuous-time descriptions (governed by differential equations) whose solutions precisely agree, at the discrete times, with the discrete time solutions, irrespective of the size of changes that occur. For discrete-time systems lacking exact solutions, we develop approximate continuous-time models that can, to high accuracy, capture rapid growth and decay. Our approach employs mappings between difference and differential equations, generating functional solutions that exactly or closely preserve the original discrete time behaviour. It uncovers fundamental structural parallels and also distinctions between the difference equation and the `equivalent' different
The belief that numbers offer a single, objective description of reality overlooks a crucial truth: data does not speak for itself. Every dataset results from choices-what to measure, how, when, and with whom-which inevitably reflect implicit, and sometimes ideological, assumptions about what is worth quantifying. Moreover, in any analysis, what remains unmeasured can be just as significant as what is captured. When a key variable is omitted-whether by neglect, design, or ignorance-it can distort the observed relationships between other variables. This phenomenon, known as omitted variable bias, may produce misleading correlations or conceal genuine effects. In some cases, accounting for this hidden factor can completely overturn the conclusions drawn from a superficial analysis. This is precisely the mechanism behind Simpson's paradox.
Language models (LMs) may appear insensitive to word order changes in natural language understanding (NLU) tasks. In this paper, we propose that linguistic redundancy can explain this phenomenon, whereby word order and other linguistic cues such as case markers provide overlapping and thus redundant information. Our hypothesis is that models exhibit insensitivity to word order when the order provides redundant information, and the degree of insensitivity varies across tasks. We quantify how informative word order is using mutual information (MI) between unscrambled and scrambled sentences. Our results show the effect that the less informative word order is, the more consistent the model's predictions are between unscrambled and scrambled sentences. We also find that the effect varies across tasks: for some tasks, like SST-2, LMs' prediction is almost always consistent with the original one even if the Pointwise-MI (PMI) changes, while for others, like RTE, the consistency is near random when the PMI gets lower, i.e., word order is really important.
We argue that on-shell excitations with large negative energies are created rapidly when the string coupling increases with time. This does not indicate an inconsistency in string theory since the negative energy on-shell excitation is always entangled with an on-shell excitation with a positive energy. The total energy of this energy-EPR state vanishes. We discuss the reason the energy-EPR states appear in string theory and the role they might play in black hole physics.
We characterize, in terms of the defining graph, when a twisted right-angled Artin group (a group whose only relations among pairs of generators are either commuting or Klein-bottle type relations) is left-orderable.