共找到 20 条结果
The all-terminal reliability of a graph $G$ is the probability that $G$ remains connected when each edge fails independently with probability $p$. For fixed $n$ and $m$, the uniformly most reliable problem asks which graph with $n$ vertices and $m$ edges maximizes reliability for all $p \in [0,1]$. Although such graphs do not always exist, optimal graphs in the regime $p \to 0$ always do and are determined by the structure of their minimal cut sets. We establish a structural characterization of graphs that are most reliable near $p=0$. Our results partially resolve a conjecture of Bourel et al., showing that, under suitable conditions, regular graphs with maximal girth are optimal. Extending this analysis to graphs with fixed redundancy $r=m-(n-1)$ and sufficiently large $n$, we show that the most reliable graphs are obtained by subdividing the most reliable cubic graphs with $2(r-1)$ vertices. The general conjecture remains open. Unlike previous results, which resolved only small redundancy cases or very dense regimes, our approach yields a substantial extension of the known range. We determine the unique cubic candidates for uniformly most reliable graphs for all redundancy level
RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``I DON'T KNOW'' (IDK) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose Boundary-Aware Policy Optimization (BAPO), a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search.
This paper addresses the topic of robustness under sensing noise, ambiguous instructions, and human-robot interaction. We take a radically different tack to the issue of reliable embodied AI: instead of focusing on formal verification methods aimed at achieving model predictability and robustness, we emphasise the dynamic, ambiguous and subjective nature of human-robot interactions that requires embodied AI systems to perceive, interpret, and respond to human intentions in a manner that is consistent, comprehensible and aligned with human expectations. We argue that when embodied agents operate in human environments that are inherently social, multimodal, and fluid, reliability is contextually determined and only has meaning in relation to the goals and expectations of humans involved in the interaction. This calls for a fundamentally different approach to achieving reliable embodied AI that is centred on building and updating an accessible "explicit world model" representing the common ground between human and AI, that is used to align robot behaviours with human expectations.
Reliable inference requires that artificial intelligence (AI) models provide trustworthy uncertainty estimates, not merely accurate predictions. Recent advances in Bayesian learning have made significant progress toward this goal, and growing concerns about computational overhead have jointly shifted the design criterion from reliability alone to the co-design of reliability and efficiency, i.e., reducing computational overhead while preserving trustworthy uncertainty quantification. This thesis develops a unified framework from two perspectives to address the central question: can we efficiently perform reliable inference?
As video language models (VLMs) gain more applications in various scenarios, the need for robust and scalable evaluation of their performance becomes increasingly critical. The traditional human expert-based evaluation of VLMs has limitations in consistency and scalability, which sparked interest in automatic methods such as employing VLMs to evaluate VLMs. However, the reliability of VLMs as judges remains underexplored. Existing methods often rely on a single VLM as the evaluator. However, this approach can be unreliable or biased because such a model may lack the ability to fully understand the content and may have inherent biases, ultimately compromising evaluation reliability. A remedy is to apply the principle of collective thoughts, aggregating evaluations from multiple VLMs to enhance reliability. This study investigates the efficacy of such approaches, particularly when the pool of judges includes both reliable and unreliable models. Our findings reveal that incorporating collective judgments from such a mixed pool does not necessarily improve the accuracy of the final evaluation. The inclusion of less reliable judges can introduce noise, undermining the overall reliabilit
In this work, we address the problem of determining reliable policies in reinforcement learning (RL), with a focus on optimization under uncertainty and the need for performance guarantees. While classical RL algorithms aim at maximizing the expected return, many real-world applications - such as routing, resource allocation, or sequential decision-making under risk - require strategies that ensure not only high average performance but also a guaranteed probability of success. To this end, we propose a novel formulation in which the objective is to maximize the probability that the cumulative return exceeds a prescribed threshold. We demonstrate that this reliable RL problem can be reformulated, via a state-augmented representation, into a standard RL problem, thereby allowing the use of existing RL and deep RL algorithms without the need for entirely new algorithmic frameworks. Theoretical results establish the equivalence of the two formulations and show that reliable strategies can be derived by appropriately adapting well-known methods such as Q-learning or Dueling Double DQN. To illustrate the practical relevance of the approach, we consider the problem of reliable routing, wh
A two-terminal graph is a graph equipped with two distinguished vertices, called terminals. Let $T_{n,m}$ be the set of all nonisomorphic connected simple two-terminal graphs on $n$ vertices and $m$ edges. Let $G$ be any two-terminal graph in $T_{n,m}$. For every number $p$ in $[0,1]$ we let each of the edges in $G$ be independently deleted with probability $1-p$. The split reliability $SR_{G}(p)$ is the probability that the resulting spanning subgraph has precisely $2$ connected components, each one including one terminal. The two-terminal graph $G$ is uniformly most split reliable if $SR_G(p)\geq SR_{H}(p)$ for each $H$ in $T_{n,m}$ and every $p$ in $[0,1]$. We say $G$ is locally most split reliable if there exists $δ>0$ such that $SR_G(p)\geq SR_{H}(p)$ for each $H$ in $T_{n,m}$ and every $p$ in $(1-δ,1)$. Brown and McMullin showed that there exists uniformly most split reliable graphs in each class $T_{n,m}$ such that $m=n-1$, $m=\binom{n}{2}$, or $m=\binom{n}{2}-1$. The authors also proved that there is no uniformly most split reliable two-terminal graph in $T_{n,n}$ when $n\geq 6$ and specified in which classes $T_{n,m}$ such that $n\leq 7$ there exist uniformly most split
In recent years, explaining decisions made by complex machine learning models has become essential in high-stakes domains such as energy systems, healthcare, finance, and autonomous systems. However, the reliability of these explanations, namely, whether they remain stable and consistent under realistic, non-adversarial changes, remains largely unmeasured. Widely used methods such as SHAP and Integrated Gradients (IG) are well-motivated by axiomatic notions of attribution, yet their explanations can vary substantially even under system-level conditions, including small input perturbations, correlated representations, and minor model updates. Such variability undermines explanation reliability, as reliable explanations should remain consistent across equivalent input representations and small, performance-preserving model changes. We introduce the Explanation Reliability Index (ERI), a family of metrics that quantifies explanation stability under four reliability axioms: robustness to small input perturbations, consistency under feature redundancy, smoothness across model evolution, and resilience to mild distributional shifts. For each axiom, we derive formal guarantees, including
Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance the reliability of LLMs, reliable LALMs remain largely unexplored. In this paper, we systematically investigate various approaches towards reliable LALMs, including training-free methods such as multi-modal chain-of-thought (MCoT), and training-based methods such as supervised fine-tuning (SFT). Besides, we identify the limitations of previous evaluation metrics and propose a new metric, the Reliability Gain Index (RGI), to assess the effectiveness of different reliable methods. Our findings suggest that both training-free and training-based methods enhance the reliability of LALMs to different extents. Moreover, we find that awareness of reliability is a "meta ability", which can be transferred across different audio modalities, although significant structural and content differences exis
Financial forecasting increasingly uses large neural network models, but their opacity raises challenges for trust and regulatory compliance. We present several approaches to explainable and reliable AI in finance. \emph{First}, we describe how Time-LLM, a time series foundation model, uses a prompt to avoid a wrong directional forecast. \emph{Second}, we show that combining foundation models for time series forecasting with a reliability estimator can filter our unreliable predictions. \emph{Third}, we argue for symbolic reasoning encoding domain rules for transparent justification. These approaches shift emphasize executing only forecasts that are both reliable and explainable. Experiments on equity and cryptocurrency data show that the architecture reduces false positives and supports selective execution. By integrating predictive performance with reliability estimation and rule-based reasoning, our framework advances transparent and auditable financial AI systems.
We consider a network consisting of $n$ nodes that aim to track a continually updating process or event. To disseminate updates about the event to the network, two sources are available, such that information obtained from one source is considered more reliable than the other source. The nodes wish to have access to information about the event that is not only latest but also more reliable, and prefer a reliable packet over an unreliable packet even when the former is a bit outdated with respect to the latter. We study how such preference affects the fraction of users with reliable information in the network and their version age of information. We derive the analytical equations to characterize the two quantities, long-term expected fraction of nodes with reliable packets and their long-term expected version age using stochastic hybrid systems (SHS) modelling and study their properties. We also compare these results with the case where nodes give more preference to freshness of information than its reliability. Finally we show simulation results to verify the theoretical results and shed further light on behavior of above quantities with respect to dependent variables.
Active Learning (AL) promises to reduce annotation cost by prioritizing informative samples, yet its reliability is undermined when labels are noisy or when the data distribution shifts. In practice, annotators make mistakes, rare categories are ambiguous, and conventional AL heuristics (uncertainty, diversity) often amplify such errors by repeatedly selecting mislabeled or redundant samples. We propose Reliable Active Learning via Neural Collapse Geometry (NCAL-R), a framework that leverages the emergent geometric regularities of deep networks to counteract unreliable supervision. Our method introduces two complementary signals: (i) a Class-Mean Alignment Perturbation score, which quantifies how candidate samples structurally stabilize or distort inter-class geometry, and (ii) a Feature Fluctuation score, which captures temporal instability of representations across training checkpoints. By combining these signals, NCAL-R prioritizes samples that both preserve class separation and highlight ambiguous regions, mitigating the effect of noisy or redundant labels. Experiments on ImageNet-100 and CIFAR100 show that NCAL-R consistently outperforms standard AL baselines, achieving higher
There is a flurry of recent research papers proposing novel differentially private machine learning (DPML) techniques. These papers claim to achieve new state-of-the-art (SoTA) results and offer empirical results as validation. However, there is no consensus on which techniques are most effective or if they genuinely meet their stated claims. Complicating matters, heterogeneity in codebases, datasets, methodologies, and model architectures make direct comparisons of different approaches challenging. In this paper, we conduct a reproducibility and replicability (R+R) experiment on 11 different SoTA DPML techniques from the recent research literature. Results of our investigation are varied: while some methods stand up to scrutiny, others falter when tested outside their initial experimental conditions. We also discuss challenges unique to the reproducibility of DPML, including additional randomness due to DP noise, and how to address them. Finally, we derive insights and best practices to obtain scientifically valid and reliable results.
Large language models (LLMs) are increasingly being used as decision aids. However, users have diverse values and preferences that can affect their decision-making, which requires novel methods for LLM alignment and personalization. Existing LLM comparison tools largely focus on benchmarking tasks, such as knowledge-based question answering. In contrast, our proposed ALIGN system focuses on dynamic personalization of LLM-based decision-makers through prompt-based alignment to a set of fine-grained attributes. Key features of our system include robust configuration management, structured output generation with reasoning, and several algorithm implementations with swappable LLM backbones, enabling different types of analyses. Our user interface enables a qualitative, side-by-side comparison of LLMs and their alignment to various attributes, with a modular backend for easy algorithm integration. Additionally, we perform a quantitative analysis comparing alignment approaches in two different domains: demographic alignment for public opinion surveys and value alignment for medical triage decision-making. The entire ALIGN framework is open source and will enable new research on reliable,
A \emph{$ν$-reliable spanner} of a metric space $(X,d)$, is a (dominating) graph $H$, such that for any possible failure set $B\subseteq X$, there is a set $B^+$ just slightly larger $|B^+|\le(1+ν)\cdot|B|$, and all distances between pairs in $X\setminus B^+$ are (approximately) preserved in $H\setminus B$. Recently, there have been several works on sparse reliable spanners in various settings, but so far, the weight of such spanners has not been analyzed at all. In this work, we initiate the study of \emph{light} reliable spanners, whose weight is proportional to that of the Minimum Spanning Tree (MST) of $X$. We first observe that unlike sparsity, the lightness of any deterministic reliable spanner is huge, even for the metric of the simple path graph. Therefore, randomness must be used: an \emph{oblivious} reliable spanner is a distribution over spanners, and the bound on $|B^+|$ holds in expectation. We devise an oblivious $ν$-reliable $(2+\frac{2}{k-1})$-spanner for any $k$-HST, whose lightness is $\approx ν^{-2}$. We demonstrate a matching $Ω(ν^{-2})$ lower bound on the lightness (for any finite stretch). We also note that any stretch below 2 must incur linear lightness. For
Large Language models (LLMs) have demonstrated significant potential in text-to-SQL reasoning tasks, yet a substantial performance gap persists between existing open-source models and their closed-source counterparts. In this paper, we introduce SQLForge, a novel approach for synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs. We improve data reliability through SQL syntax constraints and SQL-to-question reverse translation, ensuring data logic at both structural and semantic levels. We also propose an SQL template enrichment and iterative data domain exploration mechanism to boost data diversity. Building on the augmented data, we fine-tune a variety of open-source models with different architectures and parameter sizes, resulting in a family of models termed SQLForge-LM. SQLForge-LM achieves the state-of-the-art performance on the widely recognized Spider and BIRD benchmarks among the open-source models. Specifically, SQLForge-LM achieves EX accuracy of 85.7% on Spider Dev and 59.8% on BIRD Dev, significantly narrowing the performance gap with closed-source methods.
Within 3GPP, the campus network architecture has evolved as a deployment option for industries and can be provisioned using network slicing over already installed 5G public network infrastructure. In campus networks, the ultra-reliable low latency communication (URLLC) service category is of major interest for applications with strict latency and high-reliability requirements. One way to achieve high reliability in a shared infrastructure is through resource isolation, whereby network slicing can be optimized to adequately reserve computation and transmission capacity. This paper proposes an approach for vertical slicing the radio access network (RAN) to enable the deployment of multiple and isolated campus networks to accommodate URLLC services. To this end, we model RAN function placement as a mixed integer linear programming problem with URLLC-related constraints. We demonstrate that our approach can find optimal solutions in real-world scenarios. Furthermore, unlike existing solutions, our model considers the user traffic flow from a known source node on the network's edge to an unknown \textit{a priori} destination node. This flexibility could be explored in industrial campus
To bridge the gaps between topology-aware Graph Neural Networks (GNNs) and inference-efficient Multi-Layer Perceptron (MLPs), GLNN proposes to distill knowledge from a well-trained teacher GNN into a student MLP. Despite their great progress, comparatively little work has been done to explore the reliability of different knowledge points (nodes) in GNNs, especially their roles played during distillation. In this paper, we first quantify the knowledge reliability in GNN by measuring the invariance of their information entropy to noise perturbations, from which we observe that different knowledge points (1) show different distillation speeds (temporally); (2) are differentially distributed in the graph (spatially). To achieve reliable distillation, we propose an effective approach, namely Knowledge-inspired Reliable Distillation (KRD), that models the probability of each node being an informative and reliable knowledge point, based on which we sample a set of additional reliable knowledge points as supervision for training student MLPs. Extensive experiments show that KRD improves over the vanilla MLPs by 12.62% and outperforms its corresponding teacher GNNs by 2.16% averaged over 7
We study several questions in the reliable agnostic learning framework of Kalai et al. (2009), which captures learning tasks in which one type of error is costlier than others. A positive reliable classifier is one that makes no false positive errors. The goal in the positive reliable agnostic framework is to output a hypothesis with the following properties: (i) its false positive error rate is at most $ε$, (ii) its false negative error rate is at most $ε$ more than that of the best positive reliable classifier from the class. A closely related notion is fully reliable agnostic learning, which considers partial classifiers that are allowed to predict "unknown" on some inputs. The best fully reliable partial classifier is one that makes no errors and minimizes the probability of predicting "unknown", and the goal in fully reliable learning is to output a hypothesis that is almost as good as the best fully reliable partial classifier from a class. For distribution-independent learning, the best known algorithms for PAC learning typically utilize polynomial threshold representations, while the state of the art agnostic learning algorithms use point-wise polynomial approximations. We
Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoni