搜索 — ResearchTracker

Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments exhibit non-stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid-world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation-based baselines e

Policy Iteration Achieves Regularized Equilibrium under Time Inconsistency

arXiv2026-03-06作者：Yu-Jui Huang, Xiang Yu, Keyu Zhang

For a general entropy-regularized time-inconsistent stochastic control problem, we propose a policy iteration algorithm (PIA) and establish its convergence to an equilibrium policy with an exponential convergence rate. The design of the PIA is based on a coupled system of non-local partial differential equations, called the exploratory equilibrium Hamilton--Jacobi--Bellman (EEHJB) equation. As opposed to the standard time-consistent case, policy improvement fails in general and the target value function (now an equilibrium value function) is not even known to exist a priori. To overcome these, we prove that the value functions generated by the PIA form a Cauchy sequence in a specialized Banach space, hence admit a limit, and the rate of convergence is exponential, on the strength of the Bismut--Elworthy--Li formula of stochastic representation. The limiting value function is shown to fulfill the EEHJB equation, which induces an equilibrium policy in a Gibbs form. Such convergence in value additionally implies uniform convergence of the generated policies to the equilibrium policy, again with an exponential rate. As a byproduct, the PIA gives a constructive proof of the global exist

搜索结果：Policy

SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

Policy Iteration Achieves Regularized Equilibrium under Time Inconsistency

Crowdsourcing: a new tool for policy-making?

Bayesian Adaptive Trials for Social Policy

Research on Diamond Open Access in the Long Shadow of Science Policy

The California Report on Frontier AI Policy

LegiGPT: Party Politics and Transport Policy with Large Language Model

Agricultural Policy in Ukraine

Cross-Domain Policy Transfer by Representation Alignment via Multi-Domain Behavioral Cloning

Time-Varying Identification of Monetary Policy Shocks

Instant Policy: In-Context Imitation Learning via Graph Diffusion

Generative AI Policy and Governance Considerations for Health Security in Southeast Asia

Should Policymakers be Involved? Understanding the Opinions and Needs for Independent Food Delivery Platforms in the United States regarding Public Policy

Policy as Code, Policy as Type

Case Studies of AI Policy Development in Africa

PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations

Experiments on Crowdsourcing Policy Assessment

Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization

Conformal Policy Learning for Sensorimotor Control Under Distribution Shifts

The Fundamentals of Policy Crowdsourcing