Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments exhibit non-stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid-world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation-based baselines e
For a general entropy-regularized time-inconsistent stochastic control problem, we propose a policy iteration algorithm (PIA) and establish its convergence to an equilibrium policy with an exponential convergence rate. The design of the PIA is based on a coupled system of non-local partial differential equations, called the exploratory equilibrium Hamilton--Jacobi--Bellman (EEHJB) equation. As opposed to the standard time-consistent case, policy improvement fails in general and the target value function (now an equilibrium value function) is not even known to exist a priori. To overcome these, we prove that the value functions generated by the PIA form a Cauchy sequence in a specialized Banach space, hence admit a limit, and the rate of convergence is exponential, on the strength of the Bismut--Elworthy--Li formula of stochastic representation. The limiting value function is shown to fulfill the EEHJB equation, which induces an equilibrium policy in a Gibbs form. Such convergence in value additionally implies uniform convergence of the generated policies to the equilibrium policy, again with an exponential rate. As a byproduct, the PIA gives a constructive proof of the global exist
Crowdsourcing is rapidly evolving and applied in situations where ideas, labour, opinion or expertise of large groups of people are used. Crowdsourcing is now used in various policy-making initiatives; however, this use has usually focused on open collaboration platforms and specific stages of the policy process, such as agenda-setting and policy evaluations. Other forms of crowdsourcing have been neglected in policy-making, with a few exceptions. This article examines crowdsourcing as a tool for policy-making, and explores the nuances of the technology and its use and implications for different stages of the policy process. The article addresses questions surrounding the role of crowdsourcing and whether it can be considered as a policy tool or as a technological enabler and investigates the current trends and future directions of crowdsourcing. Keywords: Crowdsourcing, Public Policy, Policy Instrument, Policy Tool, Policy Process, Policy Cycle, Open Collaboration, Virtual Labour Markets, Tournaments, Competition.
This paper proposes Bayesian Adaptive Trials (BAT) as both an efficient method to conduct trials and a unifying framework for evaluation social policy interventions, addressing limitations inherent in traditional methods such as Randomized Controlled Trials (RCT). Recognizing the crucial need for evidence-based approaches in public policy, the proposal aims to lower barriers to the adoption of evidence-based methods and align evaluation processes more closely with the dynamic nature of policy cycles. BATs, grounded in decision theory, offer a dynamic, ``learning as we go'' approach, enabling the integration of diverse information types and facilitating a continuous, iterative process of policy evaluation. BATs' adaptive nature is particularly advantageous in policy settings, allowing for more timely and context-sensitive decisions. Moreover, BATs' ability to value potential future information sources positions it as an optimal strategy for sequential data acquisition during policy implementation. While acknowledging the assumptions and models intrinsic to BATs, such as prior distributions and likelihood functions, the paper argues that these are advantageous for decision-makers in
This paper reviews research literature on Diamond Open Access (DOA) journals - sometimes also called Platinum Open Access - that was produced after this journal segment started to become a priority in European research policy around 2020. It contextualizes the current science policy debate, critically examines different understandings of DOA, and reviews studies on the role of such journals in scholarly communication. Most existing research consists of quantitative studies focusing on aspects such as the number of DOA journals, their publication output, the diversity of the landscape in terms of subject areas, languages, publishing entities, indexing in major databases, awareness and perception among scholars, cost analyses, as well as insights into the internal operations of DOA journals. The review shows that research on DOA journals is partly influenced by the science policy discourse in at least two ways: first, through the normativity inherent in that discourse, and second, through the temporality of policy-driven research of practical relevance, which leaves important aspects of the phenomenon understudied. Moreover, research on the DOA journal landscape has implications beyo
The innovations emerging at the frontier of artificial intelligence (AI) are poised to create historic opportunities for humanity but also raise complex policy challenges. Continued progress in frontier AI carries the potential for profound advances in scientific discovery, economic productivity, and broader social well-being. As the epicenter of global AI innovation, California has a unique opportunity to continue supporting developments in frontier AI while addressing substantial risks that could have far reaching consequences for the state and beyond. This report leverages broad evidence, including empirical research, historical analysis, and modeling and simulations, to provide a framework for policymaking on the frontier of AI development. Building on this multidisciplinary approach, this report derives policy principles that can inform how California approaches the use, assessment, and governance of frontier AI: principles rooted in an ethos of trust but verify. This approach takes into account the importance of innovation while establishing appropriate strategies to reduce material risks.
Given the significant influence of lawmakers' political ideologies on legislative decision-making, analyzing their impact on transportation-related policymaking is of critical importance. This study introduces a novel framework that integrates a large language model (LLM) with explainable artificial intelligence (XAI) to analyze transportation-related legislative proposals. Legislative bill data from South Korea's 21st National Assembly were used to identify key factors shaping transportation policymaking. These include political affiliations and sponsor characteristics. The LLM was employed to classify transportation-related bill proposals through a stepwise filtering process based on keywords, sentences, and contextual relevance. XAI techniques were then applied to examine the relationships between political party affiliation and associated attributes. The results revealed that the number and proportion of conservative and progressive sponsors, along with district size and electoral population, were critical determinants shaping legislative outcomes. These findings suggest that both parties contributed to bipartisan legislation through different forms of engagement, such as initi
Building upon the theory and methodology of agricultural policy developed in the previous chapter, in Chapter 2 we analyse and assess agricultural policy making in Ukraine since the breakup of Soviet Union till today. Going from top down to the bottom, we begin by describing the evolution of state policy in the agri-food sector. In the beginning, we describe the major milestones of agricultural policy making since independence, paving the way to the political economy of the modern agricultural policy in Ukraine. Then we describe the role of agri-food sector in the national economy as well as globally in ensuring food security in the world. After, we dig deeper and focus on a detailed performance of agricultural sector by looking at farm structures, their land use, overall and sector-wise untapped productivity potential. Modern agricultural policy and institutional set-up is contained and analyzed in details in the section 2.4. A review of the agricultural up- and downstream sectors wraps up this chapter
Transferring learned skills across diverse situations remains a fundamental challenge for autonomous agents, particularly when agents are not allowed to interact with an exact target setup. While prior approaches have predominantly focused on learning domain translation, they often struggle with handling significant domain gaps or out-of-distribution tasks. In this paper, we present a simple approach for cross-domain policy transfer that learns a shared latent representation across domains and a common abstract policy on top of it. Our approach leverages multi-domain behavioral cloning on unaligned trajectories of proxy tasks and employs maximum mean discrepancy (MMD) as a regularization term to encourage cross-domain alignment. The MMD regularization better preserves structures of latent state distributions than commonly used domain-discriminative distribution matching, leading to higher transfer performance. Moreover, our approach involves training only one multi-domain policy, which makes extension easier than existing methods. Empirical evaluations demonstrate the efficacy of our method across various domain shifts, especially in scenarios where exact domain translation is chal
We propose a new Bayesian heteroskedastic Markov-switching structural vector autoregression with data-driven time-varying identification. The model selects alternative exclusion restrictions over time and, as a condition for the search, allows to verify identification through heteroskedasticity within each regime. Based on four alternative monetary policy rules, we show that a monthly six-variable system supports time variation in US monetary policy shock identification. In the sample-dominating first regime, systematic monetary policy follows a Taylor rule extended by the term spread, effectively curbing inflation. In the second regime, occurring after 2000 and gaining more persistence after the global financial and COVID crises, it is characterized by a money-augmented Taylor rule. This regime's unconventional monetary policy provides economic stimulus, features the liquidity effect, and is complemented by a pure term spread shock. Absent the specific monetary policy of the second regime, inflation would be over one percentage point higher on average after 2008.
Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly (without further training) from just one or two demonstrations, achieving ICIL through two key components. First, we introduce inductive biases through a graph representation and model ICIL as a graph generation problem with a learned diffusion process, enabling structured reasoning over demonstrations, observations, and actions. Second, we show that such a model can be trained using pseudo-demonstrations - arbitrary trajectories generated in simulation - as a virtually infinite pool of training data. Simulated and real experiments show that Instant Policy enables rapid learning of various everyday robot tasks. We also show how it can serve as a foundation for cross-embodiment and zero-shot transfer to language-defined tasks. Code and videos are available at https://www.robot-learning.uk/instant-policy.
Southeast Asia is a geopolitically and socio-economically significant region with unique challenges and opportunities. Intensifying progress in generative AI against a backdrop of existing health security threats makes applications of AI to mitigate such threats attractive but also risky if done without due caution. This paper provides a brief sketch of some of the applications of AI for health security and the regional policy and governance landscape. I focus on policy and governance activities of the Association of Southeast Asian Nations (ASEAN), an international body whose member states represent 691 million people. I conclude by identifying sustainability as an area of opportunity for policymakers and recommend priority areas for generative AI researchers to make the most impact with their work.
Mainstream food delivery platforms, like DoorDash and Uber Eats, have been the locus of fierce policy debates about their unfair business and labor practices. At the same time, hundreds of independent food delivery services provide alternative opportunities to many communities across the U.S. We surveyed operators of independent food delivery platforms to learn about their perception of the role of public policy. We found conflicting opinions on whether and how policy should interact with their businesses, ranging from not wanting policymakers to interfere to articulating specific policies that would curtail mainstream platforms' business practices. We provide insights for technologists and policymakers interested in the sociotechnical challenges of local marketplaces.
Policies are designed to distinguish between correct and incorrect actions; they are types. But badly typed actions may cause not compile errors, but financial and reputational harm We demonstrate how even the most complex ABAC policies can be expressed as types in dependently typed languages such as Agda and Lean, providing a single framework to express, analyze, and implement policies. We then go head-to-head with Rego, the popular and powerful open-source ABAC policy language. We show the superior safety that comes with a powerful type system and built-in proof assistant. In passing, we discuss various access control models, sketch how to integrate in a future when attributes are distributed and signed (as discussed at the W3C), and show how policies can be communicated using just the syntax of the language. Our examples are in Agda.
Artificial Intelligence (AI) requires new ways of evaluating national technology use and strategy for African nations. We conduct a survey of existing 'readiness' assessments both for general digital adoption and for AI policy in particular. We conclude that existing global readiness assessments do not fully capture African states' progress in AI readiness and lay the groundwork for how assessments can be better used for the African context. We consider the extent to which these indicators map to the African context and what these indicators miss in capturing African states' on-the-ground work in meeting AI capability. Through case studies of four African nations of diverse geographic and economic dimensions, we identify nuances missed by global assessments and offer high-level policy considerations for how states can best improve their AI readiness standards and prepare their societies to capture the benefits of AI.
Deep Reinforcement Learning (DRL) has been a promising solution to many complex decision-making problems. Nevertheless, the notorious weakness in generalization among environments prevent widespread application of DRL agents in real-world scenarios. Although advances have been made recently, most prior works assume sufficient online interaction on training environments, which can be costly in practical cases. To this end, we focus on an offline-training-online-adaptation setting, in which the agent first learns from offline experiences collected in environments with different dynamics and then performs online policy adaptation in environments with new dynamics. In this paper, we propose Policy Adaptation with Decoupled Representations (PAnDR) for fast policy adaptation. In offline training phase, the environment representation and policy representation are learned through contrastive learning and policy recovery, respectively. The representations are further refined by mutual information optimization to make them more decoupled and complete. With learned representations, a Policy-Dynamics Value Function (PDVF) [Raileanu et al., 2020] network is trained to approximate the values for
Can Crowds serve as useful allies in policy design? How do non-expert Crowds perform relative to experts in the assessment of policy measures? Does the geographic location of non-expert Crowds, with relevance to the policy context, alter the performance of non-experts Crowds in the assessment of policy measures? In this work, we investigate these questions by undertaking experiments designed to replicate expert policy assessments with non-expert Crowds recruited from Virtual Labor Markets. We use a set of ninety-six climate change adaptation policy measures previously evaluated by experts in the Netherlands as our control condition to conduct experiments using two discrete sets of non-expert Crowds recruited from Virtual Labor Markets. We vary the composition of our non-expert Crowds along two conditions: participants recruited from a geographical location directly relevant to the policy context and participants recruited at-large. We discuss our research methods in detail and provide the findings of our experiments.
Recent progress in state-only imitation learning extends the scope of applicability of imitation learning to real-world settings by relieving the need for observing expert actions. However, existing solutions only learn to extract a state-to-action mapping policy from the data, without considering how the expert plans to the target. This hinders the ability to leverage demonstrations and limits the flexibility of the policy. In this paper, we introduce Decoupled Policy Optimization (DePO), which explicitly decouples the policy as a high-level state planner and an inverse dynamics model. With embedded decoupled policy gradient and generative adversarial training, DePO enables knowledge transfer to different action spaces or state transition dynamics, and can generalize the planner to out-of-demonstration state regions. Our in-depth experimental analysis shows the effectiveness of DePO on learning a generalized target state planner while achieving the best imitation performance. We demonstrate the appealing usage of DePO for transferring across different tasks by pre-training, and the potential for co-training agents with various skills.
This paper focuses on the problem of detecting and reacting to changes in the distribution of a sensorimotor controller's observables. The key idea is the design of switching policies that can take conformal quantiles as input, which we define as conformal policy learning, that allows robots to detect distribution shifts with formal statistical guarantees. We show how to design such policies by using conformal quantiles to switch between base policies with different characteristics, e.g. safety or speed, or directly augmenting a policy observation with a quantile and training it with reinforcement learning. Theoretically, we show that such policies achieve the formal convergence guarantees in finite time. In addition, we thoroughly evaluate their advantages and limitations on two compelling use cases: simulated autonomous driving and active perception with a physical quadruped. Empirical results demonstrate that our approach outperforms five baselines. It is also the simplest of the baseline strategies besides one ablation. Being easy to use, flexible, and with formal guarantees, our work demonstrates how conformal prediction can be an effective tool for sensorimotor learning under
What is the state of the research on crowdsourcing for policy making? This article begins to answer this question by collecting, categorizing, and situating an extensive body of the extant research investigating policy crowdsourcing, within a new framework built on fundamental typologies from each field. We first define seven universal characteristics of the three general crowdsourcing techniques (virtual labor markets, tournament crowdsourcing, open collaboration), to examine the relative trade-offs of each modality. We then compare these three types of crowdsourcing to the different stages of the policy cycle, in order to situate the literature spanning both domains. We finally discuss research trends in crowdsourcing for public policy, and highlight the research gaps and overlaps in the literature. KEYWORDS: crowdsourcing, policy cycle, crowdsourcing trade-offs, policy processes, policy stages, virtual labor markets, tournament crowdsourcing, open collaboration