We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT) workflow that incorporates: (1) varying levels of agent and monitor situational awareness; (2) distinct adversarial strategies to evade the monitor, such as prompt injection; and (3) two datasets and environments -- SHADE-Arena for tool-calling agents and our new CUA-SHADE-Arena, which extends TheAgentCompany, for computer-use agents. We run MRT on existing LLM monitor scaffoldings, which orchestrate LLMs and parse agent trajectories, alongside a new hybrid hierarchical-sequential scaffolding proposed in this work. Our empirical results yield three key findings. First, agent awareness dominates monitor awareness: an agent's knowledge that it is being monitored substantially degrades the monitor's reliability. On the contrary, providing the monitor with more information about the agent is less helpful than expected. Second, monitor scaffolding matters more than monitor awareness: the hybrid scaffolding consistently outperforms baseline monitor scaffolding, and can enable weaker models to reli
Observability into the decision making of modern AI systems may be required to safely deploy increasingly capable agents. Monitoring the chain-of-thought (CoT) of today's reasoning models has proven effective for detecting misbehavior. However, this "monitorability" may be fragile under different training procedures, data sources, or even continued system scaling. To measure and track monitorability, we propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric, and introduce a broad evaluation suite. We demonstrate that these evaluations can catch simple model organisms trained to have obfuscated CoTs, and that CoT monitoring is more effective than action-only monitoring in practical settings. We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable. We also evaluate how monitorability scales with inference-time compute, reinforcement learning optimization, and pre-training model size. We find that longer CoTs are generally more monitorable and that RL optimization does not materially decrease monitorability even at the current frontier scale. Notably, we fin
Quantum Network Tomography (QNT) offers a framework for end-to-end quantum channel characterization by strategically placing monitor nodes within the network. Building upon prior work on single-monitor placement, we study optimal monitor placement and measurement assignments for channel parameter estimation in arbitrary quantum networks. Using an n-node star network as a baseline, we analyze multi-monitor configurations and show that distributing monitors across end nodes can achieve estimation performance comparable to a monitor placed at the hub. Estimation precision is quantified using the Quantum Fisher Information Matrix (QFIM), with channel parameters inferred via Maximum Likelihood Estimation (MLE) and benchmarked against the Quantum Cramer-Rao Bound (QCRB). To generalize, we develop two Integer Linear Program (ILP) formulations: one maximizing estimation accuracy (QF), and another jointly optimizing accuracy and monitoring overhead (QMF). Unlike QF, QMF prevents monitor overloading, enabling scalability and parallelism. We prove optimality for star and analyze applicability to tree-structured quantum networks.
We introduce a red-teaming methodology that exposes harder-to-catch attacks for coding-agent monitors, suggesting that current practices may under-elicit attacks and overstate monitor performance. We identify three challenges with current red-teaming. First, mode collapse in attack generation, which we reduce with a novel attack taxonomy for broader coverage. Second, a conceive-execute gap: frontier LLMs can propose strong attack ideas or execute them, but not all at once. We mitigate this by decomposing attack construction into strategy generation, execution, and post-hoc trajectory refinement. Third, manual elicitation is costly to scale, which we address with our semi-automated red-teaming pipeline. Applied to BashArena, an AI control setting for tool-using coding agents, this pipeline produces MonitoringBench, a benchmark of 2,644 attack trajectories for evaluating monitor capabilities and failure modes. Our pipeline produces more diverse and stronger attacks: Opus-4.5 monitor's catch rate falls from 94.9\% on elicited-only Opus attacks to 60.3\% on our best refined attacks, with larger drops for several mid-tier monitors. Attacks optimized against three development monitors ge
This paper reports on the integration of runtime monitoring into fully-electric aircraft designed by Volocopter, a German aircraft manufacturer of electric multi-rotor helicopters. The runtime monitor recognizes hazardous situations and system faults. Since the correct operation of the monitor is critical for the safety of the aircraft, the development of the monitor must follow strict aeronautical standards. This includes the integration of the monitor into different development environments, such as log-file analysis, hardware/software-in-the-loop testing, and test flights. We have used the stream-based monitoring framework RTLola to generate monitors for a range of requirements. In this paper, we present representative monitoring specifications and our lessons learned from integrating the generated monitors. Our main finding is that the specification and the integration need to be decoupled, because the specification remains stable throughout the development process, whereas the different development stages require a separate integration of the monitor into each environment. We achieve this decoupling with a novel abstraction layer in the monitoring framework that adapts the mon
Formal verification provides assurances that a probabilistic system satisfies its specification--conditioned on the system model being aligned with reality. We propose alignment monitoring to watch that this assumption is justified. We consider a probabilistic model well aligned if it accurately predicts the behaviour of an uncertain system in advance. An alignment score measures this by quantifying the similarity between the model's predicted and the system's (unknown) actual distributions. An alignment monitor observes the system at runtime; at each point in time it uses the current state and the model to predict the next state. After the next state is observed, the monitor updates the verdict, which is a high-probability interval estimate for the true alignment score. We utilize tools from sequential forecasting to construct our alignment monitors. Besides a monitor for measuring the expected alignment score, we introduce a differential alignment monitor, designed for comparing two models, and a weighted alignment monitor, which permits task-specific alignment monitoring. We evaluate our monitors experimentally on the PRISM benchmark suite. They are fast, memory-efficient, and d
Stream-based monitoring is a real-time safety assurance mechanism for complex cyber-physical systems such as unmanned aerial vehicles. The monitor aggregates streams of input data from sensors and other sources to give real-time statistics and assessments of the system's health. Since the monitor is a safety-critical component, it is mandatory to ensure the absence of runtime errors in the monitor. Providing such guarantees is particularly challenging when the monitor must handle unbounded data domains, like an unlimited number of airspace participants, requiring the use of dynamic data structures. This paper provides a type-safe integration of parameterized streams into the stream-based monitoring framework RTLola. Parameterized streams generalize individual streams to sets of an unbounded number of stream instances and provide a systematic mechanism for memory management. We show that the absence of runtime errors is, in general, undecidable but can be effectively ensured with a refinement type system that guarantees all memory references are either successful or backed by a default value. We report on the performance of the type analysis on example specifications from a range of
In formal verification, runtime monitoring consists of observing the execution of a system in order to decide as quickly as possible whether or not it satisfies a given property. We consider monitoring in a distributed setting, for properties given as reachability timed automata. In such a setting, the system is made of several components, each equipped with its own local clock and monitor. The monitors observe events occurring on their associated component, and receive timestamped events from other monitors through FIFO channels. Since clocks are local, they cannot be perfectly synchronized, resulting in imprecise timestamps. Consequently, they must be seen as intervals, leading monitors to consider possible reorderings of events. In this context, each monitor aims to provide, as early as possible, a verdict on the property it is monitoring, based on its potentially incomplete and imprecise knowledge of the current execution. In this paper, we propose an on-line monitoring algorithm for timed properties, robust to time imprecision and partial information from distant components. We first identify the date at which a monitor can safely compute a verdict based on received events. We
Monitoring kernel object modification of virtual machine is widely used by virtual-machine-introspection-based security monitors to protect virtual machines in cloud computing, such as monitoring dentry objects to intercept file operations, etc. However, most of the current virtual machine monitors, such as KVM and Xen, only support page-level monitoring, because the Intel EPT technology can only monitor page privilege. If the out-of-virtual-machine security tools want to monitor some kernel objects, they need to intercept the operation of the whole memory page. Since there are some other objects stored in the monitored pages, the modification of them will also trigger the monitor. Therefore, page-level memory monitor usually introduces overhead to related kernel services of the target virtual machine. In this paper, we propose a low-overhead kernel object monitoring approach to reduce the overhead caused by page-level monitor. The core idea is to migrate the target kernel objects to a protected memory area and then to monitor the corresponding new memory pages. Since the new pages only contain the kernel objects to be monitored, other kernel objects will not trigger our monitor. T
Neutron monitors are a standard tool for high-precision continuous monitoring of galactic cosmic ray flux variations arising from variations in heliospheric conditions and solar activity for space weather applications. These measurements form the basis for solving the inverse problem of determining the cosmic ray anisotropy vector beyond the magnetosphere. To support such studies, periodic latitude measurements are necessary to determine the coupling functions of primary and secondary cosmic rays variations. The aim of this work is to develop and characterize a modernized standard neutron monitor based on a CHM-15 boron thermal neutron counter and a data acquisition system designed for marine expeditionary studies of cosmic ray variations. Modern nuclear physics experimental methods and the principles of microprocessor-based data acquisition systems were used to solve this problem. The results of test trials and of continuous monitoring showed that the characteristics of the upgraded and standard neutron monitor are similar, and the ease of use, compactness, and stability allow us to conclude that the mobile neutron detector can be used in expeditionary conditions with limited acce
This paper focuses on the runtime verification of hyperproperties expressed in Hyper-recHML, an expressive yet simple logic for describing properties of sets of traces. To this end, we consider a simple language of monitors that observe sets of system executions and report verdicts w.r.t. a given Hyper-recHML formula. We first employ a unique omniscient monitor that centrally observes all system traces. Since centralised monitors are not ideal for distributed settings, we also provide a language for decentralized monitors, where each trace has a dedicated monitor; these monitors yield a unique verdict by communicating their observations to one another. For both the centralized and the decentralized settings, we provide a synthesis procedure that, given a formula, yields a monitor that is correct (i.e., sound and violation complete). A key step in proving the correctness of the synthesis for decentralized monitors is a result showing that, for each formula, the synthesized centralized monitor and its corresponding decentralized one are weakly bisimilar for a suitable notion of weak bisimulation.
Runtime monitors assess whether a system is in an unsafe state based on a stream of observations. We study the problem where the system is subject to probabilistic uncertainty and described by a hidden Markov model. A stream of observations is then unsafe if the probability of being in an unsafe state is above a threshold. A correct monitor recognizes the set of unsafe observations. The key contribution of this paper is the first correct-by-construction synthesis method for such monitors, represented as finite automata. The contribution combines four ingredients: First, we establish the coNP-hardness of checking whether an automaton is a correct monitor, i.e., a monitor without misclassifications. Second, we provide a reduction that reformulates the search for misclassifications into a standard probabilistic system synthesis problem. Third, we integrate the verification routine into an active automata learning routine to synthesize correct monitors. Fourth, we provide a prototypical implementation that shows the feasibility and limitations of the approach on a series of benchmarks.
Online monitoring is an effective validation approach for hybrid systems, that, at runtime, checks whether the (partial) signals of a system satisfy a specification in, e.g., Signal Temporal Logic (STL). The classic STL monitoring is performed by computing a robustness interval that specifies, at each instant, how far the monitored signals are from violating and satisfying the specification. However, since a robustness interval monotonically shrinks during monitoring, classic online monitors may fail in reporting new violations or in precisely describing the system evolution at the current instant. In this paper, we tackle these issues by considering the causation of violation or satisfaction, instead of directly using the robustness. We first introduce a Boolean causation monitor that decides whether each instant is relevant to the violation or satisfaction of the specification. We then extend this monitor to a quantitative causation monitor that tells how far an instant is from being relevant to the violation or satisfaction. We further show that classic monitors can be derived from our proposed ones. Experimental results show that the two proposed monitors are able to provide mo
The observable behavior of a system usually carries useful information about its internal state, properties, and potential future behaviors. In this paper, we introduce configuration monitoring to determine an unknown configuration of a running system based on observations of its behavior. We develop a modular and generic pipeline to synthesize automata-theoretic configuration monitors from a featured transition system model of the configurable system to be monitored. The pipeline further allows synthesis under partial observability and network-induced losses as well as predictive configuration monitors taking the potential future behavior of a system into account. Beyond the novel application of configuration monitoring, we show that our approach also generalizes and unifies existing work on runtime monitoring and fault diagnosis, which aim at detecting the satisfaction or violation of properties and the occurrence of faults, respectively. We empirically demonstrate the efficacy of our approach with a case study on configuration monitors synthesized from configurable systems community benchmarks.
Continuous Integration (CI) encompasses a set of widely adopted practices that enhance software development. However, there are indications that developers may not adequately monitor CI practices. Hence, this paper explores developers' perceptions regarding the monitoring CI practices. To achieve this, we first perform a Document Analysis to assess developers' expressed need for practice monitoring in pull requests comments generated by developers during the development process. After that, we conduct a survey among developers from 121 open-source projects to understand perception of the significance of monitoring seven CI practices in their projects. Finally, we triangulate the emergent themes from our survey by performing a second Document Analysis to understand the extent of monitoring features supported by existing CI services. Our key findings indicate that: 1) the most frequently mentioned CI practice during the development process is ``Test Coverage'' (> 80\%), while ``Build Health'' and ``Time to Fix a Broken Build'' present notable opportunities for monitoring CI practices; 2) developers do not adequately monitor all CI practices and express interest in monitoring addit
Indirect reciprocity based on reputation is a leading mechanism driving human cooperation, where monitoring of behaviour and sharing reputation-related information are crucial. Because collecting information is costly, a tragedy of the commons can arise, with some individuals free-riding on information supplied by others. This can be overcome by organising monitors that aggregate information, supported by fees from their information users. We analyse a co-evolutionary model of individuals playing a social dilemma game and monitors watching them; monitors provide information and players vote for a more beneficial monitor. We find that (1) monitors that simply rate defection badly cannot stabilise cooperation---they have to overlook defection against ill-reputed players; (2) such overlooking monitors can stabilise cooperation if players vote for monitors rather than to change their own strategy; (3) STERN monitors, who rate cooperation with ill-reputed players badly, stabilise cooperation more easily than MILD monitors, who do not do so; (4) a STERN monitor wins if it competes with a MILD monitor; and (5) STERN monitors require a high level of surveillance and achieve only lower leve
Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals. The challenge lies in the existing shallow safety alignment techniques: they only detect harm in the immediate prompt and do not reason about long-range intent, leaving them blind to malicious intent that emerges over a sequence of seemingly benign instructions. We therefore propose adding an external monitor that observes the conversation at a higher granularity. To facilitate our study of monitoring decomposition attacks, we curate the largest and most diverse dataset to date, including question-answering, text-to-image, and agentic tasks. We verify our datasets by testing them on frontier LLMs and show an 87% attack success rate on average on GPT-4o. This confirms that decomposition attack is broadly effective. Additionally, we find that random tasks can be injected into the decomposed subtasks to further obfuscate malicious intents. To defend in real time, we propose a lightweight sequential monitoring framework that cumulatively evaluates each subtask. We show that a carefully prompt engineered lightweight monitor achieves a 93% defe
Unequal exposure to air pollution by race and socioeconomic status is well-documented in the U.S. However, there has been relatively little research on inequities in the collection of PM2.5 data, creating a critical gap in understanding which neighborhood exposures are represented in these datasets. In this study we use multilevel models with random intercepts by county and state, stratified by urbanicity to investigate the association between six key environmental justice (EJ) attributes (%AIAN, %Asian %Black, %Hispanic, %White, %Poverty) and proximity to the nearest regulatory monitor at the census tract-level across the contiguous 48 states. We also separately stratify our models by EPA region. Our results show that most EJ attributes exhibit weak or statistically insignificant associations with monitor proximity, except in rural areas where higher poverty levels are significantly linked to greater monitor distances ($β$ = 0.6, 95%CI = [0.49, 0.71]). While the US EPA's siting criteria may be effective in ensuring equitable monitor distribution in some contexts, the low density of monitors in rural areas may impact the accuracy of national-level air pollution monitoring.
In online monitoring, we first synthesize a monitor from a formal specification, which later runs in tandem with the system under study, incrementally receiving its progress and evolving with the system. In offline monitoring the trace is logged as the system progresses to later do post-mortem analysis after the system has finished executing. In this paper we propose retroactive dynamic parametrization, a technique that allows a monitor to revisit the past log as it progresses, while still executing in an online manner. This feature allows new monitors to be incorporated into a running system and to revisit the past for particular behaviors based on new information discovered. Retroactive parametrization also allows a monitor to lazily ignore events and revisit and process them later, when it discovers that it should have followed those events. We showcase the use of retroactive dynamic parametrization to monitor denial of service attacks on a network using network logs.
The nano-meter beam size in future linear colliders requires very high resolution beam position monitor since higher resolution allows more accurate position measurement in the interaction point. We developed and tested a low-Q C-band beam position monitor with position resolution of nanometer. The C-band BPM was tested for the fast beam feedback system at the interaction point of ATF2 in KEK, in which C-band beam position monitor is called to IPBPM (Interaction Point Beam Position Monitor). The average position resolution of the developed IPBPMs was measured to be 10.1 nm at a nominal beam charge of $87\%$ of ATF2. From the measured beam position resolution, we can expect beam position resolution of around 8.8 nm and 4.4 nm with nominal ATF2 and ILC beam charge conditions, respectively, in which the position resolution is below the vertical beam size in ILC. In this paper, we describe the development of the IPBPM and the beam test results at the nanometer level in beam position resolution