搜索结果：Monitor

共找到 20 条结果

高级筛选 ▾

Reliable Weak-to-Strong Monitoring of LLM Agents

arXiv

We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT) workflow that incorporates: (1) varying levels of agent and monitor situational awareness; (2) distinct adversarial strategies to evade the monitor, such as prompt injection; and (3) two datasets and environments -- SHADE-Arena for tool-calling agents and our new CUA-SHADE-Arena, which extends TheAgentCompany, for computer-use agents. We run MRT on existing LLM monitor scaffoldings, which orchestrate LLMs and parse agent trajectories, alongside a new hybrid hierarchical-sequential scaffolding proposed in this work. Our empirical results yield three key findings. First, agent awareness dominates monitor awareness: an agent's knowledge that it is being monitored substantially degrades the monitor's reliability. On the contrary, providing the monitor with more information about the agent is less helpful than expected. Second, monitor scaffolding matters more than monitor awareness: the hybrid scaffolding consistently outperforms baseline monitor scaffolding, and can enable weaker models to reli

Monitoring Monitorability

arXiv2025-12-20作者：Melody Y. Guan, Miles Wang, Micah Carroll

Observability into the decision making of modern AI systems may be required to safely deploy increasingly capable agents. Monitoring the chain-of-thought (CoT) of today's reasoning models has proven effective for detecting misbehavior. However, this "monitorability" may be fragile under different training procedures, data sources, or even continued system scaling. To measure and track monitorability, we propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric, and introduce a broad evaluation suite. We demonstrate that these evaluations can catch simple model organisms trained to have obfuscated CoTs, and that CoT monitoring is more effective than action-only monitoring in practical settings. We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable. We also evaluate how monitorability scales with inference-time compute, reinforcement learning optimization, and pre-training model size. We find that longer CoTs are generally more monitorable and that RL optimization does not materially decrease monitorability even at the current frontier scale. Notably, we fin

搜索结果：Monitor

Reliable Weak-to-Strong Monitoring of LLM Agents

Monitoring Monitorability

An Optimization Framework for Monitor Placement in Quantum Network Tomography

MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring

Monitoring Unmanned Aircraft: Specification, Integration, and Lessons-learned

Alignment Monitoring

Type-safe Monitoring of Parameterized Streams

Distributed Monitoring of Timed Properties

A Low-overhead Kernel Object Monitoring Approach for Virtual Machine Introspection

Mobile neutron monitor for latitude cosmic ray monitoring

Centralized vs Decentralized Monitors for Hyperproperties

Learning Verified Monitors for Hidden Markov Models

Online Causation Monitoring of Signal Temporal Logic

Configuration Monitor Synthesis

On the Need to Monitor Continuous Integration Practices -- An Empirical Study

Voting by Hands Promotes Institutionalised Monitoring in Indirect Reciprocity

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Equity in the Distribution of Regulatory PM2.5 Monitors

Retroactive Parametrized Monitoring

The Development of Low-Q Cavity Type Beam Position Monitor with a Position Resolution of Nanometer for Future Colliders