搜索 — ResearchTracker

Reward hacking in code generation, where models exploit evaluation loopholes to obtain full reward without correctly solving the tasks, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies have been conducted primarily on synthetic hacking trajectories. However, whether these synthetic behaviors faithfully represent naturally emerging hacking in the wild remains unclear. In this work, we present a systematic analysis of the synthetic vs. in-the-wild discrepancy in reward hacking. We examine to what extent hacking behaviors induced by prompting resemble those emerging during RL training, and whether monitors trained on synthetic trajectories generalize to naturally arising but previously unseen hacking. To scale up the curation of in-the-wild reward hacking trajectories, we modified Group Relative Policy Optimization (GRPO) by injecting conflicting unit tests as tracers and applying a "resampling-until-hack" mechanism. Through controlled comparisons between monitors trained on synthetic versus in-the-wild data, we find that (1) synthetic-data-trained monitors fail to generalize to "in-the-wild" hacking, and (2) monitors

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking

arXiv2026-02-02作者：Mohammad Beigi, Ming Jin, Junshan Zhang

Reinforcement Learning from Human Feedback (RLHF) remains vulnerable to reward hacking, where models exploit spurious correlations in learned reward models to achieve high scores while violating human intent. Existing mitigations rely on static defenses that cannot adapt to novel exploitation strategies. We propose Adversarial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game. ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations; second, Auditor-Guided RLHF (AG-RLHF) gates reward signals to penalize detected hacking, transforming reward hacking from an unobservable failure into a measurable, controllable signal. Experiments across three hacking scenarios demonstrate that ARA achieves the best alignment-utility tradeoff among all baselines: reducing sycophancy to near-SFT levels while improving helpfulness, decreasing verbosity while achieving the highest ROUGE-L, and suppressing code gaming while improving Pass@1. Beyond single-domain evaluation, we show that reward hacking, detection, and mitigation all generalize acro

搜索结果：hack

Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis

Detecting and Suppressing Reward Hacking with Gradient Fingerprints

EvilGenie: A Reward Hacking Benchmark

Natural Emergent Misalignment from Reward Hacking in Production RL

Understanding Reward Hacking in Text-to-Image Reinforcement Learning

Fairness Hacking: The Malicious Practice of Shrouding Unfairness in Algorithms

HACK NDSU: A Real-world Event to Promote Student Interest in Cybersecurity

Ethical conundrums: Hacked data in the study of far-right violent extremism

Can LLMs Hack Enterprise Networks? -- Replicated Computational Results (RCR) Report

Spontaneous Reward Hacking in Iterative Self-Refinement

SoK: A Review of Cross-Chain Bridge Hacks in 2023

HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference

On Teacher Hacking in Language Model Distillation

Prompt-Hacking: The New p-Hacking?

X Hacking: The Threat of Misguided AutoML

Ethical Hacking and its role in Cybersecurity