搜索 — ResearchTracker

Ethical conduct in digital research is full of grey areas. Disciplinary, institutional and individual norms and conventions developed to support research are challenged, often leaving scholars with a sense of unease or lack of clarity. The growing availability of hacked data is one area. Discussions and debates around the use of these datasets in research are extremely limited. Reviews of the history, culture, or morality of the act of hacking are topics that have attracted some scholarly attention. However, how to undertake research with this data is less examined and provides an opportunity for the generation of reflexive ethical practice. This article presents a case-study outlining the ethical debates that arose when considering the use of hacked data to examine online far-right violent extremism. It argues that under certain circumstances, researchers can do ethical research with hacked data. However, to do so we must proactively and continually engage deeply with ethical quandaries and dilemmas.

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

arXiv2026-06-02作者：Shuang Liu, Yuxuan Bo, Qiuyang Zhao

Reward models are central to large language model (LLM) alignment, but they remain vulnerable to reward hacking. To evaluate reward-model robustness, we introduce RewardHackBench containing 13 reward-hacking patterns covering real life high-stakes domains and general settings, and we find severe failures on specific subcategories across eight reward models. To mitigate these failures, we propose HARVE, a training-free reward-head editing method for scalar reward models. Instead of fine-tuning the reward model, HARVE identifies a multi-directional hacking subspace from residual stream directions associated with selected hacking subcategories, and removes the component of the reward-head vector aligned with that subspace. This directly reduces the reward head's sensitivity to hacking-related features using only a small set of contrastive gold-hacked examples, without gradient updates or fine-tuning. Comprehensive experiments across eight reward models indicates that \model improves hacking robustness, outperforms fine-tuning baselines, and preserves reward-models' general capability. Further analyses suggest that reward hacking is better captured as a multidimensional residual-space

搜索结果：Hacked

Ethical conundrums: Hacked data in the study of far-right violent extremism

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

Understanding Reward Hacking in Text-to-Image Reinforcement Learning

Likelihood hacking in probabilistic program synthesis

Monitoring Emergent Reward Hacking During Generation via Internal Activations

On Benchmark Hacking in ML Contests: Modeling, Insights and Design

Detecting and Suppressing Reward Hacking with Gradient Fingerprints

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Large Language Models Hack Rewards, and Society

Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis

When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

HACK NDSU: A Real-world Event to Promote Student Interest in Cybersecurity

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

Alleviating Attention Hacking in Discriminative Reward Modeling through Interaction Distillation

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking