搜索 — ResearchTracker

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-tr

LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

arXiv2025-10-05作者：Jiarui Liu, Jivitesh Jain, Mona Diab

Although large language models (LLMs) have tremendous utility, trustworthiness is still a chief concern: models often generate incorrect information with high confidence. While contextual information can help guide generation, identifying when a query would benefit from retrieved context and assessing the effectiveness of that context remains challenging. In this work, we operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs from the model's activations alone. We also explore whether model internals contain signals about the efficacy of external context. We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them. Experiments on six different models reveal that a simple classifier trained on intermediate layer activations of the first output token can predict output correctness with about 75% accuracy, enabling early auditing. Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context, guarding against inaccuracies introduced by polluted context. These findings offer a lens to better understand the underlying

搜索结果：internals

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

Aligned Probing: Relating Toxic Behavior and Model Internals

A Pipeline to Assess Merging Methods via Behavior and Internals

vLLM Hook v0: A Plug-in for Programming Model Internals on vLLM

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation

Analysis of Polkadot: Architecture, Internals, and Contradictions

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Learning through Internalization

MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools

International Tourism and Global Biodiversity Risks

The Internal Logic and Finite Colimits

International vulnerability of inflation

Internal tides in the Mediterranean Sea

Generative Prompt Internalization

STEGR in Internal-Space Formulation: Formalisms, Primary Constraints, and Possible Internal Symmetries

On internally projective sheaves of groups

An Internal Model Principle For Robots

Towards Physics of Internal Observers: Exploring the Roles of External and Internal Observers