Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-tr
Although large language models (LLMs) have tremendous utility, trustworthiness is still a chief concern: models often generate incorrect information with high confidence. While contextual information can help guide generation, identifying when a query would benefit from retrieved context and assessing the effectiveness of that context remains challenging. In this work, we operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs from the model's activations alone. We also explore whether model internals contain signals about the efficacy of external context. We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them. Experiments on six different models reveal that a simple classifier trained on intermediate layer activations of the first output token can predict output correctness with about 75% accuracy, enabling early auditing. Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context, guarding against inaccuracies introduced by polluted context. These findings offer a lens to better understand the underlying
We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs), based on their outputs, and their internal representations (internals). Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time. Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers. Focusing on how unique LMs differ offers both correlative and causal evidence that they generate less toxic output when strongly encoding information about the input toxicity. We also highlight the heterogeneity of toxicity, as model behavior and internals vary across unique attributes such as Threat. Finally, four case studies analyzing detoxification, multi-prompt evaluations, model quantization, and pre-training dynamics underline the practical impact of aligned probing with further concrete insights. Our findings contribute to a more holistic understanding of LMs, both within and beyond the context of toxicity.
Merging methods combine the weights of multiple language models (LMs) to leverage their capacities, such as for domain adaptation. While existing studies investigate merged models from a solely behavioral perspective, we offer the first comprehensive view by assessing and connecting their behavior and internals. We present a novel evaluation pipeline that first merges multiple parent LMs, and then evaluates the merged models in comparison to the initial ones based on their behavior on downstream tasks, like MMLU, and the internal encoded linguistic competence. We showcase this pipeline by assessing the merging of instruction fine-tuned with math- and code-adapted LMs from the Qwen2.5 family. Our results show that merging methods impacts behavior and internals differently. While the performance of merged models is typically between that of the two parent models, their encoded information about linguistic phenomena, particularly in morphology and syntax, can surpass the parent models. Moreover, we find weak ranking correlation between this behavior and internal evaluation. With our pipeline and initial results, we emphasize the need for more comprehensive evaluations of model merging
Modern artificial intelligence (AI) models are deployed on inference engines to optimize runtime efficiency and resource allocation, particularly for transformer-based large language models (LLMs). The vLLM project is a major open-source library to support model serving and inference. However, the current implementation of vLLM limits programmability of the internal states of deployed models. This prevents the use of popular test-time model alignment and enhancement methods. For example, it prevents the detection of adversarial prompts based on attention patterns or the adjustment of model responses based on activation steering. To bridge this critical gap, we present vLLM Hook, an opensource plug-in to enable the programming of internal states for vLLM models. Based on a configuration file specifying which internal states to capture, vLLM Hook provides seamless integration to vLLM and supports two essential features: passive programming and active programming. For passive programming, vLLM Hook probes the selected internal states for subsequent analysis, while keeping the model generation intact. For active programming, vLLM Hook enables efficient intervention of model generation
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred remote execution. The National Deep Inference Fabric (NDIF) is a scalable inference service that executes NNsight requests, allowing users to share GPU resources and pretrained models. These technologies are enabled by the Intervention Graph, an architecture developed to decouple experimental design from model runtime. Together, this framework provides transparent and efficient access to the internals of deep neural networks such as very large language models (LLMs) without imposing the cost or complexity of hosting customized models individually. We conduct a quantitative survey of the machine learning literature that reveals a growing gap in the study of the internals of large-scale AI. We demonstrate the design and use of our framework to address this gap by enabling a range of research methods on huge models. Finally, we conduct benchmarks to compare performance with previous approaches. Code, documentation, and tutorials are available
Ensuring the verifiability of model answers is a fundamental challenge for retrieval-augmented generation (RAG) in the question answering (QA) domain. Recently, self-citation prompting was proposed to make large language models (LLMs) generate citations to supporting documents along with their answers. However, self-citing LLMs often struggle to match the required format, refer to non-existent sources, and fail to faithfully reflect LLMs' context usage throughout the generation. In this work, we present MIRAGE --Model Internals-based RAG Explanations -- a plug-and-play approach using model internals for faithful answer attribution in RAG applications. MIRAGE detects context-sensitive answer tokens and pairs them with retrieved documents contributing to their prediction via saliency methods. We evaluate our proposed approach on a multilingual extractive QA dataset, finding high agreement with human answer attribution. On open-ended QA, MIRAGE achieves citation quality and efficiency comparable to self-citation while also allowing for a finer-grained control of attribution parameters. Our qualitative evaluation highlights the faithfulness of MIRAGE's attributions and underscores the
Polkadot is a network protocol launched in 2020 with the ambition of unlocking the full potential of blockchain technologies. Its novel multi-chain protocol allows arbitrary data to be transferred across heterogeneous blockchains, enabling the implementation of a wide range of novel use cases. The Polkadot architecture is based on the principles of sharding, which promises to solve scalability and interoperability shortcomings that encumber many existing blockchain-based systems. Lured by these impressive features, investors immediately appreciated the Polkadot project, which is now firmly ranked among the top 10 cryptocurrencies by capitalization (around 20 Billions USD). However, Polkadot has not received the same level of attention from academia that other proposals in the crypto domain have received so far, like Bitcoin, Ethereum, and Algorand, to cite a few. Polkadot architecture is described and discussed only in the grey literature, and very little is known about its internals. In this paper, we provide the first systematic study on the Polkadot environment, detailing its protocols, governance, and economic model. Then, we identify several limitations -- supported by an empi
Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.
We study internalization processes, by which neural-network-based systems absorb an explicit computational procedure into their own weights, and how they facilitate learning. We investigate how transformers internalize the simulation of semiautomata by internalizing chain-of-thought (CoT) tokens, which classes of semiautomata are harder to internalize, and expose the flip side of internalization, that is, a progressive degradation of out-of-distribution performance. We then provide the first provable analysis of successful internalization: for the task of learning parities, we show that a simplified one-layer transformer provably first learns the target with explicit CoT supervision and then internalizes the autoregressive generation as CoT tokens are progressively removed, learning to directly compute the parity. This task is computationally hard to learn from data without CoT supervision. Finally, we discuss how learning through internalization relates to the \textit{Positive Distribution Shift} phenomenon recently introduced by~\citet{Med+26}.
Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logitLens and then computes similarity scores between each layer's generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying
The impact of international tourism on biodiversity risks has received considerable attention, yet quantitative research in this field remains relatively limited. This study constructs a biodiversity risk index for 155 countries and regions spanning the years 2001 to 2019, analysing how international tourism influences biodiversity risks in destination countries. The results indicate that the growth of international tourism significantly elevates biodiversity risks, with these effects displaying both lagging and cumulative characteristics. Furthermore, spatial analysis shows that international tourism also intensifies biodiversity risks in neighbouring countries. The extent of its impact varies according to the tourism model and destination. In addition, government regulations and international financial assistance play a crucial role in mitigating the biodiversity risks associated with international tourism.
We describe how finite colimits can be described using the internal lanuage, also known as the Mitchell-Benabou language, of a topos, provided the topos admits countably infinite colimits. This description is based on the set theoretic definitions of colimits and coequalisers, however the translation is not direct due to the differences between set theory and the internal language, these differences are described as internal versus external. Solutions to the hurdles which thus arise are given.
In a globalised world, inflation in a given country may be becoming less responsive to domestic economic activity, while being increasingly determined by international conditions. Consequently, understanding the international sources of vulnerability of domestic inflation is turning fundamental for policy makers. In this paper, we propose the construction of Inflation-at-risk and Deflation-at-risk measures of vulnerability obtained using factor-augmented quantile regressions estimated with international factors extracted from a multi-level Dynamic Factor Model with overlapping blocks of inflations corresponding to economies grouped either in a given geographical region or according to their development level. The methodology is implemented to inflation observed monthly from 1999 to 2022 for over 115 countries. We conclude that, in a large number of developed countries, international factors are relevant to explain the right tail of the distribution of inflation, and, consequently, they are more relevant for the vulnerability related to high inflation than for average or low inflation. However, while inflation of developing low-income countries is hardly affected by international co
The generation and propagation sites of internal tides in the Mediterranean Sea are mapped through a comprehensive high-resolution numerical study. Two ocean general circulation models were used for this: NEMO v3.6, and ICON-O, both hydrostatic ocean models based on primitive equations with Boussinesq approximation, where NEMO is a regional Mediterranean Sea model with an Atlantic box, and ICON a global model. Internal tides are widespread in the Mediterranean Sea. The primary generation sites: the Gibraltar Strait, Sicily Strait/Malta Bank, and Hellenic Arc, are mapped through analysis of the tidal barotropic to baroclinic energy conversion. Semidiurnal internal tides can propagate for hundreds of kilometres from these generation sites into the Algerian Sea, Tyrrhenian Sea, and Ionian Sea respectively. Diurnal internal tides remain trapped along the bathymetry, and are generated in the central Mediterranean Sea and southeastern coasts of the basin. The total energy used for internal tide generation in the Mediterranean Sea is 2.89 GW in NEMO and 1.36 GW in ICON. Wavelengths of the first baroclinic modes of the M2 tide are calculated in various regions of the Mediterranean Sea wher
Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Prompt Internalization (GenPI), a lightweight method that employs a joint training approach. GenPI not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model's behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Prompt Internalization enables high performance and efficient inference without the need for explicit prompts.
We establish the theories of Symmetric Teleparallel Equivalent to General Relativity (STEGR) in the internal-space and investigate possible internal-space symmetries among primary constraint densities in the theories. First of all, we revisit STEGR in terms of the gauge approach to gravity and formulate it in the internal-space set-up. We find three possible formalisms according to the vanishing-torsion property. Then, we investigate possible internal-space symmetries in each formalism. We find that in our formulation there are two possible symmetries. One satisfies the translation symmetry but broken in the local symmetry provided by the general linear group which contains the local Lorentz symmetry. The other satisfies the latter symmetry but is absent in the former symmetry. Finally, we conclude this work and show future perspectives.
A sheaf of modules on a site is said to be internally projective if sheaf hom with the module preserves epimorphism. In this note, we give an example showing that internally projective sheaves of abelian groups are not in general stable under base change to a slice. This shows that internal projectivity is weaker than projectivity in the internal logic of the topos, as expressed for example in terms of Shulman's stack semantics. The sheaf of groups that we use as a counterexample comes from recent work by Clausen and Scholze on light condensed sets.
When designing a robot's internal system, one often makes assumptions about the structure of the intended environment of the robot. One may even assign meaning to various internal components of the robot in terms of expected environmental correlates. In this paper we want to make the distinction between robot's internal and external worlds clear-cut. Can the robot learn about its environment, relying only on internally available information, including the sensor data? Are there mathematical conditions on the internal robot system which can be internally verified and make the robot's internal system mirror the structure of the environment? We prove that sufficiency is such a mathematical principle, and mathematically describe the emergence of the robot's internal structure isomorphic or bisimulation equivalent to that of the environment. A connection to the free-energy principle is established, when sufficiency is interpreted as a limit case of surprise minimization. As such, we show that surprise minimization leads to having an internal model isomorphic to the environment. This also parallels the Good Regulator Principle which states that controlling a system sufficiently well mean
In both quantum mechanics and relativity theory, the concept of the observer plays a critical role. However, there is no consensus on the definition of observer in these theories. Following Einstein's thought experiments, one could ask: What would it look like to sit inside a photon or to be a photon? And what type of observer could represent this more global perspective of the photon's interior? To address these questions, we introduce the concepts of internal and external observers with a focus on their relationship in quantum theory and relativity theory. The internal observer, associated with the internal observables super-algebra, glues the external interactions. Drawing inspiration from the advancements in abstract algebraic topology, we propose mathematical representation of the internal observer. We also outline principles for ensuring the consistency of observers in terms of information theory. It becomes evident, through the analysis of the introduced hierarchy of observers, that entanglement is a primitive of space-time causal relationships. While external observers must abide by the relativistic causality linked with the no-signaling principle in quantum mechanics, the