共找到 20 条结果
暂无摘要(点击查看原文获取完整内容)
In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate activations before output generation. To test whether this signal is actionable, we introduce Mechanistic AutoDAN, a probe-guided variant of AutoDAN that replaces full-model fitness evaluation with partial forward passes and probe-based scoring inside a genetic prompt search loop. Across the evaluated models, our method achieves attack success rates competitive with vanilla AutoDAN while reducing per-iteration search time by up to 72%, and probe-guided prompts match or exceed AutoDAN's cross-model transfer in several configurations. We further find that the usefulness of probe guidance increases with model scale. Our results show that refusal is not only observable at the output level, but is encoded as a structured and actionable signal in intermediate LLM activations.
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost un
Using an individual-level panel dataset from Japan covering the period 2016-2024, we examined how the COVID-19 pandemic, as an unanticipated public crisis, affected preferences for income redistribution. Furthermore, we investigated how the association between redistribution preferences and trust in government changed before and after COVID-19. The major findings are as follows: (1) individuals in the high-income group are less likely to prefer redistribution after COVID-19 than before it; (2) the degree of decline in redistribution preference is lower when trust in government is higher; and (3) generalised trust and reciprocity did not influence the decline in preference.
Public debate links worsening job prospects for AI-exposed occupations to the release of ChatGPT in late 2022. Using monthly U.S. unemployment insurance records, we measure occupation- and location-specific unemployment risk and find that risk rose in AI-exposed occupations beginning in early 2022, months before ChatGPT. Analyzing millions of LinkedIn profiles, we show that graduate cohorts from 2021 onward entered AI-exposed jobs at lower rates than earlier cohorts, with gaps opening before late 2022. Finally, from millions of university syllabi, we find that graduates taking more AI-exposed curricula had higher first-job pay and shorter job searches after ChatGPT. Together, these results point to forces pre-dating generative AI and to the ongoing value of LLM-relevant education.
We construct a minimal task that isolates conditional learning in neural networks: a surjective map with K-fold ambiguity, resolved by a selector token z, so H(A | B) = log K while H(A | B, z) = 0. The model learns the marginal P(A | B) first, producing a plateau at exactly log K, before acquiring the full conditional in a sharp, collective transition. The plateau has a clean decomposition: height = log K (set by ambiguity), duration = f(D) (set by dataset size D, not K). Gradient noise stabilizes the marginal solution: higher learning rates monotonically slow the transition (3.6* across a 7* η range at fixed throughput), and batch-size reduction delays escape, consistent with an entropic force opposing departure from the low-gradient marginal. Internally, a selector-routing head assembles during the plateau, leading the loss transition by ~50% of the waiting time. This is the Type 2 directional asymmetry of Papadopoulos et al. [2024], measured dynamically: we track the excess risk from log K to zero and characterize what stabilizes it, what triggers its collapse, and how long it takes.
In the Frontier AI Safety Commitments, sixteen companies committed to "Assess the risks posed by their frontier models or systems across the AI lifecycle, including [...] as appropriate, before and during training" (I) and to "Provide public transparency on the implementation of the above (I-VI), except insofar as doing so would increase risk or divulge sensitive commercial information to a degree disproportionate to the societal benefit. They should still share more detailed information which cannot be shared publicly with trusted actors, including their respective home governments or appointed body, as appropriate" (VII). This short paper considers what information should be shared with whom before training begins. What information should be shared publicly and what only with trusted actors such as home governments? Sharing such information before a frontier training run can build shared awareness and preparedness, can improve risk assessment and management, and can contribute to greater predictability and accountability. Companies could share certain information before a training run including: Expected dates of beginning and end of training; Expected compute used (in FLOP); Des
Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions typically either rescale updates after orthogonalization or use heavier whitening-based preconditioners before it. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon with three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). By rebalancing the momentum matrix before finite-step Newton--Schulz orthogonalization, {\method} improves the geometry seen by orthogonalization. We show that finite-step orthogonalization is governed by the input spectrum, especially stable rank and condition number, and that row/column normalization acts as a zeroth-order surrogate for whitening. For hidden matrix weights, R is the default variant. Theoretically, {\method} (R) retains the standard $\widetilde{\mathcal O}(T^{-1/4})$ Muon-type nonconvex stationarity guarantee with decoupled weight decay and a horizon-free diminishing learning-rate schedule, and extends it to finite-step NS5 up to an explicit inexactness constant. In LLaMA2 pretraining on C4, {\method} (R) consistently outperform
We propose a method to predict toxicity and other textual attributes through the use of natural language processing (NLP) techniques for two recent events: the Ukraine-Russia and Hamas-Israel conflicts. This article provides a basis for exploration in future conflicts with hopes to mitigate risk through the analysis of social media before and after a conflict begins. Our work compiles several datasets from Twitter and Reddit for both conflicts in a before and after separation with an aim of predicting a future state of social media for avoidance. More specifically, we show that: (1) there is a noticeable difference in social media discussion leading up to and following a conflict and (2) social media discourse on platforms like Twitter and Reddit is useful in identifying future conflicts before they arise. Our results show that through the use of advanced NLP techniques (both supervised and unsupervised) toxicity and other attributes about language before and after a conflict is predictable with a low error of nearly 1.2 percent for both conflicts.
We investigate whether the success of a zero-shot Chain-of-Thought (CoT) process can be predicted before completion. We discover that a probing classifier, based on LLM representations, performs well \emph{even before a single token is generated}, suggesting that crucial information about the reasoning process is already present in the initial steps representations. In contrast, a strong BERT-based baseline, which relies solely on the generated tokens, performs worse, likely because it depends on shallow linguistic cues rather than deeper reasoning dynamics. Surprisingly, using later reasoning steps does not always improve classification. When additional context is unhelpful, earlier representations resemble later ones more, suggesting LLMs encode key information early. This implies reasoning can often stop early without loss. To test this, we conduct early stopping experiments, showing that truncating CoT reasoning still improves performance over not using CoT at all, though a gap remains compared to full reasoning. However, approaches like supervised learning or reinforcement learning designed to shorten CoT chains could leverage our classifier's guidance to identify when early s
Using the norm operator method, which extends and corrects the conventional boson expansion theories, we investigate two boson mappings of the boson expansion theory, the so-called mapping after truncation and the mapping before truncation. The difference between them stems from the treatment of the phonon excitation modes; those not adopted as boson excitation modes in the former mapping are first all adopted as boson excitation modes and then truncated later in the latter mapping. If and only if the commutation relations among the phonon operators are closed among the excitation modes adopted as the boson excitation modes in the mapping after truncation, the mapping after and the mapping before truncation coincide, not depending on the types, Hermitian and non-Hermitian. We also investigate the Park operator, which judges whether a boson state vector is physical, and reveal that the conventional claim, which claims that it is applicable only when the mapping is that of the whole fermion space, is incorrect.
Understanding stars and their evolution is a key goal of astronomical research and has long been a focus of human interest. In recent years, theorists have paid much attention to the final interior processes within massive stars, as they can be essential for revealing neutrino-driven supernova mechanisms and other potential transients of massive star collapse. However, it is challenging to observe directly the last hours of a massive star before explosion, since it is the supernova event that triggers the start of intense observational study. Here we report evidence for a final phase of stellar activity known as a ``shell merger'', an intense shell burning in which the O-burning shell swallows its outer C-/Ne-burning shell, deep within the progenitor's interior moments before the supernova explosion. In the violent convective layer created by the shell merger, Ne, which is abundant in the stellar O-rich layer, is burned as it is pulled inward, and Si, which is synthesized inside, is transported outward. The remnant still preserves some traces of such Ne-rich downflows and Si-rich upflows in the O-rich layer, suggesting that inhomogeneous shell-merger mixing began just hours ($\less
Life is an exergonic chemical reaction. Many individual reactions in metabolism entail slightly endergonic steps that are coupled to free energy release, typically as ATP hydrolysis, in order to go forward. ATP is almost always supplied by the rotor-stator ATP synthase, which harnesses chemiosmotic ion gradients. Because the ATP synthase is a protein, it arose after the ribosome did. What was the energy currency of metabolism before the origin of the ATP synthase and how (and why) did ATP come to be the universal energy currency? About 27% of a cell's energy budget is consumed as GTP during translation. The universality of GTP-dependence in ribosome function indicates that GTP was the ancestral energy currency of protein synthesis. The use of GTP in translation and ATP in small molecule synthesis are conserved across all lineages, representing energetic compartments that arose in the last universal common ancestor, LUCA. And what came before GTP? Recent findings indicate that the energy supporting the origin of LUCA's metabolism stemmed from H2-dependent CO2 reduction along routes that strongly resemble the reactions and transition metal catalysts of the acetyl-CoA pathway.
We have monitored the Didymos-Dimorphos binary asteroid in spectropolarimetric mode in the optical range before and after the DART impact. The ultimate goal was to obtain constraints on the characteristics of the ejected dust for modelling purposes. Before impact, Didymos exhibited a linear polarization rapidly increasing with phase angle, reaching a level of about 5% in the blue and about 4.5 in the red. The shape of the polarization spectrum was anti-correlated with that of its reflectance spectrum, which appeared typical of an S-class asteroid. After impact, the level of polarization dropped by about 1 percentage point (pp) in the blue band and about 0.5 pp in the red band, then continued to linearly increase with phase angle, with a slope similar to that measured prior to impact. The polarization spectra, once normalised by their values at an arbitrary wavelength, show very little or no change over the course of all observations, before and after impact. The lack of any remarkable change in the shape of the polarization spectrum after impact suggests that the way in which polarization varies with wavelength depends on the composition of the scattering material, rather than on i
Shared spaces aim to reduce the dominance of motor vehicles by promoting pedestrian and cyclist activity and minimizing segregation between road users. Despite the intended scope to improve the safety of vulnerable road users, only few works in the literature focused on before after safety evaluations, mainly analyzing changes in users trajectories and speeds, traffic volumes, and conflict counts, which, while useful, cannot univocally quantify road safety. Here, we propose a more advanced methodology, based on surrogate measures of safety and Extreme Value Theory, to assess road safety before and after the implementation of a shared space. The aim is to produce a crash risk estimation in different scenarios, obtaining a quantitative and comprehensive indicator, useful to practitioners for evaluating the safety of urban design solutions. A real world case study illustrates the proposed procedure. Video data were collected on two separate days, before and after a shared space implementation, and were semiautomatically processed to extract road users trajectories. Analysis of traffic volumes, trajectories, speeds and yield ratios allowed to understand the spatial behavior of road use
We investigate the Brout-Englert-Higgs mechanism of spontaneous symmetry breaking and show that, before symmetry breaking, the interaction of Higgs fields with massless gauge fields leads to the production of particles with negative squared mass.
Flares are a major explosive event in our solar system. They are often followed by coronal mass ejection that has a potential to trigger the geomagnetic storms. There are various studies aiming to predict when and where the flares are likely to occur. Most of these studies mainly discuss the photospheric and chromospheric activity before the flare onset. In this paper we study the coronal features before the famous large flare occurrence on December 13th, 2006. Using the data from Hinode/EUV Imaging Spectrometer (EIS), X-Ray Telescope (XRT), and Solar and Heliospheric Observatory (SOHO) /Extreme ultraviolet Imaging Telescope (EIT), we discuss the coronal features in the large scale (~ a few 100 arcsec) before the flare onset. Our findings are as follows: 1) The upflows in and around active region start growing from ~10 to 30 km /s a day before the flare. 2) The expanding coronal loops are clearly observed a few hours before the flare. 3) Soft X-ray and EUV intensity are gradually reduced. 4) The upflows are further enhanced after the flare. From these observed signatures, we conclude that the outer part of active region loops with low density were expanding a day before the flare o
It is argued that recent experiments refuting nonlocal realism, can also be considered as experiments refuting time-ordered nonlocality and, hence, confirming the result of the before-before experiment. However, the before-before experiment provides a broader refutation because it also falsifies the testable relativistic version of Bohm's nonlocal model. All this stresses the interest of a new before-before experiment demonstrating together the failure of time-ordered nonlocality and the violation of the Leggett's inequality.
Physical phenomena observed before strong earthquake have been reported over centuries. Radon anomalies, electrical signals, water level changes, earthquake lights near the epicenter are recognized as pre-earthquake signals to approach earthquake prediction. Anomalous negative signals observed by ground-based atmospheric electric field instrument under fair weather open up a new way to earthquake prediction. Abnormal heat radiation before the earthquake bring fair weather around the epicenter in theory. In order to figure out the weather conditions around the epicenter before earthquakes, 213 global earthquake events with magnitude of 6 or above from 2013 to 2020 were collected. Based on our definition of fair weather, in 96.7% of the events in the statistics, the weather before the earthquake is fair. Besides, the fair state before the earthquake lasted more than 7 hours, leaving us with enough early warming time.
Pruning on neural networks before training not only compresses the original models, but also accelerates the network training phase, which has substantial application value. The current work focuses on fine-grained pruning, which uses metrics to calculate weight scores for weight screening, and extends from the initial single-order pruning to iterative pruning. Through these works, we argue that network pruning can be summarized as an expressive force transfer process of weights, where the reserved weights will take on the expressive force from the removed ones for the purpose of maintaining the performance of original networks. In order to achieve optimal expressive force scheduling, we propose a pruning scheme before training called Neural Network Panning which guides expressive force transfer through multi-index and multi-process steps, and designs a kind of panning agent based on reinforcement learning to automate processes. Experimental results show that Panning performs better than various available pruning before training methods.