Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training ti
While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.
Transporting hazardous materials (hazmats) using tank cars has more significant economic benefits than other transportation modes. Although railway transportation is roughly four times more fuel-efficient than roadway transportation, a train derailment has greater potential to cause more disastrous consequences than a truck incident. Train types, such as unit train or manifest train (also called mixed train), can influence transport risks in several ways. For example, unit trains only experience risks on mainlines and when arriving at or departing from terminals, while manifest trains experience additional switching risks in yards. Based on prior studies and various data sources covering the years 1996-2018, this paper constructs event chains for line-haul risks on mainlines (for both unit trains and manifest trains), arrival/departure risks in terminals (for unit trains) and yards (for manifest trains), and yard switching risks for manifest trains using various probabilistic models, and finally determines expected casualties as the consequences of a potential train derailment and release incident. This is the first analysis to quantify the total risks a train may encounter through
We present methods for constructing Taylor series surrogate models for covariance preconditioned high dimensional mappings that depend implicitly on the solution of a system of nonlinear equations, e.g., the solution of a partial differential equation. Taylor series are traditionally considered intractable for such mappings because the derivative tensors are enormous, and are only accessible through ``probing'' (contraction of the tensor with vectors in all but one index). We overcome these challenges using a ``Tucker tensor train Taylor series'' (T4S) surrogate model, in which each derivative tensor is approximated by a Tucker decomposition composed with a tensor train. After an initial dimension reduction, Tucker tensor trains are fit to directionally symmetric tensor probes using Riemannian manifold optimization within a rank continuation scheme. The optimization is enabled by fast sweeping methods for applying the Riemannian Jacobian (the Jacobian for the Tucker tensor train fitting problem) and its transpose to vectors. We justify the T4S model theoretically, and provide numerical evidence for the effectiveness of the proposed methods.
The TRAIN code, developed in 1995 as a post-processor for second-order transport maps from MAD, has been used extensively at the LEP and the LHC to study self-consistent closed orbits, tunes and chromaticities of bunch trains under the presence of beam-beam long-range (BBLR) and PACMAN effects.. This paper presents a modern re-implementation of the TRAIN concept in Python using well-known numeric libraries (numpy, scipy) and an optional link to MAD-X via cpymad. This greatly improves the usability, maintainability and extensibility of the code. New functionality includes the support for arbitrary particle types, an arbitrary number and distribution of beam-beam interaction points, and the extrapolation of the beam-beam induced closed-orbit effects to arbitrary points in the machine. The code is benchmarked against the classic TRAIN code, and simulation results are compared to observations from LHC physics operation.
Nowadays, cloud-based services are widely favored over the traditional approach of locally training a Neural Network (NN) model. Oftentimes, a cloud service processes multiple requests from users--thus training multiple NN models concurrently. However, training NN models concurrently is a challenging process, which typically requires significant amounts of available computing resources and takes a long time to complete. In this paper, we present UnifiedNN to effectively train multiple NN models concurrently on the cloud. UnifiedNN effectively "combines" multiple NN models and features several memory and time conservation mechanisms to train multiple NN models simultaneously without impacting the accuracy of the training process. Specifically, UnifiedNN merges multiple NN models and creates a large singular unified model in order to efficiently train all models at once. We have implemented a prototype of UnifiedNN in PyTorch and we have compared its performance with relevant state-of-the-art frameworks. Our experimental results demonstrate that UnifiedNN can reduce memory consumption by up to 53% and training time by up to 81% when compared with vanilla PyTorch without impacting the
The train unit scheduling problem (TUSP) is an important part of the scheduling process for passenger railway operators. Currently, scholars in various countries have proposed a variety of optimization models based on specific local railway situations and scheduling needs. This research investigates the train unit scheduling problem in the UK. We propose an Enhanced Train Unit Scheduling Problem with Unit Ordering based on existing integer multicommodity flow models. We innovatively introduce unit ordering variables representing the order in which train units are coupled for serving the same trip as well as train direction parameters so that our model can provide unit order information and avoid unit blockage in stations. We present experimental results based on three different sizes of artificial data, as well as real-world data based on the Trans Pennine Express' Anglo-Scottish route. The experimental results showed that our model is able to provide the ordering information corresponding to each train unit and prevents the blockage in the station.
Railway networks have become increasingly important in recent times, especially in moving freight and public transportation from road traffic and planes to more environmentally friendly trains. Since expanding the global railway network is time- and resource-consuming, maximizing the rail capacity of the existing infrastructure is desirable. However, simply running more trains is infeasible as certain constraints enforced by the train control system must be satisfied. The capacity of a network depends (amongst others) on the distance between trains allowed by this safety system. While most signaling systems rely on fixed blocks defined by costly hardware, new specifications provided by Level 2 with Hybrid Train Detection of the European Train Control System (ETCS L2 HTD), formerly known as ETCS Hybrid Level 3, allow the usage of virtual subsections. This additional degree of freedom allows for shorter train following times and, thus, more trains on existing railway tracks. On the other hand, new design tasks arise on which automated methods might be helpful for designers of modern railway networks. However, although first approaches exist that solve design problems arising within E
What happens if we train a new Large Language Model (LLM) using data that are at least partially generated by other LLMs? The explosive success of LLMs means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. We evaluate the implications of such "regurgitative training" on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by other LLMs in a machine translation task, we find strong evidence that regurgitative training clearly handicaps the performance of LLMs. The same performance loss of regurgitative training is observed on transformer models that we train from scratch. We find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher error rates and (2) lower lexical diversity in LLM-generated data as compared to real data. Based on these mechanisms, we propose and evaluate three different strategies to mitigate the performance loss of regurgitative training. First, we devise data-driven metrics to gauge the quality of each LLM-generated data instance, and
We describe a numerical many-body technique that is based on both tensor networks and quantum Monte Carlo. The variational ansatz is a tensor network that can harvest volume-law entanglement. It is constructed from a tensor train to which one applies a set of non-local operators that force several indices of the tensor train to represent the same physical index, hence its name -- replica tensor train (RTT). From the tensor network toolbox, it inherits the possibility to make linear combinations of these states and apply a certain class of operators. We can therefore find the ground-state of a local Hamiltonian in a purely algebraic way as in standard tensor network algorithms -- i.e. without using gradient descent methods. On the other hand, the volume-law structure forbids calculating physical observables directly. In much the same way as on a quantum computer where one can prepare a state but can only sample it at the end, here we have to use Markov Chain Monte Carlo to compute the observables. We further show that the approach can be extended to build Krylov-subspace ground-state methods within the variational manifold. We illustrate the different algorithms on a two-dimensional
Data-augmentation is key to the training of neural networks for image classification. This paper first shows that existing augmentations induce a significant discrepancy between the typical size of the objects seen by the classifier at train and test time. We experimentally validate that, for a target test resolution, using a lower train resolution offers better classification at test time. We then propose a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ. It involves only a computationally cheap fine-tuning of the network at the test resolution. This enables training strong classifiers using small training images. For instance, we obtain 77.1% top-1 accuracy on ImageNet with a ResNet-50 trained on 128x128 images, and 79.8% with one trained on 224x224 image. In addition, if we use extra training data we get 82.5% with the ResNet-50 train with 224x224 images. Conversely, when training a ResNeXt-101 32x48d pre-trained in weakly-supervised fashion on 940 million public images at resolution 224x224 and further optimizing for test resolution 320x320, we obtain a test top-1 accuracy of 86.4% (top-5: 98.0%) (sin
We identify a tradeoff curve between the number of wheels on a train car, and the amount of track that must be installed in order to ensure that the train car is supported by the track at all times. The goal is to build an elevated track that covers some large distance $\ell$, but that consists primarily of gaps, so that the total amount of feet of train track that is actually installed is only a small fraction of $\ell$. In order so that the train track can support the train at all points, the requirement is that as the train drives across the track, at least one set of wheels from the rear quarter and at least one set of wheels from the front quarter of the train must be touching the track at all times. We show that, if a train car has $n$ sets of wheels evenly spaced apart in its rear and $n$ sets of wheels evenly spaced apart in its front, then it is possible to build a train track that supports the train car but uses only $Θ( \ell / n )$ feet of track. We then consider what happens if the wheels on the train car are not evenly spaced (and may even be configured adversarially). We show that for any configuration of the train car, with $n$ wheels in each of the front and rear qu
Tensors with unit Frobenius norm are fundamental objects in many fields, including scientific computing and quantum physics, which are able to represent normalized eigenvectors and pure quantum states. While the tensor train decomposition provides a powerful low-rank format for tackling high-dimensional problems, it does not intrinsically enforce the unit-norm constraint. To address this, we introduce the normalized tensor train (NTT) decomposition, which aims to approximate a tensor by unit-norm tensors in tensor train format. The low-rank structure of NTT decomposition not only saves storage and computational cost but also preserves the underlying unit-norm structure. We prove that the set of fixed-rank NTT tensors forms a smooth manifold, and the corresponding Riemannian geometry is derived, paving the way for geometric methods. We propose NTT-based methods for low-rank tensor recovery, high-dimensional eigenvalue problem, estimation of stabilizer rank, and calculation of the minimum output Rényi 2-entropy of quantum channels. Numerical experiments demonstrate the superior efficiency and scalability of the proposed NTT-based methods.
We explore the capability of evolution strategies to train an agent with a policy based on a transformer architecture in a reinforcement learning setting. We performed experiments using OpenAI's highly parallelizable evolution strategy to train Decision Transformer in the MuJoCo Humanoid locomotion environment and in the environment of Atari games, testing the ability of this black-box optimization technique to train even such relatively large and complicated models (compared to those previously tested in the literature). The examined evolution strategy proved to be, in general, capable of achieving strong results and managed to produce high-performing agents, showcasing evolution's ability to tackle the training of even such complex models.
Points 2.1.4(b), 2.4.2(b) and 2.4.3(b) in Annex I of Implementing Regulation (EU) No. 402/2013 allow a simplified approach for the safety approval of computer vision systems for driverless trains, if they have 'similar' functions and interfaces as the replaced human driver. The human driver is not replaced one-to-one by a technical system - only a limited set of cognitive functions are replaced. However, performance in the most challenging function, obstacle detection, is difficult to quantify due to the deficiency of published measurement results. This article summarizes the data published so far. This article also goes a long way to remedy this situation by providing a new public and anonymized dataset of 711 train driver performance measurements from controlled experiments. The measurements are made for different speeds, obstacle sizes, train protection systems and obstacle color contrasts respectively. The measured values are reaction time and distance to the obstacle. The goal of this paper is an unbiased and exhaustive description of the presented dataset for research, standardization and regulation. The dataset with supplementing information and literature is published on ht
A deeper understanding of pedestrian dynamics is essential to improve crowd flows in public spaces such as train stations. It is essential to understand both the physical and the psychological processes present in this context. However, current research on train boarding behavior is limited in scope and mainly focuses on how group level variables such as number of boarders/deboarders influence train boarding efficiency. Viewing pedestrian dynamics through a psychological lens is important for a detailed understanding of the train boarding context and to recognize target areas for improving crowd flows. At Dutch train stations, boarders follow a social norm of waiting at the train door until deboarding is complete. Although people generally adhere to this norm, the way it is executed may not be optimal for deboarding efficiency. We investigate how waiting boarders form a deboarding channel (a corridor where deboarders exit the train) which is a macroscopic structure formed by pedestrians, and how this channel in turn influences the efficiency of deboarding. Analyzing a dataset with 3278 boarding events at Utrecht Centraal Station in the Netherlands from 2017 - 2020 (a subset of a tr
Despite the effectiveness in improving the robustness of neural networks, adversarial training has suffered from the natural accuracy degradation problem, i.e., accuracy on natural samples has reduced significantly. In this study, we reveal that natural accuracy degradation is highly related to the disruption of the natural sample topology in the representation space by quantitative and qualitative experiments. Based on this observation, we propose Topology-pReserving Adversarial traINing (TRAIN) to alleviate the problem by preserving the topology structure of natural samples from a standard model trained only on natural samples during adversarial training. As an additional regularization, our method can be combined with various popular adversarial training algorithms, taking advantage of both sides. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet show that our proposed method achieves consistent and significant improvements over various strong baselines in most cases. Specifically, without additional data, TRAIN achieves up to 8.86% improvement in natural accuracy and 6.33% improvement in robust accuracy.
Sparse neural networks have been widely applied to reduce the computational demands of training and deploying over-parameterized deep neural networks. For inference acceleration, methods that discover a sparse network from a pre-trained dense network (dense-to-sparse training) work effectively. Recently, dynamic sparse training (DST) has been proposed to train sparse neural networks without pre-training a dense model (sparse-to-sparse training), so that the training process can also be accelerated. However, previous sparse-to-sparse methods mainly focus on Multilayer Perceptron Networks (MLPs) and Convolutional Neural Networks (CNNs), failing to match the performance of dense-to-sparse methods in the Recurrent Neural Networks (RNNs) setting. In this paper, we propose an approach to train intrinsically sparse RNNs with a fixed parameter count in one single run, without compromising performance. During training, we allow RNN layers to have a non-uniform redistribution across cell gates for better regularization. Further, we propose SNT-ASGD, a novel variant of the averaged stochastic gradient optimizer, which significantly improves the performance of all sparse training methods for R
A seemingly simple, yet widely applicable subroutine in automated train scheduling is the insertion of a new train path to a timetable in a railway network. We believe it to be the first step towards a new train-rerouting framework in case of large disturbances or maintenance works. Other applications include handling ad-hoc requests and modifying train paths upon request from railway undertakings. We propose a fast and scalable path-insertion algorithm based on dynamic programming that is able to output multiple suitable paths. Our algorithm uses macroscopic data and can run on railway networks with any number of tracks. We apply the algorithm on the line from Göteborg Sävenäs to the Norwegian border at Kornsjö. For a time window of seven hours, we obtain eight suitable paths for a freight train within 0.3 seconds after preprocessing.
We present here a real-time control model for the train dynamics in a linear metro line system. The model describes the train dynamics taking into account average passenger arrival rates on platforms, including control laws for train dwell and run times, based on the feedback of the train dynamics. The model extends a recently developed Max-plus linear traffic model with demand-dependent dwell times and a run time control. The extension permits the elimination of eventual irregularities on the train time-headway. The resulting train dynamics are interpreted as a dynamic programming system of a stochastic optimal control problem of a Markov chain. The train dynamics still admit a stable stationary regime with a unique average growth rate interpreted as the asymptotic average train time-headway. Moreover, beyond the transient regime of the train dynamics, our extension guarantees uniformity in time of the train time-headways at every platform.