Using a Monte Carlo method on a lattice model of a vicinal surface with short-range step-step attraction, we show that, at low temperature and near equilibrium, there is an inhibition of the motion of macro-steps. This inhibition leads to a pinning of steps without defects, adsorbates, or impurities (self-pinning of steps). We show that this inhibition of the macro-step motion is caused by faceted steps, which are macro-steps that have a smooth side surface. The faceted steps result from discontinuities in the anisotropic surface tension (the surface free energy per area). The discontinuities are brought into the surface tension by the short-range step-step attraction. The short-range step-step attraction also originates `step-droplets', which are locally merged steps, at higher temperatures. We derive an analytic equation of the surface stiffness tensor for the vicinal surface around the (001) surface. Using the surface stiffness tensor, we show that step-droplets roughen the vicinal surface. Contrary to what we expected, the step-droplets slow down the step velocity due to the diminishment of kinks in the merged steps (smoothing of the merged steps).
Procedural activities follow well-defined structures: whether we consider a cooking recipe or a mechanic repairing a car, these activities naturally decompose in a hierarchy of steps and sub-steps. Traditional approaches for step grounding require extensive annotations and scale poorly. Instead, we argue that such hierarchical structure can emerge naturally from uncurated videos of human activities through recurring patterns of co-occurring actions and activities. Our approach builds on HiERO, a weakly-supervised representation learning approach that maps close in the feature space actions that are functionally related to each other, leveraging only fine-grained action-level narrations. In this feature space, procedure steps can be detected by a simple clustering, with no additional task-specific fine-tuning. For the Ego4D Step Grounding challenge, we augment this approach by ensuring fine and coarse level agreement in step assignments, enforcing strict temporal monotonicity of the grounded steps and post-processing the detected steps to reduce the impact of noisy predictions. We call this approach HiERO-StepG and it achieves 56.27 % on the R@1 (IoU = 0.3) metric on the global lead
Solutions to math word problems (MWPs) with step-by-step explanations are valuable, especially in education, to help students better comprehend problem-solving strategies. Most existing approaches only focus on obtaining the final correct answer. A few recent approaches leverage intermediate solution steps to improve final answer correctness but often cannot generate coherent steps with a clear solution strategy. Contrary to existing work, we focus on improving the correctness and coherence of the intermediate solutions steps. We propose a step-by-step planning approach for intermediate solution generation, which strategically plans the generation of the next solution step based on the MWP and the previous solution steps. Our approach first plans the next step by predicting the necessary math operation needed to proceed, given history steps, then generates the next step, token-by-token, by prompting a language model with the predicted math operation. Experiments on the GSM8K dataset demonstrate that our approach improves the accuracy and interpretability of the solution on both automatic metrics and human evaluation.
Chain-of-Thought (CoT) is a technique that guides Large Language Models (LLMs) to decompose complex tasks into multi-step reasoning through intermediate steps in natural language form. Briefly, CoT enables LLMs to think step by step. However, although many Natural Language Understanding (NLU) tasks also require thinking step by step, LLMs perform less well than small-scale Masked Language Models (MLMs). To migrate CoT from LLMs to MLMs, we propose Chain-of-Thought Tuning (CoTT), a two-step reasoning framework based on prompt tuning, to implement step-by-step thinking for MLMs on NLU tasks. From the perspective of CoT, CoTT's two-step framework enables MLMs to implement task decomposition; CoTT's prompt tuning allows intermediate steps to be used in natural language form. Thereby, the success of CoT can be extended to NLU tasks through MLMs. To verify the effectiveness of CoTT, we conduct experiments on two NLU tasks: hierarchical classification and relation extraction, and the results show that CoTT outperforms baselines and achieves state-of-the-art performance.
This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and
Breakthrough progress in vision-based navigation through unknown environments has been achieved by using multimodal large language models (MLLMs). These models can plan a sequence of motions by evaluating the current view at each time step against the task and goal given to the agent. However, current zero-shot Vision-and-Language Navigation (VLN) agents powered by MLLMs still tend to drift off course, halt prematurely, and achieve low overall success rates. We propose Three-Step Nav to counteract these failures with a three-view protocol: First, "look forward" to extract global landmarks and sketch a coarse plan. Then, "look now" to align the current visual observation with the next sub-goal for fine-grained guidance. Finally, "look backward" audits the entire trajectory to correct accumulated drift before stopping. Requiring no gradient updates or task-specific fine-tuning, our planner drops into existing VLN pipelines with minimal overhead. Three-Step Nav achieves state-of-the-art zero-shot performance on the R2R-CE and RxR-CE dataset. Our code is available at https://github.com/ZoeyZheng0/3-step-Nav.
Step adjustment for humanoid robots has been shown to improve robustness in gaits. However, step duration adaptation is often neglected in control strategies. In this paper, we propose an approach that combines both step location and timing adjustment for generating robust gaits. In this approach, step location and step timing are decided, based on feedback from the current state of the robot. The proposed approach is comprised of two stages. In the first stage, the nominal step location and step duration for the next step or a previewed number of steps are specified. In this stage which is done at the start of each step, the main goal is to specify the best step length and step duration for a desired walking speed. The second stage deals with finding the best landing point and landing time of the swing foot at each control cycle. In this stage, stability of the gaits is preserved by specifying a desired offset between the swing foot landing point and the Divergent Component of Motion (DCM) at the end of current step. After specifying the landing point of the swing foot at a desired time, the swing foot trajectory is regenerated at each control cycle to realize desired landing prop
Bunching of steps at the surface of growing crystals can be induced by both directions of the driving force: step up and step down. The processes happen in different adatom concentrations and differ in character. In this study we show how the overall picture of the bunching process depends on the strength of short range step-step repulsion. The repulsive interaction between steps, controlled by an additional parameter, is introduced into the recently studied atomistic scale model of vicinal crystal growth, based on cellular automata. It is shown that the repulsion modifies bunching process in a different way, depending on the direction of the destabilizing force. In particular, bunch profiles, stability diagrams and time-scaling dependences of various bunch properties are affected when the step-step repulsion increases. The repulsion between steps creates a competition between two characteristic sizes - bunch width and macrostep height, playing the role of the second length scale that describes the step bunching phenomenon. A new characteristic time scale dependent on the step-step repulsion parameter emerges as an effect of interplay between (01) faceted macrosteps and (11) facete
We present a multirate method that is particularly suited for integrating the systems of Ordinary Differential Equations (ODEs) that arise in step models of surface evolution. The surface of a crystal lattice, that is slightly miscut from a plane of symmetry, consists of a series of terraces separated by steps. Under the assumption of axisymmetry, the step radii satisfy a system of ODEs that reflects the steps' response to step line tension and step-step interactions. Two main problems arise in the numerical solution of these equations. First, the trajectory of the innermost step can become singular, resulting in a divergent step velocity. Second, when a step bunching instability arises, the motion of steps within a bunch becomes very strongly stable, resulting in "local stiffness". The multirate method introduced in this paper ensures that small time steps are taken for singular and locally stiff components, while larger time steps are taken for the remaining ones. Special consideration is given to the construction of high order interpolants during run time which ensures fourth order accuracy of scheme for components of the solution sufficiently far away from singular trajectories
We present a cost-effective two-step authentication system that integrates face identification and speaker verification using only a camera and microphone available on common devices. The pipeline first performs face recognition to identify a candidate user from a small enrolled group, then performs voice recognition only against the matched identity to reduce computation and improve robustness. For face recognition, a pruned VGG-16 based classifier is trained on an augmented dataset of 924 images from five subjects, with faces localized by MTCNN; it achieves 95.1% accuracy. For voice recognition, a CNN speaker-verification model trained on LibriSpeech (train-other-360) attains 98.9% accuracy and 3.456% EER on test-clean. Source code and trained models are available at https://github.com/NCUE-EE-AIAL/Two-step-Authentication-Multi-biometric-System.
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash
For computation, there existed Turing machine and later-matured automata theory. For low-level parallel computation, there existed variants of Turing machine, such as two-tapes Turing machine and multi-tapes Turing machine. In the literature, the combination of computation and concurrency is still active, such the combination of automata and processes, and the introduction of concurrency into automata: the so-called pomset automata and branch automata. But the linkage of Turing machine and concurrent automaton is still absent. In this paper, we propose the concepts of step automaton and step Turing machine (STM), which a natural extension to traditional automaton and classical Turing machine just allowing an automaton or Turing machine to execute a step of atomic actions (without partial orders pairwise).
Scaling up network depth is a fundamental pursuit in neural architecture design, as theory suggests that deeper models offer exponentially greater capability. Benefiting from the residual connections, modern neural networks can scale up to more than one hundred layers and enjoy wide success. However, as networks continue to deepen, current architectures often struggle to realize their theoretical capacity improvements, calling for more advanced designs to further unleash the potential of deeper networks. In this paper, we identify two key barriers that obstruct residual models from scaling deeper: shortcut degradation and limited width. Shortcut degradation hinders deep-layer learning, while the inherent depth-width trade-off imposes limited width. To mitigate these issues, we propose a generalized residual architecture dubbed Step by Step Network (StepsNet) to bridge the gap between theoretical potential and practical performance of deep models. Specifically, we separate features along the channel dimension and let the model learn progressively via stacking blocks with increasing width. The resulting method mitigates the two identified problems and serves as a versatile macro desi
By taking account of the alternation of structural parameters, we study bunching of impermeable steps induced by drift of adatoms on a vicinal face of Si(001). With the alternation of diffusion coefficient, the step bunching occurs irrespective of the direction of the drift if the step distance is large. Like the bunching of permeable steps, the type of large terraces is determined by the drift direction. With step-down drift, step bunches grows faster than those with step-up drift. The ratio of the growth rates is larger than the ratio of the diffusion coefficients. Evaporation of adatoms, which does not cause the step bunching, decreases the difference. If only the alternation of kinetic coefficient is taken into account, the step bunching occurs with step-down drift. In an early stage, the initial fluctuation of the step distance determines the type of large terraces, but in a late stage, the type of large terraces is opposite to the case of alternating diffusion coefficient.
The Natural Language Inference (NLI) task often requires reasoning over multiple steps to reach the conclusion. While the necessity of generating such intermediate steps (instead of a summary explanation) has gained popular support, it is unclear how to generate such steps without complete end-to-end supervision and how such generated steps can be further utilized. In this work, we train a sequence-to-sequence model to generate only the next step given an NLI premise and hypothesis pair (and previous steps); then enhance it with external knowledge and symbolic search to generate intermediate steps with only next-step supervision. We show the correctness of such generated steps through automated and human verification. Furthermore, we show that such generated steps can help improve end-to-end NLI task performance using simple data augmentation strategies, across multiple public NLI datasets.
This paper analyzes multi-step temporal difference (TD)-learning algorithms within the ``deadly triad'' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that $n$-step TD-learning algorithms converge to a solution as the sampling horizon $n$ increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when $n$ is sufficiently large. Based on these findings, in the second part, two $n$-step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the model-based deterministic algorithms.
Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter's out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500
Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization. Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner (i.e., the weights are averaged after the entire training process is finished), which significantly degrades the diversity between networks and thus impairs the effectiveness. In this paper, inspired by weight average, we propose Lookaround, a straightforward yet effective SGD-based optimizer leading to flatter minima with better generalization. Specifically, Lookaround iterates two steps during the whole training period: the around step and the average step. In each iteration, 1) the around step starts from a common point and trains multiple networks simultaneously, each on transformed data by a different data augmentation, and 2) the average step averages these trained networks to get the averaged network, which serves as the starting point for the next iteration. The around step improves the functionality diversity while the average step guarantees the weight locality of these networks during the whole training,
Sharpness-aware minimization (SAM) methods have gained increasing popularity by formulating the problem of minimizing both loss value and loss sharpness as a minimax objective. In this work, we increase the efficiency of the maximization and minimization parts of SAM's objective to achieve a better loss-sharpness trade-off. By taking inspiration from the Lookahead optimizer, which uses multiple descent steps ahead, we propose Lookbehind, which performs multiple ascent steps behind to enhance the maximization step of SAM and find a worst-case perturbation with higher loss. Then, to mitigate the variance in the descent step arising from the gathered gradients across the multiple ascent steps, we employ linear interpolation to refine the minimization step. Lookbehind leads to a myriad of benefits across a variety of tasks. Particularly, we show increased generalization performance, greater robustness against noisy weights, as well as improved learning and less catastrophic forgetting in lifelong learning settings. Our code is available at https://github.com/chandar-lab/Lookbehind-SAM.
A phase diagram for the step faceting phase, the step droplet phase, and the Gruber-Mullins-Pokrovsky-Talapov (GMPT) phase on a crystal surface is obtained by calculating the surface tension with the density matrix renormalization group method. The model based on the calculations is the restricted solid-on-solid (RSOS) model with a point-contact-type step-step attraction (p-RSOS model) on a square lattice. The point-contact-type step-step attraction represents the energy gain obtained by forming a bonding state with orbital overlap at the meeting point of the neighbouring steps. Owing to the sticky character of steps, there are two phase transition temperatures, $T_{f,1}$ and $T_{f,2}$. At temperatures $T < T_{f,1}$, the anisotropic surface tension has a disconnected shape around the (111) surface. At $T<T_{f,2}<T_{f,1}$, the surface tension has a disconnected shape around the (001) surface. On the (001) facet edge in the step droplet phase, the shape exponent normal to the mean step running direction $θ_n=2$ at $T$ near $T_{f,2}$, which is different from the GMPT universal value $θ_n=3/2$. On the (111) facet edge, $θ_n=4/3$ only on $T_{f,1}$. To understand how the system