Despite significant achievements in improving the instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflicting instructions remains a considerable challenge. Real-world scenarios often require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization, which demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs' capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct~with $\sim$1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in nine capability categories, including statics and dynamics, reasoning, and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effecti
Conversation is a subject of increasing interest in the social, cognitive, and computational sciences. Yet as conversational datasets continue to increase in size and complexity, researchers lack scalable methods to segment speech-to-text transcripts into conversational "turns"-the basic building blocks of social interaction. We discuss this challenge and then introduce "NaturalTurn," a turn-segmentation algorithm designed to accurately capture the dynamics of conversational exchange. NaturalTurn operates by distinguishing speakers' primary conversational turns from listeners' secondary utterances, such as backchannels, brief interjections, and other forms of parallel speech that characterize human conversation. Using data from a large conversation corpus, we show that NaturalTurn captures conversational turns more accurately than a baseline model. For example, it produces turns with durations and gaps that match empirical literature, reveals stronger linguistic alignment patterns between speakers, and uncovers otherwise hidden relationships between turn-taking and affective outcomes. NaturalTurn thus represents a pragmatic development in machine-generated transcript-processing met
Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during rein
As LLM reasoning performance plateau, improving inference-time compute efficiency is crucial to mitigate overthinking and long thinking traces even for simple queries. Prior approaches including length regularization, adaptive routing, and difficulty-based budget allocation primarily focus on single-turn settings and fail to address the sequential dependencies inherent in multi-turn reasoning. In this work, we formulate multi-turn reasoning as a sequential compute allocation problem and model it as a multi-objective Markov Decision Process. We propose TAB: Turn-Adaptive Budgets, a budget allocation policy trained via Group Relative Policy Optimization (GRPO) that learns to maximize task accuracy while respecting global per-problem token constraints. Consequently, TAB takes as input the conversation history and learns to adaptively allocate smaller budgets to easier turns and save appropriate number of tokens for the crucial harder reasoning steps. Our experiments on mathematical reasoning benchmarks demonstrate that TAB achieves a superior accuracy-tokens tradeoff saving up to 35% tokens while maintaining accuracy over static and off-the-shelf LLM budget baselines. Further, for sys
Using a minimal aggregation-based model, we address the efficient information transfer observed in natural flocks during collective turns. Specifically, we demonstrate that this feature can arise solely from the non-reciprocal nature of local interactions. Through a perturbative analysis, moreover, we find that velocity fluctuations (in the continuum) can be described by a Born approximation. We then show that a wave propagating across the flock undergoes scattering. Our model provides testable predictions and can be extended to study other physical contexts exhibiting polar order.
The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs acces
We have shown recently that the notion of poking pairwise interactions along a chain provides a unifying framework for understanding the formation of both secondary and the tertiary protein structure based on symmetry and geometry. $α$-helices and $β$-sheets are found to be special geometries that have systematic poking contacts in a repetitive manner with the contacts being local along the $α$-helix and non-local along a pair of adjacent strands within a $β$-sheet. Pairwise poking interactions also govern tertiary structure formation, but they are weaker and there are no special geometrical constraints as in secondary structure formation. Here we demonstrate that protein turns, the most prevalent non-repetitive structural element in proteins, are instances of local (as in $α$-helices) and isolated (non-repetitive) poking pairwise contacts for which the geometrical constraints are partially relaxed. This simple and purely geometrical definition of protein turns (also sometimes known as reverse turns, $β$-turns, $β$-bends, hairpin bends, $3_{10}$ bends, kinks, widgets, ...) provides a simple framework for unifying them. We present the results of a systematic analysis and identify th
Kinetic inductances of superconducting nanostrips with a meander pattern are theoretically investigated based on the London model, and the effect of the current crowding at the turns of the nanostrips is considered. The complex current approach is developed for analytical investigation of the kinetic inductance of nanostrips with turns for thin $d<λ$ and narrow $w\ll λ^2/d$ superconducting strips, where $d$ is the strip thickness, $w$ is the strip width, and $λ$ is the London penetration depth. We show that the current distribution in superconducting nanostrips of $wd\llλ^2$ is identical to that in normal conducting nanostrips of $wd\llδ^2/2$, where $δ$ is the skin depth, and the dependence of the kinetic inductance on the nanostrip geometry is identical to that of the normal resistance. Effects of the edge defects of superconducting strips upon the kinetic inductance are also considered.
Spin-wave computing, a potential successor to CMOS-based technologies, relies on the efficient manipulation of spin waves for information processing. While basic logic devices like magnon transistors, gates, and adders have been experimentally demonstrated, the challenge for complex magnonic circuits lies in steering spin waves through sharp turns. In this study we demonstrate with micromagnetic simulations and Brillouin light scattering microscopy experiments, that dipolar spin waves can propagate through 90-degree turns without distortion. The key lies in carefully designed in-plane magnetization landscapes, addressing challenges posed by anisotropic dispersion. The experimental realization of the required magnetization landscape is enabled by spatial manipulation of the uniaxial anisotropy using corrugated magnonic waveguides. The findings presented in this work should be considered in any magnonic circuit design dealing with anisotropic dispersion and spin wave turns.
We argue that field trajectories, which lead to cosmic acceleration and feature rapid turns near the boundary of the moduli space, are in the Swampland. We obtain this result by assuming the validity of the Swampland Distance Conjecture (SDC) in the presence of a positive scalar potential and by focusing on hyperbolic spaces, as prototype geometries of infinite distance limits of Calabi-Yau compactifications. We find that, in a quasi-de Sitter space with Hubble rate $H$ and acceleration parameter $ε$, the turning rate $Ω$ is upper bounded such as $Ω/H<\mathcal{O}(\sqrtε)$. Therefore, field trajectories consistent with the SDC can only have a negligible deviation from geodesics. This has direct implications for the realization and consistency of multi-field scenarios in string theory. Moreover, it implies a tension between asymptotic accelerating expansion, consistent with observations, and the de Sitter conjecture.
A turn in a computation of a pushdown automaton is a switch from a phase in which the height of the pushdown store increases to a phase in which it decreases. Given a pushdown or one-counter automaton, we consider, for each string in its language, the minimum number of turns made in accepting computations. We prove that it cannot be decided if this number is bounded by any constants. Furthermore, we obtain a non-recursive trade-off between pushdown and one-counter automata accepting in a finite number of turns and finite-turn pushdown automata, that are defined requiring that the constant bound is satisfied by each accepting computation. We prove that there are languages accepted in a sublinear but not constant number of turns, with respect to the input length. Furthermore, there exists an infinite proper hierarchy of complexity classes, with the number of turns bounded by different sublinear functions. In addition, there is a language requiring a number of turns which is not constant but grows slower than each of the functions defining the above hierarchy.
Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We de
Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin-wise selection in the user-query trajectory space to retain representative yet non-redundant dialogues, with a local structural stage that evaluates within-dialogue reliability through entity-grounded topic grounding and information progress, together with query-answer form consistency for functional alignment. MDS outperforms strong single-turn selectors, dialogue-level LLM scorers, and heuristic baselines on three multi-turn benchmarks and an in-domain Banking test set, achieving the best overall rank across reference-free and reference-based metrics, and is more robust on long conversations under the same training budget. Code and resources are included in the supplementary materials.
Multi-turn jailbreak attacks are effective against text-only large language models (LLMs) by gradually introducing malicious content across turns. When extended to large vision-language models (LVLMs), we find that naively adding visual inputs can cause existing multi-turn jailbreaks to be easily defended. For example, overly malicious visual input will easily trigger the defense mechanism of safety-aligned LVLMs, making the response more conservative. To address this, we propose MAPA: a multi-turn adaptive prompting attack that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 11-35% on recent benchmarks against LLaVA-V1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini.
Adherent cells have long been known to display two modes during migration: a faster mode that is persistent in direction and a slower one where they turn. Compared to the persistent mode, the turns are less studied. Here we develop a simple yet effective protocol to isolate the turns quantitatively. With the protocol, we study different adherent cells in different morphological states and find that, during turns, the cells behave as rotors with constant turning rates but random turning directions. To perform tactic motion, the cells bias the sign of turning towards the stimuli. Our results clarify the bimodal kinematics of adherent cell migration. Compared to the rotational-diffusion-based turning dynamics - which has been widely implemented, our data reveal a distinct picture, where turns are governed by a deterministic angular velocity.
LLM-powered coding agents, which operate in iterative loops (turns) to solve software engineering tasks, are becoming increasingly powerful. However, their practical deployment is hindered by significant and unpredictable costs. This challenge arises from a combination of factors: quadratically growing token counts with each turn, the high price of models, the large number of turns required for real-world tasks, and the tendency of agents to take inefficient or unnecessary actions. While existing research focuses on optimizing individual turns, the strategic control of the total number of turns remains an underexplored area for managing agent performance and cost. To address this gap, we conduct a comprehensive empirical study on SWE-bench using three state-of-the-art models and evaluate the impact of three distinct turn-control strategies: an unrestricted baseline, a fixed-turn limit with reminders, and a novel dynamic-turn strategy that grants extensions on-demand. Our findings first reveal a fundamental trade-off in the unrestricted setting, where no single model excels across performance, cost, and turn efficiency. We then show that a fixed-turn limit, specifically at the 75th
A finite impartial game is a two-player game in which the players take turns making moves and the game ends after finitely many moves. In this paper, we study a class of finite impartial games introduced by H.~Lenstra, which we call coin turning games. We focus on two typical classes of coin turning games, namely the order ideal games and the rulers, distinguished by their choices of turning sets. For several posets arising from enumerative combinatorics, we determine the Sprague-Grundy functions. In particular, we determine the Sprague-Grundy function of the order ideal game on the ASM poset, introduced by J.~Striker in connection with the alternating sign matrices.
While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model's superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning b
Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into cli
Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% across diverse math reasoning benchmarks, establishing its effectiveness. GTPO also improves GRPO by 3