Existing temporal QA benchmarks focus on simple fact-seeking queries from news corpora, while reasoning-intensive retrieval benchmarks lack temporal grounding. However, real-world information needs often require reasoning about temporal evolution and synthesizing evidence across time periods. We introduce TEMPO, the first benchmark combining temporal reasoning with reasoning-intensive retrieval across 13 domains. TEMPO features: (1) 1,730 complex queries requiring deep temporal reasoning such as tracking changes, identifying trends, or comparing cross-period evidence; (2) step-wise retrieval planning with 3,976 decomposed steps and gold documents mapped to each step for multi-hop evaluation; and (3) novel temporal metrics including Temporal Coverage@k and Temporal Precision@k measuring whether results span required time periods. Evaluation of 12 retrieval systems reveals substantial challenges: the best model (DiVeR) achieves only 32.0 NDCG@10 and 71.4\% Temporal Coverage@10, demonstrating difficulty in retrieving temporally complete evidence. We believe TEMPO provides a challenging benchmark for improving temporal reasoning in retrieval and RAG systems. Our code and data are avail
Event-Based Models (EBMs) infer biomarker progression from cross-sectional data but typically only as ordinal sequences and rely on rigid model assumptions. We propose \textsc{Tempo}, a Transformer architecture that learns both ordinal and continuous event sequences through simulation-based supervised learning. \textsc{Tempo} uses two Transformer modules: one treats biomarkers as tokens to infer event sequencing; the other treats patients as tokens, representing each by their per-biomarker abnormality profile, to infer patients' disease stages. On synthetic benchmarks, \textsc{Tempo} reduces normalized Kendall's Tau distance by 52.89\% and staging MAE by 25.33\% compared to state-of-the-art SA-EBM, with larger reductions in high-dimensional settings (58.88\% and 61.10\%). Applied to ADNI, \textsc{Tempo} recovers a biologically plausible Alzheimer's progression: early medial temporal atrophy, followed by amyloid accumulation and cognitive decline, and late-stage tau pathology with terminal acceleration of global neurodegeneration -- broadly consistent with established disease models. \textsc{Tempo} also eliminates the need to derive custom inference algorithms and enables rapid empi
Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.
Backtesting large language models on historical events requires reasoning exclusively from information available before a specified cutoff date. Yet models routinely leak post-cutoff knowledge from pre-training into their reasoning, inflating apparent accuracy and undermining evaluation validity. Prompt-based constraints fail when suppressed content is causally related to the prediction, and knowledge unlearning cannot address this problem because temporal compliance is instance-specific: the same fact may be legitimate evidence for one cutoff date and a violation for another. Rather than erasing knowledge, the model must learn temporal discipline: selecting evidence conditioned on each instance's cutoff date. We propose TEMPO (Temporal Enforcement via Mode-separated Policy Optimization), which trains this discipline via two contributions: (1) a two-mode reward where a leakage mode drives post-cutoff claims to zero as a hard prerequisite before a performance mode optimizes task performance; and (2) a GRPO-based training pipeline that enables the model to discover temporally valid reasoning strategies. We prove that training monotonically decreases leakage, converges to the leak-fre
Deep learning (DL) algorithms are often defined in terms of temporal relationships: a tensor at one timestep may depend on tensors from earlier or later timesteps. Such dynamic dependencies (and corresponding dynamic tensor shapes) are difficult to express and optimize: while eager DL systems support such dynamism, they cannot apply compiler-based optimizations; graph-based systems require static tensor shapes, which forces users to pad tensors or break-up programs into multiple static graphs. We describe Tempo, a new DL system that combines the dynamism of eager execution with the whole-program optimizations of graph-based compilation. Tempo achieves this through a declarative programming model with recurrent tensors, which include explicit temporal dimensions. Temporal dimensions can be indexed using symbolic expressions to express dynamic dependencies on past and future tensors. Based on this, Tempo constructs a symbolic dependence graph, which concisely encodes dynamic dependencies between operators, and applies whole-program optimizations, such as algebraic simplifications, vectorization, tiling, and fusion. By tiling dynamic dependencies into static-size blocks, Tempo can als
Sonification offers a non-visual way to understand data, with pitch-based encodings being the most common. Yet, how well people perceive slope and acceleration-key features of data trends-remains poorly understood. Drawing on people's natural abilities to perceive tempo, we introduce a novel sampling method for pitch-based sonification to enhance the perception of slope and acceleration in univariate functions. While traditional sonification methods often sample data at uniform x-spacing, yielding notes played at a fixed tempo with variable pitch intervals (Variable Pitch Interval), our approach samples at uniform y-spacing, producing notes with consistent pitch intervals but variable tempo (Variable Tempo). We conducted psychoacoustic experiments to understand slope and acceleration perception across three sampling methods: Variable Pitch Interval, Variable Tempo, and a Continuous (no sampling) baseline. In slope comparison tasks, Variable Tempo was more accurate than the other methods when modulated by the magnitude ratio between slopes. For acceleration perception, just-noticeable differences under Variable Tempo were over 13 times finer than with other methods. Participants als
Video-to-audio generation has made significant progress in achieving semantic consistency and temporal alignment from silent videos. However, audio contains rich stylistic attributes such as timbre and tempo that are difficult to infer from visual and textual inputs alone. While reference audio can serve as additional conditioning, it is typically treated as a holistic signal, limiting fine-grained style control. We propose AudioIM, an attribute-aware framework that explicitly models timbre and tempo as separate control factors rather than relying on holistic prompt conditioning. Dual encoders extract complementary timbre-related and tempo-related representations, which are injected through global conditioning. A masking-based training strategy enables effective latent prompt conditioning at inference. Experiments on VGGSound show improved style similarity while preserving semantic alignment and synchronization. Audio samples are available at: https://anonymousdemo757.github.io/.
This paper examines the evolving performance practices of Ludwig van Beethoven's cello sonatas, with a particular focus on tempo and portamento between 1930 and 2012. It integrates analyses of 22 historical recordings, advancements in recording technology to shed light on changes in interpretative approaches. By comparing Beethoven's metronome markings, as understood through contemporaries such as Czerny and Moscheles, with their application in modern performances, my research highlights notable deviations. These differences prove the challenges performers face in reconciling historical tempos with the demands of contemporary performance practice. My study pays special attention to the diminishing use of audible portamento in the latter half of the 20th century, contrasted with a gradual increase in tempo after 1970. This development is linked to broader cultural and pedagogical shifts, including the adoption of fingering techniques that reduce hand shifts, thereby facilitating greater technical precision at faster tempos. Nonetheless, my study identifies the persistence of 'silent portamento' as an expressive device, allowing performers to retain stylistic expression without compr
Action visual tempo characterizes the dynamics and the temporal scale of an action, which is helpful to distinguish human actions that share high similarities in visual dynamics and appearance. Previous methods capture the visual tempo either by sampling raw videos with multiple rates, which require a costly multi-layer network to handle each rate, or by hierarchically sampling backbone features, which rely heavily on high-level features that miss fine-grained temporal dynamics. In this work, we propose a Temporal Correlation Module (TCM), which can be easily embedded into the current action recognition backbones in a plug-in-and-play manner, to extract action visual tempo from low-level backbone features at single-layer remarkably. Specifically, our TCM contains two main components: a Multi-scale Temporal Dynamics Module (MTDM) and a Temporal Attention Module (TAM). MTDM applies a correlation operation to learn pixel-wise fine-grained temporal dynamics for both fast-tempo and slow-tempo. TAM adaptively emphasizes expressive features and suppresses inessential ones via analyzing the global information across various tempos. Extensive experiments conducted on several action recognit
Robotic policy learning for complex real-world manipulation tasks has seen rapid recent progress, enabled in large part by the ability to collect demonstrations through human operation. However, policies trained from such demonstrations often execute tasks far more slowly than the robot's physical capabilities, as demonstration data is collected under practical constraints that favor conservative, success-oriented trajectories over execution speed. Existing policy acceleration methods determine execution tempo through data preprocessing or heuristic rules, rather than learning execution speed optimized for the task. In this paper, we propose SpeedAug, a policy acceleration framework that enables policies to learn task-optimal execution tempo via reinforcement learning (RL). SpeedAug first learns a tempo-enriched prior policy from speed-augmented demonstrations that captures diverse execution tempos. Building on this tempo-enriched prior, RL fine-tuning guides exploration to refine action trajectories and optimize execution tempo efficiently. Experiments on robotic manipulation benchmarks demonstrate that SpeedAug substantially improves the sample efficiency of policy acceleration w
We present TEMPO, a global, temporally resolved dataset of building density and height derived from high-resolution satellite imagery using deep learning models. We pair building footprint and height data from existing datasets with quarterly PlanetScope basemap satellite images to train a multi-task deep learning model that predicts building density and building height at a 37.6-meter per pixel resolution. We apply this model to global PlanetScope basemaps from Q1 2018 through Q2 2025 to create global, temporal maps of building density and height. We validate these maps by comparing against existing building footprint datasets. Our estimates achieve an F1 score between 85% and 88% on different hand-labeled subsets, and are temporally stable, with a 0.96 five-year trend-consistency score. TEMPO captures quarterly changes in built settlements at a fraction of the computational cost of comparable approaches, unlocking large-scale monitoring of development patterns and climate impacts essential for global resilience and adaptation efforts.
Empirical studies of recorded performance have conventionally modelled tempo change as a unidirectional historical process, fitting linear regression lines to tempo data plotted against recording year. This paper argues that such approaches impose a false narrative of uniform stylistic evolution on what is, in fact, a plurality of coexisting interpretive traditions. Applying k-means clustering (k=3) to bar-level BPM data from over one hundred recordings of Beethoven's five piano and cello sonatas (Op. 5 Nos. 1 and 2; Op. 69; Op. 102 Nos. 1 and 2) spanning 1930-2012, this study reveals that every movement supports at least two, and usually three, discrete tempo traditions (slow, mid-range, and fast), whose internal regression slopes are negligible (R-squared <= 0.25 in all but one case), demonstrating that each tradition is independently stable across eight decades. The mid-range cluster dominates in all movements, typically comprising 55-70% of recordings. A slow cluster is absent from fast-character movements (Op. 5 Rondos, Op. 69 Scherzo), reflecting a shared rhetorical consensus about their character. The single case of significant intra-cluster drift (Op. 102 No. 1 Allegro c
This paper addresses the problem of global tempo estimation in musical audio. Given that annotating tempo is time-consuming and requires certain musical expertise, few publicly available data sources exist to train machine learning models for this task. Towards alleviating this issue, we propose a fully self-supervised approach that does not rely on any human labeled data. Our method builds on the fact that generic (music) audio embeddings already encode a variety of properties, including information about tempo, making them easily adaptable for downstream tasks. While recent work in self-supervised tempo estimation aimed to learn a tempo specific representation that was subsequently used to train a supervised classifier, we reformulate the task into the binary classification problem of predicting whether a target track has the same or a different tempo compared to a reference. While the former still requires labeled training data for the final classification model, our approach uses arbitrary unlabeled music data in combination with time-stretching for model training as well as a small set of synthetically created reference samples for predicting the final tempo. Evaluation of our
Wearable sensing devices, such as Holter monitors, will play a crucial role in the future of digital health. Unsupervised learning frameworks such as Self-Supervised Learning (SSL) are essential to map these single-lead electrocardiogram (ECG) signals with their anticipated clinical outcomes. These signals are characterized by a tempo-variant component whose patterns evolve through the recording and an invariant component with patterns that remain unchanged. However, existing SSL methods only drive the model to encode the invariant attributes, leading the model to neglect tempo-variant information which reflects subject-state changes through time. In this paper, we present Parallel-Learning of Invariant and Tempo-variant Attributes (PLITA), a novel SSL method designed for capturing both invariant and tempo-variant ECG attributes. The latter are captured by mandating closer representations in space for closer inputs on time. We evaluate both the capability of the method to learn the attributes of these two distinct kinds, as well as PLITA's performance compared to existing SSL methods for ECG analysis. PLITA performs significantly better in the set-ups where tempo-variant attributes
Recently, automatic music transcription has made it possible to convert musical audio into accurate MIDI. However, the resulting MIDI lacks music notations such as tempo, which hinders its conversion into sheet music. In this paper, we investigate state-of-the-art tempo estimation techniques and evaluate their performance on solo instrumental music. These include temporal convolutional network (TCN) and recurrent neural network (RNN) models that are pretrained on massive of mixed vocals and instrumental music, as well as TCN models trained specifically with solo instrumental performances. Through evaluations on drum, guitar, and classical piano datasets, our TCN models with the new training scheme achieved the best performance. Our newly trained TCN model increases the Acc1 metric by 38.6% for guitar tempo estimation, compared to the pretrained TCN model with an Acc1 of 61.1%. Although our trained TCN model is twice as accurate as the pretrained TCN model in estimating classical piano tempo, its Acc1 is only 50.9%. To improve the performance of deep learning models, we investigate their combinations with various post-processing methods. These post-processing techniques effectively
Temporal predictive models have the potential to improve decisions in health care, public services, and other domains, yet they often fail to effectively support decision-makers. Prior literature shows that many misalignments between model behavior and decision-makers' expectations stem from issues of model specification, namely how, when, and for whom predictions are made. However, model specifications for predictive tasks are highly technical and difficult for non-data-scientist stakeholders to interpret and critique. To address this challenge we developed Tempo, an interactive system that helps data scientists and domain experts collaboratively iterate on model specifications. Using Tempo's simple yet precise temporal query language, data scientists can quickly prototype specifications with greater transparency about pre-processing choices. Moreover, domain experts can assess performance within data subgroups to validate that models behave as expected. Through three case studies, we demonstrate how Tempo helps multidisciplinary teams quickly prune infeasible specifications and identify more promising directions to explore.
Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.
The launch of the Tropospheric Emissions: Monitoring of Pollution (TEMPO) mission in 2023 marked a new era in air quality monitoring by providing high-frequency, geostationary observations of column NO2 across North America. In this study, we present the first implementation of a TEMPO NO2 data assimilation system using the Joint Effort for Data assimilation Integration (JEDI) framework. Leveraging a four-dimensional ensemble variational (4DEnVar) approach and an Ensemble of Data Assimilations (EDA), we demonstrate a novel capability to assimilate hourly NO2 retrievals from TEMPO alongside polar-orbiting TROPOMI data into NASA's GEOS Composition Forecast (GEOS-CF) model. The system is evaluated over the CONUS region for August 2023, using a suite of independent measurements including Pandora spectrometers, AirNow surface stations, and aircraft-based observations from AEROMMA and STAQS field campaigns. Results show that the assimilation system successfully integrates geostationary NO2 observations, improves model performance in the column, and captures diurnal variability. However, assimilation also leads to systematic reductions in surface NO2 levels, improving agreement with some
Historical metronome indications for Beethoven's five piano and cello sonatas (as transmitted by Czerny, Moscheles, and Kolisch), have long been regarded as problematic by performers and scholars alike. This paper presents the first systematic empirical assessment of those indications against a corpus of over one hundred movement-level recordings spanning 1930--2012, encompassing first, second, and third movements across all five sonatas (Op.~5 Nos.~1 and~2; Op.~69; Op.~102 Nos.~1 and~2). The core findings are threefold. First, Czerny's and Moscheles's markings are consistently and substantially exceeded by the entire recording corpus: gaps of 15--39\% are documented across movements, with the largest divergences in slow Adagio movements and the smallest in fast Allegro finales. Second, Kolisch's 1943 markings align considerably more closely with recorded practice than either Czerny's or Moscheles's, a striking result given that Kolisch was reasoning without corpus data. Third, the central Allegro tempo traditions for each movement are stable across eight decades; not because all performers play alike, but because three coexisting slow, mid-range, and fast traditions persist simult
Audio embeddings enable large scale comparisons of the similarity of audio files for applications such as search and recommendation. Due to the subjectivity of audio similarity, it can be desirable to design systems that answer not only whether audio is similar, but similar in what way (e.g., wrt. tempo, mood or genre). Previous works have proposed disentangled embedding spaces where subspaces representing specific, yet possibly correlated, attributes can be weighted to emphasize those attributes in downstream tasks. However, no research has been conducted into the independence of these subspaces, nor their manipulation, in order to retrieve tracks that are similar but different in a specific way. Here, we explore the manipulation of tempo in embedding spaces as a case-study towards this goal. We propose tempo translation functions that allow for efficient manipulation of tempo within a pre-existing embedding space whilst maintaining other properties such as genre. As this translation is specific to tempo it enables retrieval of tracks that are similar but have specifically different tempi. We show that such a function can be used as an efficient data augmentation strategy for both