搜索 — ResearchTracker

Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: https://jiahaotian-sjtu.github.io/headforcing.github.io/.

One Adaptive Trailing Head Can Outperform Many Oblivious Trailing Heads

arXiv2026-05-28作者：Julianne Cruz, Sho Glashausser, Neil Lutz

In the setting of multi-head finite-state dimensions, trailing heads lag behind a leading head, accessing past data to aid a finite-state gambler placing bets on successive bits read by the leading head. Cruz, Glashausser, Li, and Lutz (2026) proved that, for any fixed number of trailing heads, adaptive (data-dependent) movement rules can strictly outperform oblivious (data-independent) movement schedules. In this paper we strengthen that separation by proving that a single trailing head with adaptive movements can outperform, by a large and uniform margin, arbitrarily many trailing heads with oblivious movements. Formally, our main theorem states that there is a binary sequence whose adaptive two-head finite-state strong dimension is less than its oblivious multi-head finite-state dimension, and that the gap is greater than 0.3.

搜索结果：head

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

One Adaptive Trailing Head Can Outperform Many Oblivious Trailing Heads

Interleaved Head Attention

The Anxiety of Influence: Bloom Filters in Transformer Attention Heads

Adaptive Head Budgeting for Efficient Multi-Head Attention

OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions

From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors

Location-guided Head Pose Estimation for Fisheye Image

Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

Few-Shot Head Swapping in the Wild

Data-driven Head Motion Generation through Natural Gaze-Head Coordination

From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation

Borrowed Geometry: Cross-Distribution Head-Importance Fingerprints of Frozen Pretrained Gemma 4 31B

Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention

FONT: Flow-guided One-shot Talking Head Generation with Natural Head Motions

The Eye-Head Mover Spectrum: Modelling Individual and Population Head Movement Tendencies in Virtual Reality

RPEE-HEADS: A Novel Benchmark for Pedestrian Head Detection in Crowd Videos

Semi-Supervised Unconstrained Head Pose Estimation in the Wild

SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing

SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis