搜索 — ResearchTracker

Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because

Parity, Sensitivity, and Transformers

arXiv2026-02-05作者：Alexander Kozachinskiy, Tomasz Steifer, Przemysław Wałȩga

Understanding what neural architectures can and cannot compute is a central challenge in the theory of AI. One of the fundamental problems in this context is the PARITY task, which asks whether the number of 1s in a binary input sequence is even or odd. PARITY is one of the central tasks studied in the theory of computation, yet it remains surprisingly unclear under which conditions transformers can or cannot solve it. In this paper, we show that the minimal number of layers a transformer needs to compute PARITY is two. In particular, we solve the open problem asking whether a one-layer transformer can compute PARITY. We answer it negatively by showing that average sensitivity of a one-layer transformer grows slower than that of PARITY. Furthermore, we show a new construction for transformer that computes PARITY, which improves on the existing constructions by removing a number of impractical assumptions. In particular, the existing transformers for PARITY rely on such impractical assumptions as length-dependent positional encoding, hardmax, layernorm without a regularisation parameter, or incompatibility with causal masking. We show that these assumptions can be removed, at the co

搜索结果：transformers

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Parity, Sensitivity, and Transformers

Graph Tokenization for Bridging Graphs and Transformers

Transformers are Graph Neural Networks

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Continuous-Depth Transformers with Learned Control Dynamics

Krause Synchronization Transformers

Clustering in pure-attention hardmax transformers and its role in sentiment analysis

Task-Conditioned Routing Signatures in Sparse Mixture-of-Experts Transformers

Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers

Solving Empirical Bayes via Transformers

No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

What comes after transformers? -- A selective survey connecting ideas in deep learning

On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition

Transformers in Time Series: A Survey

SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers

Intra-Layer Recurrence in Transformers for Language Modeling

Local Attention Transformers for High-Detail Optical Flow Upsampling

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers