In this work, we revisit Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3$\times$-10$\times$ larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well-established baselines. Specifically, when training 7B-scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208. Our source code will be releas
Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perc
This paper presents a new stochastic relay-based extremum-seeking controller (ESC) for multi-input-single-output (MISO) systems. The goal of this work was to create an algorithm that is much simpler to configure than alternative approaches making deployment to real-world problems easier. A solution is developed first for a static map and then adapted for a general class of dynamic systems. The number of configurable parameters is one per input channel for the static case and only one additional parameter is needed for the dynamic version. The problem of gradient identification is solved via the use of stochastic relay gains and a simple stability proof for the static case is presented. Simulation tests demonstrate the performance of the strategy for optimizing both static and dynamic systems.
We define simple tilings in the general context of a $G$-tiling on a Riemannian homogeneous space $M$ to be tilings by Riemannian simplices. As evidence that this definition is natural, we prove that a large class of tilings of $M$ are MLD to simple ones. We demonstrate the utility of this definition by generalizing previously known results about simple tilings of Euclidean space. In particular, it is shown that a simple tiling space of a rational, connected, simply connected, nilpotent Lie group is homeomorphic to a rational tiling space, that is, a tiling space for which displacement between vertices take on rational values. Hence, such a tiling space is a fiber bundle over a nilmanifold. We further sketch a proof of the fact that there is an isomorphism between Čech cohomology and pattern equivariant cohomology of simple tilings in connected, simply connected, nilpotent Lie groups.
In this paper, we introduce a Ketonen-type Gentzen-style classical simple type theory $\bf KCT$. Also the tableau system $\bf KCTT$ corresponding to $\bf KCT$ is introduced. Further inference-preserving Gentzen system $\bf KCT_h$ (equivalent to $\bf KCT$) and tableau system $\bf KCTT_h$ (equivalent to $\bf KCTT$) is introduced. We introduce the notion of Hintikka sequents for $\bf KCTT_h$.The completeness theorem and Takahashi-Prawitz's theorem are proved for $\bf KCTT_h$.
"Metaphorical maps" or "contact representations" are visual representations of vertex-weighted graphs that rely on the geographic map metaphor. The vertices are represented by countries, the weights by the areas of the countries, and the edges by contacts/ boundaries among them. The accuracy with which the weights are mapped to areas and the simplicity of the polygons representing the countries are the two classical optimization goals for metaphorical maps. Mchedlidze and Schnorr [Metaphoric Maps for Dynamic Vertex-weighted Graphs, EuroVis 2022] presented a force-based algorithm that creates metaphorical maps that balance between these two optimization goals. Their maps look visually simple, but the accuracy of the maps is far from optimal - the countries' areas can vary up to 30% compared to required. In this paper, we provide a multi-fold extension of the algorithm in [Metaphoric Maps for Dynamic Vertex-weighted Graphs, EuroVis 2022]. More specifically: 1. Towards improving accuracy: We introduce the notion of region stiffness and suggest a technique for varying the stiffness based on the current pressure of map regions. 2. Towards maintaining simplicity: We introduce a weight co
Given an undirected graph $G=(V,E,w)$, a Gomory-Hu tree $T$ (Gomory and Hu, 1961) is a tree on $V$ that preserves all-pairs mincuts of $G$ exactly. We present a simple and efficient randomized reduction from Gomory-Hu trees to polylog maxflow computations. On unweighted graphs, our reduction reduces to maxflow computations on graphs of total instance size $\tilde{O}(m)$ and the algorithm requires only $\tilde{O}(m)$ additional time. Our reduction is the first that is tight up to polylog factors. The reduction also seamlessly extends to weighted graphs, however, instance sizes and runtime increase to $\tilde{O}(n^2)$. Finally, we show how to extend our reduction to reduce Gomory-Hu trees for unweighted hypergraphs to maxflow in hypergraphs. Again, our reduction is the first that is tight up to polylog factors.
In this article, we define quasiprimitive quandles and describe them with the help of quasiprimitive permutation groups. As a consequence, we enumerate finite non-affine simple quandles up to order $4096$.
Most current neural networks for molecular dynamics (MD) include physical inductive biases, resulting in specialized and complex architectures. This is in contrast to most other machine learning domains, where specialist approaches are increasingly replaced by general-purpose architectures trained on vast datasets. In line with this trend, several recent studies have questioned the necessity of architectural features commonly found in MD models, such as built-in rotational equivariance or energy conservation. In this work, we contribute to the ongoing discussion by evaluating the performance of an MD model with as few specialized architectural features as possible. We present a recipe for MD using an Edge Transformer, an ``off-the-shelf'' transformer architecture that has been minimally modified for the MD domain, termed MD-ET. Our model implements neither built-in equivariance nor energy conservation. We use a simple supervised pre-training scheme on $\sim$30 million molecular structures from the QCML database. Using this ``off-the-shelf'' approach, we show state-of-the-art results on several benchmarks after fine-tuning for a small number of steps. Additionally, we examine the ef
Simultaneous variable selection and robust data fitting are important aspects of many mathematical modelling projects and a wide array of optimisation tools and techniques exist to support them. When the intention is to embed this capability in run-time interactive decision support tools running hundreds of such modelling tasks simultaneously on a GPU, the choices of implementation approach are more limited. Recently, simple and fast Coordinate Descent algorithms have been proposed which can implement the LASSO approach to variable selection in conjunction with ordinary least squares (OLS) data fitting. However extending this to use the more robust Least Absolute Deviation (LAD) data fitting has been hampered by the multiple axis wise local minima that occur in the objective function for this LAD-LASSO approach. This paper suggests that these multiple axis wise local minima form a locus which is monotonic in all the axes and that this locus has a convex objective function. Hence allowing the locus to be searched using a ternary chop algorithm that uses Coordinate Descent to identify multiple local minima (points on this locus) as required to find the global minimum. The resulting a
Integrating visible and infrared images into one high-quality image, also known as visible and infrared image fusion, is a challenging yet critical task for many downstream vision tasks. Most existing works utilize pretrained deep neural networks or design sophisticated frameworks with strong priors for this task, which may be unsuitable or lack flexibility. This paper presents SimpleFusion, a simple yet effective framework for visible and infrared image fusion. Our framework follows the decompose-and-fusion paradigm, where the visible and the infrared images are decomposed into reflectance and illumination components via Retinex theory and followed by the fusion of these corresponding elements. The whole framework is designed with two plain convolutional neural networks without downsampling, which can perform image decomposition and fusion efficiently. Moreover, we introduce decomposition loss and a detail-to-semantic loss to preserve the complementary information between the two modalities for fusion. We conduct extensive experiments on the challenging benchmarks, verifying the superiority of our method over previous state-of-the-arts. Code is available at \href{https://github.co
Oblivious RAM (ORAM) is a well-researched primitive to hide the memory access pattern of a RAM computation; it has a variety of applications in trusted computing, outsourced storage, and multiparty computation. In this paper, we study the so-called offline ORAM in which the sequence of memory access locations to be hidden is known in advance. Apart from their theoretical significance, offline ORAMs can be used to construct efficient oblivious algorithms. We obtain the first optimal offline ORAM with perfect security from oblivious priority queues via time-forward processing. For this, we present a simple construction of an oblivious priority queue with perfect security. Our construction achieves an asymptotically optimal (amortized) runtime of $Θ(\log N)$ per operation for a capacity of $N$ elements and is of independent interest. Building on our construction, we additionally present efficient external-memory instantiations of our oblivious, perfectly-secure construction: For the cache-aware setting, we match the optimal I/O complexity of $Θ(\frac{1}{B} \log \frac{N}{M})$ per operation (amortized), and for the cache-oblivious setting we achieve a near-optimal I/O complexity of $O(\
Using a simplified model for a non-Brownian suspension, we numerically study the response of athermal, overdamped, frictionless disks in two dimensions to isotropic and uniaxial compression, as well as to pure {\color{black}and simple} shearing, all at finite constant strain rates $\dotε$. We show that isotropic and uniaxial compression result in the same jamming packing fraction $φ_J$, while pure shear and simple shear induced jamming occurs at a slightly higher $φ_J^*$, consistent with that found previously for simple shearing. A critical scaling analysis of pure shearing gives critical exponents consistent with those previously found for both isotropic compression and simple shearing. Using orientational order parameters for contact bond directions, we compare the anisotropy of the force and contact networks at both lowest nematic order, as well as higher $2n$-fold order.
We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation and detection datasets. To bridge the gap of vocabulary and annotation granularity, we first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. This gives us reasonably good results compared with the counterparts trained on segmentation task only. To further reconcile them, we locate two discrepancies: $i$) task discrepancy -- segmentation requires extracting masks for both foreground objects and background stuff, while detection merely cares about the former; $ii$) data discrepancy -- box and mask annotations are with different spatial granularity, and thus not directly interchangeable. To address these issues, we propose a decoupled decoding to reduce the interference between foreground/background and a conditioned mask decoding to assist in generating masks for given boxes. To this end, we develop a simple encoder-decoder model encompassing all three techniques and train it jointly on COCO and Objects365. After pre-training, our model exhibits competitive or stronger zero-sh
Data attribution methods play a crucial role in understanding machine learning models, providing insight into which training data points are most responsible for model outputs during deployment. However, current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions. These approaches therefore come at a high computational cost, are memory intensive, and are hard to scale to large models or datasets. In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution. Our method is model-agnostic and scales easily to large datasets. We show results on CIFAR-10 and ImageNet, achieving strong performance that rivals or outperforms state-of-the-art approaches at a fraction of the compute or memory cost. Contrary to prior work, our results reinforce the intuition that a model's prediction on one image is most impacted by visually similar training samples. Our approach serves as a simple and efficient baseline for data attribution on images.
A simple-triangle graph (also known as a PI graph) is the intersection graph of a family of triangles defined by a point on a horizontal line and an interval on another horizontal line. The recognition problem for simple-triangle graphs was a longstanding open problem, and recently a polynomial-time algorithm has been given [G. B. Mertzios, The Recognition of Simple-Triangle Graphs and of Linear-Interval Orders is Polynomial, SIAM J. Discrete Math., 29(3):1150--1185, 2015]. Along with the approach of this paper, we show a simpler recognition algorithm for simple-triangle graphs. To do this, we provide a polynomial-time algorithm to solve the following problem: Given a bipartite graph $G$ and a set $F$ of edges of $G$, find a 2-chain subgraph cover of $G$ such that one of two chain subgraphs has no edges in $F$.
Let $X$ be a closed smooth manifold, $G$ be a simple connected compact real Lie group, $M (G)$ be the group of all smooth maps from $X$ to $G$, and $M_0 (G)$ be its connected component for the $\mathcal C^\infty$-compact open topology. It is shown that maximal normal subgroups of $M_0 (G)$ are precisely the inverse images of the centre $Z(G)$ of $G$ by the evaluation homomorphisms $M_0 (G) \to G, \hskip.1cm γ\mapsto γ(a)$, for $a \in X$. This in turn is a consequence of a result on the group $\mathcal C^\infty_{n, G}$ of germs at the origin $O$ of $\mathbf R^n$ of smooth maps $\mathbf R^n \to G$: this group has a unique maximal normal subgroup, which is the inverse image of $Z(G)$ by the evaluation homomorphism $\mathcal C^\infty_{n, G} \to G, \hskip.1cm \underline γ\mapsto \underline γ(O)$. This article provides corrections for part of an earlier article [Harp--88].
Given any polytope $P$ and any generic linear functional ${\bf c} $, one obtains a directed graph $G(P,{\bf c})$ from the 1-skeleton of $P$ by orienting each edge $e(u,v)$ from $u$ to $v$ for ${\bf c} (u) < {\bf c} ( v)$. For $P$ a simple polytope and $G(P,{\bf c})$ the Hasse diagram of a lattice $L$, the join of any collection $S$ of elements which all cover a common element $u$ in $L$ is proven to equal the sink of the smallest face of $P$ containing $u$ and all of the elements of $S$. The author conjectures for such $G(P,{\bf c})$ that no directed path in $G(P,{\bf c})$ ever revisits any facet of $P$. This would imply for such $P$ and ${\bf c}$ that the simplex method for linear programming is efficient under all possible pivot rules. This conjecture is proven for 3-polytopes and for spindles. For simple polytopes in which $G(P,{\bf c})$ is the Hasse diagram of a lattice $L$, the order complex of each open interval in $L$ is proven homotopy equivalent to a ball or a sphere. Applications are given to the weak Bruhat order, the Tamari lattice, and the Cambrian lattices. This paper concludes with an appendix by Dominik Preußproving the monotone Hirsch conjecture for $P$ a simple
Simple crystallizations are edge-coloured graphs representing PL 4-manifolds with the property that the 1-skeleton of the associated triangulation equals the 1-skeleton of a 4-simplex. In the present paper, we prove that any (simply-connected) PL $4$-manifold $M$ admitting a simple crystallization admits a special handlebody decomposition, too; equivalently, $M$ may be represented by a framed link yielding $\mathbb S^3$, with exactly $β_2(M)$ components ($β_2(M)$ being the second Betti number of $M$). As a consequence, the regular genus of $M$ is proved to be the double of $β_2(M)$. Moreover, the characterization of any such PL $4$-manifold by $k(M)= 3 β_2(M)$, where $k(M)$ is the gem-complexity of $M$ (i.e. the non-negative number $p-1$, $2p$ being the minimum order of a crystallization of $M$) implies that both PL invariants gem-complexity and regular genus turn out to be additive within the class of all PL $4$-manifolds admitting simple crystallizations (in particular: within the class of all "standard" simply-connected PL 4-manifolds).
We present a simple proof on the existence of $L^1$-flat analytic polynomials with coefficients $0,1$ on the circle and on the real line and we give an example of a conservative ergodic map and flow whose unitary operators admits a simple Lebesgue spectrum. Among other results, we obtain an answer to Bourgain's question on the supremum of $L^1$-norm of such polynomials and to a question inspired by Lehmer's problem on the supremum of the Mahler measures of those polynomials.