Multimodal instruction tuning is the de facto recipe for adapting vision language models (VLMs), yet instruction data are highly redundant, making data selection critical for training efficiency. Existing methods derive selection signals from a specific model or dataset, so whenever the target model or candidate pool changes, the criteria must be recomputed from scratch at substantial cost. To address this, we propose OFA, a data selection framework that trains a reusable selector once and applies it to any dataset or model without recomputation. OFA clusters multimodal instructions in a frozen CLIP space, derives pseudo labels from the cluster structure, and trains a lightweight selector for only a few epochs; samples on which this selector is least confident are selected as the most informative. Once trained, the frozen selector transfers directly across datasets and model scales. The selector is trained once on LLaVA-665K and applied both to LLaVA-665K itself and, without any retraining, to the unseen Vision-Flan-186K. Selecting only 15% of the data, OFA achieves 98.3% of full data performance across 10 downstream benchmarks; on the smaller Vision-Flan-186K, the transferred sele
We study once-excited random walks on general trees, modeled by placing a single "cookie" at each vertex. Each cookie acts as a metaphorical reward that is consumed upon the first visit to the vertex where the cookie is placed. On that initial visit, the walk is in an excited state and behaves like a biased random walk. Once the cookie is consumed, the process reverts to a symmetric random walk on all subsequent visits. We consider a random environment in which the bias parameters are independent random variables. We prove that the process exhibits a sharp phase transition between transience and recurrence on general trees with polynomial growth, where the critical threshold is determined by the branching-ruin number of the tree.
A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goal. All-goals learning, where each transition is used for learning off-policy with respect to every goal, allows agents to extract maximal information, however it is usually computationally infeasible when done via naive relabelling. This can be overcome by jointly outputting values and actions for every goal at once, allowing for efficient, parallel all-goals updates with a single pass through the network, in a process we call Learning Everything all at Once (LEO). We show that this approach significantly outperforms other methods on goal-conditioned Craftax and is competitive with existing baselines on continuous control environments, while achieving a >250x speed-up compared to all-goals relabelling. We then go on to show that this approach can be made even more powerful by using LEO as a teacher network, rather than a direct actor. We hope that, by unlocking all-goals learning at scale, LEO can serve as a useful tool for RL practitioners in complex e
It is known that a positive Boolean function f depending on n variables has at least n + 1 extremal points, i.e. minimal ones and maximal zeros. We show that f has exactly n + 1 extremal points if and only if it is linear read-once. The class of linear read-once functions is known to be the intersection of the classes of read-once and threshold functions. Generalizing this result we show that the class of linear read-once functions is the intersection of read-once and Chow functions. We also find the set of minimal read-once functions which are not linear read-once and the set of minimal threshold functions which are not linear read-once. In other words, we characterize the class of linear read-once functions by means of minimal forbidden subfunctions within the universe of read-once and the universe of threshold functions. Within the universe of threshold functions the importance of linear read-once func- tions is due to the fact that they attain the minimum value of the specification number, which is n + 1 for functions depending on n variables. In 1995 Anthony et al. conjec- tured that for all other threshold functions the specification number is strictly greater than n + 1. We
We study $η$-correction terms in the Kauffman bracket skein algebra of the once-punctured torus $K_t(Σ_{1,1})$. While the Frohman--Gelca product-to-sum rule gives an explicit multiplication formula on the closed torus, the once-punctured torus introduces correction terms in the ideal $(η)$. We give a closed formula for the Chebyshev-threaded family generated by the primitive determinant-two pair \[ P_n=T_n((1,2))\cdot(1,0). \] The correction $ε_n$ has an explicit Chebyshev expansion whose coefficients factor as geometric sums in $t^{\pm4}$ and whose terms are governed by a parity pattern arising from the Chebyshev recurrence. We also treat a primitive maximal-thread regime, in which one Frohman--Gelca summand is fully threaded and the other is simple or doubly covered. In this case the discrepancy is an explicit $η$-linear cascade with Chebyshev $S$-coefficients, lowering the thread degree by two at each step. These formulas recover the relevant low-determinant behavior and give compact closed multiplication rules for structured threaded families in $K_t(Σ_{1,1})$.
The Pascal matrix, which is related to Pascal's triangle, appears in many places in the theory of uniform distribution and in many other areas of mathematics. Examples are the construction of low-discrepancy sequences as well as normal numbers or the binomial transforms of Hankel matrices. Hankel matrices which are defined by Catalan numbers and related to the paperfolding sequence are interesting objects in number theory. Therefore, matrices that share many properties with the Pascal matrix or such Hankel matrices are of interest. In this note we will collect common features of the Pascal matrix and the same modulo $2$ as well as the Hankel matrix defined by Catalan numbers once pure and once modulo $2$ in the ring of integers. Hankel matrices with only $0$ and $1$ entries in e.g. finite fields gave recently access to counterexamples to the so-called $X$-adic Liouville conjecture. This justifies as well as motivates our consideration of further matrices with $0$ and $1$ entries.
Millimeter-wave (mmWave) frequencies promise multi-gigabit connectivity for vehicle-to-everything (V2X) networks, but face challenges in terms of severe path loss and mobility-related beam misalignment. Reliable V2X connectivity requires fast, double-directional beam alignment. However, existing methods suffer from high training overhead and limited generalization to unseen scenarios. This paper presents VIsion-based BEamforming(VIBE), a hybrid model-based, closed-loop, learning architecture for real-time double-directional mmWave beam management primed by camera sensing. VIBE fuses machine learning, model-based reasoning, and closed-loop RF feedback to balance beam-pair establishment latency with link quality. VIBE bypasses exhaustive training overhead and accelerates link establishment by leveraging camera observations to reduce the beam-search space. Lightweight beam refinement and offset tracking mechanisms adaptively refine beams in response to dynamic application requirements. VIBE is implemented and evaluated across online indoor/outdoor testbeds, public datasets, and real-time vehicular experiments, demonstrating strong generalization capabilities, making it suitable for re
The title of this paper is perhaps an overclaim. Of course, the process of creating and optimizing a learned model inevitably involves multiple training runs which potentially feature different architectural designs, input and output encodings, and losses. However, our method, You Only Train Once (YOTO), indeed contributes to limiting training to one shot for the latter aspect of losses selection and weighting. We achieve this by automatically optimizing loss weight hyperparameters of learned models in one shot via standard gradient-based optimization, treating these hyperparameters as regular parameters of the networks and learning them. To this end, we leverage the differentiability of the composite loss formulation which is widely used for optimizing multiple empirical losses simultaneously and model it as a novel layer which is parameterized with a softmax operation that satisfies the inherent positivity constraints on loss hyperparameters while avoiding degenerate empirical gradients. We complete our joint end-to-end optimization scheme by defining a novel regularization loss on the learned hyperparameters, which models a uniformity prior among the employed losses while ensuri
Picking up multiple objects at once is a grasping skill that makes a human worker efficient in many domains. This paper presents a system to pick a requested number of objects by only picking once (OPO). The proposed Only-Pick-Once System (OPOS) contains several graph-based algorithms that convert the layout of objects into a graph, cluster nodes in the graph, rank and select candidate clusters based on their topology. OPOS also has a multi-object picking predictor based on a convolutional neural network for estimating how many objects would be picked up with a given gripper location and orientation. This paper presents four evaluation metrics and three protocols to evaluate the proposed OPOS. The results show OPOS has very high success rates for two and three objects when only picking once. Using OPOS can significantly outperform two to three times single object picking in terms of efficiency. The results also show OPOS can generalize to unseen size and shape objects.
Once-for-All (OFA) is a Neural Architecture Search (NAS) framework designed to address the problem of searching efficient architectures for devices with different resources constraints by decoupling the training and the searching stages. The computationally expensive process of training the OFA neural network is done only once, and then it is possible to perform multiple searches for subnetworks extracted from this trained network according to each deployment scenario. In this work we aim to give one step further in the search for efficiency by explicitly conceiving the search stage as a multi-objective optimization problem. A Pareto frontier is then populated with efficient, and already trained, neural architectures exhibiting distinct trade-offs among the conflicting objectives. This could be achieved by using any multi-objective evolutionary algorithm during the search stage, such as NSGA-II and SMS-EMOA. In other words, the neural network is trained once, the searching for subnetworks considering different hardware constraints is also done one single time, and then the user can choose a suitable neural network according to each deployment scenario. The conjugation of OFA and an
CNNs are increasingly deployed across different hardware, dynamic environments, and low-power embedded devices. This has led to the design and training of CNN architectures with the goal of maximizing accuracy subject to such variable deployment constraints. As the number of deployment scenarios grows, there is a need to find scalable solutions to design and train specialized CNNs. Once-for-all training has emerged as a scalable approach that jointly co-trains many models (subnets) at once with a constant training cost and finds specialized CNNs later. The scalability is achieved by training the full model and simultaneously reducing it to smaller subnets that share model weights (weight-shared shrinking). However, existing once-for-all training approaches incur huge training costs reaching 1200 GPU hours. We argue this is because they either start the process of shrinking the full model too early or too late. Hence, we propose Delayed $ε$-Shrinking (D$ε$pS) that starts the process of shrinking the full model when it is partially trained (~50%) which leads to training cost improvement and better in-place knowledge distillation to smaller models. The proposed approach also consists
Large-scale self-supervised pre-training has paved the way for one foundation model to handle many different vision tasks. Most pre-training methodologies train a single model of a certain size at one time. Nevertheless, various computation or storage constraints in real-world scenarios require substantial efforts to develop a series of models with different sizes to deploy. Thus, in this study, we propose a novel tri-branch self-supervised training framework, termed as POA (Pre-training Once for All), to tackle this aforementioned issue. Our approach introduces an innovative elastic student branch into a modern self-distillation paradigm. At each pre-training step, we randomly sample a sub-network from the original student to form the elastic student and train all branches in a self-distilling fashion. Once pre-trained, POA allows the extraction of pre-trained models of diverse sizes for downstream tasks. Remarkably, the elastic student facilitates the simultaneous pre-training of multiple models with different sizes, which also acts as an additional ensemble of models of various sizes to enhance representation learning. Extensive experiments, including k-nearest neighbors, linear
In this paper we provide a means of certifying infinitesimal projective rigidity relative to the cusp for hyperbolic once punctured torus bundles in terms of twisted Alexander polynomials of representations associated to the holonomy. We also relate this polynomial to an induced action on the tangent space of the character variety of the free group of rank 2 into PGL(4,R) that arises from the holonomy of a hyperbolic once-punctured torus bundle. We prove the induced action on the tangent space of the character variety is the same as the group theoretic action that arises in the Lyndon Hochschild Serre spectral sequence on cohomology.
This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once, a paradigm we call multi-problem evaluation (MPE). Unlike conventional single-problem evaluation, where a prompt presents a single problem and expects one specific answer, MPE places multiple problems together in a single prompt and assesses how well an LLM answers all these problems in a single output. Leveraging 6 classification and 12 reasoning benchmarks that already exist, we introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts. We experiment with a total of 13 LLMs from 5 model families on ZeMPE to present a comprehensive and systematic MPE. Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short. In addition, we perform in-depth further analyses and explore model-level factors that may enable multiple problem handling capabilities in LLMs. We release our corpus and code to facilitate future research.
The in-context learning ability of Transformer models has brought new possibilities to visual navigation. In this paper, we focus on the video navigation setting, where an in-context navigation policy needs to be learned purely from videos in an offline manner, without access to the actual environment. For this setting, we propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without finetuning or re-training. To enable learning from videos, we first propose a pseudo action labeling procedure using optical flow to recover the action label from egocentric videos. Then, offline reinforcement learning is applied to learn the navigation policy. Through extensive experiments on different scenes both in simulation and the real world, we show that our algorithm outperforms baselines by a large margin, which demonstrates the in-context learning ability of the learned policy. For videos and more information, visit https://sites.google.com/view/nol0.
Grounding objects in images using visual cues is a well-established approach in computer vision, yet the potential of audio as a modality for object recognition and grounding remains underexplored. We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes, termed Audio Grounding. By integrating pre-trained audio models with visual models using contrastive learning and multi-modal alignment, our approach captures speech commands or descriptions and maps them directly to corresponding objects within images. Experimental results indicate that audio guidance can be effectively applied to object grounding, suggesting that incorporating audio guidance may enhance the precision and robustness of current object grounding methods and improve the performance of robotic systems and computer vision applications. This finding opens new possibilities for advanced object recognition, scene understanding, and the development of more intuitive and capable robotic systems.
Artificial-intelligence tools in research like ChatGPT are playing an increasingly transformative role in revolutionizing scientific publishing and re-shaping its economic background. They can help academics to tackle such issues as limited space in academic journals, accessibility of knowledge, delayed dissemination, or the exponential growth of academic output. Moreover, AI tools could potentially change scientific communication and academic publishing market as we know them. They can help to promote Open Access (OA) in the form of preprints, dethrone the entrenched journals and publishers, as well as introduce novel approaches to the assessment of research output. It is also imperative that they should do just that, once and for all.
Quantum formulas, defined by Yao [FOCS '93], are the quantum analogs of classical formulas, i.e., classical circuits in which all gates have fanout one. We show that any read-once quantum formula over a gate set that contains all single-qubit gates is equivalent to a read-once classical formula of the same size and depth over an analogous classical gate set. For example, any read-once quantum formula over Toffoli and single-qubit gates is equivalent to a read-once classical formula over Toffoli and NOT gates. We then show that the equivalence does not hold if the read-once restriction is removed. To show the power of quantum formulas without the read-once restriction, we define a new model of computation called the one-qubit model and show that it can compute all boolean functions. This model may also be of independent interest.
A Boolean function is called read-once over a basis B if it can be expressed by a formula over B where no variable appears more than once. A checking test for a read-once function f over B depending on all its variables is a set of input vectors distinguishing f from all other read-once functions of the same variables. We show that every read-once function f over B has a checking test containing O(n^l) vectors, where n is the number of relevant variables of f and l is the largest arity of functions in B. For some functions, this bound cannot be improved by more than a constant factor. The employed technique involves reconstructing f from its l-variable projections and provides a stronger form of Kuznetsov's classic theorem on read-once representations.
Recent work of Chinburg, Reid, and Stover has shown that certain arithmetic and algebro-geometric properties of the character variety of a hyperbolic knot complement in the 3-sphere $M=S^3\setminus K$ yields topological and number theoretic information about Dehn fillings of M. Specifically, they show how the study of a certain extension problem for quaternion Azumaya algebras is related to topological invariants associated to these fillings. In this paper, we extend their work to the setting of hyperbolic once punctured torus bundles. Along the way, we exhibit new phenomenon in the relevant extension problem not visible in the case of a hyperbolic knot complement, which is related to the more complicated non-abelian reducible representation theory of hyperbolic once punctured torus bundles. We then apply these results to a series of examples from the literature and list some remaining questions from both works.