共找到 20 条结果
Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model-execution abstraction that decouples model architecture from system-level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming-aware scheduling and an asynchronous inference pipeline to improve end-to-end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at https://github.com/vox-serve/vox-serve.
We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow). This benchmark is grounded in speech science and linguistics, developed with domain experts to accurately index speaker and speech characteristics. We report benchmark experiments using over 15 publicly available speech datasets and several widely used speech foundation models that target various static and dynamic speaker and speech properties. In addition to benchmark experiments, we showcase several downstream applications supported by Vox-Profile. First, we show that Vox-Profile can augment existing speech recognition datasets to analyze ASR performance variability. Vox-Profile is also used as a tool to evaluate the performance of speech generation systems. Finally, we assess the quality of our automated profiles through comparison with human evaluation and show convergent validity. Vox-Profile
Virtual content creation and interaction play an important role in modern 3D applications such as AR and VR. Recovering detailed 3D models from real scenes can significantly expand the scope of its applications and has been studied for decades in the computer vision and computer graphics community. We propose Vox-Surf, a voxel-based implicit surface representation. Our Vox-Surf divides the space into finite bounded voxels. Each voxel stores geometry and appearance information in its corner vertices. Vox-Surf is suitable for almost any scenario thanks to sparsity inherited from voxel representation and can be easily trained from multiple view images. We leverage the progressive training procedure to extract important voxels gradually for further optimization so that only valid voxels are preserved, which greatly reduces the number of sampling points and increases rendering speed.The fine voxels can also be considered as the bounding volume for collision detection.The experiments show that Vox-Surf representation can learn delicate surface details and accurate color with less memory and faster rendering speed than other methods.We also show that Vox-Surf can be more practical in scen
Recent advances in zero-shot text-to-speech (TTS), driven by language models, diffusion models and masked generation, have achieved impressive naturalness in speech synthesis. Nevertheless, stability and fidelity remain key challenges, manifesting as mispronunciations, audible noise, and quality degradation. To address these issues, we introduce Vox-Evaluator, a multi-level evaluator designed to guide the correction of erroneous speech segments and preference alignment for TTS systems. It is capable of identifying the temporal boundaries of erroneous segments and providing a holistic quality assessment of the generated speech. Specifically, to refine erroneous segments and enhance the robustness of the zero-shot TTS model, we propose to automatically identify acoustic errors with the evaluator, mask the erroneous segments, and finally regenerate speech conditioning on the correct portions. In addition, the fine-gained information obtained from Vox-Evaluator can guide the preference alignment for TTS model, thereby reducing the bad cases in speech synthesis. Due to the lack of suitable training datasets for the Vox-Evaluator, we also constructed a synthesized text-speech dataset ann
Large Language Models' capacity to reason in natural language makes them uniquely promising for 4X and grand strategy games, enabling more natural human-AI gameplay interactions such as collaboration and negotiation. However, these games present unique challenges due to their complexity and long-horizon nature, while latency and cost factors may hinder LLMs' real-world deployment. Working on a classic 4X strategy game, Sid Meier's Civilization V with the Vox Populi mod, we introduce Vox Deorum, a hybrid LLM+X architecture. Our layered technical design empowers LLMs to handle macro-strategic reasoning, delegating tactical execution to subsystems (e.g., algorithmic AI or reinforcement learning AI in the future). We validate our approach through 2,327 complete games, comparing two open-source LLMs with a simple prompt against Vox Populi's enhanced AI. Results show that LLMs achieve competitive end-to-end gameplay while exhibiting play styles that diverge substantially from algorithmic AI and from each other. Our work establishes a viable architecture for integrating LLMs in commercial 4X games, opening new opportunities for game design and agentic AI research.
Altermagnets represent a novel class of magnetic materials that integrate the advantages of both ferromagnets and antiferromagnets, providing a rich platform for exploring the physical properties of multiferroic materials.This work demonstrates that $\mathrm{VOX_2}$ monolayers ($\mathrm{X = Cl, Br, I}$) are two-dimensional ferroelectric altermagnets, as confirmed by symmetry analysis and first-principles calculations. $\mathrm{VOI_2}$ monolayer exhibits a strong magnetoelectric coupling coefficient ($α_S \approx 1.208 \times 10^{-6}~\mathrm{s/m}$), with spin splitting in the electronic band structure tunable by both electric and magnetic fields. Additionally, the absence of inversion symmetry in noncentrosymmetric crystals enables significant nonlinear optical effects, such as shift current (SC). The $x$-direction component of SC exhibits a ferroicity-driven switching behavior. Moreover, the $σ^{yyy}$ component exhibits an exceptionally large spin SC of $330.072~\mathrm{μA/V^2}$. These findings highlight the intricate interplay between magnetism and ferroelectricity, offering versatile tunability of electronic and optical properties. $\mathrm{VOX_2}$ monolayers provide a promising
VOXES is a Von Hamos X-ray spectrometer developed at the INFN National Laboratories of Frascati for high-resolution laboratory X-ray spectroscopy in the 5--20~keV range. It uses curved mosaic crystals and motorized positioning stages to perform wavelength-dispersive X-ray fluorescence (WD-XRF) with sub-10~eV tunable resolution for extended and dilute samples. Recent developments include the integration of an energy-dispersive X-ray fluorescence (ED-XRF) line based on a silicon pin-diode detector, which enables flux monitoring and simultaneous ED and WD measurements. In addition, a dedicated liquid-sample holder has been introduced, and a Y-shaped support geometry, crucial for switching to a transmission layout, provides mechanical compatibility with laboratory XAS, now under implementation. These upgrades expand the versatility and automation of VOXES, strengthening its role as a table-top platform for laboratory X-ray spectroscopy.
Vanadium oxide (VOx) is a material of significant interest due to its metal-insulator transition (MIT) properties as well as its diverse stable antiferromagnetism depending on the valence states of V and O with distinct MIT transitions and Néel temperatures. Although several studies reported the ferromagnetism in the VOx, it was mostly associated with impurities or defects, and pure VOx has rarely been reported as ferromagnetic. Our research presents clear evidence of ferromagnetism in the VOx thin films, exhibiting a saturation magnetization of approximately 14 kA/m at 300 K. We fabricated 20-nm thick VOx thin films via reactive sputtering from a metallic vanadium target in various oxygen atmosphere. The oxidation states of ferromagnetic VOx films show an ill-defined stoichiometry of V2O3+p, where p = 0.05, 0.23, 0.49, with predominantly disordered microstructures. Ferromagnetic nature of these VOx films is confirmed through a strong antiferromagnetic exchange coupling with the neighboring ferromagnetic layer in the VOx/Co bilayers, in which the spin configurations of Co layer is influenced strongly due to the additional anisotropy introduced by VOx layer. The present study highli
In this paper, we introduce Vox-Fusion++, a multi-maps-based robust dense tracking and mapping system that seamlessly fuses neural implicit representations with traditional volumetric fusion techniques. Building upon the concept of implicit mapping and positioning systems, our approach extends its applicability to real-world scenarios. Our system employs a voxel-based neural implicit surface representation, enabling efficient encoding and optimization of the scene within each voxel. To handle diverse environments without prior knowledge, we incorporate an octree-based structure for scene division and dynamic expansion. To achieve real-time performance, we propose a high-performance multi-process framework. This ensures the system's suitability for applications with stringent time constraints. Additionally, we adopt the idea of multi-maps to handle large-scale scenes, and leverage loop detection and hierarchical pose optimization strategies to reduce long-term pose drift and remove duplicate geometry. Through comprehensive evaluations, we demonstrate that our method outperforms previous methods in terms of reconstruction quality and accuracy across various scenarios. We also show th
We present a multimodal fusion framework that bridges pre-trained decoder-based large language models (LLM) and acoustic encoder-decoder architectures such as Whisper, with the aim of building speech-enabled LLMs. Instead of directly using audio embeddings, we explore an intermediate audio-conditioned text space as a more effective mechanism for alignment. Our method operates fully in continuous text representation spaces, fusing Whisper's hidden decoder states with those of an LLM through cross-modal attention, and supports both offline and streaming modes. We introduce \textit{VoxKrikri}, the first Greek speech LLM, and show through analysis that our approach effectively aligns representations across modalities. These results highlight continuous space fusion as a promising path for multilingual and low-resource speech LLMs, while achieving state-of-the-art results for Automatic Speech Recognition in Greek, providing an average $\sim20\%$ relative improvement across benchmarks.
In this work, we present a dense tracking and mapping system named Vox-Fusion, which seamlessly fuses neural implicit representations with traditional volumetric fusion methods. Our approach is inspired by the recently developed implicit mapping and positioning system and further extends the idea so that it can be freely applied to practical scenarios. Specifically, we leverage a voxel-based neural implicit surface representation to encode and optimize the scene inside each voxel. Furthermore, we adopt an octree-based structure to divide the scene and support dynamic expansion, enabling our system to track and map arbitrary scenes without knowing the environment like in previous works. Moreover, we proposed a high-performance multi-process framework to speed up the method, thus supporting some applications that require real-time performance. The evaluation results show that our methods can achieve better accuracy and completeness than previous methods. We also show that our Vox-Fusion can be used in augmented reality and virtual reality applications. Our source code is publicly available at https://github.com/zju3dv/Vox-Fusion.
This paper presents a fast lidar-inertial odometry (LIO) that is robust to aggressive motion. To achieve robust tracking in aggressive motion scenes, we exploit the continuous scanning property of lidar to adaptively divide the full scan into multiple partial scans (named sub-frames) according to the motion intensity. And to avoid the degradation of sub-frames resulting from insufficient constraints, we propose a robust state estimation method based on a tightly-coupled iterated error state Kalman smoother (ESKS) framework. Furthermore, we propose a robocentric voxel map (RC-Vox) to improve the system's efficiency. The RC-Vox allows efficient maintenance of map points and k nearest neighbor (k-NN) queries by mapping local map points into a fixed-size, two-layer 3D array structure. Extensive experiments are conducted on 27 sequences from 4 public datasets and our own dataset. The results show that our system can achieve stable tracking in aggressive motion scenes (angular velocity up to 21.8 rad/s) that cannot be handled by other state-of-the-art methods, while our system can achieve competitive performance with these methods in general scenes. Furthermore, thanks to the RC-Vox, our
Multifunctional two-dimensional (2D) multiferroics are promising materials for designing low-dimensional multipurpose devices. The key to multifunctionality in these materials is breaking the space-inversion and the time-reversal symmetry, which results in spontaneous electric polarization and magnetization in the same phase. A new class of 2D materials, Janus 2D materials, has emerged, which works on a similar principle of breaking out-of-plane symmetry to invoke new exciting functionalities in the 2D materials, such as an out-of-plane piezoelectric polarization. In this work, a new group of 2D multiferroic Janus monolayers VOXY (X/Y = F, Cl, Br, or I, and X$ ot=$Y) is derived by breaking the out-of-plane symmetry in the parent multiferroics VOX$_2$ (X = F, Cl, Br, or I). The structural, magnetic, and ferroelectric properties of multiferroics VOX$_2$ are compared with their Janus derivatives. We calculated in-plane and out-of-plane piezoelectric polarization for VOX$_2$ and VOXY series, where VOFCl, VOFBr, VOFI, and VOClI are found to have significant out-of-plane piezoelectric polarization. Our theoretical work predicts a new series of 2D multiferroic materials and encourages the
Cryo-Electron Tomography (cryo-ET) is a 3D imaging technology facilitating the study of macromolecular structures at near-atomic resolution. Recent volumetric segmentation approaches on cryo-ET images have drawn widespread interest in biological sector. However, existing methods heavily rely on manually labeled data, which requires highly professional skills, thereby hindering the adoption of fully-supervised approaches for cryo-ET images. Some unsupervised domain adaptation (UDA) approaches have been designed to enhance the segmentation network performance using unlabeled data. However, applying these methods directly to cryo-ET images segmentation tasks remains challenging due to two main issues: 1) the source data, usually obtained through simulation, contain a certain level of noise, while the target data, directly collected from raw-data from real-world scenario, have unpredictable noise levels. 2) the source data used for training typically consists of known macromoleculars, while the target domain data are often unknown, causing the model's segmenter to be biased towards these known macromolecules, leading to a domain shift problem. To address these challenges, in this work,
Incorrect information poses significant challenges by disrupting content veracity and integrity, yet most detection approaches struggle to jointly balance textual content verification with external knowledge modification under collapsed attention geometries. To address this issue, we propose a dual-head reasoning framework, BiMind, which disentangles content-internal reasoning from knowledge-augmented reasoning. In BiMind, we introduce three core innovations: (i) an attention geometry adapter that reshapes attention logits via token-conditioned offsets and mitigates attention collapse; (ii) a self-retrieval knowledge mechanism, which constructs an in-domain semantic memory through kNN retrieval and injects retrieved neighbors via feature-wise linear modulation; (iii) the uncertainty-aware fusion strategies, including entropy-gated fusion and a trainable agreement head, stabilized by a symmetric Kullback-Leibler agreement regularizer. To quantify the knowledge contributions, we define a novel metric, Value-of-eXperience (VoX), to measure instance-wise logit gains from knowledge-augmented reasoning. Experiment results on public datasets demonstrate that our BiMind model outperforms a
Determining the oxidation states of metals assumes great importance in various applications because a variation in the oxidation number can drastically influence the material properties. As an example, this becomes evident in edible liquids like wine and oil, where a change in the oxidation states of the contained metals can significantly modify both the overall quality and taste. To this end, here we present the MITIQO project, which aims to identify oxidation states of metals in edible liquids utilizing X-ray emission with Bragg spectroscopy. This is achieved using the VOXES crystal spectrometer, developed at INFN National Laboratories of Frascati (LNF), employing mosaic crystal (HAPG) in the Von Hamos configuration. This combination allow us to work with effective source sizes of up to a few millimeters and improves the typical low efficiency of Bragg spectroscopy, a crucial aspect when studying liquids with low metal concentration. Here we showcase the concept behind MITIQO, for a liquid solution containing oxidized iron. We performed several high-resolution emission spectra measurements, for the liquid and for different powdered samples containing oxidized and pure iron. By lo
The formation of VO2 crystalline domains in amorphous substoichiometric nanocolumnar VOx thin films subjected to an oxidation process at temperatures below 300°C has been studied. It is obtained that values of [O]/[V] above 1.9 lead to the sole formation of V2O5 after oxidation, while values below 1.9 favor the formation of VO2, V3O7 and V2O5 crystalline domains for temperatures as low as 260°C. Moreover, it is found that the adsorption of oxygen and its incorporation into the film network produce a relevant volume expansion in a so-called swelling mechanism that makes pores shrink. Under some specific conditions, the low temperature oxidation does not only trigger the formation of VO2 domains but also a drastic reduction of oxygen-deficient amorphous VOx in the films, which clearly improves the overall transparency and thermochromic modulation capabilities. The changes in the optical and electrical properties of these films during the metal-insulator transition have been studied, finding the best performance when the stoichiometry of the film before oxidation is [O]/[V]=1.5 and the oxidation temperature 280°C. These conditions yield a relatively transparent coating that presents a
The recent development of large language models (LLMs) has spurred discussions about whether LLM-generated "synthetic samples" could complement or replace traditional surveys, considering their training data potentially reflects attitudes and behaviors prevalent in the population. A number of mostly US-based studies have prompted LLMs to mimic survey respondents, with some of them finding that the responses closely match the survey data. However, several contextual factors related to the relationship between the respective target population and LLM training data might affect the generalizability of such findings. In this study, we investigate the extent to which LLMs can estimate public opinion in Germany, using the example of vote choice. We generate a synthetic sample of personas matching the individual characteristics of the 2017 German Longitudinal Election Study respondents. We ask the LLM GPT-3.5 to predict each respondent's vote choice and compare these predictions to the survey-based estimates on the aggregate and subgroup levels. We find that GPT-3.5 does not predict citizens' vote choice accurately, exhibiting a bias towards the Green and Left parties. While the LLM captu
In the era of generative AI and specifically large language models (LLMs), exemplified by ChatGPT, the intersection of artificial intelligence and human reasoning has become a focal point of global attention. Unlike conventional search engines, LLMs go beyond mere information retrieval, entering into the realm of discourse culture. Its outputs mimic well-considered, independent opinions or statements of facts, presenting a pretense of wisdom. This paper explores the potential transformative impact of LLMs on democratic societies. It delves into the concerns regarding the difficulty in distinguishing ChatGPT-generated texts from human output. The discussion emphasizes the essence of authorship, rooted in the unique human capacity for reason - a quality indispensable for democratic discourse and successful collaboration within free societies. Highlighting the potential threats to democracy, this paper presents three arguments: the Substitution argument, the Authenticity argument, and the Facts argument. These arguments highlight the potential risks that are associated with an overreliance on LLMs. The central thesis posits that widespread deployment of LLMs may adversely affect the f
Our quality audit for three widely used public multilingual speech datasets - Mozilla Common Voice 17.0, FLEURS, and Vox Populi - shows that in some languages, these datasets suffer from significant quality issues, which may obfuscate downstream evaluation results while creating an illusion of success. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the dataset creation process. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of sociolinguistic awareness and language planning principles. Furthermore, we encourage research into how this creation process itself can be leveraged as a tool for community-led language planning and revitalization.