Large Language Models (LLMs) demonstrate remarkable fluency across high-resource languages yet consistently fail to generate coherent text in Kashmiri, a language spoken by approximately seven million people. This performance disparity stems not from inherent model limitations but from a critical scarcity of high-quality training data. Decades of Kashmiri literature remain inaccessible to modern NLP pipelines due to their encoding in the proprietary InPage desktop publishing format. This paper introduces KS-LIT-3M, a curated corpus of 3.1 million words (16.4 million characters) specifically designed for pretraining language models on Kashmiri. The dataset is structured as a single continuous linear text stream, optimized for causal language model training where models learn to predict subsequent tokens from preceding context. The corpus was constructed through the development of a specialized InPage-to-Unicode converter, followed by rigorous preprocessing including English contamination removal, character normalization, and quality validation. Encompassing 131,607 unique words drawn from diverse genres including literary works, journalistic writing, academic texts, and religious sc
Over the last years, Ethereum has evolved into a public platform that safeguards the savings of hundreds of millions of people and secures more than $650 billion in assets, placing it among the top 25 stock exchanges worldwide in market capitalization, ahead of Singapore, Mexico, and Thailand. As such, the performance and security of the Ethereum blockchain are not only of theoretical interest, but also carry significant global economic implications. At the time of writing, the Ethereum platform is collectively secured by almost one million validators highlighting its decentralized nature and underlining its economic security guarantees. However, due to this large validator set, the protocol takes around 15 minutes to finalize a block which is prohibitively slow for many real world applications. This delay is largely driven by the cost of aggregating and disseminating signatures across a validator set of this scale. Furthermore, as we show in this paper, the existing protocol that is used to aggregate and disseminate the signatures has several shortcomings that can be exploited by adversaries to shift stake proportion from honest to adversarial nodes. In this paper, we introduce Wo
Conventional integrated circuits (ICs) struggle to meet the escalating demands of artificial intelligence (AI). This has sparked a renewed interest in an unconventional computing paradigm: neuromorphic (brain-inspired) computing. However, current neuromorphic systems face significant challenges in delivering a large number of parameters (i.e., weights) required for large-scale AI models. As a result, most neuromorphic hardware is limited to basic benchmark demonstrations, hindering its application to real-world AI challenges. Here, we present a large-scale optical neural network (ONN) for machine learning acceleration, featuring over 41 million photonic neurons. This system not only surpasses digital electronics in speed and energy efficiency but more importantly, closes the performance gap with large-scale AI models. Our ONN leverages an innovative optical metasurface device featuring numerous spatial modes. This device integrates over 41 million meta-atoms on a 10 mm$^2$ metasurface chip, enabling the processing of tens of millions of weights in a single operation. For the first time, we demonstrate that an ONN, utilizing a single-layer metasurface, can match the performance of d
We present an updated catalog of stellar parameters, including effective temperature, luminosity classification, and metallicity, for over fifty million stars from the SkyMapper Southern Survey (SMSS) DR4 and Gaia DR3. The accuracy of the derived parameters remains consistent with those achieved with SMSS DR2 using the same methods. Thanks to the advancements in SMSS DR4, photometric-metallicity estimates are now available for an unprecedented number of metal-poor stars. The catalog includes over 13 million metal-poor (MP; [Fe/H] $\leq -1$) stars, nearly three million very metal-poor (VMP; [Fe/H] $\leq -2.0$) stars, and approximately 120,000 extremely metal-poor (EMP; [Fe/H] $\leq -3.0$) stars - representing an increase by a factor of 4-6 compared to SMSS DR2. This catalog, combined with other stellar parameters obtained through our efforts, will be made available at https://nadc.china-vo.org/data/sports/ and https://zenodo.org/records/15108911.
We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\cite{malik2024inpage}, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens, substantially higher than prior estimates derived from non-Kashmiri Perso-Arabic analogues. KS-PRET-5M is released as a single continuous text stream under CC~BY~4.0 to support language model pretraining, tokenizer trai
We report on our tool, Pulse Infinite, that uses proof techniques to show non-termination (divergence) in large programs. Pulse Infinite works compositionally and under-approximately: the former supports scale, and the latter ensures soundness for proving divergence. Prior work focused on small benchmarks in the tens or hundreds of lines of code (LoC), and scale limits their practicality: a single company may have tens of millions, or even hundreds of millions of LoC or more. We report on applying Pulse Infinite to over a hundred million lines of open-source and proprietary software written in C, C++, and Hack, identifying over 30 previously unknown issues, establishing a new state of the art for detecting divergence in real-world codebases.
We used 3.1 million spectroscopically labelled sources from the Sloan Digital Sky Survey (SDSS) to train an optimised random forest classifier using photometry from the SDSS and the Widefield Infrared Survey Explorer (WISE). We applied this machine learning model to 111 million previously unlabelled sources from the SDSS photometric catalogue which did not have existing spectroscopic observations. Our new catalogue contains 50.4 million galaxies, 2.1 million quasars, and 58.8 million stars. We provide individual classification probabilities for each source, with 6.7 million galaxies (13%), 0.33 million quasars (15%), and 41.3 million stars (70%) having classification probabilities greater than 0.99; and 35.1 million galaxies (70%), 0.72 million quasars (34%), and 54.7 million stars (93%) having classification probabilities greater than 0.9. Precision, Recall, and F1 score were determined as a function of selected features and magnitude error. We investigate the effect of class imbalance on our machine learning model and discuss the implications of transfer learning for populations of sources at fainter magnitudes than the training set. We used a non-linear dimension reduction techn
Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128K or 1M tokens. This trend, however, presents significant challenges in inference speed and memory management. Quantization emerges as a promising approach to address the widening gap between LLM size and memory capacity. However, traditional quantization schemes often yield suboptimal compression results for KV caches due to two key factors: i) On-the-fly quantization and de-quantization, causing significant performance overhead; ii) Prevalence of outliers in KV values, challenging low-bitwidth uniform quantization. To this end, we propose MILLION, a novel quantization framework achieving low-bitwidth KV cache through product quantization. First, we conduct a thorough analysis of KV cache distribution, revealing the limitations of existing quantization schemes. Second, we introduce a non-uniform quantization algorithm based on product quantization, which efficiently compresses data while preserving accuracy. Third, we develop a high-performance GPU inference framework with efficient attention kernel and pipeline design for MILLION that lev
During the 20th Century, aerial surveys captured hundreds of millions of high-resolution photographs of the earth's surface. These images, the precursors to modern satellite imagery, represent an extraordinary visual record of the environmental and social upheavals of the 20th Century. However, most of these images currently languish in physical archives where retrieval is difficult and costly. Digitization could revolutionize access, but manual scanning is slow and expensive. Automated scanning could make at-scale digitization feasible, unlocking this visual record of the 20th Century for the digital era. Here, we describe and validate a novel robot-assisted pipeline that increases worker productivity in scanning 30-fold, applied at scale to digitize an archive of 1.7 million historical aerial photographs from 65 countries.
Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as "
Simulating water droplets made up of millions of molecules and on timescales as needed in biological and technological applications is challenging due to the difficulty of balancing accuracy with computational capabilities. Most detailed descriptions, such as ab initio, polarizable, or rigid models, are typically constrained to a few hundred (for ab initio) or thousands of molecules (for rigid models). Recent machine learning approaches allow for the simulation of up to 4 million molecules with ab initio accuracy but only for tens of nanoseconds, even if parallelized across hundreds of GPUs. In contrast, coarse-grained models permit simulations on a larger scale but at the expense of accuracy or transferability. Here, we consider the CVF molecular model of fluid water, which bridges the gap between accuracy and efficiency for free-energy and thermodynamic quantities due to i) a detailed calculation of the hydrogen bond contributions at the molecular level, including cooperative effects, and ii) coarse-graining of the translational and rotational degrees of freedom of the molecules. The CVF model can reproduce the experimental equation of state and fluctuations of fluid water across
We present the main-sequence binary (MSMS) Catalog derived from Gaia Data Release 3 BP/RP (XP) spectra. Leveraging the vast sample of low-resolution Gaia XP spectra, we develop a forward modeling approach that maps stellar mass and photometric metallicity to XP spectra using a neural network. Our methodology identifies binary systems through statistical comparison of single- and binary-star model fits, enabling detection of binaries with mass ratios between 0.4 and 1.0 and flux ratios larger than 0.1. From an initial sample of 35 million stars within 1 kpc, we identify 14 million binary candidates and define a high-confidence "golden sample" of 1 million binary systems. This large, homogeneous sample enables detailed statistical analysis of binary properties across diverse Galactic environments, providing new insights into binary star formation and evolution. In addition, the $χ^2$ comparison allows us to distinguish stars with luminous companions from single stars or binaries with dark companions, such as white dwarfs, neutron stars and black hole candidates, improving our understanding of compact object populations.
Currently, there is an urgent demand for scalable multivariate and high-dimensional false discovery rate (FDR)-controlling variable selection methods to ensure the repro-ducibility of discoveries. However, among existing methods, only the recently proposed Terminating-Random Experiments (T-Rex) selector scales to problems with millions of variables, as encountered in, e.g., genomics research. The T-Rex selector is a new learning framework based on early terminated random experiments with computer-generated dummy variables. In this work, we propose the Big T-Rex, a new implementation of T-Rex that drastically reduces its Random Access Memory (RAM) consumption to enable solving FDR-controlled sparse regression problems with millions of variables on a laptop. We incorporate advanced memory-mapping techniques to work with matrices that reside on solid-state drive and two new dummy generation strategies based on permutations of a reference matrix. Our nu-merical experiments demonstrate a drastic reduction in memory demand and computation time. We showcase that the Big T-Rex can efficiently solve FDR-controlled Lasso-type problems with five million variables on a laptop in thirty minutes
Menopause reshapes female physiology, yet its full temporal footprint is obscured by uncertainty in the age of the final menstrual period (FMP). Here we analyse cross-sectional data on 300 million laboratory tests from more than a million women in two population-scale cohorts (Israel-Clalit and US-NHANES). We apply a deconvolution algorithm inspired by astronomical image "de-blurring" to align each test to time-from-FMP rather than chronological age. Nearly every assay - spanning endocrine, bone, hepatic, lipid, osmolality, inflammatory and muscular systems - exhibits a jump at FMP that is absent in males and highly concordant between cohorts. Jumps were largest in the sex hormones, followed by bone, toxins, red blood cells, liver, iron, lipids, kidney, and muscle. Changes are mostly detrimental except iron indices and anemia that improve post-menopause, and depression scores that spike only transiently. Hormone-replacement therapy attenuates many of the step-like changes. Sex hormone dysregulation occurs more than 10 years prior to FMP. These findings reveal the step-like dysregulation across physiology caused by loss of sex hormones and establish deconvolution as a general strate
Small molecules are essential to drug discovery, and graph-language models hold promise for learning molecular properties and functions from text. However, existing molecule-text datasets are limited in scale and informativeness, restricting the training of generalizable multimodal models. We present MolTextNet, a dataset of 2.5 million high-quality molecule-text pairs designed to overcome these limitations. To construct it, we propose a synthetic text generation pipeline that integrates structural features, computed properties, bioactivity data, and synthetic complexity. Using GPT-4o-mini, we create structured descriptions for 2.5 million molecules from ChEMBL35, with text over 10 times longer than prior datasets. MolTextNet supports diverse downstream tasks, including property prediction and structure retrieval. Pretraining CLIP-style models with Graph Neural Networks and ModernBERT on MolTextNet yields improved performance, highlighting its potential for advancing foundational multimodal modeling in molecular science. Our dataset is available at https://huggingface.co/datasets/liuganghuggingface/moltextnet.
The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.
Supersonic magnetohydrodynamic (MHD) turbulence is a ubiquitous state for many astrophysical plasmas. However, even the basic statistics for this type of turbulence remains uncertain. We present results from supersonic MHD turbulence simulations at unparalleled resolutions, with plasma Reynolds numbers of over a million. In the kinetic energy spectrum we find a break between the scales that are dominated by kinetic energy, with spectral index $-2$, and those that become strongly magnetized, with spectral index $-3/2$. By analyzing the Helmholtz decomposed kinetic energy spectrum, we find that the compressible modes are not passively mixed through the cascade of the incompressible modes. At high magnetic Reynolds number, above $10^5$, we find a power law in the magnetic energy spectrum with spectral index $-9/5$. On the strongly magnetized, subsonic scales the plasma tends to self-organize into locally relaxed regions, where there is strong alignment between the current density, magnetic field, velocity field and vorticity field, depleting both the nonlinearities and magnetic terms in the MHD equations, which we attribute to plasma relaxation on scales where the magnetic fluctuation
To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval.
Most Self-Admitted Technical Debt (SATD) research utilizes explicit SATD features such as 'TODO' and 'FIXME' for SATD detection. A closer look reveals several SATD research uses simple SATD ('Easy to Find') code comments without the contextual data (preceding and succeeding source code context). This work addresses this gap through PENTACET (or 5C dataset) data. PENTACET is a large Curated Contextual Code Comments per Contributor and the most extensive SATD data. We mine 9,096 Open Source Software Java projects with a total of 435 million LOC. The outcome is a dataset with 23 million code comments, preceding and succeeding source code context for each comment, and more than 250,000 comments labeled as SATD, including both 'Easy to Find' and 'Hard to Find' SATD. We believe PENTACET data will further SATD research using Artificial Intelligence techniques.
The ability to understand and generate similes is an imperative step to realize human-level AI. However, there is still a considerable gap between machine intelligence and human cognition in similes, since deep models based on statistical distribution tend to favour high-frequency similes. Hence, a large-scale symbolic knowledge base of similes is required, as it contributes to the modeling of diverse yet unpopular similes while facilitating additional evaluation and reasoning. To bridge the gap, we propose a novel framework for large-scale simile knowledge base construction, as well as two probabilistic metrics which enable an improved understanding of simile phenomena in natural language. Overall, we construct MAPS-KB, a million-scale probabilistic simile knowledge base, covering 4.3 million triplets over 0.4 million terms from 70 GB corpora. We conduct sufficient experiments to justify the effectiveness and necessity of the methods of our framework. We also apply MAPS-KB on three downstream tasks to achieve state-of-the-art performance, further demonstrating the value of MAPS-KB.