共找到 20 条结果
A recent study [1] has introduced an advanced method aimed at extracting from a circular particle accelerator over millions of turns using stable islands and a bent crystal. This technique leverages the strength of non-linear beam dynamics along with adiabatic trapping and transport within stable islands, in combination with a bent silicon crystal, to provide an efficient extraction of hadron beams over millions of turns. The positive and encouraging results of the comprehensive numerical simulations were validated through a successful proof-of-principle experiment at the CERN Super Proton Synchrotron, which demonstrated the feasibility of the technique. This paper presents a detailed discussion and analysis of the experimental results.
Scopus is increasingly regarded as a high-quality and reliable data source for research and evaluation of scientific and scholarly activity. However, a puzzling phenomenon has been discovered occasionally: millions of records with author affiliation information collected in Scopus are oddly labeled as "country-undefined" by Scopus which is rarely to be detected in its counterpart Web of Science. This huge number of "homeless" records in Scopus is unacceptable for a widely used high-quality bibliographic database. By using data from the past 124 years, this brief communication tries to probe these affiliated but country-undefined records in Scopus. Our analysis identifies four primary causes for these "homeless" records: incomplete author affiliation addresses, Scopus' inability to recognize different variants of country/territory names, misspelled country/territory names in author affiliation addresses, and Scopus' insufficiency in correctly split and identify the clean affiliation addresses. To address this pressing issue, we put forward several recommendations to relevant stakeholders, with the aim of resettling millions of "homeless" records in Scopus and reducing its potential
Leveraging the in-context learning (ICL) capability of Large Language Models (LLMs) for tabular classification has gained significant attention for its training-free adaptability across diverse datasets. Recent advancements, like TabPFN, excel in small-scale tabular datasets but struggle to scale for large and complex datasets. Our work enhances the efficiency and scalability of TabPFN for larger datasets by incorporating linear attention mechanisms as a scalable alternative to complexity-quadratic self-attention. Our model, TabFlex, efficiently handles tabular datasets with thousands of features and hundreds of classes, scaling seamlessly to millions of samples. For instance, TabFlex processes the poker-hand dataset with over a million samples in just 5 seconds. Our extensive evaluations demonstrate that TabFlex can achieve over a 2x speedup compared to TabPFN and a 1.5x speedup over XGBoost, outperforming 25 tested baselines in terms of efficiency across a diverse range of datasets. Furthermore, TabFlex remains highly effective on large-scale datasets, delivering strong performance with significantly reduced computational costs, especially when combined with data-efficient techni
Simulating water droplets made up of millions of molecules and on timescales as needed in biological and technological applications is challenging due to the difficulty of balancing accuracy with computational capabilities. Most detailed descriptions, such as ab initio, polarizable, or rigid models, are typically constrained to a few hundred (for ab initio) or thousands of molecules (for rigid models). Recent machine learning approaches allow for the simulation of up to 4 million molecules with ab initio accuracy but only for tens of nanoseconds, even if parallelized across hundreds of GPUs. In contrast, coarse-grained models permit simulations on a larger scale but at the expense of accuracy or transferability. Here, we consider the CVF molecular model of fluid water, which bridges the gap between accuracy and efficiency for free-energy and thermodynamic quantities due to i) a detailed calculation of the hydrogen bond contributions at the molecular level, including cooperative effects, and ii) coarse-graining of the translational and rotational degrees of freedom of the molecules. The CVF model can reproduce the experimental equation of state and fluctuations of fluid water across
Single photons source (SPS) is a key component required by quantum communication devices. We report the finding of bright diamond-based SPS created by nature millions of years ago. It is shown that narrow ($\leq$ 2 nm) lines observed within the 500-800 nm range in photoluminescence (PL) spectra of the surface layer of untreated Yakut diamonds rich in nitrogen and hydrogen belong to SPS. Moreover, unknown narrow-line PL observed earlier in nitrogen- and hydrogen-rich diamonds from various deposits around the world are thought to be associated with SPS. Thus, the diamond rim, which until now was sent to the dumps or, at best, used as an abrasive powder, turned out to be a valuable material suitable for use in quantum technologies.
MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at https://gustavocilleruelo.com/mgen
The only known example of an almost perfect nonlinear (APN) permutation in even dimension was obtained by applying CCZ-equivalence to a specific quadratic APN function. Motivated by this result, there have been numerous recent attempts to construct new quadratic APN functions. Currently, 32,892 quadratic APN functions in dimension 8 are known and two recent conjectures address their possible total number. The first, proposed by Y. Yu and L. Perrin (Cryptogr. Commun. 14(6): 1359-1369, 2022), suggests that there are more than 50,000 such functions. The second, by A. Polujan and A. Pott (Proc. 7th Int. Workshop on Boolean Functions and Their Applications, 2022), argues that their number exceeds that of inequivalent quadratic (8,4)-bent functions, which is 92,515. We computationally construct 3,775,599 inequivalent quadratic APN functions in dimension 8 and estimate the total number to be about 6 million.
We present the main-sequence binary (MSMS) Catalog derived from Gaia Data Release 3 BP/RP (XP) spectra. Leveraging the vast sample of low-resolution Gaia XP spectra, we develop a forward modeling approach that maps stellar mass and photometric metallicity to XP spectra using a neural network. Our methodology identifies binary systems through statistical comparison of single- and binary-star model fits, enabling detection of binaries with mass ratios between 0.4 and 1.0 and flux ratios larger than 0.1. From an initial sample of 35 million stars within 1 kpc, we identify 14 million binary candidates and define a high-confidence "golden sample" of 1 million binary systems. This large, homogeneous sample enables detailed statistical analysis of binary properties across diverse Galactic environments, providing new insights into binary star formation and evolution. In addition, the $χ^2$ comparison allows us to distinguish stars with luminous companions from single stars or binaries with dark companions, such as white dwarfs, neutron stars and black hole candidates, improving our understanding of compact object populations.
Transformer-based recommender systems, such as BERT4Rec or SASRec, achieve state-of-the-art results in sequential recommendation. However, it is challenging to use these models in production environments with catalogues of millions of items: scaling Transformers beyond a few thousand items is problematic for several reasons, including high model memory consumption and slow inference. In this respect, RecJPQ is a state-of-the-art method of reducing the models' memory consumption; RecJPQ compresses item catalogues by decomposing item IDs into a small number of shared sub-item IDs. Despite reporting the reduction of memory consumption by a factor of up to 50x, the original RecJPQ paper did not report inference efficiency improvements over the baseline Transformer-based models. Upon analysing RecJPQ's scoring algorithm, we find that its efficiency is limited by its use of score accumulators for each item, which prevents parallelisation. In contrast, LightRec (a non-sequential method that uses a similar idea of sub-ids) reported large inference efficiency improvements using an algorithm we call PQTopK. We show that it is also possible to improve RecJPQ-based models' inference efficiency
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video caption dataset
Accurate day-ahead electricity price forecasting is essential for residential welfare, yet current methods often fall short in forecast accuracy. We observe that commonly used time series models struggle to utilize the prior correlation between price and demand-supply, which, we found, can contribute a lot to a reliable electricity price forecaster. Leveraging this prior, we propose a simple piecewise linear model that significantly enhances forecast accuracy by directly deriving prices from readily forecastable demand-supply values. Experiments in the day-ahead electricity markets of Shanxi province and ISO New England reveal that such forecasts could potentially save residents millions of dollars a year compared to existing methods. Our findings underscore the value of suitably integrating time series modeling with economic prior for enhanced electricity price forecasting accuracy.
Nanoparticles, exhibiting functionally relevant structural heterogeneity, are at the forefront of cutting-edge research. Now, high-throughput single-particle imaging (SPI) with x-ray free-electron lasers (XFELs) creates unprecedented opportunities for recovering the shape distributions of millions of particles that exhibit functionally relevant structural heterogeneity. To realize this potential, three challenges have to be overcome: (1) simultaneous parametrization of structural variability in real and reciprocal spaces; (2) efficiently inferring the latent parameters of each SPI measurement; (3) scaling up comparisons between $10^5$ structural models and $10^6$ XFEL-SPI measurements. Here, we describe how we overcame these three challenges to resolve the non-equilibrium shape distributions within millions of gold nanoparticles imaged at the European XFEL. These shape distributions allowed us to quantify the degree of asymmetry in these particles, discover a relatively stable `shape envelope' amongst nanoparticles, discern finite-size effects related to shape-controlling surfactants, and extrapolate nanoparticles' shapes to their idealized thermodynamic limit. Ultimately, these de
Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on sma
Announcing the release v2 of the MORX (Millions of Optical-Radio/X-ray Associations) catalogue which presents probable (40%-100% likelihood) radio/X-ray associations, including double radio lobes, to optical objects over the whole sky. Detections from all the largest radio/X-ray surveys to June 2023 are evaluated, those surveys being VLASS, LoTSS, RACS, FIRST, NVSS, and SUMSS radio surveys, and Chandra, XMM-Newton, Swift, and ROSAT X-ray surveys. The totals are 3,115,575 optical objects of all classifications (or unclassified) so associated. The MORX v2 catalogue is available on multiple sites.
To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval.
Treasure Data is processing millions of distributed SQL queries every day on the cloud. Upgrading the query engine service at this scale is challenging because we need to migrate all of the production queries of the customers to a new version while preserving the correctness and performance of the data processing pipelines. To ensure the quality of the query engines, we utilize our query logs to build customer-specific benchmarks and replay these queries with real customer data in a secure pre-production environment. To simulate millions of queries, we need effective minimization of test query sets and better reporting of the simulation results to proactively find incompatible changes and performance regression of the new version. This paper describes the overall design of our system and shares various challenges in maintaining the quality of the query engine service on the cloud.
BACKGROUND: The Danish Civil Registration System (CRS) was established in 1968, where all persons alive and living in Denmark were registered. Among many other variables, it includes individual information on personal identification number, gender, date of birth, place of birth, place of residence, citizenship, continuously updated information on vital status, and the identity of parents and spouses. METHODS: To evaluate the quality and completeness of the information recorded on persons in the CRS, we considered all persons registered on November 4, 2005, i.e. all persons who were alive and resident in Denmark at least one day from April 2, 1968 to November 4, 2005, or in Greenland from May 1, 1972 to November 4, 2005. RESULTS: A total of 8,176,097 persons were registered. On November 4, 2005, 5,427,687 (66.4%) were alive and resident in Denmark, 56,920 (0.7%) were alive and resident in Greenland, 2,141,373 (26.2%) were dead, 21,160 (0.3%) had disappeared, and 528,957 (6.5%) had emigrated. Among persons born in Denmark 1960 or later the CRS contains complete information on maternal identity. Among persons born in Denmark 1970 or later the CRS contains complete information on paternal identity. Among women born in Denmark April 1935 or later the CRS contains complete information on all their children. Among males born in Denmark April 1945 or later the CRS contains complete information on all their children. The CRS contains complete information on: a) immigrations and emigrations from 1971 onwards, b) permanent residence in a Danish municipality from 1971 onwards, c) permanent residence in a municipality in Greenland from May 1972 onwards, and d) full address in Denmark from 1977 onwards. CONCLUSION: Data from the CRS is an important research tool in epidemiological research, which enables Danish researchers to carry out representative population-based studies on e.g. the potential clustering of disease and death in families and the potential association between residence and disease and death.
Over the last years, Ethereum has evolved into a public platform that safeguards the savings of hundreds of millions of people and secures more than $650 billion in assets, placing it among the top 25 stock exchanges worldwide in market capitalization, ahead of Singapore, Mexico, and Thailand. As such, the performance and security of the Ethereum blockchain are not only of theoretical interest, but also carry significant global economic implications. At the time of writing, the Ethereum platform is collectively secured by almost one million validators highlighting its decentralized nature and underlining its economic security guarantees. However, due to this large validator set, the protocol takes around 15 minutes to finalize a block which is prohibitively slow for many real world applications. This delay is largely driven by the cost of aggregating and disseminating signatures across a validator set of this scale. Furthermore, as we show in this paper, the existing protocol that is used to aggregate and disseminate the signatures has several shortcomings that can be exploited by adversaries to shift stake proportion from honest to adversarial nodes. In this paper, we introduce Wo
Tock began 10 years ago as a research operating system developed by academics to help other academics build urban sensing applications. By leveraging a new language (Rust) and new hardware protection mechanisms, Tock enabled Multiprogramming a 64 kB Computer Safely and Efficiently. Today, it is an open source project with a vibrant community of users and contributors. It is deployed on root of trust hardware in data center servers and on millions of laptops; it is used to develop automotive and space products, wearable electronics, and hardware security tokens--all while remaining a platform for operating systems research. This paper focuses on the impact of Tock's technical design on its adoption, the challenges and unexpected benefits of using a type safe language (Rust)--particularly in security sensitive settings--and the experience of supporting a production open4source operating system from academia.
We report on our tool, Pulse Infinite, that uses proof techniques to show non-termination (divergence) in large programs. Pulse Infinite works compositionally and under-approximately: the former supports scale, and the latter ensures soundness for proving divergence. Prior work focused on small benchmarks in the tens or hundreds of lines of code (LoC), and scale limits their practicality: a single company may have tens of millions, or even hundreds of millions of LoC or more. We report on applying Pulse Infinite to over a hundred million lines of open-source and proprietary software written in C, C++, and Hack, identifying over 30 previously unknown issues, establishing a new state of the art for detecting divergence in real-world codebases.