Generalist Large Language Models (LLMs), such as GPT-4, have shown considerable promise in various domains, including medical diagnosis. Rare diseases, affecting approximately 300 million people worldwide, often have unsatisfactory clinical diagnosis rates primarily due to a lack of experienced physicians and the complexity of differentiating among many rare diseases. In this context, recent news such as "ChatGPT correctly diagnosed a 4-year-old's rare disease after 17 doctors failed" underscore LLMs' potential, yet underexplored, role in clinically diagnosing rare diseases. To bridge this research gap, we introduce RareBench, a pioneering benchmark designed to systematically evaluate the capabilities of LLMs on 4 critical dimensions within the realm of rare diseases. Meanwhile, we have compiled the largest open-source dataset on rare disease patients, establishing a benchmark for future studies in this domain. To facilitate differential diagnosis of rare diseases, we develop a dynamic few-shot prompt methodology, leveraging a comprehensive rare disease knowledge graph synthesized from multiple knowledge bases, significantly enhancing LLMs' diagnostic performance. Moreover, we pres
Understanding rare events is critical across domains ranging from signal processing to reliability and structural safety, extreme-weather forecasting, and insurance. The analysis of rare events is a computationally challenging problem, particularly in high dimensions $d$. In this work, we develop the first asymptotic high-dimensional theory of rare events. First, we exploit asymptotic integral methods recently developed by the first author to provide an asymptotic expansion of rare event probabilities. The expansion employs the geometry of the rare event boundary and the local behavior of the log probability density. Generically, the expansion is valid if $d^2\llλ$, where $λ$ characterizes the extremity of the event. We prove this condition is necessary by constructing an example in which the first-order remainder is bounded above and below by $d^2/λ$. We also provide a nonasymptotic remainder bound which specifies the precise dependence of the remainder on $d$, $λ$, the density, and the boundary, and which shows that in certain cases, the condition $d^2\ll λ$ can be relaxed. As an application of the theory, we derive asymptotic approximations to rare probabilities under the standa
This study proposes a systematic non-kinetic deterrence path modeling framework based on strategic rare earth supply cut-off, aiming to assess the strategic effects of China's export control policy against the United States at the military system level. The model adopts a four-layer structure of "policy input -- resource node -- equipment system -- capability output" and integrates path dependency modeling, degradation function design, and capability lag prediction mechanisms to form a strategic simulation system. The study incorporates graph neural networks and LSTM-based time series methods to dynamically evaluate the impact of rare earth supply disruption on key U.S. military platforms such as the F-35 fighter, nuclear submarines, and AI combat systems, identifying critical path nodes and strategic timing windows. Results indicate that a ten-year zero-tolerance policy on rare earth exports would lead to a significant technological disconnect between years 3 to 5 and a systemic capability lag between years 8 to 12, with an estimated average annual economic impact of 35 to 40 billion USD. These findings demonstrate that rare earth export cut-offs can serve as a structural strategi
Rare and very rare decays of third-generation particles, including $b$-hadrons and $τ$ leptons, provide sensitive probes of physics beyond the Standard Model (SM). Unlike direct searches limited by collider energies, they probe new physics at much higher energy scales. Many of these decays have SM-predicted branching fractions below the sensitivity of current detectors. These proceedings report on recent LHCb searches, including several first searches and results setting the most stringent limits to date. In particular, searches for $b \to s τ^+τ^-$, $b \to s τ^\pm e^\mp$, $b \to s μ^\pm e^\mp$, and $τ^- \to μ^-μ^+μ^-$ are presented, alongside searches for lepton-number-violating processes and loop-suppressed annihilation decays.
Rare diseases are collectively common, affecting approximately one in twenty individuals worldwide. In recent years, rapid progress has been made in rare disease diagnostics due to advances in DNA sequencing, development of new computational and experimental approaches to prioritize genes and genetic variants, and increased global exchange of clinical and genetic data. However, more than half of individuals suspected to have a rare disease lack a genetic diagnosis. The Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium was initiated to study thousands of challenging rare disease cases and families and apply, standardize, and evaluate emerging genomics technologies and analytics to accelerate their adoption in clinical practice. Further, all data generated, currently representing ~7500 individuals from ~3000 families, is rapidly made available to researchers worldwide via the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) to catalyze global efforts to develop approaches for genetic diagnoses in rare diseases (https://gregorconsortium.org/data). The majority of these families have undergone prior clinical genetic testing
Domain-specific intelligence demands specialized knowledge and sophisticated reasoning for problem-solving, posing significant challenges for large language models (LLMs) that struggle with knowledge hallucination and inadequate reasoning capabilities under constrained parameter budgets. Inspired by Bloom's Taxonomy in educational theory, we propose Retrieval-Augmented Reasoning Modeling (RARE), a novel paradigm that decouples knowledge storage from reasoning optimization. RARE externalizes domain knowledge to retrievable sources and internalizes domain-specific reasoning patterns during training. Specifically, by injecting retrieved knowledge into training prompts with masked losses, RARE transforms learning objectives from rote memorization to contextualized reasoning. It enables models to bypass parameter-intensive memorization and prioritize the development of higher-order cognitive processes. Extensive experiments demonstrate that lightweight RARE-trained models (e.g., Llama-3.1-8B) could achieve state-of-the-art performance, surpassing retrieval-augmented GPT-4 and DeepSeek-R1 up to approximately 20\% accuracy. RARE establishes a paradigm shift where maintainable external kno
State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare compositions of concepts, e.g., objects with unusual attributes. In this paper, we show that the compositional generation power of diffusion models on such rare concepts can be significantly enhanced by the Large Language Model (LLM) guidance. We start with empirical and theoretical analysis, demonstrating that exposing frequent concepts relevant to the target rare concepts during the diffusion sampling process yields more accurate concept composition. Based on this, we propose a training-free approach, R2F, that plans and executes the overall rare-to-frequent concept guidance throughout the diffusion inference by leveraging the abundant semantic knowledge in LLMs. Our framework is flexible across any pre-trained diffusion models and LLMs, and can be seamlessly integrated with the region-guided diffusion approaches. Extensive experiments on three datasets, including our newly proposed benchmark, RareBench, containing various prompts with rare compositions of concepts, R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at https:/
By adopting a hadron-structure-oriented approach, we present and discuss the release of the novel OMG3Q1.0 set of collinear fragmentation functions for fully charmed, rare $Ω$ baryons. Our methodology combines diquark-like proxy model inputs for both charm-quark and gluon channels, calculated at the initial energy scales, with a DGLAP evolution that ensures a consistent treatment of heavy-quark thresholds, following directly from the HF-NRevo scheme. We complement our work with a phenomenological study of NLL/NLO$^+$ resummed $Ω_{3c}$ plus jet distributions using (sym)JETHAD at the HL-LHC and the future FCC. Unraveling the production mechanisms of rare, yet-unobserved hadrons, as provided by the OMG3Q1.0 functions, stands as a key asset for deepening our understanding of QCD at future high-energy hadron colliders.
Rare event prediction involves identifying and forecasting events with a low probability using machine learning (ML) and data analysis. Due to the imbalanced data distributions, where the frequency of common events vastly outweighs that of rare events, it requires using specialized methods within each step of the ML pipeline, i.e., from data processing to algorithms to evaluation protocols. Predicting the occurrences of rare events is important for real-world applications, such as Industry 4.0, and is an active research area in statistical and ML. This paper comprehensively reviews the current approaches for rare event prediction along four dimensions: rare event data, data processing, algorithmic approaches, and evaluation approaches. Specifically, we consider 73 datasets from different modalities (i.e., numerical, image, text, and audio), four major categories of data processing, five major algorithmic groupings, and two broader evaluation approaches. This paper aims to identify gaps in the current literature and highlight the challenges of predicting rare events. It also suggests potential research directions, which can help guide practitioners and researchers.
This study constructs a quantifiable modelling framework to simulate non-kinetic strategic deterrence pathways in rare earth supply disruption scenarios, based on structured responses from expert interviews led by Dr. Daniel O'Connor, CEO of the Rare Earth Exchange (REE). Focusing on disruption impacts on national security systems, the study proposes four core modelling components: Security Critical Zones (SCZ), Strategic Signal Injection Function (SSIF), System-Capability Migration Function (SCIF), and Policy-Capability Transfer Function (PCTF). The framework integrates parametric ODEs, segmented function modelling, path-overlapping covariance matrices, and LSTM networks to simulate nonlinear suppression trajectories triggered by regime signals. Data is derived from expert interviews and scenario analyses centered on U.S.-China dynamics in ISR, electronic warfare, and rare earth control. Results show institutional signals have strong tempo and path-coupling effects, capable of causing rapid degradation of strategic capabilities. The model is adaptable across national resource frameworks and extendable to AI sandbox engines for situational simulation and counterfactual reasoning. T
Increasing design complexity and reduced time-to-market have motivated manufacturers to outsource some parts of the System-on-Chip (SoC) design flow to third-party vendors. This provides an opportunity for attackers to introduce hardware Trojans by constructing stealthy triggers consisting of rare events (e.g., rare signals, states, and transitions). There are promising test generation-based hardware Trojan detection techniques that rely on the activation of rare events. In this paper, we investigate rareness reduction as a design-for-trust solution to make it harder for an adversary to hide Trojans (easier for Trojan detection). Specifically, we analyze different avenues to reduce the potential rare trigger cases, including design diversity and area optimization. While there is a good understanding of the relationship between area, power, energy, and performance, this research provides a better insight into the dependency between area and security. Our experimental evaluation demonstrates that area reduction leads to a reduction in rareness. It also reveals that reducing rareness leads to faster Trojan detection as well as improved coverage by Trojan detection methods.
Language models learn rare syntactic phenomena, but the extent to which this is attributable to generalization vs. memorization is a major open question. To that end, we iteratively trained transformer language models on systematically manipulated corpora which were human-scale in size, and then evaluated their learning of a rare grammatical phenomenon: the English Article+Adjective+Numeral+Noun (AANN) construction (``a beautiful five days''). We compared how well this construction was learned on the default corpus relative to a counterfactual corpus in which AANN sentences were removed. We found that AANNs were still learned better than systematically perturbed variants of the construction. Using additional counterfactual corpora, we suggest that this learning occurs through generalization from related constructions (e.g., ``a few days''). An additional experiment showed that this learning is enhanced when there is more variability in the input. Taken together, our results provide an existence proof that LMs can learn rare grammatical phenomena by generalization from less rare phenomena. Data and code: https://github.com/kanishkamisra/aannalysis.
Rare event simulation and rare event probability estimation are important tasks within the analysis of systems subject to uncertainty and randomness. Simultaneously, accurately estimating rare event probabilities is an inherently difficult task that calls for dedicated tools and methods. One way to improve estimation efficiency on difficult rare event estimation problems is to leverage gradients of the computational model representing the system in consideration, e.g., to explore the rare event faster and more reliably. We present a novel approach for estimating rare event probabilities using such model gradients by drawing on a technique to generate samples from non-normalized posterior distributions in Bayesian inference - the Stein variational gradient descent. We propagate samples generated from a tractable input distribution towards a near-optimal rare event importance sampling distribution by exploiting a similarity of the latter with Bayesian posterior distributions. Sample propagation takes the shape of passing samples through a sequence of invertible transforms such that their densities can be tracked and used to construct an unbiased importance sampling estimate of the ra
We study the problem of learning generative adversarial networks (GANs) for a rare class of an unlabeled dataset subject to a labeling budget. This problem is motivated from practical applications in domains including security (e.g., synthesizing packets for DNS amplification attacks), systems and networking (e.g., synthesizing workloads that trigger high resource usage), and machine learning (e.g., generating images from a rare class). Existing approaches are unsuitable, either requiring fully-labeled datasets or sacrificing the fidelity of the rare class for that of the common classes. We propose RareGAN, a novel synthesis of three key ideas: (1) extending conditional GANs to use labelled and unlabelled data for better generalization; (2) an active learning approach that requests the most useful labels; and (3) a weighted loss function to favor learning the rare class. We show that RareGAN achieves a better fidelity-diversity tradeoff on the rare class than prior work across different applications, budgets, rare class fractions, GAN losses, and architectures.
It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction met
The rare and forbidden processes within the Standard Model offer an opportunity to explore potential new physics beyond the SM. We summarize the research method and the recent results of rare charm decays at BESIII based on the extensive data samples in the $τ-c$ energy region, many of which impose stringent constraints on the new physics.
In the quest to overcome language barriers, encoder-decoder models like NLLB have expanded machine translation to rare languages, with some models (e.g., NLLB 1.3B) even trainable on a single GPU. While general-purpose LLMs perform well in translation, open LLMs prove highly competitive when fine-tuned for specific tasks involving unknown corpora. We introduce LYRA (Language verY Rare for All), a novel approach that combines open LLM fine-tuning, retrieval-augmented generation (RAG), and transfer learning from related high-resource languages. This study is exclusively focused on single-GPU training to facilitate ease of adoption. Our study focuses on two-way translation between French and Monégasque, a rare language unsupported by existing translation tools due to limited corpus availability. Our results demonstrate LYRA's effectiveness, frequently surpassing and consistently matching state-of-the-art encoder-decoder models in rare language translation.
This is the Snowmass 2021 Rare and Precision Frontier Report. The Rare Processes and Precision Measurements Frontier, referred to as the ``Rare and Precision Frontier", or RPF, encompasses searches for extremely rare processes or tiny deviations from the Standard Model (SM) that can be studied with intense sources and high-precision detectors. Our community studies have identified several unique research opportunities that may pin down the scales associated with New Physics (NP) interactions and constrain the couplings of possible new degrees of freedom. Searches for rare flavor transition processes and precision measurements are indispensable probes of flavor and fundamental symmetries, and provide insights into physics that manifests itself at higher energy or through weaker interactions than those directly accessible at high-energy colliders.
We report the synthesis and magnetic characterization of stuffed rare earth gallium garnets, RE3+xGa5-xO12 (RE=Lu, Yb, Er, Dy, Gd), for x up to 0.5. The excess rare earth ions partly fill the octahedral sites normally fully occupied by Ga3+, forming disordered pairs of corner-shared face-sharing magnetic tetrahedra. The Curie-Weiss constants and observed effective moments per rare earth are smaller than are seen for the unstuffed gallium garnets. No significant change in the field-dependent magnetization is observed but missing entropy is seen when integrating the low-temperature heat capacity to 0.5 K.
The investigation of rare phenomena requires an effective suppression of all the background components entangling the expected signal. This has compelled the development of a wide range of low radioactivity techniques and background mitigation strategies. Some examples of those applied to Large Time Projection Chambers (TPCs) will be discussed here, including the operation of experiments deep underground, the exhaustive control of material radiopurity and the implementation of discrimination techniques.