While imaging-genetics holds great promise for unraveling the complex interplay between brain structure and genetic variation in neurological disorders, traditional methods are limited to simplistic linear models or to black-box techniques that lack interpretability. In this paper, we present NeuroPathX, an explainable deep learning framework that uses an early fusion strategy powered by cross-attention mechanisms to capture meaningful interactions between structural variations in the brain derived from MRI and established biological pathways derived from genetics data. To enhance interpretability and robustness, we introduce two loss functions over the attention matrix - a sparsity loss that focuses on the most salient interactions and a pathway similarity loss that enforces consistent representations across the cohort. We validate NeuroPathX on both autism spectrum disorder and Alzheimer's disease. Our results demonstrate that NeuroPathX outperforms competing baseline approaches and reveals biologically plausible associations linked to the disorder. These findings underscore the potential of NeuroPathX to advance our understanding of complex brain disorders. Code is available at
In this the first of an anticipated four paper series, fundamental results of quantitative genetics are presented from a first principles approach. While none of these results are in any sense new, they are presented in extended detail to precisely distinguish between definition and assumption, with a further emphasis on distinguishing quantities from their usual approximations. Terminology frequently encountered in the field of human genetic disease studies will be defined in terms of their quantitive genetics form. Methods for estimation of both quantitative genetics and the related human genetics quantities will be demonstrated. While practitioners in the field of human quantitative disease studies may find this work pedantic in detail, the principle target audience for this work is trainees reasonably familiar with population genetics theory, but with less experience in its application to human disease studies. We introduce much of this formalism because in later papers in this series, we demonstrate that common areas of confusion in human disease studies can be resolved be appealing directly to these formal definitions. The second paper in this series will discuss polygenic ri
Population genetics lies at the heart of evolutionary theory. This topic forms part of many biological science curricula but is rarely taught to physics students. Since physicists are becoming increasingly interested in biological evolution, we aim to provide a brief introduction to population genetics, written for physicists. We start with two background chapters: chapter 1 provides a brief historical introduction to the topic, while chapter 2 provides some essential biological background. We begin our main content with chapter 3 which discusses the key concepts behind Darwinian natural selection and Mendelian inheritance. Chapter 4 covers the basics of how variation is maintained in populations, while chapter 5 discusses mutation and selection. In chapter 6 we discuss stochastic effects in population genetics using the Wright-Fisher model as our example, and finally we offer concluding thoughts and references to excellent textbooks in chapter 7.
Imaging genetics is a growing field that employs structural or functional neuroimaging techniques to study individuals with genetic risk variants potentially linked to specific illnesses. This area presents considerable challenges to statisticians due to the heterogeneous information and different data forms it involves. In addition, both imaging and genetic data are typically high-dimensional, creating a "big data squared" problem. Moreover, brain imaging data contains extensive spatial information. Simply vectorizing tensor images and treating voxels as independent features can lead to computational issues and disregard spatial structure. This paper presents a novel statistical method for imaging genetics modeling while addressing all these challenges. We explore a Canonical Correlation Analysis based linear model for the joint modeling of brain imaging, genetic information, and clinical phenotype, enabling the simultaneous detection of significant brain regions and selection of important genetic variants associated with the phenotype outcome. Scalable algorithms are developed to tackle the "big data squared" issue. We apply the proposed method to explore the reaction speed, an i
Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits. When applied to high-dimensional medical imaging data, a key step is to extract lower-dimensional, yet informative representations of the data as traits. Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS in comparison to typical visual representation learning. In this study, we tackle this problem from the mutual information (MI) perspective by identifying key limitations of existing methods. We introduce a trans-modal learning framework Genetic InfoMax (GIM), including a regularized MI estimator and a novel genetics-informed transformer to address the specific challenges of GWAS. We evaluate GIM on human brain 3D MRI data and establish standardized evaluation protocols to compare it to existing approaches. Our results demonstrate the effectiveness of GIM and a significantly improved performance on GWAS.
The problem of inferring unknown graph edges from numerical data at a graph's nodes appears in many forms across machine learning. We study a version of this problem that arises in the field of \emph{landscape genetics}, where genetic similarity between organisms living in a heterogeneous landscape is explained by a weighted graph that encodes the ease of dispersal through that landscape. Our main contribution is an efficient algorithm for \emph{inverse landscape genetics}, which is the task of inferring this graph from measurements of genetic similarity at different locations (graph nodes). Inverse landscape genetics is important in discovering impediments to species dispersal that threaten biodiversity and long-term species survival. In particular, it is widely used to study the effects of climate change and human development. Drawing on influential work that models organism dispersal using graph \emph{effective resistances} (McRae 2006), we reduce the inverse landscape genetics problem to that of inferring graph edges from noisy measurements of these resistances, which can be obtained from genetic similarity data. Building on the NeurIPS 2018 work of Hoskins et al. 2018 on learn
Non-invasive measurements of the human brain using magnetic resonance imaging (MRI) have significantly improved our understanding the brain's network organization by enabling measurement of anatomical connections between brain regions (structural connectivity) and their coactivation (functional connectivity). Heritability analyses have established that genetics account for considerable intersubject variability in structural and functional connectivity. However, characterizing how genetics shape the relationship between structural and functional connectomes remains challenging, since this association is obscured by unique environmental exposures in observed data. To address this, we develop a regression analysis framework that enables characterization of the relationship between latent genetic contributions to structural and functional connectivity. Implementing the proposed framework requires estimating genetic covariance matrices in multivariate random effects models, which is computationally intractable for high-dimensional connectome data using existing methods. We introduce a constrained method-of-moments estimator that is several orders of magnitude faster than existing method
A common sample descriptor in human genomics studies is that of 'genetic ancestry group', with terms such as 'European genetic ancestry' or 'East Asian genetic ancestry' frequently used in publications to describe the genetics of groups of individuals based on the analysis of their genotypes. In this Perspective, I argue that these terms are imprecise and potentially misleading and that, for most applications, simple statements of genetic similarity represent a more accurate description.
This report documents the development and evaluation of domain-specific language models for neurology. Initially focused on building a bespoke model, the project adapted to rapid advances in open-source and commercial medical LLMs, shifting toward leveraging retrieval-augmented generation (RAG) and representational models for secure, local deployment. Key contributions include the creation of neurology-specific datasets (case reports, QA sets, textbook-derived data), tools for multi-word expression extraction, and graph-based analyses of medical terminology. The project also produced scripts and Docker containers for local hosting. Performance metrics and graph community results are reported, with future possible work open for multimodal models using open-source architectures like phi-4.
Adenosine receptors are G-protein-coupled receptors involved in a wide range of physiological and pathological phenomena in most mammalian systems. All four receptors are widely expressed in the central nervous system, where they modulate neurotransmitter release and neuronal plasticity. A large number of gene association studies have shown that common genetic variants of the adenosine receptors (encoded by the ADORA1, ADORA2A, ADORA2B and ADORA3 genes) have a neuroprotective or neurodegenerative role in neurologic/psychiatric diseases. New genetic studies of rare variants and few novel associations with depression or epilepsy subtypes have recently been reported. Here, we review the literature on the genetics of adenosine receptors in neurologic and/or psychiatric diseases in humans, and discuss perspectives for further genetic research. We also provide an update on the genetic structures of the four human adenosine receptor genes and their regulation - a topic that has not been extensively addressed. Our review emphasizes the importance of (i) better characterizing the genetics of adenosine receptor genes and (ii) understanding how these genes are regulated.
Large language models (LLMs) have shown promise in medical domains, but their ability to handle specialized neurological reasoning requires systematic evaluation. We developed a comprehensive benchmark using 305 questions from Israeli Board Certification Exams in Neurology, classified along three complexity dimensions: factual knowledge depth, clinical concept integration, and reasoning complexity. We evaluated ten LLMs using base models, retrieval-augmented generation (RAG), and a novel multi-agent system. Results showed significant performance variation. OpenAI-o1 achieved the highest base performance (90.9% accuracy), while specialized medical models performed poorly (52.9% for Meditron-70B). RAG provided modest benefits but limited effectiveness on complex reasoning questions. In contrast, our multi-agent framework, decomposing neurological reasoning into specialized cognitive functions including question analysis, knowledge retrieval, answer synthesis, and validation, achieved dramatic improvements, especially for mid-range models. The LLaMA 3.3-70B-based agentic system reached 89.2% accuracy versus 69.5% for its base model, with substantial gains on level 3 complexity questio
Rule-based explanation methods offer rigorous and globally interpretable insights into neural network behavior. However, existing approaches are mostly limited to small fully connected networks and depend on costly layerwise rule extraction and substitution processes. These limitations hinder their generalization to more complex architectures such as Transformers. Moreover, existing methods produce shallow, decision-tree-like rules that fail to capture rich, high-level abstractions in complex domains like computer vision and natural language processing. To address these challenges, we propose NEUROLOGIC, a novel framework that extracts interpretable logical rules directly from deep neural networks. Unlike previous methods, NEUROLOGIC can construct logic rules over hidden predicates derived from neural representations at any chosen layer, in contrast to costly layerwise extraction and rewriting. This flexibility enables broader architectural compatibility and improved scalability. Furthermore, NEUROLOGIC supports richer logical constructs and can incorporate human prior knowledge to ground hidden predicates back to the input space, enhancing interpretability. We validate NEUROLOGIC
Patients with rare neurological diseases report cognitive symptoms -"brain fog"- invisible to traditional tests. We propose continuous neurocognitive monitoring via smartphone speech analysis integrated with Relational Graph Transformer (RELGT) architectures. Proof-of-concept in phenylketonuria (PKU) shows speech-derived "Proficiency in Verbal Discourse" correlates with blood phenylalanine (p = -0.50, p < 0.005) but not standard cognitive tests (all |r| < 0.35). RELGT could overcome information bottlenecks in heterogeneous medical data (speech, labs, assessments), enabling predictive alerts weeks before decompensation. Key challenges: multi-disease validation, clinical workflow integration, equitable multilingual deployment. Success would transform episodic neurology into continuous personalized monitoring for millions globally.
It is widely accepted that population genetics theory is the cornerstone of evolutionary analyses. Empirical tests of the theory, however, are challenging because of the complex relationships between space, dispersal, and evolution. Critically, we lack quantitative validation of the spatial models of population genetics. Here we combine analytics, on and off-lattice simulations, and experiments with bacteria to perform quantitative tests of the theory. We study two bacterial species, the gut microbe Escherichia coli and the opportunistic pathogen Pseudomonas aeruginosa, and show that spatio-genetic patterns in colony biofilms of both species are accurately described by an extension of the one-dimensional stepping-stone model. We use one empirical measure, genetic diversity at the colony periphery, to parameterize our models and show that we can then accurately predict another key variable: the degree of short-range cell migration along an edge. Moreover, the model allows us to estimate other key parameters including effective population size (density) at the expansion frontier. While our experimental system is a simplification of natural microbial community, we argue it is a proof
Mathematical population genetics is only one of Kingman's many research interests. Nevertheless, his contribution to this field has been crucial, and moved it in several important new directions. Here we outline some aspects of his work which have had a major influence on population genetics theory.
Rare diseases are collectively common, affecting approximately one in twenty individuals worldwide. In recent years, rapid progress has been made in rare disease diagnostics due to advances in DNA sequencing, development of new computational and experimental approaches to prioritize genes and genetic variants, and increased global exchange of clinical and genetic data. However, more than half of individuals suspected to have a rare disease lack a genetic diagnosis. The Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium was initiated to study thousands of challenging rare disease cases and families and apply, standardize, and evaluate emerging genomics technologies and analytics to accelerate their adoption in clinical practice. Further, all data generated, currently representing ~7500 individuals from ~3000 families, is rapidly made available to researchers worldwide via the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) to catalyze global efforts to develop approaches for genetic diagnoses in rare diseases (https://gregorconsortium.org/data). The majority of these families have undergone prior clinical genetic testing
Algebraic properties of the genetic code are analyzed. The investigations of the genetic code on the basis of matrix approaches ("matrix genetics") are described. The degeneracy of the vertebrate mitochondria genetic code is reflected in the black-and-white mosaic of the (8*8)-matrix of 64 triplets, 20 amino acids and stop-signals. This mosaic genetic matrix is connected with the matrix form of presentation of the special 8-dimensional Yin-Yang-algebra and of its particular 4-dimensional case. The special algorithm, which is based on features of genetic molecules, exists to transform the mosaic genomatrix into the matrices of these algebras. Two new numeric systems are defined by these 8-dimensional and 4-dimensional algebras: genetic Yin-Yang-octaves and genetic tetrions. Their comparison with quaternions by Hamilton is presented. Elements of new "genovector calculation" and ideas of "genetic mechanics" are discussed. These algebras are considered as models of the genetic code and as its possible pre-code basis. They are related with binary oppositions of the Yin-Yang type and they give new opportunities to investigate evolution of the genetic code. The revealed fact of the relati
These lecture notes introduce key concepts of mathematical population genetics within the most elementary setting and describe a few recent applications to microbial evolution experiments. Pointers to the literature for further reading are provided, and some of the derivations are left as exercises for the reader.
The set of known dialects of the genetic code (GC) is analyzed from the viewpoint of the genetic octave Yin-Yang-algebra. This algebra was described in the previous author's publications. The algebra was discovered on the basis of structural features of the GC in the matrix form of its presentation ("matrix genetics"). The octave Yin-Yang-algebra is considered as the pre-code or as the model of the GC. From the viewpoint of this algebraic model, for example, the sets of 20 amino acids and of 64 triplets consist of sub-sets of "male", "female" and "androgynous" molecules, etc. This algebra permits to reveal hidden peculiarities of the structure and evolution of the GC and to propose the conception of "sexual" relationships among genetic molecules. The first results of the analysis of the GC systems from such algebraic viewpoint say about the close connection between evolution of the GC and this algebra. They include 8 evolutionary rules of the dialects of the GC. The evolution of the GC is appeared as the struggle between male and female beginnings. The hypothesis about new biophysical factor of "sexual" interactions among genetic molecules is put forward. The matrix forms of presen
Migrations have played an important role in shaping the genetic diversity of human populations. Understanding genomic data thus requires careful modeling of historical gene flow. Here we consider the effect of relatively recent population structure and gene flow, and interpret genomes of individuals that have ancestry from multiple source populations as mosaics of segments originating from each population. We propose general and tractable models for describing the evolution of these patterns of local ancestry and their impact on genetic diversity. We focus on the length distribution of continuous ancestry tracts, and the variance in total ancestry proportions among individuals. The proposed models offer improved agreement with Wright-Fisher simulation data when compared to state-of-the art models, and can be used to infer various demographic parameters in gene flow models. Considering HapMap African-American (ASW) data, we find that a model with two distinct phases of `European' gene flow significantly improves the modeling of both tract lengths and ancestry variances.