Computer simulations of complex population genetic models are an essential tool for making sense of the large-scale datasets of multiple genome sequences from a single species that are becoming increasingly available. A widely used approach for reducing computing time is to simulate populations that are much smaller than the natural populations that they are intended to represent, by using parameters such as selection coefficients and mutation rates whose products with the population size correspond to those of the natural populations. This approach has come to be known as rescaling, and is justified by the theory of the genetics of finite populations. Recently, however, there have been criticisms of this practice, which have brought to light situations in which it can lead to erroneous conclusions. This paper reviews the theoretical basis for rescaling, and relates it to current practice in population genetics simulations. It shows that some population genetic statistics are scaleable while others are not. Additionally, it shows that there are likely to be problems with rescaling when simulating large chromosomal regions, due to the non-linear relation between the physical distanc
Population genetics lies at the heart of evolutionary theory. This topic forms part of many biological science curricula but is rarely taught to physics students. Since physicists are becoming increasingly interested in biological evolution, we aim to provide a brief introduction to population genetics, written for physicists. We start with two background chapters: chapter 1 provides a brief historical introduction to the topic, while chapter 2 provides some essential biological background. We begin our main content with chapter 3 which discusses the key concepts behind Darwinian natural selection and Mendelian inheritance. Chapter 4 covers the basics of how variation is maintained in populations, while chapter 5 discusses mutation and selection. In chapter 6 we discuss stochastic effects in population genetics using the Wright-Fisher model as our example, and finally we offer concluding thoughts and references to excellent textbooks in chapter 7.
In this the first of an anticipated four paper series, fundamental results of quantitative genetics are presented from a first principles approach. While none of these results are in any sense new, they are presented in extended detail to precisely distinguish between definition and assumption, with a further emphasis on distinguishing quantities from their usual approximations. Terminology frequently encountered in the field of human genetic disease studies will be defined in terms of their quantitive genetics form. Methods for estimation of both quantitative genetics and the related human genetics quantities will be demonstrated. While practitioners in the field of human quantitative disease studies may find this work pedantic in detail, the principle target audience for this work is trainees reasonably familiar with population genetics theory, but with less experience in its application to human disease studies. We introduce much of this formalism because in later papers in this series, we demonstrate that common areas of confusion in human disease studies can be resolved be appealing directly to these formal definitions. The second paper in this series will discuss polygenic ri
Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes is challenging, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that $k$-mers are a crucial stepping stone to bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of $k$-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different $k$-mer-based measures of genetic variation behave in population genetic simulations according to the choice of $k$, depth of sequencing coverage, and degree of data compression. Overall, we find that $k$-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity ($π$) up to values of about $π= 0.025$ ($R^2 = 0.97$) for neutrally ev
The approval success rate of drug candidates is very low with the majority of failure due to safety and efficacy. Increasingly available high dimensional information on targets, drug molecules and indications provides an opportunity for ML methods to integrate multiple data modalities and better predict clinically promising drug targets. Notably, drug targets with human genetics evidence are shown to have better odds to succeed. However, a recent tensor factorization-based approach found that additional information on targets and indications might not necessarily improve the predictive accuracy. Here we revisit this approach by integrating different types of human genetics evidence collated from publicly available sources to support each target-indication pair. We use Bayesian tensor factorization to show that models incorporating all available human genetics evidence (rare disease, gene burden, common disease) modestly improves the clinical outcome prediction over models using single line of genetics evidence. We provide additional insight into the relative predictive power of different types of human genetics evidence for predicting the success of clinical outcomes.
Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits. When applied to high-dimensional medical imaging data, a key step is to extract lower-dimensional, yet informative representations of the data as traits. Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS in comparison to typical visual representation learning. In this study, we tackle this problem from the mutual information (MI) perspective by identifying key limitations of existing methods. We introduce a trans-modal learning framework Genetic InfoMax (GIM), including a regularized MI estimator and a novel genetics-informed transformer to address the specific challenges of GWAS. We evaluate GIM on human brain 3D MRI data and establish standardized evaluation protocols to compare it to existing approaches. Our results demonstrate the effectiveness of GIM and a significantly improved performance on GWAS.
Imaging genetics is a growing field that employs structural or functional neuroimaging techniques to study individuals with genetic risk variants potentially linked to specific illnesses. This area presents considerable challenges to statisticians due to the heterogeneous information and different data forms it involves. In addition, both imaging and genetic data are typically high-dimensional, creating a "big data squared" problem. Moreover, brain imaging data contains extensive spatial information. Simply vectorizing tensor images and treating voxels as independent features can lead to computational issues and disregard spatial structure. This paper presents a novel statistical method for imaging genetics modeling while addressing all these challenges. We explore a Canonical Correlation Analysis based linear model for the joint modeling of brain imaging, genetic information, and clinical phenotype, enabling the simultaneous detection of significant brain regions and selection of important genetic variants associated with the phenotype outcome. Scalable algorithms are developed to tackle the "big data squared" issue. We apply the proposed method to explore the reaction speed, an i
The problem of inferring unknown graph edges from numerical data at a graph's nodes appears in many forms across machine learning. We study a version of this problem that arises in the field of \emph{landscape genetics}, where genetic similarity between organisms living in a heterogeneous landscape is explained by a weighted graph that encodes the ease of dispersal through that landscape. Our main contribution is an efficient algorithm for \emph{inverse landscape genetics}, which is the task of inferring this graph from measurements of genetic similarity at different locations (graph nodes). Inverse landscape genetics is important in discovering impediments to species dispersal that threaten biodiversity and long-term species survival. In particular, it is widely used to study the effects of climate change and human development. Drawing on influential work that models organism dispersal using graph \emph{effective resistances} (McRae 2006), we reduce the inverse landscape genetics problem to that of inferring graph edges from noisy measurements of these resistances, which can be obtained from genetic similarity data. Building on the NeurIPS 2018 work of Hoskins et al. 2018 on learn
Many forensic genetics problems can be handled using structured systems of discrete variables, for which Bayesian networks offer an appealing practical modeling framework, and allow inferences to be computed by probability propagation methods. However, when standard assumptions are violated--for example, when allele frequencies are unknown, there is identity by descent or the population is heterogeneous--dependence is generated among founding genes, that makes exact calculation of conditional probabilities by propagation methods less straightforward. Here we illustrate different methodologies for assessing sensitivity to assumptions about founders in forensic genetics problems. These include constrained steepest descent, linear fractional programming and representing dependence by structure. We illustrate these methods on several forensic genetics examples involving criminal identification, simple and complex disputed paternity and DNA mixtures.
In the context of population genetics, active information can be extended to measure the change of information of a given event (e.g., fixation of an allele) from a neutral model in which only genetic drift is taken into account to a non-neutral model that includes other sources of frequency variation (e.g., selection and mutation). In this paper we illustrate active information in population genetics through the Wright-Fisher model.
Using brain imaging quantitative traits (QTs) to identify the genetic risk factors is an important research topic in imaging genetics. Many efforts have been made via building linear models, e.g. linear regression (LR), to extract the association between imaging QTs and genetic factors such as single nucleotide polymorphisms (SNPs). However, to the best of our knowledge, these linear models could not fully uncover the complicated relationship due to the loci's elusive and diverse impacts on imaging QTs. Though deep learning models can extract the nonlinear relationship, they could not select relevant genetic factors. In this paper, we proposed a novel multi-task deep feature selection (MTDFS) method for brain imaging genetics. MTDFS first adds a multi-task one-to-one layer and imposes a hybrid sparsity-inducing penalty to select relevant SNPs making significant contributions to abnormal imaging QTs. It then builds a multi-task deep neural network to model the complicated associations between imaging QTs and SNPs. MTDFS can not only extract the nonlinear relationship but also arms the deep neural network with the feature selection capability. We compared MTDFS to both LR and single-
A common sample descriptor in human genomics studies is that of 'genetic ancestry group', with terms such as 'European genetic ancestry' or 'East Asian genetic ancestry' frequently used in publications to describe the genetics of groups of individuals based on the analysis of their genotypes. In this Perspective, I argue that these terms are imprecise and potentially misleading and that, for most applications, simple statements of genetic similarity represent a more accurate description.
Mathematical population genetics is only one of Kingman's many research interests. Nevertheless, his contribution to this field has been crucial, and moved it in several important new directions. Here we outline some aspects of his work which have had a major influence on population genetics theory.
It is widely accepted that population genetics theory is the cornerstone of evolutionary analyses. Empirical tests of the theory, however, are challenging because of the complex relationships between space, dispersal, and evolution. Critically, we lack quantitative validation of the spatial models of population genetics. Here we combine analytics, on and off-lattice simulations, and experiments with bacteria to perform quantitative tests of the theory. We study two bacterial species, the gut microbe Escherichia coli and the opportunistic pathogen Pseudomonas aeruginosa, and show that spatio-genetic patterns in colony biofilms of both species are accurately described by an extension of the one-dimensional stepping-stone model. We use one empirical measure, genetic diversity at the colony periphery, to parameterize our models and show that we can then accurately predict another key variable: the degree of short-range cell migration along an edge. Moreover, the model allows us to estimate other key parameters including effective population size (density) at the expansion frontier. While our experimental system is a simplification of natural microbial community, we argue it is a proof
Standard neutral population genetics theory with a strictly fixed population size has important limitations. An alternative model that allows independently fluctuating population sizes and reproduces the standard neutral evolution is reviewed. We then study a situation such that the competing species are neutral at the equilibrium population size but population size fluctuations nevertheless favor fixation of one species over the other. In this case, a separation of timescales emerges naturally and allows adiabatic elimination of a fast population size variable to deduce the fluctuations-induced selection dynamics near the equilibrium population size. The results highlight the incompleteness of the standard population genetics with a strictly fixed population size.
These lecture notes introduce key concepts of mathematical population genetics within the most elementary setting and describe a few recent applications to microbial evolution experiments. Pointers to the literature for further reading are provided, and some of the derivations are left as exercises for the reader.
The growth and evolution of microbial populations is often subjected to advection by fluid flows in spatially extended environments, with immediate consequences for questions of spatial population genetics in marine ecology, planktonic diversity and origin of life scenarios. Here, we review recent progress made in understanding this rich problem in the simplified setting of two competing genetic microbial strains subjected to fluid flows. As a pedagogical example we focus on antagonsim, i.e., two killer microorganism strains, each secreting toxins that impede the growth of their competitors (competitive exclusion), in the presence of stationary fluid flows. By solving two coupled reaction-diffusion equations that include advection by simple steady cellular flows composed of characteristic flow motifs in two dimensions (2d), we show how local flow shear and compressibility effects can interact with selective advantage to have a dramatic influence on genetic competition and fixation in spatially distributed populations. We analyze several 1d and 2d flow geometries including sources, sinks, vortices and saddles, and show how simple analytical models of the dynamics of the genetic inte
We give a overview of stochastic models of evolution that have found applications in genetics, ecology and linguistics for an audience of nonspecialists, especially statistical physicists. In particular, we focus mostly on neutral models in which no intrinsic advantage is ascribed to a particular type of the variable unit, for example a gene, appearing in the theory. In many cases these models are exactly solvable and furthermore go some way to describing observed features of genetic, ecological and linguistic systems.
In this paper, we propose a framework for automatic classification of patients from multimodal genetic and brain imaging data by optimally combining them. Additive models with unadapted penalties (such as the classical group lasso penalty or $L_1$-multiple kernel learning) treat all modalities in the same manner and can result in undesirable elimination of specific modalities when their contributions are unbalanced. To overcome this limitation, we introduce a multilevel model that combines imaging and genetics and that considers joint effects between these two modalities for diagnosis prediction. Furthermore, we propose a framework allowing to combine several penalties taking into account the structure of the different types of data, such as a group lasso penalty over the genetic modality and a $L_2$-penalty on imaging modalities. Finally , we propose a fast optimization algorithm, based on a proximal gradient method. The model has been evaluated on genetic (single nucleotide polymorphisms-SNP) and imaging (anatomical MRI measures) data from the ADNI database, and compared to additive models. It exhibits good performances in AD diagnosis; and at the same time, reveals relationships
Non-invasive measurements of the human brain using magnetic resonance imaging (MRI) have significantly improved our understanding the brain's network organization by enabling measurement of anatomical connections between brain regions (structural connectivity) and their coactivation (functional connectivity). Heritability analyses have established that genetics account for considerable intersubject variability in structural and functional connectivity. However, characterizing how genetics shape the relationship between structural and functional connectomes remains challenging, since this association is obscured by unique environmental exposures in observed data. To address this, we develop a regression analysis framework that enables characterization of the relationship between latent genetic contributions to structural and functional connectivity. Implementing the proposed framework requires estimating genetic covariance matrices in multivariate random effects models, which is computationally intractable for high-dimensional connectome data using existing methods. We introduce a constrained method-of-moments estimator that is several orders of magnitude faster than existing method