We discuss foundational issues of quantum information biology (QIB) -- one of the most successful applications of the quantum formalism outside of physics. QIB provides a multi-scale model of information processing in bio-systems: from proteins and cells to cognitive and social systems. This theory has to be sharply distinguished from "traditional quantum biophysics". The latter is about quantum bio-physical processes, e.g., in cells or brains. QIB models the dynamics of information states of bio-systems. It is based on the quantum-like paradigm: complex bio-systems process information in accordance with the laws of quantum information and probability. This paradigm is supported by plenty of statistical bio-data collected at all scales, from molecular biology and genetics/epigenetics to cognitive psychology and behavioral economics. We argue that the information interpretation of quantum mechanics (its various forms were elaborated by Zeilinger and Brukner, Fuchs and Mermin, and D' Ariano) is the most natural interpretation of QIB. We also point out that QBIsm (Quantum Bayesianism) can serve to find a proper interpretation of bio-quantum probabilities. Biologically QIB is based on
Understanding the biological mechanisms of disease is crucial for medicine, and in particular, for drug discovery. AI-powered analysis of genome-scale biological data holds great potential in this regard. The increasing availability of single-cell RNA sequencing data has enabled the development of large foundation models for disease biology. However, existing foundation models only modestly improve over task-specific models in downstream applications. Here, we explored two avenues for improving single-cell foundation models. First, we scaled the pre-training data to a diverse collection of 116 million cells, which is larger than those used by previous models. Second, we leveraged the availability of large-scale biological annotations as a form of supervision during pre-training. We trained the \model family of models comprising six transformer-based state-of-the-art single-cell foundation models with 70 million, 160 million, and 400 million parameters. We vetted our models on several downstream evaluation tasks, including identifying the underlying disease state of held-out donors not seen during training, distinguishing between diseased and healthy cells for disease conditions and
High throughput sequencing (HTS)-based technology enables identifying and quantifying non-culturable microbial organisms in all environments. Microbial sequences have enhanced our understanding of the human microbiome, the soil and plant environment, and the marine environment. All molecular microbial data pose statistical challenges due to contamination sequences from reagents, batch effects, unequal sampling, and undetected taxa. Technical biases and heteroscedasticity have the strongest effects, but different strains across subjects and environments also make direct differential abundance testing unwieldy. We provide an introduction to a few statistical tools that can overcome some of these difficulties and demonstrate those tools on an example. We show how standard statistical methods, such as simple hierarchical mixture and topic models, can facilitate inferences on latent microbial communities. We also review some nonparametric Bayesian approaches that combine visualization and uncertainty quantification. The intersection of molecular microbial biology and statistics is an exciting new venue. Finally, we list some of the important open problems that would benefit from more ca
Graphical modelling has a long history in statistics as a tool for the analysis of multivariate data, starting from Wright's path analysis and Gibbs' applications to statistical physics at the beginning of the last century. In its modern form, it was pioneered by Lauritzen and Wermuth and Pearl in the 1980s, and has since found applications in fields as diverse as bioinformatics, customer satisfaction surveys and weather forecasts. Genetics and systems biology are unique among these fields in the dimension of the data sets they study, which often contain several hundreds of variables and only a few tens or hundreds of observations. This raises problems in both computational complexity and the statistical significance of the resulting networks, collectively known as the "curse of dimensionality". Furthermore, the data themselves are difficult to model correctly due to the limited understanding of the underlying mechanisms. In the following, we will illustrate how such challenges affect practical graphical modelling and some possible solutions.
In a recent paper, Wilmes et al. demonstrated a qualitative integration of omics data streams to gain a mechanistic understanding of cyclosporine A toxicity. One of their major conclusions was that cyclosporine A strongly activates the nuclear factor (erythroid-derived 2)-like 2 pathway (Nrf2) in renal proximal tubular epithelial cells exposed in vitro. We pursue here the analysis of those data with a quantitative integration of omics data with a differential equation model of the Nrf2 pathway. That was done in two steps: (i) Modeling the in vitro pharmacokinetics of cyclosporine A (exchange between cells, culture medium and vial walls) with a minimal distribution model. (ii) Modeling the time course of omics markers in response to cyclosporine A exposure at the cell level with a coupled PK-systems biology model. Posterior statistical distributions of the parameter values were obtained by Markov chain Monte Carlo sampling. Data were well simulated, and the known in vitro toxic effect EC50 was well matched by model predictions. The integration of in vitro pharmacokinetics and systems biology modeling gives us a quantitative insight into mechanisms of cyclosporine A oxidative-stress
The understanding of molecular cell biology requires insight into the structure and dynamics of networks that are made up of thousands of interacting molecules of DNA, RNA, proteins, metabolites, and other components. One of the central goals of systems biology is the unraveling of the as yet poorly characterized complex web of interactions among these components. This work is made harder by the fact that new species and interactions are continuously discovered in experimental work, necessitating the development of adaptive and fast algorithms for network construction and updating. Thus, the "reverse-engineering" of networks from data has emerged as one of the central concern of systems biology research. A variety of reverse-engineering methods have been developed, based on tools from statistics, machine learning, and other mathematical domains. In order to effectively use these methods, it is essential to develop an understanding of the fundamental characteristics of these algorithms. With that in mind, this chapter is dedicated to the reverse-engineering of biological systems. Specifically, we focus our attention on a particular class of methods for reverse-engineering, namely th
This article frames the relation between biology and physics by characterizing the former as a subdiscipline rather than a special case of the latter. To do this, we posit biological physics as the science of living matter in contrast to classic biophysics, the study of organismal properties by physical techniques. At the scale of the individual cell, living matter is nonunitary, i.e., not composed of aggregated subunits, and has features (e.g., intracellular organizational arrangements and biomolecular condensates) that are unlike any materials of the nonliving world. In transiently or constitutively multicellular forms (social microorganisms, animals, plants), living matter sustains physical processes that are generic (shared with nonliving matter, e.g., subunit communication by molecular diffusion in cellular slime molds), biogeneric (analogous to nonliving matter but realized through cellular activities, e.g., subunit demixing in animal embryos) or nongeneric (pertaining to sui generis materials, e.g., budding of active solids in plants). This "forms of matter" perspective is philosophically situated in the dialectical materialism of Engels and Hessen and the multilevel physica
Recent tumor genome sequencing confirmed that one tumor often consists of multiple cell subpopulations (clones) which bear different, but related, genetic profiles such as mutation and copy number variation profiles. Thus far, one tumor has been viewed as a whole entity in cancer functional studies. With the advances of genome sequencing and computational analysis, we are able to quantify and computationally dissect clones from tumors, and then conduct clone-based analysis. Emerging technologies such as single-cell genome sequencing and RNA-Seq could profile tumor clones. Thus, we should reconsider how to conduct cancer systems biology studies in the genome sequencing era. We will outline new directions for conducting cancer systems biology by considering that genome sequencing technology can be used for dissecting, quantifying and genetically characterizing clones from tumors. Topics discussed in Part 1 of this review include computationally quantifying of tumor subpopulations; clone-based network modeling, cancer hallmark-based networks and their high-order rewiring principles and the principles of cell survival networks of fast-growing clones.
Dynamical systems modeling, particularly via systems of ordinary differential equations, has been used to effectively capture the temporal behavior of different biochemical components in signal transduction networks. Despite the recent advances in experimental measurements, including sensor development and '-omics' studies that have helped populate protein-protein interaction networks in great detail, modeling in systems biology lacks systematic methods to estimate kinetic parameters and quantify associated uncertainties. This is because of multiple reasons, including sparse and noisy experimental measurements, lack of detailed molecular mechanisms underlying the reactions, and missing biochemical interactions. Additionally, the inherent nonlinearities with respect to the states and parameters associated with the system of differential equations further compound the challenges of parameter estimation. In this study, we propose a comprehensive framework for Bayesian parameter estimation and complete quantification of the effects of uncertainties in the data and models. We apply these methods to a series of signaling models of increasing mathematical complexity. Systematic analysis o
Systems biology relies on mathematical models that often involve complex and intractable likelihood functions, posing challenges for efficient inference and model selection. Generative models, such as normalizing flows, have shown remarkable ability in approximating complex distributions in various domains. However, their application in systems biology for approximating intractable likelihood functions remains unexplored. Here, we elucidate a framework for leveraging normalizing flows to approximate complex likelihood functions inherent to systems biology models. By using normalizing flows in the Simulation-based inference setting, we demonstrate a method that not only approximates a likelihood function but also allows for model inference in the model selection setting. We showcase the effectiveness of this approach on real-world systems biology problems, providing practical guidance for implementation and highlighting its advantages over traditional computational methods.
A number of models in mathematical epidemiology have been developed to account for control measures such as vaccination or quarantine. However, COVID-19 has brought unprecedented social distancing measures, with a challenge on how to include these in a manner that can explain the data but avoid overfitting in parameter inference. We here develop a simple time-dependent model, where social distancing effects are introduced analogous to coarse-grained models of gene expression control in systems biology. We apply our approach to understand drastic differences in COVID-19 infection and fatality counts, observed between Hubei (Wuhan) and other Mainland China provinces. We find that these unintuitive data may be explained through an interplay of differences in transmissibility, effective protection, and detection efficiencies between Hubei and other provinces. More generally, our results demonstrate that regional differences may drastically shape infection outbursts. The obtained results demonstrate the applicability of our developed method to extract key infection parameters directly from publically available data so that it can be globally applied to outbreaks of COVID-19 in a number
The last decade has witnessed a rapid growth in understanding of the pivotal roles of mechanical stresses and physical forces in cell biology. As a result an integrated view of cell biology is evolving, where genetic and molecular features are scrutinized hand in hand with physical and mechanical characteristics of cells. Physics of liquid crystals has emerged as a burgeoning new frontier in cell biology over the past few years, fueled by an increasing identification of orientational order and topological defects in cell biology, spanning scales from subcellular filaments to individual cells and multicellular tissues. Here, we provide an account of most recent findings and developments together with future promises and challenges in this rapidly evolving interdisciplinary research direction.
Quantum computers can in principle solve certain problems exponentially more quickly than their classical counterparts. We have not yet reached the advent of useful quantum computation, but when we do, it will affect nearly all scientific disciplines. In this review, we examine how current quantum algorithms could revolutionize computational biology and bioinformatics. There are potential benefits across the entire field, from the ability to process vast amounts of information and run machine learning algorithms far more efficiently, to algorithms for quantum simulation that are poised to improve computational calculations in drug discovery, to quantum algorithms for optimization that may advance fields from protein structure prediction to network analysis. However, these exciting prospects are susceptible to "hype", and it is also important to recognize the caveats and challenges in this new technology. Our aim is to introduce the promise and limitations of emerging quantum computing technologies in the areas of computational molecular biology and bioinformatics.
Molecular traits, such as gene expression levels or protein binding affinities, are increasingly accessible to quantitative measurement by modern high-throughput techniques. Such traits measure molecular functions and, from an evolutionary point of view, are important as targets of natural selection. We review recent developments in evolutionary theory and experiments that are expected to become building blocks of a quantitative genetics of molecular traits. We focus on universal evolutionary characteristics: these are largely independent of a trait's genetic basis, which is often at least partially unknown. We show that universal measurements can be used to infer selection on a quantitative trait, which determines its evolutionary mode of conservation or adaptation. Furthermore, universality is closely linked to predictability of trait evolution across lineages. We argue that universal trait statistics extends over a range of cellular scales and opens new avenues of quantitative evolutionary systems biology.
We developed a theory showing that under appropriate normalizations and rescalings, temperature response curves show a remarkably regular behavior and follow a general, universal law. The impressive universality of temperature response curves remained hidden due to various curve-fitting models not well-grounded in first principles. In addition, this framework has the potential to explain the origin of different scaling relationships in thermal performance in biology, from molecules to ecosystems. Here, we summarize the background, principles and assumptions, predictions, implications, and possible extensions of this theory.
We survey and introduce concepts and tools located at the intersection of information theory and network biology. We show that Shannon's information entropy, compressibility and algorithmic complexity quantify different local and global aspects of synthetic and biological data. We show examples such as the emergence of giant components in Erdos-Renyi random graphs, and the recovery of topological properties from numerical kinetic properties simulating gene expression data. We provide exact theoretical calculations, numerical approximations and error estimations of entropy, algorithmic probability and Kolmogorov complexity for different types of graphs, characterizing their variant and invariant properties. We introduce formal definitions of complexity for both labeled and unlabeled graphs and prove that the Kolmogorov complexity of a labeled graph is a good approximation of its unlabeled Kolmogorov complexity and thus a robust definition of graph complexity.
This article introduces the physics of information in the context of molecular biology and genomics. Entropy and information, the two central concepts of Shannon's theory of information and communication, are often confused with each other but play transparent roles when applied to statistical ensembles (i.e., identically prepared sets) of symbolic sequences. Such an approach can distinguish between entropy and information in genes, predict the secondary structure of ribozymes, and detect the covariation between residues in folded proteins. We also review applications to molecular sequence and structure analysis, and introduce new tools in the characterization of resistance mutations, and in drug design.
Modern statistical inference tasks often require iterative optimization methods to compute the solution. Convergence analysis from an optimization viewpoint only informs us how well the solution is approximated numerically but overlooks the sampling nature of the data. In contrast, recognizing the randomness in the data, statisticians are keen to provide uncertainty quantification, or confidence, for the solution obtained using iterative optimization methods. This paper makes progress along this direction by introducing the moment-adjusted stochastic gradient descents, a new stochastic optimization method for statistical inference. We establish non-asymptotic theory that characterizes the statistical distribution for certain iterative methods with optimization guarantees. On the statistical front, the theory allows for model mis-specification, with very mild conditions on the data. For optimization, the theory is flexible for both convex and non-convex cases. Remarkably, the moment-adjusting idea motivated from "error standardization" in statistics achieves a similar effect as acceleration in first-order optimization methods used to fit generalized linear models. We also demonstrat
We aim to characterize the U-band variability of young brown dwarfs in the Taurus Molecular Cloud and discuss its origin. We used the XMM-Newton Extended Survey of the Taurus Molecular Cloud, where a sample of 11 young bona fide brown dwarfs (spectral type later than M6) were observed simultaneously in X-rays with XMM-Newton and in the U-band with the XMM-Newton Optical/UV Monitor (OM). We obtained upper limits to the U-band emission of 10 brown dwarfs (U>19.6-20.6 mag), whereas 2MASSJ04141188+2811535 was detected in the U-band. Remarkably, the magnitude of this brown dwarf increased regularly from U~19.5 mag at the beginning of the observation, peaked 6h later at U~18.4 mag, and then decreased to U~18.65 mag in the next 2h. The first OM U-band measurement is consistent with the quiescent level observed about one year later thanks to ground follow-up observations. This brown dwarf was not detected in X-rays by XMM-Newton during the OM observation. We discuss the possible sources of U-band variability for this young brown dwarf, namely a magnetic flare, non-steady accretion onto the substellar surface, and rotational modulation of a hot spot. We conclude that this event is relate
Systems Biology has emerged in the last years as a new holistic approach based on the global understanding of cells instead of only being focused on their individual parts (genes or proteins), to better understand the complexity of human cells. Since the Systems Biology still does not provide the most accurate answers to our questions due to the complexity of cells and the limited quality of available information to perform a good gene/protein map analysis, we have created simpler models to ensure easier analysis of the map that represents the human cell. Therefore, a virtual organism has been designed according to the main physiological rules for humans in order to replicate the human organism and its vital functions. This toy model was constructed by defining the topology of its genes/proteins and the biological functions associated to it. There are several examples of these toy models that emulate natural processes to perform analysis of the virtual life in order to design the best strategy to understand real life. The strategy applied in this study combines topological and functional analysis integrating the knowledge about the relative position of a node among the others in th