Natural Language Processing (NLP) has transformed various fields beyond linguistics by applying techniques originally developed for human language to the analysis of biological sequences. This review explores the application of NLP methods to biological sequence data, focusing on genomics, transcriptomics, and proteomics. We examine how various NLP methods, from classic approaches like word2vec to advanced models employing transformers and hyena operators, are being adapted to analyze DNA, RNA, protein sequences, and entire genomes. The review also examines tokenization strategies and model architectures, evaluating their strengths, limitations, and suitability for different biological tasks. We further cover recent advances in NLP applications for biological data, such as structure prediction, gene expression, and evolutionary analysis, highlighting the potential of these methods for extracting meaningful insights from large-scale genomic data. As language models continue to advance, their integration into bioinformatics holds immense promise for advancing our understanding of biological processes in all domains of life.
The effective visualization of genomic data is crucial for exploring and interpreting complex relationships within and across genes and genomes. Despite advances in developing dedicated bioinformatics software, common visualization tools often fail to efficiently integrate the diverse datasets produced in comparative genomics, lack intuitive interfaces to construct complex plots and are missing functionalities to inspect the underlying data iteratively and at scale. Here, we introduce gggenomes, a versatile R package designed to overcome these challenges by extending the widely used ggplot2 framework for comparative genomics. gggenomes is available from CRAN and GitHub, accompanied by detailed and user-friendly documentation (https://thackl.github.io/gggenomes).
Biology is perhaps the most complex of the sciences, given the incredible variety of chemical species that are interconnected in spatial and temporal pathways that are daunting to understand. Their interconnections lead to emergent properties such as memory, consciousness, and recognition of self and non-self. To understand how these interconnected reactions lead to cellular life characterized by activation, inhibition, regulation, homeostasis, and adaptation, computational analyses and simulations are essential, a fact recognized by the biological communities. At the same time, students struggle to understand and apply binding and kinetic analyses for the simplest reactions such as the irreversible first-order conversion of a single reactant to a product. This likely results from cognitive difficulties in combining structural, chemical, mathematical, and textual descriptions of binding and catalytic reactions. To help students better understand dynamic reactions and their analyses, we have introduced two kinds of interactive graphs and simulations into the online educational resource, Fundamentals of Biochemistry, a multivolume biochemistry textbook that is part of the LibreText c
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.
How to compare whole genome sequences at large scale has not been achieved via conventional methods based on pair-wisely base-to-base comparison; nevertheless, no attention was paid to handle in-one-sitting a number of genomes crossing genetic category (chromosome, plasmid, and phage) with farther divergences (much less or no homologous) over large size ranges (from Kbp to Mbp). We created a new method, GenomeFingerprinter, to unambiguously produce three-dimensional coordinates from a sequence, followed by one three-dimensional plot and six two-dimensional trajectory projections to illustrate whole genome fingerprints. We further developed a set of concepts and tools and thereby established a new method, universal genome fingerprint analysis. We demonstrated their applications through case studies on over a hundred of genome sequences. Particularly, we defined the total genetic component configuration (TGCC) (i.e., chromosome, plasmid, and phage) for describing a strain as a system, and the universal genome fingerprint map (UGFM) of TGCC for differentiating a strain as a universal system, as well as the systematic comparative genomics (SCG) for comparing in-one-sitting a number of
Introduction: While the origin and evolution of proteins remain mysterious, advances in evolutionary genomics and systems biology are facilitating the historical exploration of the structure, function and organization of proteins and proteomes. Molecular chronologies are series of time events describing the history of biological systems and subsystems and the rise of biological innovations. Together with time-varying networks, these chronologies provide a window into the past. Areas covered: Here, we review molecular chronologies and networks built with modern methods of phylogeny reconstruction. We discuss how chronologies of structural domain families uncover the explosive emergence of metabolism, the late rise of translation, the co-evolution of ribosomal proteins and rRNA, and the late development of the ribosomal exit tunnel; events that coincided with a tendency to shorten folding time. Evolving networks described the early emergence of domains and a late big bang of domain combinations. Expert opinion: Two processes, folding and recruitment appear central to the evolutionary progression. The former increases protein persistence. The later fosters diversity. Chronologically,
Understanding the spatial architecture of the tumor microenvironment (TME) is critical to advance precision oncology. We present ProteinPNet, a novel framework based on prototypical part networks that discovers TME motifs from spatial proteomics data. Unlike traditional post-hoc explanability models, ProteinPNet directly learns discriminative, interpretable, faithful spatial prototypes through supervised training. We validate our approach on synthetic datasets with ground truth motifs, and further test it on a real-world lung cancer spatial proteomics dataset. ProteinPNet consistently identifies biologically meaningful prototypes aligned with different tumor subtypes. Through graphical and morphological analyses, we show that these prototypes capture interpretable features pointing to differences in immune infiltration and tissue modularity. Our results highlight the potential of prototype-based learning to reveal interpretable spatial biomarkers within the TME, with implications for mechanistic discovery in spatial omics.
Rare diseases are collectively common, affecting approximately one in twenty individuals worldwide. In recent years, rapid progress has been made in rare disease diagnostics due to advances in DNA sequencing, development of new computational and experimental approaches to prioritize genes and genetic variants, and increased global exchange of clinical and genetic data. However, more than half of individuals suspected to have a rare disease lack a genetic diagnosis. The Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium was initiated to study thousands of challenging rare disease cases and families and apply, standardize, and evaluate emerging genomics technologies and analytics to accelerate their adoption in clinical practice. Further, all data generated, currently representing ~7500 individuals from ~3000 families, is rapidly made available to researchers worldwide via the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) to catalyze global efforts to develop approaches for genetic diagnoses in rare diseases (https://gregorconsortium.org/data). The majority of these families have undergone prior clinical genetic testing
Proteomics is the large scale study of protein structure and function from biological systems through protein identification and quantification. "Shotgun proteomics" or "bottom-up proteomics" is the prevailing strategy, in which proteins are hydrolyzed into peptides that are analyzed by mass spectrometry. Proteomics studies can be applied to diverse studies ranging from simple protein identification to studies of proteoforms, protein-protein interactions, protein structural alterations, absolute and relative protein quantification, post-translational modifications, and protein stability. To enable this range of different experiments, there are diverse strategies for proteome analysis. The nuances of how proteomic workflows differ may be challenging to understand for new practitioners. Here, we provide a comprehensive overview of different proteomics methods to aid the novice and experienced researcher. We cover from biochemistry basics and protein extraction to biological interpretation and orthogonal validation. We expect this work to serve as a basic resource for new practitioners in the field of shotgun or bottom-up proteomics.
The molecular responses of macrophages to copper-based nanoparticles have been investigated via a combination of proteomic and biochemical approaches, using the RAW264.7 cell line as a model. Both metallic copper and copper oxide nanoparticles have been tested, with copper ion and zirconium oxide nanoparticles used as controls. Proteomic analysis highlighted changes in proteins implicated in oxidative stress responses (superoxide dismutases and peroxiredoxins), glutathione biosynthesis, the actomyosin cytoskeleton, and mitochondrial proteins (especially oxidative phosphorylation complex subunits). Validation studies employing functional analyses showed that the increases in glutathione biosynthesis and in mitochondrial complexes observed in the proteomic screen were critical to cell survival upon stress with copper-based nanoparticles; pharmacological inhibition of these two pathways enhanced cell vulnerability to copper-based nanoparticles, but not to copper ions. Furthermore, functional analyses using primary macrophages derived from bone marrow showed a decrease in reduced glutathione levels, a decrease in the mitochondrial transmembrane potential, and inhibition of phagocytosis
The applications of large language models (LLMs) are promising for biomedical and healthcare research. Despite the availability of open-source LLMs trained using a wide range of biomedical data, current research on the applications of LLMs to genomics and proteomics is still limited. To fill this gap, we propose a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for three novel tasks in genomic and proteomic research. The models in Geneverse are trained and evaluated based on domain-specific datasets, and we use advanced parameter-efficient finetuning techniques to achieve the model adaptation for tasks including the generation of descriptions for gene functions, protein function inference from its structure, and marker gene selection from spatial transcriptomic data. We demonstrate that adapted LLMs and MLLMs perform well for these tasks and may outperform closed-source large-scale models based on our evaluations focusing on both truthfulness and structural correctness. All of the training strategies and base models we used are freely accessible.
Two-dimensional gel electrophoresis has been instrumental in the birth and developments of proteomics, although it is no longer the exclusive separation tool used in the field of proteomics. In this review, a historical perspective is made, starting from the days where two-dimensional gels were used and the word proteomics did not even exist. The events that have led to the birth of proteomics are also recalled, ending with a description of the now well-known limitations of two-dimensional gels in proteomics. However, the often-underestimated advantages of two-dimensional gels are also underlined, leading to a description of how and when to use two-dimensional gels for the best in a proteomics approach. Taking support of these advantages (robustness, resolution, and ability to separate entire, intact proteins), possible future applications of this technique in proteomics are also mentioned.
Taking the opportunity of the 20th anniversary of the word "proteomics", this young adult age is a good time to remember how proteomics came from enormous progress in protein separation and protein microanalysis techniques, and from the conjugation of these advances into a high performance and streamlined working setup. However, in the history of the almost three decades that encompass the first attempts to perform large scale analysis of proteins to the current high throughput proteomics that we can enjoy now, it is also interesting to underline and to recall how difficult the first decade was. Indeed when the word was cast, the battle was already won. This recollection is mostly devoted to the almost forgotten period where proteomics was being conceived and put to birth, as this collective scientific work will never appear when searched through the keyword "proteomics". BIOLOGICAL SIGNIFICANCE: The significance of this manuscript is to recall and review the two decades that separated the first attempts of performing large scale analysis of proteins from the solid technical corpus that existed when the word "proteomics" was coined twenty years ago. This recollection is made within
In anticipation of the completion of the High-Luminosity Large Hadron Collider (HL-LHC) programme by the end of 2041, CERN is preparing to launch a new major facility in the mid-2040s. According to the 2020 update of the European Strategy for Particle Physics (ESPP), the highest-priority next collider is an electron-positron Higgs factory, followed in the longer term by a hadron-hadron collider at the highest achievable energy. The CERN directorate established a Future Colliders Comparative Evaluation working group in June 2023. This group brings together project leaders and domain experts to conduct a consistent evaluation of the Future Circular Collider (FCC) and alternative scenarios based on shared assumptions and standardized criteria. This report presents a comparative evaluation of proposed future collider projects submitted as input for the Update of the European Strategy for Particle Physics. These proposals are compared considering main performance parameters, environmental impact and sustainability, technical maturity, cost of construction and operation, required human resources, and realistic implementation timelines. An overview of the international collider projects w
Traditionally, studies in experimental physiology have been conducted in small groups of human participants, animal models or cell lines. Identifying optimal study designs that achieve sufficient power for drawing proper statistical inferences to detect group level effects with small sample sizes has been challenging. Moreover, average effects derived from traditional group-level inference do not necessarily apply to individual participants. Here, we introduce N-of-1 trials as an innovative study design that can be used to draw valid statistical inference about the effects of interventions on individual participants and can be aggregated across multiple study participants to provide population-level inferences more efficiently than standard group randomized trials. N-of-1 trials have been used since the late 1980s, but without large-scale adoption and with few applications in experimental physiology research settings. In this manuscript, we introduce the key components and design features of N-of-1 trials, describe statistical analysis and interpretations of the results, and describe some available digital tools to facilitate their use using examples from experimental physiology.
Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic an
Two-dimensional electrophoresis of proteins has preceded, and accompanied, the birth of proteomics. Although it is no longer the only experimental scheme used in modern proteomics, it still has distinct features and advantages. The purpose of this tutorial paper is to guide the reader through the history of the field, then through the main steps of the process, from sample preparation to in-gel detection of proteins, commenting the constraints and caveats of the technique. Then the limitations and positive features of two-dimensional electrophoresis are discussed (e.g. its unique ability to separate complete proteins and its easy interfacing with immunoblotting techniques), so that the optimal type of applications of this technique in current and future proteomics can be perceived. This is illustrated by a detailed example taken from the literature and commented in detail. This Tutorial is part of the International Proteomics Tutorial Programme (IPTP 2).
Meloidogyne root knot nematodes (RKN) can infect most of the world's agricultural crop species and are among the most important of all plant pathogens. As yet however we have little understanding of their origins or the genomic basis of their extreme polyphagy. The most damaging pathogens reproduce by mitotic parthenogenesis and are suggested to originate by interspecific hybridizations between unknown parental taxa. We sequenced the genome of the diploid meiotic parthenogen Meloidogyne floridensis, and use a comparative genomic approach to test the hypothesis that it was involved in the hybrid origin of the tropical mitotic parthenogen M. incognita. Phylogenomic analysis of gene families from M. floridensis, M. incognita and an outgroup species M. hapla was used to trace the evolutionary history of these species' genomes, demonstrating that M. floridensis was one of the parental species in the hybrid origins of M. incognita. Analysis of the M. floridensis genome revealed many gene loci present in divergent copies, as they are in M. incognita, indicating that it too had a hybrid origin. The triploid M. incognita is shown to be a complex double-hybrid between M. floridensis and a th
The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.
Classifying genome sequences based on metadata has been an active area of research in comparative genomics for decades with many important applications across the life sciences. Established methods for classifying genomes can be broadly grouped into sequence alignment-based and alignment-free models. Conventional alignment-based models rely on genome similarity measures calculated based on local sequence alignments or consistent ordering among sequences. However, such methods are computationally expensive when dealing with large ensembles of even moderately sized genomes. In contrast, alignment-free (AF) approaches measure genome similarity based on summary statistics in an unsupervised setting and are efficient enough to analyze large datasets. However, both alignment-based and AF methods typically assume fixed scoring rubrics that lack the flexibility to assign varying importance to different parts of the sequences based on prior knowledge. In this study, we integrate AI and network science approaches to develop a comparative genomic analysis framework that addresses these limitations. Our approach, termed the Genome Misclassification Network Analysis (GMNA), simultaneously lever