As the human population expands and global temperatures rise, species, populations, and biodiversity decline at unprecedented rates, while the frequency of infectious disease emergence increases. Therefore, it is more vital than ever to accurately understand the current state of natural habitats and their constituent species. We assess the feasibility of a single assay: long-read shotgun metagenomic sequencing of environmental DNA (eDNA), to monitor species from across the tree of life, from viruses to complex multicellular organisms, across a representative Irish river system (Avoca River, Co. Wicklow). We conducted aquatic eDNA sampling and long-read shotgun metagenomic sequencing from a mountain tributary through to the sea. This approach could detect and quantify organismal DNA present in environmental samples, from microbes (including DNA viruses) to mammals. Rather than the traditional siloing of microbial and multicellular studies of DNA recovered from environmental samples, simultaneously considering viruses, microbes, and eukaryotes (animals, plants, and fungi) can provide deeper insights. This single assay can simultaneously quantify differences in DNA abundance for a broad range of species and pathogens across sites and sample types, enabling wide-ranging biodiversity assessments. This included human, wildlife, plant, and microbial pathogens and parasites with health, agricultural, and economic importance. The environmental genomic data enabled animal phylogeny and transmissible cancer analysis (blue mussel, Mytilus edulis) even from natural complex community settings. Oxford Nanopore sequencing provides a quantitative approach for river biodiversity, pollution, and environmental health monitoring. Long-read shotgun metagenomic sequencing of environmental samples offers the means to assess whole ecosystems and the ecological, trophic, and host-pathogen interactions occurring within them.
Although the genetic code is degenerate, codon selection is nonrandom and reflects significant functional constraints. Codon-usage bias (CUB) acts as a layer of post-transcriptional regulation, influencing messenger RNA (mRNA) stability, translation kinetics, and co-translational protein folding. While CUB is well-characterized in unicellular organisms, its regulatory scope and functional consequences in humans remain complex and less defined. Our study offers a comprehensive evaluation of human codon usage. We report that genes exhibiting the strongest codon bias are enriched in high-stoichiometry biological processes, such as skin development and oxygen/carbon dioxide transport, and harbor significantly fewer synonymous variants than expected (ρ = -0.24, P < 2.2 × 10-16). Furthermore, we find that codon optimization is structurally distinct: it is significantly more pronounced in structured protein domains compared to intrinsically disordered regions (IDRs) (Cliff's Δ= 0.26, P < 2.2 × 10-16). Consistent with translational selection, the most frequently used codons are supported by higher transfer RNA (tRNA) gene copy numbers (ρ = 0.49, P < 6.4 × 10-4). Finally, by correcting for GC3 content, we reveal that the apparent correlation between effective number of codon and adaptation indices (CAI/tAI) vanishes, allowing us to disentangle mutational pressure from translational selection. Collectively, our findings position CUB as a central, evolutionarily conserved regulator of translation and protein folding in humans. Our results provide a comprehensive and integrated view of intergenic and intragenic CUB in humans, reinforcing the biological relevance of synonymous codon choice in shaping translational dynamics and protein biogenesis. This provides a refined framework for interpreting synonymous variation and guiding functional genomics.
Challenges preventing mainstream use of RNA-sequencing (RNA-seq) in genome diagnostics are sources of biological and technical variation, typically caused by intrinsic differences in gene expression between tissue types, cellular conditions, and environmental factors. While machine learning methods may partially correct unwanted variation, interpreting RNA-seq data that are typically generated by different sources over time, which is a realistic scenario in healthcare, remains challenging and complex. We developed a complete RNA-guided workflow that handles such variation and is therefore able to identify gene-disease associations in the context of genomic, phenotypic, and segregation analysis of rare disease patients. The result is a streamlined implementation of OUTRIDER and FRASER, complemented with Borzoi and MOLGENIS VIP. This novel workflow paves the way for pinpointing rare variants affecting gene expression and splicing using self-contained interactive reports visualizing outlier genes and prioritized patient-level variants for immediate clinical interpretation. We analysed 144 cases from different centres, a realistic cohort for centres more likely to be dependent on background cohorts. We demonstrate that RNA outlier analysis enhances variant interpretation and, despite its limitations, is already able to aid clinical variant interpretation. Our workflow accelerates the prioritization of coding and non-coding variants, and the reclassification of clinically relevant variants of unknown significance.
While graph-based pangenomes have become a standard and interoperable foundation for comparisons across multiple reference genomes, integrating protein-coding gene annotations across pangenomes in a single 'pangene set' remains challenging, both because of methodological inconsistency and biological presence-absence variation (PAV). Here, we review and experimentally evaluate the root of genome annotation and pangene set inconsistency using two polyploid plant pangenomes: cotton and soybean, which were chosen because of their existing diverse high-quality genomic resources and the known importance of gene PAV in their respective breeding programs. We first demonstrate that building pangene sets across different genome resources is highly error prone: PAV calculated directly from the genome annotations hosted on public repositories recapitulates structure in annotation methods and not biological sequence differences. Re-annotation of all genomes with a single identical pipeline largely resolves the broadest stroke issues; however, substantial challenges remain, including a surprisingly common case where exactly identical sequences have different gene model structural annotations. Combined, these results clearly show that pangenome gene model annotations must be carefully integrated before any biological inference can be made regarding sequence evolution, gene copy-number, or PAV.
Workflow management systems (WMS) are essential for creating and automating multi-step data analyses and ensuring the reproducibility of biological insights. Although numerous WMS solutions exist, few provide deep integration of command-line software with the R and Bioconductor ecosystems, where a substantial portion of statistical modeling and downstream scientific analysis is performed by a large user base. systemPipeR addresses this gap by offering a unified environment that links R-based analytical steps with command-line tools through a standardized workflow specification. It enables the design and execution of reproducible workflows on both local and high-performance computing systems, while allowing users to select the most appropriate R or command-line tool for each analysis step. The latest version introduces a fully redesigned architecture that streamlines workflow construction, execution, monitoring, and reporting. Key enhancements include a flexible workflow management class object, integration of the Common Workflow Language (CWL), formal declaration and standardized execution of both R and command-line steps, utilities for metadata management, and automated generation of scientific and technical reports. Together, these advances establish systemPipeR as a general-purpose R-based WMS for building and executing end-to-end workflows for reproducible analysis of complex data in genomics and other data-intensive fields. The software is distributed as a free open-source Bioconductor package (https://bioconductor.org/packages/systemPipeR).
Quality control constitutes a critical component of any next-generation sequencing (NGS) pipeline; however, most existing pipelines emphasize technical quality assessment (e.g. read quality, alignment metrics, duplication rates) while overlooking other equally important dimensions, such as sample identity verification, contamination detection, kinship analysis, and metadata concordance. Detecting issues like cross-sample contamination and sample swaps is essential to control data integrity. Here, we present NGSTroubleFinder, a novel tool to detect cross-sample contamination in human whole-genome and whole-transcriptome sequencing data, sample swaps, and mismatches between the reported and the inferred genetic and transcriptomic sexes. It can be run directly on BAM/CRAM files without requiring additional variant-calling steps and offers an integrated pipeline for ensuring quality control on NGS data, generated particularly within the context of clinical studies or research projects involving family members. It produces a detailed report that combines the results of its multiple analyses, including kinship, sex prediction, and contamination metrics. The tool reports extensive information on the samples, both in textual and HTML formats, including key plots for easy interpretation of the results. NGSTroubleFinder is written in Python and incorporates a custom-built parallelized pileup engine written in C, and it can be easily installed with pip. The tool source code and the models are freely available on GitHub (https://github.com/STALICLA-RnD/NGSTroubleFinder), and a containerized version is available on Docker Hub (https://hub.docker.com/r/staliclarnd/ngstroublefinder).
Characterizing genomic properties such as genome size, ploidy level, heterozygosity, and repetitive DNA proportion and composition without relying on genome assembly is crucial for profiling the genomes of non-model species. This study compares genome profiles of Epidendrum anisatum and Epidendrum marmoratum, using flow cytometry and k-mer analysis approaches, as well as bioinformatic ploidy-level estimation and repeatome characterization. Multiple depths of coverage, k values, and software tools for genome size estimation were explored and contrasted with cytometry genome size estimations. Cytometry and k-mer analyses yielded a consistently higher genome size for E. anisatum (2.59 Gb) than E. marmoratum (1.13 Gb), a 2.3-fold genome size difference. Both species were identified as diploid with no evidence of partial endoreplication. The genomes of both species were found to be highly repetitive (63%-73%) and heavily dominated by Ogre Ty3-gypsy retrotransposons. Additionally, the genome of E. anisatum was characterized by the presence of a 172 bp satellite (AniS1), which represented 11% of the genome size. Together, both Ty3-gypsy transposons and AniS1 shape the genome size difference between the two genomes. This study highlights the importance of using flow cytometry, cytogenetic approaches, and bioinformatics techniques in conjunction for genome profiling.
Over thousands of years, selective breeding of domestic dogs (Canis lupus familiaris) has led to extensive genetic diversity, underscoring the need for refined genomic and immunogenetic investigations. This study presents a comprehensive analysis and biocuration of the immunoglobulin kappa light chain locus (IGK) across multiple dog breeds to identify breed-specific variations and assess their relevance to canine immunology and veterinary diagnostics. We examined nine canine genome assemblies to characterize structural variations, polymorphisms, and gene diversity, with the goal of enriching the IMGT® reference database and expanding its representativeness across breeds. Through in-depth annotation of seven breeds-Bernese Mountain Dog, Boxer, Cairn Terrier, Labrador Retriever, Great Dane, Basenji, and German Shepherd-we identified 40 genes and 97 alleles, highlighting both conserved regions and unique breed-specific variants. Variants were validated in silico against Sanger sequencing data. Importantly, discrepancies were observed in the CanFam3.1 Boxer reference genome, indicating possible sequencing or assembly artifacts, challenges in gene and allele nomenclature standardization, and a low-density genomic segment within the IGK locus. These findings refine current knowledge of IGK locus diversity and enhance IMGT® database accuracy, supporting future studies on immunogenetic variability, somatic hypermutation, and immune response mechanisms in canine health and disease.
With the current speed of sequencing, there is a desire for standardized and automated genome assembly and annotation to produce high-quality genomes as input for comparative (pan)genomics. Therefore, we created a convenience pipeline using existing tools that creates annotated genome assemblies from HiFi (and optionally ultra-long ONT and/or Hi-C) reads for a set of related individuals as well as a related reference genome. Our pipeline is species-agnostic and generates an extensive quality assessment report that can be used for manual filtering and refinement of the assembly and annotation. It includes statistics for individual completeness and contamination assessments as well as a concise pangenome view. The pipeline is implemented in Snakemake and available with a GPLv3 licence at GitHub under github.com/dirkjanvw/MoGAAAP, at Zenodo under doi.org/10.5281/zenodo.14833021, and can be installed through Bioconda.
Although an increasing number of long-read genome assemblies have been created from a diverse collection of dogs and wolves, most published assemblies represent the diploid genome as a single primary sequence. Here, we generate and analyze phase-resolved diploid dual assemblies from five canines. The most contiguous assemblies represent over half of the canine chromosomes as single contigs, permitting an assessment of the sequence and structure of canine chromosomes. Consistent with a telocentric classification, we find that the centromeres of canine autosomes begin an average of 59 kb from the start of the chromosome and are flanked by a 35 kb subtelomeric segment that is repeat-rich and shared across autosomes. Analysis of a pangenome graph constructed from the 10 haplotype-resolved assemblies shows that short tandem repeat loci are three times more common than variable number tandem repeat loci and that the landscape of canine structural variation features extensive allelic heterogeneity. The pangenome graph includes examples of complex, nested allelic variation involving SINEC (a carnivore-specific SINE) and LINE-1 mobile elements. Analysis of 3' transductions implicate an uncharacterized source element with high activity and demonstrates the presence of full-length LINE-1s capable of retrotransposition that are segregating among canines.
Protein translation is a highly regulated process influenced by multiple factors at the initiation, elongation, and termination stages. One notable regulatory element of the ribosome is the CAR interaction surface, a three-residue motif in the structure of the ribosome composed of C1274 and A1427 of Saccharomyces cerevisiae 18S rRNA (corresponding to C1054 and A1196 in Escherichia coli 16S rRNA) and R146 of ribosomal protein Rps3. CAR is highly conserved and positioned adjacent to the amino-acyl (A site) decoding center. It establishes hydrogen bonds with the +1 codon next in line to enter the ribosome A site, acting as an extension of the transfer RNA (tRNA) anticodon and forming base-stacking interactions with nucleotide 34 of the tRNA. However, despite CAR's enzymatically strategic positioning within the ribosome, its functional relationship with the A site remains poorly characterized. Using molecular dynamics simulations, we examined the interplay between the A site and CAR site, revealing sequence-dependent modulation of H-bonding and π-stacking interactions within and between the two sites. These findings highlight the interplay between the A site and CAR site, suggesting a structural and functional connection between these two regions of the ribosome that may contribute to messenger RNA sequence-specific tuning of translation elongation.
Codon usage bias is a fundamental feature of the genetic code, yet its impact on messenger RNA translation is incompletely defined. Here, we integrate comparative genomics, human tissue proteomes, large cancer cell line, and patient cancer datasets to reveal a conserved codon-bias axis. Across mammals, we show that GC-biased gene conversion drives human-specific GC3 (third codon nucleotide bias score) drifts, yet the functional dichotomy is maintained: A/T-ending codons associate with proliferation and RNA processing, while G/C-ending (Third nucleotide Guanine or Cytosine) codons associate with differentiation and neuronal functions. At the isoacceptors level, synonymous codons segregate into distinct functional categories. To mechanistically connect codon usage to cancer, we introduce the ANN- and m7G-indices, capturing codons decoded by transfer RNA (tRNA) modifications t6A and m7G. Both indices negatively correlate with GC3 and enrich for pro-oncogenic proliferative pathways. Human tissue proteomes reveal strong codon bias discordance between RNA and protein levels, with nervous system tissues enriched for G/C-ending codons while proliferative organs are A/T-biased. Analysis of 2600 cancer cell lines and 21 cancer types revealed heterogeneous codon preferences in cancer cell lines but a global A/T-ending shift in human cancer-upregulated proteins. These findings establish synonymous codon divergence and tRNA modification indices as key determinants of translational reprogramming in health and cancer.
Identifying regulatory relationships between transcription factors (TFs) and genes is essential to understand diverse biological phenomena related to gene expression. Recently, deep learning-based models to predict TFs that bind to genes from nucleotide sequences of the target genes have been developed, yet these models are trained to predict known TFs only. Here, we developed a deep learning model, GReNIMJA (Gene Regulatory Network Inference by Mixing and Jointing features of Amino acid and nucleotide sequences), to predict gene regulation even by unknown TFs. Our model is designed to mix the features of the TF amino acid sequences and nucleotide sequences of the target genes using a 2D Long Short-Term Memory architecture and to perform binary classification with the aim of determining the presence or absence of a regulatory relationship. By explicitly modeling interactions between TFs and genes, our model can predict gene regulation for unknown TFs. The accuracy of our model in predicting regulatory relationships was 84.4% for known TFs (higher than those of conventional models) and 68.5% for unknown TFs; the latter is an unsolved task for conventional deep learning-based models. We expect our model to advance identification of unknown gene regulatory networks and contribute to the understanding of diverse biological phenomena.
This review examines the current landscape of federated learning frameworks to evaluate their long-term sustainability, flexibility, and usability in biomedical research, where strict data regulations limit data sharing across institutions. Through a systematic literature analysis, the study assesses these frameworks against findability, accessibility, interoperability, and reusability for research software principles and compares reported use cases to framework functionalities to identify gaps in usability and scalability. The findings reveal that while most frameworks perform well in findability and reusability, they exhibit limited interoperability both among themselves and with specific software libraries. Although often developed for particular use cases, the technical foundations of these frameworks suggest potential for broader applicability. However, the scarce integration of privacy-preserving techniques and a predominant reliance on horizontal architectures may constrain their scalability in more complex federated learning scenarios. Ultimately, this analysis highlights the necessity for federated learning frameworks to evolve toward greater interoperability, flexibility, and privacy-awareness.
RNA structure critically governs biological function in both physiological and pathological contexts, making high-resolution structural maps essential for RNA-targeted therapeutics. Yet, despite recent advances, well-validated structural targets for drug design remain limited. To help bridge this gap, we generated the first genome-scale map of the human RNA structurome by applying ScanFold to >230 000 annotated human pre-mRNA transcripts, identifying sequences likely evolved to form highly stable and functional secondary structures. We also performed a global analysis of regions with z-scores ≤ -2 and statistically characterized their two-dimensional folding patterns. In addition, we developed the RNA-Annotator Pipeline to integrate 20 diverse biological annotations, such as tissue-specific expression and protein interactions, with the structural data. Our results reveal local folding propensities and unusually stable structures with high-confidence architectures, providing insights for prioritizing RNA targets and guiding therapeutic design, including antisense oligonucleotides and small molecules. All ScanFold results are publicly available through RNAStructuromeDB. Using the RNA-Annotator Pipeline, analysis of SMN1 and SMN2 pre-mRNAs showed that a single C-to-T transition in SMN2 induces structural rearrangements that disrupt a critical splicing enhancer. This toolkit establishes an integrated workflow that enables researchers to explore RNA structure-function relationships and accelerate advances in RNA-targeted drug discovery and RNA biology.
The growing availability of high-dimensional, complex datasets demands analysis methods that are both interpretable and flexible. We introduce FlowSets, a novel framework for identifying and analysing flows-fuzzy, interpretable patterns-across multiple, diverse, and heterogeneous data sets. Views are constructed as an interpretation of biological features of the data sets by grouping absolute or relative values, summary statistics (such as fold changes or P-values), or higher-order comparisons into fuzzy, linguistically defined categories based on their underlying distributions. FlowSets builds on these fuzzy categorizations and then tracks how features transition between categories across different conditions or data types, uncovering structural patterns that conventional methods often overlook. The FlowSets framework enables users to define, analyse, and manipulate complex patterns across heterogeneous datasets in a flexible and interpretable manner. With FlowSets, users can visualize feature flows, quantify pattern memberships, and perform enrichment analysis explicitly designed for sets with gradual memberships. This approach offers a robust and customizable alternative to rigid clustering or hard thresholding, allowing for a more transparent and insightful interpretation of multidimensional biological data.
Large Language Models (LLMs) are increasingly applied to genomic tasks, yet core challenges remain concerning tokenization, evaluation, and data scarcity. This study focuses on promoter classification and systematically evaluates four tokenization methods: non-overlapping 6-mer, overlapping 6-mer, Byte Pair Encoding (BPE), and WordPiece (WPC). We show that the commonly used k-mer approach, specifically the non-overlapping variant, outperforms BPE and WPC across eight organisms, challenging assumptions derived from natural language processing. To ensure robustness, we evaluated performance under two distinct negative data strategies: positive-promoter-shuffled and random-non-promoter-fragments. Using a positional SHAP framework, we demonstrate that the model learns biologically plausible positional patterns rather than exploiting artifacts from these negative data generation processes. Furthermore, evolutionary-informed transfer learning experiments and external validation on an unseen organism reveal that training on phylogenetically related species significantly improves performance, particularly in low-data regimes. These findings underscore the significant impact of tokenization and negative data design, providing practical guidance for refining genomic classifiers.
Noncoding RNAs <200 nucleotides (nt) in length are referred to as short noncoding RNAs (sncRNAs) and include microRNAs (miRNAs), piwi-interacting RNAs, small nucleolar RNAs, transfer RNAs, etc. One striking example of the regulatory capabilities of sncRNAs comes from a group of small yet potent RNAs called miRNAs. MiRNAs are ∼20-nt RNAs excised from longer pre-miRNA hairpins, and to date, thousands of miRNAs have been identified across an array of species with specific roles for miRNAs defined in virtually every cellular activity (e.g. growth, differentiation, apoptosis, and disease). Importantly, studies aimed at evaluating the transcriptomic changes of miRNAs have now revealed the existence of miRNA-like fragments derived from other types of sncRNAs and suggest similar regulatory capacities may be associated with these novel sncRNA fragments. Unfortunately, many biologically relevant sncRNA-excised fragments remain uncharacterized due to their routine exclusion during initial miRNA characterizations as "sncRNA degradation products" as well as nearly all sncRNA informatic analyses continuing to solely assess annotated miRNA expressions. To address this, several platforms aimed at identifying novel sncRNA fragments have recently been developed. That said, the principal analytical tools currently employed to characterize novel sncRNA fragments often require significant computational expertise hindering their widespread utilization. As such, the development of a user-friendly platform, requiring minimal programming experience yet capable of identifying and characterizing RNA fragments excised from any sncRNA from any species is highly desirable and potentially impactful. In light of this, we have developed FragmentFinder-an intuitive, Windows-executable resource designed to require absolutely no computational background and capable of accurately characterizing all (annotated and unknown) sncRNA-derived RNAs within a raw small RNA sequencing file in real time.
Assay for transposase-accessible chromatin using sequencing (ATAC-seq) is a cornerstone for epigenomic profiling, yet its potential for genomic characterization remains poorly explored. Here, we systematically benchmarked bulk ATAC-seq against whole-genome sequencing (WGS) to assess its capacity for detecting small variants, copy number variations (CNVs), telomere-associated repeat content, and mitochondrial single-nucleotide polymorphisms in cancer cells. Using paired datasets from patient-derived melanoma cell lines and from TCGA primary brain tumors, we demonstrated that ATAC-seq achieves high precision in small variants detection within accessible regions supporting cohort-scale genotyping and genetic stratification, robustly resolves CNVs in the nuclear genome, and supports high-coverage mitogenome profiling, with strong concordance to WGS at standard sequencing depths. Notably, we present the first systematic evaluation of telomere-associated repeat content by ATAC-seq, revealing its untapped potential for studying genome stability. By bridging genomic and epigenomic insights into a single genome-wide approach, bulk ATAC-seq emerges as a cost-effective and versatile tool poised to transform cancer research and to support integrative molecular profiling in clinical settings.
Copy number variations (CNVs) are genomic alterations that can cause rare genetic diseases. Their interpretation requires consulting multiple databases and following classification guidelines, such as those from the American College of Medical Genetics (ACMG) and the Clinical Genome Resource (ClinGen). In France, the Achro-Puce working group has also established recommendations to support CNV interpretation. Despite these resources, CNV analysis remains time-consuming, as it requires reviewing gene content, regulatory elements, and associated syndromes. To address this challenge, we developed CNV-Hub, a web-based platform that streamlines CNV classification and interpretation. CNV-Hub integrates five algorithms: two based on ACMG recommendations (AnnotSV, ClassifyCNV), two using machine learning (X-CNV, ISV), and one specifically developed according to French guidelines. In addition to automated pathogenicity predictions, CNV-Hub provides annotations for each CNV, including gene dosage sensitivity scores (pHaplo, pTriplo), syndrome associations, and direct links to databases such as OMIM and PubMed. The platform's user-friendly interface enables rapid, evidence-based CNV evaluation. By incorporating machine learning among its classification algorithms, CNV-Hub improves the interpretation of uncertain variants by integrating additional parameters. This tool reduces the time required for CNV analysis while maintaining accuracy and reliability, representing a significant advance in molecular cytogenetics and supporting geneticists in clinical decision-making.