Establishing best practice recommendations helps to increase consistency, equity and innovation in clinical genomics services. Bioinformatics approaches are a core component of clinical genomics services that use high-throughput genomic sequencing applied in the diagnosis of rare disorders and cancer. While a broad range of international recommendations exist for genomic diagnostic testing and genetic variant classification, the current UK-specific best practice recommendations for bioinformatics approaches applied in this context are outdated. We assembled a team of bioinformaticians and scientists with diverse expertise in rare disease and cancer genomics applied in clinical diagnostics within the UK National Health Service. Through structured discussion, polls and surveys, we developed an updated set of best practice recommendations for bioinformatics approaches applied to high-throughput genomic sequencing in clinical genomic testing. We provide best practice recommendations across the spectrum of activities within a clinical genomics bioinformatics pipeline, including quality control, primary, secondary and tertiary analysis approaches and shared knowledge bases. We also comment on issues related to software development and maintenance. The recommendations can be applied to multiple sequencing technologies and encompass both targeted and whole genome sequencing approaches applied to germline and tumour DNA samples. The best practice recommendations outlined in this study provide a national framework for adoption and innovation of bioinformatics approaches across diverse clinical genomic testing strategies in the UK National Health Service.
Breakthrough advancements in protein tertiary and quaternary structure prediction have accelerated structural bioinformatics research activity and drug development processes. However, many biological mechanisms involve more complicated interactions, such as those between amino and nucleic acids. Predicting the structure of protein-RNA complexes is highly relevant and challenging due to data scarcity and experimental difficulties. Understanding and interpreting these interactions can yield crucial insights into various human diseases and biological phenomena. Thus, quality assessment methods that specifically evaluate protein-RNA complex models can provide significant utility in this emerging area of protein-RNA structural bioinformatics research. We propose a novel graph transformer-based approach named CARP (complex quality assessment of RNA and protein) to infer multiple quality perspectives of protein-RNA complex models. For a single protein-RNA complex model, in one shot, CARP simultaneously predicts multiple overall fold, overall interface, and per-protein-RNA interface quality estimates. When evaluated against a non-redundant protein-RNA docking benchmark, our methods demonstrated obvious improved performance compared to almost all of the existing scoring tools, particularly when ordering and selecting the highest quality decoys. Furthermore, CARP consistently selected higher quality models relative to other predictors when tested on CASP16 targets. Specifically, CARP-predicted global interface and global protein-RNA interface qualities were ranked first and second, respectively, based on the selected top-3 models over all ten CASP16 protein-RNA complex targets. CARP also showed a strong ability, compared to both existing tools and AlphaFold3 self-estimates, in selecting high quality AlphaFold3 models. CARP is freely available at github.com/zwang-bioinformatics/CARP/. Supplementary information and data are available at Bioinformatics online.
Drug combinations are crucial for overcoming resistance in cancer therapy. Although deep learning has achieved strong performance in synergy prediction, existing models often treat cell-specific features and paired drugs as a static background and fail to capture how the specific cell-drug environment dynamically modulates drug representations, thereby hindering the modeling of environment-specific synergistic effects. We propose Env-Syn, a framework for modeling drug-drug-cell interactions through Environment-Conditioned Feature Modulation, which incorporates a Residual Feature-wise Linear Modulation (R-FiLM) module to perform precise affine transformations on drug representations conditioned on paired drugs and cellular environments. Benchmark evaluations show that Env-Syn consistently outperforms state-of-the-art methods. Notably, the model exhibits exceptional generalization performance in rigorous inductive scenarios. It maintains high predictive accuracy for unseen drugs with AUROC and AUPRC exceeding 0.81 in the Leave-drug-out setting, and further demonstrates strong cross-dataset reliability by surpassing a recall of 0.7 on independent test set. Furthermore, among 15 novel predicted drug combinations, eight are directly supported by literature evidence. These results demonstrate that Env-Syn is an effective computational tool for drug synergy discovery. The source code is available at https://github.com/AnQi-87/Env-Syn. Supplementary data is available at Bioinformatics online.
Multiple sequence alignment (MSA) remains a core problem in bioinformatics, yet most widely used alignment methods still rely on static amino acid substitution matrices that cannot adapt to sequence-specific context. BABAPPAlign is a progressive MSA engine that replaces static substitution scoring with a trained residue-level scorer operating on fixed protein-language-model embeddings, while retaining exact affine-gap dynamic programming. It also provides an integrated codon-aware alignment mode. Using BAliBASE as the primary inferential benchmark, with supporting external validation on deterministic subsets of PREFAB and HOMSTRAD, the learned backend outperformed matched in-engine EBA-style cosine and BLOSUM62 controls, and also exceeded MAFFT. BABAPPAlign is implemented in Python and distributed as an open-source command-line package through PyPI; the source code is available at https://github.com/sinhakrishnendu/BABAPPAlign, the archived software release is available at https://doi.org/10.5281/zenodo.17934124, and the pretrained model weights are available at https://doi.org/10.5281/zenodo.18053200. Supplementary material is available at Bioinformatics online.
Allele typing for Human Leukocyte Antigen (HLA) genes has many important clinical applications. Popular short-read typing can only accurately distinguish alleles at the coding sequence level, which potentially limit our understanding of the effect of variants in non-coding region. Long read data has been proved to be useful in typing HLA alleles in full resolution, but only a few tools are publicly available and with significant limitations in practical application. We developed FuFiHLA, a lightweight open-source software, to type HLA alleles. Currently it supports typing alleles of six HLA genes (HLA-A, HLA-B, HLA-C, HLA-DRB1, HLA-DQA1, and HLA-DQB1) from long reads. Evaluation using 233 PacBio HiFi WGS samples from HPRC shows that FuFiHLA achieves 99.6% accuracy in the full field allele typing and QV as 51.8 for consensus allele sequence construction. Additional testing on four Nanopore R10 reads demonstrates slightly reduced accuracy in the fourth field. FuFiHLA is available at https://github.com/jingqing-hu/FuFiHLA under MIT License. Supplementary data are available at Bioinformatics online.
Proteins change shape as they work, and these changing states control whether binding sites are exposed, signals are relayed, and catalysis proceeds. Most protein language models pair a sequence with a single structural snapshot, which can miss state-dependent features central to interaction, localization, and enzyme activity. Studies also indicate that many proteins assume multiple, functionally relevant shapes, motivating approaches that learn from this variability. We present DynamicsPLM, a protein language model conditioned on ensembles of computationally generated conformations to derive state-aware representations. DynamicsPLM improves predictive performance across protein-protein interaction, subcellular localization, enzyme classification, and metal-ion binding. On a widely used protein-protein interaction benchmark, it achieves a four-point accuracy gain over the strongest baseline. On a curated test set enriched for proteins with multiple conformational states, the margin increases to eleven points. These findings argue for a shift from static to dynamics-aware modeling, in which conformational variability is treated as informative. By elevating conformational state to a central element of machine learning in protein biology, this work advances modeling toward mechanisms that better reflect how proteins operate in cells and provides a route to actionable hypotheses about when and how binding, signaling, and catalysis occur. Code, model weights, and inference scripts are available at https://github.com/kalifadan/DynamicsPLM (DOI: https://doi.org/10.5281/zenodo.17668302). Supplementary data are available at Bioinformatics online.
Accurate prediction of drug response remains a major challenge in precision oncology, particularly at the single-cell level and in clinical settings, due to significant distribution shifts between preclinical models and real-world patient data. Existing approaches often rely on transfer learning from cell lines to target domains, but typically require access to target-domain data during training, which is frequently unavailable in practice. We propose FourierDrug, a novel domain generalization framework for robust drug response prediction. Given gene expression profiles, the model performs Fourier transformation to project features into the frequency domain and introduces an asymmetric attention mechanism that encourages drug-sensitive samples to form compact clusters while driving resistant samples to be more dispersed. This design facilitates the learning of domain-invariant yet task-relevant representations. Extensive experiments demonstrate that FourierDrug effectively leverages diverse source domains and generalizes well to unseen cancer types. Notably, when evaluated on single-cell and patient-level prediction tasks, our method-trained solely on in vitro cell line data without access to target-domain data-consistently outperforms or matches state-of-the-art approaches. The source code and processed datasets are available at: https://github.com/hliulab/FourierDrug. Supplementary data are available at Bioinformatics online.
The assessment of aberrant transcription events in rare disease patients holds great promise for enhancing the prioritization of causative genes-a strategy already widely adopted in clinical settings to improve diagnostic accuracy. Nevertheless, the accurate identification of causal genes remains a substantial challenge. We propose AXOLOTL, a novel ensemble method for identifying aberrant gene expression events in RNA expression matrices. AXOLOTL effectively accounts for gene correlation by incorporating coexpression constraints. We demonstrated the superior performance of AXOLOTL on representative RNA-seq datasets, including those from the GTEx healthy cohort, mitochondrial disease cohorts, and collagen VI-related dystrophy cohorts. Furthermore, we applied AXOLOTL to real-world cases of neurological disorders and demonstrated its ability to accurately identify aberrant gene expression and facilitate the prioritization of pathogenic variants. AXOLOTL is freely available on GitHub (https://github.com/xuwenjian85/axolotl) and Zenodo (https://doi.org/10.5281/zenodo.17940844). Supplementary data are available at Bioinformatics online.
Dimensionality reduction for single-cell RNA-sequencing (scRNA-seq) data involving multiple biological samples presents a significant analytical challenge. We introduce MUlti-Sample Trajectory-Assisted Reduction of Dimensions (MUSTARD), an innovative trajectory-guided dimensionality reduction method specifically designed for multi-sample, multi-condition scRNA-seq data. By integrating pseudotemporal information, MUSTARD provides a comprehensive unsupervised approach that simultaneously captures major gene expression variation patterns along pseudotime trajectories and across multiple samples, facilitating the discovery of biologically meaningful sample heterogeneity, endotypes, and associated gene markers and modules. In data-driven simulations, MUSTARD outperformed existing methods in distinguishing sample groups, achieving superior out-of-sample prediction accuracy. In two COVID-19 datasets and a tuberculosis dataset, MUSTARD identified components linked to symptom severity, batch effect, and other known biological variations, with notable overlap in immune response genes across the two independent COVID-19 datasets. These results underscore MUSTARD's flexibility and power in identifying biologically relevant sample heterogeneity across diverse datasets. The R package MUSTARD with a detailed user manual is publicly available at https://github.com/haotian-zhuang/MUSTARD and Zenodo (DOI: 10.5281/zenodo.18293392). The source code to reproduce the results in this paper is available at https://github.com/haotian-zhuang/MUSTARD_Paper and Zenodo (DOI: 10.5281/zenodo.18293392). Supplementary data are available at Bioinformatics online.
T cell receptor (TCR) and peptide interactions (TPI) are one of the most important parts of T cell immunity. Experimental identification of TPI is time-consuming and labor-intensive; therefore, it is necessary to develop computational prediction method that exploit existing data to predict TPI. We use huge TCR and peptide sequences to pre-train two language models (∼152M parameters), respectively, and integrate them into a sequence-based only prediction framework (i.e., RoBERTcr) with supervised fine-tuning (SFT). Visualization of amino acids embedding from pre-trained language model (PLM) shows biochemical clusters based on different properties, and our PLMs outperform existing protein language models (i.e., ESM and ProtTrans) under the same condition. RoBERTcr achieved higher performance than other state-of-the-art methods based on structures or sequences without dataset bias. The visualization of attention from our framework implies valuable spatial information that residues in TCR contacting peptides are the key to their interaction. RoBERTcr is free available at https://fca_icdb.mpu.edu.mo/robertcr/ and https://zenodo.org/records/19042627. Supplementary data are available at Bioinformatics online.
The biological functions of RNAs are tightly connected to their specific RNA structures. As experimental techniques to determine high-accuracy structures are costly and time-consuming, computational prediction approaches became indispensable for biological RNA research; most notably, the prediction of minimum free energy secondary structures. Pseudoknots are prevalent, highly significant structural motifs, yet they are commonly ignored to achieve acceptable efficiency. Existing reliable pseudoknot prediction methods typically have prohibitive complexity. A route to fast scalable pseudoknot prediction was suggested with HFold following the hierarchical folding hypothesis. Recent successful sparsification of the CCJ pseudoknot prediction algorithm in Knotty promises a further boost by introducing this technique to hierarchical folding. We introduce Spark, a sparsified algorithm for predicting pseudoknotted RNA structures. Spark predicts exactly the same minimum-energy structures as its predecessor HFold in the accurate HotKnots 2.0 energy model for pseudoknots. While sparsification maintains exact energy minimization and theoretical complexity, it strongly improves the time and space consumption over HFold. We benchmarked the performance of Spark against HFold and, as a pseudoknot-free baseline, RNAfold. Compared with HFold, Spark substantially reduces both run time and memory usage, while achieving run times close to RNAfold. Across all tested sequence lengths, Spark used the least memory and consistently ran faster than HFold. Combining sparsification and hierarchical folding in Spark results in an remarkably fast and memory-efficient tool for the accurate prediction of pseudoknotted RNA structures. Consequently, Spark practically enables pseudoknot prediction in large scale and even for very long RNA sequences. Spark software is available on Github (https://github.com/TheCOBRALab/Spark), with a permanent archive of the software and results deposited on Zenodo (https://doi.org/10.5281/zenodo.19073315). Supplementary data are available at Bioinformatics online.
The AGENT project established a network of actively cooperating European genebanks, integrating genomic and phenotypic data from accessions of wheat and barley. Due to specific storage demands for phenotypic and genotypic data, the project used separate database instances and backend technologies to manage integrated phenotypic and genotypic data. We discuss the challenges encountered when integrating dispersed data to serve through a single interface such as the Plant Breeding Application Programming Interface, BrAPI. We examine how the consistent mappability of genebank data to the BrAPI model can enable the implementation of effective services. The advantages of BrAPI in transparently linking distributed data entities through embedded, unique identifiers are highlighted. We present a technical solution involving a BrAPI proxy, which combines and merges separate BrAPI endpoints. Finally, we demonstrate the AGENT BrAPI implementation with an illustrative example that validates a suggested SNP for a trait from the literature by linking phenotypic, genotypic and passport data. The BrAPI proxy implementation and documentation is available at the Python Package Index (https://pypi.org/project/brapi-proxy) and archived in Zenodo (doi : 10.5281/zenodo.19436445). Supplementary data are available at Bioinformatics online.
Understanding chemical reactions requires bridging fine-grained molecular edits with broader semantic context. Reaction mechanisms are determined not only by local atom-bond transformations but also by the global reaction class. However, most existing approaches treat these tasks separately or rely on external atom-mapping tools, introducing noise and limiting end-to-end learnability. We introduce MARCC (Mapping-Assisted Reaction Center and Classification), a multi-task graph neural network that jointly predicts atom mappings, reaction centers, and reaction classes within a unified architecture. MARCC integrates three key innovations: (i) a mapping-guided cross-attention mechanism that aligns reactants and products for local edit detection, (ii) a dual-graph design that explicitly reasons about bond-level transformations, and (iii) pooled product embeddings for global reaction classification. On the USPTO-50K benchmark, MARCC achieves state-of-the-art results when trained with both reactants and products, including 98.2% atom mapping accuracy, 99.1% Top-1 edit localization accuracy, and 97.2% reaction classification accuracy. Even under the products-only setting, MARCC delivers competitive performance comparable to specialized baselines. Ablation studies confirm the value of mapping-guided attention and multi-task supervision, which enhance both predictive accuracy and interpretability. By unifying atom-level alignment, local reactivity, and global classification, MARCC provides a structured and interpretable framework for reaction understanding. Beyond benchmarks, MARCC has the potential to support applications in reaction annotation, template discovery, and mechanism inference; with additional domain-specific modeling and data, it could be extended to biochemical domains such as enzyme-catalyzed transformations and metabolic pathway modeling. The source code and implementation details are available at https://github.com/maryamastero/MARCC and archived at https://doi.org/10.5281/zenodo.18500230. Supplementary data are available at Bioinformatics online.
Genome-scale metabolic network (GSMN) models enable flux-based metabolite fate discovery, metabolic engineering, drug target identification, and multi-omics integration. However, programming requirements, architectural complexity, and limited visualization support impede its adoption by the broader scientific community. Existing tools exclusively specialize in GSMN analyses or visualization while lacking important features such as pathway-specific views, database-integrated refinement, and comprehensive enrichment and perturbation analyses. Here, we present NAViFluX (metabolic Network Analysis and Visualization of Flux), a visualization-centric, web browser-based tool that unifies native pathway/subsystem map generation, interactive model refinement via KEGG/BiGG, pathway merging and modules for flux computations, topology, and functional enrichment all within network views. Using three independent case studies on Escherichia coli, the utility of NAViFluX for characterization of nutrient-specific metabolic adaptations, enhancing gene essentiality predictions and interpretability, and rational design of an optimized carbon-fixing metabolic state is demonstrated. All source code and supplementary files associated with the case studies are publicly available via Zenodo at https://zenodo.org/records/19107831. NAViFluX can be easily installed as a standalone software through https://github.com/bnsb-lab-iith/NAViFluX. Supplementary data are available at Bioinformatics online.
Predicting the thermodynamic stability of proteins upon single-point mutations is a pivotal step in both protein engineering and medicine. In the study of predicting protein thermodynamic stability, various computational methods, whether they extract features at the local-level or global-level, exhibit their respective advantages and limitations. To leverage the advantages of both features, we developed MuFaDDG, a novel sequence-based method that integrated multiscale feature fusion for improved prediction of protein stability changes (ΔΔG). MuFaDDG achieves comparable performance on the S669 benchmark, demonstrating strong capabilities in stabilizing mutations. Notably, it shows a significant advantage in the ACC metric, with values of 0.75, 0.88, and 0.81 on the direct, reverse, and overall datasets of the CAGI5 Challenge's Frataxin, respectively. Furthermore, our method outperforms leading sequence-based approaches including THPLM, DDGemb, DDGun, and INPS-Seq on protein Myoglobin stability prediction. Additionally, MuFaDDG demonstrates exceptional predictive performance with higher PCC and ACC on the protein ThreeFoil, which is uncurated by FireProtDB and ProThermDB databases. The source code and data are available at https://github.com/PengjiaMa23/MuFaDDG. Supplementary data are available at Bioinformatics online.
How individuals with conditions, disabilities or abnormalities were treated gives us valuable insights into past societies. Chromosomal aneuploidies, the presence of an abnormal number of copies of the chromosomes, represent the most common large-scale chromosomal abnormalities in human populations. Chromosomal aneuploidies can affect autosomal chromosomes (e.g. Down syndrome) as well as the sex chromosomes (e.g. Klinefelter syndrome), with physical manifestations ranging from mild to severe. While simple to identify genetically, chromosomal aneuploidies are difficult to diagnose from skeletal remains alone, as they present skeletal pathologies consistent with many other conditions. Here we present ChASM (Chromosomal Aneuploidy Screening Methodology), a statistically rigorous Bayesian method for detecting full autosomal and sex chromosomal aneuploidies. The method leverages chromosome-wise read counts and takes into account differences in sequencing methodology, genetic coverage and condition rarity to produce posterior probability estimates for the screening of small and large databases of sequence data. To facilitate the ease of use, ChASM has been implemented in R as the package RChASM. RChASM is available under MIT license on the Comprehensive R Archive Network. Supplementary data are available at Bioinformatics online.
Drug repurposing leverages existing drugs for new indications, accelerating drug development. Computational methods integrating diverse biological and chemical data can systematically prioritize repurposing candidates, but standardized benchmarks for deep learning evaluation are lacking. We present KG-Bench, a GNN benchmarking framework designed to systematically compare the performance of different graph neural network (GNN) architectures on drug-disease association prediction using the Open Targets dataset. We constructed a knowledge graph (KG) of drugs, diseases, and targets, including annotations such as therapeutic area and molecular pathway, and ensured retrospective validation by leveraging regular dataset updates. To avoid data leakage, we removed redundant entities across splits. Benchmarking six GNN architectures, RGCN achieved the highest ranking performance (AUC: 0.91), while TransformerConv showed superior robustness under class imbalance (F1: 0.28 at 1:100 positive: negative ratio), characteristic of real drug repurposing datasets. KG-Bench also assesses bias, node/feature importance, and uses GNNExplainer for interpretability. Our open-source framework enables fair, reproducible evaluation of graph-based drug repurposing algorithms. Data and codes are available at https://github.com/cmbi/Benchmark_GNN_OpenTargets. Supplementary data are available at Bioinformatics online.
GSEA is a standard approach for pathway interpretation, yet Python ecosystems lack a high-performance implementation aligned with the fgseaMultilevel rare-event estimator target, especially for trajectory-aware rolling-window analysis. Under matched inputs, PyFgsea remains near-identical for normalized enrichment scores (NES; Pearson r>0.999), machine-precision identical for enrichment scores (ES), and statistically faithful for nominal p values relative to the R fgseaMultilevel reference. Its stateful rolling-window engine further reduces repeated trajectory-analysis overhead, yielding approximately 1.9-fold end-to-end wall-time speedup in a conservative stress test and, in a narrower 100-window component benchmark, up to 7.47-fold acceleration. Rolling-window significance is controlled only by within-window Benjamini-Hochberg correction across pathways rather than by trajectory-wide global error control, so these profiles are intended primarily for local trend exploration and candidate-pathway prioritization. Source code is available at https://github.com/shayuanxukuang/pyfgsea and via PyPI (pip install pyfgsea). An archival snapshot of the code and benchmark data is available on Zenodo (DOI: 10.5281/zenodo.19446446). Supplementary data are available at Bioinformatics online.
Pedigrees reconstructed from biologically related ancient genomes have revealed many insights into (pre)history. To our knowledge, all reported ancient pedigrees have been primarily manually reconstructed, as existing pedigree reconstruction methods are ill-suited for the quality and nature of ancient DNA data. We introduce repare, an open-source software method to automatically reconstruct pedigrees from inferred pairwise kinship relations, which are readily obtainable from ancient genomes. This method reconstructs pedigrees by iteratively incorporating pairwise kinship relations into a set of candidate pedigrees, with pruning and sampling to reduce its search space. It optionally considers supporting information such as haplogroups and skeletal age-at-death estimates. We evaluate this method on a variety of simulated pedigrees with varying error rates and missingness. We also use this method to reconstruct several published pedigrees that were originally manually reconstructed; for one, we present a potential alternative topology. repare optionally incorporates user-inferred pedigree constraints, enabling "human-in-the-loop" reconstruction workflows. Especially when used with these user-inferred constraints, we find that repare represents a powerful and flexible tool for ancient pedigree reconstruction. repare is freely available at https://github.com/ehuangc/repare. In addition, source code, benchmark scripts, and benchmark results used in this work are archived at https://doi.org/10.5281/zenodo.19716772. Supplementary data are available at Bioinformatics online.
Duplex-Indel is a novel Snakemake workflow for detecting somatic small insertions and deletions (Indels) from Tn5 transposase-based duplex sequencing data. Duplex-Indel enhances the accuracy of mutation calling at the single-molecule level by requiring consensus support from both DNA strands for each somatic Indel, minimizing confounding from technical artifacts. Duplex-Indel extends somatic mutation calling in Tn5 transposase-based duplex sequencing data to include Indels. We have demonstrated the accuracy and robustness of Duplex-Indel using cancer cell lines. Source code and documentation are available under the MIT license on GitHub at https://github.com/ealee-lab/duplex-indel and archived on Zenodo at https://doi.org/10.5281/zenodo.19228799. Supplementary data are available at Bioinformatics online.