Rare diseases (RDs) are a highly heterogeneous and underserved group of conditions. Most RDs have a strong genetic basis but their causal pathophysiological mechanisms remain poorly understood, limiting the development of targeted therapies. We systematically characterised the cell type-specific mechanisms underlying all genetically defined RD phenotypes by integrating the Human Phenotype Ontology (HPO) with whole-body single-cell transcriptomic atlases from embryonic, foetal, and adult samples. Associations were validated against orthogonal biomedical knowledge graphs and then prioritised by strength of supporting evidence, clinical severity, and gene-therapy compatibility. We identified significant associations between 201 cell types and 9,575/11,028 (86.7%) phenotypes across 8,628 RDs, substantially expanding knowledge of phenotype-cell type links. Prioritisation by severity (e.g. lethality, motor or mental impairment) and gene-therapy compatibility (e.g. cell type specificity, postnatal treatability) identified candidate phenotypes and cell types for therapeutic targeting. We present a scalable, reproducible framework for phenome-wide, cell type-specific mechanism prediction in rare diseases, providing a major step toward systematic therapeutic development for patients across a broad spectrum of serious RDs. Interactive web portal: https://neurogenomics-ukdri.dsi.ic.ac.uk/. R packages introduced in this study: KGExplorer (https://github.com/neurogenomics/KGExplorer), HPOExplorer (https://github.com/neurogenomics/HPOExplorer), and MSTExplorer (https://github.com/neurogenomics/MSTExplorer). Manuscript analyses and reproducibility code: https://github.com/neurogenomics/rare_disease_celltyping.
Single-molecule assays like NOMe-seq, dSMF, and Nanopore are superior to DNase-seq and ATAC-seq as they do not destroy DNA. Thus, they enable quantification of all three, that is, protein-free, Transcription Factor-bound, and histone-complex-bound states. But a user-friendly tool to visualize and quantify such states is lacking. Here, we present SMTrackR, an R/Bioconductor package to visualize protein-DNA binding states on individual sequenced DNA molecules. SMTrackR queries the single-molecule footprint database we built and hosted at Galaxy Server. It comprises BigBed files generated from NOMe-seq, dSMF, and Nanopore (SMAC-seq) datasets. SMTrackR exploits UCSC REST API to query a BigBed file and plot footprint heatmap categorized in different binding states, as well as report their occupancies. Additionally, this package generates a Gviz-enabled script to visualize these single molecules on gene tracks. The SMTrackR tool is implemented in the statistical programming language R and is available as a Bioconductor package, SMTrackR (https://bioconductor.org/packages/3.23/bioc/html/SMTrackR.html). The GitHub repository at https://github.com/satyanarayan-rao/SMTrackR has latest updates. The installation time is less than five minutes given the dependent packages are installed. The tool is also available as a web version https://smtrackrest.iitr.ac.in/. A function is provided to use local BigBed file for users who wish to use unpublished data. A fully automated pipeline to generate such BigBed files is available at https://github.com/satyanarayan-rao/SMF_for_SMThub, and https://github.com/satyanarayan-rao/dSMF_for_SMThub.
Male infertility has emerged as a significant concern in modern society, with genetic defects as one of the major underlying cause behind it. This impairment negatively impacts sperm motility and morphology, leading to conditions such as Asthenozoospermia (reduced sperm motility), Teratozoospermia (abnormal sperm morphology) and sometimes Asthenoteratozoospermia (both motility and morphology defects). Assisted reproductive technologies (ART), such as in-vitro fertilization (IVF), offer a potential solution for such cases but with a low success rate. Classical semen analysis provides only a phenotypic snapshot without revealing the fertilizing potential of the sperms. Hence, in order to screen the functional sperm population as well as to get a deeper insight into the reasons underlying the aberrant sperm population, it is important to study their genetic profile. In this work, we have performed a meta analysis of the transcriptomic data of infertile sperms from Asthenozoospermia and Teratozoospermia patients with that from fertile sperms of normal individuals. Thereafter we have screened a signature gene set which has been used to develop a prediction model named Explainable Infertility Test (E-InfertilityTest) to classify between fertile versus infertile sperm at the preliminary level. For each prediction, it will also provide the set of genes which are playing a dominant role towards such prediction. Thus, it will provide patient specific dominant gene expression profile responsible for the aberration. Overall, this AI based framework will serve as a proof-of-concept towards predicting the genetic basis associated with male infertility. User can access the tool named E-InfertilityTest as a standalone version on GitHub. Github Link: https://github.com/zglabDIB/einfertility.git.
Ancestral recombination graphs (ARGs) are increasingly central to modern population genetics, yet ARG-based methods for spatiotemporal demographic inference remain underutilized in empirical settings due to fragmented workflows and a lack of exploratory tools. ARGscape addresses this by providing a unified framework, seamlessly integrating established and novel tools for ARG simulation, manipulation, and spatiotemporal inference into both graphical and command-line interfaces. ARGscape features dynamic 2- and 3-dimensional visualizations and a novel "spatial diff" visualization for quantitative comparison of ARG-based geographic inference methods. By integrating these various functionalities, ARGscape facilitates novel data exploration and hypothesis generation, bridging the gap between methods development and empirical adoption, and enabling educational uses. ARGscape is available as a Python package on PyPI and as a live website for educational and simple demonstrative purposes at https://www.argscape.com. The source code and documentation are available on GitHub at https://github.com/chris-a-talbot/argscape. Supplementary data is available at Bioinformatics online.
Weighted Quantile Sum (WQS) regression is a statistical method for quantifying the association between multiple possibly correlated predictors and a health outcome, estimating both the joint effect of the predictors as well as their individual contributions to the total effect. WQS has become one of the most popular and widely used approaches for investigating complex mixtures in environmental epidemiology, yet its implementation has been largely restricted to R users. In this paper we present wqsreg, the first Stata command for WQS regression, implemented for continuous, binary and count outcomes. We describe command's architecture and present an application of the command on exposome data exploring the association between 38 exposures and a continuous outcome. Wqsreg provides a user-friendly command for WQS regression that integrates several flexible components of the framework such as bootstrap, training/validation splitting, and repeated holdout procedures. Wqsreg returns regression estimates as well as graphical displays of the individual weights. It requires Stata version 11 or higher and is freely available on GitHub [ https://github.com/PonzanoMarta/wqsreg ]. Given the increasing importance of appropriately exploring complex multidimensional exposures, this contribution will further promote the use of appropriate statistical methods in epidemiological settings with multiple correlated predictors.
Low-field (LF) magnetic resonance imaging (MRI) plays a crucial role in assisting clinicians with rapid stroke diagnosis. However, its inherent limitations, such as low signal-to-noise ratio (SNR) and suboptimal image quality, make accurate stroke lesion identification challenging. To address this, we propose a Difference-Guided Conditional Diffusion Model (DGCD-3D) to enhance image quality of LF MRI while preserving structural integrity of stroke lesions. Specifically, the model incorporates a difference-adaptive forward diffusion process that guides the diffusion dynamics based on differences. During training, multi-scale intrinsic features from LF MRI and prior spatial information of stroke lesions are explicitly encoded into the generative process. Furthermore, a time-adaptive multi-loss optimization strategy dynamically balances pixel-wise and perceptual losses at different timesteps. DGCD-3D was evaluated on a large-scale, clinically scarce paired LF-HF MRI dataset (n = 974) acquired within a mean interval of 18.5 min. Experimental results demonstrate that DGCD-3D substantially improves LF MRI image quality (PSNR = 28.26, SSIM = 0.896) and achieves significantly higher consistency with HF MRI in stroke lesion assessment (Spearman's correlation coefficient ρ = 0.732) compared with the LF MRI (ρ = 0.680). Furthermore, a clinical authenticity assessment conducted by six experienced radiologists yielded a confusion rate of 51.6% and a confusion score of 5.64, further confirming the clinical reliability and broad applicability of the proposed approach. The codes and trained models will be released on GitHub: https://github.com/lihao9056/DGCD-3D.
Conflicting phylogenetic signals are common in plant phylogenomics and often reflect evolutionary histories shaped by processes like hybridization, incomplete lineage sorting, and whole-genome duplication (WGD). We aimed to identify and assess these complex processes in the hyper-diverse family Asteraceae to offer insight into the underlying causes of phylogenetic discordance. We used new and existing Hyb-Seq and transcriptome data to explore phylogenetic discordance by testing for nuclear/plastid incongruences, WGD, and reticulation. We present a tutorial detailing the execution of complex bioinformatic analyses to increase transparency, facilitate reproducibility, and support advancements in the field of plant evolution (https://github.com/erika-r-moore/Ellestad_etal_2025_APPS_Hybridizations). We uncovered extensive discordance among nuclear gene trees and deep reticulation events, particularly among South American lineages. Signals of WGD were found across the family but were often difficult to interpret, likely due to variation in data completeness, the complexity of the events, and their ancient origins. Our study and tutorial, along with a growing body of phylogenomic research, emphasize the role of reticulation and WGD in the evolution of large, diverse clades, while also underscoring the challenges. We anticipate continued advancements in theoretical approaches that will further enhance empirical studies in reticulate evolution. Las señales filogenéticas conflictivas son comunes en la filogenómica de plantas, y a menudo, reflejan historias evolutivas moldeadas por procesos como la hibridación, la clasificación incompleta de linajes y la duplicación del genoma completo. Nuestro objetivo fue identificar y evaluar estos procesos complejos en la diversa familia Asteraceae, para ofrecer una perspectiva sobre las causas subyacentes de la discordancia filogenética. Utilizamos datos nuevos y existentes de Hyb‐Seq y transcriptomas para explorar la discordancia filogenética mediante pruebas de incongruencia entre los genomas nucleares y plastidiales, duplicación completa del genoma (WGD) y reticulación. Presentamos un tutorial que detalla la ejecución de análisis bioinformáticos complejos para aumentar la transparencia, facilitar la reproducibilidad y apoyar los avances en el campo de la evolución de las plantas (https://github.com/erika-r-moore/Ellestad_etal_2025_APPS_Hybridizations). Descubrimos una discordancia extensa entre los árboles genéticos nucleares y eventos profundos de reticulación, particularmente entre linajes sudamericanos. Se detectaron señales de WGD en toda la familia, aunque a menudo resultaron difíciles de interpretar, probablemente debido a la variación en la integridad de los datos, la complejidad de los eventos y su origen antiguo. Nuestro estudio y tutorial, junto con un cuerpo creciente de investigaciones filogenómicas, destacan el papel de la reticulación y de los WGD en la evolución de clados grandes y diversos, al mismo tiempo que subrayan los desafíos asociados. Anticipamos avances continuos en enfoques teóricos que potenciarán aún más los estudios empíricos sobre la evolución reticulada.
Identifying gene-disease associations (GDAs) remains a fundamental challenge in biomedical research due to the enormous combinatorial space of candidate gene-disease pairs and the limited scalability of experimental validation. As wet lab studies cannot keep pace with rapidly expanding omics data, computational approaches have become essential for prioritizing plausible GDAs and accelerating biological discovery. Recent advances in artificial intelligence (AI), particularly graph neural networks (GNNs) and large language models (LLMs), are trans forming this field by enabling richer biological representations and more accurate predictive modeling. In this survey, we provide a unified and up-to-date overview of AI-driven GDA prediction. We first summarize major public resources containing gene, dis ease, and auxiliary biological information that underpin computational studies. We then review methodological developments ranging from traditional network-based methods to machine learning, deep learning, and the emerging integration of GNNs and LLMs, which has received limited attention in previous GDA-focused surveys. Representative applications in gene prioritization, drug repurposing, and clinical research are also discussed to demonstrate the practical impact of these approaches. Finally, we outline current challenges and promising future directions. By integrating data resources, methodological advances, and translational applications, this survey provides a comprehensive overview of modern AI techniques for GDA prediction and aims to support the development of more robust, interpretable, and clinically actionable computational tools. All curated resources and re viewed literature are publicly available in our GitHub repository (last updated September 2025; including peer-reviewed publications and preprints on AI-driven GDA prediction published through September 2025): https://github.com/linyaoyang/gene disease-association-prediction-papers.
Organisms within ecological systems often engage in molecular interactions that mediate key biological processes, such as protein-protein interactions involved in host-pathogen recognition and symbiosis. Characterization of these interactions at a molecular level is essential for understanding the mechanistic, evolutionary, and functional basis of interspecies interactions, as well as for informing potential therapeutic interventions. However, progress in this field is significantly impeded by the lack of a comprehensive database of interacting species at molecular resolution and the limited availability of experimental data. We introduce the Interacting Species Database (ISDB), a comprehensive resource that catalogs interspecies interactions, annotated with NCBI taxonomic identifiers, interaction types and known molecular interactions. The ISDB encompasses 858,229 interacting species pairs and 171,713 interspecies protein-protein interactions within 261,287 organisms. ISDB is designed to support researchers in searching for, downloading, and depositing interspecies interaction data, which facilitates the study of ecological dynamics across diverse research domains. The ISDB is available via a web interface (https://www.elhabashylab.org/isdb), open-source code on GitHub (https://github.com/ElhabashyLab/ISDB) under the MIT license and is archived on Zenodo (Version v1.0.1, DOI: 10.5281/zenodo.20162385).
Protein dynamics are central to function, but experiments and molecular dynamics (MD) simulations remain costly, low-throughput, and difficult to compare across protocols. Scalable structure-based methods are needed to infer dynamics from static protein structures. We present a deep learning framework that predicts protein dynamics from 30-dimensional Gaussian integral (GI) descriptors of Cα backbone topology. Using 1,374 ATLAS protein chains with MD-derived RMSF, GI stratified proteins into fold-relevant clusters enriched for secondary structure, sequence homology, and ECOD families. An attention-based 1D-CNN classified flexible versus non-flexible proteins with test AUC = 0.772 and separated slow-mode- from fast-mode-dominated dynamics with AUC = 0.91. Regression models recovered mean RMSF (Pearson r = 0.72; R² = 0.46) and slow-mode RMSF more accurately (Pearson r = 0.83; R² = 0.62), supporting rapid inference of flexibility and collective-motion bias. Code and data are available on GitHub at: https://github.com/fvilicich/gaussian_integral/blob/main/gaussian_integral_classification.ipynb. Supplementary data are available at Bioinformatics online.
Research publications on various aspects of functional genomics are constantly growing, providing an opportunity and challenge to mine for the information of interest, among them are the annotation of proteins into their specific functional class or the sheer degeneracy of their functions. While knowledge-driven approaches to functional annotations are based on mechanistic basis or data-driven predictive models based on deterministic features, they do not harness what is already reported in literature in different contexts. Natural language processing, combined with machine learning aims to bridge this gap. We have earlier developed a method to predict an intriguing protein functional property, called moonlighting in DNA-binding proteins using protein features with reasonable accuracy. However, the very development of training data and harnessing of available functional information from literature are the tasks, not addressed well so far for this problem. It may be noted that moonlighting being a problem of functional redundancy, literature mining may be a way to provide a cross-study perspective and hence a better prediction performance. Here we present an NLP-based model for literature mining and identifying moonlighting behaviour of proteins. A high-performing PubMed BERT model pre-trained on PubMed publications was further optimized through retraining on particular data sets, allowing accurate identification of moonlighting function in proteins. We show that this approach can identify moonlighting proteins with high accuracy and outperform first principle approaches reported earlier. The methods presented here are for moonlighting behaviour of proteins but are scalable to any literature-mining problem in biological domain. Data sets and codes used in this work are provided in GitHub repository https://github.com/Sciwhylab/moonlighting_nlp .
A-liner is a flexible command-line tool for linear visualization of genome-scale sequence alignments, supporting outputs from multiple aligners and integrated visualization of annotations, highlights, quantitative tracks, and coordinate scales. It is applicable to a wide range of organisms, from bacteria to large eukaryotic genomes, and facilitates efficient generation of publication-ready comparative genome visualizations. The source code and example output files for a-liner are available in the GitHub repository: https://github.com/mokuno3430/a-liner. A-liner v1.1.0 has been archived on Zenodo at https://doi.org/10.5281/zenodo.19702001. Supplementary data are available at Bioinformatics online.
Accurate segmentation of multiple organs is essential for the diagnosis and treatment of head and neck cancer. However, the intricate anatomical structure and dense organ distribution in the head and neck region pose significant challenges for existing automated segmentation models, which predominantly target single organs and rely on single-modality imaging. Achieving comprehensive, one-step segmentation of organs-at-risk (OARs) remains challenging. To this end, we propose a Point-cloud Matrix Fusion-based Segmentation Model (PMFM) that leverages an improved multi-modal data fusion strategy for the automated full segmentation of OARs in head and neck cancer. The proposed PMFM involves three core modules: 1) a camera model-based 3D feature mapping and point-cloud extraction module (PEM) that enables vertical decoupling of modalities and objects; (2) a Point Cloud Matrix Module (PMM) utilizing PointNet and a virtual point cloud-based attention mechanism to facilitate horizontal association and global feature learning across modalities; and (3) a Cross Fusion Module (CFM) based on virtual point clouds to achieve deep intermodal object fusion and enhance inter-organ correlation. PMFM effectively integrates multi-modal image information, transforming them into a unified virtual point cloud matrix, and enables precise, comprehensive segmentation of OARs in head and neck cancer. Extensive validation and comparative experiments on the HaNSeg dataset demonstrate that PMFM significantly outperforms state-of-the-art methods, achieving an average Dice coefficient of 79.8% and an average Hausdorff distance of 2.47 mm. The source code for this study will be publicly available on GitHub at https://github.com/zhouxinyu1028/PMFM.
We develop a freely-available Python package Vardetector (https://github.com/julijselb/vardetector/tree/main/vardetector) used for detecting DNA called mutations in aligned RNA reads. We benchmark it by comparing it to industry standard variant caller (GATK HaplotypeCaller; r = 0.88896/0.88859 (supporting-reads/all-reads)) and demonstrate the functionality by comparing two RNA-seq library preparation protocols for formalin fixed paraffin embedded (FFPE) tumor samples. One protocol relies on exome-capture and the other on ribosome-depletion (ribodepletion) chemistry. We call somatic mutations from DNA of tumor/normal samples of two individuals with non-small cell lung cancer and test the difference between the two protocols by quantifying all RNA reads (all-reads) and somatic mutation supporting RNA reads (supporting-reads) over the positions of the DNA-called mutations. We show that the ribodepletion protocol produces significantly higher number of all (p < 0.001) and of supporting (p < 0.001) reads over the mutations of interest. Moreover, the ribodepletion protocol produces significantly (p < 0.001) wider breath of somatic mutation position coverage. The Vardetector software package and our results display a meaningful potential of the approach to improve neoantigen prioritisation pipelines.
Reconstructing gene regulatory networks (GRNs) with directionality and regulatory types is an important challenge in computational biology. Existing methods often struggle to effectively capture complex topological structures in highly skewed GRNs due to imbalances between local and global information and to the collapse of representation dimensionality. To address these challenges, we propose BMGRN, a unified framework that reconstructs directional and GRNs with regulation types by integrating bidirectional state space modeling with dual contrastive representation learning. Drawing inspiration from sequence modeling, BMGRN employs an enhanced bidirectional Mamba2 architecture to capture long-range dependencies and asymmetric regulatory interactions between genes efficiently. This design enables global information propagation while maintaining directional specificity. Furthermore, a dual contrastive learning mechanism is introduced to alleviate oversmoothing and dimensional collapse, enforcing representation uniformity and discriminability in low-connectivity scenarios. By coupling these representations with a KAN-based convolutional predictor, BMGRN adaptively learns nonlinear dependencies and regulatory modes, thereby improving its modeling capacity for the GRN inference. Experiments on multiple benchmark data sets show that BMGRN attains superior performance, demonstrating great potential for large-scale GRN inference. The code is available at https://github.com/KanZh/BMGRN.
Spatial sequencing technologies enable the single-cell-level study of molecular organization in tissues. Revealing such spatial patterns relies on accurate cell segmentation. In complex tissues with dense cell packing, segmentation based solely on nuclear staining is insufficient for accurate cell boundary detection. This limitation arises because accurate segmentation necessitates the delineation of cell morphology, which is driven by molecular activities such as cytoskeletal dynamics, cell-cell adhesion, and intercellular signaling. Thus, integrating molecular information, including gene or protein expression, has the potential to improve segmentation, but remains computationally challenging. To address this, we developed SegJointGene, a deep learning framework that jointly performs cell segmentation and spatial gene prioritization by integrating nuclei-based images with spatial gene or protein expression data. SegJointGene designs an information-entropy-guided convolutional neural network together with a computational information discarding score to identify genes that are important for cell-type-specific segmentation. The model iteratively refines gene prioritization and cell boundaries, producing convergent segmentation results along with prioritized spatial genes or proteins across cell types. We applied and benchmarked SegJointGene on both simulation and real spatial datasets, including spatial transcriptomics from the mouse hippocampus and distinct regions of the whole mouse brain, as well as spatial proteomics data from human tonsil. Across datasets, SegJointGene outperformed existing methods by 5-20% in accurately assigning molecular signals to cell boundaries. Robustness analyses further demonstrated stable performance across varying gene numbers and imaging resolutions. In addition, the genes prioritized by SegJointGene were enriched for structural, developmental, and synaptic signaling pathways, supporting their relevance to spatial tissue organization. The source code and data are available at https://github.com/daifengwanglab/segjointgene. Supplementary figures, notes and data descriptions are available in Supplementalmaterials.pdf.
The early detection and classification of skin cancer are pivotal in improving patient outcomes and reducing healthcare burdens. However, traditional deep learning models in dermatological diagnostics often struggle with the nuanced differentiation of skin lesions. This paper introduces an approach to integrate an Advanced Heat Flow Layer into deep learning architectures for skin cancer classification, this method is centered on the principles of anisotropic diffusion, distinguishing itself from conventional image processing techniques by selectively smoothing image areas while preserving critical edge details, essential for accurate lesion identification. In our research, we utilized the Ham10000 dataset, enriched with data augmentation to simulate real-world variability, we conducted a comprehensive comparison of our model, featuring the Advanced Heat Flow Layer, against several benchmark deep learning models, including Sobel Edge Detection Layer. Our model, integrated with various layers of DenseNet121, consistently outperformed these benchmarks across key metrics such as accuracy, precision, recall, F1 score, and AUC, particularly with augmented data, this indicates a significant enhancement in the model's ability to generalize and maintain critical diagnostic features under diverse conditions. Our code is available at, https://github.com/sanadv/SkinCancerClassificationModels/blob/main/Models.ipynb.
Learning a high-order connectional brain template (CBT) endowed with cognitive capacities such as visual or auditory memory is crucial for identifying cognition-related biomarkers and distinguishing between control and clinical populations. Higher-order CBTs provide a population-level representation that captures not only structural or topological regularities but also the multi-regional interactions and cognitive processes that conventional pairwise models fail to reflect. Because the brain operates through complex, coordinated dynamics, estimating CBTs that incorporate such higher-order and cognitively meaningful organization is essential for advancing our understanding of neural function and dysfunction. While recent machine-learning and graph-neural-network approaches have improved CBT estimation, they remain limited by their focus on pairwise interactions and purely structural features, overlooking both higher-order organization and cognitive properties. This gap raises a central question: How can we learn a high-order CBT that is well-centered at the population level and also endowed with cognitive capacities? We tackle this challenge using reservoir computing (RC), a biologically inspired framework that mimics how the brain processes information. RC exhibits dynamic properties similar to those of the prefrontal cortex, an area associated with working memory and features a fading memory mechanism, known as the Echo State Property (ESP), which mirrors the brain's short-term memory function. Building on these properties, we introduce HyperCOCO, a novel framework for generating high-order cognitively enhanced CBTs in two stages. First, BOLD signals are processed through a random reservoir to generate high-order individual functional connectomes, which are then aggregated into a population-level template. Second, this template is instantiated into a hyper-cognitive reservoir and stimulated with multi-sensory inputs (visual, auditory, and linguistic). Finally, we measure the memory capacity of the resulting CBT as a proxy for its ability to encode and retain cognitive information. Our source code is available at https://github.com/basiralab/HyperCOCO.
Formalin-fixed paraffin-embedded (FFPE) tissues are widely used in clinical and research settings, yet their use for detecting somatic mutations from RNA sequencing (RNA-seq) is hindered by artefactual mutations introduced by cytosine deamination and strand-specific damage. Existing FFPE noise-filtering tools are tailored to DNA-seq and rely on strand bias, rendering them unsuitable for RNA-seq. Here, we present FFixR, a machine learning-based framework that filters FFPE-induced artefacts from RNA-seq data without requiring matched-normal samples. Trained on FFPE melanoma samples with matched DNA, FFixR leverages allele-specific read counts, variant features, and mutational signature probabilities. FFixR removed up to 98% of artefactual mutations while maintaining ∼92% recall of true variants. SHAP analysis revealed key feature interactions guiding model decisions. When applied to independent cohorts, FFixR restored the correlation between RNA- and DNA-derived tumor mutational burden (R2 = 0.881) and recovered biologically meaningful mutational signatures. FFixR enables accurate somatic variant calling from FFPE RNA-seq data, expanding the utility of archival samples for research and clinical applications. FFixR tool is freely available on the web at https://github.com/yizhak-lab-ccg/FFixR and https://doi.org/10.6084/m9.figshare.31998315. The repository also includes a readme file describing the inputs, outputs and the entire pipeline. The results presented here were produced using v1.0.0. Supplementary data are available at Bioinformatics online.
Despite the increased availability of electronic health records, open-source standardized data collection to facilitate high-resolution data during extracorporeal life support (ECLS) is lacking. This project aimed to assess the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) for interoperability to store data sufficiently generated in the context of ECLS and to develop a custom data model expansion in case the OMOP CDM proved insufficient. The OMOP CDM was analyzed qualitatively by expert consensus for its capability to capture data relative to ECLS as well as the presence of fitting ECLS-related concepts. Database entries necessary to store information about primary ECLS components were compared using the OMOP CDM versus the custom data model expansion. Analysis of data elements required to capture ECLS data within the OMOP CDM revealed a paucity of suitable concepts within the OHDSI Standardized Vocabularies, limiting capture of ECLS circuit-derived data. Custom ECLS-specific database tables and novel concepts were introduced as part of a custom expansion, the ECLS Common Data Model (ECLS CDM). The number of database entries necessary to store ECLS use cases was reduced by up to 90%. The ECLS CDM was released as an open-source project on GitHub and placed in the public domain. With the first iteration of the ECLS CDM, we introduce a data model to improve interoperability for data describing ECLS and elevate data quality, enabling multi-center research and quality initiatives.