Single-molecule assays like NOMe-seq, dSMF, and Nanopore are superior to DNase-seq and ATAC-seq as they do not destroy DNA. Thus, they enable quantification of all three, that is, protein-free, Transcription Factor-bound, and histone-complex-bound states. But a user-friendly tool to visualize and quantify such states is lacking. Here, we present SMTrackR, an R/Bioconductor package to visualize protein-DNA binding states on individual sequenced DNA molecules. SMTrackR queries the single-molecule footprint database we built and hosted at Galaxy Server. It comprises BigBed files generated from NOMe-seq, dSMF, and Nanopore (SMAC-seq) datasets. SMTrackR exploits UCSC REST API to query a BigBed file and plot footprint heatmap categorized in different binding states, as well as report their occupancies. Additionally, this package generates a Gviz-enabled script to visualize these single molecules on gene tracks. The SMTrackR tool is implemented in the statistical programming language R and is available as a Bioconductor package, SMTrackR (https://bioconductor.org/packages/3.23/bioc/html/SMTrackR.html). The GitHub repository at https://github.com/satyanarayan-rao/SMTrackR has latest updates. The installation time is less than five minutes given the dependent packages are installed. The tool is also available as a web version https://smtrackrest.iitr.ac.in/. A function is provided to use local BigBed file for users who wish to use unpublished data. A fully automated pipeline to generate such BigBed files is available at https://github.com/satyanarayan-rao/SMF_for_SMThub, and https://github.com/satyanarayan-rao/dSMF_for_SMThub.
Rare diseases (RDs) are a highly heterogeneous and underserved group of conditions. Most RDs have a strong genetic basis but their causal pathophysiological mechanisms remain poorly understood, limiting the development of targeted therapies. We systematically characterised the cell type-specific mechanisms underlying all genetically defined RD phenotypes by integrating the Human Phenotype Ontology (HPO) with whole-body single-cell transcriptomic atlases from embryonic, foetal, and adult samples. Associations were validated against orthogonal biomedical knowledge graphs and then prioritised by strength of supporting evidence, clinical severity, and gene-therapy compatibility. We identified significant associations between 201 cell types and 9,575/11,028 (86.7%) phenotypes across 8,628 RDs, substantially expanding knowledge of phenotype-cell type links. Prioritisation by severity (e.g. lethality, motor or mental impairment) and gene-therapy compatibility (e.g. cell type specificity, postnatal treatability) identified candidate phenotypes and cell types for therapeutic targeting. We present a scalable, reproducible framework for phenome-wide, cell type-specific mechanism prediction in rare diseases, providing a major step toward systematic therapeutic development for patients across a broad spectrum of serious RDs. Interactive web portal: https://neurogenomics-ukdri.dsi.ic.ac.uk/. R packages introduced in this study: KGExplorer (https://github.com/neurogenomics/KGExplorer), HPOExplorer (https://github.com/neurogenomics/HPOExplorer), and MSTExplorer (https://github.com/neurogenomics/MSTExplorer). Manuscript analyses and reproducibility code: https://github.com/neurogenomics/rare_disease_celltyping.
Male infertility has emerged as a significant concern in modern society, with genetic defects as one of the major underlying cause behind it. This impairment negatively impacts sperm motility and morphology, leading to conditions such as Asthenozoospermia (reduced sperm motility), Teratozoospermia (abnormal sperm morphology) and sometimes Asthenoteratozoospermia (both motility and morphology defects). Assisted reproductive technologies (ART), such as in-vitro fertilization (IVF), offer a potential solution for such cases but with a low success rate. Classical semen analysis provides only a phenotypic snapshot without revealing the fertilizing potential of the sperms. Hence, in order to screen the functional sperm population as well as to get a deeper insight into the reasons underlying the aberrant sperm population, it is important to study their genetic profile. In this work, we have performed a meta analysis of the transcriptomic data of infertile sperms from Asthenozoospermia and Teratozoospermia patients with that from fertile sperms of normal individuals. Thereafter we have screened a signature gene set which has been used to develop a prediction model named Explainable Infertility Test (E-InfertilityTest) to classify between fertile versus infertile sperm at the preliminary level. For each prediction, it will also provide the set of genes which are playing a dominant role towards such prediction. Thus, it will provide patient specific dominant gene expression profile responsible for the aberration. Overall, this AI based framework will serve as a proof-of-concept towards predicting the genetic basis associated with male infertility. User can access the tool named E-InfertilityTest as a standalone version on GitHub. Github Link: https://github.com/zglabDIB/einfertility.git.
Weighted Quantile Sum (WQS) regression is a statistical method for quantifying the association between multiple possibly correlated predictors and a health outcome, estimating both the joint effect of the predictors as well as their individual contributions to the total effect. WQS has become one of the most popular and widely used approaches for investigating complex mixtures in environmental epidemiology, yet its implementation has been largely restricted to R users. In this paper we present wqsreg, the first Stata command for WQS regression, implemented for continuous, binary and count outcomes. We describe command's architecture and present an application of the command on exposome data exploring the association between 38 exposures and a continuous outcome. Wqsreg provides a user-friendly command for WQS regression that integrates several flexible components of the framework such as bootstrap, training/validation splitting, and repeated holdout procedures. Wqsreg returns regression estimates as well as graphical displays of the individual weights. It requires Stata version 11 or higher and is freely available on GitHub [ https://github.com/PonzanoMarta/wqsreg ]. Given the increasing importance of appropriately exploring complex multidimensional exposures, this contribution will further promote the use of appropriate statistical methods in epidemiological settings with multiple correlated predictors.
Organisms within ecological systems often engage in molecular interactions that mediate key biological processes, such as protein-protein interactions involved in host-pathogen recognition and symbiosis. Characterization of these interactions at a molecular level is essential for understanding the mechanistic, evolutionary, and functional basis of interspecies interactions, as well as for informing potential therapeutic interventions. However, progress in this field is significantly impeded by the lack of a comprehensive database of interacting species at molecular resolution and the limited availability of experimental data. We introduce the Interacting Species Database (ISDB), a comprehensive resource that catalogs interspecies interactions, annotated with NCBI taxonomic identifiers, interaction types and known molecular interactions. The ISDB encompasses 858,229 interacting species pairs and 171,713 interspecies protein-protein interactions within 261,287 organisms. ISDB is designed to support researchers in searching for, downloading, and depositing interspecies interaction data, which facilitates the study of ecological dynamics across diverse research domains. The ISDB is available via a web interface (https://www.elhabashylab.org/isdb), open-source code on GitHub (https://github.com/ElhabashyLab/ISDB) under the MIT license and is archived on Zenodo (Version v1.0.1, DOI: 10.5281/zenodo.20162385).
Protein dynamics are central to function, but experiments and molecular dynamics (MD) simulations remain costly, low-throughput, and difficult to compare across protocols. Scalable structure-based methods are needed to infer dynamics from static protein structures. We present a deep learning framework that predicts protein dynamics from 30-dimensional Gaussian integral (GI) descriptors of Cα backbone topology. Using 1,374 ATLAS protein chains with MD-derived RMSF, GI stratified proteins into fold-relevant clusters enriched for secondary structure, sequence homology, and ECOD families. An attention-based 1D-CNN classified flexible versus non-flexible proteins with test AUC = 0.772 and separated slow-mode- from fast-mode-dominated dynamics with AUC = 0.91. Regression models recovered mean RMSF (Pearson r = 0.72; R² = 0.46) and slow-mode RMSF more accurately (Pearson r = 0.83; R² = 0.62), supporting rapid inference of flexibility and collective-motion bias. Code and data are available on GitHub at: https://github.com/fvilicich/gaussian_integral/blob/main/gaussian_integral_classification.ipynb. Supplementary data are available at Bioinformatics online.
Identifying gene-disease associations (GDAs) remains a fundamental challenge in biomedical research due to the enormous combinatorial space of candidate gene-disease pairs and the limited scalability of experimental validation. As wet lab studies cannot keep pace with rapidly expanding omics data, computational approaches have become essential for prioritizing plausible GDAs and accelerating biological discovery. Recent advances in artificial intelligence (AI), particularly graph neural networks (GNNs) and large language models (LLMs), are trans forming this field by enabling richer biological representations and more accurate predictive modeling. In this survey, we provide a unified and up-to-date overview of AI-driven GDA prediction. We first summarize major public resources containing gene, dis ease, and auxiliary biological information that underpin computational studies. We then review methodological developments ranging from traditional network-based methods to machine learning, deep learning, and the emerging integration of GNNs and LLMs, which has received limited attention in previous GDA-focused surveys. Representative applications in gene prioritization, drug repurposing, and clinical research are also discussed to demonstrate the practical impact of these approaches. Finally, we outline current challenges and promising future directions. By integrating data resources, methodological advances, and translational applications, this survey provides a comprehensive overview of modern AI techniques for GDA prediction and aims to support the development of more robust, interpretable, and clinically actionable computational tools. All curated resources and re viewed literature are publicly available in our GitHub repository (last updated September 2025; including peer-reviewed publications and preprints on AI-driven GDA prediction published through September 2025): https://github.com/linyaoyang/gene disease-association-prediction-papers.
Ancestral recombination graphs (ARGs) are increasingly central to modern population genetics, yet ARG-based methods for spatiotemporal demographic inference remain underutilized in empirical settings due to fragmented workflows and a lack of exploratory tools. ARGscape addresses this by providing a unified framework, seamlessly integrating established and novel tools for ARG simulation, manipulation, and spatiotemporal inference into both graphical and command-line interfaces. ARGscape features dynamic 2- and 3-dimensional visualizations and a novel "spatial diff" visualization for quantitative comparison of ARG-based geographic inference methods. By integrating these various functionalities, ARGscape facilitates novel data exploration and hypothesis generation, bridging the gap between methods development and empirical adoption, and enabling educational uses. ARGscape is available as a Python package on PyPI and as a live website for educational and simple demonstrative purposes at https://www.argscape.com. The source code and documentation are available on GitHub at https://github.com/chris-a-talbot/argscape. Supplementary data is available at Bioinformatics online.
Conflicting phylogenetic signals are common in plant phylogenomics and often reflect evolutionary histories shaped by processes like hybridization, incomplete lineage sorting, and whole-genome duplication (WGD). We aimed to identify and assess these complex processes in the hyper-diverse family Asteraceae to offer insight into the underlying causes of phylogenetic discordance. We used new and existing Hyb-Seq and transcriptome data to explore phylogenetic discordance by testing for nuclear/plastid incongruences, WGD, and reticulation. We present a tutorial detailing the execution of complex bioinformatic analyses to increase transparency, facilitate reproducibility, and support advancements in the field of plant evolution (https://github.com/erika-r-moore/Ellestad_etal_2025_APPS_Hybridizations). We uncovered extensive discordance among nuclear gene trees and deep reticulation events, particularly among South American lineages. Signals of WGD were found across the family but were often difficult to interpret, likely due to variation in data completeness, the complexity of the events, and their ancient origins. Our study and tutorial, along with a growing body of phylogenomic research, emphasize the role of reticulation and WGD in the evolution of large, diverse clades, while also underscoring the challenges. We anticipate continued advancements in theoretical approaches that will further enhance empirical studies in reticulate evolution. Las señales filogenéticas conflictivas son comunes en la filogenómica de plantas, y a menudo, reflejan historias evolutivas moldeadas por procesos como la hibridación, la clasificación incompleta de linajes y la duplicación del genoma completo. Nuestro objetivo fue identificar y evaluar estos procesos complejos en la diversa familia Asteraceae, para ofrecer una perspectiva sobre las causas subyacentes de la discordancia filogenética. Utilizamos datos nuevos y existentes de Hyb‐Seq y transcriptomas para explorar la discordancia filogenética mediante pruebas de incongruencia entre los genomas nucleares y plastidiales, duplicación completa del genoma (WGD) y reticulación. Presentamos un tutorial que detalla la ejecución de análisis bioinformáticos complejos para aumentar la transparencia, facilitar la reproducibilidad y apoyar los avances en el campo de la evolución de las plantas (https://github.com/erika-r-moore/Ellestad_etal_2025_APPS_Hybridizations). Descubrimos una discordancia extensa entre los árboles genéticos nucleares y eventos profundos de reticulación, particularmente entre linajes sudamericanos. Se detectaron señales de WGD en toda la familia, aunque a menudo resultaron difíciles de interpretar, probablemente debido a la variación en la integridad de los datos, la complejidad de los eventos y su origen antiguo. Nuestro estudio y tutorial, junto con un cuerpo creciente de investigaciones filogenómicas, destacan el papel de la reticulación y de los WGD en la evolución de clados grandes y diversos, al mismo tiempo que subrayan los desafíos asociados. Anticipamos avances continuos en enfoques teóricos que potenciarán aún más los estudios empíricos sobre la evolución reticulada.
A-liner is a flexible command-line tool for linear visualization of genome-scale sequence alignments, supporting outputs from multiple aligners and integrated visualization of annotations, highlights, quantitative tracks, and coordinate scales. It is applicable to a wide range of organisms, from bacteria to large eukaryotic genomes, and facilitates efficient generation of publication-ready comparative genome visualizations. The source code and example output files for a-liner are available in the GitHub repository: https://github.com/mokuno3430/a-liner. A-liner v1.1.0 has been archived on Zenodo at https://doi.org/10.5281/zenodo.19702001. Supplementary data are available at Bioinformatics online.
Accurate segmentation of multiple organs is essential for the diagnosis and treatment of head and neck cancer. However, the intricate anatomical structure and dense organ distribution in the head and neck region pose significant challenges for existing automated segmentation models, which predominantly target single organs and rely on single-modality imaging. Achieving comprehensive, one-step segmentation of organs-at-risk (OARs) remains challenging. To this end, we propose a Point-cloud Matrix Fusion-based Segmentation Model (PMFM) that leverages an improved multi-modal data fusion strategy for the automated full segmentation of OARs in head and neck cancer. The proposed PMFM involves three core modules: 1) a camera model-based 3D feature mapping and point-cloud extraction module (PEM) that enables vertical decoupling of modalities and objects; (2) a Point Cloud Matrix Module (PMM) utilizing PointNet and a virtual point cloud-based attention mechanism to facilitate horizontal association and global feature learning across modalities; and (3) a Cross Fusion Module (CFM) based on virtual point clouds to achieve deep intermodal object fusion and enhance inter-organ correlation. PMFM effectively integrates multi-modal image information, transforming them into a unified virtual point cloud matrix, and enables precise, comprehensive segmentation of OARs in head and neck cancer. Extensive validation and comparative experiments on the HaNSeg dataset demonstrate that PMFM significantly outperforms state-of-the-art methods, achieving an average Dice coefficient of 79.8% and an average Hausdorff distance of 2.47 mm. The source code for this study will be publicly available on GitHub at https://github.com/zhouxinyu1028/PMFM.
Research publications on various aspects of functional genomics are constantly growing, providing an opportunity and challenge to mine for the information of interest, among them are the annotation of proteins into their specific functional class or the sheer degeneracy of their functions. While knowledge-driven approaches to functional annotations are based on mechanistic basis or data-driven predictive models based on deterministic features, they do not harness what is already reported in literature in different contexts. Natural language processing, combined with machine learning aims to bridge this gap. We have earlier developed a method to predict an intriguing protein functional property, called moonlighting in DNA-binding proteins using protein features with reasonable accuracy. However, the very development of training data and harnessing of available functional information from literature are the tasks, not addressed well so far for this problem. It may be noted that moonlighting being a problem of functional redundancy, literature mining may be a way to provide a cross-study perspective and hence a better prediction performance. Here we present an NLP-based model for literature mining and identifying moonlighting behaviour of proteins. A high-performing PubMed BERT model pre-trained on PubMed publications was further optimized through retraining on particular data sets, allowing accurate identification of moonlighting function in proteins. We show that this approach can identify moonlighting proteins with high accuracy and outperform first principle approaches reported earlier. The methods presented here are for moonlighting behaviour of proteins but are scalable to any literature-mining problem in biological domain. Data sets and codes used in this work are provided in GitHub repository https://github.com/Sciwhylab/moonlighting_nlp .
Low-field (LF) magnetic resonance imaging (MRI) plays a crucial role in assisting clinicians with rapid stroke diagnosis. However, its inherent limitations, such as low signal-to-noise ratio (SNR) and suboptimal image quality, make accurate stroke lesion identification challenging. To address this, we propose a Difference-Guided Conditional Diffusion Model (DGCD-3D) to enhance image quality of LF MRI while preserving structural integrity of stroke lesions. Specifically, the model incorporates a difference-adaptive forward diffusion process that guides the diffusion dynamics based on differences. During training, multi-scale intrinsic features from LF MRI and prior spatial information of stroke lesions are explicitly encoded into the generative process. Furthermore, a time-adaptive multi-loss optimization strategy dynamically balances pixel-wise and perceptual losses at different timesteps. DGCD-3D was evaluated on a large-scale, clinically scarce paired LF-HF MRI dataset (n = 974) acquired within a mean interval of 18.5 min. Experimental results demonstrate that DGCD-3D substantially improves LF MRI image quality (PSNR = 28.26, SSIM = 0.896) and achieves significantly higher consistency with HF MRI in stroke lesion assessment (Spearman's correlation coefficient ρ = 0.732) compared with the LF MRI (ρ = 0.680). Furthermore, a clinical authenticity assessment conducted by six experienced radiologists yielded a confusion rate of 51.6% and a confusion score of 5.64, further confirming the clinical reliability and broad applicability of the proposed approach. The codes and trained models will be released on GitHub: https://github.com/lihao9056/DGCD-3D.
Identifying robust microbial biomarkers is crucial for disease diagnosis and prediction, elucidation of biological mechanisms, and development of targeted therapies. Machine learning-based approaches, particularly the random forest model, have been widely used for biomarker identification during sample stratification. However, those biomarkers often vary considerably for the same disease, limiting their practical applicability. A robust framework for reliable biomarker identification in microbiome research is needed. To address this gap, we proposed a prevalence-aware feature selection framework (ParSlet) that incorporates a universal scaling relationship between taxon prevalence and selection frequency. We first identified a universal exponential scaling law linking the probability of a taxon being consistently recognized as a biomarker versus its prevalence. Then, we integrated this scaling law with taxa prevalence into the biomarker identification using random forest. We systematically evaluated this approach in both simulated microbiome datasets and real-world microbiome datasets and compared it with existing methods, finding that our integrated approach generally improved feature stability and reproducibility of biomarker identification. In colorectal cancer (CRC) datasets, our method robustly identified well-established microbial biomarkers such as Ruminococcus, Clostridium_XVIII, and Faecalibacterium. Integrating a prevalence-based scaling adjustment into feature importance enhances the stability of microbiome biomarker identification. This approach holds promise for enabling more reliable disease diagnostics, uncovering generalizable microbial signatures across cohorts, and guiding the development of targeted microbiome-based interventions. ParSlet is available at https://github.com/KelabatOSU/Feature_selection. Supplementary data are available at Bioinformatics online.
Hybrid capture sequencing (Hyb-Seq) is a widely used approach in phylogenomics, providing efficient access to targeted genomic regions. However, deriving high-quality phylogenetic trees from raw sequencing reads requires extensive bioinformatics processing, which increases complexity, the risk of errors, and challenges in file management, especially for users unfamiliar with bioinformatics workflows. We developed HybSuite, a streamlined Bash-based bioinformatics pipeline built upon mainstream tools such as HybPiper 2, designed to simplify the Hyb-Seq phylogenomic analysis from raw reads to species trees. Compared to existing tools (e.g., HybPiper 2, CAPTUS), it offers a modular yet integrated workflow covering all key steps from downloading from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA), adapter removal, data assembly, and paralog handling to species tree inference and extensive in-depth analysis. We validated HybSuite by reconstructing a robust phylogeny for the Elaeagnaceae family, using the Angiosperms353 probe set and a dataset of 100 single-copy nuclear loci from Arabidopsis. HybSuite provides a flexible and user-friendly pipeline for Hyb-Seq phylogenomic analyses, and its high accuracy and efficiency were demonstrated through benchmarking with two empirical datasets. HybSuite is freely available at https://github.com/Yuxuanliu-HZAU/HybSuite. The pipeline is compatible with both the Linux and MacOS platforms.
Major advances in Plasmodium sequencing approaches, bioinformatic pipelines, and data analysis tools have provided valuable insights into malaria epidemiology from parasite genomic data. However, translating genetic data into actionable information for decision-makers remains a challenge. Significant barriers limit the integration of these advances into a functional data analysis ecosystem that produces standardized, interpretable results for use by national malaria control programs. The Plasmodium Genomic Epidemiology network convened 18 subject matter experts across 15 institutions at the Reproducibility, Accessibility, Documentation, and Interoperability Standards Hackathon in 2023 to identify available analysis tools, evaluate software standards, improve documentation, and outline workflows. Eight use cases for genomic data were identified, and a subset was developed into analysis workflows comprising a series of connected functionalities. Software tools were then mapped against functionalities to outline a modular approach to data analysis for these use cases. In addition to outlining workflows, a set of objective criteria was developed for evaluating software standards. A total of 40 Plasmodium genomic analysis tools were identified, 22 of which were prioritized for software standards evaluation. Additional tutorials were developed for 10 tools in the form of reproducible code applied to shared datasets. These resources are available on PGEforge (mrc-ide.github.io/PGEforge), a new community resource that serves as a central, open repository for current and future resources for malaria genomic data analysis.
Functional magnetic resonance imaging (fMRI) provides a crucial window for understanding brain functional connectivity (FC) in psychiatric disorders, yet its complex spatiotemporal dynamics pose substantial challenges for modeling. Existing methods often rely on static FC, making it difficult to capture the dynamic plasticity of brain, while generally ignoring structural differences across functional networks or discarding informative weak connections due to excessive sparsification. Here, we propose SPSGL, a biologically inspired deep learning framework designed to construct novel brain connectivity patterns from fMRI signals. SPSGL transforms voxel-wise time series into frequency-domain, feature-driven functional brain graphs and employs a biologically inspired gated edge-update mechanism to capture dynamic changes in connectivity strength. On this basis, core functional networks and whole-brain patterns are mapped as structural priors to explicitly guide multi-head attention in forming complementary subspace foci that emphasize neurobiologically meaningful connections. Further combined with Orthonormal Clustering Readout (OCRead), our model achieves adaptive learning of multi-scale brain graph representations and functional parcellations. Across five psychiatry-related computational tasks, SPSGL demonstrates superior performance compared with existing approaches. Moreover, it identifies task-relevant functional connections and hub regions associated with aberrant coupling among the default mode, sensorimotor, and subcortical networks, highlighting potential neuroimaging biomarkers and uncovering shared brain network factors shared across diverse psychiatric conditions. Overall, SPSGL provides a unified, interpretable, and high-performing framework for fMRI-based brain connectivity analysis, advancing mechanistic understanding and potential clinical translation in mental health research. Our code is publicly available on https://github.com/zhaoqi106/SPSGL. The online version contains supplementary material available at 10.1007/s13755-026-00467-6.
This work studies the challenging problem of acquiring high-quality underwater images via 4-D light field (LF) imaging. To this end, we propose GeoDiff-LF, a novel diffusion-based framework built upon SD-Turbo to enhance underwater 4-D LF imaging by leveraging its spatial-angular structure. GeoDiff-LF consists of three key adaptations: (1) a modified U-Net architecture with convolutional and attention adapters to model geometric cues, (2) a geometry-guided loss function using tensor decomposition and progressive weighting to regularize global structure, and (3) an optimized sampling strategy with noise prediction to improve efficiency. By integrating diffusion priors and LF geometry, GeoDiff-LF effectively mitigates color distortion in underwater scenes. Extensive experiments demonstrate that our framework outperforms existing methods across both visual fidelity and quantitative performance, advancing the state-of-the-art in enhancing underwater imaging. The code will be publicly available at https://github.com/linlos1234/GeoDiff-LF.
The early detection and classification of skin cancer are pivotal in improving patient outcomes and reducing healthcare burdens. However, traditional deep learning models in dermatological diagnostics often struggle with the nuanced differentiation of skin lesions. This paper introduces an approach to integrate an Advanced Heat Flow Layer into deep learning architectures for skin cancer classification, this method is centered on the principles of anisotropic diffusion, distinguishing itself from conventional image processing techniques by selectively smoothing image areas while preserving critical edge details, essential for accurate lesion identification. In our research, we utilized the Ham10000 dataset, enriched with data augmentation to simulate real-world variability, we conducted a comprehensive comparison of our model, featuring the Advanced Heat Flow Layer, against several benchmark deep learning models, including Sobel Edge Detection Layer. Our model, integrated with various layers of DenseNet121, consistently outperformed these benchmarks across key metrics such as accuracy, precision, recall, F1 score, and AUC, particularly with augmented data, this indicates a significant enhancement in the model's ability to generalize and maintain critical diagnostic features under diverse conditions. Our code is available at, https://github.com/sanadv/SkinCancerClassificationModels/blob/main/Models.ipynb.
We introduce DruGUI 2.0, a drug discovery tool for assessing the druggability of proteins, integrated into the ProDy application programming interface (API). DruGUI 2.0 is developed to facilitate the search for druggable sites while allowing for proteins' conformational flexibility. Simulations in explicit solvent, with an option to include membrane, are carried out in the presence of probe molecules selected from an expanded library of small molecules containing drug-like fragments. Druggable sites beyond orthosteric sites are identifiable, as well as the probes that show high affinity to bind to those sites. Characterization of the composition and position of the probes helps build pharmacophore models and estimate relative binding affinities. As a Python module with enhanced visualization features, DruGUI 2.0 complements, and benefits from, the vast collection of protein sequence, structure, and dynamics analyses modules accessible in ProDy. Case studies in the Supplemental Material showcase the utility of DruGUI 2.0 applied to both soluble targets and membrane proteins. ProDy is open-sourced and freely available under MIT License from https://github.com/prody/ProDy. The code version of DruGUI 2.0 used for simulations is available on Zenodo : 10.5281/zenodo.20511357. Supplementary data are available at Bioinformatics online. http://www.bahargroup.org/prody/tutorials/drugui2_tutorial/index.html.