Idiopathic pulmonary fibrosis (IPF) is an intractable lung disease that belongs to idiopathic interstitial pneumonia (IIP) with limited therapeutic options. Conventional patient stratification approaches often fail to integrate diverse data modalities, particularly heterogeneous electronic medical records (EMR) containing mixed discrete and continuous values, with omics data, or fail to extract the interpretable many-to-many relationships crucial for precision medicine. We introduce subset binding (SB), a novel unsupervised algorithm that extends fuzzy association rule mining to robustly integrate heterogeneous clinical data (EMR) and omics data. This framework is uniquely designed to identify clinically meaningful patient subgroup patterns and discover associated molecular signatures based on observable symptoms rather than relying on ambiguous conventional diagnostic categories, such as IIPs. Applying SB to a dataset including 602 samples (from 403 IIPs including IPF patients and 39 healthy controls), we successfully identified 20 proteins linked with key IPF clinical features. Network-based pathway analysis nominated tyrosine kinases as critical drug target candidates, leading to the proposal of ponatinib, a multi-kinase inhibitor, as a candidate therapeutic. Functional validation using a TGF-β-induced epithelial-mesenchymal transition (EMT) model confirmed ponatinib's ability to at least partially suppress TGF-β-induced EMT. This inhibitory effect is consistent with the anti-fibrotic mechanism of the existing IPF drug, nintedanib, and reinforces prior evidence supporting ponatinib's anti-fibrotic property. This study demonstrates that SB enables transparent, reproducible, and robust, molecularly defined patient stratification from multimodal patient data. By establishing a data-driven framework that focuses on observation-based rules, this work lays the critical foundation for future prognostic validation and tailored treatment strategies, offering clinically actionable insights and therapeutic discovery in diagnostically ambiguous diseases like IPF, with ponatinib emerging as a compelling repurposing candidate. Significance statement Idiopathic pulmonary fibrosis (IPF) is a progressive lung disease with limited therapeutic options. IPF is classified as idiopathic interstitial pneumonia (IIP), but distinguishing it from other similar diseases in IIP is not straightforward. The ambiguities in distinguishing IPF from other IIPs necessitate the identification of molecules associated with specific clinical features, rather than relying on solely on diagnosis. Existing methods for multi-omics data analysis often fail to effectively integrate heterogeneous data - such as EMR (containing mixed discrete and continuous values) and omics - or to extract many-to-many molecular-phenotypic relationships. We developed subset binding (SB), a novel, interpretable unsupervised machine learning method to specifically address these technical limitations by integrating EMR and omics data. Our approach successfully detected proteins in serum extracellular vesicles associated with IPF-related features, highlighted several tyrosine kinases as potential drug targets, and proposed the multi-kinase inhibitor ponatinib as a compelling candidate for drug repurposing. This data-driven framework establishes a scalable and interpretable foundation for biomarker and drug target discovery for intractable diseases whose mechanisms are not fully understood.
Pulmonary fibrosis (PF) following severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection is a life-threatening complication. Despite growing concerns about PF after SARS-CoV-2 infection, early recognition remains challenging. Additionally, the role of changes in respiratory and intestinal microbiota in PF progression remains insufficiently understood. To address this gap, this study uses a multi-omics approach to analyze microbiota and clinical changes in PF patients following SARS-CoV-2 infection, developing a predictive model for PF progression with risk stratification to enable early interventions and improve outcomes. A total of 68 patients with confirmed SARS-CoV-2 infection were included in the study, divided into two subgroups: patients with PF (COVID-PF) and patients without PF (COVID-non PF). Metagenomic sequencing of bronchoalveolar lavage fluid (BALF) and fecal specimens was performed to profile respiratory and intestinal microbiota. Peripheral blood mononuclear cells (PBMCs) were collected for transcriptome sequencing. A random forest classifier was developed to predict PF risk based on integrated respiratory-intestinal microbiota profiles as well as clinical indicators. Our findings suggest that there are significant differences in the respiratory and intestinal microbiota between COVID-non PF and COVID-PF patients. Transcriptomic analysis of PBMCs revealed significant activation of immunomodulatory pathways associated with PF development. The machine learning model further allowed early PF risk stratification, demonstrating that changes in both microbiomes, along with clinical indicators, can predict the progression and prognosis of PF. Overall, these results offer new insights into disease and suggest options for early detection and personalized treatment strategies for PF in SARS-CoV-2-infected patients.
Drug combination therapy is a key strategy in cancer treatment, and accurately predicting synergistic drug pairs is crucial for improving therapeutic efficacy. While machine learning methods have advanced this task, their performance is often limited by two challenges: the strong cell line specificity of drug synergy and the poor generalization of models to unseen cellular contexts. Existing approaches tend to emphasize cell line-specific modeling but struggle to generalize across diverse biological domains. We propose PDTSyn, a domain generalization-driven framework that addresses these limitations through disentangled representation learning. PDTSyn treats each cell line as a distinct domain and separates drug representations into domain-invariant and domain-specific components. The parameter-decomposed transformer dynamically generates cell line-adaptive attention parameters from cell features, enabling flexible modeling of cell-specific drug-drug interactions while preserving shared pharmacological structure. To further enhance generalization, we introduce a dual regularization strategy: a cross-domain Kullback-Leibler-divergence loss that aligns invariant embeddings across cell lines, and a cell-line discriminative loss that enforces the specificity of domain-dependent representations. Comprehensive experiments on the O'Neil and NCI-ALMANAC datasets demonstrate that PDTSyn consistently outperforms state-of-the-art baselines under standard evaluation protocols. Moreover, PDTSyn maintains strong performance in challenging unseen cell line, unseen drug, and unseen drug pair settings, highlighting its robustness to distribution shifts. These results indicate that explicitly disentangling invariant and specific mechanisms provides an effective and generalizable solution for drug synergy prediction in heterogeneous biological environments.
The B-cell receptor (BCR) repertoire encodes not only antigen-binding specificity but also intrinsic signatures reflecting B-cell functional states and differentiation trajectories. Deciphering the intricate sequence semantics embedded within these repertoires is pivotal for elucidating immune dynamics and expediting antibody discovery. Although single-cell sequencing provides high-resolution insights, its scalability and cost remain major obstacles, leaving population-level repertoire data underexploited. Furthermore, conventional bioinformatics approaches struggle to model the high-order, non-linear semantic dependencies inherent in antibody sequences. To address these challenges, we present BCRInsight, an antibody-specific pretrained language model that integrates a Transformer architecture with phenotype-aware contrastive learning. Pretrained on 80 million human BCR sequences, BCRInsight learns biologically meaningful contextual representations that encode subtle signatures of B-cell activation, maturation, and clonal evolution. Extensive benchmarking demonstrates that BCRInsight achieves state-of-the-art performance across multiple downstream tasks, particularly in paratope prediction. Further evaluation on diverse single-cell immune cohorts, including healthy, neoplastic, and viral infection states, reveals cross-scenario robustness and superior generalization relative to existing methods. Notably, attention-based analyses show that high-attention regions correspond closely to physical antigen-contact residues, highlighting emergent structural interpretability derived solely from self-supervised learning. Collectively, BCRInsight establishes a new paradigm for decoding the "language" of antibodies, offering a scalable and interpretable framework for computational immunology and rational antibody engineering.
Therapeutic antibody discovery is central to modern drug development, yet conventional methods such as hybridoma and phage display remain slow, inefficient, and costly. Computational approaches including site-saturation mutagenesis often yield limited affinity gains and expression liabilities, while deep learning and generative models expand sequence diversity but suffer from low validation rates. Here, we present a multi-scale computational screening pipeline inspired by key principles of in vivo immune selection. The framework integrates structure-based docking (ZDock), graph neural network-based interaction prediction, and accelerated molecular dynamics (MDs) with metadynamics free-energy profiling to enable high-throughput in silico prioritization of structure-resolved antibodies. Applied to Activin A, a pleiotropic cytokine implicated in fibrosis, oncology, and muscle-wasting disorders, the platform screened ~5000 antibody structures and identified 11 candidates. Experimental validation confirmed two binders, with Ab4 exhibiting sub-nanomolar affinity (KD = 0.38 nM) and potent neutralizing activity, underscoring therapeutic potential in fibrodysplasia ossificans progressiva (FOP) and related diseases. Rather than performing full iterative affinity maturation, the present study focuses on the screening and repurposing stage, with affinity maturation positioned as a prospective extension. This work demonstrates the feasibility of integrating AI-driven interaction prediction with physics-based simulations to accelerate structure-guided antibody screening and repurposing, while conceptually paralleling selected stages of immune selection rather than fully recapitulating immune evolution.
Atopic dermatitis (AD) involves complex metabolic-immune dysregulation, but the molecular links remain unclear. This study integrates a multilevel analytical framework to systematically investigate the metabolic-immune crosstalk in AD. Using linkage disequilibrium score regression and a two-step Mendelian randomization approach, we established genetic correlations and inferred causal relationships between plasma metabolites and inflammatory proteins, identifying 1-palmitoyl-2-arachidonoyl-GPC (PA-GPC) as a protective metabolite that exerts its effect primarily through downregulation of interleukin-18 receptor 1 (IL-18R1). Integration of single-cell transcriptomic data further revealed elevated IL-18R1 expression in T cells within the AD microenvironment and enabled stratification of T cells based on PA-GPC-associated metabolic activity, identifying 33 differentially expressed genes. Subsequent least absolute shrinkage and selection operator (LASSO) regression, combined with machine learning models and SHapley Additive exPlanations analysis, consistently prioritized CD9 as a key regulator. Functional validation showed that PA-GPC attenuates tumor necrosis factor-alpha (TNF-α)/interferon-gamma (IFN-γ)-induced inflammatory responses in human immortalized keratinocyte (HaCaT) cells and suppresses Th2 cytokine production in T cells. IL-18R1 knockdown reduced CD9 expression and Th2 cytokine production in T cells, whereas CD9 knockdown did not affect IL-18R1 expression, indicating that IL-18R1 acts upstream of CD9. Moreover, CD9 knockdown impaired T-cell viability, activation, and Th2 cytokine production. Collectively, these findings characterize metabolic-immune crosstalk in AD and identify a PA-GPC-IL-18R1-CD9 regulatory axis with potential therapeutic implications.
Disease-gene prediction (DGP) plays a pivotal role in understanding the genetic underpinnings of various diseases, offering insights for disease diagnosis, treatment, and prevention. Accurate identification of disease-related genes can enhance personalized medicine and the development of targeted therapies. While numerous methods for DGP have been proposed in the field, a significant challenge remains in effectively capturing and modeling the complex relationships among biological entities, such as diseases, symptoms, genes, and pathways. These intricate interactions are essential for learning robust representations of phenotypes and genotypes, which are critical for accurate DGP. In this study, we introduce MELGene, a knowledge-enhanced multimodel ensemble learning framework for DGP. MELGene leverages an adaptive integration of multiple pretrained knowledge inference models based on knowledge graph, effectively integrating the collective intelligence of diverse models to achieve more accurate gene predictions. The framework incorporates Model-aware Importance Learning, which dynamically adjusts the contributions of individual models, and introduces a dynamic ensemble mechanism to obtain robust consensus predictions. Finally, we conducted comprehensive experiments, including performance comparisons, which demonstrated the excellent performance of MELGene. Ablation experiments highlighted the positive impact of each module, while case studies showcased the reliability of the biological relevance of gastric, lung, and liver cancers, as supported by the analysis of network medicine, functional enrichment, and literature mining. MELGene offers a flexible framework for DGP through knowledge enhancement and adaptive ensemble learning, with broad potential for decoding disease mechanisms.
Artificial intelligence (AI) integrated with high-throughput assays offers a powerful route to accelerate discovery in relevant biological models. Functional cardiac imaging is a prime application, where deep learning (DL) and explainable AI (xAI) can overcome limitations of traditional phenotyping methods, such as manual analysis, subjective interpretation, and low scalability. In cardiovascular research, the zebrafish model is highly valuable due to its translational relevance and accessibility for high-throughput applications. Here, we present ZeCardioAI, a computational platform combining zebrafish experimental advantages with DL and xAI methodologies. The platform automatically extracts comprehensive cardiac phenotypes from live imaging, achieving high precision while maintaining interpretability, critical for mechanistic insight and translational validation. ZeCardioAI, when applied to zebrafish models of dilated and hypertrophic cardiomyopathy (CM), detected subtle yet clinically relevant phenotypic differences. Machine learning classifiers achieved robust separation of disease from healthy phenotypes, and xAI revealed discriminative features aligning with established clinical markers. Our developments should prove valuable in addressing the unmet medical need in CMs to find new, specific treatments. The platform's modular architecture supports future adaptation to diverse disease contexts beyond CMs, enabling large-scale, fully automated phenotyping at a throughput unattainable by manual approaches. ZeCardioAI establishes a new standard for AI-powered biological research, offering transformative potential for accelerating drug discovery, advancing precision medicine approaches, and deepening fundamental understanding of complex biological systems across multiple therapeutic areas.
Influenza A virus (IAV) poses a persistent threat to global public health due to its broad host adaptability, frequent anti-genic variation, and potential for cross-species transmission. Accurate identification of IAV subtypes is essential for effective epidemic surveillance and precise disease control. Here, we present Influ-BERT, a domain-adaptive pretrained model based on the Transformer architecture. Optimized from DNABERT-2, Influ-BERT was developed using a dedicated corpus of ~900 000 influenza genome sequences. We constructed a custom Byte Pair Encoding tokenizer, and employed a two-stage training strategy involving domain-adaptive pretraining followed by task-specific fine-tuning. This approach significantly enhanced identification performance for IAV subtypes. Experimental results demonstrate that Influ-BERT outperforms both traditional machine learning approaches and general genomic language models, such as DNABERT-2, Necleotide Transformer, and MegaDNA, in the task of IAV subtype identification. The model consistently achieved F1-scores above 97% across five subtype classification tasks and exhibited stable performance gains for subtypes that are underrepresented in sequencing data, including H5N8, H1N2, and H13N6. Beyond subtype identification, Influ-BERT was successfully applied to additional tasks including respiratory virus identification, IAV pathogenicity prediction, and identification of IAV genomic fragments and functional genes, demonstrating robust performance throughout. Further interpretability analysis using sliding window perturbation confirmed that the model focuses on biologically significant genomic regions, providing insight into its improved predictive capability.
Spatial transcriptomics (ST) technologies have transformed our ability to examine gene expression within intact tissues, yet accurately identifying spatially variable genes (SVGs) remains challenging due to spatial heterogeneity, data sparsity, and incomplete modeling of domain-level dependencies. To address these limitations, we propose MLN2SVG, a domain-aware framework that integrates contrastive variational autoencoding with a multi-level neighbor (MLN) search algorithm to jointly learn tissue domains and SVGs. MLN2SVG constructs a weighted spatial graph to capture both local and long-range spatial relationships, employing a deep contrastive variational autoencoder to align augmented and original data representations while preserving biological diversity. The MLN algorithm dynamically expands neighborhood connectivity to mitigate sparsity and enhance domain coherence. Across multiple human and mouse ST datasets, including dorsolateral prefrontal cortex, breast cancer, and brain tissues, MLN2SVG consistently outperformed existing methods in clustering accuracy, robustness, and biological interpretability. Notably, in breast cancer tissues, MLN2SVG uncovers fine-grained spatial organization of tertiary lymphoid structures, delineating region-specific immune architectures spanning intratumoral, tumor-edge, and extratumoral compartments. Through the integration of spatial domain discovery and SVG detection, MLN2SVG delivers a robust and biologically interpretable framework for uncovering the molecular and structural complexity of tissue organization.
Quantifying how biological and chemical exposures reshape spatial gene regulation across tissues remains challenging due to technical and statistical constraints. Moreover, spatial transcriptomic comparisons are often hindered by tissue misalignment between conditions and the pervasive zero inflation of single-cell gene expression data. Existing differential expression approaches typically ignore spatial dependencies and fail to capture differential gene activation. We present Spatial-ZEDNet, a hierarchical Gaussian random field framework that jointly detects spatially differentially expressed genes (DEGs) and differentially activated genes (DAGs) while explicitly modeling zero inflation. Unlike previous tools, Spatial-ZEDNet aligns biological signals across conditions without requiring spatial coordinate matching, improving spatial inference robustness. In both simulations and real biological applications, Spatial-ZEDNet demonstrates superior power and specificity relative to standard methods and is robust in distinguishing DEGs from spatially variable genes. Applied to colitis and Plasmodium infection datasets, the method identified spatially localized expression and activation of immune genes, including Mmp7, Olr1, Ifitm3, and Gbp3, several of which correspond to known inflammatory disease loci, highlighting coordinated tissue-specific responses often missed by conventional methods. These findings demonstrate that explicitly modeling excess zeros improves the detection of spatially regulated activation states. Spatial-ZEDNet provides a statistically rigorous, interpretable framework for integrating spatial transcriptomic data across environmental and therapeutic exposures, advancing mechanistic understanding of exposure-induced tissue remodeling.
Advances in single-cell sequencing and Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) technologies have enabled detailed case-control comparisons and experimental perturbations at single-cell resolution. However, uncovering causal relationships in observational genomic data remains challenging due to selection bias and inadequate adjustment for unmeasured confounders, particularly in heterogeneous datasets. To address these challenges, we introduce causarray, a robust causal inference framework for analyzing array-based genomic data at both pseudo-bulk and single-cell levels under unmeasured confounding. causarray integrates a generalized confounder adjustment method to account for unmeasured confounders and employs semiparametric inference with flexible machine learning techniques to ensure robust statistical estimation of treatment effects. Benchmarking results show that causarray robustly separates treatment effects from confounders while preserving biological signals across diverse settings. We also apply causarray to two single-cell genomic studies: (i) an in vivo Perturb-seq study of autism risk genes in developing mouse brains and (ii) a case-control study of Alzheimer's disease (AD) using three human brain transcriptomic datasets. In these applications, causarray identifies clustered causal effects of multiple autism risk genes and consistent causally affected genes across AD datasets, uncovering biologically relevant pathways directly linked to neuronal development and synaptic functions that are critical for understanding disease pathology.
Microbial communities function as dynamic societies where intercellular communication governs collective behaviors. However, mapping these interaction networks has remained a fundamental challenge in microbiology. This study aims to decode the social networks of complex bacterial communities at single-cell resolution by developing BACON, a computational framework that infers quorum sensing-mediated communication from single-microbe transcriptomic data. The approach combines a curated database of signaling systems with a statistical model that quantifies communication strength through coordinated expression of signal synthesis and receptor genes. Validation in model systems demonstrated BACON's precision in reconstructing density-dependent communication trajectories in Bacillus subtilis and capturing rapid network reorganization in Escherichia coli under antibiotic stress, revealing distinct sender-receiver subpopulations. Applied to human gut microbiomes, BACON unveiled diurnal fluctuations in cross-species signaling that transcend enterotype boundaries and uncovered conserved metabolic specialization in signal-responsive bacteria. In a clinical context, analysis of an ICU patient's gut microbiome revealed how Pseudomonas aeruginosa establishes a self-reinforcing communication circuit that upregulates virulence pathways. This work provides a unified framework for analyzing bacterial social interactions across diverse ecosystems. It opens new avenues for understanding microbial sociology, combating antimicrobial resistance, and engineering synthetic communities.
ABCB1, a polyspecific efflux transporter, mediates multidrug resistance in cancer by interacting with diverse substrates and inhibitors, yet its recognition mechanisms remain elusive. Here, we introduce an integrated framework that synergistically combines biophysics and computational biology to predict ABCB1 allocrite interactions and elucidate their mechanisms. We curated hierarchical-confidence bioactivity datasets from multi-source assays and developed MolMM, a convolutional neural network leveraging meta-learning on noisy data and multi-task learning on refined data, achieving AUC-ROC scores of 83.33% for inhibitors and 81.26% for substrates. SHapley Additive exPlanations (SHAP) analysis revealed key molecular features, highlighting competitive polar, and hydrophobic motifs distinguishing substrates from inhibitors . Building on these ML insights, coarse-grained umbrella sampling simulations mapped these features onto free energy landscapes, proposing an amphiphilic model for substrate binding via a flip-flop process through the transmembrane pore and an inhibitory mechanism stabilizing ABCB1 in transitional conformations at the cavity's gate. This machine learning-molecular dynamics synergy offers mechanistic insights into ABCB1 polyspecificity, facilitating rational design of inhibitors to overcome multidrug resistance.
Cells regulate their functions through gene expression, driven by a complex interplay of transcription factors (TFs) and other regulatory mechanisms that together can be modeled as gene regulatory networks (GRNs). While the advent of single-cell sequencing has revolutionized our understanding of these networks, current GRNs inference methods rely predominantly on expression data alone, overlooking the sequence semantic context of target genes, and the intrinsic physicochemical properties of TFs. Consequently, the reconstructed networks are often riddled with false-positive connections, significantly compromising their reliability. To address these challenges, we propose CaHoT-GRN, a context-aware high-order topology learning framework for robust single-cell GRNs inference. First, we leverage pretrained biological large language models to extract deep semantic embeddings from gene and protein sequences. This allows the model to explore the potential TF-target binding affinity within a latent semantic space. Second, to model cooperative regulatory mechanisms and capture high-order gene interactions, we construct a heterogeneous information network (HIN) via meta-path generation constrained by protein-protein interactions. Furthermore, we propose a similarity co-attention module to model the topological consistency between the prior GRNs and the HIN, thereby capturing long-range associations among genes. On single-cell transcriptomic datasets across four types of networks, CaHoT-GRN yielded an average AUC of 0.846 and an AUPR of 0.420, matching or outperforming existing methods. Moreover, downstream case studies, pathway analyses, and motif matching confirmed its high biological relevance. CaHoT-GRN is publicly available at https://github.com/ydkvictory/CaHoT-GRN.
Bioinformatics tools are increasingly important for diagnostics in clinical care and precision medicine, but despite a very active bioinformatics research community, implementation and adaptation is slow. Drawing on multidisciplinary expertise, we have identified key systemic barriers on the journey from research to implementation, using the Danish healthcare ecosystem as the example. We find the main obstacles to be regulatory uncertainty, fragmented data access, and limited infrastructure for implementation. Cultural resistance to commercialization and workforce gaps further impedes progress. We believe that these challenges reflect broader international trends and could be generally applicable. Consensus recommendations include centralized data and regulatory resources, cross-sector collaboration models, and pilot initiatives to support scalable implementation. These findings offer a roadmap for translating bioinformatics innovation into clinical practice.
Gene expression analysis has evolved substantially over the past 25 years, from early transcript surveys using expressed sequence tags and microarrays to RNA sequencing, and more recently to single-cell and spatial transcriptomics. These successive waves have expanded measurement scale and resolution, enabling systematic discovery of transcriptional programmes, inference of gene regulatory networks, and increasingly direct links between transcriptomic insight and therapeutic strategies that modulate gene expression. In this Perspective, we synthesize major methodological milestones with bibliometric trends in leading bioinformatics journals to describe four revolutions that redefined gene expression analysis. We also map widely used computational tools onto a common timeline by analysing 70 78 831 open-access full-text articles, illustrating how enduring statistical frameworks coexist with rapidly growing end-to-end analysis ecosystems. We highlight current challenges and emerging directions in core bioinformatics approaches for gene expression analysis. Looking ahead, we argue that the next era will be defined less by generating new datasets and more by organizing, searching, and reusing transcriptomic and multimodal information at scale. We propose three future directions: consortium-scale searchable transcriptomic knowledgebases, foundation models for gene expression analysis, and programmable regulatory design for engineered control of gene expression. The landscape of gene expression analysis is shifting from descriptive measurement towards queryable, predictive, and programmable gene expression biology.
Understanding how transcription factors (TFs) recognize DNA motifs is central to deciphering gene regulation. However, integrating multi-omics data, particularly DNA methylation, which can variably influence TF binding, remains a significant challenge. To address this, we developed BayesPI-Feature Learning Yard (BayesPI-FLY), a Bayesian neural network for de novo motif discovery that integrates DNA sequence information with DNA methylation status data. Building upon the classical biophysical model of TF-DNA interactions, BayesPI-FLY employs a two-layer inference architecture to jointly estimate model parameters and hyperparameters within a Bayesian framework. The core algorithms are implemented in C and parallelized through Python, ensuring computational efficiency. BayesPI-FLY quantitatively characterizes methylation effects at both single-nucleotide and motif levels, and generates position weight matrices and sequence logos to facilitate motif interpretation. Validation using synthetic and high-throughput sequencing datasets, including whole-genome bisulfite sequencing data, demonstrates that the framework can recapitulate known methylation-associated TF-binding patterns and infer strand-specific associations within the modeling framework. Collectively, BayesPI-FLY offers a versatile and extensible computational platform for characterizing methylation-related TF-DNA binding patterns across complex epigenetic contexts.
Drug discovery is a time-consuming, expensive, and high-risk process. Recent advances in artificial intelligence (AI) have enabled major breakthroughs in small-molecule and protein therapeutics. However, AI-driven design of aptamer drugs remains largely unexplored. Aptamers are short (15-100 nt) single-stranded DNAs or RNAs that exhibit high binding affinity, high specificity, and low immunogenicity, making them promising candidates for disease (such as cancer) therapeutics. Compared with protein-ligand or protein-protein systems, protein-aptamer complexes are under-represented in public structural databases, and aptamers themselves are highly flexible and relatively large molecules. These characteristics present distinct challenges for AI-based structural modeling. Here, we systematically evaluate recent AI frameworks, including AlphaFold3, Chai-1, Boltz-2, and RoseTTAFold2NA, along with a template-based approach, in predicting protein-aptamer complex structures and estimating binding free energies. We establish an independent benchmark to assess their performance in structural accuracy, stability, and energetic consistency. This study provides a foundation for the application of AI in aptamer drug design and offers a reference framework for future research in nucleic-acid therapeutics and biomolecular modeling.
Chromatin looping, which facilitates the three-dimensional (3D) organization of the genome, is essential for the regulation of gene expression. This process relies on the interaction of numerous transcription factors (TFs), particularly CCCTC-binding factor (CTCF) and Cohesin, whose dynamic binding patterns orchestrate loop formation. Current computational methods for prediction of CTCF-mediated chromatin loops struggle to perform genome-wide predictions, primarily due to the extreme imbalance between positive and negative samples in training datasets. Existing DNA-sequence-based models often fail to capture the complex dynamics of TF binding and the regulatory code behind chromatin looping. To address these challenges, we present TF-loop, a novel TF regulatory language framework designed to predict chromatin loops. This framework conceptualizes TF sequences, defined by the binding positions and orientations of five key TFs, as a structured "TF language." Using the BERT model, TF-loop decodes the latent linguistic patterns embedded in these sequences, facilitating accurate predictions of chromatin loops. Comparative analysis with state-of-the-art model demonstrates that TF-loop significantly improves prediction accuracy across diverse cell types, even when faced with highly imbalanced datasets. The results highlight the potential of TF-loop to offer a new perspective on decoding the 3D structure of chromatin using natural language processing techniques.