Protein-protein interactions (PPIs) are essential to understanding cellular mechanisms, signaling networks, disease processes, and drug development, as they represent the physical contacts and functional associations between proteins. Recent advances have witnessed the achievements of artificial intelligence (AI) methods aimed at predicting PPIs. However, these approaches often handle the intricate web of relationships and mechanisms among proteins, drugs, diseases, ribonucleic acid (RNA), and protein structures in a fragmented or superficial manner. This is typically due to the limitations of non-end-to-end learning frameworks, which can lead to sub-optimal feature extraction and fusion, thereby compromising the prediction accuracy. To address these deficiencies, this paper introduces a novel end-to-end learning model, the Knowledge Graph Fused Graph Neural Network (KGF-GNN). This model comprises three integral components: (1) Protein Associated Network (PAN) Construction: We begin by constructing a PAN that extensively captures the diverse relationships and mechanisms linking proteins with drugs, diseases, RNA, and protein structures. (2) Graph Neural Network for Feature Extraction: A Graph Neural Network (GNN) is then employed to distill both topological and semantic features from the PAN, alongside another GNN designed to extract topological features directly from observed PPI networks. (3) Multi-layer Perceptron for Feature Fusion: Finally, a multi-layer perceptron integrates these varied features through end-to-end learning, ensuring that the feature extraction and fusion processes are both comprehensive and optimized for PPI prediction. Extensive experiments conducted on real-world PPI datasets validate the effectiveness of our proposed KGF-GNN approach, which not only achieves high accuracy in predicting PPIs but also significantly surpasses existing state-of-the-art models. This work not only enhances our ability to predict PPIs with a higher precision but also contributes to the broader application of AI in Bioinformatics, offering profound implications for biological research and therapeutic development.
As a central organizing principle of biology, bacteria and archaea are classified into a hierarchical structure across taxonomic ranks from kingdom to subspecies. Traditionally, this organization was based on observable characteristics of form and chemistry but recently, bacterial taxonomy has been robustly quantified using comparisons of sequenced genomes, as exemplified in the Genome Taxonomy Database (GTDB). Such genome-based taxonomies resolve genomes down to genera and species and are useful in many contexts yet lack the flexibility and resolution of a fine-grained approach. The Life Identification Number (LIN) approach is a common, quantitative framework to tie existing (and future) bacterial taxonomies together, increase the resolution of genome-based discrimination of taxa, and extend taxonomic identification below the species level in a principled way. Utilizing LINgroup as an organizational concept helps resolve some of the confusion and unforeseen negative effects resulting from nomenclature changes of microorganisms that are closely related by overall genomic similarity (often due to genome-based reclassification). Our experimental results demonstrate the value of LINs and LINgroups in mapping between taxonomies, translating between different nomenclatures, and integrating them into a single taxonomic framework. They also reveal the robustness of LIN assignment to hyper-parameter changes when considering within-species taxonomic groups.
Three-dimensional (3D) reconstruction in single-particle cryo-electron microscopy (cryo-EM) is a critical technique for recovering and studying the fine 3D structure of proteins and other biological macromolecules, where the primary issue is to determine the orientations of projection images with high levels of noise. This paper proposes a method to determine the orientations of cryo-EM projection images using reliable common lines and spherical embeddings. First, the reliability of common lines between projection images is evaluated using a weighted voting algorithm based on an iterative improvement technique and binarized weighting. Then, the reliable common lines are used to calculate the normal vectors and local -axis vectors of projection images after two spherical embeddings. Finally, the orientations of projection images are determined by aligning the results of the two spherical embeddings using an orthogonal constraint. Experimental results on both synthetic and real cryo-EM projection image datasets demonstrate that the proposed method can achieve higher accuracy in estimating the orientations of projection images and higher resolution in reconstructing preliminary 3D structures than some common line-based methods, indicating that the proposed method is effective in single-particle cryo-EM 3D reconstruction.
The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms.
The prediction of drug-target affinity (DTA) plays a crucial role in drug development and the identification of potential drug targets. In recent years, computer-assisted DTA prediction has emerged as a significant approach in this field. In this study, we propose a multi-modal deep learning framework called MMD-DTA for predicting drug-target binding affinity and binding regions. The model can predict DTA while simultaneously learning the binding regions of drug-target interactions through unsupervised learning. To achieve this, MMD-DTA first uses graph neural networks and target structural feature extraction network to extract multi-modal information from the sequences and structures of drugs and targets. It then utilizes the feature interaction and fusion modules to generate interaction descriptors for predicting DTA and interaction strength for binding region prediction. Our experimental results demonstrate that MMD-DTA outperforms existing models based on key evaluation metrics. Furthermore, external validation results indicate that MMD-DTA enhances the generalization capability of the model by integrating sequence and structural information of drugs and targets. The model trained on the benchmark dataset can effectively generalize to independent virtual screening tasks. The visualization of drug-target binding region prediction showcases the interpretability of MMD-DTA, providing valuable insights into the functional regions of drug molecules that interact with proteins.
Predicting biomolecular interactions is significant for understanding biological systems. Most existing methods for link prediction are based on graph convolution. Although graph convolution methods are advantageous in extracting structure information of biomolecular interactions, two key challenges still remain. One is how to consider both the immediate and high-order neighbors. Another is how to reduce noise when aggregating high-order neighbors. To address these challenges, we propose a novel method, called mixed high-order graph convolution with filter network via LSTM and channel attention (HGLA), to predict biomolecular interactions. Firstly, the basic and high-order features are extracted respectively through the traditional graph convolutional network (GCN) and the two-layer Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing (MixHop). Secondly, these features are mixed and input into the filter network composed of LayerNorm, SENet and LSTM to generate filtered features, which are concatenated and used for link prediction. The advantages of HGLA are: 1) HGLA processes high-order features separately, rather than simply concatenating them; 2) HGLA better balances the basic features and high-order features; 3) HGLA effectively filters the noise from high-order neighbors. It outperforms state-of-the-art networks on four benchmark datasets.
This review article delves deeply into the various machine learning (ML) methods and algorithms employed in discerning protein functions. Each method discussed is assessed for its efficacy, limitations, potential improvements, and future prospects. We present an innovative hierarchical classification system that arranges algorithms into intricate categories and unique techniques. This taxonomy is based on a tri-level hierarchy, starting with the methodology category and narrowing down to specific techniques. Such a framework allows for a structured and comprehensive classification of algorithms, assisting researchers in understanding the interrelationships among diverse algorithms and techniques. The study incorporates both empirical and experimental evaluations to differentiate between the techniques. The empirical evaluation ranks the techniques based on four criteria. The experimental assessments rank: (1) individual techniques under the same methodology sub-category, (2) different sub-categories within the same category, and (3) the broad categories themselves. Integrating the innovative methodological classification, empirical findings, and experimental assessments, the article offers a well-rounded understanding of ML strategies in protein function identification. The paper also explores techniques for multi-task and multi-label detection of protein functions, in addition to focusing on single-task methods. Moreover, the paper sheds light on the future avenues of ML in protein function determination.
Accurate prediction of drug-drug interactions (DDIs) plays an important role in improving the efficiency of drug development and ensuring the safety of combination therapy. Most existing models rely on a single source of information to predict DDIs, and few models can perform tasks on biomedical knowledge graphs. This paper proposes a new hybrid method, namely Knowledge Graph Representation Learning and Feature Fusion (KGRLFF), to fully exploit the information from the biomedical knowledge graph and molecular structure of drugs to better predict DDIs. KGRLFF first uses a Bidirectional Random Walk sampling method based on the PageRank algorithm (BRWP) to obtain higher-order neighborhood information of drugs in the knowledge graph, including neighboring nodes, semantic relations, and higher-order information associated with triple facts. Then, an embedded representation learning model named Knowledge Graph-based Cyclic Recursive Aggregation (KGCRA) is used to learn the embedded representations of drugs by recursively propagating and aggregating messages with drugs as both the source and destination. In addition, the model learns the molecular structures of the drugs to obtain the structured features. Finally, a Feature Representation Fusion Strategy (FRFS) was developed to integrate embedded representations and structured feature representations. Experimental results showed that KGRLFF is feasible for predicting potential DDIs.
Due to the broad-spectrum and high-efficiency antibacterial activity, antimicrobial peptides (AMPs) and their functions have been studied in the field of drug discovery. Using biological experiments to detect the AMPs and corresponding activities require a high cost, whereas computational technologies do so for much less. Currently, most computational methods solve the identification of AMPs and their activities as two independent tasks, which ignore the relationship between them. Therefore, the combination and sharing of patterns for two tasks is a crucial problem that needs to be addressed. In this study, we propose a deep learning model, called DMAMP, for detecting AMPs and activities simultaneously, which is benefited from multi-task learning. The first stage is to utilize convolutional neural network models and residual blocks to extract the sharing hidden features from two related tasks. The next stage is to use two fully connected layers to learn the distinct information of two tasks. Meanwhile, the original evolutionary features from the peptide sequence are also fed to the predictor of the second task to complement the forgotten information. The experiments on the independent test dataset demonstrate that our method performs better than the single-task model with 4.28% of Matthews Correlation Coefficient (MCC) on the first task, and achieves 0.2627 of an average MCC which is higher than the single-task model and two existing methods for five activities on the second task. To understand whether features derived from the convolutional layers of models capture the differences between target classes, we visualize these high-dimensional features by projecting into 3D space. In addition, we show that our predictor has the ability to identify peptides that achieve activity against Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2). We hope that our proposed method can give new insights into the discovery of novel antiviral peptide drugs.
Recent advancements in spatially transcriptomics (ST) technologies have enabled the comprehensive measurement of gene expression profiles while preserving the spatial information of cells. Combining gene expression profiles and spatial information has been the most commonly used method to identify spatial functional domains and genes. However, most existing spatial domain decipherer methods are more focused on spatially neighboring structures and fail to take into account balancing the self-characteristics and the spatial structure dependency of spots. Therefore, we propose a novel model called SpaGCAC, which recognizes spatial domains with the help of an adaptive feature-spatial balanced graph convolutional network named AFSBGCN. The AFSBGCN can dynamically learn the relationship between spatial local topology structures and the self-characteristics of spots by adaptively increasing or declining the weight on the self-characteristics during message aggregation. Moreover, to better capture the local structures of spots, SpaGCAC exploits a local topology structure contrastive learning strategy. Meanwhile, SpaGCAC utilizes a probability distribution contrastive learning strategy to increase the similarity of probability distributions for points belonging to the same category. We validate the performance of SpaGCAC for spatial domain identification on four spatial transcriptomic datasets. In comparison with seven spatial domain recognition methods, SpaGCAC achieved the highest NMI median of 0.683 and the second highest ARI median of 0.559 on the multi-slice DLPFC dataset. SpaGCAC achieved the best results on all three other single-slice datasets. The above-mentioned results show that SpaGCAC outperforms most existing methods, providing enhanced insights into tissue heterogeneity.
CircRNA is closely related to human disease, so it is important to predict circRNA-disease association (CDA). However, the traditional biological detection methods have high difficulty and low accuracy, and computational methods represented by deep learning ignore the ability of the model to explicitly extract local depth information of the CDA. We propose a model based on knowledge graph from recursion and attention aggregation for circRNA-disease association prediction (KGRACDA). This model combines explicit structural features and implicit embedding information of graphs, optimizing graph embedding vectors. First, we built large-scale, multi-source heterogeneous datasets and construct a knowledge graph of multiple RNAs and diseases. After that, we use a recursive method to build multi-hop subgraphs and optimize graph attention mechanism by gating mechanism, mining local depth information. At the same time, the model uses multi-head attention mechanism to balance global and local depth features of graphs, and generate CDA prediction scores. KGRACDA surpasses other methods by capturing local and global depth information related to CDA. We update an interactive web platform HNRBase v2.0, which visualizes circRNA data, and allows users to download data and predict CDA using model.
Investigating the associations between circRNA and diseases is vital for comprehending the underlying mechanisms of diseases and formulating effective therapies. Computational prediction methods often rely solely on known circRNA-disease data, indirectly incorporating other biomolecules' effects by computing circRNA and disease similarities based on these molecules. However, this approach is limited, as other biomolecules also play significant roles in circRNA-disease interactions. To address this, we construct a comprehensive heterogeneous network incorporating data on human circRNAs, diseases, and other biomolecule interactions to develop a novel computational model, circ2DGNN, which is built upon a heterogeneous graph neural network. circ2DGNN directly takes heterogeneous networks as inputs and obtains the embedded representation of each node for downstream link prediction through graph representation learning. circ2DGNN employs a Transformer-like architecture, which can compute heterogeneous attention score for each edge, and perform message propagation and aggregation, using a residual connection to enhance the representation vector. It uniquely applies the same parameter matrix only to identical meta-relationships, reflecting diverse parameter spaces for different relationship types. After fine-tuning hyperparameters via five-fold cross-validation, evaluation conducted on a test dataset shows circ2DGNN outperforms existing state-of-the-art(SOTA) methods.
Categorical attributes are common in many classification tasks, presenting certain challenges as the number of categories grows. This situation can affect data handling, negatively impacting the building time of models, their complexity and, ultimately, their classification performance. In order to mitigate these issues, this research proposes a novel preprocessing technique for grouping attribute categories in classification datasets. This approach combines the exact representation of the association between categorical values in a Euclidean space, clustering methods and attribute quality metrics to group similar attribute categories based on their contribution to the classification task. To estimate its effectiveness, the proposal is evaluated within the context of HIV-1 protease cleavage site prediction, where each attribute represents an amino acid that can take multiple possible values. The results obtained on HIV-1 real-world datasets show a significant reduction in the number of categories per attribute, with an average reduction percentage ranging from 74% to 81%. This reduction leads to simplified data representations and improved classification performances compared to not preprocessing. Specifically, improvements of up to 0.07 in accuracy and 0.19 in geometric mean are observed across different datasets and classification algorithms. Additionally, extensive simulations on synthetic datasets with varied characteristics are carried out, providing consistent and reliable results that validate the robustness of the proposal. These findings highlight the capability of the developed method to enhance cleavage prediction, which could potentially contribute to understanding viral processes and developing targeted therapeutic strategies.
In drug discovery, identifying drug-target interactions (DTIs) via experimental approaches is a tedious and expensive procedure. Computational methods efficiently predict DTIs and recommend a small part of potential interacting pairs for further experimental confirmation, accelerating the drug discovery process. Although fusing heterogeneous drug and target similarities can improve the prediction ability, the existing similarity combination methods ignore the interaction consistency for neighbour entities. Furthermore, area under the precision-recall curve (AUPR) and area under the receiver operating characteristic curve (AUC) are two widely used evaluation metrics in DTI prediction. However, the two metrics are seldom considered as losses within existing DTI prediction methods. We propose a local interaction consistency (LIC) aware similarity integration method to fuse vital information from diverse views for DTI prediction models. Furthermore, we propose two matrix factorization (MF) methods that optimize AUPR and AUC using convex surrogate losses respectively, and then develop an ensemble MF approach that takes advantage of the two area under the curve metrics by combining the two single metric based MF models. Experimental results under different prediction settings show that the proposed methods outperform various competitors in terms of the metric(s) they optimize and are reliable in discovering potential new DTIs.
MicroRNAs (miRNAs) play a significant role in cell differentiation, biological development as well as the occurrence and growth of diseases. Although many computational methods contribute to predicting the association between miRNAs and diseases, they do not fully explore the attribute information contained in associated edges between miRNAs and diseases. In this study, we propose a new method, Hierarchical Hypergraph learning in Association-Weighted heterogeneous network for MiRNA-Disease association identification (HHAWMD). HHAWMD first adaptively fuses multi-view similarities based on channel attention and distinguishes the relevance of different associated relationships according to changes in expression levels of disease-related miRNAs, miRNA similarity information, and disease similarity information. Then, HHAWMD assigns edge weights and attribute features according to the association level to construct an association-weighted heterogeneous graph. Next, HHAWMD extracts the subgraph of the miRNA-disease node pair from the heterogeneous graph and builds the hyperedge (a kind of virtual edge) between the node pair to generate the hypergraph. Finally, HHAWMD proposes a hierarchical hypergraph learning approach, including node-aware attention and hyperedge-aware attention, which aggregates the abundant semantic information contained in deep and shallow neighborhoods to the hyperedge in the hypergraph. Our experiment results suggest that HHAWMD has better performance and can be used as a powerful tool for miRNA-disease association identification.
AutoDock Vina and its derivatives have established themselves as a prevailing pipeline for virtual screening in contemporary drug discovery. Our Vina-GPU method leverages the parallel computing power of GPUs to accelerate AutoDock Vina, and Vina-GPU 2.0 further enhances the speed of AutoDock Vina and its derivatives. Given the prevalence of large virtual screens in modern drug discovery, the improvement of speed and accuracy in virtual screening has become a longstanding challenge. In this study, we propose Vina-GPU 2.1, aimed at enhancing the docking speed and precision of AutoDock Vina and its derivatives through the integration of novel algorithms to facilitate improved docking and virtual screening outcomes. Building upon the foundations laid by Vina-GPU 2.0, we introduce a novel algorithm, namely Reduced Iteration and Low Complexity BFGS (RILC-BFGS), designed to expedite the most time-consuming operation. Additionally, we implement grid cache optimization to further enhance the docking speed. Furthermore, we employ optimal strategies to individually optimize the structures of ligands, receptors, and binding pockets, thereby enhancing the docking precision. To assess the performance of Vina-GPU 2.1, we conduct extensive virtual screening experiments on three prominent targets, utilizing two fundamental compound libraries and seven docking tools. Our results demonstrate that Vina-GPU 2.1 achieves an average 4.97-fold acceleration in docking speed and an average 342% improvement in EF1% compared to Vina-GPU 2.0.
Relation extraction, a crucial task in understanding the intricate relationships between entities in biomedical domains, has predominantly focused on binary relations within single sentences. However, in practical biomedical scenarios, relationships often extend across multiple sentences, leading to extraction errors with potential impacts on clinical decision-making and medical diagnosis. To overcome this limitation, we present a novel cross-sentence relation extraction framework that integrates and enhances coreference resolution and relation extraction models. Coreference resolution serves as the foundation, breaking sentence boundaries and linking entities across sentences. Our framework incorporates pre-trained deep language representations and leverages graph LSTMs to effectively model cross-sentence entity mentions. The use of a self-attentive Transformer architecture and external semantic information further enhances the modeling of intricate relationships. Comprehensive experiments conducted on two standard datasets, namely the BioNLP dataset and THYME dataset, demonstrate the state-of-the-art performance of our proposed approach.
Time series RNASeq studies can enable understanding of the dynamics of disease progression and treatment response in patients. They also provide information on biomarkers, activated and repressed pathways, and more. While useful, data from multiple patients is challenging to integrate due to the heterogeneity in treatment response among patients, and the small number of timepoints that are usually profiled. Due to the heterogeneity among patients, relying on the sampled time points to integrate data across individuals is challenging and does not lead to correct reconstruction of the response patterns. To address these challenges, we developed a new constrained based pseudo-time ordering method for analyzing transcriptomics data in clinical and response studies. Our method allows the assignment of samples to their correct placement on the response curve while respecting the individual patient order. We use polynomials to represent gene expression over the duration of the study and an EM algorithm to determine parameters and locations. Application to four treatment response datasets shows that our method improves on prior methods and leads to accurate orderings that provide new biological insight on the disease and response.
Biomedical Coreference Resolution focuses on identifying the coreferences in biomedical texts, which normally consists of two parts: (i) mention detection to identify textual representation of biological entities and (ii) finding their coreference links. Recently, a popular approach to enhance the task is to embed knowledge base into deep neural networks. However, the way in which these methods integrate knowledge leads to the shortcoming that such knowledge may play a larger role in mention detection than coreference resolution. Specifically, they tend to integrate knowledge prior to mention detection, as part of the embeddings. Besides, they primarily focus on mention-dependent knowledge (KBase), i.e., knowledge entities directly related to mentions, while ignores the correlated knowledge (K+) between mentions in the mention-pair. For mentions with significant differences in word form, this may limit their ability to extract potential correlations between those mentions. Thus, this paper develops a novel model to integrate both KBase and K+ entities and achieves the state-of-the-art performance on BioNLP and CRAFT-CR datasets. Empirical studies on mention detection with different length reveals the effectiveness of the KBase entities. The evaluation on cross-sentence and match/mismatch coreference further demonstrate the superiority of the K+ entities in extracting background potential correlation between mentions.
In the past decade, Artificial Intelligence (AI) driven drug design and discovery has been a hot research topic in the AI area, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue mainly the basic properties like validity and uniqueness of the generated molecules, a few go further to explicitly optimize one single important molecular property (e.g. QED or PlogP), which makes most generated molecules little usefulness in practice. In this paper, we present a novel approach to generating molecules with desirable properties, which expands the diffusion model framework with multiple innovative designs. The novelty is two-fold. On the one hand, considering that the structures of molecules are complex and diverse, and molecular properties are usually determined by some substructures (e.g. pharmacophores), we propose to perform diffusion on two structural levels: molecules and molecular fragments respectively, with which a mixed Gaussian distribution is obtained for the reverse diffusion process. To get desirable molecular fragments, we develop a novel electronic effect based fragmentation method. On the other hand, we introduce two ways to explicitly optimize multiple molecular properties under the diffusion model framework. First, as potential drug molecules must be chemically valid, we optimize molecular validity by an energy-guidance function. Second, since potential drug molecules should be desirable in various properties, we employ a multi-objective mechanism to optimize multiple molecular properties simultaneously. Extensive experiments with two benchmark datasets QM9 and ZINC250 k show that the molecules generated by our proposed method have better validity, uniqueness, novelty, Fréchet ChemNet Distance (FCD), QED, and PlogP than those generated by current SOTA models.