We cloned, expressed and purified the Escherichia coli yhbO gene product, which is homolog to the Bacillus subtilis general stress protein 18 (the yfkM gene product), the Pyrococcus furiosus intracellular protease PfpI, and the human Parkinson disease protein DJ-1. The gene coding for YhbO was generated by amplifying the yhbO gene from E. coli by polymerase chain reaction. It was inserted in the expression plasmid pET-21a, under the transcriptional control of the bacteriophage T7 promoter and lac operator. A BL21(DE3) E. coli strain transformed with the YhbO-expression vector pET-21a-yhbO, accumulates large amounts of a soluble protein of 20 kDa in SDS-PAGE that matches the expected YhbO molecular weight. YhbO was purified to homogeneity by HPLC DEAE ion exchange chromatography and hydroxylapatite chromatography and its identity was confirmed by N-terminal sequencing and mass spectrometry analysis. The native protein exists in monomeric, trimeric and hexameric forms.
The intrinsic stochasticity of gene expression can lead to large variability in protein levels for genetically identical cells. Such variability in protein levels can arise from infrequent synthesis of mRNAs which in turn give rise to bursts of protein expression. Protein expression occurring in bursts has indeed been observed experimentally and recent studies have also found evidence for transcriptional bursting, i.e. production of mRNAs in bursts. Given that there are distinct experimental techniques for quantifying the noise at different stages of gene expression, it is of interest to derive analytical results connecting experimental observations at different levels. In this work, we consider stochastic models of gene expression for which mRNA and protein production occurs in independent bursts. For such models, we derive analytical expressions connecting protein and mRNA burst distributions which show how the functional form of the mRNA burst distribution can be inferred from the protein burst distribution. Additionally, if gene expression is repressed such that observed protein bursts arise only from single mRNAs, we show how observations of protein burst distributions (repres
Biological AI models increasingly predict complex cellular responses, yet their learned representations remain disconnected from the molecular processes they aim to capture. We present CDT-III, which extends mechanism-oriented AI across the full central dogma: DNA, RNA, and protein. Its two-stage Virtual Cell Embedder architecture mirrors the spatial compartmentalization of the cell: VCE-N models transcription in the nucleus and VCE-C models translation in the cytosol. On five held-out genes, CDT-III achieves per-gene RNA r=0.843 and protein r=0.969. Adding protein prediction improves RNA performance (r=0.804 to 0.843), demonstrating that downstream tasks regularize upstream representations. Protein supervision sharpens DNA-level interpretability, increasing CTCF enrichment by 30%. Analysis of experimentally measured mRNA and protein responses reveals that the majority of genes with observable mRNA changes show opposite protein-level changes (66.7% at |log2FC|>0.01, rising to 87.5% at |log2FC|>0.02), exposing a fundamental limitation of RNA-only perturbation models. Despite this pervasive direction discordance, CDT-III correctly predicts both mRNA and protein responses. Appli
Proteins congregate into complexes to perform fundamental cellular functions. Phenotypic outcomes, in health and disease, are often mechanistically driven by the remodeling of protein complexes by protein coding mutations or cellular signaling changes in response to molecular cues. Here, we present an affinity purification mass spectrometry (APMS) proteomics protocol to quantify and visualize global changes in protein protein interaction (PPI) networks between pairwise conditions. We describe steps for expressing affinity tagged bait proteins in mammalian cells, identifying purified protein complexes, quantifying differential PPIs, and visualizing differential PPI networks. Specifically, this protocol details steps for designing affinity tagged bait gene constructs, transfection, affinity purification, mass spectrometry sample preparation, data acquisition, database search, data quality control, PPI confidence scoring, cross run normalization, statistical data analysis, and differential PPI visualization. Our protocol discusses caveats and limitations with applicability across cell types and biological areas.
The integration of spatial multi-omics data from single tissues is crucial for advancing biological research. However, a significant data imbalance impedes progress: while spatial transcriptomics data is relatively abundant, spatial proteomics data remains scarce due to technical limitations and high costs. To overcome this challenge we propose STProtein, a novel framework leveraging graph neural networks with multi-task learning strategy. STProtein is designed to accurately predict unknown spatial protein expression using more accessible spatial multi-omics data, such as spatial transcriptomics. We believe that STProtein can effectively addresses the scarcity of spatial proteomics, accelerating the integration of spatial multi-omics and potentially catalyzing transformative breakthroughs in life sciences. This tool enables scientists to accelerate discovery by identifying complex and previously hidden spatial patterns of proteins within tissues, uncovering novel relationships between different marker genes, and exploring the biological "Dark Matter".
Gene expression is a noisy process and several mechanisms, both transcriptional and posttranscriptional, can stabilize protein levels in cells. Much work has focused on the role of miRNAs, showing in particular that miRNA-mediated regulation can buffer expression noise for lowly expressed genes. Here, using in silico simulations and mathematical modeling, we demonstrate that miRNAs can exert a much broader influence on protein levels by orchestrating competition-induced crosstalk between mRNAs. Most notably, we find that miRNA-mediated cross-talk (i) can stabilize protein levels across the full range of gene expression rates, and (ii) modifies the correlation pattern of co-regulated interacting proteins, changing the sign of correlations from negative to positive. The latter feature may constitute a potentially robust signature of the existence of RNA crosstalk induced by endogenous competition for miRNAs in standard cellular conditions.
We study the combined influence of amino acid composition and chain length on the thermal stability of protein structures. A new parameterization of the internal free energy is considered, as the sum of hydrophobic effect, hydrogen-bond and de-hydration energy terms. We divided a non-redundant selection of protein structures from the Protein Data Bank into three groups: i) rich in order-promoting residues (OPR proteins); ii) rich in disorder-promoting residues (DPR proteins); iii) belonging to a twilight zone (TZ proteins). We observe a partition of PDB in several groups with different internal free energies, amino acid compositions and protein lengths. Internal free energy of 96% of the proteins analyzed ranges from -2 to -6.5 kJ/mol/res. We found many DPR and OPR proteins with the same relative thermal stability. Only OPR proteins with internal energy between -4 and -6.5 kJ/mol/res are observed to have chains longer than 200 residues, with a high de-hydration energy compensated by the hydrophobic effect. DPR and TZ proteins are shorter than 200 residues and they have an internal energy above -4 kJ/mol/res, with a few exceptions among TZ proteins. Hydrogen-bonds play an important
While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computational analysis. We provide an overview of existing computational techniques, to validate, simulate, predict and analyse protein structures. More importantly, it will aim to provide practical knowledge about how and when to use such techniques. We will consider proteins from three major vantage points: Protein structure quantification, Protein structure prediction, and Protein simulation & dynamics. In this chapter we explore basic physical and chemical concepts required to understand protein folding. We introduce major (de)stabilising factors of folded protein structures such as the hydrophobic effect and backbone entropy. In addition, we consider different states along the folding pathway, as well as natively disordered proteins and aggregated protein states. In this chapter, an intuit
Large scale surveys in mammalian tissue culture cells suggest that the protein expressed at the median abundance is present at 8,000 - 16,000 molecules per cell and that differences in mRNA expression between genes explain only 10-40% of the differences in protein levels. We find, however, that these surveys have significantly underestimated protein abundances and the relative importance of transcription. Using individual measurements for 61 housekeeping proteins to rescale whole proteome data from Schwanhausser et al., we find that the median protein detected is expressed at 170,000 molecules per cell and that our corrected protein abundance estimates show a higher correlation with mRNA abundances than do the uncorrected protein data. In addition, we estimated the impact of further errors in mRNA and protein abundances, showing that mRNA levels explain at least 56% of the differences in protein abundance for the genes detected by Schwanhausser et al., though because one major source of error could not be estimated the true percent contribution could be higher. We also employed a second, independent strategy to determine the contribution of mRNA levels to protein expression. We sho
Protein splicing is a post-translational autocatalystic excision of internal protein sequence (intein) with the subsequent ligation of the flanking polypeptides (exteins). The high specificity of excision ensured by intein makes it possible to use a phenomenon of protein splicing for the biotechnology purposes. The aim of this work was optimization of obtaining and purification of the recombinant human growth hormone using the protein splicing. It was experimentally demonstrated that the use of modified intein as auto-removal affine marker makes it possible to perform the rapid and cheap isolation of the recombinant protein Hgh. Furthermore, this approach allows to obtain the human growth hormone with native N-terminus, without formyl-metionine. Key words: intein, human growth hormone, protein splicing
Tandem repeats in proteins identification, classification and curation is a complex process that requires manual processing from experts, processing power and time. There are recent and relevant advances applying machine learning for protein structure prediction and repeat classification that are useful for this process. However, no service contemplates required databases and software to supplement researching on repeat proteins. In this publication we present Daisy, an integrated repeat protein curation web service. This service can process Protein Data Bank (PDB) and the AlphaFold Database entries for tandem repeats identification. In addition, it uses an algorithm to search a sequence against a library of Pfam hidden Markov model (HMM). Repeat classifications are associated with the identified families through RepeatsDB. This prediction is considered for enhancing the ReUPred algorithm execution and hastening the repeat units identification process. The service can also operate every associated PDB and AlphaFold structure with a UniProt proteome registry. Availability: The Daisy web service is freely accessible at daisy.bioinformatica.org.
Protein binding and function often involves conformational changes. Advanced NMR experiments indicate that these conformational changes can occur in the absence of ligand molecules (or with bound ligands), and that the ligands may 'select' protein conformations for binding (or unbinding). In this review, we argue that this conformational selection requires transition times for ligand binding and unbinding that are small compared to the dwell times of proteins in different conformations, which is plausible for small ligand molecules. Such a separation of timescales leads to a decoupling and temporal ordering of binding/unbinding events and conformational changes. We propose that conformational-selection and induced-change processes (such as induced fit) are two sides of the same coin, because the temporal ordering is reversed in binding and unbinding direction. Conformational-selection processes can be characterized by a conformational excitation that occurs prior to a binding or unbinding event, while induced-change processes exhibit a characteristic conformational relaxation that occurs after a binding or unbinding event. We discuss how the ordering of events can be determined fro
During the last decade, network approaches became a powerful tool to describe protein structure and dynamics. Here we review the links between disordered proteins and the associated networks, and describe the consequences of local, mesoscopic and global network disorder on changes in protein structure and dynamics. We introduce a new classification of protein networks into cumulus-type, i.e., those similar to puffy (white) clouds, and stratus-type, i.e., those similar to flat, dense (dark) low-lying clouds, and relate these network types to protein disorder dynamics and to differences in energy transmission processes. In the first class, there is limited overlap between the modules, which implies higher rigidity of the individual units; there the conformational changes can be described by an energy transfer mechanism. In the second class, the topology presents a compact structure with significant overlap between the modules; there the conformational changes can be described by multi-trajectories; that is, multiple highly populated pathways. We further propose that disordered protein regions evolved to help other protein segments reach rarely visited but functionally-related states.
While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computational analysis. We provide an overview of existing computational techniques, to validate, simulate, predict and analyse protein structures. More importantly, it will aim to provide practical knowledge about how and when to use such techniques. We will consider proteins from three major vantage points: Protein structure quantification, Protein structure prediction, and Protein simulation & dynamics. In the previous chapter, "Introduction to Protein Folding", we introduced the concept of free energy and the protein folding landscape. Here, we provide a deeper, more formal underpinning of free energy in terms of the entropy and enthalpy; to this end, we will first need to better define the meaning of equilibrium, entropy and enthalpy. When we understand these concepts, we will come back fo
We explore the interplay between the protein-protein interactions network and the expression of the interacting proteins. It is shown that interacting proteins are expressed in significantly more similar cellular concentrations. This is largely due to interacting pairs which are part of protein complexes. We solve a generic model of complex formation and show explicitly that complexes form most efficiently when their members have roughly the same concentrations. Therefore, the observed similarity in interacting protein concentrations could be attributed to optimization for efficiency of complex formation.
While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computational analysis. We provide an overview of existing computational techniques, to validate, simulate, predict and analyse protein structures. More importantly, it will aim to provide practical knowledge about how and when to use such techniques. We will consider proteins from three major vantage points: Protein structure quantification, Protein structure prediction, and Protein simulation & dynamics. In the previous chapter "Molecular Dynamics" we have considered protein simulations from a dynamical point of view, using Newton's laws. In the current Chapter, we first take a step back and return to the bare minimum needed to simulate proteins, and show that proteins may be simulated in a more simple fashion, using the partition function directly. This means we do not have to calculate expl
Glycans are structurally diverse and flexible biomolecules that play key roles in many biological processes. Their conformational variability makes the modeling of their interactions with proteins particularly challenging. This chapter presents a step-by-step protocol for modeling protein-glycan interactions using HADDOCK3, an integrative modeling platform that supports the inclusion of experimental or predicted interaction restraints and allows for flexible refinement of the solutions. The workflow is illustrated using the interaction between a linear homopolymer glycan, 4-beta-glucopyranose, and the catalytic domain of the Humicola grisea Cel12A enzyme, for which an experimental X-ray structure is available as a reference. Detailed instructions are provided for input structure preparation, restraint definition, docking setup, execution, and result analysis. Application of the protocol starting from unbound structures yields models of acceptable to medium quality, with interface-ligand RMSD values below 3 angstroms. Although illustrated on a specific system, the protocol has been optimized and benchmarked on multiple protein-glycan complexes and is broadly applicable to similar sy
While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computational analysis. We provide an overview of existing computational techniques, to validate, simulate, predict and analyse protein structures. More importantly, it will aim to provide practical knowledge about how and when to use such techniques. We will consider proteins from three major vantage points: Protein structure quantification, Protein structure prediction, and Protein simulation & dynamics. Within the living cell, protein molecules perform specific functions, typically by interacting with other proteins, DNA, RNA or small molecules. They take on a specific three dimensional structure, encoded by its amino acid sequence, which allows them to function within the cell. Hence, the understanding of a protein's function is tightly coupled to its sequence and its three dimensional stru
This paper focuses on three critical problems on protein classification. Firstly, Carbohydrate-active enzyme (CAZyme) classification can help people to understand the properties of enzymes. However, one CAZyme may belong to several classes. This leads to Multi-label CAZyme classification. Secondly, to capture information from the secondary structure of protein, protein classification is modeled as graph classification problem. Thirdly, compound-protein interactions prediction employs graph learning for compound with sequential embedding for protein. This can be seen as classification task for compound-protein pairs. This paper proposes three models for protein classification. Firstly, this paper proposes a Multi-label CAZyme classification model using CNN-LSTM with Attention mechanism. Secondly, this paper proposes a variational graph autoencoder based subspace learning model for protein graph classification. Thirdly, this paper proposes graph isomorphism networks (GIN) and Attention-based CNN-LSTM for compound-protein interactions prediction, as well as comparing GIN with graph convolution networks (GCN) and graph attention networks (GAT) in this task. The proposed models are effe
Much recent work has explored molecular and population-genetic constraints on the rate of protein sequence evolution. The best predictor of evolutionary rate is expression level, for reasons which have remained unexplained. Here, we hypothesize that selection to reduce the burden of protein misfolding will favor protein sequences with increased robustness to translational missense errors. Pressure for translational robustness increases with expression level and constrains sequence evolution. Using several sequenced yeast genomes, global expression and protein abundance data, and sets of paralogs traceable to an ancient whole-genome duplication in yeast, we rule out several confounding effects and show that expression level explains roughly half the variation in Saccharomyces cerevisiae protein evolutionary rates. We examine causes for expression's dominant role and find that genome-wide tests favor the translational robustness explanation over existing hypotheses that invoke constraints on function or translational efficiency. Our results suggest that proteins evolve at rates largely unrelated to their functions, and can explain why highly expressed proteins evolve slowly across th