Spatial proteomics maps protein distributions in tissues, providing transformative insights for life sciences. However, current sequencing-based technologies suffer from low spatial resolution, and substantial inter-tissue variability in protein expression further compromises the performance of existing molecular data prediction methods. In this work, we introduce the novel task of spatial super-resolution for sequencing-based spatial proteomics (seq-SP) and, to the best of our knowledge, propose the first deep learning model for this task--Neural Proteomics Fields (NPF). NPF formulates seq-SP as a protein reconstruction problem in continuous space by training a dedicated network for each tissue. The model comprises a Spatial Modeling Module, which learns tissue-specific protein spatial distributions, and a Morphology Modeling Module, which extracts tissue-specific morphological features. Furthermore, to facilitate rigorous evaluation, we establish an open-source benchmark dataset, Pseudo-Visium SP, for this task. Experimental results demonstrate that NPF achieves state-of-the-art performance with fewer learnable parameters, underscoring its potential for advancing spatial proteomi
Database search and clustering are fundamental components of many data analytics problems, such as mass spectrometry-driven proteomics. Traditional full clustering and search algorithms suffer from high resource usage and long latencies. We introduce HERP, a lightweight incremental clustering method and a highly parallelizable database (DB) search platform that utilizes 3T2MTJ SOT-MRAM based CAM in 7nm technology for in-memory acceleration. A single hardware initialization using pre-clustered proteomics data allows for continuous DB searching and local re-clustering, providing a more practical and efficient alternative to clustering from scratch. Heuristics derived from the initial pre-clustered data guide the incremental process, accelerating clustering by 20x at a cost of 0.3% increase in clustering error where DB search results overlap by 96% with SOTA algorithms validating search quality. For a 131GB human genome proteomics dataset HERP setup requires 1.19mJ for 2M spectra while 1000 query search consumes only 1.1uJ at SOTA accuracy. Bucket-wise parallelization and query scheduling provides additional 100x speedup.
Single-cell transcriptomics and proteomics have become a great source for data-driven insights into biology, enabling the use of advanced deep learning methods to understand cellular heterogeneity and gene expression at the single-cell level. With the advent of spatial-omics data, we have the promise of characterizing cells within their tissue context as it provides both spatial coordinates and intra-cellular transcriptional or protein counts. Proteomics offers a complementary view by directly measuring proteins, which are the primary effectors of cellular function and key therapeutic targets. However, existing models either ignore the spatial information or the complex genetic and proteomic programs within cells. Thus they cannot infer how cell internal regulation adapts to microenvironmental cues. Furthermore, these models often utilize fixed gene vocabularies, hindering their generalizability unseen genes. In this paper, we introduce HEIST, a hierarchical graph transformer foundation model for spatial transcriptomics and proteomics. HEIST models tissues as hierarchical graphs. The higher level graph is a spatial cell graph, and each cell in turn, is represented by its lower leve
Quantitative proteomics plays a central role in uncovering regulatory mechanisms, identifying disease biomarkers, and guiding the development of precision therapies. These insights are often obtained through complex Bayesian models, whose inference procedures are computationally intensive, especially when applied at scale to biological datasets. This limits the accessibility of advanced modelling techniques needed to fully exploit proteomics data. Although Sequential Monte Carlo (SMC) methods offer a parallelisable alternative to traditional Markov Chain Monte Carlo, their high-performance implementations often rely on specialised hardware, increasing both financial and energy costs. We address these challenges by introducing an opportunistic computing framework for SMC samplers, tailored to the demands of large-scale proteomics inference. Our approach leverages idle compute resources at the University of Liverpool via HTCondor, enabling scalable Bayesian inference without dedicated high-performance computing infrastructure. Central to this framework is a novel Coordinator-Manager-Follower architecture that reduces synchronisation overhead and supports robust operation in heterogen
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.
In clinical proteomics, available input is often limited. In addition, phospho-proteomics is of particular interest since the dysregulation of these post-translational modifications (PTMs) has been implicated in various diseases such as cancer. We therefore assessed the feasibility of low input phospho-proteomics via phospho-bulk titration and low-input starting material. We found that there was identification of more phospho-peptides through phospho-bulk titration because of sample loss during preparation of low input starting material. Additionally, we explored various lysis buffers and boiling times for efficiency of decrosslinking formalin-fixed cells since cells and tissues are often fixed for preservation and sorting via FACS. We found that boiling in 0.05M Tris pH 7.6 with 5% SDS for 60 min yielded the highest number of phospho-peptides. Lastly, we applied Evotips Pure and phospho-bulk titration to treated Jurkat cells and identified 7 phospho-sites involved in T-cell stimulation.
Understanding the spatial architecture of the tumor microenvironment (TME) is critical to advance precision oncology. We present ProteinPNet, a novel framework based on prototypical part networks that discovers TME motifs from spatial proteomics data. Unlike traditional post-hoc explanability models, ProteinPNet directly learns discriminative, interpretable, faithful spatial prototypes through supervised training. We validate our approach on synthetic datasets with ground truth motifs, and further test it on a real-world lung cancer spatial proteomics dataset. ProteinPNet consistently identifies biologically meaningful prototypes aligned with different tumor subtypes. Through graphical and morphological analyses, we show that these prototypes capture interpretable features pointing to differences in immune infiltration and tissue modularity. Our results highlight the potential of prototype-based learning to reveal interpretable spatial biomarkers within the TME, with implications for mechanistic discovery in spatial omics.
Single-cell proteomics (SCP) is transforming our understanding of biological complexity by shifting from bulk proteomics, where signals are averaged over thousands of cells, to the proteome analysis of individual cells. This granular perspective reveals distinct cell states, population heterogeneity, and the underpinnings of disease pathogenesis that bulk approaches may obscure. However, SCP demands exceptional sensitivity, precise cell handling, and robust data processing to overcome the inherent challenges of analyzing picogram-level protein samples without amplification. Recent innovations in sample preparation, separations, data acquisition strategies, and specialized mass spectrometry instrumentation have substantially improved proteome coverage and throughput. Approaches that integrate complementary omics, streamline multi-step sample processing, and automate workflows through microfluidics and specialized platforms promise to further push SCP boundaries. Advances in computational methods, especially for data normalization and imputation, address the pervasive issue of missing values, enabling more reliable downstream biological interpretations. Despite these strides, higher
Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segment
With the development of artificial intelligence, its contribution to science is evolving from simulating a complex problem to automating entire research processes and producing novel discoveries. Achieving this advancement requires both specialized general models grounded in real-world scientific data and iterative, exploratory frameworks that mirror human scientific methodologies. In this paper, we present PROTEUS, a fully automated system for scientific discovery from raw proteomics data. PROTEUS uses large language models (LLMs) to perform hierarchical planning, execute specialized bioinformatics tools, and iteratively refine analysis workflows to generate high-quality scientific hypotheses. The system takes proteomics datasets as input and produces a comprehensive set of research objectives, analysis results, and novel biological hypotheses without human intervention. We evaluated PROTEUS on 12 proteomics datasets collected from various biological samples (e.g. immune cells, tumors) and different sample types (single-cell and bulk), generating 191 scientific hypotheses. These were assessed using both automatic LLM-based scoring on 5 metrics and detailed reviews from human exper
Proteomics is the large scale study of protein structure and function from biological systems through protein identification and quantification. "Shotgun proteomics" or "bottom-up proteomics" is the prevailing strategy, in which proteins are hydrolyzed into peptides that are analyzed by mass spectrometry. Proteomics studies can be applied to diverse studies ranging from simple protein identification to studies of proteoforms, protein-protein interactions, protein structural alterations, absolute and relative protein quantification, post-translational modifications, and protein stability. To enable this range of different experiments, there are diverse strategies for proteome analysis. The nuances of how proteomic workflows differ may be challenging to understand for new practitioners. Here, we provide a comprehensive overview of different proteomics methods to aid the novice and experienced researcher. We cover from biochemistry basics and protein extraction to biological interpretation and orthogonal validation. We expect this work to serve as a basic resource for new practitioners in the field of shotgun or bottom-up proteomics.
Mass spectrometry is the dominant technology in the field of proteomics, enabling high-throughput analysis of the protein content of complex biological samples. Due to the complexity of the instrumentation and resulting data, sophisticated computational methods are required for the processing and interpretation of acquired mass spectra. Machine learning has shown great promise to improve the analysis of mass spectrometry data, with numerous purpose-built methods for improving specific steps in the data acquisition and analysis pipeline reaching widespread adoption. Here, we propose unifying various spectrum prediction tasks under a single foundation model for mass spectra. To this end, we pre-train a spectrum encoder using de novo sequencing as a pre-training task. We then show that using these pre-trained spectrum representations improves our performance on the four downstream tasks of spectrum quality prediction, chimericity prediction, phosphorylation prediction, and glycosylation status prediction. Finally, we perform multi-task fine-tuning and find that this approach improves the performance on each task individually. Overall, our work demonstrates that a foundation model for
Background: Platelet proteomics offers valuable insights for clinical research, yet isolating high-purity platelets remains a challenge. Current methods often lead to contamination or platelet loss, compromising data quality and reproducibility. Objectives: This study aimed to optimize a platelet isolation technique that yields high-purity samples with minimal loss and to identify the most effective mass spectrometry-based proteomic method for analyzing platelet proteins with optimal coverage and sensitivity. Methods: We refined an isolation protocol by adjusting centrifugation time to reduce blood volume requirements while preserving platelet yield and purity. Using this optimized method, we evaluated three proteomic approaches: Label-free Quantification with Data-Independent Acquisition (LFQ-DIA), Label-free Quantification with Data-Dependent Acquisition (LFQ-DDA), and Tandem Mass Tag labeling with DDA (TMT-DDA). Results: LFQ-DIA demonstrated superior protein coverage and sensitivity compared to LFQ-DDA and TMT-DDA. The refined isolation protocol effectively minimized contamination and platelet loss. Additionally, age-related differences in platelet protein composition were obser
Spatial summary statistics based on point process theory are widely used to quantify the spatial organization of cell populations in single-cell spatial proteomics data. Among these, Ripley's $K$ is a popular metric for assessing whether cells are spatially clustered or are randomly dispersed. However, the key assumption of spatial homogeneity is frequently violated in spatial proteomics data, leading to overestimates of cell clustering and colocalization. To address this, we propose a novel $K$-based method, termed \textit{KAMP} (\textbf{K} adjustment by \textbf{A}nalytical \textbf{M}oments of the \textbf{P}ermutation distribution), for quantifying the spatial organization of cells in spatial proteomics samples. \textit{KAMP} leverages background cells in each sample along with a new closed-form representation of the first and second moments of the permutation distribution of Ripley's $K$ to estimate an empirical null model. Our method is robust to inhomogeneity, computationally efficient even in large datasets, and provides approximate $p$-values for testing spatial clustering and colocalization. Methodological developments are motivated by a spatial proteomics study of 103 women
Missing values are a notable challenge when analysing mass spectrometry-based proteomics data. While the field is still actively debating on the best practices, the challenge increased with the emergence of mass spectrometry-based single-cell proteomics and the dramatic increase in missing values. A popular approach to deal with missing values is to perform imputation. Imputation has several drawbacks for which alternatives exist, but currently imputation is still a practical solution widely adopted in single-cell proteomics data analysis. This perspective discusses the advantages and drawbacks of imputation. We also highlight 5 main challenges linked to missing value management in single-cell proteomics. Future developments should aim to solve these challenges, whether it is through imputation or data modelling. The perspective concludes with recommendations for reporting missing values, for reporting methods that deal with missing values and for proper encoding of missing values.
Two-dimensional gel electrophoresis has been instrumental in the birth and developments of proteomics, although it is no longer the exclusive separation tool used in the field of proteomics. In this review, a historical perspective is made, starting from the days where two-dimensional gels were used and the word proteomics did not even exist. The events that have led to the birth of proteomics are also recalled, ending with a description of the now well-known limitations of two-dimensional gels in proteomics. However, the often-underestimated advantages of two-dimensional gels are also underlined, leading to a description of how and when to use two-dimensional gels for the best in a proteomics approach. Taking support of these advantages (robustness, resolution, and ability to separate entire, intact proteins), possible future applications of this technique in proteomics are also mentioned.
Deep learning is an advanced technology that relies on large-scale data and complex models for feature extraction and pattern recognition. It has been widely applied across various fields, including computer vision, natural language processing, and speech recognition. In recent years, deep learning has demonstrated significant potential in the realm of proteomics informatics, particularly in deciphering complex biological information. The introduction of this technology not only accelerates the processing speed of protein data but also enhances the accuracy of predictions regarding protein structure and function. This provides robust support for both fundamental biology research and applied biotechnological studies. Currently, deep learning is primarily focused on applications such as protein sequence analysis, three-dimensional structure prediction, functional annotation, and the construction of protein interaction networks. These applications offer numerous advantages to proteomic research. Despite its growing prevalence in this field, deep learning faces several challenges including data scarcity, insufficient model interpretability, and computational complexity; these factors h
Summary: Mass spectrometry coupled to liquid chromatography (LC-MS/MS) is a powerful technique for the charac-terisation of proteomes. However, the diverse software platforms available for processing the raw proteomics data, each produce their own output format, making the extraction of meaningful and interpretable results a difficult task. We present TraianProt, a web-based, user-friendly proteomics data analysis platform, that enables the analysis of both label-free and labeled data from Data-Dependent or Data-Independent Acquisition mass spectrometry mode support-ing different computational platforms such as MaxQuant, MSFragger, DIA-NN, ProteoScape and Proteome Discoverer output formats. TraianProt provides a dynamic framework that includes several processing modules allowing the user to perform a complete downstream analysis covering the stages of data pre-processing, differential expression analy-sis, functional analysis and protein-protein interaction analysis. Data output includes a wide range of high-quality, cus-tomisable graphs such as heatmap, volcano plot, boxplot and barplot. This allows users to extract biological insights from proteomic data without any programming s
Taking the opportunity of the 20th anniversary of the word "proteomics", this young adult age is a good time to remember how proteomics came from enormous progress in protein separation and protein microanalysis techniques, and from the conjugation of these advances into a high performance and streamlined working setup. However, in the history of the almost three decades that encompass the first attempts to perform large scale analysis of proteins to the current high throughput proteomics that we can enjoy now, it is also interesting to underline and to recall how difficult the first decade was. Indeed when the word was cast, the battle was already won. This recollection is mostly devoted to the almost forgotten period where proteomics was being conceived and put to birth, as this collective scientific work will never appear when searched through the keyword "proteomics". BIOLOGICAL SIGNIFICANCE: The significance of this manuscript is to recall and review the two decades that separated the first attempts of performing large scale analysis of proteins from the solid technical corpus that existed when the word "proteomics" was coined twenty years ago. This recollection is made within
Intracellular compartmentalization of proteins underpins their function and the metabolic processes they sustain. Various mass spectrometry-based proteomics methods (subcellular spatial proteomics) now allow high throughput subcellular protein localization. Yet, the curation, analysis and interpretation of these data remain challenging, particularly in non-model organisms where establishing reliable marker proteins is difficult, and in contexts where experimental replication and subcellular fractionation are constrained. Here, we develop FSPmix, a semi-supervised functional clustering method implemented as an open-source R package, which leverages partial annotations from a subset of marker proteins to predict protein subcellular localization from subcellular spatial proteomics data. This method explicitly assumes that protein signatures vary smoothly across subcellular fractions, enabling more robust inference under low signal-to-noise data regimes. We applied FSPmix to a subcellular proteomics dataset from a marine diatom, allowing us to assign probabilistic localizations to proteins and uncover potentially new protein functions. Altogether, this work lays the foundation for more