Antibodies play a central role in the immune response by specifically recognizing and neutralizing antigens, and therapeutic antibodies have become major drugs for cancer and autoimmune diseases. However, their discovery still relies on extensive in vitro screening, and accurate computational modeling of antibody structures and antibody-antigen interactions can prioritize candidates, reduce experimental burden, and accelerate rational design. Despite recent advances in high-accuracy protein and complex prediction, a persistent performance gap remains for antibody-related tasks compared with general protein-protein interactions, limiting downstream design. This thesis investigates why antibody-related tasks are harder and proposes improvements along two complementary directions. First, we investigate protein language model (PLM)-based methods for antibody and antibody-antigen structure prediction. Using embeddings from multiple PLMs, our approach achieves the best CDR-H3 accuracy among compared PLM-based methods on antibody monomer prediction. Extending it to complex prediction does not generalize: without co-evolutionary signals between antibody and antigen, single-sequence PLM rep
With the growing prominence of antibody-based therapeutics, antibody engineering has gained increasing attention as a critical area of research and development. Recent progress in transformer-based protein large language models (LLMs) has demonstrated promising applications in protein sequence design and structural prediction. Moreover, the availability of large-scale antibody datasets such as the Observed Antibody Space (OAS) database has opened new avenues for the development of LLMs specialized for processing antibody sequences. Among these, RoBERTa has demonstrated improved performance relative to BERT, while maintaining a smaller parameter count (125M) compared to the BERT-based protein model, ProtBERT (420M). This reduced model size enables more efficient deployment in antibody-related applications. However, despite the numerous advantages of the RoBERTa architecture, antibody-specific foundational models built upon it have remained inaccessible to the research community. In this study, we introduce Ab-RoBERTa, a RoBERTa-based antibody-specific LLM, which is publicly available at https://huggingface.co/mogam-ai/Ab-RoBERTa. This resource is intended to support a wide range of
Ligand-receptor interactions are fundamental to many biological processes. For example in antibody-based immunotherapies, the dynamics of an antibody binding with its target antigen directly influence the potency and efficacy of monoclonal antibody (mAb) therapies. In this paper, we present an asymptotic analysis of an ordinary differential equation (ODE) model of bivalent antibody-antigen binding in the context of mAb cancer therapies, highlighting the added complexity associated with bivalency of the antibody. To understand what drives the complex temporal dynamics of bivalent antibody-antigen binding, we construct asymptotic approximations to the model's solutions at different timescales and antibody concentrations that are in good agreement with numerical simulations of the full model. We show how the dynamics differ between two scenarios; a region where unbound antigens are abundant, and one where the number of unbound antigens is small such that the dominant balance within the model equations changes. Of particular importance to the potency and efficacy of mAb treatments are the values of quantities such as antigen occupancy and bound antibody number. We use the results of ou
Antibody-based therapeutics-including antibody-drug conjugates (ADCs), bispecific antibodies, and novel formats-are reshaping oncology, yet key determinants of efficacy, safety, and manufacturability frequently emerge after conjugation and formulation. We argue that computational biophysics provides an underexploited framework to address this gap by connecting molecular interactions to biological outcomes. We highlight how molecular dynamics, coarse-grained simulations, and free energy calculations reveal how conjugation site, linker chemistry, and drug-antibody ratio reshape conformational landscapes. We emphasize structural coupling between antibody, linker, and payload, with implications for antigen binding, internalization, and developability. We propose that integrating physics-based modeling into development pipelines-alongside experimental validation-can reduce empirical iteration and de-risk translation. As force fields, and hybrid physics-machine-learning methods improve, this field is poised to become a central driver of next-generation ADC design.
There are complex interactions between antibody levels and epidemic propagation, the antibody level of an individual influences the probability of infection, and the spread of the virus influences the antibody level of each individual. There exist some viruses that, in their natural state, cause antibody levels in an infected individual to gradually decay. When these antibody levels decay to a certain point, the individual can be reinfected, such as with COVID 19. To describe their interaction, we introduce a novel mathematical model that incorporates the presence of an antibody retention rate to investigate the infection patterns of individuals who survive multiple infections. The model is composed of a system of stochastic differential equations to derive the equilibrium point and threshold of the model and presents rich experimental results of numerical simulations to further elucidate the propagation properties of the model. We find that the antibody decay rate strongly affects the propagation process, and also that different network structures have different sensitivities to the antibody decay rate, and that changes in the antibody decay rate cause stronger changes in the prop
We study a simplified model of monoclonal antibodies confined in a patchy random porous medium. Antibodies are represented as Y-shaped particles composed of seven tangential hard spheres with attractive patches on the terminal beads, while the matrix consists of randomly distributed hard-sphere obstacles bearing adhesive sites. The model captures antibody behavior in crowded biological environments with strong short-range antibody-matrix attractions. The theoretical approach combines Wertheim's multidensity thermodynamic perturbation theory, the Flory-Stockmayer theory of polymerization, and scaled particle theory for fluids in porous media. We analyze thermodynamic properties, percolation thresholds, and phase behavior, and compare the selected results with new computer simulations. The interplay between antibody-antibody and antibody-matrix interactions produces a complex phase behavior, including re-entrant phase separation with a closed-loop coexistence region at higher temperatures and conventional liquid-gas separation at lower temperatures.
Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with textual descriptions of function, AFD-Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.
Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consistently handle both paired and unpaired variable region sequences as input. These models are trained comprehensively using the more than two billion unpaired sequences and two million paired sequences of light and heavy chains present in the Observed Antibody Space dataset. We show that our models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. This advancement marks a significant leap forward in leveraging machine learning, large scale data sets and high-performance computing for enhancing antibody des
Antibody-facilitated immune responses are central to the body's defense against pathogens, viruses, and other foreign invaders. The ability of antibodies to specifically bind and neutralize antigens is vital for maintaining immunity. Over the past few decades, bioengineering advancements have significantly accelerated therapeutic antibody development. These antibody-derived drugs have shown remarkable efficacy, particularly in treating cancer, SARS-CoV-2, autoimmune disorders, and infectious diseases. Traditionally, experimental methods for affinity measurement have been time-consuming and expensive. With the advent of artificial intelligence, in silico medicine has been revolutionized; recent developments in machine learning, particularly the use of large language models (LLMs) for representing antibodies, have opened up new avenues for AI-based design and improved affinity prediction. Herein, we present an advanced antibody-antigen binding affinity prediction model (LlamaAffinity), leveraging an open-source Llama 3 backbone and antibody sequence data sourced from the Observed Antibody Space (OAS) database. The proposed approach shows significant improvement over existing state-of
Therapeutic antibody development has become an increasingly popular approach for drug development. To date, antibody therapeutics are largely developed using large scale experimental screens of antibody libraries containing hundreds of millions of antibody sequences. The high cost and difficulty of developing therapeutic antibodies create a pressing need for computational methods to predict antibody properties and create bespoke designs. However, the relationship between antibody sequence and activity is a complex physical process and traditional iterative design approaches rely on large scale assays and random mutagenesis. Deep learning methods have emerged as a promising way to learn antibody property predictors, but predicting antibody properties and target-specific activities depends critically on the choice of antibody representations and data linking sequences to properties is often limited. Existing works have not yet investigated the value, limitations and opportunities of these methods in application to antibody-based drug discovery. In this paper, we present results on a novel SARS-CoV-2 antibody binding dataset and an additional benchmark dataset. We compare three classe
Antibody engineering is essential for developing therapeutics and advancing biomedical research. Traditional discovery methods often rely on time-consuming and resource-intensive experimental screening. To enhance and streamline this process, we introduce a production-grade, high-throughput platform built on HelixFold3, HelixDesign-Antibody, which utilizes the high-accuracy structure prediction model, HelixFold3. The platform facilitates the large-scale generation of antibody candidate sequences and evaluates their interaction with antigens. Integrated high-performance computing (HPC) support enables high-throughput screening, addressing challenges such as fragmented toolchains and high computational demands. Validation on multiple antigens showcases the platform's ability to generate diverse and high-quality antibodies, confirming a scaling law where exploring larger sequence spaces increases the likelihood of identifying optimal binders. This platform provides a seamless, accessible solution for large-scale antibody design and is available via the antibody design page of PaddleHelix platform.
Modern therapeutic antibody design often involves composing multi-part assemblages of individual functional domains, each of which may be derived from a different source or engineered independently. While these complex formats can expand disease applicability and improve safety, they present a significant engineering challenge: the function and stability of individual domains are not guaranteed in the novel format, and the entire molecule may no longer be synthesizable. To address these challenges, we develop a machine learning framework to predict "reformatting success" -- whether converting an antibody from one format to another will succeed or not. Our framework incorporates both antibody sequence and structural context, incorporating an evaluation protocol that reflects realistic deployment scenarios. In experiments on a real-world antibody reformatting dataset, we find the surprising result that large pretrained protein language models (PLMs) fail to outperform simple, domain-tailored, multimodal representations. This is particularly evident in the most difficult evaluation setting, where we test model generalization to a new starting antibody. In this challenging "new antibod
Antibody co-design represents a critical frontier in drug development, where accurate prediction of both 1D sequence and 3D structure of complementarity-determining regions (CDRs) is essential for targeting specific epitopes. Despite recent advances in equivariant graph neural networks for antibody design, current approaches often fall short in capturing the intricate interactions that govern antibody-antigen recognition and binding specificity. In this work, we present Igformer, a novel end-to-end framework that addresses these limitations through innovative modeling of antibody-antigen binding interfaces. Our approach refines the inter-graph representation by integrating personalized propagation with global attention mechanisms, enabling comprehensive capture of the intricate interplay between local chemical interactions and global conformational dependencies that characterize effective antibody-antigen binding. Through extensive validation on epitope-binding CDR design and structure prediction tasks, Igformer demonstrates significant improvements over existing methods, suggesting that explicit modeling of multi-scale residue interactions can substantially advance computational a
Understanding the relationship between antibody sequence, structure and function is essential for the design of antibody-based therapeutics and research tools. Recently, machine learning (ML) models mostly based on the application of large language models to sequence information have been developed to predict antibody properties. Yet there are open directions to incorporate structural information, not only to enhance prediction but also to offer insights into the underlying molecular mechanisms. This chapter provides an overview of these approaches and describes two ML frameworks that integrate structural data (via graph representations) with neural networks to predict properties of antibodies: ANTIPASTI predicts binding affinity (a global property) whereas INFUSSE predicts residue flexibility (a local property). We survey the principles underpinning these models; the ways in which they encode structural knowledge; and the strategies that can be used to extract biologically relevant statistical signals that can help discover and disentangle molecular determinants of the properties of interest.
Therapeutic antibody candidates often require extensive engineering to improve key functional and developability properties before clinical development. This can be achieved through iterative design, where starting molecules are optimized over several rounds of in vitro experiments. While protein structure can provide a strong inductive bias, it is rarely used in iterative design due to the lack of structural data for continually evolving lead molecules over the course of optimization. In this work, we propose a strategy for iterative antibody optimization that leverages both sequence and structure as well as accumulating lab measurements of binding and developability. Building on prior work, we first train a sequence-structure diffusion generative model that operates on antibody-antigen complexes. We then outline an approach to use this model, together with carefully predicted antibody-antigen complexes, to optimize lead candidates throughout the iterative design process. Further, we describe a guided sampling approach that biases generation toward desirable properties by integrating models trained on experimental data from iterative design. We evaluate our approach in multiple in
Therapeutic antibodies have been extensively studied in drug discovery and development in the past decades. Antibodies are specialized protective proteins that bind to antigens in a lock-to-key manner. The binding strength/affinity between an antibody and a specific antigen is heavily determined by the complementarity-determining regions (CDRs) on the antibodies. Existing machine learning methods cast in silico development of CDRs as either sequence or 3D graph (with a single chain) generation tasks and have achieved initial success. However, with CDR loops having specific geometry shapes, learning the 3D geometric structures of CDRs remains a challenge. To address this issue, we propose AntibodyFlow, a 3D flow model to design antibody CDR loops. Specifically, AntibodyFlow first constructs the distance matrix, then predicts amino acids conditioned on the distance matrix. Also, AntibodyFlow conducts constraint learning and constrained generation to ensure valid 3D structures. Experimental results indicate that AntibodyFlow outperforms the best baseline consistently with up to 16.0% relative improvement in validity rate and 24.3% relative reduction in geometric graph level error (roo
Antibody design is an essential yet challenging task in various domains like therapeutics and biology. There are two major defects in current learning-based methods: 1) tackling only a certain subtask of the whole antibody design pipeline, making them suboptimal or resource-intensive. 2) omitting either the framework regions or side chains, thus incapable of capturing the full-atom geometry. To address these pitfalls, we propose dynamic Multi-channel Equivariant grAph Network (dyMEAN), an end-to-end full-atom model for E(3)-equivariant antibody design given the epitope and the incomplete sequence of the antibody. Specifically, we first explore structural initialization as a knowledgeable guess of the antibody structure and then propose shadow paratope to bridge the epitope-antibody connections. Both 1D sequences and 3D structures are updated via an adaptive multi-channel equivariant encoder that is able to process protein residues of variable sizes when considering full atoms. Finally, the updated antibody is docked to the epitope via the alignment of the shadow paratope. Experiments on epitope-binding CDR-H3 design, complex structure prediction, and affinity optimization demonstra
Accurate prediction of antibody-antigen (Ab-Ag) interfaces is critical for vaccine design, immunodiagnostics, and therapeutic antibody development. However, achieving reliable predictions from sequences alone remains a challenge. In this paper, we present ABCONFORMER, a model based on the Conformer backbone that captures both local and global features of a biosequence. To accurately capture Ab-Ag interactions, we introduced the physics-inspired sliding attention, enabling residue-level contact recovery without relying on three-dimensional structural data. ABConformer can accurately predict paratopes and epitopes given the antibody and antigen sequence, and predict pan-epitopes on the antigen without antibody information. In comparison experiments, ABCONFORMER achieves state-of-the-art performance on a recent SARS-CoV-2 Ab-Ag dataset, and surpasses widely used sequence-based methods for antibody-agnostic epitope prediction. Ablation studies further quantify the contribution of each component, demonstrating that, compared to conventional cross-attention, sliding attention significantly enhances the precision of epitope prediction. To facilitate reproducibility, we will release the co
Motivation: The clinical efficacy of antibody therapeutics critically depends on high-affinity target engagement, yet laboratory affinity-maturation campaigns are slow and costly. In computational settings, most protein language models (PLMs) are not trained to favor high-affinity antibodies, and existing preference optimization approaches introduce substantial computational overhead without clear affinity gains. Therefore, this work proposes SimBinder-IF, which converts the inverse folding model ESM-IF into an antibody sequence generator by freezing its structure encoder and training only its decoder to prefer experimentally stronger binders through preference optimization. Results: On the 11-assay AbBiBench benchmark, SimBinder-IF achieves a 55 percent relative improvement in mean Spearman correlation between log-likelihood scores and experimentally measured binding affinity compared to vanilla ESM-IF (from 0.264 to 0.410). In zero-shot generalization across four unseen antigen-antibody complexes, the correlation improves by 156 percent (from 0.115 to 0.294). SimBinder-IF also outperforms baselines in top-10 precision for ten-fold or greater affinity improvements. A case study re
Antibodies are vital proteins offering robust protection for the human body from pathogens. The development of general protein and antibody-specific pre-trained language models both facilitate antibody prediction tasks. However, there have been limited studies that comprehensively explore the representation capability of distinct pre-trained language models on different antibody tasks. To investigate the problem, we aim to answer several key questions in this paper, such as how pre-trained language models perform in antibody tasks with different specificity and how introducing specific biological mechanisms to the pre-training process can benefit the model. Additionally, we evaluate if the learned antibody pre-trained representations can be applied to real-world antibody problems, like drug discovery and immune process understanding. Previously, no benchmark available largely hindered the study to answer these questions. To aid in our investigation, we provide an AnTibody Understanding Evaluation (ATUE) benchmark. We comprehensively evaluate the performance of protein pre-trained language models by empirical study along with conclusions and new insights. Our ATUE and code are relea