Therapeutic mRNA design requires coordinating multiple interacting sequence features across the full transcript, where codon usage, untranslated regions (UTRs), and their coupling jointly determine stability, translation efficiency, and protein expression. Here, we present mRNA generation via unrolled trajectories and informed latent updates (mRNAutilus), a framework for simultaneous codon optimization and de novo UTR design directly from sequence. mRNAutilus combines a masked discrete diffusion model trained on millions of full-length mRNAs with Monte Carlo Tree Guidance to generate Pareto-efficient sequences under multiple functional objectives, using lightweight regressors over model embeddings to predict half-life, translation efficiency, and protein abundance. Unlike recent methods that design coding sequences and UTRs separately or rely on post hoc assembly and screening, mRNAutilus generates complete transcripts in a single process optimized across properties. Across diverse targets, zero-shot mRNAs encoding P. pyralis luciferase achieve over 400-fold higher expression than wild-type and outperform commercial and machine learning-designed baselines, including zero-shot gener
mRNA-based vaccines have become a major focus in the pharmaceutical industry. The coding sequence as well as the Untranslated Regions (UTRs) of an mRNA can strongly influence translation efficiency, stability, degradation, and other factors that collectively determine a vaccine's effectiveness. However, optimizing mRNA sequences for those properties remains a complex challenge. Existing deep learning models often focus solely on coding region optimization, overlooking the UTRs. We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges. In addition to a first pre-training, a second pre-training stage allows us to specialise the model with high-quality data. We employ single nucleotide tokenization of mRNA sequences with codon separation, ensuring prior biological and structural information from the original mRNA sequence is not lost. Our model, Helix-mRNA, outperforms existing methods in analysing both UTRs and coding region properties. It can process sequences 6x longer than current approaches while using only 10% of the parameters of existing foundation models. Its predictive capabilities extend to all mRNA regions. We open-source
Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new drugs and revolutionizing the pharmaceutical industry. However, selecting particular mRNA sequences for vaccines and therapeutics from extensive mRNA libraries is costly. Effective mRNA therapeutics require carefully designed sequences with optimized expression levels and stability. This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec. In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec. We jointly use the 5' untranslated region (UTR) and coding sequence (CDS) region as the input sequences. We adapt our LM-based approach specifically to mRNA by 1) considering the importance of location on the mRNA sequence with probabilistic masking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure (SS) classification as additional pretext tasks. mRNA2vec demonstrates significant improvements in translation efficiency (TE) and expression level (EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also gives a competitive performance in mRNA stability and protei
mRNA-protein assemblies play a fundamental role in forming membraneless compartments within cells, whose functions may include activating, inhibiting, and localizing reactions. Recruitment of proteins into droplets can diminish cell to cell variability in protein abundance. However, the extent to which mRNA-protein assemblies may also buffer noise arising from transcription is not understood. Complicating study of this question is that models of kinetics typically treat this as a phase separation process, when mRNA-protein assemblies can contain as few as 2 mRNA transcripts, far below the thermodynamic thresholds for phase separation. Here, through stochastic simulations and asymptotic analysis, we quantify noise suppression by mRNA-protein assemblies as a function of gene expression kinetic parameters, and show that assemblies formed from just a handful of mRNAs effectively regulate transcript abundances and suppress fluctuations. We place particular emphasis on how this mechanism can facilitate regulated transcription by reducing noise even in the context of infrequent bursts of transcription. We investigate two biologically relevant models in which mRNA assembly acts to either "
Large Genomic Foundation Models have recently achieved remarkable results and in-vivo translation capabilities. However these models quickly grow to over a few Billion of parameters and are expensive to run when compute is limited. To overcome this challenge, we present a distillation framework for transferring mRNA representations from a state of the art genomic foundation model into a much smaller model specialized for mRNA sequences, reducing the size by 200-fold. Embedding-level distillation worked better than logit based methods, which we found unstable. Benchmarking on mRNA-bench demonstrates that the distilled model achieves state-of-the-art performance among models of comparable size and competes with larger architectures for mRNA-related tasks. Our results highlight embedding-based distillation of mRNA sequences as an effective training strategy for biological foundation models. This enables similar efficient and scalable sequence modelling in genomics, particularly when large models are computationally challenging or infeasible.
Membraneless droplets formed through liquid-liquid phase separation (LLPS) play a crucial role in mRNA storage, enabling organisms to swiftly respond to environmental changes. However, the mechanisms underlying mRNA integration and protection within droplets remain unclear. Here, we unravel the role of bacterial aggresomes as stress granules (SGs) in safeguarding mRNA during stress. We discovered that upon stress onset, mobile mRNA molecules selectively incorporate into individual proteinaceous SGs based on length-dependent enthalpic gain over entropic loss. As stress prolongs, SGs undergo compaction facilitated by stronger non-specific RNA-protein interactions, thereby promoting recruitment of shorter RNA chains. Remarkably, mRNA ribonucleases are repelled from bacterial SGs, due to the influence of protein surface charge. This exclusion mechanism ensures the integrity and preservation of mRNA within SGs during stress conditions, explaining how mRNA can be stored and protected from degradation. Following stress removal, SGs facilitate mRNA translation, thereby enhancing cell fitness in changing environments. These droplets maintain mRNA physiological activity during storage, makin
Tight regulation of messenger RNA (mRNA) stability is essential to ensure accurate gene expression in response to developmental and environmental cues. mRNA stability is controlled by mRNA decay pathways, which have traditionally been proposed to occur independently of translation. However, the recent discovery of a co-translational mRNA decay pathway (also known as CTRD) reveals that mRNA translation and decay can be coupled. While being translated, a mRNA can be targeted for degradation. This pathway was first described in yeast and rapidly identified in several plant species. This review explores recent advances in our understanding of CTRD in plants, emphasizing its regulation and its importance for development and stress response. The different metrics used to assess CTRD activity are also presented. Furthermore, this review outlines future directions to explore the importance of mRNA decay in maintaining mRNA homeostasis in plants.
Research in the life sciences often employs messenger ribonucleic acids (mRNA) quantification as a standalone approach for functional analysis. However, although the correlation between the measured levels of mRNA and proteins is positive, correlation coefficients observed empirically are incomplete, necessitating caution in making agnostic inferences. This essay provides a statistical reflection and caveat on the concept of correlation strength in the context of transcriptomics-proteomics studies. It highlights the variability in possible protein levels at given empirical correlation values, even for precise mRNA amount, and underscores the notable proportion of mRNA-protein pairs with abundances at opposite ends of their respective distributions. Cell biologists, data scientists, and biostatisticians should recognise that mRNA-protein correlation alone is insufficient to justify using a single mRNA quantification to infer the amount or function of its corresponding protein.
The mRNA optimization is critical for therapeutic and biotechnological applications, since sequence features directly govern protein expression levels and efficacy. However, current methods face significant challenges in simultaneously achieving three key objectives: (1) fidelity (preventing unintended amino acid changes), (2) computational efficiency (speed and scalability), and (3) the scope of optimization variables considered (multi-objective capability). Furthermore, existing methods often fall short of comprehensively incorporating the factors related to the mRNA lifecycle and translation process, including intrinsic mRNA sequence properties, secondary structure, translation elongation kinetics, and tRNA availability. To address these limitations, we introduce \textbf{RNop}, a novel deep learning-based method for mRNA optimization. We collect a large-scale dataset containing over 3 million sequences and design four specialized loss functions, the GPLoss, CAILoss, tAILoss, and MFELoss, which simultaneously enable explicit control over sequence fidelity while optimizing species-specific codon adaptation, tRNA availability, and desirable mRNA secondary structure features. Then,
Language models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of mRNA sequences.
Designing mRNA sequences is a major challenge in developing next-generation therapeutics, since it involves exploring a vast space of possible nucleotide combinations while optimizing sequence properties like stability, translation efficiency, and protein expression. While Generative Flow Networks are promising for this task, their training is hindered by sparse, long-horizon rewards and multi-objective trade-offs. We propose Curriculum-Augmented GFlowNets (CAGFN), which integrate curriculum learning with multi-objective GFlowNets to generate de novo mRNA sequences. CAGFN integrates a length-based curriculum that progressively adapts the maximum sequence length guiding exploration from easier to harder subproblems. We also provide a new mRNA design environment for GFlowNets which, given a target protein sequence and a combination of biological objectives, allows for the training of models that generate plausible mRNA candidates. This provides a biologically motivated setting for applying and advancing GFlowNets in therapeutic sequence design. On different mRNA design tasks, CAGFN improves Pareto performance and biological plausibility, while maintaining diversity. Moreover, CAGFN r
The growing importance of mRNA therapeutics and synthetic biology highlights the need for models that capture the latent structure of synonymous codon (different triplets encoding the same amino acid) usage, which subtly modulates translation efficiency and gene expression. While recent efforts incorporate codon-level inductive biases through auxiliary objectives, they often fall short of explicitly modeling the structured relationships that arise from the genetic code's inherent symmetries. We introduce Equi-mRNA, the first codon-level equivariant mRNA language model that explicitly encodes synonymous codon symmetries as cyclic subgroups of 2D Special Orthogonal matrix (SO(2)). By combining group-theoretic priors with an auxiliary equivariance loss and symmetry-aware pooling, Equi-mRNA learns biologically grounded representations that outperform vanilla baselines across multiple axes. On downstream property-prediction tasks including expression, stability, and riboswitch switching Equi-mRNA delivers up to approximately 10% improvements in accuracy. In sequence generation, it produces mRNA constructs that are up to approximately 4x more realistic under Frechet BioDistance metrics a
Messenger RNA (mRNA) sequences as therapeutics require optimized design to ensure efficient translation, structural stability, and minimal immunogenicity. This study presents a two-stage in-silico framework that integrates deep learning and evolutionary computation for rational mRNA optimization instead of existing state-of-the-art models. In the first stage, a pretrained CodonTransformer (BERT-like Large Language Model) generates biologically coherent mRNA sequences encoding the target antigen. In the second stage, a genetic algorithm (GA) evolves these candidate sequences through codon-aware crossover and synonymous mutation guided by human codon usage preferences. Fitness functions for evaluation combined translation-related metrics (CAI, tAI, codon-pair bias), mRNA structural stability (local and global MFE via RNAfold, GC content), and reduced immunogenicity (CpG/UpA motif frequency). Over successive generations (38th, 40th, and 42nd), the GA improved (achieved CAI values of 0.73 to 0.74 and tAI values of 0.63 to 0.64) CAI and tAI by over 6% and codon-pair bias is high and consistent (0.97 ) and improved ribosomal accessibility at the 5' end, with an unpaired_30 fraction reach
The ac4C modification on mRNA has been demonstrated to be associated with various diseases; however, its molecular mechanism remains unclear. The wet lab experiments produced relatively rough data, which lack precise ac4C modification sites, and extracting valuable information from such data remains a challenge. In this study, we integrate linguistics, traditional machine learning, and deep learning, establishing a link between the understanding of mRNA data and natural language processing (NLP). Through our analysis, we successfully revealed key information about ac4C in mRNA and uncovered the information storage mechanism of ac4C redundancy on a single sequence. This redundant information storage method in mRNA facilitates the transmission of ac4C information and promotes the enrichment of ac4C.
Accurate prediction of mRNA secondary structure is critical for understanding gene expression, translation efficiency, and advancing mRNA-based therapeutics. However, the combinatorial complexity of possible foldings, especially in long sequences, poses significant computational challenges for classical algorithms. In this work, we propose a scalable, quantum-centric optimization framework that integrates quantum sampling with classical post-processing to tackle this problem. Building on a Quadratic Unconstrained Binary Optimization (QUBO) formulation of the mRNA folding task, we develop two complementary workflows: a Conditional Value at Risk (CVaR)-based variational quantum algorithm enhanced with gauge transformations and local search, and an Instantaneous Quantum Polynomial (IQP) circuit-based scheme where training is done classically and sampling is delegated to quantum hardware. We demonstrate the effectiveness of these approaches using IBM quantum processors, solving problem instances with up to 156 qubits and circuits containing up to 950 nonlocal gates, corresponding to mRNA sequences of up to 60 nucleotides. Additionally, we validate scalability of the CVaR algorithm on a
This study examines the roles of public and private sector actors in the development of mRNA vaccines, a breakthrough innovation in modern medicine. Using a dataset of 151 core patent families and 2,416 antecedent (cited) patents, we analyze the structure and dynamics of the mRNA vaccine knowledge network through network theory. Our findings highlight the central role of biotechnology firms, such as Moderna and BioNTech, alongside the crucial contributions of universities and public research organizations (PROs) in providing foundational knowledge.We develop a novel credit allocation framework, showing that universities, PROs, government and research centers account for at least 27% of the external technological knowledge base behind mRNA vaccine breakthroughs - representing a minimum threshold of their overall contribution. Our study offers new insights into pharmaceutical and biotechnology innovation dynamics, emphasizing how Moderna and BioNTech's mRNA technologies have benefited from academic institutions, with notable differences in their institutional knowledge sources.
mRNA therapy is gaining worldwide attention as an emerging therapeutic approach. The widespread use of mRNA vaccines during the COVID-19 outbreak has demonstrated the potential of mRNA therapy. As mRNA-based drugs have expanded and their indications have broadened, more patents for mRNA innovations have emerged. The global patent landscape for mRNA therapy has not yet been analyzed, indicating a research gap in need of filling, from new technology to productization. This study uses social network analysis with the patent quality assessment to investigate the temporal trends, citation relationship, and significant litigation for 16,101 mRNA therapy patents and summarizes the hot topics and potential future directions for this industry. The information obtained in this study not only may be utilized as a tool of knowledge for researchers in a comprehensive and integrated way but can also provide inspiration for efficient production methods for mRNA drugs. This study shows that infectious diseases and cancer are currently the primary applications for mRNA drugs. Emerging patent activity and lawsuits in this field are demonstrating that delivery technology remains one of the key challe
Messenger RNA (mRNA) plays a crucial role in protein synthesis, with its codon structure directly impacting biological properties. While Language Models (LMs) have shown promise in analyzing biological sequences, existing approaches fail to account for the hierarchical nature of mRNA's codon structure. We introduce Hierarchical Encoding for mRNA Language Modeling (HELM), a novel pre-training strategy that incorporates codon-level hierarchical structure into language model training. HELM modulates the loss function based on codon synonymity, aligning the model's learning process with the biological reality of mRNA sequences. We evaluate HELM on diverse mRNA datasets and tasks, demonstrating that HELM outperforms standard language model pre-training as well as existing foundation model baselines on seven diverse downstream property prediction tasks and an antibody region annotation tasks on average by around 8%. Additionally, HELM enhances the generative capabilities of language model, producing diverse mRNA sequences that better align with the underlying true data distribution compared to non-hierarchical baselines.
mRNA technology has revolutionized vaccine development, protein replacement therapies, and cancer immunotherapies, offering rapid production and precise control over sequence and efficacy. However, the inherent instability of mRNA poses significant challenges for drug storage and distribution, particularly in resource-limited regions. Co-optimizing RNA structure and codon choice has emerged as a promising strategy to enhance mRNA stability while preserving efficacy. Given the vast sequence and structure design space, specialized algorithms are essential to achieve these qualities. Recently, several effective algorithms have been developed to tackle this challenge that all use similar underlying principles. We call these specialized algorithms "mRNA folding" algorithms as they generalize classical RNA folding algorithms. A comprehensive analysis of their underlying principles, performance, and limitations is lacking. This review aims to provide an in-depth understanding of these algorithms, identify opportunities for improvement, and benchmark existing software implementations in terms of scalability, correctness, and feature support.
Designing therapeutic messenger RNA (mRNA) requires creating full-length transcripts that carefully balance stability, translation efficiency, and immune safety. To address this challenge, we propose ProMORNA, a multi-objective generation framework that produces complete mRNA transcripts \textit{de novo} directly from a target protein sequence. Our approach begins by training a BART-style encoder-decoder model on over 6 million natural protein-mRNA pairs. We then introduce Multi-Objective Group Relative Policy Optimization (MO-GRPO) to simultaneously optimize for various biological objectives in a unified way. As a case study, we evaluated ProMORNA on the widely used firefly luciferase target, excluding it from both our supervised training data and the prompt pool. The results indicate that ProMORNA improves the \textit{in silico} Pareto frontier for predicted half-life and translation efficiency relative to standard supervised baselines. Additionally, it achieves higher predicted functional scores than a state-of-the-art baseline under the same evaluation pipeline. These computational findings demonstrate the feasibility of using multi-objective reinforcement learning for full-len