Training mRNA Language Models Across 25 Species for $165

Training mRNA Language Models Across 25 Species for $165

3 0 0

You have a protein concept in your head. By end of day, you want a synthesis-ready, codon-optimized DNA sequence. That’s the pipeline OpenMed set out to build, and they did it for $165 in compute.

This isn’t another polished corporate launch. It’s a transparent account of building a protein AI pipeline from scratch: structure prediction, sequence design, and mRNA optimization. The folding and design parts lean on established tools—ESMFold from Meta, ProteinMPNN from Baker Lab. The codon optimization piece is all theirs: new models, new training infrastructure, new evaluation metrics.

What They Built

Three stages. First, predict the 3D structure (ESMFold on 30 protein chains, average pTM of 0.79). Second, design amino acid sequences that fold into that structure (ProteinMPNN on scaffold 7K00, 42% sequence recovery). Third—and this is where the real work went—optimize the DNA codons so the protein actually expresses in the target organism.

The mRNA optimization piece is the standout. They trained multiple transformer variants on 250k coding sequences, then scaled to 381k sequences across 25 species. Four production models, 55 GPU-hours total, $165 all-in. That’s absurdly cheap for what they got.

Architecture Exploration

Codon sequences are weird. They’re triplets from a 64-token alphabet, with strong positional dependencies and species-specific usage biases. BERT variants dominate protein modeling (ESM-2, ProtTrans), but nobody had really tested which architecture works best at the codon level.

They started with a tiny CodonBERT baseline (6M params, following Sanofi’s published architecture) and scaled up through ModernBERT and RoBERTa families. The hypothesis was straightforward: RoBERTa, the same architecture behind Meta’s ESM-2, should handle codon patterns well. Turns out that hypothesis was right.

The contenders:

  • CodonBERT baseline: 6M params, BERT-tiny, just to establish floor performance
  • ModernBERT-base: 90M params, 22 layers, all the latest efficiency innovations
  • CodonRoBERTa-base: 92M params, 12 layers
  • CodonRoBERTa-large: 312M params, 24 layers
  • CodonRoBERTa-large-v2: 312M params, same architecture, better hyperparameters

CodonRoBERTa-large-v2 won decisively: perplexity 4.10, Spearman CAI correlation 0.40. ModernBERT couldn’t touch it. That’s not surprising in retrospect—ModernBERT was designed for long-context NLP tasks, not for learning codon usage biases from short coding sequences. RoBERTa’s proven MLM architecture, already battle-tested on protein sequences, translated naturally to the codon domain.

Scaling to Multi-Species

This is where it gets interesting. Most codon optimization tools are species-specific. You want E. coli? Here’s a frequency table. You want human? Here’s another. OpenMed built a species-conditioned system that handles 25 organisms simultaneously. No other open-source project offers this.

They trained four production models: a general one trained on all 25 species, and three specialized variants. The species-conditioned approach means you can optimize a single gene sequence for expression in E. coli, yeast, CHO cells, or human cells without retraining. The model learns the usage bias implicitly from the training data.

Total training cost: 55 GPU-hours on consumer hardware. At current cloud pricing, that’s around $165. I’ve seen single experiments cost more than that in wasted GPU time.

The End-to-End Workflow

Take a protein idea. Run ESMFold to predict its 3D structure. Use ProteinMPNN to design sequences that fold into that structure. Then feed those sequences into the codon optimization model, conditioned on your target species. Output: a DNA sequence ready for synthesis.

The whole thing runs in an afternoon. That’s not theoretical—they provide runnable code for every step.

Where This Stands

This isn’t going to replace commercial codon optimization services for large-scale production. But for research labs, small biotechs, and open science projects, this is huge. You can now train your own species-specific codon optimization model for pocket change. You can iterate on designs without burning through grant money.

The folding and design components are well-established. The codon optimization piece is genuinely new. OpenMed is releasing everything: model weights, training code, evaluation scripts. Complete results and architectural decisions are documented.

What I’d like to see next: validation data. Perplexity and CAI correlation are good metrics, but they don’t tell you how well these sequences actually express in cells. A wet-lab validation study would be the real test. Also, the 25 species set is heavy on model organisms—I’d love to see coverage expand to more industrial and pathogenic species.

But for a $165 experiment that runs in 55 GPU-hours? This is impressive work. OpenMed is doing exactly what open-source AI for biology should look like: transparent, reproducible, and cheap enough that anyone can build on it.

Comments (0)

Be the first to comment!