

Sergey Ovchinnikov
3.8K posts

@sokrypton
Scientist, Assistant Professor @MITBiology, #FirstGen, ProteinBERTologist, 🇺🇦 No Human is illegal. Moving to: https://t.co/sow6IRD3jj











A few updates: We compare the categorical jacobian to explicitly computing pseudo-likelihood for all single & double mutations, allowing one to compute epistasis via ΔE(double) - ΔE(single) - ΔE(single), as proposed by @JeannefaustineT et al. We see strong correlations. (1/5)


Beyond additivity: zero-shot methods cannot predict impact of epistasis on protein properties and function 1 The study reveals a critical blind spot in modern protein AI: while 95 state-of-the-art zero-shot models can predict single mutations well, they systematically fail when mutations interact epistatically—where the combined effect of mutations deviates from simple additivity. 2 Using 53 MAVE datasets from ProteinGym, the researchers identified epistatic genotypes by comparing observed effects against expected additive effects, accounting for experimental error. For GFP fluorescence and protein thermostability, epistasis is widespread and biologically genuine, not a measurement artifact. 3 The performance gap is stark. Top models like ESCOTT, PoET, and MSA-Transformer achieve Spearman correlations above 0.6 for all genotypes, but collapse to near-zero or negative correlations for epistatic genotypes. Simple linear regression baselines often match or exceed complex deep learning models on epistatic combinations. 4 This exposes a fundamental limitation: protein language models learn evolutionary plausibility from natural sequences, but natural selection only explores functional sequence space. Epistatic combinations—often traversing fitness valleys—lie outside this training distribution, leaving models blind to higher-order mutational interactions. 5 The work highlights that clever feature engineering (evolutionary conservation, structural information) outperforms architectural complexity for epistasis prediction. Yet even structure-aware models like ProSST and ESM-IF1, while top performers on stability, show no consistent advantage across datasets. 6 The implications are profound for protein design and directed evolution. Current zero-shot methods cannot reliably navigate rugged fitness landscapes or predict functional variants along evolutionary paths requiring epistatic mutations. The field urgently needs models trained on multi-mutational data and architectures explicitly modeling non-linear interactions. 💻Code: github.com/kalininalab/ep… 📜Paper: biorxiv.org/content/10.648… #ProteinEngineering #Epistasis #MachineLearning #ProteinGym #VariantEffectPrediction #ComputationalBiology #Bioinformatics #ProteinEvolution #AIforScience #StructuralBiology





Welcome to the Lab of the Future! 🧬🤖 Excited to share LUMI-lab, out today in @CellCellPress — a self-driving platform that pairs an AI foundation model with a robotic lab to autonomously discover ionizable lipids (LNPs) for mRNA delivery. The core problem: Designing lipid nanoparticles (LNPs) is hard. The chemical space of ionizable lipids is vast, experimental cycles are slow, and — critically — historical LNP datasets are far too small to train a predictive model from scratch. Most AI approaches in this space hit a wall immediately: not enough data to learn from. Our solution: lab-in-the-loop foundation model learning. Instead of training on LNP data alone, LUMI starts as a transformer-based foundation model pretrained across broad chemical space, building rich molecular representations before it ever sees a single LNP experiment. Then it enters a closed loop with a robotic synthesis platform: predict → synthesize → assay → update. Each round of real wet-lab experiments fine-tunes the model, which then proposes smarter candidates for the next round. The lab isn't just validating AI predictions — it's actively teaching the model, continuously. What happened when we let it run: LUMI-lab autonomously synthesized and screened 1,700+ ionizable lipids in human bronchial epithelial cells. The top candidate — LUMI-6 — features a brominated lipid tail, a structural motif that had been largely overlooked in LNP design. LUMI found it without being told where to look. When formulated into LNPs and delivered intratracheally to mice, LUMI-6 achieved 20.3% gene editing efficiency in lung epithelial cells — a compelling result for one of the hardest-to-reach therapeutic targets, directly relevant to diseases like cystic fibrosis and alpha-1 antitrypsin deficiency. Why this matters beyond LNPs: This is a proof of concept for a broader thesis — that foundation model pretraining + active learning + robotic experimentation can overcome the data scarcity bottleneck that plagues AI-driven discovery in biology. You don't need a massive domain-specific dataset to start. You need a model that can generalize, a lab that can generate the right data, and a loop that connects them. Huge congratulations to first authors Yue Xu, @HAOTIANCUI1, and Kuan Pang, and to the entire @BowenLi_Lab team. Grateful to our collaborators at @UHN and @UofTPharmacy, and to Princess Margaret Cancer Centre Research @PMResearch_UHN. 📄 Paper: cell.com/cell/fulltext/…


