Sergey Ovchinnikov

3.8K posts

Sergey Ovchinnikov

@sokrypton

Scientist, Assistant Professor @MITBiology, #FirstGen, ProteinBERTologist, 🇺🇦 No Human is illegal. Moving to: https://t.co/sow6IRD3jj

Cambridge, MA Katılım Aralık 2014

3.7K Takip Edilen17.8K Takipçiler

Sabitlenmiş Tweet

Sergey Ovchinnikov@sokrypton·17 Tem

I'm excited to share that I'll be joining @MITBiology as an Asst Prof. in Jan 2024! Come join us! 🤓🧪🖥️🧬

English

169

148

216.4K

Sergey Ovchinnikov retweetledi

Silvi Rouskin@silvirouskin·6d

New preprint from the lab ! In short - nobody told the model what a stem-loop was, we gave it 50,000 IRES sequences and one job: predict masked nucleotides. That’s it! Albatross is an RNA language model that taught itself the structural logic of viral RNA the same way LLMs learn grammar- from sequence alone. No base pairing rules, no thermodynamics, no structure labels. Check it out-biorxiv.org/content/10.648… and albatrossrna.org

English

193

30.1K

Sergey Ovchinnikov retweetledi

Anindyadeep@anindyadeeps·3d

We fine-tuned Protenix on RNA data using @try_litefold Tune (our multi modal fine-tuning engine) and got 20% jump in pLDDT and 10% jump in the avg TM Score. Currently sota so far on rna structure prediction. More announcements on this. Stay tuned.

Anindyadeep@anindyadeeps

So our fine-tuning engine is giving us ~ 20% increase (full evaluation) in RNA folding task. Now starting second round of continual tuning.

English

17.4K

Sergey Ovchinnikov@sokrypton·4d

@DmitryRybin1

QME

185

10.2K

Dmitry Rybin@DmitryRybin1·4d

Some time ago mathematicians proposed that the first thing we should share with alien intelligence is this image:

English

137

1.4K

400.9K

Sergey Ovchinnikov@sokrypton·4d

@kyr_dreamer @ChoYehlin @grocklin @KotaroTsuboyama See here for comparison with ESM baseline or when only finetuned on megascale data. x.com/ChoYehlin/stat…

Yehlin Cho@ChoYehlin

We fine-tuned sequence- and structure-pretrained models on our large-scale stability data to predict absolute folding stability. The resulting SaProtΔG and ESM3ΔG achieved Spearman r = 0.88 and 0.87, with RMSE = 0.80 kcal mol⁻¹.

English

BlakeTheCoder@kyr_dreamer·5d

@ChoYehlin @grocklin @KotaroTsuboyama @sokrypton Really great work. I'm especially impressed with the dataset scale. 1.8M measurements change the game for stability prediction. How's it compare to existing ESM benchmarks?

English

298

Sergey Ovchinnikov retweetledi

Yehlin Cho@ChoYehlin·5d

🚀 Excited to share our new work: Absolute Stability Predictor! 📊: forms.gle/4ZnXZSnTBvaykk… Built the MGnify Stability Dataset (1.8M+ measurements) and developed stability prediction models, together with @grocklin, @KotaroTsuboyama, @sokrypton, and teams.

GIF

English

219

29.4K

Sergey Ovchinnikov@sokrypton·5d

@biocheMichael @design_proteins @ChoYehlin The model built on SaProt, should be available for both commercial and non-commercial use.

English

211

Michael - Protein Thx and Biologics@biocheMichael·5d

@design_proteins @ChoYehlin Free for academic use - requires licensing for commercial use fwiw

English

349

Corey Howe@design_proteins·5d

New binder scoring metric just dropped Absolute stability prediction with seq-based (ESM3dG) and structure based (SaProtdG) PLMs Adding stability prediction boosts ipSAE performance in discriminating binders from non-binders Congrats @ChoYehlin !

Yehlin Cho@ChoYehlin

English

118

11.3K

Sergey Ovchinnikov retweetledi

Yehlin Cho@ChoYehlin·5d

We measured stability for 1.8M diverse protein domains (60–80 aa) from the MGnify metagenomic database, spanning 200k+ sequence families, and created the MGnify Stability Dataset.

English

4.3K

Sergey Ovchinnikov@sokrypton·6d

@janekm @mbeisen Yeah... That wasn't part of the prompt... 😅 Not sure why it (Gemini) added all the "edible" disclaimers, I just asked it to make a slide comparing the sizes between rice and fly eggs.

English

173

Janek Mann@janekm·6d

@sokrypton @mbeisen I love that it’s the “edible varieties” of fly egg, that’s reassuring 🤣

English

129

Michael 英泉 Eisen@mbeisen·20 May

How do you publish this tree and claim it's a success???!!!

Leandro von Werra@lvwerra

We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days. Here are the tricks we used: When modelling DNA sequences a lot of the performance comes down to tokenizing the sequences in a smart way. BPE tokenizer struggle because there are no whitespaces and character (called base in DNA) level tokenizers waste a lot of compute on too many tokens. Carbon is built with a unique tokenizer: we split sequences in chunks of 6 bases, but during both training and inference we can work with single base resolution. That's similar to having word tokens but resolving them at the character level. All possible thanks to the DNA tokens unique structure. The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size. We built an interactive demo so you can explore how the model can generate DNA sequences, investigate the structure of genes, predict the effect of mutations, generate and fold proteins and even reconstruct parts of the tree of life. huggingface.co/spaces/Hugging…

English

398

87.3K

Chris Hayduk@ChrisHayduk·17 May

GPT 5.5 is an effective autoresearcher in structural biology! I've had goal mode running for over 150 hours straight, looking for topologically inspired architectural changes to improve the performance of AlphaFold2. Performance is strong and improving!

English

135

1.4K

131.9K

Sergey Ovchinnikov@sokrypton·17 May

@ChrisHayduk Pretty cool! Though be careful... I tried something like this with Claude Code a while back, and after running for a couple days, I woke up to an RMSD of zero! That's when I realized it started inputting the correct answer as the input to the model 😅

English

Sergey Ovchinnikov retweetledi

Protein Data Bank@PDBeurope·14 May

PDBe-SIFTS is now open source 🎉 Developed in collaboration with @PDBeurope and @uniprot , it enables fast, accurate residue-level mapping between protein sequences and structures, achieving >93% agreement with curated mappings. Get started: github.com/PDBeurope/SIFTS

English

191

26.4K

Sergey Ovchinnikov retweetledi

Állan Ferrari@ajrferrari·13 May

Now available as its peer-reviewed version in Nature: nature.com/articles/s4158…

Állan Ferrari@ajrferrari

Thrilled to share our new paper where we introduce a multiplexed hydrogen–deuterium exchange MS (mHDX‑MS) method that can measure hundreds of protein domains’ conformational energy landscapes—all in a single experiment! biorxiv.org/content/10.110…

English

246

55.7K

Sergey Ovchinnikov retweetledi

Andrew Savinov@bioSavinov·13 May

Interested in genetically encodable inhibitors of your favorite biomolecular condensate? Excited to announce our latest work, w/ @jibin_sadasivan, @GeneWeiLiLab, & @LindsayCase19, on protein fragments as generalizable regulators of phase separation. (1/n) biorxiv.org/content/10.648…

English

134

20.4K

Sergey Ovchinnikov@sokrypton·12 May

@julian_englert Do you mean "requires zero ai knowledge"? 🙃 Seems you still need to know what a protein is?

English

150

Julian Englert@julian_englert·29 Kas

We just made an app that walks you through designing a novel protein with AI from scratch. Takes about 5 minutes, requires zero biology knowledge. ➡️ design-a-protein.com The best part: we will actually synthesize 1000 of those protein designs in the lab and test their real world function as a therapeutic.

English

183

971

122.7K

Sergey Ovchinnikov@sokrypton·10 May

@shae_mcl You also need to factor in the cost of sequence databases, isolate genomes (eg. UniProt) and metagenomic databases (eg. Mgnify, JGI). And the billions of years of natural selection needed to produce these 🙃.

English

681

Shae McLaughlin@shae_mcl·9 May

It’s estimated that the Protein Data Bank (PDB) cost around $13B to create. Alphafold was only possible because of it. If we want ML to solve biology, we should be funding the creation of databases and the development of new assay technologies. ML is nothing without data.

English

176

1.3K

156.7K

Sergey Ovchinnikov@sokrypton·10 May

@nlarusstone @jboysen0 Sidenote: Pre-AlphaFold it was thought you needed ~1000 diverse sequences (no two sequences more than 90% identical to each other) around the sequence you wanted to compute the structure (with simple linear algebra) for. Post-AlphaFold (and DL) that dropped to ~100.

English

Sergey Ovchinnikov@sokrypton·10 May

@nlarusstone @jboysen0 If anything, it might be more interesting to quantify the cost of resequencing the uniprot and metagenomic (mgnify/JGI) DBs at 90% clustering threshold. That being said, if someone solves a structure where sequence databases lack diversity, structure will definitely help (2/2).

English

Nicholas Larus-Stone@nlarusstone·9 May

I see this claim a lot but the really interesting question here is what is the smallest version of the PDB that would have allowed us to get alphafold2 level performance?

Shae McLaughlin@shae_mcl

English

20.9K

Sergey Ovchinnikov retweetledi

Yeqing Lin@lin_yeqing·8 May

Introducing Genie 3, a generative protein model that substantially advances the state-of-the-art for binder design, increasing in silico success rates by up to 20x on hard multimeric targets. It also debuts a form of inference-time scaling unobserved in other design models. 🧵1/8

English

110

435

73.7K

Keşfet

@try_litefold @DmitryRybin1 @kyr_dreamer @ChoYehlin @grocklin @KotaroTsuboyama @biocheMichael @design_proteins