Sebastian Ibarraran

29 posts

Sebastian Ibarraran

@s_ibarraran

Theoretical Chemistry PhD Candidate, Rotskoff Lab @ Stanford | Yale ‘23

Palo Alto, CA Katılım Mart 2024

41 Takip Edilen108 Takipçiler

Sebastian Ibarraran retweetledi

Grant Rotskoff@grantrotskoff·6d

Protein design has been dominated by diffusions due to a "structure-first" perspective. What about intrinsically disordered proteins? We scale language-based design using the modern RL stack and our model IDiom. Paper: biorxiv.org/content/10.648… Try it: idiom-designer.vercel.app

English

142

7.4K

Sebastian Ibarraran retweetledi

Frank Hu@FrankWho1050502·23 Nis

Excited to be in Rio for ICLR with @s_ibarraran where we are presenting our work on fine-tuning protein language models using energy rank alignment for efficient directed evolution! Stop by our poster at the GEM workshop on Monday, and feel free to reach out if interested!

English

Sebastian Ibarraran retweetledi

Jason Liu@JasonLiu1044858·20 Nis

Paper: doi.org/10.64898/2026.… Code: github.com/rotskoff-group… Dataset: huggingface.co/datasets/jxliu…

Filipino

Sebastian Ibarraran retweetledi

Jason Liu@JasonLiu1044858·20 Nis

We show that IDiom generates IDRs that recapitulate the sequence features of natural disordered regions, and we additionally use reinforcement learning to steer generation towards sequences with compartment specific localization features.

English

Sebastian Ibarraran@s_ibarraran·20 Nis

It was a pleasure to work with Jason on developing IDiom. Check out our preprint on bioRxiv and our codebase if you're interested in designing intrinsically disordered proteins!

Jason Liu@JasonLiu1044858

Intrinsically disordered protein regions have remained largely out of reach for computational design. We curate 37M IDR sequences from the AlphaFold Database and train IDiom, a 122M parameter autoregressive model, as a general platform for intrinsically disordered protein design.

English

Sebastian Ibarraran retweetledi

Jason Liu@JasonLiu1044858·20 Nis

English

109

Sebastian Ibarraran retweetledi

Biology+AI Daily@BiologyAIDaily·13 Nis

Generative design of intrinsically disordered protein regions with IDiom 1. The paper introduces IDiom, a 122M-parameter autoregressive (decoder-only) protein language model trained specifically for intrinsically disordered regions (IDRs), aiming to make rational design possible in a regime where structure-based generative methods do not apply. 2. Key technical idea: fill-in-the-middle training for proteins with explicit tokens that separate N-terminal context, C-terminal context, and the IDR span (, , ). This enables conditional generation of an IDR that fits into a chosen structured protein context, not just unconditional sampling. 3. Training data scale: 37 million IDRs curated from AlphaFold DB v4 using low pLDDT as a disorder proxy (plus filtering and clustering at 90% identity). They augment to 74 million sequences by also creating “context-deleted” records to train unprompted generation of fully disordered proteins (IDPs). 4. Generated sequences are diverse yet IDR-like: maximum identity to the training IDR set broadly peaks around ~60% (not memorized), length distributions match natural IDRs (mostly <100 aa with a tail to ~300 aa), and amino-acid composition recapitulates known disorder biases (e.g., enriched Pro/Ser; depleted bulky hydrophobics and aromatics vs folded CATH domains). 5. Disorder is maintained by structure prediction checks: ColabFold/AlphaFold pLDDT distributions for generated sequences closely resemble curated AFDB IDRs and experimentally validated DisProt IDRs, both for standalone IDPs and for generated IDRs evaluated within full-protein context. 6. IDiom learns “IDR grammar”, not just composition: generations reproduce natural distributions of (i) fraction of charged residues (FCR), (ii) charge patterning/blockiness (κ), (iii) hydropathy patterning (SHD), and (iv) low complexity (SEG). These metrics separate generated IDRs from folded CATH domains and align them with DisProt statistics. 7. Conditioning matters: DisProt-context–prompted generations are consistently closer to DisProt IDRs than unprompted IDPs across multiple metrics (quantified via Wasserstein-1 distances), supporting in-context learning of context-appropriate IDR features. 8. Case study (NPM1): when prompted with NPM1 flanks, IDiom generates many low-identity IDRs that still reproduce the functional charge-block architecture (κ near WT; alternating NCPR blocks), suggesting it can preserve biophysically relevant patterning without copying sequence. 9. Post-training via reinforcement learning: the authors steer IDiom with GRPO (with DAPO modification) using ProtGPS as a reward model for subcellular localization (nucleolus, chromosomes/chromatin, P-bodies, stress granules). Regularization includes KL-to-base-model, target entropy (to avoid collapse), and target length. 10. RL-induced features are biologically interpretable while staying disordered: nucleolus-targeting sequences become Lys/Arg-rich and show higher κ; chromosome-targeting sequences become Ser/Thr-rich and show strong enrichment of ELM PTM motifs; P-body and stress-granule targeting sequences enrich RNA-interaction motifs (RG/RGG, F/YGG, SYG). Importantly, generated sequences remain low-pLDDT, indicating the policy does not drift toward folded-domain priors. 💻Code: github.com/rotskoff-group… 📜Paper: biorxiv.org/content/10.648… #ComputationalBiology #ProteinDesign #IntrinsicallyDisorderedProteins #ProteinLanguageModels #Transformers #ReinforcementLearning #PhaseSeparation #SubcellularLocalization #SyntheticBiology

English
1
4
11
1.5K

Sebastian Ibarraran retweetledi

Grant Rotskoff@grantrotskoff·17 Şub
Machine learning has improved transferable coarse-grained models, but the best architectures are nearly impossible to train due to noisy objectives. A simple strategy called mean force matching dramatically reduces data costs (↓87%) and enables scaling learned representations.
English
4
22
115
12.6K

Sebastian Ibarraran retweetledi

Biology+AI Daily@BiologyAIDaily·9 Şub
Efficient, Few-shot Directed Evolution with Energy Rank Alignment 1. A new method called Energy Rank Alignment (ERA) enables highly efficient protein engineering by adapting large pre-trained protein language models using minimal experimental data. 2. Unlike previous approaches that rely on simple models due to sparse data constraints, ERA leverages the strong inductive biases of ESM3-1.4B, a 1.4 billion parameter protein language model, to navigate complex fitness landscapes. 3. The key innovation is using quantitative experimental rankings rather than just binary preferences, allowing the model to preserve relative fitness magnitudes while learning from small batches of just 96 sequences per round. 4. ERA outperforms existing methods including MLDE, ALDE, EVOLVEpro, and Direct Preference Optimization across five diverse combinatorial fitness landscapes involving antibiotic resistance, antibody binding, and enzymatic activity. 5. The method achieves state-of-the-art performance with only 384 total samples across four rounds, successfully finding global optima even in landscapes with strong epistatic effects and rugged topography. 6. Surprisingly, adding structural conditioning or thermostability pre-training did not improve performance, suggesting that pure sequence-based adaptation is sufficient for effective directed evolution. 7. The adapted models maintain sequence diversity while shifting probability mass toward high-fitness regions by several orders of magnitude, making them interpretable and useful for understanding biophysical requirements. 8. This work establishes a compelling interface between foundation models and experimental design, demonstrating how post-training algorithms from statistical physics can solve real biological optimization problems. 💻Code: github.com/rotskoff-group… 📜Paper: biorxiv.org/content/10.648… #ProteinEngineering #DirectedEvolution #MachineLearning #Bioinformatics #ProteinDesign #ESM3 #ComputationalBiology #FewShotLearning
English
2
12
52
2.9K

Sebastian Ibarraran@s_ibarraran·7 Şub
Big thanks to my co-authors: @shriramc1 , @FrankWho1050502 , and @grantrotskoff . Would love feedback or discussion from folks working on protein design, directed evolution, protein language models, or preference optimization.
English
0
1
4
149

Sebastian Ibarraran@s_ibarraran·7 Şub
Across various fitness landscapes, ERA • Achieves state-of-the-art or competitive performance with small amounts of experimental data • Outperforms active-learning and regression-based ML-assisted DE baselines • Avoids mode collapse; remains interpretable at the residue level
English
1
1
4
149

Sebastian Ibarraran@s_ibarraran·7 Şub
Excited to share our preprint on using energy rank alignment (ERA), our physics-inspired preference optimization algorithm, to guide protein language models for directed evolution! Check out our project page with links to code, data, and the paper: rotskoff-group.github.io/era-directed-e…
English
1
3
15
2K

Sebastian Ibarraran retweetledi

Frank Hu@FrankWho1050502·26 Ara
Excited to share that our most recent preprint on automating NMR structure elucidation is now available on arXiv! check it out now at arxiv.org/abs/2512.18531!
English
1
5
12
1.4K

Sebastian Ibarraran retweetledi

Grant Rotskoff@grantrotskoff·20 Ara
I'm hiring a postdoc with a flexible start, sooner is better! Come work with us at the interface of machine learning, biophysics, and nonequilibrium physics. Interested? Send me a CV and a short summary of why you think you'd be a good fit. statmech.stanford.edu
English
3
33
108
13.9K

Sebastian Ibarraran retweetledi

Frank Hu@FrankWho1050502·8 Ara
Had a great time presenting our work on chemical language model alignment at NeurIPS 2025! It was a wonderful week of science and fascinating conversations, and if you'd like to learn more about what we're up to, please reach out!
English
0
4
4
558

Sebastian Ibarraran@s_ibarraran·8 Ara
Had an awesome time presenting our chemical and protein language model alignment work at NeurIPS! Really appreciated all the insightful feedback and new connections, and I’m looking forward to continuing our efforts in this and other directions for molecular design
English
0
2
8
1.3K

Sebastian Ibarraran retweetledi

Shriram@shriramc1·2 Ara
The internship posting is now live! The role will focus on topics related to our ongoing effort to post-train LLMs to be used in active small-molecule drug discovery campaigns. roche.wd3.myworkdayjobs.com/ROG-A2O-GENE/j…
Shriram@shriramc1
I'll be at NeurIPS this week! Our team at Prescient Design/Genentech is working on post-training LLMs for small-molecule drug discovery applications. I'll also be opening an internship for next summer related to this effort. If you're interested in learning more, I'd love to chat
English
0
1
5
713

Sebastian Ibarraran retweetledi

Shriram@shriramc1·2 Ara
Separately, I'll be presenting some of my PhD work with @s_ibarraran and @FrankWho1050502 on Wednesday. Please stop by if you'd like to learn more about aligning chemical language models! neurips.cc/virtual/2025/l…
English
0
3
6
541

Sebastian Ibarraran retweetledi

Shriram@shriramc1·2 Ara
I'll be at NeurIPS this week! Our team at Prescient Design/Genentech is working on post-training LLMs for small-molecule drug discovery applications. I'll also be opening an internship for next summer related to this effort. If you're interested in learning more, I'd love to chat
English
1
3
11
1.7K

Sebastian Ibarraran@s_ibarraran·2 Eki
@eliaswehbe @thinkymachines This is still a bit unclear, but we think this is likely related to variability in the complexity of compounds across different batches.
English
0
0
0
189

Elias Wehbe@eliaswehbe·2 Eki
@s_ibarraran @thinkymachines Why are there accuracy peaks at specific steps across all models? What causes that behavior?
English
1
0
0
230

Sebastian Ibarraran@s_ibarraran·1 Eki
While reinforcement learning has been demonstrated to improve LLM performance on mathematical reasoning tasks, currently, there is far less evidence of performant scientific reasoning models. Using Tinker by @thinkymachines, we were able to rapidly train a variety of models on -
English
5
19
196
35K

Keşfet

@shriramc1 @FrankWho1050502 @grantrotskoff @eliaswehbe @thinkymachines @elonmusk @BarackObama @taylorswift13