Sebastian Ibarraran

29 posts

Sebastian Ibarraran

Sebastian Ibarraran

@s_ibarraran

Theoretical Chemistry PhD Candidate, Rotskoff Lab @ Stanford | Yale ‘23

Palo Alto, CA Katılım Mart 2024
41 Takip Edilen108 Takipçiler
Sebastian Ibarraran retweetledi
Grant Rotskoff
Grant Rotskoff@grantrotskoff·
Protein design has been dominated by diffusions due to a "structure-first" perspective. What about intrinsically disordered proteins? We scale language-based design using the modern RL stack and our model IDiom. Paper: biorxiv.org/content/10.648… Try it: idiom-designer.vercel.app
Grant Rotskoff tweet media
English
3
23
142
7.4K
Sebastian Ibarraran retweetledi
Frank Hu
Frank Hu@FrankWho1050502·
Excited to be in Rio for ICLR with @s_ibarraran where we are presenting our work on fine-tuning protein language models using energy rank alignment for efficient directed evolution! Stop by our poster at the GEM workshop on Monday, and feel free to reach out if interested!
English
0
1
2
45
Sebastian Ibarraran retweetledi
Jason Liu
Jason Liu@JasonLiu1044858·
We show that IDiom generates IDRs that recapitulate the sequence features of natural disordered regions, and we additionally use reinforcement learning to steer generation towards sequences with compartment specific localization features.
English
1
1
2
34
Sebastian Ibarraran retweetledi
Jason Liu
Jason Liu@JasonLiu1044858·
Intrinsically disordered protein regions have remained largely out of reach for computational design. We curate 37M IDR sequences from the AlphaFold Database and train IDiom, a 122M parameter autoregressive model, as a general platform for intrinsically disordered protein design.
English
1
1
3
109
Sebastian Ibarraran retweetledi
Biology+AI Daily
Biology+AI Daily@BiologyAIDaily·
Generative design of intrinsically disordered protein regions with IDiom 1. The paper introduces IDiom, a 122M-parameter autoregressive (decoder-only) protein language model trained specifically for intrinsically disordered regions (IDRs), aiming to make rational design possible in a regime where structure-based generative methods do not apply. 2. Key technical idea: fill-in-the-middle training for proteins with explicit tokens that separate N-terminal context, C-terminal context, and the IDR span (, , ). This enables conditional generation of an IDR that fits into a chosen structured protein context, not just unconditional sampling. 3. Training data scale: 37 million IDRs curated from AlphaFold DB v4 using low pLDDT as a disorder proxy (plus filtering and clustering at 90% identity). They augment to 74 million sequences by also creating “context-deleted” records to train unprompted generation of fully disordered proteins (IDPs). 4. Generated sequences are diverse yet IDR-like: maximum identity to the training IDR set broadly peaks around ~60% (not memorized), length distributions match natural IDRs (mostly <100 aa with a tail to ~300 aa), and amino-acid composition recapitulates known disorder biases (e.g., enriched Pro/Ser; depleted bulky hydrophobics and aromatics vs folded CATH domains). 5. Disorder is maintained by structure prediction checks: ColabFold/AlphaFold pLDDT distributions for generated sequences closely resemble curated AFDB IDRs and experimentally validated DisProt IDRs, both for standalone IDPs and for generated IDRs evaluated within full-protein context. 6. IDiom learns “IDR grammar”, not just composition: generations reproduce natural distributions of (i) fraction of charged residues (FCR), (ii) charge patterning/blockiness (κ), (iii) hydropathy patterning (SHD), and (iv) low complexity (SEG). These metrics separate generated IDRs from folded CATH domains and align them with DisProt statistics. 7. Conditioning matters: DisProt-context–prompted generations are consistently closer to DisProt IDRs than unprompted IDPs across multiple metrics (quantified via Wasserstein-1 distances), supporting in-context learning of context-appropriate IDR features. 8. Case study (NPM1): when prompted with NPM1 flanks, IDiom generates many low-identity IDRs that still reproduce the functional charge-block architecture (κ near WT; alternating NCPR blocks), suggesting it can preserve biophysically relevant patterning without copying sequence. 9. Post-training via reinforcement learning: the authors steer IDiom with GRPO (with DAPO modification) using ProtGPS as a reward model for subcellular localization (nucleolus, chromosomes/chromatin, P-bodies, stress granules). Regularization includes KL-to-base-model, target entropy (to avoid collapse), and target length. 10. RL-induced features are biologically interpretable while staying disordered: nucleolus-targeting sequences become Lys/Arg-rich and show higher κ; chromosome-targeting sequences become Ser/Thr-rich and show strong enrichment of ELM PTM motifs; P-body and stress-granule targeting sequences enrich RNA-interaction motifs (RG/RGG, F/YGG, SYG). Importantly, generated sequences remain low-pLDDT, indicating the policy does not drift toward folded-domain priors. 💻Code: github.com/rotskoff-group… 📜Paper: biorxiv.org/content/10.648… #ComputationalBiology #ProteinDesign #IntrinsicallyDisorderedProteins #ProteinLanguageModels #Transformers #ReinforcementLearning #PhaseSeparation #SubcellularLocalization #SyntheticBiology
Biology+AI Daily tweet media
English
1
4
11
1.5K
Sebastian Ibarraran retweetledi
Grant Rotskoff
Grant Rotskoff@grantrotskoff·
Machine learning has improved transferable coarse-grained models, but the best architectures are nearly impossible to train due to noisy objectives. A simple strategy called mean force matching dramatically reduces data costs (↓87%) and enables scaling learned representations.
Grant Rotskoff tweet media
English
4
22
115
12.6K
Sebastian Ibarraran retweetledi
Biology+AI Daily
Biology+AI Daily@BiologyAIDaily·
Efficient, Few-shot Directed Evolution with Energy Rank Alignment 1. A new method called Energy Rank Alignment (ERA) enables highly efficient protein engineering by adapting large pre-trained protein language models using minimal experimental data. 2. Unlike previous approaches that rely on simple models due to sparse data constraints, ERA leverages the strong inductive biases of ESM3-1.4B, a 1.4 billion parameter protein language model, to navigate complex fitness landscapes. 3. The key innovation is using quantitative experimental rankings rather than just binary preferences, allowing the model to preserve relative fitness magnitudes while learning from small batches of just 96 sequences per round. 4. ERA outperforms existing methods including MLDE, ALDE, EVOLVEpro, and Direct Preference Optimization across five diverse combinatorial fitness landscapes involving antibiotic resistance, antibody binding, and enzymatic activity. 5. The method achieves state-of-the-art performance with only 384 total samples across four rounds, successfully finding global optima even in landscapes with strong epistatic effects and rugged topography. 6. Surprisingly, adding structural conditioning or thermostability pre-training did not improve performance, suggesting that pure sequence-based adaptation is sufficient for effective directed evolution. 7. The adapted models maintain sequence diversity while shifting probability mass toward high-fitness regions by several orders of magnitude, making them interpretable and useful for understanding biophysical requirements. 8. This work establishes a compelling interface between foundation models and experimental design, demonstrating how post-training algorithms from statistical physics can solve real biological optimization problems. 💻Code: github.com/rotskoff-group… 📜Paper: biorxiv.org/content/10.648… #ProteinEngineering #DirectedEvolution #MachineLearning #Bioinformatics #ProteinDesign #ESM3 #ComputationalBiology #FewShotLearning
Biology+AI Daily tweet media
English
2
12
52
2.9K
Sebastian Ibarraran
Sebastian Ibarraran@s_ibarraran·
Across various fitness landscapes, ERA • Achieves state-of-the-art or competitive performance with small amounts of experimental data • Outperforms active-learning and regression-based ML-assisted DE baselines • Avoids mode collapse; remains interpretable at the residue level
English
1
1
4
149
Sebastian Ibarraran
Sebastian Ibarraran@s_ibarraran·
Excited to share our preprint on using energy rank alignment (ERA), our physics-inspired preference optimization algorithm, to guide protein language models for directed evolution! Check out our project page with links to code, data, and the paper: rotskoff-group.github.io/era-directed-e…
English
1
3
15
2K
Sebastian Ibarraran retweetledi
Frank Hu
Frank Hu@FrankWho1050502·
Excited to share that our most recent preprint on automating NMR structure elucidation is now available on arXiv! check it out now at arxiv.org/abs/2512.18531!
English
1
5
12
1.4K
Sebastian Ibarraran retweetledi
Grant Rotskoff
Grant Rotskoff@grantrotskoff·
I'm hiring a postdoc with a flexible start, sooner is better! Come work with us at the interface of machine learning, biophysics, and nonequilibrium physics. Interested? Send me a CV and a short summary of why you think you'd be a good fit. statmech.stanford.edu
English
3
33
108
13.9K
Sebastian Ibarraran retweetledi
Frank Hu
Frank Hu@FrankWho1050502·
Had a great time presenting our work on chemical language model alignment at NeurIPS 2025! It was a wonderful week of science and fascinating conversations, and if you'd like to learn more about what we're up to, please reach out!
Frank Hu tweet mediaFrank Hu tweet mediaFrank Hu tweet media
English
0
4
4
558
Sebastian Ibarraran
Sebastian Ibarraran@s_ibarraran·
Had an awesome time presenting our chemical and protein language model alignment work at NeurIPS! Really appreciated all the insightful feedback and new connections, and I’m looking forward to continuing our efforts in this and other directions for molecular design
Sebastian Ibarraran tweet mediaSebastian Ibarraran tweet mediaSebastian Ibarraran tweet media
English
0
2
8
1.3K
Sebastian Ibarraran retweetledi
Sebastian Ibarraran retweetledi
Shriram
Shriram@shriramc1·
I'll be at NeurIPS this week! Our team at Prescient Design/Genentech is working on post-training LLMs for small-molecule drug discovery applications. I'll also be opening an internship for next summer related to this effort. If you're interested in learning more, I'd love to chat
English
1
3
11
1.7K
Sebastian Ibarraran
Sebastian Ibarraran@s_ibarraran·
While reinforcement learning has been demonstrated to improve LLM performance on mathematical reasoning tasks, currently, there is far less evidence of performant scientific reasoning models. Using Tinker by @thinkymachines, we were able to rapidly train a variety of models on -
Sebastian Ibarraran tweet mediaSebastian Ibarraran tweet media
English
5
19
196
35K