Jason Liu

19 posts

Jason Liu

Jason Liu

@JasonLiu1044858

Katılım Ağustos 2023
3 Takip Edilen9 Takipçiler
Jason Liu retweetledi
Grant Rotskoff
Grant Rotskoff@grantrotskoff·
Protein design has been dominated by diffusions due to a "structure-first" perspective. What about intrinsically disordered proteins? We scale language-based design using the modern RL stack and our model IDiom. Paper: biorxiv.org/content/10.648… Try it: idiom-designer.vercel.app
Grant Rotskoff tweet media
English
3
23
142
7.4K
Jason Liu
Jason Liu@JasonLiu1044858·
We show that IDiom generates IDRs that recapitulate the sequence features of natural disordered regions, and we additionally use reinforcement learning to steer generation towards sequences with compartment specific localization features.
English
1
1
2
34
Jason Liu
Jason Liu@JasonLiu1044858·
Intrinsically disordered protein regions have remained largely out of reach for computational design. We curate 37M IDR sequences from the AlphaFold Database and train IDiom, a 122M parameter autoregressive model, as a general platform for intrinsically disordered protein design.
English
1
1
3
109
Jason Liu retweetledi
Biology+AI Daily
Biology+AI Daily@BiologyAIDaily·
Generative design of intrinsically disordered protein regions with IDiom 1. The paper introduces IDiom, a 122M-parameter autoregressive (decoder-only) protein language model trained specifically for intrinsically disordered regions (IDRs), aiming to make rational design possible in a regime where structure-based generative methods do not apply. 2. Key technical idea: fill-in-the-middle training for proteins with explicit tokens that separate N-terminal context, C-terminal context, and the IDR span (, , ). This enables conditional generation of an IDR that fits into a chosen structured protein context, not just unconditional sampling. 3. Training data scale: 37 million IDRs curated from AlphaFold DB v4 using low pLDDT as a disorder proxy (plus filtering and clustering at 90% identity). They augment to 74 million sequences by also creating “context-deleted” records to train unprompted generation of fully disordered proteins (IDPs). 4. Generated sequences are diverse yet IDR-like: maximum identity to the training IDR set broadly peaks around ~60% (not memorized), length distributions match natural IDRs (mostly <100 aa with a tail to ~300 aa), and amino-acid composition recapitulates known disorder biases (e.g., enriched Pro/Ser; depleted bulky hydrophobics and aromatics vs folded CATH domains). 5. Disorder is maintained by structure prediction checks: ColabFold/AlphaFold pLDDT distributions for generated sequences closely resemble curated AFDB IDRs and experimentally validated DisProt IDRs, both for standalone IDPs and for generated IDRs evaluated within full-protein context. 6. IDiom learns “IDR grammar”, not just composition: generations reproduce natural distributions of (i) fraction of charged residues (FCR), (ii) charge patterning/blockiness (κ), (iii) hydropathy patterning (SHD), and (iv) low complexity (SEG). These metrics separate generated IDRs from folded CATH domains and align them with DisProt statistics. 7. Conditioning matters: DisProt-context–prompted generations are consistently closer to DisProt IDRs than unprompted IDPs across multiple metrics (quantified via Wasserstein-1 distances), supporting in-context learning of context-appropriate IDR features. 8. Case study (NPM1): when prompted with NPM1 flanks, IDiom generates many low-identity IDRs that still reproduce the functional charge-block architecture (κ near WT; alternating NCPR blocks), suggesting it can preserve biophysically relevant patterning without copying sequence. 9. Post-training via reinforcement learning: the authors steer IDiom with GRPO (with DAPO modification) using ProtGPS as a reward model for subcellular localization (nucleolus, chromosomes/chromatin, P-bodies, stress granules). Regularization includes KL-to-base-model, target entropy (to avoid collapse), and target length. 10. RL-induced features are biologically interpretable while staying disordered: nucleolus-targeting sequences become Lys/Arg-rich and show higher κ; chromosome-targeting sequences become Ser/Thr-rich and show strong enrichment of ELM PTM motifs; P-body and stress-granule targeting sequences enrich RNA-interaction motifs (RG/RGG, F/YGG, SYG). Importantly, generated sequences remain low-pLDDT, indicating the policy does not drift toward folded-domain priors. 💻Code: github.com/rotskoff-group… 📜Paper: biorxiv.org/content/10.648… #ComputationalBiology #ProteinDesign #IntrinsicallyDisorderedProteins #ProteinLanguageModels #Transformers #ReinforcementLearning #PhaseSeparation #SubcellularLocalization #SyntheticBiology
Biology+AI Daily tweet media
English
1
3
30
2.3K
Jason Liu retweetledi
Biology+AI Daily
Biology+AI Daily@BiologyAIDaily·
Generative design of intrinsically disordered protein regions with IDiom 1. The paper introduces IDiom, a 122M-parameter autoregressive (decoder-only) protein language model trained specifically for intrinsically disordered regions (IDRs), aiming to make rational design possible in a regime where structure-based generative methods do not apply. 2. Key technical idea: fill-in-the-middle training for proteins with explicit tokens that separate N-terminal context, C-terminal context, and the IDR span (, , ). This enables conditional generation of an IDR that fits into a chosen structured protein context, not just unconditional sampling. 3. Training data scale: 37 million IDRs curated from AlphaFold DB v4 using low pLDDT as a disorder proxy (plus filtering and clustering at 90% identity). They augment to 74 million sequences by also creating “context-deleted” records to train unprompted generation of fully disordered proteins (IDPs). 4. Generated sequences are diverse yet IDR-like: maximum identity to the training IDR set broadly peaks around ~60% (not memorized), length distributions match natural IDRs (mostly <100 aa with a tail to ~300 aa), and amino-acid composition recapitulates known disorder biases (e.g., enriched Pro/Ser; depleted bulky hydrophobics and aromatics vs folded CATH domains). 5. Disorder is maintained by structure prediction checks: ColabFold/AlphaFold pLDDT distributions for generated sequences closely resemble curated AFDB IDRs and experimentally validated DisProt IDRs, both for standalone IDPs and for generated IDRs evaluated within full-protein context. 6. IDiom learns “IDR grammar”, not just composition: generations reproduce natural distributions of (i) fraction of charged residues (FCR), (ii) charge patterning/blockiness (κ), (iii) hydropathy patterning (SHD), and (iv) low complexity (SEG). These metrics separate generated IDRs from folded CATH domains and align them with DisProt statistics. 7. Conditioning matters: DisProt-context–prompted generations are consistently closer to DisProt IDRs than unprompted IDPs across multiple metrics (quantified via Wasserstein-1 distances), supporting in-context learning of context-appropriate IDR features. 8. Case study (NPM1): when prompted with NPM1 flanks, IDiom generates many low-identity IDRs that still reproduce the functional charge-block architecture (κ near WT; alternating NCPR blocks), suggesting it can preserve biophysically relevant patterning without copying sequence. 9. Post-training via reinforcement learning: the authors steer IDiom with GRPO (with DAPO modification) using ProtGPS as a reward model for subcellular localization (nucleolus, chromosomes/chromatin, P-bodies, stress granules). Regularization includes KL-to-base-model, target entropy (to avoid collapse), and target length. 10. RL-induced features are biologically interpretable while staying disordered: nucleolus-targeting sequences become Lys/Arg-rich and show higher κ; chromosome-targeting sequences become Ser/Thr-rich and show strong enrichment of ELM PTM motifs; P-body and stress-granule targeting sequences enrich RNA-interaction motifs (RG/RGG, F/YGG, SYG). Importantly, generated sequences remain low-pLDDT, indicating the policy does not drift toward folded-domain priors. 💻Code: github.com/rotskoff-group… 📜Paper: biorxiv.org/content/10.648… #ComputationalBiology #ProteinDesign #IntrinsicallyDisorderedProteins #ProteinLanguageModels #Transformers #ReinforcementLearning #PhaseSeparation #SubcellularLocalization #SyntheticBiology
Biology+AI Daily tweet media
English
1
4
11
1.5K
Jason Liu retweetledi
Thinking Machines
Thinking Machines@thinkymachines·
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lora/
Thinking Machines tweet media
English
82
563
3.5K
1.4M
Jason Liu retweetledi
Thinking Machines
Thinking Machines@thinkymachines·
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models! thinkingmachines.ai/tinker
Thinking Machines tweet media
English
241
793
5.9K
4.2M
Jason Liu
Jason Liu@JasonLiu1044858·
of the 8B-Instruct model using 1k reasoning traces generated by the 70B model, accuracy improves to ~25% (green). The highly scalable training infrastructure of Tinker will enable further studies on the learning of scientific reasoning tasks under different
English
1
0
0
26
Jason Liu
Jason Liu@JasonLiu1044858·
While reinforcement learning has been demonstrated to improve LLM performance on mathematical reasoning tasks, currently, there is far less evidence of performant scientific reasoning models. Using Tinker by @thinkymachines, we were able to rapidly train a variety of models on
Jason Liu tweet mediaJason Liu tweet media
English
1
0
0
45