Jason Liu

19 posts

Jason Liu

@JasonLiu1044858

Katılım Ağustos 2023

3 Takip Edilen9 Takipçiler

Jason Liu retweetledi

Grant Rotskoff@grantrotskoff·5d

Protein design has been dominated by diffusions due to a "structure-first" perspective. What about intrinsically disordered proteins? We scale language-based design using the modern RL stack and our model IDiom. Paper: biorxiv.org/content/10.648… Try it: idiom-designer.vercel.app

English

142

7.4K

Jason Liu@JasonLiu1044858·20 Nis

Paper: doi.org/10.64898/2026.… Code: github.com/rotskoff-group… Dataset: huggingface.co/datasets/jxliu…

Filipino

Jason Liu@JasonLiu1044858·20 Nis

We show that IDiom generates IDRs that recapitulate the sequence features of natural disordered regions, and we additionally use reinforcement learning to steer generation towards sequences with compartment specific localization features.

English

Jason Liu@JasonLiu1044858·20 Nis

Intrinsically disordered protein regions have remained largely out of reach for computational design. We curate 37M IDR sequences from the AlphaFold Database and train IDiom, a 122M parameter autoregressive model, as a general platform for intrinsically disordered protein design.

English

109

Jason Liu retweetledi

Biology+AI Daily@BiologyAIDaily·13 Nis

Generative design of intrinsically disordered protein regions with IDiom 1. The paper introduces IDiom, a 122M-parameter autoregressive (decoder-only) protein language model trained specifically for intrinsically disordered regions (IDRs), aiming to make rational design possible in a regime where structure-based generative methods do not apply. 2. Key technical idea: fill-in-the-middle training for proteins with explicit tokens that separate N-terminal context, C-terminal context, and the IDR span (, , ). This enables conditional generation of an IDR that fits into a chosen structured protein context, not just unconditional sampling. 3. Training data scale: 37 million IDRs curated from AlphaFold DB v4 using low pLDDT as a disorder proxy (plus filtering and clustering at 90% identity). They augment to 74 million sequences by also creating “context-deleted” records to train unprompted generation of fully disordered proteins (IDPs). 4. Generated sequences are diverse yet IDR-like: maximum identity to the training IDR set broadly peaks around ~60% (not memorized), length distributions match natural IDRs (mostly <100 aa with a tail to ~300 aa), and amino-acid composition recapitulates known disorder biases (e.g., enriched Pro/Ser; depleted bulky hydrophobics and aromatics vs folded CATH domains). 5. Disorder is maintained by structure prediction checks: ColabFold/AlphaFold pLDDT distributions for generated sequences closely resemble curated AFDB IDRs and experimentally validated DisProt IDRs, both for standalone IDPs and for generated IDRs evaluated within full-protein context. 6. IDiom learns “IDR grammar”, not just composition: generations reproduce natural distributions of (i) fraction of charged residues (FCR), (ii) charge patterning/blockiness (κ), (iii) hydropathy patterning (SHD), and (iv) low complexity (SEG). These metrics separate generated IDRs from folded CATH domains and align them with DisProt statistics. 7. Conditioning matters: DisProt-context–prompted generations are consistently closer to DisProt IDRs than unprompted IDPs across multiple metrics (quantified via Wasserstein-1 distances), supporting in-context learning of context-appropriate IDR features. 8. Case study (NPM1): when prompted with NPM1 flanks, IDiom generates many low-identity IDRs that still reproduce the functional charge-block architecture (κ near WT; alternating NCPR blocks), suggesting it can preserve biophysically relevant patterning without copying sequence. 9. Post-training via reinforcement learning: the authors steer IDiom with GRPO (with DAPO modification) using ProtGPS as a reward model for subcellular localization (nucleolus, chromosomes/chromatin, P-bodies, stress granules). Regularization includes KL-to-base-model, target entropy (to avoid collapse), and target length. 10. RL-induced features are biologically interpretable while staying disordered: nucleolus-targeting sequences become Lys/Arg-rich and show higher κ; chromosome-targeting sequences become Ser/Thr-rich and show strong enrichment of ELM PTM motifs; P-body and stress-granule targeting sequences enrich RNA-interaction motifs (RG/RGG, F/YGG, SYG). Importantly, generated sequences remain low-pLDDT, indicating the policy does not drift toward folded-domain priors. 💻Code: github.com/rotskoff-group… 📜Paper: biorxiv.org/content/10.648… #ComputationalBiology #ProteinDesign #IntrinsicallyDisorderedProteins #ProteinLanguageModels #Transformers #ReinforcementLearning #PhaseSeparation #SubcellularLocalization #SyntheticBiology

English
1
3
30
2.3K

Jason Liu retweetledi

Biology+AI Daily@BiologyAIDaily·13 Nis
Generative design of intrinsically disordered protein regions with IDiom 1. The paper introduces IDiom, a 122M-parameter autoregressive (decoder-only) protein language model trained specifically for intrinsically disordered regions (IDRs), aiming to make rational design possible in a regime where structure-based generative methods do not apply. 2. Key technical idea: fill-in-the-middle training for proteins with explicit tokens that separate N-terminal context, C-terminal context, and the IDR span (, , ). This enables conditional generation of an IDR that fits into a chosen structured protein context, not just unconditional sampling. 3. Training data scale: 37 million IDRs curated from AlphaFold DB v4 using low pLDDT as a disorder proxy (plus filtering and clustering at 90% identity). They augment to 74 million sequences by also creating “context-deleted” records to train unprompted generation of fully disordered proteins (IDPs). 4. Generated sequences are diverse yet IDR-like: maximum identity to the training IDR set broadly peaks around ~60% (not memorized), length distributions match natural IDRs (mostly <100 aa with a tail to ~300 aa), and amino-acid composition recapitulates known disorder biases (e.g., enriched Pro/Ser; depleted bulky hydrophobics and aromatics vs folded CATH domains). 5. Disorder is maintained by structure prediction checks: ColabFold/AlphaFold pLDDT distributions for generated sequences closely resemble curated AFDB IDRs and experimentally validated DisProt IDRs, both for standalone IDPs and for generated IDRs evaluated within full-protein context. 6. IDiom learns “IDR grammar”, not just composition: generations reproduce natural distributions of (i) fraction of charged residues (FCR), (ii) charge patterning/blockiness (κ), (iii) hydropathy patterning (SHD), and (iv) low complexity (SEG). These metrics separate generated IDRs from folded CATH domains and align them with DisProt statistics. 7. Conditioning matters: DisProt-context–prompted generations are consistently closer to DisProt IDRs than unprompted IDPs across multiple metrics (quantified via Wasserstein-1 distances), supporting in-context learning of context-appropriate IDR features. 8. Case study (NPM1): when prompted with NPM1 flanks, IDiom generates many low-identity IDRs that still reproduce the functional charge-block architecture (κ near WT; alternating NCPR blocks), suggesting it can preserve biophysically relevant patterning without copying sequence. 9. Post-training via reinforcement learning: the authors steer IDiom with GRPO (with DAPO modification) using ProtGPS as a reward model for subcellular localization (nucleolus, chromosomes/chromatin, P-bodies, stress granules). Regularization includes KL-to-base-model, target entropy (to avoid collapse), and target length. 10. RL-induced features are biologically interpretable while staying disordered: nucleolus-targeting sequences become Lys/Arg-rich and show higher κ; chromosome-targeting sequences become Ser/Thr-rich and show strong enrichment of ELM PTM motifs; P-body and stress-granule targeting sequences enrich RNA-interaction motifs (RG/RGG, F/YGG, SYG). Importantly, generated sequences remain low-pLDDT, indicating the policy does not drift toward folded-domain priors. 💻Code: github.com/rotskoff-group… 📜Paper: biorxiv.org/content/10.648… #ComputationalBiology #ProteinDesign #IntrinsicallyDisorderedProteins #ProteinLanguageModels #Transformers #ReinforcementLearning #PhaseSeparation #SubcellularLocalization #SyntheticBiology

English
1
4
11
1.5K

Jason Liu retweetledi

Thinking Machines@thinkymachines·29 Eyl
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lora/
English
82
563
3.5K
1.4M

Jason Liu retweetledi

Thinking Machines@thinkymachines·1 Eki
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models! thinkingmachines.ai/tinker
English
241
793
5.9K
4.2M

Jason Liu@JasonLiu1044858·1 Eki
training paradigms. Work conducted together with @s_ibarraran, Alex Dunn, and @grantrotskoff
English
1
0
0
59

Jason Liu@JasonLiu1044858·1 Eki
of the 8B-Instruct model using 1k reasoning traces generated by the 70B model, accuracy improves to ~25% (green). The highly scalable training infrastructure of Tinker will enable further studies on the learning of scientific reasoning tasks under different
English
1
0
0
26

Jason Liu@JasonLiu1044858·1 Eki
While reinforcement learning has been demonstrated to improve LLM performance on mathematical reasoning tasks, currently, there is far less evidence of performant scientific reasoning models. Using Tinker by @thinkymachines, we were able to rapidly train a variety of models on
English
1
0
0
45

Jason Liu@JasonLiu1044858·29 Eyl
Liquid–liquid phase separation within fibrillar networks rdcu.be/dnl8U doi.org/10.1038/s41467… nature.com/articles/s4146…
English
0
1
0
55

Jason Liu retweetledi

soft/living matter fanclub@softlivmat_fc·2 Eki
J. X. Liu, et al., Liquid-liquid phase separation within fibrillar networks. Nat. Commun. 14, 6085 (2023). dx.doi.org/10.1038/s41467…
English
0
7
24
2.3K

Jason Liu retweetledi

ShorterLab@ShorterLab·29 Eyl
Liquid–liquid phase separation within fibrillar networks: nature.com/articles/s4146…
English
0
20
112
13.5K

Jason Liu retweetledi

Nature Communications@NatureComms·29 Eyl
Researchers study liquid–liquid phase separation within fibrillar networks to further understanding intracellular phase separation #GettingApplied nature.com/articles/s4146…
English
0
5
11
6.6K

Keşfet

@s_ibarraran @grantrotskoff @thinkymachines @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates