Jyo Pari

157 posts

Jyo Pari

@jyo_pari

Working on continual learning | PhD @MIT

Boston Katılım Aralık 2021

903 Takip Edilen2.8K Takipçiler

Sabitlenmiş Tweet

Jyo Pari@jyo_pari·13 Haz

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

English

133

511

3.2K

664.1K

Jyo Pari@jyo_pari·24 Mar

Shared representations and computations for generation and tokenization !

Shivam Duggal@ShivamDuggal4

Tokenization & Generation power Large Models. But are they really separate? Tokenization=Generation under strong observability UNITE: An end-to-end training framework where one shared Generative Encoder (GE) performs both token. & latent denoising Paper: arxiv.org/abs/2603.22283

English

2.2K

Jyo Pari@jyo_pari·19 Mar

Very cool! Adding more non-linearity to the state update is needed ➰

Mayank Mishra@MayankMish98

Introducing M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling We bring back non-linear recurrence to language modeling and show it's been held back by small state sizes, not by non-linearity itself. 📄 Paper: arxiv.org/abs/2603.14360 💻 Code: github.com/open-lm-engine… 🤗 Models: huggingface.co/collections/op…

English

1.8K

Jyo Pari retweetledi

Shivam Duggal@ShivamDuggal4·16 Mar

Similar thought. Next-token prediction feels statistical: perplexity / shannon-entropy minimization. But creativity / science may require: finding compact generative structures, then exploring in that space. Closer to algorithmic complexity? More Kolmogorov than Shannon.

Andrew Gordon Wilson@andrewgwils

Being good at next word prediction is the opposite of what we want for creativity, for scientific breakthroughs.

English

9.1K

Jyo Pari@jyo_pari·16 Mar

Hard problems require more than bigger models, they require effective exploration at test time. 💡 @aviral_kumar2 will present new approaches for training LMs to scale test-time exploration, including solving IMO-level math problems. 🏅 🗓️ March 19, 4pm ET @scaleml

English

8.1K

Jyo Pari@jyo_pari·13 Mar

Very cool work by Seungwook! Would be interesting to see if the neural cellular automata pre training results in additional capabilities that natural language training alone can’t produce.

Seungwook Han@seungwookh

Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)

English

1.4K

Jyo Pari@jyo_pari·18 Şub

As context windows grow 📈, continual learning matters more! @tianyuanzhang99 will present how to scale test-time training for effectively infinite context ♾ 🗓️ Feb 19, 3pm ET @scaleml

English

177

26K

Jyo Pari@jyo_pari·29 Oca

The benefits of on-policy learning with the speed of SFT !

idan shenfeld@IdanShenfeld

People keep saying 2026 will be the year of continual learning. But there are still major technical challenges to making it a reality. Today we take the next step towards that goal — a new on-policy learning algorithm, suitable for continual learning! (1/n)

English

2.2K

Jyo Pari retweetledi

Locke Cai@couplefire12·11 Ara

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English

611

177K

Jyo Pari@jyo_pari·22 Kas

Next Tuesday, @shannonzshen will present hybrid chain-of-thought, a method that mixes latent and discrete tokens during decoding 🔥 🗓️ Nov 25, 3pm ET @scaleml

English

6.4K

Jyo Pari@jyo_pari·13 Kas

Why do deep learning optimizers make progress even in the edge-of-stability regime? 🤔 @alex_damian_ will present theory that can describe the dynamics of optimization in this regime! 🗓️ Nov 17, 3pm ET @scaleml

English

7.8K

Jyo Pari retweetledi

idan shenfeld@IdanShenfeld·9 Kas

Everyone’s talking about Kimi K2 Thinking and its impressive performance. No full report yet, but judging from Kimi K2\1.5 reports, it likely uses Policy Mirror Descent - an RL trick that’s quietly becoming standard in frontier labs. Let’s break down what it is:

English

477

58.8K

Jyo Pari retweetledi

Kevin Lu@_kevinlu·27 Eki

in our new post, we walk through great prior work from @agarwl_ & the @Alibaba_Qwen team exploring on-policy distillation using an open source recipe: you can run our experiments on Tinker today! github.com/thinking-machi… i'm especially excited by the use of on-policy distillation to enable new "test-time training" personalization methods, allow the model to learn new domain knowledge without regressing on post-training capabilities

Thinking Machines@thinkymachines

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-policy…

English

370

95.3K

Jyo Pari@jyo_pari·15 Eki

Very interest! We could use RLMs for complex reasoning problems where models are solving sub-problems in parallel unlocking a new dimension of scaling!

alex zhang@a1zhang

What if scaling the context windows of frontier LLMs is much easier than it sounds? We’re excited to share our work on Recursive Language Models (RLMs). A new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length, as a REPL environment. On the OOLONG benchmark, RLMs with GPT-5-mini outperforms GPT-5 by over 110% gains (more than double!) on 132k-token sequences and is cheaper to query on average. On the BrowseComp-Plus benchmark, RLMs with GPT-5 can take in 10M+ tokens as their “prompt” and answer highly compositional queries without degradation and even better than explicit indexing/retrieval. We link our blogpost, (still very early!) experiments, and discussion below.

English

5.6K

Jyo Pari retweetledi

Moritz Reuss@moritz_reuss·14 Eki

VLAs have become the fastest-growing subfield in robot learning. So where are we now? After reviewing ICLR 2026 submissions and conversations at CoRL, I wrote an overview of the current state of VLA research with some personal takes: is.gd/1pqw9w

English

106

533

53K

Jyo Pari@jyo_pari·9 Eki

@kfir99 @scaleml @cloneofsimo @HanGuo97 Sign up at scale-ml.org for the mailing list, where we will provide zoom links. It will also likely be recorded and posted on YouTube :)

English

608

Kfir Goldberg@kfir99·9 Eki

@jyo_pari @scaleml @cloneofsimo @HanGuo97 Will it be recorded or available live anywhere?

English

648

Jyo Pari@jyo_pari·9 Eki

After weeks of learning about systems at @scaleml, we’re shifting gears to video foundation models. Thrilled to have @cloneofsimo sharing how to train them from scratch next Tuesday — no better person to learn from 🔥

English

127

30.5K

Jyo Pari@jyo_pari·27 Eyl

Next Tuesday, @scaleml hosts @kavnwang & Kristine Lu for a tutorial based on jax-ml.github.io/scaling-book/ 🚀 They'll cover distributed training/inference of large models, plus the math & tradeoffs of latency, throughput, and model size in GPU comms!

English

127

12.2K

Jyo Pari@jyo_pari·27 Eyl

A great read 👏

Jeremy Bernstein@jxbz

I wrote this blog post that tries to go further toward design principles for neural nets and optimizers The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number x.com/thinkymachines…

English

1.2K

Jyo Pari@jyo_pari·7 Eyl

@BlackHC @abeirami @IdanShenfeld This is a great question, we find that simply adding KL regularization to SFT isn’t enough. This is likely because their objectives are opposing and we posit that there should be more principled ways of incorporating the KL regularization.

English

Andreas Kirsch 🇺🇦@BlackHC·7 Eyl

@abeirami @IdanShenfeld @jyo_pari So can we simply add a regularizer towards the base model to SFT? Isn't that similar to what EWC was doing?

English

379

Ahmad Beirami@abeirami·7 Eyl

This great work co-led by @IdanShenfeld and @jyo_pari shows that online RL leads to less forgetting because it inherently leads to a solution with a small reverse KL divergence! I'll try to discuss the significance of the result: 🧵

Jyo Pari@jyo_pari

For agents to improve over time, they can’t afford to forget what they’ve already mastered. We found that supervised fine-tuning forgets more than RL when training on a new task! Want to find out why? 👇

English

7.5K

Keşfet

@aviral_kumar2 @scaleml @tianyuanzhang99 @shannonzshen @alex_damian_ @agarwl_ @Alibaba_Qwen @kfir99