Varad Pimpalkhute

322 posts

Varad Pimpalkhute banner
Varad Pimpalkhute

Varad Pimpalkhute

@varad0309

RS @ IFM | Prev @Articul8_AI @AmazonScience @allen_ai | MS CS @UMassAmherst. Towards super intelligence, One Algorithm at a Time.

Sunnyvale, CA Katılım Ocak 2021
672 Takip Edilen106 Takipçiler
Varad Pimpalkhute retweetledi
Mingkai Deng
Mingkai Deng@mdeng34·
Frontier LLMs are converging on efficient, adaptive reasoning. Opus 4.7 lets the model decide how deeply to reason. GPT-5.5 achieves strong results with fewer reasoning tokens. We study a related but more structural question: what 𝗸𝗶𝗻𝗱 𝗼𝗳 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 should we adapt? Last year in SiRA (upper figure), we showed that simulative reasoning (System II), which uses a 𝘄𝗼𝗿𝗹𝗱 𝗺𝗼𝗱𝗲𝗹 to evaluate consequences of actions, yields up to 124% improvement over reactive baselines (System I), and that strong reasoning models (o1, o3-mini) fail as planners without this structure. In our new paper SR²AM (lower figure), we add a learned 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗼𝗿 (System III) that self-regulates when to simulate, how far ahead, and when to skip planning entirely. Efficient reasoning is not just shorter reasoning: it is better allocation of simulation.
Mingkai Deng tweet media
English
3
45
273
58.7K
Varad Pimpalkhute retweetledi
Cameron R. Wolfe, Ph.D.
Cameron R. Wolfe, Ph.D.@cwolferesearch·
New blog on RL scaling laws coming out tomorrow morning. Scaling is one of the most impactful concepts in the history of AI research, but "scaling laws" are an overloaded (sometimes confusing) concept. Scaling laws for pretraining and RL are entirely different concepts. Scaling laws for pretraining are well-defined and have undergone extensive empirical validation, whereas scaling for RL is messy, bespoke, and full of intricate / evolving details. I hope this writeup provides a little clarity to this complex topic.
Cameron R. Wolfe, Ph.D. tweet media
English
10
29
217
35K
Varad Pimpalkhute retweetledi
Vaidehi Patil
Vaidehi Patil@vaidehi_patil_·
AI Double Agents: Can a defender steer the attacker towards wrong info while making them think they won? We show that RL training to proactively build and use a theory-of-mind (ToM) of the attacker results in effective double agents. We use this as a lens to study+improve ToM in LLMs – even strong LLMs struggle to build/use ToM, and we analyse how RL in our env improves them 📈 Key takeaways: 1️⃣ We introduce ToM-SB: a long-horizon dialogue-based ToM environment where defender LLMs must fool attackers trying to extract sensitive pieces of information, but attackers can have some prior knowledge about their targets. Frontier models like Gemini 3 Pro (34%) and GPT-5.4 (27%) struggle on this task even against a baseline attacker. 2️⃣ We improve via AI Double Agents 🕵️: We train LLMs to act as “Double Agents” via RL by rewarding fooling the attacker and ToM modeling behaviors, matching and surpassing the performance of frontier models. 3️⃣ We demonstrate bidirectional emergence 🔄: When training on ToM-SB, rewarding for fooling the attacker leads to emergent improvement in ToM ability, and vice versa. Further, ToM ability and fooling performance are correlated on all methods we test, suggesting ToM-SB is a good testbed for functional ToM. 🧵👇
Vaidehi Patil tweet media
English
2
40
91
27.2K
Dirhousssi Amine
Dirhousssi Amine@DirhousssiAmine·
Been going down a massive rabbit hole with numerical stability in RL training lately.🕵️‍♂️🕵️ Take a look at these two GRPO sanity runs. Exact same model, identical task. One climbs perfectly, the other completely flatlines. The only difference? The dead run is in bf16, the successful one is fp32. What do you think the problem is with these runs? Drop your best guesses below !
Dirhousssi Amine tweet media
English
13
10
160
33.3K
Varad Pimpalkhute
Varad Pimpalkhute@varad0309·
We should have more of these events in churches, honestly very cool!
Varad Pimpalkhute tweet media
English
0
0
2
43
Varad Pimpalkhute
Varad Pimpalkhute@varad0309·
@GXiming Curious to know your thoughts on MCQ type tasks? I feel we can always get a positive signal with sufficiently high number of rollouts in this setting..
English
0
0
0
7
Ximing Lu
Ximing Lu@GXiming·
There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐
Ximing Lu tweet media
English
13
66
397
108.9K
Varad Pimpalkhute retweetledi
Zhihu Frontier
Zhihu Frontier@ZhihuFrontier·
🚀 @MiniMax_AI M2.5 is getting attention — but what actually changed under the hood? Zhihu contributor MM faker (RL framework & algorithm engineer at MiniMax) shared a deep dive into the training system behind the breakthrough: 💡Forge — a large-scale native Agent RL system. When running RL in real-world, complex Agent environments, you always face the same triangle: Throughput | Stability | Agent Flexibility 👉Forge formalizes the objective as maximizing effective training return J: J ≈ Throughput × Sample Efficiency × Stability • Throughput = raw token processing rate (Rollout + Training + Data + I/O) • Sample efficiency = average performance gain per trajectory (data quality, distribution, algorithm, off-policy level) • Stability = monitored convergence under long-horizon optimization 🔥 The 3 Core Challenges 1️⃣ Agent Scalability Most RL frameworks assume white-box agents and tightly couple with tokenizer logic (TITO). This limits complex setups like dynamic context management or multi-agent loops. 2️⃣ System Efficiency Rollout latency ranges from seconds to hours. • Strict FIFO → blocked by long-tail samples. • Pure Greedy → distribution shift & RL collapse. • Meanwhile, multi-turn agents share massive prefix overlap — wasting compute. 3️⃣ Credit Assignment & Stability Long trajectories (thousands of steps) + sparse rewards → high gradient variance. Long CoT boosts benchmarks, but can hurt real-world latency. 🏗 Forge Architecture (Fig 1) Forge fully decouples Agent logic from the training engine: • Agent Layer → pure trajectory producer • Middleware (Gateway + Data Pool) → Physical isolation between agents and engines with async buffering & protocol standardization • Rollout Engine + Train Engine → high-throughput generation + scheduled policy updates This enables training across hundreds of frameworks and thousands of tool formats — without modifying Agent internals. ◽️For white-box agents, Context Management (CM) is modeled as an action inside RL. Context shifts become part of state transitions — solving long-horizon attention dilution & train-infer mismatch. ◼️For black-box agents, Forge integrates non-intrusively via Gateway. Even opaque agent loops benefit from RL optimization (Fig 2). ⚙️ Key Engineering Innovations 1️⃣ Windowed FIFO Scheduling Balances strict FIFO and Greedy — preserving throughput while controlling off-policy drift (Fig 3). 2️⃣ Prefix Tree Merging Transforms linear samples into tree structures, eliminating redundant prefix computation (Fig 4). → ~40× training acceleration → Significant memory reduction 3️⃣ Inference Acceleration • Dynamic MTP with Top-K KL alignment • PD separation for MoE scheduling • Global L3 KV cache pool for long-context reuse 🧠 Algorithm & Reward Design M2 continues using CISPO as baseline, adapted for 200k-context agent scenarios. Multi-domain mixed training (Reasoning, QA, Code, General Agent) improves robustness and reduces forgetting. Composite reward includes: • Process reward (dense mid-step supervision) • Completion-time reward (optimize execution path) • Reward-to-Go(variance reduction for long trajectories) It's not just an RL system — it's scalable infrastructure for real-world Agents. M2.5 is a milestone — not the endpoint. RL is still running internally. Reward is still climbing. M2.7 might arrive stronger than expected 👀 🔗Original article (Chinese): zhuanlan.zhihu.com/p/200574271625… #Forge #Agent #RL #MiniMax #M25 #LLM #Training #AI #Tech
Zhihu Frontier tweet mediaZhihu Frontier tweet mediaZhihu Frontier tweet mediaZhihu Frontier tweet media
English
2
10
86
5.4K
Varad Pimpalkhute retweetledi
λux
λux@novasarc01·
we got three cool papers on self-distillation in the same week! 1/ Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models - arxiv.org/abs/2601.18734 2/ Self-Distillation Enables Continual Learning - arxiv.org/abs/2601.19897 3/ Reinforcement Learning via Self-Distillation - arxiv.org/abs/2601.20802
English
17
108
704
70.1K
Varad Pimpalkhute retweetledi
Fahim Tajwar
Fahim Tajwar@FahimTajwar10·
Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: zanette-labs.github.io/MaxRL/ 🧵 1/n
English
14
161
808
207.3K
Varad Pimpalkhute retweetledi
Mikhail Yurochkin
Mikhail Yurochkin@Yurochkin_M·
Nice way to scale synthetic data 🙃 If people still believe in the AI "data wall," here is a spicy take: I don’t think it is a real problem. There are so many ways to generate 10s of trillions of diverse tokens with today's open LLMs (with permissive licenses).
Andrej Karpathy@karpathy

I'm being accused of overhyping the [site everyone heard too much about today already]. People's reactions varied very widely, from "how is this interesting at all" all the way to "it's so over". To add a few words beyond just memes in jest - obviously when you take a look at the activity, it's a lot of garbage - spams, scams, slop, the crypto people, highly concerning privacy/security prompt injection attacks wild west, and a lot of it is explicitly prompted and fake posts/comments designed to convert attention into ad revenue sharing. And this is clearly not the first the LLMs were put in a loop to talk to each other. So yes it's a dumpster fire and I also definitely do not recommend that people run this stuff on their computers (I ran mine in an isolated computing environment and even then I was scared), it's way too much of a wild west and you are putting your computer and private data at a high risk. That said - we have never seen this many LLM agents (150,000 atm!) wired up via a global, persistent, agent-first scratchpad. Each of these agents is fairly individually quite capable now, they have their own unique context, data, knowledge, tools, instructions, and the network of all that at this scale is simply unprecedented. This brings me again to a tweet from a few days ago "The majority of the ruff ruff is people who look at the current point and people who look at the current slope.", which imo again gets to the heart of the variance. Yes clearly it's a dumpster fire right now. But it's also true that we are well into uncharted territory with bleeding edge automations that we barely even understand individually, let alone a network there of reaching in numbers possibly into ~millions. With increasing capability and increasing proliferation, the second order effects of agent networks that share scratchpads are very difficult to anticipate. I don't really know that we are getting a coordinated "skynet" (thought it clearly type checks as early stages of a lot of AI takeoff scifi, the toddler version), but certainly what we are getting is a complete mess of a computer security nightmare at scale. We may also see all kinds of weird activity, e.g. viruses of text that spread across agents, a lot more gain of function on jailbreaks, weird attractor states, highly correlated botnet-like activity, delusions/ psychosis both agent and human, etc. It's very hard to tell, the experiment is running live. TLDR sure maybe I am "overhyping" what you see today, but I am not overhyping large networks of autonomous LLM agents in principle, that I'm pretty sure.

English
0
2
4
615
Varad Pimpalkhute
Varad Pimpalkhute@varad0309·
If your large-scale MoE RL run is "mysteriously" unstable, check the router: training is not equal to inference routing resulting in an off-policy chaos. Rollout Routing Replay (R3) reuses inference routing during training to align them and prevent collapse. Both Verl and @radixark support this! Checkout the paper: arxiv.org/pdf/2510.11370
Varad Pimpalkhute tweet media
English
1
0
2
36
Varad Pimpalkhute retweetledi
Siyan Zhao
Siyan Zhao@siyan_zhao·
Introducing 💡On-Policy Self-Distillation💡, a simple method that enables LLM to teach itself with dense per-token feedback on its own on-policy generations—achieving 4-8x more token efficiency vs. GRPO and outperforming both GRPO and SFT/Off-Policy Distillation. Key insight: like a student reviewing solutions, rationalizing them, and correcting prior mistakes, an LLM can be conditioned on privileged info (e.g., correct solution or a reasoning trace) and supervise its weaker self—the version without such access—by matching the privileged-info-induced distribution from itself. 🌐Blog: siyan-zhao.github.io/blog/2026/opsd/ 🧵👇
Siyan Zhao tweet media
English
31
157
923
133.6K
Xu Zou
Xu Zou@xz_keg·
@BanghuaZ This is not counterintuitive.
English
1
0
0
112
Varad Pimpalkhute retweetledi
Hector Liu
Hector Liu@waterluffy·
While reading papers, I often notice many methods are validated only on just one base model. It may be inevitable due to experimental costs. What are some practical approaches to this? Just spend more? Or are there better ways to analyze generalizability across models?
English
2
3
8
1.5K
Varad Pimpalkhute retweetledi
Varad Pimpalkhute retweetledi
Cody (Yingquan) Wu
Cody (Yingquan) Wu@CodyWueqs·
I’ve been benchmarking GPT against state-of-the-art topics in algebraic coding theory. Three months ago, GPT-5 Thinking correctly described the Berlekamp–Massey algorithm, but stumbled on the Berlekamp algorithm. I was surprised to see that GPT-5.2 Thinking can now properly describe the Berlekamp algorithm: chatgpt.com/s/t_6979532b7c… It even covered the lesser-known Koetter–Horiguchi formula: chatgpt.com/s/t_6979580a4b… One important caveat: this success relied on web search and verification, rather than deriving directly from first principles.
English
1
2
3
327