Varad Pimpalkhute

322 posts

Varad Pimpalkhute

@varad0309

RS @ IFM | Prev @Articul8_AI @AmazonScience @allen_ai | MS CS @UMassAmherst. Towards super intelligence, One Algorithm at a Time.

Sunnyvale, CA Katılım Ocak 2021

672 Takip Edilen106 Takipçiler

Varad Pimpalkhute retweetledi

Mingkai Deng@mdeng34·2d

Frontier LLMs are converging on efficient, adaptive reasoning. Opus 4.7 lets the model decide how deeply to reason. GPT-5.5 achieves strong results with fewer reasoning tokens. We study a related but more structural question: what 𝗸𝗶𝗻𝗱 𝗼𝗳 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 should we adapt? Last year in SiRA (upper figure), we showed that simulative reasoning (System II), which uses a 𝘄𝗼𝗿𝗹𝗱 𝗺𝗼𝗱𝗲𝗹 to evaluate consequences of actions, yields up to 124% improvement over reactive baselines (System I), and that strong reasoning models (o1, o3-mini) fail as planners without this structure. In our new paper SR²AM (lower figure), we add a learned 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗼𝗿 (System III) that self-regulates when to simulate, how far ahead, and when to skip planning entirely. Efficient reasoning is not just shorter reasoning: it is better allocation of simulation.

English

273

58.7K

Varad Pimpalkhute retweetledi

Cameron R. Wolfe, Ph.D.@cwolferesearch·20 Nis

New blog on RL scaling laws coming out tomorrow morning. Scaling is one of the most impactful concepts in the history of AI research, but "scaling laws" are an overloaded (sometimes confusing) concept. Scaling laws for pretraining and RL are entirely different concepts. Scaling laws for pretraining are well-defined and have undergone extensive empirical validation, whereas scaling for RL is messy, bespoke, and full of intricate / evolving details. I hope this writeup provides a little clarity to this complex topic.

English

217

35K

Varad Pimpalkhute retweetledi

Vaidehi Patil@vaidehi_patil_·14 Nis

AI Double Agents: Can a defender steer the attacker towards wrong info while making them think they won? We show that RL training to proactively build and use a theory-of-mind (ToM) of the attacker results in effective double agents. We use this as a lens to study+improve ToM in LLMs – even strong LLMs struggle to build/use ToM, and we analyse how RL in our env improves them 📈 Key takeaways: 1️⃣ We introduce ToM-SB: a long-horizon dialogue-based ToM environment where defender LLMs must fool attackers trying to extract sensitive pieces of information, but attackers can have some prior knowledge about their targets. Frontier models like Gemini 3 Pro (34%) and GPT-5.4 (27%) struggle on this task even against a baseline attacker. 2️⃣ We improve via AI Double Agents 🕵️: We train LLMs to act as “Double Agents” via RL by rewarding fooling the attacker and ToM modeling behaviors, matching and surpassing the performance of frontier models. 3️⃣ We demonstrate bidirectional emergence 🔄: When training on ToM-SB, rewarding for fooling the attacker leads to emergent improvement in ToM ability, and vice versa. Further, ToM ability and fooling performance are correlated on all methods we test, suggesting ToM-SB is a good testbed for functional ToM. 🧵👇

English

27.2K

Varad Pimpalkhute@varad0309·29 Mar

@DirhousssiAmine @_lewtun Sampling and trainer dtypes are mismatched?

English

Dirhousssi Amine@DirhousssiAmine·28 Mar

Been going down a massive rabbit hole with numerical stability in RL training lately.🕵️‍♂️🕵️ Take a look at these two GRPO sanity runs. Exact same model, identical task. One climbs perfectly, the other completely flatlines. The only difference? The dead run is in bf16, the successful one is fp32. What do you think the problem is with these runs? Drop your best guesses below !

English

160

33.3K

Varad Pimpalkhute@varad0309·15 Mar

We should have more of these events in churches, honestly very cool!

English

Varad Pimpalkhute@varad0309·12 Mar

Cool work!!

Seungwook Han@seungwookh

Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)

English

Varad Pimpalkhute@varad0309·3 Mar

@GXiming Curious to know your thoughts on MCQ type tasks? I feel we can always get a positive signal with sufficiently high number of rollouts in this setting..

English

Ximing Lu@GXiming·3 Şub

There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

English

397

108.9K

Varad Pimpalkhute retweetledi

Zhihu Frontier@ZhihuFrontier·14 Şub

🚀 @MiniMax_AI M2.5 is getting attention — but what actually changed under the hood? Zhihu contributor MM faker (RL framework & algorithm engineer at MiniMax) shared a deep dive into the training system behind the breakthrough: 💡Forge — a large-scale native Agent RL system. When running RL in real-world, complex Agent environments, you always face the same triangle: Throughput | Stability | Agent Flexibility 👉Forge formalizes the objective as maximizing effective training return J: J ≈ Throughput × Sample Efficiency × Stability • Throughput = raw token processing rate (Rollout + Training + Data + I/O) • Sample efficiency = average performance gain per trajectory (data quality, distribution, algorithm, off-policy level) • Stability = monitored convergence under long-horizon optimization 🔥 The 3 Core Challenges 1️⃣ Agent Scalability Most RL frameworks assume white-box agents and tightly couple with tokenizer logic (TITO). This limits complex setups like dynamic context management or multi-agent loops. 2️⃣ System Efficiency Rollout latency ranges from seconds to hours. • Strict FIFO → blocked by long-tail samples. • Pure Greedy → distribution shift & RL collapse. • Meanwhile, multi-turn agents share massive prefix overlap — wasting compute. 3️⃣ Credit Assignment & Stability Long trajectories (thousands of steps) + sparse rewards → high gradient variance. Long CoT boosts benchmarks, but can hurt real-world latency. 🏗 Forge Architecture (Fig 1) Forge fully decouples Agent logic from the training engine: • Agent Layer → pure trajectory producer • Middleware (Gateway + Data Pool) → Physical isolation between agents and engines with async buffering & protocol standardization • Rollout Engine + Train Engine → high-throughput generation + scheduled policy updates This enables training across hundreds of frameworks and thousands of tool formats — without modifying Agent internals. ◽️For white-box agents, Context Management (CM) is modeled as an action inside RL. Context shifts become part of state transitions — solving long-horizon attention dilution & train-infer mismatch. ◼️For black-box agents, Forge integrates non-intrusively via Gateway. Even opaque agent loops benefit from RL optimization (Fig 2). ⚙️ Key Engineering Innovations 1️⃣ Windowed FIFO Scheduling Balances strict FIFO and Greedy — preserving throughput while controlling off-policy drift (Fig 3). 2️⃣ Prefix Tree Merging Transforms linear samples into tree structures, eliminating redundant prefix computation (Fig 4). → ~40× training acceleration → Significant memory reduction 3️⃣ Inference Acceleration • Dynamic MTP with Top-K KL alignment • PD separation for MoE scheduling • Global L3 KV cache pool for long-context reuse 🧠 Algorithm & Reward Design M2 continues using CISPO as baseline, adapted for 200k-context agent scenarios. Multi-domain mixed training (Reasoning, QA, Code, General Agent) improves robustness and reduces forgetting. Composite reward includes: • Process reward (dense mid-step supervision) • Completion-time reward (optimize execution path) • Reward-to-Go(variance reduction for long trajectories) It's not just an RL system — it's scalable infrastructure for real-world Agents. M2.5 is a milestone — not the endpoint. RL is still running internally. Reward is still climbing. M2.7 might arrive stronger than expected 👀 🔗Original article (Chinese): zhuanlan.zhihu.com/p/200574271625… #Forge #Agent #RL #MiniMax #M25 #LLM #Training #AI #Tech

English

5.4K

Varad Pimpalkhute retweetledi

Sydney He@helansydney·13 Şub

The Bitter Lesson Behind Building Agentic RL in Terminal Environments This blog post summarizes our practical experience over the past three months working on Agentic RL. For more details, please refer to: faithful-almanac-add.notion.site/The-Bitter-Les… #LLM #RL #Agent #AgenticRL

English

184

11.4K

Varad Pimpalkhute retweetledi

λux@novasarc01·29 Oca

we got three cool papers on self-distillation in the same week! 1/ Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models - arxiv.org/abs/2601.18734 2/ Self-Distillation Enables Continual Learning - arxiv.org/abs/2601.19897 3/ Reinforcement Learning via Self-Distillation - arxiv.org/abs/2601.20802

English

108

704

70.1K

Varad Pimpalkhute retweetledi

Fahim Tajwar@FahimTajwar10·5 Şub

Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: zanette-labs.github.io/MaxRL/ 🧵 1/n

English

161

808

207.3K

Varad Pimpalkhute retweetledi

Mikhail Yurochkin@Yurochkin_M·1 Şub

Nice way to scale synthetic data 🙃 If people still believe in the AI "data wall," here is a spicy take: I don’t think it is a real problem. There are so many ways to generate 10s of trillions of diverse tokens with today's open LLMs (with permissive licenses).

Andrej Karpathy@karpathy

I'm being accused of overhyping the [site everyone heard too much about today already]. People's reactions varied very widely, from "how is this interesting at all" all the way to "it's so over". To add a few words beyond just memes in jest - obviously when you take a look at the activity, it's a lot of garbage - spams, scams, slop, the crypto people, highly concerning privacy/security prompt injection attacks wild west, and a lot of it is explicitly prompted and fake posts/comments designed to convert attention into ad revenue sharing. And this is clearly not the first the LLMs were put in a loop to talk to each other. So yes it's a dumpster fire and I also definitely do not recommend that people run this stuff on their computers (I ran mine in an isolated computing environment and even then I was scared), it's way too much of a wild west and you are putting your computer and private data at a high risk. That said - we have never seen this many LLM agents (150,000 atm!) wired up via a global, persistent, agent-first scratchpad. Each of these agents is fairly individually quite capable now, they have their own unique context, data, knowledge, tools, instructions, and the network of all that at this scale is simply unprecedented. This brings me again to a tweet from a few days ago "The majority of the ruff ruff is people who look at the current point and people who look at the current slope.", which imo again gets to the heart of the variance. Yes clearly it's a dumpster fire right now. But it's also true that we are well into uncharted territory with bleeding edge automations that we barely even understand individually, let alone a network there of reaching in numbers possibly into ~millions. With increasing capability and increasing proliferation, the second order effects of agent networks that share scratchpads are very difficult to anticipate. I don't really know that we are getting a coordinated "skynet" (thought it clearly type checks as early stages of a lot of AI takeoff scifi, the toddler version), but certainly what we are getting is a complete mess of a computer security nightmare at scale. We may also see all kinds of weird activity, e.g. viruses of text that spread across agents, a lot more gain of function on jailbreaks, weird attractor states, highly correlated botnet-like activity, delusions/ psychosis both agent and human, etc. It's very hard to tell, the experiment is running live. TLDR sure maybe I am "overhyping" what you see today, but I am not overhyping large networks of autonomous LLM agents in principle, that I'm pretty sure.

English

615

Varad Pimpalkhute@varad0309·1 Şub

We are tackling all of these fun challenges at IFM.AI ;)

English

Varad Pimpalkhute@varad0309·1 Şub

If your large-scale MoE RL run is "mysteriously" unstable, check the router: training is not equal to inference routing resulting in an off-policy chaos. Rollout Routing Replay (R3) reuses inference routing during training to align them and prevent collapse. Both Verl and @radixark support this! Checkout the paper: arxiv.org/pdf/2510.11370

English

Varad Pimpalkhute@varad0309·30 Oca

@DhruvBatra_ Wow very cool !! Congrats on the release

English

Dhruv Batra@DhruvBatra_·30 Oca

We trained a good model. You can now try it.

Devi Parikh@deviparikh

Introducing n1 — Yutori’s browser-use model. Available today via our API. If you’ve been using Claude, Gemini or OpenAI’s computer use models for browser automation — you should switch to Yutori’s n1. It is more accurate, significantly cheaper, and a drop-in replacement.

English

6.5K

Varad Pimpalkhute retweetledi

Siyan Zhao@siyan_zhao·22 Oca

Introducing 💡On-Policy Self-Distillation💡, a simple method that enables LLM to teach itself with dense per-token feedback on its own on-policy generations—achieving 4-8x more token efficiency vs. GRPO and outperforming both GRPO and SFT/Off-Policy Distillation. Key insight: like a student reviewing solutions, rationalizing them, and correcting prior mistakes, an LLM can be conditioned on privileged info (e.g., correct solution or a reasoning trace) and supervise its weaker self—the version without such access—by matching the privileged-info-induced distribution from itself. 🌐Blog: siyan-zhao.github.io/blog/2026/opsd/ 🧵👇

English

157

923

133.6K

Varad Pimpalkhute@varad0309·30 Oca

@xz_keg @BanghuaZ Why so?

English

Xu Zou@xz_keg·29 Oca

@BanghuaZ This is not counterintuitive.

English

112

Banghua Zhu@BanghuaZ·28 Oca

Kimi-K2-like INT4 QAT training is now integrated and tested on slime & Miles, with experiment details on Qwen 235B and Kimi K2 Thinking! Counterintuitively, BF16 train–INT4 infer actually can achieve lower train-inference logprob gap than BF16 train–FP8 infer because of QAT on the training side. More exciting exps and features to be shared soon from LMSYS & SGLang RL community!

LMSYS Org@lmsysorg

🚀 New Blog: INT4 Quantization-Aware Training (QAT) is fired up! Inspired by the Kimi K2 team, our SGLang RL team shipped an end-to-end INT4 Quantization-Aware Training (QAT) pipeline that achieved BF16-level stability & train–infer consistency with: 📙Fake quant during training + real W4A16 at inference 💻INT4 compression shrinks ~1TB-scale models to fit on a single H200 GPU 💡Single-node rollout: no cross-node synchronization and communication overhead, so we have faster, more stable RL sampling More to come: speeding up QAT on the training side, exploring FP4 RL on NVIDIA Blackwell and beyond.

English

105

12.7K

Varad Pimpalkhute retweetledi

Hector Liu@waterluffy·29 Oca

While reading papers, I often notice many methods are validated only on just one base model. It may be inevitable due to experimental costs. What are some practical approaches to this? Just spend more? Or are there better ways to analyze generalizability across models?

English

1.5K

Varad Pimpalkhute retweetledi

Rupesh Srivastava@rupspace·28 Oca

Most attractive quadrant is right! Congrats @varad0309 @tw_killian for a new milestone in fully open source AI.

LLM360@llm360

Please welcome K2 Think V2, our first fully sovereign 70B reasoning model. Built on the K2-V2 base, this release bridges the gap between community-owned AI and proprietary models. About K2 Think V2: 🧠 70B parameters, RLVR-tuned 🛡️ 100% Sovereign (IFM-curated data only) 🔓 Fully Open (Pre-training to Post-training) 💡 Top-tier Openness & Intelligence

English

1.2K

Varad Pimpalkhute retweetledi

Cody (Yingquan) Wu@CodyWueqs·28 Oca

I’ve been benchmarking GPT against state-of-the-art topics in algebraic coding theory. Three months ago, GPT-5 Thinking correctly described the Berlekamp–Massey algorithm, but stumbled on the Berlekamp algorithm. I was surprised to see that GPT-5.2 Thinking can now properly describe the Berlekamp algorithm: chatgpt.com/s/t_6979532b7c… It even covered the lesser-known Koetter–Horiguchi formula: chatgpt.com/s/t_6979580a4b… One important caveat: this success relied on web search and verification, rather than deriving directly from first principles.

English

327

Keşfet

@DirhousssiAmine @_lewtun @GXiming @MiniMax_AI @radixark @DhruvBatra_ @xz_keg @BanghuaZ