Arun Iyer

897 posts

Arun Iyer banner
Arun Iyer

Arun Iyer

@AIonGradFlow

Researcher

Bangalore, India Katılım Haziran 2009
263 Takip Edilen103 Takipçiler
Arun Iyer retweetledi
Rulin Shao
Rulin Shao@RulinShao·
DR Tulu is now accepted for an oral presentation at #ICML2026 🙏 Updated paper: arxiv.org/abs/2511.19399 📥We added more ablations including using Qwen3-8B as the rubric generator&judge, showing evolving rubrics work with a weak model too; spurious rewards sanity check, etc. Live demo: dr-tulu.org Code&models: github.com/rlresearch/dr-…
Rulin Shao@RulinShao

Happy to share that DR Tulu has been accepted to ICML as a ✨Spotlight✨! We believe that co-evolving the agent and its reward metric can lead to more capable intelligence. DR Tulu is a team effort. Huge thanks and congrats to all my amazing collaborators and mentors!

English
3
29
198
15.8K
Arun Iyer retweetledi
Alex Smola
Alex Smola@smolix·
The LLM benchmark zoo keeps growing: MMLU, MTEB, HELM, BigCodeBench, AlpacaEval, LiveBench, Arena-Hard, MT-Bench... days of GPU time per release. But the columns are wildly correlated. The real question isn't "which benchmark" but "which subset."
Alex Smola tweet media
English
5
1
10
1.8K
Arun Iyer retweetledi
Yifan Yang
Yifan Yang@Yif_Yang·
🚀 Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language skill as a trainable external parameter. Think of it as deep learning for the frontier-model + agent era: learning rate, LR schedule, mini-batch, batch size, epoch, momentum — all in text-space optimization. SkillOpt enables stable, controllable skill updates through bounded edits, allowing the optimizer to summarize “gradient directions” from agent experience and continuously improve procedural capability. We evaluate SkillOpt across 6 benchmarks and 7 models, under both direct model calls and real agent execution loops with Codex + Claude Code. SkillOpt achieves best or tied-best results in 52/52 settings. Train the skill, not the model. 🛠️🤖 🌐 aka.ms/skillopt 📄 huggingface.co/papers/2605.23…
English
49
102
820
80.3K
Arun Iyer retweetledi
Soheil Feizi
Soheil Feizi@FeiziSoheil·
🚨 New paper alert: LLM agents increasingly need to decide when to answer directly and when to use a tool. But tool use is not one-size-fits-all. In our new paper, “Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use,” we argue that tool necessity should depend on the model itself. A question that GPT-level models can answer reliably may still require a calculator, search engine, or database call for a weaker model. Treating tool necessity as model-agnostic misses this important reality. We introduce a model-adaptive view of tool necessity, grounded in each model’s empirical capabilities, and compare when a tool is actually needed with when models choose to call one. Across arithmetic and factual QA settings, we find substantial mismatches: models often either call tools when they do not need them, or fail to call tools when they do. The key finding is a knowing-doing gap in LLM tool use. Models often contain internal signals about whether a tool is needed, but those signals do not reliably translate into the final tool-call action. This suggests that improving agent reliability is not only about teaching models to recognize when tools are useful, but also about making sure that recognition is converted into action. As LLMs become more agentic, tool-use reliability will be central to their safety, efficiency, and trustworthiness. Our work points to a more model-aware way of evaluating and improving when agents should rely on themselves versus external tools. Paper: arxiv.org/abs/2605.14038 Code & Data: github.com/chengez/Tool-C… Joint work with @chengez1114, Chenrui Fan, Mahdi JafariRaviz, @RezaeiKeivan
Soheil Feizi tweet media
English
3
5
23
1.7K
Arun Iyer retweetledi
Arun Iyer retweetledi
Souradip Chakraborty
Souradip Chakraborty@SOURADIPCHAKR18·
🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL
Souradip Chakraborty tweet media
English
15
86
483
111.5K
Arun Iyer retweetledi
Duy Nguyen
Duy Nguyen@duynguyen772·
Sparse binary rewards bottleneck LLM RL, motivating the use of privileged information in self-distillation as dense teachers. How can we use and balance multiple types of privileged info: leveraging stable cross-view info, while preserving view-specific info? Current on-policy self-distillation methods often condition the teacher on only one type of privileged view: full solution, partial rationale, answer-only, reference code, feedback, etc. This can be suboptimal: 1️⃣ No single privileged view consistently performs best when used as a teacher. 2️⃣ Views can introduce teacher-specific artifacts from information unavailable to the student. 🧠 Adaptive-View Self-Distillation (AVSD) considers multiple privileged views jointly as a teacher family, balancing cross-view consensus and view-specific signals through a token-level gate to construct better dense learning signals. 🧵👇
Duy Nguyen tweet media
English
4
35
84
25.3K
Arun Iyer retweetledi
Yuxiang Huang
Yuxiang Huang@yxyxyyy6·
[1/n] Can a model learn *where* and *how much* information it should attend to, and do so efficiently? We introduce DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention! This pushes the accuracy-efficiency frontier in LLMs.
GIF
English
2
19
119
30.2K
Arun Iyer retweetledi
Ming Li @ UMD PhD
Ming Li @ UMD PhD@Ming_Liiii·
Excited to share that our paper: “Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models” has been selected as an ACL 2026 Oral 🎉 @aclmeeting Mathematical reasoning has become one of the key frontiers for evaluating and improving large language models. Yet we still lack a clear picture of how these models organize their reasoning internally through natural language traces. In this work, we propose ThinkARM, a framework for analyzing LLM mathematical reasoning from a cognitive science perspective. Building on Schoenfeld’s episode theory of mathematical problem solving, ThinkARM segments model reasoning into interpretable functional episodes such as Reading, Analysis, Exploration, Implementation, and Verification. This allows us to ask not only whether a model reaches the right answer, but how it moves through the reasoning process. We believe this provides a useful step toward more interpretable analysis of LLM reasoning, and toward building models that reason not only more, but better. Congrats to the collaborators Chenrui, @chengez1114 , @FeiziSoheil and @zhoutianyi at UMD @umdcs Paper: arxiv.org/pdf/2512.19995 Repo: github.com/MingLiiii/Thin…
English
1
4
19
36.8K
Arun Iyer retweetledi
Emmy Liu
Emmy Liu@_emliu·
Copying → morphology/translation → basic arithmetic → complex reasoning & math. Across every model family we tested, LLMs acquire skills in roughly the same order during pretraining. Can we use this to predict what a model will learn next, just from its internals? 🧵
Emmy Liu tweet media
English
16
62
477
52K
Arun Iyer retweetledi
Konstantin Mishchenko
Konstantin Mishchenko@konstmish·
That's a nice paper, very neat.
Konstantin Mishchenko tweet media
English
2
33
181
22.5K
Arun Iyer retweetledi
Sungjin Ahn
Sungjin Ahn@SungjinAhn_·
🧠We introduce "Generative Recursive Reasoning"! Recursive Reasoning Models like HRM, TRM, and Looped Transformers are deterministic — same input, same reasoning, every time. They collapse the entire space of plausible reasoning paths into a single attractor. Our model GRAM (Generative Recursive reAsoning Models) turns recursion itself into a stochastic latent trajectory. Multiple hypotheses, alternative solution strategies, and inference-time scaling not just by depth, but by width — parallel trajectory sampling. And here's the kicker: the same formulation that gives us conditional reasoning p(y|x) also makes GRAM a general generative model p(x). With only 10M params: • Sudoku-Extreme: 97.0% (TRM 87.4%) • ARC-AGI-1: 52.0% • ARC-AGI-2: 11.1% • N-Queens coverage: 90%+ 📄 Paper: arxiv.org/abs/2605.19376 🌐 Project page: ahn-ml.github.io/gram-website w/ Junyeob Baek @JunyeobB (KAIST), Mingyu Jo @pyross0000 (KAIST), Minsu Kim @minsuuukim (KAIST & Mila), Mengye Ren @mengyer (NYU), Yoshua Bengio @Yoshua_Bengio (Mila), Sungjin Ahn @SungjinAhn_ (KAIST)
Sungjin Ahn tweet mediaSungjin Ahn tweet mediaSungjin Ahn tweet media
English
31
208
1.5K
179.1K
Arun Iyer retweetledi
Yuda Song
Yuda Song@yus167·
Exciting work! But in our February paper, "Reinforcement Learning with Text Feedback", we proposed the same methodology: predicting environment feedback on top of the RL loss. Nice to see this idea specialized to agentic terminal tasks, and the new insight this brings 💡. [1/2]
Yuda Song tweet media
Dimitris Papailiopoulos@DimitrisPapail

x.com/i/article/2056…

English
4
22
225
29.7K
Arun Iyer retweetledi
Jeonghye Kim
Jeonghye Kim@beanie0__0·
Great to see RL with self-distillation (w/ text feedback) in agent setups being scaled to a production Cursor model! If you're interested in this regime, I highly recommend "Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization" (ICLR'26). In multi-turn agents interacting with external environments, it shows how agents can distill self-generated textual tips during RL training to correct past failures and explore more efficiently, achieving up to a 128.6% performance improvement🚀 📄 Paper: arxiv.org/abs/2602.23008 📝 Blog: agent-lightning.github.io/posts/empo2/ 💻 Code: github.com/microsoft/agen…
GIF
Cursor@cursor_ai

Introducing Composer 2.5, our most powerful model yet. It's more intelligent, better at sustained work on long-running tasks, and more reliable at following complex instructions. For the next week, we’re doubling the included usage of the model.

English
0
9
46
4.7K
Arun Iyer retweetledi
Satwik Bhattamishra
Satwik Bhattamishra@satwik1729·
Given black-box access to a Transformer's output, can we efficiently recover its parameters? We analyse the learnability of attention-based models with query access in our new work. Accepted at #ICML2026 🎉 Work done with @shahkulin98, @mhahn29 and Varun Kanade. 🧵
Satwik Bhattamishra tweet media
English
7
23
163
22.1K
Arun Iyer retweetledi
Paria Rashidinejad
Paria Rashidinejad@paria_rd·
Looped Transformers: the dream was right. But there was trouble in paradise. The loop made them unstable, expensive, and memory-hungry, with gains hard to scale. So we asked: 𝗖𝗮𝗻 𝘄𝗲 𝗿𝗲𝗮𝗽 𝘁𝗵𝗲 𝗿𝗲𝘄𝗮𝗿𝗱𝘀 𝘄𝗶𝘁𝗵𝗼𝘂𝘁 𝗽𝗮𝘆𝗶𝗻𝗴 𝘁𝗵𝗲 𝗹𝗼𝗼𝗽 𝘁𝗮𝘅? Introducing 𝗔𝘁𝘁𝗿𝗮𝗰𝘁𝗼𝗿 𝗠𝗼𝗱𝗲𝗹𝘀 𝗳𝗼𝗿 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗮𝗻𝗱 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴: • A Backbone proposes an initial “guess” output embedding; • An Attractor refines it: a fixed-point solver lets the model “think” before each token. Implicit differentiation trains the model stably, with constant memory and without BPTT. Training also revealed a surprising phenomenon: 𝗘𝗾𝘂𝗶𝗹𝗶𝗯𝗿𝗶𝘂𝗺 𝗜𝗻𝘁𝗲𝗿𝗻𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 Over the course of training, the Backbone learns to propose latents close to the equilibrium itself, making the Attractor almost unnecessary at inference. Results: • 𝗣𝗮𝗿𝗲𝘁𝗼 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗼𝗻 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹𝗶𝗻𝗴: up to 𝟰𝟲.𝟲% lower perplexity and 𝟭𝟵.𝟳% better downstream accuracy. A 770M Attractor Model beats a 1.3B Transformer, despite being trained on half as many tokens. • 𝗦𝗶𝗴𝗻𝗶𝗳𝗶𝗰𝗮𝗻𝘁 𝗴𝗮𝗶𝗻𝘀 𝗼𝗻 𝗵𝗮𝗿𝗱 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝘁𝗮𝘀𝗸𝘀: a 27M Attractor Model trained on only 1K examples achieves 𝟵𝟭.𝟰% 𝗼𝗻 𝗦𝘂𝗱𝗼𝗸𝘂-𝗘𝘅𝘁𝗿𝗲𝗺𝗲 and 𝟵𝟯.𝟭% 𝗼𝗻 𝗠𝗮𝘇𝗲-𝗛𝗮𝗿𝗱, while Transformers and frontier models like Claude and GPT o3 score 𝟬%. 📝 arxiv.org/pdf/2605.12466 🧵 1/10
Paria Rashidinejad tweet media
English
19
90
590
64.2K
Arun Iyer retweetledi
Yuwei Zhang
Yuwei Zhang@YuweiZh49446108·
On-policy self-distillation is a promising direction for learning from rich textual feedback. But can it really learn from failed trajectories? Our answer: not quite -- unless we let the model actively interpret them. 🧵1/N
Yuwei Zhang tweet media
English
10
59
478
519.2K
Arun Iyer retweetledi
Linlu Qiu
Linlu Qiu@linluqiu·
Language is discrete. Language models don’t have to be. 🧚Introducing ELF🧚‍♀️: Embedded Language Flows—a class of diffusion models in continuous embedding space based on continuous-time Flow Matching 🧵
Linlu Qiu tweet media
English
15
131
806
135.2K