Atula Tejaswi

334 posts

Atula Tejaswi

@atu_tej

CS PhD @UTCompSci | Currently working on Discrete Diffusion.

Katılım Ekim 2016

530 Takip Edilen247 Takipçiler

Sabitlenmiş Tweet

Atula Tejaswi@atu_tej·7 Ara

Honored to have played a part in establishing a fully open foundation for OpenThoughts-Agent-v1. I'm looking forward to watching this strong start evolve and contribute further to improve future #TerminalBench agents! #OpenThoughtsAgent

Negin Raoof@NeginRaoof_

How can we make a better TerminalBench agent? Today, we are announcing the OpenThoughts-Agent project. OpenThoughts-Agent v1 is the first TerminalBench agent trained on fully open curated SFT and RL environments. OpenThinker-Agent-v1 is the strongest model of its size on TerminalBench, and sets a new bar on our newly released OpenThoughts-TB-Dev benchmark. (1/n)

English

1.7K

Atula Tejaswi@atu_tej·2d

@HongliZhan @jessyjli Congratulations!

English

208

Hongli Zhan@HongliZhan·2d

PhD defended at UT Austin today.🤘 The best thing was having an advisor who believed in me before I believed in myself. Jessy taught me how to write, how to think, and how to chase research ideas. Then the rest followed. Thank you, @jessyjli

English

120

11.4K

Atula Tejaswi@atu_tej·2d

ICML folks: what do you do when a reviewer selects “(b) Partially satisfied - I have more questions for the authors,” keeps the score the same, but doesn’t actually ask any questions? We only get one more author response - how would you handle it?

English

565

Atula Tejaswi retweetledi

Kilian Weinberger@KilianQW·27 Mar

My talk at the IFML Symposium in Austin Texas about "Beyond Parametric Knowledge and Next Token Prediction" youtu.be/ivFViMCmWls?si…

YouTube

English

2.8K

Atula Tejaswi retweetledi

Rohan Jha@Robro612·26 Mar

Poster at LIR workshop next week: we compared training-free multi-vector compression methods head-to-head across some BEIR and CoIR sets. The takeaway is clear — pooling > pruning at all compression ratios for text. Not a huge surprise, but nice to have the controlled comparison.

English

6.5K

Atula Tejaswi retweetledi

Justin T Chiu@justintchiu·26 Mar

First blog post in a while: How to differentiate through optimizers! One of my pet interests from grad school. justintchiu.com/blog/ift

English

126

10.1K

Atula Tejaswi@atu_tej·27 Mar

It's time we switched to Diffusion, then :) diffusion-scaling.github.io

Samip@industriaalist

here's @JeffDean talking about how labs will do multi-epoch pretraining with heavy regularization to keep scaling even with limited data. no wonder slowrun gets so much attention from pretraining teams at big labs. pretraining is about to look very very different.

English

533

Atula Tejaswi retweetledi

Neel Guha@NeelGuha·25 Mar

I wrote a blogpost about writing machine learning research papers (e.g., NeurIPS, ICML, ICLR, etc.). The core idea is that most papers follow one of a predetermined set of templates. The post talks about each template, describes their rules, and offers examples...

English

622

78.1K

Atula Tejaswi retweetledi

Oussama Zekri@oussamazekri_·24 Mar

What if discrete diffusion didn’t have to be stuck with mask or uniform noise? 🤔 In our new paper, we show how to go beyond them, unlocking much richer noising processes. And the empirical results are surprisingly strong! 🚀 🌐 Project Page: oussamazekri.fr/gdds 📑 Paper: arxiv.org/pdf/2603.21342 💻 Code: github.com/ozekri/gdds Thread below 🧵

English

9.8K

Atula Tejaswi retweetledi

Shankar Padmanabhan@shankarpad8·23 Mar

1/5 How do we update a model trained in 2025 with new world knowledge from 2026? ⚠️Continued training will undo skills learned by LLMs during post-training, e.g. instruction-following/math/code. 🤝Our method DiSC updates LLMs with new knowledge while preserving existing skills!

English

9.8K

Atula Tejaswi retweetledi

Tanya Goyal@tanyaagoyal·23 Mar

Check out @shankarpad8's new work on continual training for learning new **factual** knowledge. Tons of recent papers show that RL can mitigate forgetting, but what about settings where RL is not an option?

Shankar Padmanabhan@shankarpad8

English

3.3K

Atula Tejaswi retweetledi

Ofir Press@OfirPress·23 Mar

If you work in AI you have to watch this talk by Moritz Hardt on the science of benchmarking. It talks about a lot of unexpected properties of benchmarks that I don't think most people are aware of; e.g. benchmarks can be incredibly noisy/imprecise and still be useful. 🔗⬇️

English

225

19.8K

Atula Tejaswi@atu_tej·23 Mar

@zhuokaiz Interesting insights about Entropy for dLLMs. We show that you can directly use feedback from reward models to optimize generations at test time, and entropy of the dLLM plays a huge role there! arxiv.org/abs/2602.05000

English

165

Zhuokai Zhao@zhuokaiz·22 Mar

I wish someone had told me this when I started digging into diffusion language models (dLLMs) from an LLM post-training background. I've spent the last few weeks reading across both the dLLM RL literature (d1, EGSPO, MDPO, LLaDA 1.5) and the older robotics literature on diffusion policies + RL (DPPO, Diffusion-QL, and follow-up work). What surprised me most wasn't the algorithms themselves — it was realizing that the robotics community had already worked through several of the same problems the dLLM community is hitting now. The robotics insight — structured exploration — doesn't transfer to discrete dLLMs as directly as I initially thought, but the broader lesson does. The multi-step denoising process isn't just an expensive way to generate tokens. It gives RL tools that autoregressive models don't have — intermediate evaluations, entropy signals, a natural coarse-to-fine hierarchy — and understanding how to use (and not break) these tools is probably one of the key challenges. This post is me organizing what I've learned — how RL post-training works (or doesn't) with diffusion language models, what carries over from the autoregressive world, what's genuinely new, and where I'm still confused. A Quick Intro to How dLLMs Generate Autoregressive LLMs generate left-to-right, one token at a time, and each token choice is irreversible during generation. The probability of a sequence factorizes as a product of conditional distributions: p(x₁)·p(x₂|x₁)·p(x₃|x₁,x₂)·… Diffusion language models generate through iterative denoising. The mainstream approach right now — masked diffusion (LLaDA, Dream, MDLM) — starts with the entire response masked, then over T denoising steps, progressively unmasks tokens. At each step, the model predicts all masked positions simultaneously using bidirectional attention, and selectively reveals the most confident predictions. The process repeats until all tokens are unmasked. Properties of this process matter a lot for RL (a) No fixed generation order. Tokens can be revealed in any order — high-confidence tokens first, uncertain ones later. This means the model can lay down the skeleton of a response early and refine details later. Think of it as coarse-to-fine generation rather than left-to-right. (b) Complete generations at every intermediate step. Unlike autoregressive models where you have a partial sequence mid-generation, a dLLM produces a full (noisy) output at every denoising step. This turns out to be very useful for RL — you can evaluate intermediate states cheaply. (c) No cheap exact autoregressive-style sequence log-probability. Autoregressive models give you log p(sequence) for free via the chain rule. dLLMs don't have an equally convenient sequence-level factorization for standard RL objectives, so exact likelihood-style updates become awkward and expensive. Practical methods usually rely on approximations, surrogates, or stepwise reformulations. This is one of the core obstacles for applying standard RL algorithms directly. The field has moved fast over the last year or so. Notable models include LLaDA 8B (trained from scratch, reported by its authors as competitive with LLaMA 3 8B), Dream 7B (adapted from Qwen2.5, notably strong on planning tasks), Mercury 2 (Inception, focused on inference speed), and LLaDA 2.0 (scaled to 100B). Where the Standard RL Pipeline Breaks The standard RL post-training pipeline for autoregressive models is straightforward. Sample a response, get a reward, compute log-probability of the response under the current policy, estimate advantage, update with policy gradient. The log-probability computation is trivial since you just sum per-token log-probs from the forward pass. With dLLMs, this pipeline breaks at step 3. You can sample responses and get rewards just fine. But you can't recover an exact autoregressive-style response log-probability with the same convenience, because there's no left-to-right chain-rule factorization. So RL methods that rely on likelihood ratios or preference-style likelihood comparisons (PPO, GRPO, DPO-style objectives) need some workaround. So far, a few approaches have emerged. (a) Mean-field approximation (d1 / diffu-GRPO). Since exact autoregressive-style sequence likelihood is unavailable in a convenient form, approximate it by treating token positions more independently and summing per-token terms — similar in spirit to autoregressive likelihood computation, but ignoring some within-step dependencies. This is cheap and works surprisingly well in practice, but it is still an approximation, especially in early denoising steps where token predictions can be strongly correlated. (b) ELBO-based estimates with variance reduction (LLaDA 1.5 / VRPO). Instead of computing the exact likelihood, these approaches use a tractable surrogate based on the ELBO, which is already central to diffusion-model training. The problem is that these estimates can be noisy — high variance makes preference-style updates unstable. LLaDA 1.5's key contribution is VRPO, which analyzes this variance explicitly and introduces variance-reduction techniques that make this route much more practical. (c) Treat denoising as an MDP (EGSPO, MDPO, DiFFPO). This is the approach most analogous to DPPO in robotics. Formulate the T-step denoising process as a finite-horizon MDP where state = the current partially denoised sequence, action = the denoising decision at that step, reward = often sparse at the end, though some methods also use intermediate rewards. Each denoising step has tractable local transition probabilities. Then apply policy gradient across the denoising chain. A Parallel Story from Robotics In robotics, from-scratch online RL for diffusion policies has proven challenging and often unstable or sample-inefficient enough to motivate alternatives and architectural workarounds. But in the fine-tuning regime — pretrain a diffusion policy from demonstrations, then improve with RL — the results are much better. DPPO reports strong gains over alternative fine-tuning baselines, including standard Gaussian PPO-style policies, especially in sim-to-real transfer. On the Furniture-Bench assembly task, DPPO achieves 80% real-robot success zero-shot from simulation, while a Gaussian PPO baseline achieves 88% in simulation and 0% on hardware. The explanation offered by this line of work is structured, on-manifold exploration. In continuous action spaces, a pretrained diffusion policy denoises noisy actions back toward the data manifold. Each denoising step adds stochasticity (exploration) while also restoring structure, so the exploration stays in the neighborhood of plausible behavior rather than scattering across the full action space. This is why RL fine-tuning works despite the long denoising horizon — most sampled trajectories are still "reasonable," so even coarse credit assignment can produce useful gradients. Now, this specific geometric mechanism doesn't transfer cleanly to dLLMs. In masked diffusion, the "actions" are discrete token predictions, not continuous vectors. There's no continuous score field pulling tokens back toward a manifold in the same way. But the broader principle does transfer — the denoising process is sequential structure that RL can exploit. What the Denoising Structure Gives dLLM RL The denoising chain gives dLLM RL methods specific tools that don't exist in the autoregressive setting. (a) Iterative self-correction. dLLMs can revise tokens across denoising steps. d1 observed "aha moments" — the model initially commits to a wrong reasoning path, then during later denoising steps, corrects itself. Autoregressive models can do chain-of-thought, but they can't go back and change earlier tokens. For RL, this means the policy has a built-in error-correction mechanism that RL doesn't need to learn from scratch. (b) Free intermediate evaluations. Because dLLMs produce complete outputs at every denoising step, you can evaluate quality at intermediate steps without extra rollouts. MDPO exploits this directly — it checks whether the answer is correct at each denoising step and uses these intermediate rewards for credit assignment. They also discovered something interesting — over-denoising, where models sometimes get the right answer at an intermediate step, then "refine" it into a wrong answer. This is probably the dLLM version of RL over-optimization destroying a good pretrained policy. (c) Entropy-guided compute allocation. EGSPO uses the model's entropy at each denoising step to decide where to spend training compute. High-entropy steps (where the model is most uncertain) get more gradient signal; low-entropy steps (where the model is confident) get less. The intuition is that you're directing optimization pressure where decisions are most consequential. My interpretation of this, in the structured-exploration framing, is that high entropy often marks denoising steps where the model has not yet committed to a stable solution, so optimization matters more there. Low entropy steps are more settled and may offer less room for improvement. (d) Denoising discount as an implicit regularizer. DPPO in robotics uses a denoising discount that downweights earlier (noisier) denoising steps in the policy gradient. My read is that this plays a role similar to regularization — it discourages RL from aggressively modifying the early, structure-establishing denoising steps, while allowing more freedom in later refinement steps. The same principle may apply to dLLMs — you want to preserve the coarse structure and optimize the fine-grained details more aggressively. The Failure Modes We're Seeing The robotics literature warns about specific failure modes, and we're already seeing some of the analogues in dLLMs. (a)Mode collapse. This is a recurring concern in RL fine-tuning of diffusion models more broadly, including image-generation work and policy fine-tuning. RL optimization can collapse multimodal distributions toward a smaller set of reward-favored modes. dLLMs' ability to represent multiple valid responses (different reasoning paths, different coding styles) is a key advantage — but RL will try to compress this diversity. The DPPO paper argues that its specific setup is relatively robust to catastrophic collapse, but the broader diffusion-RL literature suggests this risk is real. (b) Data/manifold bias. The pretrained distribution is bounded by pretraining + SFT data. If your SFT data only demonstrates one reasoning style, RL can optimize that style but can't easily discover fundamentally different approaches. The denoising process may make this harder to escape, since it actively pulls generations back toward the pretrained distribution. (c) Over-denoising / over-optimization. MDPO's finding that models get correct answers at intermediate steps and then "refine" them into wrong final answers is the dLLM-specific version of RLHF over-optimization. The iterative structure that provides self-correction can also provide self-destruction if RL pushes too hard. What this Suggests? If this framing is roughly right, then maybe we should: (a) Invest heavily in pretraining and SFT quality, not just fancier RL. My current read is that the quality of the pretrained dLLM and SFT data may matter more than the choice between diffu-GRPO, EGSPO, or MDPO. The pretrained distribution appears to be doing a lot of the heavy lifting. If your pretrained model doesn't cover the relevant solution space, no amount of RL sophistication will find what isn't there. (b) Exploit denoising structure for credit assignment. The intermediate evaluations that dLLMs offer for free might be under-appreciated. MDPO and EGSPO are pointing the way. Use entropy-guided step selection. Use intermediate rewards. The denoising chain gives you structure that autoregressive models don't have; so why not use it. (c) Be careful with early denoising steps. The early steps establish coarse structure — the overall shape of the response. Aggressively optimizing these risks destroying the pretrained distribution. Consider denoising discounting, or only fine-tuning later denoising steps, or using larger clipping ratios for early steps. DPPO in robotics found that fine-tuning only the last K' of K denoising steps can work well — the same principle likely applies. (d) Monitor for over-denoising. Track performance at intermediate denoising steps, not just the final output. If intermediate steps consistently outperform the final output after RL, you're over-optimizing. This is a dLLM-specific early warning system for reward hacking. (e) Take mode collapse seriously. If the task has multiple valid solution strategies, check that RL preserves them. Measure output diversity, not just reward. KL from the reference model is necessary but probably not sufficient. What I Still Don't Know 1. Does the denoising structure actually help RL quantitatively? The robotics evidence is strong — DPPO clearly outperforms Gaussian PPO in the fine-tuning regime. For dLLMs, the comparison would be whether diffu-GRPO on a dLLM produces more stable or efficient RL fine-tuning than standard GRPO on an equivalently pretrained autoregressive model. I haven't seen this head-to-head comparison done cleanly. d1 shows diffu-GRPO works, but doesn't compare against autoregressive GRPO with matched pretraining quality. 2. Is the planning advantage real? Dream 7B reports substantially stronger results than Qwen2.5 7B on several planning-style tasks (for example, Countdown 16.0 vs 6.2 and Sudoku 81.0 vs 21.0 in the paper's evaluation). Is this because the non-autoregressive generation structure is genuinely better for constraint satisfaction, or is it an artifact of evaluation methodology? If it's real, it suggests dLLMs + RL could be particularly powerful for agentic tasks that require planning. 3. How far does this scale? DPPO in robotics works for 7-DOF manipulation but hasn't been tested on truly high-dimensional action spaces. dLLMs operate in vocabulary-size action spaces (32K+). Do the denoising structure advantages hold at this scale? 4. Can you escape the pretrained distribution when you need to? The denoising process constrains RL to stay near the pretrained distribution, which helps stability but limits what RL can discover. For genuinely novel reasoning, not just refinement of existing patterns, you may need to break free. What's the dLLM equivalent of off-distribution exploration? What I keep coming back to is that when you move from autoregressive to diffusion generation, the denoising chain provides exploitable structure for RL, but it also constrains what RL can do. The methods that seem to work best are the ones that take both sides of this seriously — exploiting the structure where it helps, and being careful not to destroy it where it matters.

English

530

41.2K

Atula Tejaswi@atu_tej·18 Mar

@litu_rout_ @isro @UTAustin @GoogleDeepMind Congratulations Dr. Litu! 🙂

Français

136

Litu Rout@litu_rout_·18 Mar

Excited to share that I've joined Google DeepMind as a Senior Research Scientist, working on Gemini! @isro➡️PhD @UTAustin➡️@GoogleDeepMind Industry to academia is a leap many hesitate to take. For me, it felt natural. Enjoyed every moment. Looking forward to what lies ahead!

English

350

19.7K

Atula Tejaswi retweetledi

Amy Zhang@yayitsamyzhang·17 Mar

A challenging benchmark that demonstrates a strong use case for RL: improving upon generalist agents for specialized tasks.

Seth Karten@sethkarten

x.com/i/article/2033…

English

2.2K

Atula Tejaswi retweetledi

Siddarth Venkatraman@siddarthv66·16 Mar

No more POs! Everything is REINFORCE

Sasha Rush@srush_nlp

This is a nice blog post, but it really highlights an issue in ML research culture. Just broken that changing one term in the equation yields a completely new acronym. The field has lost the entropy term towards simplicity.

English

4.3K

Atula Tejaswi retweetledi

Yifan Zhang@yifan_zhang_·16 Mar

After 18 months of hard work by Tomas and Zhen, we cooked it! 🚀 Thanks to all friends who give constructive feedback! Deep Learning 2.0, Rethinking every fundamental cornerstone of Modern Foundation Models. It's just the beginning, Hyped! 🚀 github.com/FlashSampling/…

English

471

49.2K

Atula Tejaswi@atu_tej·16 Mar

@anirudhg9119 I'm really looking forward to the LLM version of Global Workspace :), my prediction is will see it positioned as some form of memory

English

239

Anirudh Goyal@anirudhg9119·16 Mar

We (with Alex lamb) tried selectively aggregating information from all previous layers (instead of using residual from the past layer). arxiv.org/abs/2010.08012

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

9.5K

Atula Tejaswi@atu_tej·15 Mar

@DdelAlamo Perceiver/PerceiverIO?

English

268

Diego del Alamo@DdelAlamo·15 Mar

So I can't say I've ever seen residual cross-attention before (where the final representations attend to earlier representations of the input data); is there any literature on when and where to use this?

Rishabh Anand@rishabh16_

🚨 New preprint!!! Introducting Zatom-1, a multi-modal generative foundation model for 3D small molecules and materials that operates fully in ambient space. Its embeddings are also useful for downstream molecular predictive tasks (properties, MLIPs, etc). 1/n

English

11.6K

Atula Tejaswi@atu_tej·15 Mar

It amazes me when people rediscover the strengths of end-to-end deep learning and give it a fancy name

Neel Nanda@NeelNanda5

Out of context reasoning is one of the most fascinating developments in the science of how LLMs work. This primer by @OwainEvans_UK, one of the main discoverers of the phenomena, is a great introduction

English

467

Atula Tejaswi@atu_tej·14 Mar

@YouJiacheng Isn't top-k selection on train set essentially a gradient estimate?

English

242

You Jiacheng@YouJiacheng·13 Mar

TLDR and comment: 1. there is a Top-K selection (on train set) step before ensemble -- Top-50 in 2000, then ensemble of 50. 2. low-rank perturbation is necessary, otherwise decoding memory IO will be 50×. but there are no low-rank perturbation exps.

Yulu Gan@yule_gan

Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed — a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Rand… Website: thickets.mit.edu

English

11.7K

Keşfet

@HongliZhan @jessyjli @shankarpad8 @zhuokaiz @litu_rout_ @isro @UTAustin @GoogleDeepMind