Atula Tejaswi

334 posts

Atula Tejaswi banner
Atula Tejaswi

Atula Tejaswi

@atu_tej

CS PhD @UTCompSci | Currently working on Discrete Diffusion.

Katılım Ekim 2016
530 Takip Edilen247 Takipçiler
Sabitlenmiş Tweet
Hongli Zhan
Hongli Zhan@HongliZhan·
PhD defended at UT Austin today.🤘 The best thing was having an advisor who believed in me before I believed in myself. Jessy taught me how to write, how to think, and how to chase research ideas. Then the rest followed. Thank you, @jessyjli
Hongli Zhan tweet mediaHongli Zhan tweet mediaHongli Zhan tweet mediaHongli Zhan tweet media
English
11
3
120
11.4K
Atula Tejaswi
Atula Tejaswi@atu_tej·
ICML folks: what do you do when a reviewer selects “(b) Partially satisfied - I have more questions for the authors,” keeps the score the same, but doesn’t actually ask any questions? We only get one more author response - how would you handle it?
English
0
0
3
565
Atula Tejaswi retweetledi
Kilian Weinberger
Kilian Weinberger@KilianQW·
My talk at the IFML Symposium in Austin Texas about "Beyond Parametric Knowledge and Next Token Prediction" youtu.be/ivFViMCmWls?si…
YouTube video
YouTube
English
1
3
26
2.8K
Atula Tejaswi retweetledi
Rohan Jha
Rohan Jha@Robro612·
Poster at LIR workshop next week: we compared training-free multi-vector compression methods head-to-head across some BEIR and CoIR sets. The takeaway is clear — pooling > pruning at all compression ratios for text. Not a huge surprise, but nice to have the controlled comparison.
Rohan Jha tweet media
English
5
11
38
6.5K
Atula Tejaswi retweetledi
Justin T Chiu
Justin T Chiu@justintchiu·
First blog post in a while: How to differentiate through optimizers! One of my pet interests from grad school. justintchiu.com/blog/ift
English
4
20
126
10.1K
Atula Tejaswi
Atula Tejaswi@atu_tej·
It's time we switched to Diffusion, then :) diffusion-scaling.github.io
Samip@industriaalist

here's @JeffDean talking about how labs will do multi-epoch pretraining with heavy regularization to keep scaling even with limited data. no wonder slowrun gets so much attention from pretraining teams at big labs. pretraining is about to look very very different.

English
0
0
6
533
Atula Tejaswi retweetledi
Neel Guha
Neel Guha@NeelGuha·
I wrote a blogpost about writing machine learning research papers (e.g., NeurIPS, ICML, ICLR, etc.). The core idea is that most papers follow one of a predetermined set of templates. The post talks about each template, describes their rules, and offers examples...
Neel Guha tweet media
English
7
83
622
78.1K
Atula Tejaswi retweetledi
Oussama Zekri
Oussama Zekri@oussamazekri_·
What if discrete diffusion didn’t have to be stuck with mask or uniform noise? 🤔 In our new paper, we show how to go beyond them, unlocking much richer noising processes. And the empirical results are surprisingly strong! 🚀 🌐 Project Page: oussamazekri.fr/gdds 📑 Paper: arxiv.org/pdf/2603.21342 💻 Code: github.com/ozekri/gdds Thread below 🧵
English
5
20
98
9.8K
Atula Tejaswi retweetledi
Shankar Padmanabhan
Shankar Padmanabhan@shankarpad8·
1/5 How do we update a model trained in 2025 with new world knowledge from 2026? ⚠️Continued training will undo skills learned by LLMs during post-training, e.g. instruction-following/math/code. 🤝Our method DiSC updates LLMs with new knowledge while preserving existing skills!
English
1
16
60
9.8K
Atula Tejaswi retweetledi
Tanya Goyal
Tanya Goyal@tanyaagoyal·
Check out @shankarpad8's new work on continual training for learning new **factual** knowledge. Tons of recent papers show that RL can mitigate forgetting, but what about settings where RL is not an option?
Shankar Padmanabhan@shankarpad8

1/5 How do we update a model trained in 2025 with new world knowledge from 2026? ⚠️Continued training will undo skills learned by LLMs during post-training, e.g. instruction-following/math/code. 🤝Our method DiSC updates LLMs with new knowledge while preserving existing skills!

English
0
7
34
3.3K
Atula Tejaswi retweetledi
Ofir Press
Ofir Press@OfirPress·
If you work in AI you have to watch this talk by Moritz Hardt on the science of benchmarking. It talks about a lot of unexpected properties of benchmarks that I don't think most people are aware of; e.g. benchmarks can be incredibly noisy/imprecise and still be useful. 🔗⬇️
Ofir Press tweet media
English
3
14
225
19.8K
Atula Tejaswi
Atula Tejaswi@atu_tej·
@zhuokaiz Interesting insights about Entropy for dLLMs. We show that you can directly use feedback from reward models to optimize generations at test time, and entropy of the dLLM plays a huge role there! arxiv.org/abs/2602.05000
English
0
0
1
165
Zhuokai Zhao
Zhuokai Zhao@zhuokaiz·
I wish someone had told me this when I started digging into diffusion language models (dLLMs) from an LLM post-training background. I've spent the last few weeks reading across both the dLLM RL literature (d1, EGSPO, MDPO, LLaDA 1.5) and the older robotics literature on diffusion policies + RL (DPPO, Diffusion-QL, and follow-up work). What surprised me most wasn't the algorithms themselves — it was realizing that the robotics community had already worked through several of the same problems the dLLM community is hitting now. The robotics insight — structured exploration — doesn't transfer to discrete dLLMs as directly as I initially thought, but the broader lesson does. The multi-step denoising process isn't just an expensive way to generate tokens. It gives RL tools that autoregressive models don't have — intermediate evaluations, entropy signals, a natural coarse-to-fine hierarchy — and understanding how to use (and not break) these tools is probably one of the key challenges. This post is me organizing what I've learned — how RL post-training works (or doesn't) with diffusion language models, what carries over from the autoregressive world, what's genuinely new, and where I'm still confused. A Quick Intro to How dLLMs Generate Autoregressive LLMs generate left-to-right, one token at a time, and each token choice is irreversible during generation. The probability of a sequence factorizes as a product of conditional distributions: p(x₁)·p(x₂|x₁)·p(x₃|x₁,x₂)·… Diffusion language models generate through iterative denoising. The mainstream approach right now — masked diffusion (LLaDA, Dream, MDLM) — starts with the entire response masked, then over T denoising steps, progressively unmasks tokens. At each step, the model predicts all masked positions simultaneously using bidirectional attention, and selectively reveals the most confident predictions. The process repeats until all tokens are unmasked. Properties of this process matter a lot for RL (a) No fixed generation order. Tokens can be revealed in any order — high-confidence tokens first, uncertain ones later. This means the model can lay down the skeleton of a response early and refine details later. Think of it as coarse-to-fine generation rather than left-to-right. (b) Complete generations at every intermediate step. Unlike autoregressive models where you have a partial sequence mid-generation, a dLLM produces a full (noisy) output at every denoising step. This turns out to be very useful for RL — you can evaluate intermediate states cheaply. (c) No cheap exact autoregressive-style sequence log-probability. Autoregressive models give you log p(sequence) for free via the chain rule. dLLMs don't have an equally convenient sequence-level factorization for standard RL objectives, so exact likelihood-style updates become awkward and expensive. Practical methods usually rely on approximations, surrogates, or stepwise reformulations. This is one of the core obstacles for applying standard RL algorithms directly. The field has moved fast over the last year or so. Notable models include LLaDA 8B (trained from scratch, reported by its authors as competitive with LLaMA 3 8B), Dream 7B (adapted from Qwen2.5, notably strong on planning tasks), Mercury 2 (Inception, focused on inference speed), and LLaDA 2.0 (scaled to 100B). Where the Standard RL Pipeline Breaks The standard RL post-training pipeline for autoregressive models is straightforward. Sample a response, get a reward, compute log-probability of the response under the current policy, estimate advantage, update with policy gradient. The log-probability computation is trivial since you just sum per-token log-probs from the forward pass. With dLLMs, this pipeline breaks at step 3. You can sample responses and get rewards just fine. But you can't recover an exact autoregressive-style response log-probability with the same convenience, because there's no left-to-right chain-rule factorization. So RL methods that rely on likelihood ratios or preference-style likelihood comparisons (PPO, GRPO, DPO-style objectives) need some workaround. So far, a few approaches have emerged. (a) Mean-field approximation (d1 / diffu-GRPO). Since exact autoregressive-style sequence likelihood is unavailable in a convenient form, approximate it by treating token positions more independently and summing per-token terms — similar in spirit to autoregressive likelihood computation, but ignoring some within-step dependencies. This is cheap and works surprisingly well in practice, but it is still an approximation, especially in early denoising steps where token predictions can be strongly correlated. (b) ELBO-based estimates with variance reduction (LLaDA 1.5 / VRPO). Instead of computing the exact likelihood, these approaches use a tractable surrogate based on the ELBO, which is already central to diffusion-model training. The problem is that these estimates can be noisy — high variance makes preference-style updates unstable. LLaDA 1.5's key contribution is VRPO, which analyzes this variance explicitly and introduces variance-reduction techniques that make this route much more practical. (c) Treat denoising as an MDP (EGSPO, MDPO, DiFFPO). This is the approach most analogous to DPPO in robotics. Formulate the T-step denoising process as a finite-horizon MDP where state = the current partially denoised sequence, action = the denoising decision at that step, reward = often sparse at the end, though some methods also use intermediate rewards. Each denoising step has tractable local transition probabilities. Then apply policy gradient across the denoising chain. A Parallel Story from Robotics In robotics, from-scratch online RL for diffusion policies has proven challenging and often unstable or sample-inefficient enough to motivate alternatives and architectural workarounds. But in the fine-tuning regime — pretrain a diffusion policy from demonstrations, then improve with RL — the results are much better. DPPO reports strong gains over alternative fine-tuning baselines, including standard Gaussian PPO-style policies, especially in sim-to-real transfer. On the Furniture-Bench assembly task, DPPO achieves 80% real-robot success zero-shot from simulation, while a Gaussian PPO baseline achieves 88% in simulation and 0% on hardware. The explanation offered by this line of work is structured, on-manifold exploration. In continuous action spaces, a pretrained diffusion policy denoises noisy actions back toward the data manifold. Each denoising step adds stochasticity (exploration) while also restoring structure, so the exploration stays in the neighborhood of plausible behavior rather than scattering across the full action space. This is why RL fine-tuning works despite the long denoising horizon — most sampled trajectories are still "reasonable," so even coarse credit assignment can produce useful gradients. Now, this specific geometric mechanism doesn't transfer cleanly to dLLMs. In masked diffusion, the "actions" are discrete token predictions, not continuous vectors. There's no continuous score field pulling tokens back toward a manifold in the same way. But the broader principle does transfer — the denoising process is sequential structure that RL can exploit. What the Denoising Structure Gives dLLM RL The denoising chain gives dLLM RL methods specific tools that don't exist in the autoregressive setting. (a) Iterative self-correction. dLLMs can revise tokens across denoising steps. d1 observed "aha moments" — the model initially commits to a wrong reasoning path, then during later denoising steps, corrects itself. Autoregressive models can do chain-of-thought, but they can't go back and change earlier tokens. For RL, this means the policy has a built-in error-correction mechanism that RL doesn't need to learn from scratch. (b) Free intermediate evaluations. Because dLLMs produce complete outputs at every denoising step, you can evaluate quality at intermediate steps without extra rollouts. MDPO exploits this directly — it checks whether the answer is correct at each denoising step and uses these intermediate rewards for credit assignment. They also discovered something interesting — over-denoising, where models sometimes get the right answer at an intermediate step, then "refine" it into a wrong answer. This is probably the dLLM version of RL over-optimization destroying a good pretrained policy. (c) Entropy-guided compute allocation. EGSPO uses the model's entropy at each denoising step to decide where to spend training compute. High-entropy steps (where the model is most uncertain) get more gradient signal; low-entropy steps (where the model is confident) get less. The intuition is that you're directing optimization pressure where decisions are most consequential. My interpretation of this, in the structured-exploration framing, is that high entropy often marks denoising steps where the model has not yet committed to a stable solution, so optimization matters more there. Low entropy steps are more settled and may offer less room for improvement. (d) Denoising discount as an implicit regularizer. DPPO in robotics uses a denoising discount that downweights earlier (noisier) denoising steps in the policy gradient. My read is that this plays a role similar to regularization — it discourages RL from aggressively modifying the early, structure-establishing denoising steps, while allowing more freedom in later refinement steps. The same principle may apply to dLLMs — you want to preserve the coarse structure and optimize the fine-grained details more aggressively. The Failure Modes We're Seeing The robotics literature warns about specific failure modes, and we're already seeing some of the analogues in dLLMs. (a)Mode collapse. This is a recurring concern in RL fine-tuning of diffusion models more broadly, including image-generation work and policy fine-tuning. RL optimization can collapse multimodal distributions toward a smaller set of reward-favored modes. dLLMs' ability to represent multiple valid responses (different reasoning paths, different coding styles) is a key advantage — but RL will try to compress this diversity. The DPPO paper argues that its specific setup is relatively robust to catastrophic collapse, but the broader diffusion-RL literature suggests this risk is real. (b) Data/manifold bias. The pretrained distribution is bounded by pretraining + SFT data. If your SFT data only demonstrates one reasoning style, RL can optimize that style but can't easily discover fundamentally different approaches. The denoising process may make this harder to escape, since it actively pulls generations back toward the pretrained distribution. (c) Over-denoising / over-optimization. MDPO's finding that models get correct answers at intermediate steps and then "refine" them into wrong final answers is the dLLM-specific version of RLHF over-optimization. The iterative structure that provides self-correction can also provide self-destruction if RL pushes too hard. What this Suggests? If this framing is roughly right, then maybe we should: (a) Invest heavily in pretraining and SFT quality, not just fancier RL. My current read is that the quality of the pretrained dLLM and SFT data may matter more than the choice between diffu-GRPO, EGSPO, or MDPO. The pretrained distribution appears to be doing a lot of the heavy lifting. If your pretrained model doesn't cover the relevant solution space, no amount of RL sophistication will find what isn't there. (b) Exploit denoising structure for credit assignment. The intermediate evaluations that dLLMs offer for free might be under-appreciated. MDPO and EGSPO are pointing the way. Use entropy-guided step selection. Use intermediate rewards. The denoising chain gives you structure that autoregressive models don't have; so why not use it. (c) Be careful with early denoising steps. The early steps establish coarse structure — the overall shape of the response. Aggressively optimizing these risks destroying the pretrained distribution. Consider denoising discounting, or only fine-tuning later denoising steps, or using larger clipping ratios for early steps. DPPO in robotics found that fine-tuning only the last K' of K denoising steps can work well — the same principle likely applies. (d) Monitor for over-denoising. Track performance at intermediate denoising steps, not just the final output. If intermediate steps consistently outperform the final output after RL, you're over-optimizing. This is a dLLM-specific early warning system for reward hacking. (e) Take mode collapse seriously. If the task has multiple valid solution strategies, check that RL preserves them. Measure output diversity, not just reward. KL from the reference model is necessary but probably not sufficient. What I Still Don't Know 1. Does the denoising structure actually help RL quantitatively? The robotics evidence is strong — DPPO clearly outperforms Gaussian PPO in the fine-tuning regime. For dLLMs, the comparison would be whether diffu-GRPO on a dLLM produces more stable or efficient RL fine-tuning than standard GRPO on an equivalently pretrained autoregressive model. I haven't seen this head-to-head comparison done cleanly. d1 shows diffu-GRPO works, but doesn't compare against autoregressive GRPO with matched pretraining quality. 2. Is the planning advantage real? Dream 7B reports substantially stronger results than Qwen2.5 7B on several planning-style tasks (for example, Countdown 16.0 vs 6.2 and Sudoku 81.0 vs 21.0 in the paper's evaluation). Is this because the non-autoregressive generation structure is genuinely better for constraint satisfaction, or is it an artifact of evaluation methodology? If it's real, it suggests dLLMs + RL could be particularly powerful for agentic tasks that require planning. 3. How far does this scale? DPPO in robotics works for 7-DOF manipulation but hasn't been tested on truly high-dimensional action spaces. dLLMs operate in vocabulary-size action spaces (32K+). Do the denoising structure advantages hold at this scale? 4. Can you escape the pretrained distribution when you need to? The denoising process constrains RL to stay near the pretrained distribution, which helps stability but limits what RL can discover. For genuinely novel reasoning, not just refinement of existing patterns, you may need to break free. What's the dLLM equivalent of off-distribution exploration? What I keep coming back to is that when you move from autoregressive to diffusion generation, the denoising chain provides exploitable structure for RL, but it also constrains what RL can do. The methods that seem to work best are the ones that take both sides of this seriously — exploiting the structure where it helps, and being careful not to destroy it where it matters.
English
15
63
530
41.2K
Litu Rout
Litu Rout@litu_rout_·
Excited to share that I've joined Google DeepMind as a Senior Research Scientist, working on Gemini! @isro➡️PhD @UTAustin➡️@GoogleDeepMind Industry to academia is a leap many hesitate to take. For me, it felt natural. Enjoyed every moment. Looking forward to what lies ahead!
English
26
5
350
19.7K
Atula Tejaswi retweetledi
Yifan Zhang
Yifan Zhang@yifan_zhang_·
After 18 months of hard work by Tomas and Zhen, we cooked it! 🚀 Thanks to all friends who give constructive feedback! Deep Learning 2.0, Rethinking every fundamental cornerstone of Modern Foundation Models. It's just the beginning, Hyped! 🚀 github.com/FlashSampling/…
Yifan Zhang tweet media
English
9
65
471
49.2K
Atula Tejaswi
Atula Tejaswi@atu_tej·
@anirudhg9119 I'm really looking forward to the LLM version of Global Workspace :), my prediction is will see it positioned as some form of memory
English
1
0
1
239
Diego del Alamo
Diego del Alamo@DdelAlamo·
So I can't say I've ever seen residual cross-attention before (where the final representations attend to earlier representations of the input data); is there any literature on when and where to use this?
Diego del Alamo tweet media
Rishabh Anand@rishabh16_

🚨 New preprint!!! Introducting Zatom-1, a multi-modal generative foundation model for 3D small molecules and materials that operates fully in ambient space. Its embeddings are also useful for downstream molecular predictive tasks (properties, MLIPs, etc). 1/n

English
4
5
66
11.6K