
Molei Tao
363 posts

Molei Tao
@MoleiTaoMath
Georgia Tech Prof; Tsinghua, Caltech, NYU Courant * deep learning theory * (diffusion) generative model, probabilistic ML * AI4Science * applied & comput. math








Join us this Tuesday, for a talk by Ye He: “Diffusion Model’s Generalization via Data-Dependent Ridge Manifolds” The talk gives a geometric view of what a learned diffusion model generates, why ridge manifolds matter, and how this helps explain inference dynamics.







RL is the engine behind reasoning in AR-LLMs. But for diffusion LLMs? Existing methods mostly port AR algorithms over with some modifications — ignoring what makes dLLMs special, and paying the price in speed. We propose 𝗗𝗠𝗣𝗢 (𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻 𝗠𝗮𝘁𝗰𝗵𝗶𝗻𝗴 𝗣𝗼𝗹𝗶𝗰𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻): an efficient, effective RL method designed for dLLMs from the ground up. Forward-only, off-policy, theoretically grounded ⚡ 🔗 arxiv.org/abs/2510.08233



[1/D] 🤔 What are drifting models really connected to? 📢 Our new paper, A Unified View of Drifting and Score-Based Models, shows that the bridge to score-based models is clear and precise (w/ team and @mittu1204, @StefanoErmon, @MoleiTaoMath)! ✍️ Main takeaway: drifting is more closely connected to score-based (diffusion) modeling than it may first appear! 🔗 arxiv.org/abs/2603.07514 🎯 Here’s why: Drifting’s mean-shift moves a sample toward the kernel-weighted average of nearby samples. Score function points toward regions of higher density. So both describe local directions that push samples toward where data is denser. We show that this link is exact for Gaussian kernels (Section 4.1): 📌drifting’s mean-shift = a rescaled score-matching field between the Gaussian-smoothed data and model distributions — the vector field underlying score matching (Tweedie!). 📌This also clarifies the bridge to Distribution Matching Distillation (DMD): both use score-based transport directions, but only differ in how the score is realized—drifting does so nonparametrically through kernel neighborhoods, whereas DMD relies on a pretrained diffusion teacher. 🤔 So what happens for the default Laplace kernel used in drifting models? Let’s look below 👇






We proudly present “Rethinking the Design Space of RL for Diffusion Models” showing that ELBO-based likelihood estimation (from the final sample) is the dominant driver of stable, efficient RL fine-tuning. On SD3.5-Medium, we boost GenEval 0.24 → 0.95 in ~90 GPU hours, beating FlowGRPO (4.6×) and DiffusionNFT (2×) efficiency. Great Collab with @YongxinChen1 @YuchenZhu_ZYC @WeiGuo01 @MoleiTaoMath Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin 🎥 Video attached 📍Link: arxiv.org/abs/2602.04663 #DiffusionModels #ReinforcementLearning #TextToImage #GenAI





