David

16 posts

David

David

@LiDavid2002

Katılım Mart 2026
43 Takip Edilen21 Takipçiler
David
David@LiDavid2002·
So eager to see this!
Discrete Diffusion Reading Group@diffusion_llms

📢 May 25 (Mon): Language Modeling with Spherical Geometry 📷 💡Join us to hear Justin (@jdeschena) and Jannis (@JChemseddine) present their recent work on (Hyper)spherical language modeling! ⚖️Discrete Diffusion and Continuous Flow Language Models (DLMs / FLMs) have emerged as interesting alternatives to autoregressive models. Yet they face fundamental tensions: discrete diffusion samples from a factorized distribution that is strictly less expressive than AR. FLMs avoid factorized sampling but typically add Gaussian noise on one-hot vectors or embeddings. It is far from clear that this kind of noise is well suited to text generation. 🤔Both papers ask the same question: what if the natural geometry for language flows isn't Euclidean space or the probability simplex, but the sphere? The hint has been there for a while: prior work like CDCD (@sedielem et al.) already operates on normalized vectors, and empirically, the cosine distance outperforms the Euclidean one for comparing word embeddings (think of word2vec, GloVe, or retrieval systems). 🧭By lifting tokens onto Sᵈ⁻¹, the authors develop tools for spherical language modeling via SLERP and vMF paths. The vMF path has the added benefit of a closed-form score, enabling principled predictor–corrector samplers on the sphere. 📈 Working with the sphere leads to concrete performance improvements: on code generation with TinyGSM, prior FLMs reach roughly 0% accuracy, while flows on the hypersphere reach 12–18% 🚀. This still lags behind the AR and discrete diffusion baselines, but it strongly suggests that spherical embedding geometry is a natural noise model for tokens. At matched NFE, a properly tuned PC sampler with vMF paths clearly improves the accuracy on Sudoku. And as a bonus, training with rotations avoids materializing one-hot vectors, making it cheaper than standard FLM training⚡️. 🔗 Language Modeling with Hyperspherical Flows: arxiv.org/abs/2605.11125 🔗 Spherical Flows for Sampling Categorical Data: arxiv.org/abs/2605.05629 🤝 Joint work with Caglar Gulcehre (@caglarml), Gregor Kornhardt (@gregorkornhardt), and Gabriele Steidl (page.math.tu-berlin.de/~steidl/)

English
0
0
2
87
David retweetledi
Oscar Davis
Oscar Davis@osclsd·
We were all wondering whether Categorical Flow Maps (CFMs) could scale... 🤔 I couldn't help trying it out... So we scaled CFMs to 1.7B parameters over 2.1T tokens 🚀🔥 Short summary 🧵⬇️
English
4
32
127
15.1K
David
David@LiDavid2002·
@jdeschena huge thanks for the recording!
English
1
0
1
17
David retweetledi
Justin Deschenaux
Justin Deschenaux@jdeschena·
📢 Missed the talk? Check out the recording on YouTube: youtu.be/RZ6_huata1Y
YouTube video
YouTube
Discrete Diffusion Reading Group@diffusion_llms

📢 May 18 (Mon): IDLM: Inverse-distilled Diffusion Language Models 🤔Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. 💡To address this, the authors extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. However, this extension introduces both theoretical and practical challenges. 🔧To overcome these challenges, the authors first provide a theoretical result demonstrating that their inverse formulation admits a unique solution, thereby ensuring valid optimization. They then introduce gradient-stable relaxations to support effective training. 📊As a result, experiments on multiple DLMs show that their method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4×—64×, while preserving the teacher model’s entropy and generative perplexity. This Monday, David Li (scholar.google.com/citations?user…) and Nikita Gushchin (scholar.google.com/citations?user…) will present their jointly led paper, which was recently accepted at ICML 2026. Collaborators of this work include: Dmitry Abulkhanov (@dabulkhanov_), Eric Moulines (scholar.google.com/citations?user…), Ivan Oseledets (@oseledetsivan), Maxim Panov (@maxim_panov), Alexander Korotin (akorotin.netlify.app) Paper link: arxiv.org/abs/2602.19066

English
1
6
16
2.6K
David retweetledi
Zhihan Yang
Zhihan Yang@zhihanyang_·
📢Excited to share our new paper: Continuous Diffusion Scales Competitively with Discrete Diffusion for Language We introduce RePlaid 🌊, a continuous diffusion language model (DLM) with 🏅Discrete likelihood bound 🏅Scaling laws competitive with SOTA discrete DLMs How? Dive in👇[🧵1/12] Paper: arxiv.org/abs/2605.18530 Work done with my amazing collaborators: @WeiGuo01 @ShuibaiZ69721 @ssahoo_ @YongxinChen1 @ArashVahdat @MardaniMorteza @jwthickstun
Zhihan Yang tweet media
English
5
46
200
58.9K
David
David@LiDavid2002·
Discrete Diffusion Reading Group@diffusion_llms

📢 May 18 (Mon): IDLM: Inverse-distilled Diffusion Language Models 🤔Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. 💡To address this, the authors extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. However, this extension introduces both theoretical and practical challenges. 🔧To overcome these challenges, the authors first provide a theoretical result demonstrating that their inverse formulation admits a unique solution, thereby ensuring valid optimization. They then introduce gradient-stable relaxations to support effective training. 📊As a result, experiments on multiple DLMs show that their method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4×—64×, while preserving the teacher model’s entropy and generative perplexity. This Monday, David Li (scholar.google.com/citations?user…) and Nikita Gushchin (scholar.google.com/citations?user…) will present their jointly led paper, which was recently accepted at ICML 2026. Collaborators of this work include: Dmitry Abulkhanov (@dabulkhanov_), Eric Moulines (scholar.google.com/citations?user…), Ivan Oseledets (@oseledetsivan), Maxim Panov (@maxim_panov), Alexander Korotin (akorotin.netlify.app) Paper link: arxiv.org/abs/2602.19066

English
0
1
10
448
David
David@LiDavid2002·
I also shared a brief overview of the core idea in my Telegram channel for anyone curious, you can check it out here: t.me/LiSearch We’ll be happy to see everyone interested in diffusion LMs, generative modeling, or just curious about where this field is heading.
English
1
0
0
84
David
David@LiDavid2002·
🚀 Today, @iNikitaGushchin and I will be presenting our work on IDLM as part of the dLLM Reading Group! We’ll discuss the main ideas behind the paper, what motivated this direction, and why we believe it is an exciting step for diffusion-based language modeling.
David tweet media
English
1
3
8
1.4K
David retweetledi
Emiel Hoogeboom
Emiel Hoogeboom@emiel_hoogeboom·
Since I'm between jobs, I've been having a lot of fun vibe-coding with public tooling. First drop: a clean PyTorch impl of the Gradient Moment metric from our recent paper (arXiv:2603.20155). github.com/ehoogeboom/gra…
Emiel Hoogeboom tweet media
English
5
5
68
5.3K
David retweetledi
Discrete Diffusion Reading Group
Discrete Diffusion Reading Group@diffusion_llms·
📢 May 18 (Mon): IDLM: Inverse-distilled Diffusion Language Models 🤔Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. 💡To address this, the authors extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. However, this extension introduces both theoretical and practical challenges. 🔧To overcome these challenges, the authors first provide a theoretical result demonstrating that their inverse formulation admits a unique solution, thereby ensuring valid optimization. They then introduce gradient-stable relaxations to support effective training. 📊As a result, experiments on multiple DLMs show that their method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4×—64×, while preserving the teacher model’s entropy and generative perplexity. This Monday, David Li (scholar.google.com/citations?user…) and Nikita Gushchin (scholar.google.com/citations?user…) will present their jointly led paper, which was recently accepted at ICML 2026. Collaborators of this work include: Dmitry Abulkhanov (@dabulkhanov_), Eric Moulines (scholar.google.com/citations?user…), Ivan Oseledets (@oseledetsivan), Maxim Panov (@maxim_panov), Alexander Korotin (akorotin.netlify.app) Paper link: arxiv.org/abs/2602.19066
Discrete Diffusion Reading Group tweet media
English
2
13
36
9.5K
David retweetledi
Justin Deschenaux
Justin Deschenaux@jdeschena·
🔥 New paper: Language Modeling with Hyperspherical Flows Recent flow language models (FLMs) all use Gaussian noise. Makes sense for images, but not necessarily for text 🫠 We propose to add noise by rotating embeddings on 𝕊^{d−1} instead 🌐 w/ @caglarml (1/9)
Justin Deschenaux tweet media
English
13
82
427
57.8K
David
David@LiDavid2002·
@Sam_Acqua Yeah, I think evaluating gen. ppl with matching entropy would provide a fairer comparison. Otherwise, gen. ppl can always be reduced by lowering the entropy of the output distribution, for example through temperature scaling
English
0
0
1
15
Sam Acquaviva
Sam Acquaviva@Sam_Acqua·
@LiDavid2002 Fully agree on all accounts! So when ppl bounds are not an option, I think papers should report the gen. ppl vs entropy frontier, or at least refrain from point estimates without matching entropy.
English
1
0
1
18
Sam Acquaviva
Sam Acquaviva@Sam_Acqua·
Flow models are a promising alternative to autoregression. But the current standard of evaluating flow models is broken. The reported 3x improvement in 1024-step PPL since 2023 is closer to 1.1x if you control for sample entropy. (1/12)
English
7
30
164
49.3K
David
David@LiDavid2002·
@Sam_Acqua Also, reporting perplexity bounds is not always straightforward. Some models, like distillation-based ones, are not trained with an ELBO or a clear likelihood objective, so it is unclear how to compare perplexity fairly.
English
1
0
1
29
David
David@LiDavid2002·
@Sam_Acqua I agree with the criticism of these metrics, but I guess we just do not have any options. The main issue is that academia often cannot train large models, so benchmark results on small models may be less meaningful, even if downstream metrics are better in principle.
English
1
0
1
35
David
David@LiDavid2002·
@emiel_hoogeboom Thank you for citing our work, we really appreciate it. I think our works are quite closely related. We already have some ideas for pushing this further and would be excited to explore a collaboration.
English
0
0
1
11
Emiel Hoogeboom
Emiel Hoogeboom@emiel_hoogeboom·
Also want to highligh very relevant concurrent work: arxiv.org/abs/2602.19066 You could say that if our work is discrete "mmd", theirs is discrete "dmd" from the continuous world.
English
1
0
17
1.2K
Emiel Hoogeboom
Emiel Hoogeboom@emiel_hoogeboom·
You may think discrete distillation is fundamentally flawed, you are (surprisingly) wrong. 🤯 Meet Discrete Moment Distillation (D-MMD). It is a new method that brings fast, few-step sampling to discrete diffusion models! 🧵👇
Emiel Hoogeboom tweet media
English
6
39
253
58.1K