David (@LiDavid2002) - Twitter Profili | Zamantika Mersobahis Locabet

David@LiDavid2002·3d

So eager to see this!

Discrete Diffusion Reading Group@diffusion_llms

📢 May 25 (Mon): Language Modeling with Spherical Geometry 📷 💡Join us to hear Justin (@jdeschena) and Jannis (@JChemseddine) present their recent work on (Hyper)spherical language modeling! ⚖️Discrete Diffusion and Continuous Flow Language Models (DLMs / FLMs) have emerged as interesting alternatives to autoregressive models. Yet they face fundamental tensions: discrete diffusion samples from a factorized distribution that is strictly less expressive than AR. FLMs avoid factorized sampling but typically add Gaussian noise on one-hot vectors or embeddings. It is far from clear that this kind of noise is well suited to text generation. 🤔Both papers ask the same question: what if the natural geometry for language flows isn't Euclidean space or the probability simplex, but the sphere? The hint has been there for a while: prior work like CDCD (@sedielem et al.) already operates on normalized vectors, and empirically, the cosine distance outperforms the Euclidean one for comparing word embeddings (think of word2vec, GloVe, or retrieval systems). 🧭By lifting tokens onto Sᵈ⁻¹, the authors develop tools for spherical language modeling via SLERP and vMF paths. The vMF path has the added benefit of a closed-form score, enabling principled predictor–corrector samplers on the sphere. 📈 Working with the sphere leads to concrete performance improvements: on code generation with TinyGSM, prior FLMs reach roughly 0% accuracy, while flows on the hypersphere reach 12–18% 🚀. This still lags behind the AR and discrete diffusion baselines, but it strongly suggests that spherical embedding geometry is a natural noise model for tokens. At matched NFE, a properly tuned PC sampler with vMF paths clearly improves the accuracy on Sudoku. And as a bonus, training with rotations avoids materializing one-hot vectors, making it cheaper than standard FLM training⚡️. 🔗 Language Modeling with Hyperspherical Flows: arxiv.org/abs/2605.11125 🔗 Spherical Flows for Sampling Categorical Data: arxiv.org/abs/2605.05629 🤝 Joint work with Caglar Gulcehre (@caglarml), Gregor Kornhardt (@gregorkornhardt), and Gabriele Steidl (page.math.tu-berlin.de/~steidl/)

English

0

2

87

David@LiDavid2002·6d

Amazing work and interesting results🔥. The next step is to evaluate them in benchmarks

Oscar Davis@osclsd

We were all wondering whether Categorical Flow Maps (CFMs) could scale... 🤔 I couldn't help trying it out... So we scaled CFMs to 1.7B parameters over 2.1T tokens 🚀🔥 Short summary 🧵⬇️

English

1

0

3

615

David retweetledi

Oscar Davis@osclsd·6d

We were all wondering whether Categorical Flow Maps (CFMs) could scale... 🤔 I couldn't help trying it out... So we scaled CFMs to 1.7B parameters over 2.1T tokens 🚀🔥 Short summary 🧵⬇️

English

4

32

127

15.1K

David@LiDavid2002·19 May

@jdeschena huge thanks for the recording!

English

1

0

1

17

David retweetledi

Justin Deschenaux@jdeschena·19 May

📢 Missed the talk? Check out the recording on YouTube: youtu.be/RZ6_huata1Y

YouTube

Discrete Diffusion Reading Group@diffusion_llms

📢 May 18 (Mon): IDLM: Inverse-distilled Diffusion Language Models 🤔Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. 💡To address this, the authors extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. However, this extension introduces both theoretical and practical challenges. 🔧To overcome these challenges, the authors first provide a theoretical result demonstrating that their inverse formulation admits a unique solution, thereby ensuring valid optimization. They then introduce gradient-stable relaxations to support effective training. 📊As a result, experiments on multiple DLMs show that their method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4×—64×, while preserving the teacher model’s entropy and generative perplexity. This Monday, David Li (scholar.google.com/citations?user…) and Nikita Gushchin (scholar.google.com/citations?user…) will present their jointly led paper, which was recently accepted at ICML 2026. Collaborators of this work include: Dmitry Abulkhanov (@dabulkhanov_), Eric Moulines (scholar.google.com/citations?user…), Ivan Oseledets (@oseledetsivan), Maxim Panov (@maxim_panov), Alexander Korotin (akorotin.netlify.app) Paper link: arxiv.org/abs/2602.19066

English

1

6

16

2.6K

David retweetledi

Zhihan Yang@zhihanyang_·19 May

📢Excited to share our new paper: Continuous Diffusion Scales Competitively with Discrete Diffusion for Language We introduce RePlaid 🌊, a continuous diffusion language model (DLM) with 🏅Discrete likelihood bound 🏅Scaling laws competitive with SOTA discrete DLMs How? Dive in👇[🧵1/12] Paper: arxiv.org/abs/2605.18530 Work done with my amazing collaborators: @WeiGuo01 @ShuibaiZ69721 @ssahoo_ @YongxinChen1 @ArashVahdat @MardaniMorteza @jwthickstun

English

5

46

200

58.9K

David@LiDavid2002·18 May

📄 Paper: arxiv.org/abs/2602.19066 💻 Code: github.com/David-cripto/I… 📢 Invitation post: x.com/diffusion_llms… Many thanks to the organizers for the invitation, @jdeschena, @ssahoo_, @zhihanyang_! 🙌 #ICML2026 #DiffusionModels #LanguageModels #GenerativeAI #MachineLearning

Discrete Diffusion Reading Group@diffusion_llms

📢 May 18 (Mon): IDLM: Inverse-distilled Diffusion Language Models 🤔Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. 💡To address this, the authors extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. However, this extension introduces both theoretical and practical challenges. 🔧To overcome these challenges, the authors first provide a theoretical result demonstrating that their inverse formulation admits a unique solution, thereby ensuring valid optimization. They then introduce gradient-stable relaxations to support effective training. 📊As a result, experiments on multiple DLMs show that their method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4×—64×, while preserving the teacher model’s entropy and generative perplexity. This Monday, David Li (scholar.google.com/citations?user…) and Nikita Gushchin (scholar.google.com/citations?user…) will present their jointly led paper, which was recently accepted at ICML 2026. Collaborators of this work include: Dmitry Abulkhanov (@dabulkhanov_), Eric Moulines (scholar.google.com/citations?user…), Ivan Oseledets (@oseledetsivan), Maxim Panov (@maxim_panov), Alexander Korotin (akorotin.netlify.app) Paper link: arxiv.org/abs/2602.19066

English

0

1

10

448

David@LiDavid2002·18 May

I also shared a brief overview of the core idea in my Telegram channel for anyone curious, you can check it out here: t.me/LiSearch We’ll be happy to see everyone interested in diffusion LMs, generative modeling, or just curious about where this field is heading.

English

1

0

84

David@LiDavid2002·18 May

🚀 Today, @iNikitaGushchin and I will be presenting our work on IDLM as part of the dLLM Reading Group! We’ll discuss the main ideas behind the paper, what motivated this direction, and why we believe it is an exciting step for diffusion-based language modeling.

English

1

3

8

1.4K

David retweetledi

Emiel Hoogeboom@emiel_hoogeboom·16 May

Since I'm between jobs, I've been having a lot of fun vibe-coding with public tooling. First drop: a clean PyTorch impl of the Gradient Moment metric from our recent paper (arXiv:2603.20155). github.com/ehoogeboom/gra…

English

5

68

5.3K

David retweetledi

Discrete Diffusion Reading Group@diffusion_llms·15 May

📢 May 18 (Mon): IDLM: Inverse-distilled Diffusion Language Models 🤔Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. 💡To address this, the authors extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. However, this extension introduces both theoretical and practical challenges. 🔧To overcome these challenges, the authors first provide a theoretical result demonstrating that their inverse formulation admits a unique solution, thereby ensuring valid optimization. They then introduce gradient-stable relaxations to support effective training. 📊As a result, experiments on multiple DLMs show that their method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4×—64×, while preserving the teacher model’s entropy and generative perplexity. This Monday, David Li (scholar.google.com/citations?user…) and Nikita Gushchin (scholar.google.com/citations?user…) will present their jointly led paper, which was recently accepted at ICML 2026. Collaborators of this work include: Dmitry Abulkhanov (@dabulkhanov_), Eric Moulines (scholar.google.com/citations?user…), Ivan Oseledets (@oseledetsivan), Maxim Panov (@maxim_panov), Alexander Korotin (akorotin.netlify.app) Paper link: arxiv.org/abs/2602.19066

Discrete Diffusion Reading Group tweet media

English

2

13

36

9.5K

David retweetledi

Justin Deschenaux@jdeschena·14 May

🔥 New paper: Language Modeling with Hyperspherical Flows Recent flow language models (FLMs) all use Gaussian noise. Makes sense for images, but not necessarily for text 🫠 We propose to add noise by rotating embeddings on 𝕊^{d−1} instead 🌐 w/ @caglarml (1/9)

English

13

82

427

57.8K

David@LiDavid2002·13 May

@Sam_Acqua Yeah, I think evaluating gen. ppl with matching entropy would provide a fairer comparison. Otherwise, gen. ppl can always be reduced by lowering the entropy of the output distribution, for example through temperature scaling

English

0

1

15

Sam Acquaviva@Sam_Acqua·13 May

@LiDavid2002 Fully agree on all accounts! So when ppl bounds are not an option, I think papers should report the gen. ppl vs entropy frontier, or at least refrain from point estimates without matching entropy.

English

1

0

1

18

Sam Acquaviva@Sam_Acqua·13 May

Flow models are a promising alternative to autoregression. But the current standard of evaluating flow models is broken. The reported 3x improvement in 1024-step PPL since 2023 is closer to 1.1x if you control for sample entropy. (1/12)

English

7

30

164

49.3K

David@LiDavid2002·13 May

@Sam_Acqua Also, reporting perplexity bounds is not always straightforward. Some models, like distillation-based ones, are not trained with an ELBO or a clear likelihood objective, so it is unclear how to compare perplexity fairly.

English

1

0

1

29

David@LiDavid2002·13 May

@Sam_Acqua I agree with the criticism of these metrics, but I guess we just do not have any options. The main issue is that academia often cannot train large models, so benchmark results on small models may be less meaningful, even if downstream metrics are better in principle.

English

1

0

1

35

David@LiDavid2002·25 Mar

@emiel_hoogeboom Thank you for citing our work, we really appreciate it. I think our works are quite closely related. We already have some ideas for pushing this further and would be excited to explore a collaboration.

English

0

1

11

Emiel Hoogeboom@emiel_hoogeboom·23 Mar

Also want to highligh very relevant concurrent work: arxiv.org/abs/2602.19066 You could say that if our work is discrete "mmd", theirs is discrete "dmd" from the continuous world.

English

1

0

17

1.2K

Emiel Hoogeboom@emiel_hoogeboom·23 Mar

You may think discrete distillation is fundamentally flawed, you are (surprisingly) wrong. 🤯 Meet Discrete Moment Distillation (D-MMD). It is a new method that brings fast, few-step sampling to discrete diffusion models! 🧵👇

English

6

39

253

58.1K

David

Keşfet