
Chumeng Liang
21 posts

Chumeng Liang
@lowerbad
1st CS PhD student at @UofIllinois. Diffusion Language Models. Representation Learning.



Continuous diffusion dominates image & video generation, but people used to believe that it inherently lags behind its discrete counterparts in language modeling. Today, we challenge this belief with LangFlow: the first continuous diffusion language model that rivals—and even beats—discrete diffusion. (1/7) Blog: caradryanl.github.io/blog/2026/lang… GitHub: github.com/nealchen2003/L… Arxiv: arxiv.org/abs/2604.11748












(1/n) Tiny-A2D: An Open Recipe to Turn Any AR LM into a Diffusion LM Code (dLLM): github.com/ZHZisZZ/dllm Checkpoints: huggingface.co/collections/dl… With dLLM, you can turn ANY autoregressive LM into a diffusion LM (parallel generation + infilling) with minimal compute. Using this recipe, we built a 🤗collection of the smallest diffusion LMs that work well in practice. Key takeaways: 1. Finetuned on Qwen3-0.6B, we obtain the strongest small (~0.5/0.6B) diffusion LMs to date. 2. The base AR LM matters: Investing compute in improving the base AR model is potentially more efficient than scaling compute during adaptation. 3. Block diffusion (BD3LM) generally outperforms vanilla masked diffusion (MDLM), especially on math-reasoning and coding tasks.

Introducing Multi-Agent Evolve 🧠 A new paradigm beyond RLHF and RLVR: More compute → closer to AGI No need for expensive data or handcrafted rewards We show that an LLM can self-evolve — improving itself through co-evolution among roles (Proposer, Solver, Judge) via RL — all without external supervision. On Qwen2.5-3B-Instruct, Multi-Agent Evolve boosts average accuracy from 55% → 58% across 22 benchmarks. Remarkably, the model automatically learns balance among roles: - The Proposer first generates easy tasks. - The Judge refines the difficulty metric. - The Proposer then raises the challenge, forcing the Solver to improve. - The system co-evolves until reaching equilibrium. Multi-Agent Evolve: LLMs self-improve through co-evolution. 📄 Paper: arxiv.org/abs/2510.23595 💻 Code (coming soon): github.com/ulab-uiuc/Mult…






(1/n) 🚨 BERTs that chat: turn any BERT into a chatbot with diffusion hi @karpathy, we just trained a few BERTs to chat with diffusion — we are releasing all the model checkpoints, training curves, and recipes! Hopefully this spares you the side quest into training nanochat with diffusion for now 🙂. It’s both a hands-on tutorial for beginners and an example showing how to use our complete toolkit (dLLM) for deeper projects. Code: github.com/ZHZisZZ/dllm Report: api.wandb.ai/links/asap-zzh… Checkpoints: huggingface.co/collections/dl… Motivation: I couldn’t find a good “Hello World” example for training a minimally working yet useful diffusion language models, a class of bidirectional language models capable of parallel token generation in arbitrary order. So I tried finetuning BERTs to make it chat with discrete diffusion—and it turned out more fun than I expected. TLDR: With a small amount of open-source instruction-following data, a standard BERT can gain conversational ability with diffusion. Specifically, a finetuned ModernBERT-large, with a similar number of parameters, performs close to Qwen1.5-0.5B.

Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm in text. For audio I've seen a bit of both. A lot of diffusion papers look a bit dense but if you strip the mathematical formalism, you end up with simple baseline algorithms, e.g. something a lot closer to flow matching in continuous, or something like this in discrete. It's your vanilla transformer but with bi-directional attention, where you iteratively re-sample and re-mask all tokens in your "tokens canvas" based on a noise schedule until you get the final sample at the last step. (Bi-directional attention is a lot more powerful, and you get a lot stronger autoregressive language models if you train with it, unfortunately it makes training a lot more expensive because now you can't parallelize across sequence dim). So autoregression is doing an `.append(token)` to the tokens canvas while only attending backwards, while diffusion is refreshing the entire token canvas with a `.setitem(idx, token)` while attending bidirectionally. Human thought naively feels a bit more like autoregression but it's hard to say that there aren't more diffusion-like components in some latent space of thought. It feels quite possible that you can further interpolate between them, or generalize them further. And it's a component of the LLM stack that still feels a bit fungible. Now I must resist the urge to side quest into training nanochat with diffusion.




