Cunxiao Du

55 posts

Cunxiao Du

Cunxiao Du

@ducx_du

Research Scientist @ seed

Beigetreten Mart 2024
182 Folgt462 Follower
Cunxiao Du retweetet
Fengzhuo Zhang
Fengzhuo Zhang@FengzhuoZhang·
Large Language Models (LLMs) exhibit “slash patterns” in attention maps — a key mechanism behind prefilling acceleration. We take a first step toward understanding why they emerge. Main findings: ▶️ Slash patterns are OOD-generalizable ▶️ Queries and keys on these heads are near rank-one and carry little contextual information. ▶️RoPE is the primary source of the slash pattern. Blog link: fengzhuo.notion.site/Demystifying-t… A thread 🧵
Fengzhuo Zhang tweet media
English
2
24
74
7.1K
Cunxiao Du
Cunxiao Du@ducx_du·
robo phd @YouJiacheng is so professional on LLM🤣, I have never checked the annotation guidelines of OpenAI PRM800k, although I have read this paper so many times. But the idea did come from when @mavenlin and I were using codex, we just felt codex is so human-like. So we also should annotate some data efficiently.
You Jiacheng@YouJiacheng

This is similar to what OpenAI did in PRM800k (2023). I think this is the right way to collect data. github.com/openai/prm800k…

English
1
0
23
4.5K
Cunxiao Du
Cunxiao Du@ducx_du·
A real example on MLE. The left side shows the terminal execution, while the right side shows the interaction between the agent and the human annotator. At each step, the annotator needs to confirm whether to proceed with the LLM-generated command (p) or edit it (e). At the 30s mark, an edit occurs. The human annotator only needs to rewrite, in first-person form, the desired behavior at a high level, and all subsequent tokens can be completed by the LLM, just like Cursor. This significantly accelerates the annotation process and removes the need for annotators to memorize complex terminal commands, which we find the LLM rarely gets wrong. As a result, the annotation barrier is substantially lowered.
English
0
0
5
789
Cunxiao Du
Cunxiao Du@ducx_du·
A simple and fast method for high-quality data annotation: On-Policy Annotation. Humans lightly edit LLM outputs, then let the LLM continue from the edited prefix—rather than labeling from scratch. Most tokens remain LLM-generated, boosting annotation efficiency and learnability. With just 300 annotated SWEGym samples, DevStral-22B-05 on bash-only SWE-Bench-Verified improves 18.6% → 32.8%. BLOG: terminal-agent.github.io/blog/annotatio…
English
1
14
78
15.9K
Cunxiao Du
Cunxiao Du@ducx_du·
@Xinyu2ML semantic parallelism 👈🏻❤️❤️❤️
Español
0
0
1
319
Xinyu Yang
Xinyu Yang@Xinyu2ML·
This blog post is pretty nice! My main concern with discrete dLLMs these days is that the sampling process effectively replaces the true joint distribution with a collection of independent marginals. In other words, what we end up modeling is not the true joint distribution, but an independently factorized approximation. This results in the Trilemma among efficiency, fluency, and diversity. Since independent factorization is fundamentally weaker than autoregressive factorization, discrete dLLMs are, in principle, less expressive than standard autoregressive LLMs when using the same computation (forward*1). Moreover, even with continuous dLLMs, several fundamental challenges remain in the learning objective, as we discussed in our recent blog post: notion.so/Understanding-… Related to Planned Diffusion, we previously explored a similar idea in our NeurIPS Spotlight paper, Multiverse (arxiv.org/pdf/2506.09991). The key concept is still semantic parallelism: spliting a task into multiple independent subtasks and then merging them based on an autoregressively generated plan. This provides a principled way to increase parallelism while retaining semantic coherence. However, our approach always utilize autoregressive generation on every branch to support recursive decomposition and bypass the inherent limitations of independently factorized diffusion sampling. This naturally raises an interesting open question: once we achieve semantic parallelism, do we still need token parallelism at all?
Daniel Israel@danielmisrael

I wrote a blog post: The Parallel Decoding Trilemma Parallel decoding research has tried to optimize the speed-quality tradeoff, but I believe quality should be decomposed into fluency and diversity.

English
1
7
79
18.2K
Cunxiao Du
Cunxiao Du@ducx_du·
@danielmisrael this picture form da-transformer aslo explains why one-turn parallel decoding canno be fluent.
Cunxiao Du tweet media
English
1
0
1
409
Daniel Israel
Daniel Israel@danielmisrael·
I wrote a blog post: The Parallel Decoding Trilemma Parallel decoding research has tried to optimize the speed-quality tradeoff, but I believe quality should be decomposed into fluency and diversity.
Daniel Israel tweet media
English
3
19
61
14.6K
Cunxiao Du
Cunxiao Du@ducx_du·
I completely agree. As I wrote at the end of my blog post and in thread 11: Humans don’t reason in absolute token slots (“what’s the 25th word from now?”). Mask diffusion does. We think in latent plans: functions before code, structure before wording, ideas before tokens. That’s what a better diffusion model should capture. This is very much in line with your perspective, and there are already some papers exploring such latent plans. Personally, I highly recommend Skeleton of Thought and Multiverse @Xinyu2ML . However, these methods are more closer to post-training or inference-time algorithms, rather than something that can be applied at the pre-training level. If we want to truly beat left-to-right autoregressive models, I believe we need to figure out how to realize what these methods are doing directly at the pre-training level.
John F. Wu@jwuphysics

@ducx_du @sedielem I'm obviously being naive here, but I'm thinking something like an outer loop "big picture/outlining" diffusion-like approach, with an inner loop "detail oriented" autoregressive model. Just basing this on how I read (skim + focus) and also write (outline + details).

English
1
1
17
3.6K
Cunxiao Du
Cunxiao Du@ducx_du·
I think this is strongly related to the prior of natural language. I do not deny that there are many meaningful conditional distributions in any-order training: for example, fill-in-the-middle patterns like “a, b, [mask], c, d” are clearly valuable. However, I believe that for the vast majority of predictions that do not provide locality, the resulting gradient will have a very low signal-to-noise ratio. Humans, for instance, do not naturally try to predict the next 20-th token, yet diffusion-style training is filled with many similarly non-local prediction targets.
English
0
0
1
18
Ryan Chesler
Ryan Chesler@ryan_chesler·
@TrelisResearch This is assuming that on every additional pass of the data you have another chance at seeing a different remasking of the data
English
2
0
1
40
Trelis Research
Trelis Research@TrelisResearch·
Important thread for diffusion language models. It appears that masking outputs results in poor performance because locality (info of one token informing a neighbour) is significantly hampered... [i.e. training on "The MASKED MASKED animal" is quite ineffective ask the two masked tokens depend a lot on each other, and you are wasting forward and backward passes compared to training on "The big wild MASKED"... This doesn't happen in diffusion image models as they add noise to the image and feel the full noised image to the model. There is a big difference between a noised pixel (still has info) and a zero-d out pixel (masked). So this implies masked diffusion models are fundamentally just limited versus autoregressive models...? So perhaps we go back to training diffusion llms where diffusion operates in latent space (in which case you still need a latent to language decoder)...?
Cunxiao Du@ducx_du

Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This thread is about why “any order” turns into a curse. (Work with Xinyu Yang @Xinyu2ML , Min Lin @mavenlin , Chao Du @duchao0726 and the team.) Blog Link: #2af0ba07baa880c29fc4c8c198244cc8" target="_blank" rel="nofollow noopener">notion.so/Understanding-…

English
4
4
58
8.8K
Cunxiao Du
Cunxiao Du@ducx_du·
@xordrew There is another rather old-school yet highly promising model, which I personally believe is the HMM (hidden markov model).
English
0
0
0
19
Andrew Dickson
Andrew Dickson@xordrew·
@ducx_du Do you have proposals you like for autoregressive models in latent space? I'd love to see anything like that, but defining the log-likelihood objective (without some horrifying VAE style latents) is intimidating.
English
2
0
0
50
Cunxiao Du
Cunxiao Du@ducx_du·
Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This thread is about why “any order” turns into a curse. (Work with Xinyu Yang @Xinyu2ML , Min Lin @mavenlin , Chao Du @duchao0726 and the team.) Blog Link: #2af0ba07baa880c29fc4c8c198244cc8" target="_blank" rel="nofollow noopener">notion.so/Understanding-…
English
16
73
456
134.8K
Cunxiao Du
Cunxiao Du@ducx_du·
@xordrew Personally, I am more inclined to recommend Skeleton of Thought and Multiverse @Xinyu2ML . However, the problem is that there does not yet seem to be a clear path for applying them at pre-training scale.
English
0
0
2
45
Cunxiao Du
Cunxiao Du@ducx_du·
In fact, I have read the insightful Block Diffusion, and around the same time I independently came up with a similar solution to address the compatibility issue with the KV cache, but @mariannearr already did it beautifully. However, Block Diffusion does not change the conclusions of this paper; it merely reduces the length used in Mask Diffusion from the full sequence length to the block length. I have tested cases where the block size is small, and the prediction gap between any-order generation and L2R/R2L is not significant. Therefore, this appears to be a promising direction.
English
0
0
2
123
Josh Cason
Josh Cason@TheGrizztronic·
@ducx_du If you haven't read about block diffusion, seems like something you'd want to consider for further experiments. x.com/mariannearr/st…
Marianne Arriola@mariannearr

🚨Announcing our #ICLR2025 Oral! 🔥Diffusion LMs are on the rise for parallel text generation! But unlike autoregressive LMs, they struggle with quality, fixed-length constraints & lack of KV caching. 🚀Introducing Block Diffusion—combining autoregressive and diffusion models for the best of both worlds! 👇1/7

English
1
0
0
185
Cunxiao Du
Cunxiao Du@ducx_du·
In fact, this is a rather sad story. I started following masked diffusion language models back in 2019, when they were still called mask-predict non-autoregressive transformers (arxiv.org/abs/1904.09324). I was deeply moved by the “any-order” perspective. But during my 5-years PhD journey, I slowly realized that masked diffusion models were consistently weaker than autoregressive models for language. It wasn’t until six years later that I suddenly understood — the root cause was likely the inductive bias inherent in the data itself.
English
1
0
2
117
John F. Wu
John F. Wu@jwuphysics·
@sedielem Agree - especially that the inductive biases for arbitrary ordering are quite different than those for natural language!
English
1
0
2
512
Sander Dieleman
Sander Dieleman@sedielem·
This is a nice thread on the flip side of modern diffusion language models, most of which make use of a discrete corruption process based on masking. These shortcomings are also why I was originally so excited about continuous diffusion, which allows intermediate noisy embeddings to represent superpositions of possible outcomes (as in CDCD arxiv.org/abs/2211.15089 and many other works). This potentially enables them to go far beyond the "any-order autoregression" of masked discrete diffusion models. Unfortunately, as @PatrickPyn35903, @thjashin and @ruqi_zhang recently convincingly showed in the CANDI paper (arxiv.org/abs/2510.22510), this is not without its problems either. It's starting to look like either a hybrid continuous-discrete or a latent-based approach is the way to go in the longer term.
Cunxiao Du@ducx_du

Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This thread is about why “any order” turns into a curse. (Work with Xinyu Yang @Xinyu2ML , Min Lin @mavenlin , Chao Du @duchao0726 and the team.) Blog Link: #2af0ba07baa880c29fc4c8c198244cc8" target="_blank" rel="nofollow noopener">notion.so/Understanding-…

English
7
19
160
24.5K
Cunxiao Du
Cunxiao Du@ducx_du·
A slightly different point of view: according to our analysis, both L2R and R2L orders break translation invariance and essentially increase the learning difficulty. Even if we believe that removing inductive bias would help the model learn better, the training objective should still follow a log-sum formulation rather than a sum-of-log one. We cannot expect a single piece of data to be perfectly explained by all possible orders.
English
1
0
3
174
Michael Luo
Michael Luo@michaelzluo·
Transformers without positional embeddings are functionally the same as dLLM and has better scaling law than transformers with positional embeddings. It is also much easier and stable to train transformer than dLLM due to much prior work. I would position dLLM as a "cost arbitrage" over LLMs, i.e. faster generation with much higher token throughput.
Cunxiao Du@ducx_du

Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This thread is about why “any order” turns into a curse. (Work with Xinyu Yang @Xinyu2ML , Min Lin @mavenlin , Chao Du @duchao0726 and the team.) Blog Link: #2af0ba07baa880c29fc4c8c198244cc8" target="_blank" rel="nofollow noopener">notion.so/Understanding-…

English
9
10
158
28.5K
Cunxiao Du
Cunxiao Du@ducx_du·
Thanks a lot for the kind words! I fully agree with your point, whether l2r or r2l dominates is clearly data-dependent. As I wrote in my blog, we can do a simple thought experiment: if we swap the 5th token with the 0th token in all training data, then the optimal order becomes something like 5, 1, 2, 3, 4, 0, 6, …. So for Sudoku, there might not even exist a single “best” order. But I still believe the optimal form of any-order learning should be something like a log-sum over orders, treating the order as a latent variable.
English
0
0
1
345
Ashwinee Panda @ICLR2026
Ashwinee Panda @ICLR2026@PandaAshwinee·
@ducx_du really nice post, i liked it a lot. i think that the 2 directions forward are latent diffusion and DLLMs but NOT for “language” but for sudoku or things where “any-order” is actually BiS. what do you think?
English
1
0
0
437
oso
oso@osoleve·
@kalomaze @teortaxesTex I'm not speaking about the architecture of general language models here, I'm drawing a counterpoint to the comfy assumption we like to make that everything about language is inherently markovian. That's an artifact of what we're doing and how we measure it, not a prescription.
English
1
0
0
28
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Extremely powerful narrative. Diffusion models by default make no sense because language is Markovian and L2R *or* *R2L* orders are strictly superior. It appears that the only sane way to train DLLMs is with log-sum loss.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet mediaTeortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Cunxiao Du@ducx_du

Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This thread is about why “any order” turns into a curse. (Work with Xinyu Yang @Xinyu2ML , Min Lin @mavenlin , Chao Du @duchao0726 and the team.) Blog Link: #2af0ba07baa880c29fc4c8c198244cc8" target="_blank" rel="nofollow noopener">notion.so/Understanding-…

English
6
6
105
10.8K