Cunxiao Du

0

23

4.5K

Cunxiao Du@ducx_du·18 Ara

A real example on MLE. The left side shows the terminal execution, while the right side shows the interaction between the agent and the human annotator. At each step, the annotator needs to confirm whether to proceed with the LLM-generated command (p) or edit it (e). At the 30s mark, an edit occurs. The human annotator only needs to rewrite, in first-person form, the desired behavior at a high level, and all subsequent tokens can be completed by the LLM, just like Cursor. This significantly accelerates the annotation process and removes the need for annotators to memorize complex terminal commands, which we find the LLM rarely gets wrong. As a result, the annotation barrier is substantially lowered.

English

5

789

Cunxiao Du@ducx_du·18 Ara

A simple and fast method for high-quality data annotation: On-Policy Annotation. Humans lightly edit LLM outputs, then let the LLM continue from the edited prefix—rather than labeling from scratch. Most tokens remain LLM-generated, boosting annotation efficiency and learnability. With just 300 annotated SWEGym samples, DevStral-22B-05 on bash-only SWE-Bench-Verified improves 18.6% → 32.8%. BLOG: terminal-agent.github.io/blog/annotatio…

English

🚀We propose Reptile, a Terminal Agent🤖️that enables interaction with an LLM agent directly in your terminal. The agent can execute any command or custom CLI tool to accomplish tasks, and users can define their own tools and commands for the agent to utilize. ✨What Makes Reptile Special? Compared with other CLI agents (e.g., Claude Code and Mini SWE-Agent), Reptile stands out for the following reasons: ⚡️Human-in-the-Loop Learning: Users can inspect every step and provide prompt feedback, i.e., give feedback under the USER role or edit the LLM generation under the ASSISTANT role. The interaction will be used for model SFT training & RL training. 💻Terminal-only beyond Bash-only: Simple and stateful execution, which is more efficient than bash-only (you don’t need to specify the environment in every command). It doesn’t require the complicated MCP protocol—just a naive bash tool under the REPL protocol. Github: github.com/terminal-agent… Homepage: terminal-agent.github.io/blog/workflow/

14

78

15.9K

Cunxiao Du@ducx_du·17 Ara

Can I say this is currently the most user-friendly agent framework I’ve tried? 🚀 You can use it directly in the terminal without Docker. At every step, you can fully see the model’s output and the commands to be executed and also freely edit them. All of this with minimal dependencies installed. 🔧✨

Longxu Dou@LongxuDou

English

5

811

Cunxiao Du@ducx_du·4 Ara

@QPHutu Congrats.

English

👀Optimizing Anytime Reasoning via Budget Relative Policy Optimization👀 🚀Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.🚀 📰Paper: arxiv.org/abs/2505.13438 🛠️Code: github.com/sail-sg/Anytim…

219

Penghui Qi@QPHutu·3 Ara

Not being able to attend #NeurIPS2025 because my wife just delivered a baby😀 Welcome any dicussion on our work!👇 arxiv.org/pdf/2505.13438

Penghui Qi@QPHutu

English

8

6

85

8.9K

Cunxiao Du@ducx_du·28 Kas

@Xinyu2ML semantic parallelism 👈🏻❤️❤️❤️

Español

Daniel Israel@danielmisrael

1

319

Xinyu Yang@Xinyu2ML·28 Kas

This blog post is pretty nice! My main concern with discrete dLLMs these days is that the sampling process effectively replaces the true joint distribution with a collection of independent marginals. In other words, what we end up modeling is not the true joint distribution, but an independently factorized approximation. This results in the Trilemma among efficiency, fluency, and diversity. Since independent factorization is fundamentally weaker than autoregressive factorization, discrete dLLMs are, in principle, less expressive than standard autoregressive LLMs when using the same computation (forward*1). Moreover, even with continuous dLLMs, several fundamental challenges remain in the learning objective, as we discussed in our recent blog post: notion.so/Understanding-… Related to Planned Diffusion, we previously explored a similar idea in our NeurIPS Spotlight paper, Multiverse (arxiv.org/pdf/2506.09991). The key concept is still semantic parallelism: spliting a task into multiple independent subtasks and then merging them based on an autoregressively generated plan. This provides a principled way to increase parallelism while retaining semantic coherence. However, our approach always utilize autoregressive generation on every branch to support recursive decomposition and bypass the inherent limitations of independently factorized diffusion sampling. This naturally raises an interesting open question: once we achieve semantic parallelism, do we still need token parallelism at all?

I wrote a blog post: The Parallel Decoding Trilemma Parallel decoding research has tried to optimize the speed-quality tradeoff, but I believe quality should be decomposed into fluency and diversity.

English

7

79

18.2K

Cunxiao Du@ducx_du·28 Kas

@danielmisrael this picture form da-transformer aslo explains why one-turn parallel decoding canno be fluent.

English

0

1

409

Daniel Israel@danielmisrael·25 Kas

I wrote a blog post: The Parallel Decoding Trilemma Parallel decoding research has tried to optimize the speed-quality tradeoff, but I believe quality should be decomposed into fluency and diversity.

English

3

19

61

14.6K

Cunxiao Du@ducx_du·28 Kas

I completely agree. As I wrote at the end of my blog post and in thread 11: Humans don’t reason in absolute token slots (“what’s the 25th word from now?”). Mask diffusion does. We think in latent plans: functions before code, structure before wording, ideas before tokens. That’s what a better diffusion model should capture. This is very much in line with your perspective, and there are already some papers exploring such latent plans. Personally, I highly recommend Skeleton of Thought and Multiverse @Xinyu2ML . However, these methods are more closer to post-training or inference-time algorithms, rather than something that can be applied at the pre-training level. If we want to truly beat left-to-right autoregressive models, I believe we need to figure out how to realize what these methods are doing directly at the pre-training level.

John F. Wu@jwuphysics

@ducx_du @sedielem I'm obviously being naive here, but I'm thinking something like an outer loop "big picture/outlining" diffusion-like approach, with an inner loop "detail oriented" autoregressive model. Just basing this on how I read (skim + focus) and also write (outline + details).

English

17

3.6K

Cunxiao Du@ducx_du·28 Kas

I think this is strongly related to the prior of natural language. I do not deny that there are many meaningful conditional distributions in any-order training: for example, fill-in-the-middle patterns like “a, b, [mask], c, d” are clearly valuable. However, I believe that for the vast majority of predictions that do not provide locality, the resulting gradient will have a very low signal-to-noise ratio. Humans, for instance, do not naturally try to predict the next 20-th token, yet diffusion-style training is filled with many similarly non-local prediction targets.

English

1

18

Ryan Chesler@ryan_chesler·28 Kas

@TrelisResearch This is assuming that on every additional pass of the data you have another chance at seeing a different remasking of the data

English

2

0

1

40

Trelis Research@TrelisResearch·27 Kas

Important thread for diffusion language models. It appears that masking outputs results in poor performance because locality (info of one token informing a neighbour) is significantly hampered... [i.e. training on "The MASKED MASKED animal" is quite ineffective ask the two masked tokens depend a lot on each other, and you are wasting forward and backward passes compared to training on "The big wild MASKED"... This doesn't happen in diffusion image models as they add noise to the image and feel the full noised image to the model. There is a big difference between a noised pixel (still has info) and a zero-d out pixel (masked). So this implies masked diffusion models are fundamentally just limited versus autoregressive models...? So perhaps we go back to training diffusion llms where diffusion operates in latent space (in which case you still need a latent to language decoder)...?

Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This thread is about why “any order” turns into a curse. (Work with Xinyu Yang @Xinyu2ML , Min Lin @mavenlin , Chao Du @duchao0726 and the team.) Blog Link: #2af0ba07baa880c29fc4c8c198244cc8" target="_blank" rel="nofollow noopener">notion.so/Understanding-…

English

4

58

8.8K

Cunxiao Du@ducx_du·28 Kas

@xordrew There is another rather old-school yet highly promising model, which I personally believe is the HMM (hidden markov model).

English

19

Andrew Dickson@xordrew·28 Kas

@ducx_du Do you have proposals you like for autoregressive models in latent space? I'd love to see anything like that, but defining the log-likelihood objective (without some horrifying VAE style latents) is intimidating.

English

2

0

50

Cunxiao Du@ducx_du·25 Kas

Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This thread is about why “any order” turns into a curse. (Work with Xinyu Yang @Xinyu2ML , Min Lin @mavenlin , Chao Du @duchao0726 and the team.) Blog Link: #2af0ba07baa880c29fc4c8c198244cc8" target="_blank" rel="nofollow noopener">notion.so/Understanding-…

English

16

73

456

134.8K

Cunxiao Du@ducx_du·28 Kas

@xordrew Personally, I am more inclined to recommend Skeleton of Thought and Multiverse @Xinyu2ML . However, the problem is that there does not yet seem to be a clear path for applying them at pre-training scale.

English

2

45

Cunxiao Du@ducx_du·28 Kas

In fact, I have read the insightful Block Diffusion, and around the same time I independently came up with a similar solution to address the compatibility issue with the KV cache, but @mariannearr already did it beautifully. However, Block Diffusion does not change the conclusions of this paper; it merely reduces the length used in Mask Diffusion from the full sequence length to the block length. I have tested cases where the block size is small, and the prediction gap between any-order generation and L2R/R2L is not significant. Therefore, this appears to be a promising direction.

English

Marianne Arriola@mariannearr

2

123

Josh Cason@TheGrizztronic·28 Kas

@ducx_du If you haven't read about block diffusion, seems like something you'd want to consider for further experiments. x.com/mariannearr/st…

🚨Announcing our #ICLR2025 Oral! 🔥Diffusion LMs are on the rise for parallel text generation! But unlike autoregressive LMs, they struggle with quality, fixed-length constraints & lack of KV caching. 🚀Introducing Block Diffusion—combining autoregressive and diffusion models for the best of both worlds! 👇1/7

English

0

185

Cunxiao Du@ducx_du·27 Kas

I’m sorry, but I must disagree with you. 😂

Min Lin@mavenlin

I'm also not very much bothered by the random order permutation, it still make sense to fit n! models for any frame of n tokens text. It is just that the ex-post fitted posterior may suggest L2R most of the time. (Greedily choosing the order can be seen as the crude version of this posterior). I believe there are cases where the best order is not L2R, it is more of whether it is worth the extra compute on training to fit the n! models.

English

6

1.4K

Cunxiao Du@ducx_du·27 Kas

In fact, this is a rather sad story. I started following masked diffusion language models back in 2019, when they were still called mask-predict non-autoregressive transformers (arxiv.org/abs/1904.09324). I was deeply moved by the “any-order” perspective. But during my 5-years PhD journey, I slowly realized that masked diffusion models were consistently weaker than autoregressive models for language. It wasn’t until six years later that I suddenly understood — the root cause was likely the inductive bias inherent in the data itself.

English

0

2

117

John F. Wu@jwuphysics·27 Kas

@sedielem Agree - especially that the inductive biases for arbitrary ordering are quite different than those for natural language!

English

0

2

512

Sander Dieleman@sedielem·27 Kas

This is a nice thread on the flip side of modern diffusion language models, most of which make use of a discrete corruption process based on masking. These shortcomings are also why I was originally so excited about continuous diffusion, which allows intermediate noisy embeddings to represent superpositions of possible outcomes (as in CDCD arxiv.org/abs/2211.15089 and many other works). This potentially enables them to go far beyond the "any-order autoregression" of masked discrete diffusion models. Unfortunately, as @PatrickPyn35903, @thjashin and @ruqi_zhang recently convincingly showed in the CANDI paper (arxiv.org/abs/2510.22510), this is not without its problems either. It's starting to look like either a hybrid continuous-discrete or a latent-based approach is the way to go in the longer term.

Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This thread is about why “any order” turns into a curse. (Work with Xinyu Yang @Xinyu2ML , Min Lin @mavenlin , Chao Du @duchao0726 and the team.) Blog Link: #2af0ba07baa880c29fc4c8c198244cc8" target="_blank" rel="nofollow noopener">notion.so/Understanding-…

English

7

19

160

24.5K

Cunxiao Du@ducx_du·27 Kas

@michaelzluo @SimonXinDong a typo: L2R and R2L do not break translation invariance, whereas any other ordering does

English

125

Cunxiao Du@ducx_du·27 Kas

A slightly different point of view: according to our analysis, both L2R and R2L orders break translation invariance and essentially increase the learning difficulty. Even if we believe that removing inductive bias would help the model learn better, the training objective should still follow a log-sum formulation rather than a sum-of-log one. We cannot expect a single piece of data to be perfectly explained by all possible orders.

English

0

3

174

Michael Luo@michaelzluo·26 Kas

Transformers without positional embeddings are functionally the same as dLLM and has better scaling law than transformers with positional embeddings. It is also much easier and stable to train transformer than dLLM due to much prior work. I would position dLLM as a "cost arbitrage" over LLMs, i.e. faster generation with much higher token throughput.

Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This thread is about why “any order” turns into a curse. (Work with Xinyu Yang @Xinyu2ML , Min Lin @mavenlin , Chao Du @duchao0726 and the team.) Blog Link: #2af0ba07baa880c29fc4c8c198244cc8" target="_blank" rel="nofollow noopener">notion.so/Understanding-…

English

9

10

158

28.5K

Cunxiao Du@ducx_du·27 Kas

@unsorsodicorda @_vaishnavh yeah, maybe in next month

English

2

175

andrea panizza@unsorsodicorda·26 Kas

@ducx_du @_vaishnavh Sounds interesting! Any plans to upload a preprint to arXiv?

English

0

1

243

Cunxiao Du@ducx_du·27 Kas

Thanks a lot for the kind words! I fully agree with your point, whether l2r or r2l dominates is clearly data-dependent. As I wrote in my blog, we can do a simple thought experiment: if we swap the 5th token with the 0th token in all training data, then the optimal order becomes something like 5, 1, 2, 3, 4, 0, 6, …. So for Sudoku, there might not even exist a single “best” order. But I still believe the optimal form of any-order learning should be something like a log-sum over orders, treating the order as a latent variable.

English

1

345

Ashwinee Panda @ICLR2026@PandaAshwinee·27 Kas

@ducx_du really nice post, i liked it a lot. i think that the 2 directions forward are latent diffusion and DLLMs but NOT for “language” but for sudoku or things where “any-order” is actually BiS. what do you think?

English

0

437

Cunxiao Du@ducx_du·26 Kas

@osoleve @kalomaze @teortaxesTex you can check the time homogeneous view at my thread

English

13

oso@osoleve·25 Kas

@kalomaze @teortaxesTex I'm not speaking about the architecture of general language models here, I'm drawing a counterpoint to the comfy assumption we like to make that everything about language is inherently markovian. That's an artifact of what we're doing and how we measure it, not a prescription.

English

0

28

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·25 Kas

Extremely powerful narrative. Diffusion models by default make no sense because language is Markovian and L2R *or* *R2L* orders are strictly superior. It appears that the only sane way to train DLLMs is with log-sum loss.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media