Nathan Barry (@nathanrs) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Rewrote tiny-diffusion to be 3x smaller! Went from 951 lines to just 364, all contained in one file. As simple as possible, but not simpler. I also added a tiny GPT implementation as a comparison (312 lines, inspired by @karpathy). The two implementations are ~80% identical. The model architecture, training loop, tokenization, etc, only differ in 19 lines of code. The main differences are contained within two functions (generate and get_batch). The reason to include the GPT implementation was to show how similar autoregressive LMs are to diffusion LMs on an architectural level. Only *1* line of code in the architecture needs to be modified to support masked language diffusion instead of next-token prediction (by disabling causal masking). Link to the repo is in the comments

Nathan Barry@nathanrs

Playing around with training a tiny 11M parameter character-level text diffusion model! It's a WIP but the code is currently a heavily modified nanochat gpt implementation (to change from autoregressive decoding to diffusion) and trained on the Tiny Shakespeare dataset. The naive implementation of a masking schedule is having a uniform masking probability for each token for each iteration. Newer approaches mask in block chunks from left to right which improves output quality and allows some KVCache reuse. I realized you can actually apply masking in any arbitrary manner during the generation process. Below you can see I applied masking based on the rules of Conway's Game of Life. I wonder if there are any unusual masking strategies like this which provides benefits. Regardless, this a very interesting and mesmerizing way to corrupt and deform text.

English

24

103

1.2K

160.9K

Nathan Barry@nathanrs·25 Mar

@texheavy Nice job @fedpoasts and gang!

English

1

0

6

140

Tex Software@texheavy·25 Mar

Intelligence for the heavy equipment industry. Made in the USA 🇺🇸

English

42

38

252

66.5K

Nathan Barry@nathanrs·24 Mar

@oussamazekri_ @theo_uscidda @Korba_Anna @CNRS @JulesSamaran @LucaEyring @ssahoo @ChenyuW64562111 @olivierhenaff @Pierrot_Clavier @WeiGuo01 @Jaeyeon_Kim_0 @YuchenZhu_ZYC @AlanNawzadAmin @RosieZ0512 @dvruette @jdeschena @zhihanyang_ @SchiffYair @Guanghan__Wang @mariannearr @vincentpaulinef @sansa19739319 @aaron_lou @AndrewC_ML @thjashin @ArnaudDoucet1 Some previous papers (like the original D3PM) tried semantic noising via embedding distance, which performed worse than masking and uniform. What are the specific additions that substantially improved the performance? And what were these other formulations missing?

English

1

0

2

151

Oussama Zekri@oussamazekri_·24 Mar

@theo_uscidda @Korba_Anna @CNRS @JulesSamaran @LucaEyring @ssahoo @ChenyuW64562111 @olivierhenaff @Pierrot_Clavier @WeiGuo01 @Jaeyeon_Kim_0 @YuchenZhu_ZYC @AlanNawzadAmin @RosieZ0512 And tagging a few people who I think might be especially interested in this work: @dvruette @jdeschena @zhihanyang_ @SchiffYair @Guanghan__Wang @mariannearr @vincentpaulinef @nathanrs @sansa19739319 @aaron_lou @AndrewC_ML @thjashin @ArnaudDoucet1

English

2

0

5

402

Oussama Zekri@oussamazekri_·24 Mar

What if discrete diffusion didn’t have to be stuck with mask or uniform noise? 🤔 In our new paper, we show how to go beyond them, unlocking much richer noising processes. And the empirical results are surprisingly strong! 🚀 🌐 Project Page: oussamazekri.fr/gdds 📑 Paper: arxiv.org/pdf/2603.21342 💻 Code: github.com/ozekri/gdds Thread below 🧵

English

5

20

98

9.6K

Nathan Barry@nathanrs·24 Mar

@0xSero Let's say 16 gb. Want to see how viable edit prediction can be made to run locally (for most people).

English

0

76

0xSero@0xSero·24 Mar

@nathanrs How much memory does your Mac have?

English

1

0

265

0xSero@0xSero·23 Mar

Composer-2 in Zed, going to see if I can somehow get it into Droid lol.

English

6

1

85

8.8K

Nathan Barry@nathanrs·13 Mar

@sdand Great looking site!

English

0

1

123

Nathan Barry@nathanrs·28 Şub

@josesaezmerino @ElevenYellow Wow, this reminds me of my TreeHacks project! Did the same thing, but we built a physical camera that printed the image on printer paper.

English

1

0

7

617

Jose@josesaezmerino·27 Şub

This is TIMEBOY, your personal time travel device. Go anywhere in the world and see what it was like across different eras thanks to AI recreations. Timeboy even uses location data for historical accuracy. Last month I joined the amazing team at @ElevenYellow and this is my first project with them.

English

57

71

1.2K

107.5K

Nathan Barry@nathanrs·27 Şub

@haha_whatsgood @arnie_hacker @karpathy I named it tiny diffusion just in case @karpathy wanted to make nano diffusion!

English

0

66

devler@haha_whatsgood·27 Şub

@arnie_hacker @karpathy @nathanrs has a pretty good one github.com/nathan-barry/t…

English

1

0

11

546

Arnie Ramesh@arnie_hacker·26 Şub

when is nanodiffusion coming out @karpathy 👀

English

8

1

89

14.2K

Nathan Barry@nathanrs·25 Şub

@StefanoErmon Can I interview for the MLE position? Doing my masters thesis on optimizing dLLMs inference with novel KVCache approximation methods, am pretty familiar with the area. Currently interviewing at a bunch of places and will probably wrap up my job search in the next 2-3 weeks

English

1

0

3

330

Stefano Ermon@StefanoErmon·24 Şub

@nathanrs We are excited to find out!

English

1

0

7

576

Nathan Barry@nathanrs·24 Şub

Diffusion LLMs are becoming very competitive architectures. But recently, there's also been a lot of progress in flow-based LLMs, which are conceptually similar. Both learn to transport samples from a noise distribution to a data distribution. Image generation used to be dominated by diffusion models but the leading models have since shifted to flow matching, largely because flow produces straighter trajectories that are easier to traverse in fewer steps without degrading quality. Categorical data (language) is certainly harder than continuous data (image latents) for flow. It'll be interesting to see whether language ends up following the same trajectory as images (pun intended).

Stefano Ermon@StefanoErmon

Mercury 2 is live 🚀🚀 The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs. Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built. We’re just getting started on what diffusion can do for language.

English

4

8

130

13.1K

Nathan Barry retweetledi

Nicholas Boffi@nmboffi·20 Şub

We just brought flow maps to language modeling for one-step sequence generation 💥 Discrete diffusion is not necessary -- continuous flows over one-hot encodings achieve SoTA performance and ≥8.3× faster generation 🔥 We believe this is a major step forward for discrete generative modeling and language modeling alike. 🚀 Full thread from first author @chandavidlee: x.com/chandavidlee/s…

English

4

45

250

41.9K

Nathan Barry@nathanrs·17 Şub

Our project won both @neo’s prize and the Most Creative Prize at @hackwithtrees Was fun working with @alexkranias, Pranav, and Lainey!

Nathan Barry@nathanrs

Built a camera that transforms your photos with diffusion models and prints them instantly on receipt paper

English

9

1

58

5.1K

Nathan Barry@nathanrs·17 Şub

Link to project: diffuji.com

English

0

6

546

Nathan Barry@nathanrs·17 Şub

More photos

Français

1

0

4

615

Nathan Barry@nathanrs·17 Şub

Built a camera that transforms your photos with diffusion models and prints them instantly on receipt paper

English

7

1

34

6.7K

Nathan Barry@nathanrs·11 Şub

LLaDA 2.1 was released, a 100B parameter diffusion language model with self-correction capabilities. They are able to fix previous tokens by adopting a mixture of masking/state-absorption and uniform diffusion, similar to GIDD. In a previous post, I mentioned that Google Gemini and Inception Lab’s Mercury might have done something similar. A few people in the comments suggested that they use masking + re-masking instead (so masking without the state-absorption property). I wonder how these two approaches compare. They both allow for self-correction and (thus) allow for more progress per step in the diffusion process (by being more robust to taking larger steps through the diffusion process). Masking + re-masking might have some benefits like a simplified training objective, stronger inductive bias (which is arguably a good thing), and easier use with KVCache approximation (due to fewer tokens changing per step). My only question is: how does the departure from state-absorption change things?  The simplified training objective from masking (which reduces to a weighted MLM training objective) comes from this state-absorption property. But does re-masking actually change this? The state-absorption property is just that each token undergoes one transition only ([MASK] -> predicted token, and never changes). Re-masking a token, of course, causes it to go through multiple transitions.  But does it really? Re-masking seems like you are just “jumping” to a more likely trajectory to account for the accumulation of errors. So instead of causing multiple state transitions, it could be just viewed as jumping to a better trajectory where that “poor” transition wasn’t made. Will be interesting to see more formalization of this and how it compares to GIDD at scale.

Ant Open Source@ant_oss

What if an LLM could EDIT its own tokens in real-time, not just generate them? 🤯 Introducing LLaDA2.1 — a diffusion model that breaks from autoregressive dominance. It drafts fast, then fixes its own mistakes on the fly with Token-to-Token editing. The result? 892 tokens/sec on a 100B model. 🔥 ⚡ 892 TPS on HumanEval+ (coding) ⚡ 801 TPS on BigCodeBench 🧠 Real-time self-correction via T2T editing ✅ @lmsysorg SGLang Day 0 support — production-ready now A "non-consensus" architecture now challenging the mainstream. Open-sourced TODAY. 👇 #LLaDA #TokenEditing #OpenSource #LLM #dLLM

English

0

3

34

3.3K

Nathan Barry@nathanrs·6 Şub

@Wayframe Looks great @michaelgold3n

English

0

1

258

Wayframe@Wayframe·6 Şub

Introducing Wayframe. Make new designs from words. Wayframe.com

English

28

17

136

44.4K

Nathan Barry@nathanrs·2 Şub

Was doing a deeper literature review over and found one of my new favorite paper title ever: “BERT has a Mouth, and It Must Speak” Was one of the earliest papers to do something akin to state-absorption diffusion language modeling.

English

0

6

94

5.3K

Nathan Barry@nathanrs·21 Oca

@dvruette What ways do you think? There's been some work integrating these models to help guide AR transformer generation. N-grams work on sequences, which makes autoregression natural. It's not apparent to me how it would work with dLLMs due to the out-of-order generation.

English

1

0

5

456

Dimitri von Rütte@dvruette·21 Oca

@nathanrs i think there’s a good chance that some day in the not too distant future, infinigram will power SOTA (diffusion-based) language models

English

1

0

10

631

Nathan Barry@nathanrs·20 Oca

Created tiny-infini-gram, a training-free language model which can generate Shakespeare 250x faster than nanoGPT! Last year, I read about unbounded n-gram language models, which solve the exponential space problem for classical n-grams that made using large n intractable. By using suffix arrays, we can simulate any arbitrary-sized n-gram lookup table in logarithmic time.  Since I’ve been testing different small language models recently, I decided to implement this n-gram variant, and was surprised at how good the results were.  Previous papers (to my knowledge) haven’t used this for language generation due to previous sampling methods causing infinite perplexity and verbatim copying. I solved these issues by creating Selective Back-off Interpolation Sampling, which mixes probability distributions from multiple n-gram levels to balance quality and novelty.  A detailed write-up is linked in the comments.

English

14

26

288

14.6K

Nathan Barry@nathanrs·20 Oca

@dadabots The goal in life is to be high signal

English

1

0

1

25

dadabots@dadabots·20 Oca

@nathanrs you animal, that’s two home runs in a row

Nathan Barry@nathanrs

Rewrote tiny-diffusion to be 3x smaller! Went from 951 lines to just 364, all contained in one file. As simple as possible, but not simpler. I also added a tiny GPT implementation as a comparison (312 lines, inspired by @karpathy). The two implementations are ~80% identical. The model architecture, training loop, tokenization, etc, only differ in 19 lines of code. The main differences are contained within two functions (generate and get_batch). The reason to include the GPT implementation was to show how similar autoregressive LMs are to diffusion LMs on an architectural level. Only *1* line of code in the architecture needs to be modified to support masked language diffusion instead of next-token prediction (by disabling causal masking). Link to the repo is in the comments

English

1

0

4

455

dadabots@dadabots·20 Oca

good ol fashioned autoregression making leaps —> who wants to try this on music?

Nathan Barry@nathanrs

Created tiny-infini-gram, a training-free language model which can generate Shakespeare 250x faster than nanoGPT! Last year, I read about unbounded n-gram language models, which solve the exponential space problem for classical n-grams that made using large n intractable. By using suffix arrays, we can simulate any arbitrary-sized n-gram lookup table in logarithmic time.  Since I’ve been testing different small language models recently, I decided to implement this n-gram variant, and was surprised at how good the results were.  Previous papers (to my knowledge) haven’t used this for language generation due to previous sampling methods causing infinite perplexity and verbatim copying. I solved these issues by creating Selective Back-off Interpolation Sampling, which mixes probability distributions from multiple n-gram levels to balance quality and novelty.  A detailed write-up is linked in the comments.

English

2

0

10

716

Nathan Barry

Keşfet