Nathan Barry

301 posts

Nathan Barry banner
Nathan Barry

Nathan Barry

@nathanrs

Man in the Arena Allocator. Prev @Apple, CS + Math @UTAustin, @zfellows

Austin, TX Katılım Haziran 2020
354 Takip Edilen2.8K Takipçiler
Sabitlenmiş Tweet
Nathan Barry
Nathan Barry@nathanrs·
Rewrote tiny-diffusion to be 3x smaller! Went from 951 lines to just 364, all contained in one file. As simple as possible, but not simpler. I also added a tiny GPT implementation as a comparison (312 lines, inspired by @karpathy). The two implementations are ~80% identical. The model architecture, training loop, tokenization, etc, only differ in 19 lines of code. The main differences are contained within two functions (generate and get_batch). The reason to include the GPT implementation was to show how similar autoregressive LMs are to diffusion LMs on an architectural level. Only *1* line of code in the architecture needs to be modified to support masked language diffusion instead of next-token prediction (by disabling causal masking). Link to the repo is in the comments
Nathan Barry@nathanrs

Playing around with training a tiny 11M parameter character-level text diffusion model! It's a WIP but the code is currently a heavily modified nanochat gpt implementation (to change from autoregressive decoding to diffusion) and trained on the Tiny Shakespeare dataset. The naive implementation of a masking schedule is having a uniform masking probability for each token for each iteration. Newer approaches mask in block chunks from left to right which improves output quality and allows some KVCache reuse. I realized you can actually apply masking in any arbitrary manner during the generation process. Below you can see I applied masking based on the rules of Conway's Game of Life. I wonder if there are any unusual masking strategies like this which provides benefits. Regardless, this a very interesting and mesmerizing way to corrupt and deform text.

English
24
103
1.2K
160.9K
Tex Software
Tex Software@texheavy·
Intelligence for the heavy equipment industry. Made in the USA 🇺🇸
English
42
38
252
66.5K
Nathan Barry
Nathan Barry@nathanrs·
@oussamazekri_ @theo_uscidda @Korba_Anna @CNRS @JulesSamaran @LucaEyring @ssahoo @ChenyuW64562111 @olivierhenaff @Pierrot_Clavier @WeiGuo01 @Jaeyeon_Kim_0 @YuchenZhu_ZYC @AlanNawzadAmin @RosieZ0512 @dvruette @jdeschena @zhihanyang_ @SchiffYair @Guanghan__Wang @mariannearr @vincentpaulinef @sansa19739319 @aaron_lou @AndrewC_ML @thjashin @ArnaudDoucet1 Some previous papers (like the original D3PM) tried semantic noising via embedding distance, which performed worse than masking and uniform. What are the specific additions that substantially improved the performance? And what were these other formulations missing?
Nathan Barry tweet media
English
1
0
2
151
Oussama Zekri
Oussama Zekri@oussamazekri_·
What if discrete diffusion didn’t have to be stuck with mask or uniform noise? 🤔 In our new paper, we show how to go beyond them, unlocking much richer noising processes. And the empirical results are surprisingly strong! 🚀 🌐 Project Page: oussamazekri.fr/gdds 📑 Paper: arxiv.org/pdf/2603.21342 💻 Code: github.com/ozekri/gdds Thread below 🧵
English
5
20
98
9.6K
Nathan Barry
Nathan Barry@nathanrs·
@0xSero Let's say 16 gb. Want to see how viable edit prediction can be made to run locally (for most people).
English
0
0
0
76
0xSero
0xSero@0xSero·
@nathanrs How much memory does your Mac have?
English
1
0
0
265
0xSero
0xSero@0xSero·
Composer-2 in Zed, going to see if I can somehow get it into Droid lol.
English
6
1
85
8.8K
Nathan Barry
Nathan Barry@nathanrs·
@josesaezmerino @ElevenYellow Wow, this reminds me of my TreeHacks project! Did the same thing, but we built a physical camera that printed the image on printer paper.
Nathan Barry tweet mediaNathan Barry tweet media
English
1
0
7
617
Jose
Jose@josesaezmerino·
This is TIMEBOY, your personal time travel device. Go anywhere in the world and see what it was like across different eras thanks to AI recreations. Timeboy even uses location data for historical accuracy. Last month I joined the amazing team at @ElevenYellow and this is my first project with them.
English
57
71
1.2K
107.5K
Nathan Barry
Nathan Barry@nathanrs·
@StefanoErmon Can I interview for the MLE position? Doing my masters thesis on optimizing dLLMs inference with novel KVCache approximation methods, am pretty familiar with the area. Currently interviewing at a bunch of places and will probably wrap up my job search in the next 2-3 weeks
English
1
0
3
330
Nathan Barry
Nathan Barry@nathanrs·
Diffusion LLMs are becoming very competitive architectures. But recently, there's also been a lot of progress in flow-based LLMs, which are conceptually similar. Both learn to transport samples from a noise distribution to a data distribution. Image generation used to be dominated by diffusion models but the leading models have since shifted to flow matching, largely because flow produces straighter trajectories that are easier to traverse in fewer steps without degrading quality. Categorical data (language) is certainly harder than continuous data (image latents) for flow. It'll be interesting to see whether language ends up following the same trajectory as images (pun intended).
Stefano Ermon@StefanoErmon

Mercury 2 is live 🚀🚀 The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs. Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built. We’re just getting started on what diffusion can do for language.

English
4
8
130
13.1K
Nathan Barry retweetledi
Nicholas Boffi
Nicholas Boffi@nmboffi·
We just brought flow maps to language modeling for one-step sequence generation 💥 Discrete diffusion is not necessary -- continuous flows over one-hot encodings achieve SoTA performance and ≥8.3× faster generation 🔥 We believe this is a major step forward for discrete generative modeling and language modeling alike. 🚀 Full thread from first author @chandavidlee: x.com/chandavidlee/s…
English
4
45
250
41.9K
Nathan Barry
Nathan Barry@nathanrs·
Built a camera that transforms your photos with diffusion models and prints them instantly on receipt paper
Nathan Barry tweet media
English
7
1
34
6.7K
Nathan Barry
Nathan Barry@nathanrs·
LLaDA 2.1 was released, a 100B parameter diffusion language model with self-correction capabilities. They are able to fix previous tokens by adopting a mixture of masking/state-absorption and uniform diffusion, similar to GIDD. In a previous post, I mentioned that Google Gemini and Inception Lab’s Mercury might have done something similar. A few people in the comments suggested that they use masking + re-masking instead (so masking without the state-absorption property). I wonder how these two approaches compare. They both allow for self-correction and (thus) allow for more progress per step in the diffusion process (by being more robust to taking larger steps through the diffusion process). Masking + re-masking might have some benefits like a simplified training objective, stronger inductive bias (which is arguably a good thing), and easier use with KVCache approximation (due to fewer tokens changing per step). My only question is: how does the departure from state-absorption change things?
 The simplified training objective from masking (which reduces to a weighted MLM training objective) comes from this state-absorption property. But does re-masking actually change this? The state-absorption property is just that each token undergoes one transition only ([MASK] -> predicted token, and never changes). Re-masking a token, of course, causes it to go through multiple transitions.

But does it really? Re-masking seems like you are just “jumping” to a more likely trajectory to account for the accumulation of errors. So instead of causing multiple state transitions, it could be just viewed as jumping to a better trajectory where that “poor” transition wasn’t made. Will be interesting to see more formalization of this and how it compares to GIDD at scale.
Ant Open Source@ant_oss

What if an LLM could EDIT its own tokens in real-time, not just generate them? 🤯 Introducing LLaDA2.1 — a diffusion model that breaks from autoregressive dominance. It drafts fast, then fixes its own mistakes on the fly with Token-to-Token editing. The result? 892 tokens/sec on a 100B model. 🔥 ⚡ 892 TPS on HumanEval+ (coding) ⚡ 801 TPS on BigCodeBench 🧠 Real-time self-correction via T2T editing ✅ @lmsysorg SGLang Day 0 support — production-ready now A "non-consensus" architecture now challenging the mainstream. Open-sourced TODAY. 👇 #LLaDA #TokenEditing #OpenSource #LLM #dLLM

English
0
3
34
3.3K
Wayframe
Wayframe@Wayframe·
Introducing Wayframe. Make new designs from words. Wayframe.com
English
28
17
136
44.4K
Nathan Barry
Nathan Barry@nathanrs·
Was doing a deeper literature review over and found one of my new favorite paper title ever: “BERT has a Mouth, and It Must Speak” Was one of the earliest papers to do something akin to state-absorption diffusion language modeling.
Nathan Barry tweet media
English
0
6
94
5.3K
Nathan Barry
Nathan Barry@nathanrs·
@dvruette What ways do you think? There's been some work integrating these models to help guide AR transformer generation. N-grams work on sequences, which makes autoregression natural. It's not apparent to me how it would work with dLLMs due to the out-of-order generation.
English
1
0
5
456
Dimitri von Rütte
Dimitri von Rütte@dvruette·
@nathanrs i think there’s a good chance that some day in the not too distant future, infinigram will power SOTA (diffusion-based) language models
English
1
0
10
631
Nathan Barry
Nathan Barry@nathanrs·
Created tiny-infini-gram, a training-free language model which can generate Shakespeare 250x faster than nanoGPT! Last year, I read about unbounded n-gram language models, which solve the exponential space problem for classical n-grams that made using large n intractable. By using suffix arrays, we can simulate any arbitrary-sized n-gram lookup table in logarithmic time.

Since I’ve been testing different small language models recently, I decided to implement this n-gram variant, and was surprised at how good the results were.
 Previous papers (to my knowledge) haven’t used this for language generation due to previous sampling methods causing infinite perplexity and verbatim copying. I solved these issues by creating Selective Back-off Interpolation Sampling, which mixes probability distributions from multiple n-gram levels to balance quality and novelty. 
A detailed write-up is linked in the comments.
English
14
26
288
14.6K
dadabots
dadabots@dadabots·
@nathanrs you animal, that’s two home runs in a row
Nathan Barry@nathanrs

Rewrote tiny-diffusion to be 3x smaller! Went from 951 lines to just 364, all contained in one file. As simple as possible, but not simpler. I also added a tiny GPT implementation as a comparison (312 lines, inspired by @karpathy). The two implementations are ~80% identical. The model architecture, training loop, tokenization, etc, only differ in 19 lines of code. The main differences are contained within two functions (generate and get_batch). The reason to include the GPT implementation was to show how similar autoregressive LMs are to diffusion LMs on an architectural level. Only *1* line of code in the architecture needs to be modified to support masked language diffusion instead of next-token prediction (by disabling causal masking). Link to the repo is in the comments

English
1
0
4
455