jere:D 고수
3.2K posts


Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation pub.sakana.ai/diffusionblocks What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: arxiv.org/abs/2506.14202 GitHub: github.com/SakanaAI/Diffu… 🐟







BiternionNet、1分で学習が終わってしまったんだが。

"I was definitely the first prompt engineer at Anthropic. Might have been the first in the world." Alex Albert just spent 35 minutes explaining how they train Claude's personality from the inside. 35 minutes. free. by the person who invented the role. most people think Claude's character is a system prompt. it's not. you'll never look at Claude the same way.

In Oct last year, Representation Autoencoders provided an elegant solution to unified tokenization for understanding and generation. Today we make them a bit more simple. a bit more general. Result: >10x faster convergence, better reconstruction, better generation. And yes we test them on T2I and world models :) Introducing RAEv2


Many VLM papers propose new connectors claiming improved performance, but then every follow-up paper just replaces it with a super simple MLP (1 or 2 layers) Simplicity wins again


Who has the deepest network of them all?
















