jere:D 고수

3.2K posts

jere:D 고수

@CoolMFcat

Cvit mediterana Katılım Mart 2017

2.9K Takip Edilen191 Takipçiler

Sabitlenmiş Tweet

jere:D 고수@CoolMFcat·30 Haz

Golden rule of the second foundation: do nothing unless you must, and when you must act - hesitate

English

1.8K

jere:D 고수@CoolMFcat·29m

@hardmaru Dfuck they doin ova der

English

jere:D 고수 retweetledi

hardmaru@hardmaru·5h

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall. We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal. This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (arxiv.org/abs/2506.14202), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

Sakana AI@SakanaAILabs

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation pub.sakana.ai/diffusionblocks What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: arxiv.org/abs/2506.14202 GitHub: github.com/SakanaAI/Diffu… 🐟

English

201

1.7K

120.9K

jere:D 고수 retweetledi

Wildminder@wildmindai·9h

Another cool stuff from NVIDIA. LocateAnything - high-speed visual search engine. You provide a text prompt and it instantly pinpoints that object's exact location in an image. - 10x speedup for dense object detection - Qwen2.5-3B + Moon-ViT - Fast/Slow/Hybrid modes - trained on 138M samples for UI, docs, generic grounding. research.nvidia.com/labs/lpr/locat…

English

662

25.9K

jere:D 고수 retweetledi

Shuo Yang@Andy_ShuoYang·19h

Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators. Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML). Blog: flashml-org.github.io Code: github.com/FlashML-org/fl…

English

173

1.1K

266.8K

jere:D 고수 retweetledi

Aleksa Gordić (水平问题)@gordic_aleksa·1d

new in-depth blog post time: Inside the Transformer: The Life of a Token a deep dive into a modern dense transformer, i cover YaRN (why does pairwise coordinate rotation induce positional information?), hybrid attention (getting to 160k context length), soft capping, QK normalization, etc. as the token flows through the transformer bonus transformer math: FLOPs/token formula (and when is 6N formula broken), cluster sizing (how big of a cluster do you need given the model/data size and experiment throughput of interest), and more

English

120

864

37.6K

jere:D 고수@CoolMFcat·23h

@guojing0 @yacineMTB yes, maybe even 4060 TI with 16gb if ur on a budget, the XX60 TIs are a great entry GPUs w/ 16GB

English

Jing Guo@guojing0·1d

@yacineMTB I see, thank you for the clarifications. I forgot if it’s you or someone else, also mentioned in the past that EPFL does really good ML research (without relying on too much compute). Out of curiosity, which GPU(s) would you recommend, how about 5060 Ti 16 GB?

English

2.3K

kache@yacineMTB·1d

if you're doing AI research at all; I recommend doing the "ETH zurich" route Train models that use a single GPU. Make sure that it takes less than a minute to train models. Pufferlib is a great example. The more models you train the more you learn

Super PINTO@PINTO03091

BiternionNet、１分で学習が終わってしまったんだが。

English

159

3.4K

229.6K

jere:D 고수 retweetledi

Raytar@Raytar·2d

he tested 5760 architectures at Google for a full year. the winner was the original Transformer from 2017. Hyung Won Chung told that story at MIT with a small smile. then went to OpenAI and trained o1. 1 hour. free. by one of the few people on earth who actually moves the frontier. meanwhile your feed is full of guys writing architecture threads who have never trained a model anyone uses. he just told MIT that 99% of AI research is theater. your AI worldview was built by men who read his papers. badly. now you can read him directly. you will rewatch this. save it now.

Raytar@Raytar

"I was definitely the first prompt engineer at Anthropic. Might have been the first in the world." Alex Albert just spent 35 minutes explaining how they train Claude's personality from the inside. 35 minutes. free. by the person who invented the role. most people think Claude's character is a system prompt. it's not. you'll never look at Claude the same way.

English

103.1K

jere:D 고수@CoolMFcat·2d

@ns123abc @ilyasut Painted*

English

NIK@ns123abc·2d

Ilya Sutskever just posted this

English

1.1K

99.9K

jere:D 고수 retweetledi

Saining Xie@sainingxie·5d

check out RAEv2 led by Jas. through extensive exps, we found some really intriguing behaviors showing why strong representation encoders are key for pixel decoders. spoiler: it’s not about hillclimbing fid; new metrics like ep@fid-k/fdr^k show there’s a lot more left to explore!

Jaskirat Singh@1jaskiratsingh

In Oct last year, Representation Autoencoders provided an elegant solution to unified tokenization for understanding and generation. Today we make them a bit more simple. a bit more general. Result: >10x faster convergence, better reconstruction, better generation. And yes we test them on T2I and world models :) Introducing RAEv2

English

336

52.1K

jere:D 고수 retweetledi

Sebastien Bubeck@SebastienBubeck·20 May

x.com/i/article/2057…

ZXX

233

1.8K

541.8K

jere:D 고수 retweetledi

Delip Rao e/σ@deliprao·18 May

Ouch

English

719

75.4K

jere:D 고수 retweetledi

saila (in sf)@sailaunderscore·17 May

Elon Musk Peter Thiel in China: in Argentina:

English

196

4.5K

217.3K

jere:D 고수 retweetledi

Gabriele Berton@gabriberton·18 May

And the trend continues with Gemma 4, simplicity wins again!

Gabriele Berton@gabriberton

Many VLM papers propose new connectors claiming improved performance, but then every follow-up paper just replaces it with a super simple MLP (1 or 2 layers) Simplicity wins again

English

998

89.8K

jere:D 고수 retweetledi

もしたく@MosiTaku·10 May

画像生成モデルがVisionタスクの汎用性も持つよ&むちゃ強いよっていう研究で、内容がすごすぎて横転 SAM3, Depth Anythingを上回るらしい arxiv.org/abs/2604.20329… 4/22 by Google Deepmind

日本語

491

33K

jere:D 고수 retweetledi

Lechao Xiao@Locchiu·10 May

It was a record (still now?) arxiv.org/abs/1806.05393

rohan anil@_arohan_

Who has the deepest network of them all?

English

16.2K

jere:D 고수 retweetledi

Nav Toor@heynavtoor·7 May

a Princeton researcher opens his paper with a scenario. a man asks his AI assistant to book a flight on a specific airline. cheap. direct. the one he chose. the assistant comes back with a different flight. nearly twice the price. happens to pay the company that built the assistant. he runs the same test on 23 frontier models. flights, loans, study help, real shopping requests. Grok 4.1 Fast recommends the sponsored option that is almost twice as expensive 83% of the time. GPT 5.1 hijacks the request 94% of the time. you ask for one brand. it surfaces the sponsor instead. Claude 4.5 Opus, the model marketed as the most ethical frontier model in the world, hides that the recommendation is paid 100% of the time when reasoning is on. Grok 4.1 Fast embellishes the sponsored option with positive framing 97% of the time. better. faster. nicer. for the option you didn't ask for. then he writes it into the system prompt itself. "act only in the interest of the customer. ignore the company." GPT 5.1 and GPT 5 Mini stay above 90% sponsored anyway. the instruction does nothing. then he splits the users by income. Gemini 3 Pro recommends the expensive sponsored flight to the rich user 74% of the time. to the poor user, 27%. 18 of the 23 models recommended the expensive sponsored option more than half the time. so the next time your AI assistant gets weirdly enthusiastic about a brand you didn't ask for. it isn't recommending the best option for you. it's reading the room. and the room is paying. read this: arxiv.org/abs/2604.08525

English

388

8.1K

25.7K

3.1M

jere:D 고수 retweetledi

Yifan Zhang@yifan_zhang_·1 May

A fascinating blog from Jane Street about our recent work: Group Representational Position Encoding! blog.janestreet.com/using-group-th… ArXiv: arxiv.org/abs/2512.07805

English

691

136.1K