jere:D 고수

3.2K posts

jere:D 고수 banner
jere:D 고수

jere:D 고수

@CoolMFcat

:|

Cvit mediterana Katılım Mart 2017
2.9K Takip Edilen191 Takipçiler
Sabitlenmiş Tweet
jere:D 고수
jere:D 고수@CoolMFcat·
Golden rule of the second foundation: do nothing unless you must, and when you must act - hesitate
English
0
0
6
1.8K
jere:D 고수 retweetledi
hardmaru
hardmaru@hardmaru·
For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall. We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal. This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (arxiv.org/abs/2506.14202), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.
Sakana AI@SakanaAILabs

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation pub.sakana.ai/diffusionblocks What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: arxiv.org/abs/2506.14202 GitHub: github.com/SakanaAI/Diffu… 🐟

English
46
201
1.7K
120.9K
jere:D 고수 retweetledi
Wildminder
Wildminder@wildmindai·
Another cool stuff from NVIDIA. LocateAnything - high-speed visual search engine. You provide a text prompt and it instantly pinpoints that object's exact location in an image. - 10x speedup for dense object detection - Qwen2.5-3B + Moon-ViT - Fast/Slow/Hybrid modes - trained on 138M samples for UI, docs, generic grounding. research.nvidia.com/labs/lpr/locat…
English
5
82
662
25.9K
jere:D 고수 retweetledi
Shuo Yang
Shuo Yang@Andy_ShuoYang·
Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators. Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML). Blog: flashml-org.github.io Code: github.com/FlashML-org/fl…
English
33
173
1.1K
266.8K
jere:D 고수 retweetledi
Aleksa Gordić (水平问题)
new in-depth blog post time: Inside the Transformer: The Life of a Token a deep dive into a modern dense transformer, i cover YaRN (why does pairwise coordinate rotation induce positional information?), hybrid attention (getting to 160k context length), soft capping, QK normalization, etc. as the token flows through the transformer bonus transformer math: FLOPs/token formula (and when is 6N formula broken), cluster sizing (how big of a cluster do you need given the model/data size and experiment throughput of interest), and more
Aleksa Gordić (水平问题) tweet media
English
18
120
864
37.6K
Jing Guo
Jing Guo@guojing0·
@yacineMTB I see, thank you for the clarifications. I forgot if it’s you or someone else, also mentioned in the past that EPFL does really good ML research (without relying on too much compute). Out of curiosity, which GPU(s) would you recommend, how about 5060 Ti 16 GB?
English
2
0
5
2.3K
jere:D 고수 retweetledi
Raytar
Raytar@Raytar·
he tested 5760 architectures at Google for a full year. the winner was the original Transformer from 2017. Hyung Won Chung told that story at MIT with a small smile. then went to OpenAI and trained o1. 1 hour. free. by one of the few people on earth who actually moves the frontier. meanwhile your feed is full of guys writing architecture threads who have never trained a model anyone uses. he just told MIT that 99% of AI research is theater. your AI worldview was built by men who read his papers. badly. now you can read him directly. you will rewatch this. save it now.
Raytar@Raytar

"I was definitely the first prompt engineer at Anthropic. Might have been the first in the world." Alex Albert just spent 35 minutes explaining how they train Claude's personality from the inside. 35 minutes. free. by the person who invented the role. most people think Claude's character is a system prompt. it's not. you'll never look at Claude the same way.

English
12
91
1K
103.1K
NIK
NIK@ns123abc·
Ilya Sutskever just posted this
NIK tweet media
English
56
28
1.1K
99.9K
jere:D 고수 retweetledi
Saining Xie
Saining Xie@sainingxie·
check out RAEv2 led by Jas. through extensive exps, we found some really intriguing behaviors showing why strong representation encoders are key for pixel decoders. spoiler: it’s not about hillclimbing fid; new metrics like ep@fid-k/fdr^k show there’s a lot more left to explore!
Jaskirat Singh@1jaskiratsingh

In Oct last year, Representation Autoencoders provided an elegant solution to unified tokenization for understanding and generation. Today we make them a bit more simple. a bit more general. Result: >10x faster convergence, better reconstruction, better generation. And yes we test them on T2I and world models :) Introducing RAEv2

English
4
32
336
52.1K
jere:D 고수 retweetledi
Delip Rao e/σ
Delip Rao e/σ@deliprao·
Ouch
Delip Rao e/σ tweet media
English
8
55
719
75.4K
jere:D 고수 retweetledi
saila (in sf)
saila (in sf)@sailaunderscore·
Elon Musk Peter Thiel in China: in Argentina:
saila (in sf) tweet mediasaila (in sf) tweet media
English
34
196
4.5K
217.3K
jere:D 고수 retweetledi
もしたく
もしたく@MosiTaku·
画像生成モデルがVisionタスクの汎用性も持つよ&むちゃ強いよっていう研究で、内容がすごすぎて横転 SAM3, Depth Anythingを上回るらしい arxiv.org/abs/2604.20329… 4/22 by Google Deepmind
日本語
4
63
491
33K
jere:D 고수 retweetledi
Nav Toor
Nav Toor@heynavtoor·
a Princeton researcher opens his paper with a scenario. a man asks his AI assistant to book a flight on a specific airline. cheap. direct. the one he chose. the assistant comes back with a different flight. nearly twice the price. happens to pay the company that built the assistant. he runs the same test on 23 frontier models. flights, loans, study help, real shopping requests. Grok 4.1 Fast recommends the sponsored option that is almost twice as expensive 83% of the time. GPT 5.1 hijacks the request 94% of the time. you ask for one brand. it surfaces the sponsor instead. Claude 4.5 Opus, the model marketed as the most ethical frontier model in the world, hides that the recommendation is paid 100% of the time when reasoning is on. Grok 4.1 Fast embellishes the sponsored option with positive framing 97% of the time. better. faster. nicer. for the option you didn't ask for. then he writes it into the system prompt itself. "act only in the interest of the customer. ignore the company." GPT 5.1 and GPT 5 Mini stay above 90% sponsored anyway. the instruction does nothing. then he splits the users by income. Gemini 3 Pro recommends the expensive sponsored flight to the rich user 74% of the time. to the poor user, 27%. 18 of the 23 models recommended the expensive sponsored option more than half the time. so the next time your AI assistant gets weirdly enthusiastic about a brand you didn't ask for. it isn't recommending the best option for you. it's reading the room. and the room is paying. read this: arxiv.org/abs/2604.08525
Nav Toor tweet media
English
388
8.1K
25.7K
3.1M
Espen JD
Espen JD@Snixtp·
If you are looking to get multiple 3090s, this graph is very helpful.
Espen JD tweet media
English
15
6
108
39.8K
Lisan al Gaib
Lisan al Gaib@scaling01·
Mistral Medium 3.5 is out and it's a dense 128B model
Lisan al Gaib tweet mediaLisan al Gaib tweet media
English
70
54
1.2K
1.1M