Tom

85 posts

Tom banner
Tom

Tom

@martyitsarocket

Applied AI at BCG | Chief Claude Whisperer Opinions my own

가입일 Şubat 2009
155 팔로잉32 팔로워
Tom 리트윗함
Lisan al Gaib
Lisan al Gaib@scaling01·
The realistic take on the Anthropic situation: - investing in AI companies has just become permanently more risky as the USG could pull the plug at any moment - the USG will use that time to strengthen their defense and carry out their own cyber attacks with the unleashed Mythos version - the situation itself will likely resolve in a few days-weeks - Anthropic will miss out on hundreds of millions-billions in revenue I think it also increases the probability of a nationalization happening sooner rather than later and also the probability of misuse by the USG.
English
32
28
598
34.4K
Tom 리트윗함
Richard Socher
Richard Socher@RichardSocher·
In a week when some of the leaders in AI are trying to pull up the ladder behind them and prevent the automation of science and self-improving superintelligence, we're committed to building RSI safely and publicizing the outputs of our system to give humanity an audit trail of its inventions and intentions and let the open source community build on top of them. Stay tuned for the first such result in the coming days.
English
22
27
387
30K
Tom
Tom@martyitsarocket·
(1) what
English
0
0
0
3
Tom 리트윗함
François Fleuret
François Fleuret@francoisfleuret·
Hot take: Transformers are all-seeing ultrafast librarians. They have a very low incentive to extract and organize information, they can just "look around" to see correlating fragments. RNNs done properly would have far stronger "conceptual embeddings" and would actually think.
English
66
39
786
63.4K
Tom
Tom@martyitsarocket·
I think of attention as a projection of the natural language zipf-ian spectrum (source side) into the loss landscape of the model architecture (capacity side). Thought of in this way, literally only quadratic attention is capable of achieving perfect projection. Other attention mechanisms are predictably imperfect. And if you agree with the Platonic Representation Hypothesis, with some whitening, different activation geometries are just representational power over the same gauge orbit of the platonic representation!
English
0
0
1
112
Anthea Li
Anthea Li@AntheaYLi·
A new lens on attention that I've been thinking: each key in attention defines a hyperplane in query space. The score qᵀk isn't just similarity — it's a signed incidence. Which side of the key-plane the query sits on, and how far.
English
3
3
39
6K
Tom
Tom@martyitsarocket·
Using @zeddotdev a bit recently whilst making my own GPUI project. It has given me a lot of respect for the zed team. Rust only UIs are no joke, and GPUI (whilst bloated by the dependent crates) is a hell of a thing to engineer. Also a new found respect for browsers and what web devs take for granted (like scrollbars, views, drag and drop, text select and copy and paste etc..)!
English
5
2
104
12.1K
Tom 리트윗함
hardmaru
hardmaru@hardmaru·
For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall. We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal. This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (arxiv.org/abs/2506.14202), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.
Sakana AI@SakanaAILabs

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation pub.sakana.ai/diffusionblocks What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: arxiv.org/abs/2506.14202 GitHub: github.com/SakanaAI/Diffu… 🐟

English
154
640
5.8K
744.7K
Tom
Tom@martyitsarocket·
Counterintuitively, I've seen benchmarks that show LLMs perform worse when they use the web on certain tasks. My hypothesis is that zero tool use means you stay closer to base train distribution and somehow access higher model capacity. No idea if others see the same, but easy to test if someone has the tokens to spend.
English
0
0
0
22
Tom
Tom@martyitsarocket·
@willccbb Extremely bullish on this. I think of it as finding the highest leverage problem for the tokens you have access to. My current flavour is the Platonic Representation Hypothesis. If all models land at the same representation, why isn't there a short cut to getting there?
English
0
0
0
152
will brown
will brown@willccbb·
pick project ideas that were definitely not worth talking about a few months ago and still feel slightly blasphemous to talk about today but are now just barely maybe feasible and would be a big deal if you're right
English
18
36
663
30.6K
Alex
Alex@AlexJonesax·
Calling all London/Ukmaxxing 🇬🇧builders/engineers/developers, let's grow this list. Comment if you want to be added x.com/i/lists/205269…
English
90
3
90
24.4K
Tom 리트윗함
Goodfire
Goodfire@GoodfireAI·
Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵
English
307
1.7K
11.2K
3.2M
Tom
Tom@martyitsarocket·
@AjdDavison I'm finding that if the prior art exists in the model somewhere, you can connect previously disjoint findings and experiment rapidly. Along with very fast data analysis this is a great research tool. But truly novel findings, no. The models still struggle out of distribution
English
0
0
1
123
Tom 리트윗함
Ineffable Intelligence
Ineffable Intelligence@IneffableLabs·
Introducing Ineffable Intelligence. Led by David Silver, we're assembling the best engineers and researchers in the world to make first contact with superintelligence. We’ll be solving the hardest problems in AI on the way. Come join us. ineffable.ai
Ineffable Intelligence tweet media
English
76
158
1.4K
351.9K
Tom
Tom@martyitsarocket·
Love this exploration, and the passion! I'm also convinced there are more fundamental relationships between the data we use, and the models we empirically grow to represent that data distribution. Universal representations are the canary!
Jamie Simon@learning_mech

1/ Deep learning is going to have a scientific theory. We can see the pieces starting to come together, and it's looking a lot like physics! We're releasing a paper pulling together these emerging threads and giving them a name: learning mechanics. 🔨 arxiv.org/pdf/2604.21691 🔧

English
0
0
1
30
Tom
Tom@martyitsarocket·
@EastlondonDev @karpathy this is brilliant. and a broadly applicable concept as well - will you write up?
English
1
0
0
820
Andrew Jefferson
Andrew Jefferson@EastlondonDev·
Chat, my nanochat (left) with its onboard wasm-interpreter is now clearly exceeding @karpathy’s nanochat (right) on a range of computation tasks. The wasm interpreter plus cross attention only adds about 300 million params, a marginal increase in params for a big boost! You could call it tool use but it’s a single transformer that can both predict the next token and is a functioning wasm machine, there is no external tool.
Andrew Jefferson tweet mediaAndrew Jefferson tweet media
English
17
16
268
21.2K
Tom
Tom@martyitsarocket·
I'm convinced that in the future of agentic coding, Rust will be a clear winner. Sure the models were built with Python. But agents LOVE type-safe, compile-time-checked, opinionated languages like Rust; its compiler feedback is incredibly clear and helpful. It's like getting strongly verified rewards every time you change a line of code.
English
0
0
0
11
Tom
Tom@martyitsarocket·
@willccbb "We don't know how to build them anymore, we have forgotten how to do it"
English
0
0
0
4
will brown
will brown@willccbb·
when i was younger we called it the "Ralph Wiggum Technique" now it's just "Ralph loops" and nobody knows or cares who Ralph is we're losing recipes
English
16
11
533
29K