Tom

85 posts

Tom

@martyitsarocket

Applied AI at BCG | Chief Claude Whisperer Opinions my own

가입일 Şubat 2009

155 팔로잉32 팔로워

Tom 리트윗함

Lisan al Gaib@scaling01·13 Haz

The realistic take on the Anthropic situation: - investing in AI companies has just become permanently more risky as the USG could pull the plug at any moment - the USG will use that time to strengthen their defense and carry out their own cyber attacks with the unleashed Mythos version - the situation itself will likely resolve in a few days-weeks - Anthropic will miss out on hundreds of millions-billions in revenue I think it also increases the probability of a nationalization happening sooner rather than later and also the probability of misuse by the USG.

English

598

34.4K

Tom 리트윗함

Richard Socher@RichardSocher·10 Haz

In a week when some of the leaders in AI are trying to pull up the ladder behind them and prevent the automation of science and self-improving superintelligence, we're committed to building RSI safely and publicizing the outputs of our system to give humanity an audit trail of its inventions and intentions and let the open source community build on top of them. Stay tuned for the first such result in the coming days.

English

387

30K

Tom 리트윗함

will brown@willccbb·10 Haz

capabilities are getting locked up. come join the fight jobs.ashbyhq.com/PrimeIntellect

English

944

90.9K

Tom@martyitsarocket·9 Haz

(1) what

English

Tom 리트윗함

François Fleuret@francoisfleuret·8 Haz

Hot take: Transformers are all-seeing ultrafast librarians. They have a very low incentive to extract and organize information, they can just "look around" to see correlating fragments. RNNs done properly would have far stronger "conceptual embeddings" and would actually think.

English

786

63.4K

Tom@martyitsarocket·2 Haz

I think of attention as a projection of the natural language zipf-ian spectrum (source side) into the loss landscape of the model architecture (capacity side). Thought of in this way, literally only quadratic attention is capable of achieving perfect projection. Other attention mechanisms are predictably imperfect. And if you agree with the Platonic Representation Hypothesis, with some whitening, different activation geometries are just representational power over the same gauge orbit of the platonic representation!

English

112

Anthea Li@AntheaYLi·1 Haz

A new lens on attention that I've been thinking: each key in attention defines a hyperplane in query space. The score qᵀk isn't just similarity — it's a signed incidence. Which side of the key-plane the query sits on, and how far.

English

Tom@martyitsarocket·31 May

Using @zeddotdev a bit recently whilst making my own GPUI project. It has given me a lot of respect for the zed team. Rust only UIs are no joke, and GPUI (whilst bloated by the dependent crates) is a hell of a thing to engineer. Also a new found respect for browsers and what web devs take for granted (like scrollbars, views, drag and drop, text select and copy and paste etc..)!

English

104

12.1K

Tom 리트윗함

hardmaru@hardmaru·27 May

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall. We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal. This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (arxiv.org/abs/2506.14202), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

Sakana AI@SakanaAILabs

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation pub.sakana.ai/diffusionblocks What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: arxiv.org/abs/2506.14202 GitHub: github.com/SakanaAI/Diffu… 🐟

English

154

640

5.8K

744.7K

Tom@martyitsarocket·23 May

Counterintuitively, I've seen benchmarks that show LLMs perform worse when they use the web on certain tasks. My hypothesis is that zero tool use means you stay closer to base train distribution and somehow access higher model capacity. No idea if others see the same, but easy to test if someone has the tokens to spend.

English

Tom@martyitsarocket·10 May

@willccbb Extremely bullish on this. I think of it as finding the highest leverage problem for the tokens you have access to. My current flavour is the Platonic Representation Hypothesis. If all models land at the same representation, why isn't there a short cut to getting there?

English

152

will brown@willccbb·9 May

pick project ideas that were definitely not worth talking about a few months ago and still feel slightly blasphemous to talk about today but are now just barely maybe feasible and would be a big deal if you're right

English

663

30.6K

Tom@martyitsarocket·8 May

@AlexJonesax London maxxer here!

English

Alex@AlexJonesax·8 May

Calling all London/Ukmaxxing 🇬🇧builders/engineers/developers, let's grow this list. Comment if you want to be added x.com/i/lists/205269…

English

24.4K

Tom 리트윗함

Goodfire@GoodfireAI·7 May

Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵

English

307

1.7K

11.2K

3.2M

Tom@martyitsarocket·2 May

@AjdDavison I'm finding that if the prior art exists in the model somewhere, you can connect previously disjoint findings and experiment rapidly. Along with very fast data analysis this is a great research tool. But truly novel findings, no. The models still struggle out of distribution

English

123

Andrew Davison@AjdDavison·1 May

Related... is anyone out there making progress on their *hardest* research problems using LLMs? The kind you've been wondering about for years, where it's hard to even describe what you're trying to do but just have a feeling there's something to find. Honest question: how? 1/2

kache@yacineMTB

you can outsource your thinking but you cannot outsource your understanding

English

12.9K

Tom 리트윗함

Ineffable Intelligence@IneffableLabs·27 Nis

Introducing Ineffable Intelligence. Led by David Silver, we're assembling the best engineers and researchers in the world to make first contact with superintelligence. We’ll be solving the hardest problems in AI on the way. Come join us. ineffable.ai

English

158

1.4K

351.9K

Tom@martyitsarocket·25 Nis

Love this exploration, and the passion! I'm also convinced there are more fundamental relationships between the data we use, and the models we empirically grow to represent that data distribution. Universal representations are the canary!

Jamie Simon@learning_mech

1/ Deep learning is going to have a scientific theory. We can see the pieces starting to come together, and it's looking a lot like physics! We're releasing a paper pulling together these emerging threads and giving them a name: learning mechanics. 🔨 arxiv.org/pdf/2604.21691 🔧

English

Tom 리트윗함

Joseph Suarez 🐡@jsuarez·7 Nis

x.com/i/article/2037…

ZXX

789

109.7K

Tom@martyitsarocket·7 Nis

@EastlondonDev @karpathy this is brilliant. and a broadly applicable concept as well - will you write up?

English

820

Andrew Jefferson@EastlondonDev·7 Nis

Chat, my nanochat (left) with its onboard wasm-interpreter is now clearly exceeding @karpathy’s nanochat (right) on a range of computation tasks. The wasm interpreter plus cross attention only adds about 300 million params, a marginal increase in params for a big boost! You could call it tool use but it’s a single transformer that can both predict the next token and is a functioning wasm machine, there is no external tool.

English

268

21.2K

Tom 리트윗함

Archie Sengupta@archiexzzz·2 Nis

x.com/i/article/2039…

ZXX

317

78.7K

Tom@martyitsarocket·2 Nis

I'm convinced that in the future of agentic coding, Rust will be a clear winner. Sure the models were built with Python. But agents LOVE type-safe, compile-time-checked, opinionated languages like Rust; its compiler feedback is incredibly clear and helpful. It's like getting strongly verified rewards every time you change a line of code.

English

Tom@martyitsarocket·10 Mar

@willccbb "We don't know how to build them anymore, we have forgotten how to do it"

English

will brown@willccbb·10 Mar

when i was younger we called it the "Ralph Wiggum Technique" now it's just "Ralph loops" and nobody knows or cares who Ralph is we're losing recipes

English

533

29K

탐색

@zeddotdev @willccbb @AlexJonesax @AjdDavison @elonmusk @BarackObama @taylorswift13 @cristiano