Grégoire Delétang

298 posts

Grégoire Delétang

@gregdeletang

Intelligence is good statistics

London, England Katılım Ocak 2016

73 Takip Edilen312 Takipçiler

Grégoire Delétang@gregdeletang·28 Mar

Claude code is actually an insane tool, I can only say congrats to the team

English

Dimitris Papailiopoulos@DimitrisPapail·24 Şub

512 parameters: a new top scorer for 10-digit addition with transformers! Who can beat it?

Yinglun Zhu@yinglun122

Hey @DimitrisPapail we now have a 512 parameter model that does the job. I instructed opus 4.6 to explore along the direction of low rankness.

English

239

34.5K

Grégoire Delétang@gregdeletang·24 Şub

@DimitrisPapail Cool stuff! I think the challenge would be even more interesting with a scaling law num parameters versus number of digits (at iso ~100% accuracy)

English

745

Grégoire Delétang retweetledi

Google DeepMind@GoogleDeepMind·6 Şub

Genie 3 🤝 @Waymo The Waymo World Model generates photorealistic, interactive environments to train autonomous vehicles. This helps the cars navigate rare, unpredictable events before encountering them in reality. 🧵

GIF

English

260

1.7K

426.8K

Grégoire Delétang@gregdeletang·8 Şub

I find that LLMs went much further than I initially thought 3-4 years ago. My impression is that pretraining enables many non trivial things to work, that would never otherwise (test-time "thinking", rlhf, fine-tuning on non trivial targets, interaction with tools etc). Am I the only one to be positively surprised?

English

157

Grégoire Delétang@gregdeletang·30 Eki

@jxmnop Language modelling is compression. Lossy (weights) and lossless (sequence).

English

2.7K

dr. jack morris@jxmnop·29 Eki

this post is complete misinformation LLMs are lossy compressors! of *training data*. LLMs losslessly compress *prompts*, internally. that’s what this paper shows. source: i am the author of “Language Model Inversion”, the original paper on this

English

212

4.1K

381K

Grégoire Delétang@gregdeletang·17 Ağu

@zhaisf Back to 1995... How many people still discover NNs for the first time?

English

299

Shuangfei Zhai@zhaisf·16 Ağu

A good way of understanding the inductive bias of neural nets is to train an MLP to regress to the sin(x) function. Below is training on 10K points in [-20, 20], predicting over [-30, 30] after 10 and 100 epochs of training. The implications shown here are surprisingly general.

English

911

112.7K

Grégoire Delétang@gregdeletang·11 Nis

If you pass a sequence of constant tokens to a transformer with rotary encodings, it will return a constant output distribution, because the attention values are constant. Idk if someone spotted this "bug" before?

English

388

Grégoire Delétang@gregdeletang·28 Şub

We slowly realize that humans do not, in fact, "generalize"

English

304

Grégoire Delétang@gregdeletang·13 Şub

Interesting effect, I think I saw you mentioned it in one of your tweet indeed. Yes my point is that ultimately even with cot, the model must be pretrained on a sufficiently large context length L, that contains at most L*log2(vocab_size) bits. That's enough to solve most problems, but that's still a hard limit.

English

Dimitris Papailiopoulos@DimitrisPapail·13 Şub

this effect is especially magnified when doing self improvmenet on top of pretrained models. Why would pretraining help? Because for a model to "learn an algorithm" it has to be ancored on an idea of a number system, having seen other examples of deduction, algorithms etc. This is impossible to have without pretraining.

English

396

Dimitris Papailiopoulos@DimitrisPapail·12 Şub

one of the few times in my career I feel good about calling a problem "solved", for an approximately satisfying value of "solved" length generalization challenges can be overcome with iterative self improvement

Dimitris Papailiopoulos@DimitrisPapail

o3 can't multiply beyond a few digits... But I think multiplication, addition, maze solving and easy-to-hard generalization is actually solvable on standard transformers... with recursive self-improvement. Below is the acc of a tiny model teaching itself how to add.

English

105

7.9K

Grégoire Delétang@gregdeletang·13 Şub

@DimitrisPapail Your work is still interesting for sure, but what I mean is that it doesn't solve the fundamental problem of the transformer architecture, which is that you must train it with a large enough context window.

English

113

Grégoire Delétang@gregdeletang·13 Şub

@DimitrisPapail I mean you could train with a longer context window yes, with even just some fine-tuning. But testing with a longer context window experimentally doesn't work, it's what people have been struggling with for a long time (i.e. length generalization).

English

112

Grégoire Delétang@gregdeletang·10 Şub

@_akhaliq @mhutter42 haven't checked it but mmh it smells a lot like Solomonoff's reference UTM

English

678

AK@_akhaliq·10 Şub

Google Deepmind presents Agency Is Frame-Dependent

English

148

1.1K

137.3K

Grégoire Delétang@gregdeletang·4 Şub

@nouhadziri @DimitrisPapail It also makes me think about universality: is it really the right concept? Couldn't we get, say, a proof of the RH in a few years, with a model that can't sum and add two numbers?

English

Grégoire Delétang@gregdeletang·4 Şub

@nouhadziri @DimitrisPapail that comes back regularly. But when will it finally be solved?

English

257

Nouha Dziri@nouhadziri·3 Şub

📢 DeepSeek R1 still cannot solve multiplication with 100% accuracy🫠😬 Though it can achieve high scores on hard math questions (AIME, MATH-500), extremely difficult physics, biology, and chemistry problems (GPQA Diamond), and coding challenges (LiveCode, CodeForces)-problems that require advanced problem-solving skills, it struggles with a simple multiplication algorithm [1/8].

English

139

1.3K

306.9K

Grégoire Delétang@gregdeletang·1 Şub

GPU hours is the new bitcoin

English

322

Grégoire Delétang retweetledi

François Fleuret@francoisfleuret·30 Oca

Here you know you should pull up the logit of the class you proposed when you get +1 and push it down with -1, but that's it. You have to "poke around" until you get a +1. So in the classification setup you get ~log_2(N) bits of information, and in the RL case 1 bit. 4/4

English

114

7.9K

Grégoire Delétang@gregdeletang·28 Oca

Does anyone know/can guess how they made fp8 training actually work? That's probably the most impressive to me. This would mean sparse training with gradients is indeed possible.

English

231

Keşfet

@DimitrisPapail @Waymo @jxmnop @zhaisf @_akhaliq @mhutter42 @nouhadziri @elonmusk