Grégoire Delétang

298 posts

Grégoire Delétang

Grégoire Delétang

@gregdeletang

Intelligence is good statistics

London, England Katılım Ocak 2016
73 Takip Edilen312 Takipçiler
Grégoire Delétang
Grégoire Delétang@gregdeletang·
Claude code is actually an insane tool, I can only say congrats to the team
English
0
0
0
65
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
512 parameters: a new top scorer for 10-digit addition with transformers! Who can beat it?
Yinglun Zhu@yinglun122

Hey @DimitrisPapail we now have a 512 parameter model that does the job. I instructed opus 4.6 to explore along the direction of low rankness.

English
8
15
239
34.5K
Grégoire Delétang
Grégoire Delétang@gregdeletang·
@DimitrisPapail Cool stuff! I think the challenge would be even more interesting with a scaling law num parameters versus number of digits (at iso ~100% accuracy)
English
0
0
6
745
Grégoire Delétang retweetledi
Google DeepMind
Google DeepMind@GoogleDeepMind·
Genie 3 🤝 @Waymo The Waymo World Model generates photorealistic, interactive environments to train autonomous vehicles. This helps the cars navigate rare, unpredictable events before encountering them in reality. 🧵
GIF
English
80
260
1.7K
426.8K
Grégoire Delétang
Grégoire Delétang@gregdeletang·
I find that LLMs went much further than I initially thought 3-4 years ago. My impression is that pretraining enables many non trivial things to work, that would never otherwise (test-time "thinking", rlhf, fine-tuning on non trivial targets, interaction with tools etc). Am I the only one to be positively surprised?
English
0
0
4
157
Grégoire Delétang
Grégoire Delétang@gregdeletang·
@jxmnop Language modelling is compression. Lossy (weights) and lossless (sequence).
English
0
0
0
2.7K
dr. jack morris
dr. jack morris@jxmnop·
this post is complete misinformation LLMs are lossy compressors! of *training data*. LLMs losslessly compress *prompts*, internally. that’s what this paper shows. source: i am the author of “Language Model Inversion”, the original paper on this
English
90
212
4.1K
381K
Grégoire Delétang
Grégoire Delétang@gregdeletang·
@zhaisf Back to 1995... How many people still discover NNs for the first time?
English
0
0
0
299
Shuangfei Zhai
Shuangfei Zhai@zhaisf·
A good way of understanding the inductive bias of neural nets is to train an MLP to regress to the sin(x) function. Below is training on 10K points in [-20, 20], predicting over [-30, 30] after 10 and 100 epochs of training. The implications shown here are surprisingly general.
Shuangfei Zhai tweet media
English
48
71
911
112.7K
Grégoire Delétang
Grégoire Delétang@gregdeletang·
If you pass a sequence of constant tokens to a transformer with rotary encodings, it will return a constant output distribution, because the attention values are constant. Idk if someone spotted this "bug" before?
English
2
0
0
388
Grégoire Delétang
Grégoire Delétang@gregdeletang·
We slowly realize that humans do not, in fact, "generalize"
English
0
0
1
304
Grégoire Delétang
Grégoire Delétang@gregdeletang·
Interesting effect, I think I saw you mentioned it in one of your tweet indeed. Yes my point is that ultimately even with cot, the model must be pretrained on a sufficiently large context length L, that contains at most L*log2(vocab_size) bits. That's enough to solve most problems, but that's still a hard limit.
English
0
0
0
87
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
this effect is especially magnified when doing self improvmenet on top of pretrained models. Why would pretraining help? Because for a model to "learn an algorithm" it has to be ancored on an idea of a number system, having seen other examples of deduction, algorithms etc. This is impossible to have without pretraining.
English
1
0
1
396
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
one of the few times in my career I feel good about calling a problem "solved", for an approximately satisfying value of "solved" length generalization challenges can be overcome with iterative self improvement
Dimitris Papailiopoulos@DimitrisPapail

o3 can't multiply beyond a few digits... But I think multiplication, addition, maze solving and easy-to-hard generalization is actually solvable on standard transformers... with recursive self-improvement. Below is the acc of a tiny model teaching itself how to add.

English
6
5
105
7.9K
Grégoire Delétang
Grégoire Delétang@gregdeletang·
@DimitrisPapail Your work is still interesting for sure, but what I mean is that it doesn't solve the fundamental problem of the transformer architecture, which is that you must train it with a large enough context window.
English
1
0
0
113
Grégoire Delétang
Grégoire Delétang@gregdeletang·
@DimitrisPapail I mean you could train with a longer context window yes, with even just some fine-tuning. But testing with a longer context window experimentally doesn't work, it's what people have been struggling with for a long time (i.e. length generalization).
English
1
0
0
112
AK
AK@_akhaliq·
Google Deepmind presents Agency Is Frame-Dependent
AK tweet media
English
28
148
1.1K
137.3K
Grégoire Delétang
Grégoire Delétang@gregdeletang·
@nouhadziri @DimitrisPapail It also makes me think about universality: is it really the right concept? Couldn't we get, say, a proof of the RH in a few years, with a model that can't sum and add two numbers?
English
0
0
0
30
Nouha Dziri
Nouha Dziri@nouhadziri·
📢 DeepSeek R1 still cannot solve multiplication with 100% accuracy🫠😬 Though it can achieve high scores on hard math questions (AIME, MATH-500), extremely difficult physics, biology, and chemistry problems (GPQA Diamond), and coding challenges (LiveCode, CodeForces)-problems that require advanced problem-solving skills, it struggles with a simple multiplication algorithm [1/8].
Nouha Dziri tweet media
English
77
139
1.3K
306.9K
Grégoire Delétang retweetledi
François Fleuret
François Fleuret@francoisfleuret·
Here you know you should pull up the logit of the class you proposed when you get +1 and push it down with -1, but that's it. You have to "poke around" until you get a +1. So in the classification setup you get ~log_2(N) bits of information, and in the RL case 1 bit. 4/4
English
10
5
114
7.9K
Grégoire Delétang
Grégoire Delétang@gregdeletang·
Does anyone know/can guess how they made fp8 training actually work? That's probably the most impressive to me. This would mean sparse training with gradients is indeed possible.
English
0
0
1
231