Grégoire Delétang
298 posts

Grégoire Delétang
@gregdeletang
Intelligence is good statistics
London, England Katılım Ocak 2016
73 Takip Edilen312 Takipçiler

512 parameters: a new top scorer for 10-digit addition with transformers!
Who can beat it?
Yinglun Zhu@yinglun122
Hey @DimitrisPapail we now have a 512 parameter model that does the job. I instructed opus 4.6 to explore along the direction of low rankness.
English

@DimitrisPapail Cool stuff! I think the challenge would be even more interesting with a scaling law num parameters versus number of digits (at iso ~100% accuracy)
English
Grégoire Delétang retweetledi

Genie 3 🤝 @Waymo
The Waymo World Model generates photorealistic, interactive environments to train autonomous vehicles.
This helps the cars navigate rare, unpredictable events before encountering them in reality. 🧵
GIF
English

I find that LLMs went much further than I initially thought 3-4 years ago. My impression is that pretraining enables many non trivial things to work, that would never otherwise (test-time "thinking", rlhf, fine-tuning on non trivial targets, interaction with tools etc). Am I the only one to be positively surprised?
English

@jxmnop Language modelling is compression. Lossy (weights) and lossless (sequence).
English

@zhaisf Back to 1995... How many people still discover NNs for the first time?
English

Interesting effect, I think I saw you mentioned it in one of your tweet indeed. Yes my point is that ultimately even with cot, the model must be pretrained on a sufficiently large context length L, that contains at most L*log2(vocab_size) bits. That's enough to solve most problems, but that's still a hard limit.
English

this effect is especially magnified when doing self improvmenet on top of pretrained models. Why would pretraining help? Because for a model to "learn an algorithm" it has to be ancored on an idea of a number system, having seen other examples of deduction, algorithms etc. This is impossible to have without pretraining.
English

one of the few times in my career I feel good about calling a problem "solved", for an approximately satisfying value of "solved"
length generalization challenges can be overcome with iterative self improvement
Dimitris Papailiopoulos@DimitrisPapail
o3 can't multiply beyond a few digits... But I think multiplication, addition, maze solving and easy-to-hard generalization is actually solvable on standard transformers... with recursive self-improvement. Below is the acc of a tiny model teaching itself how to add.
English

@DimitrisPapail Your work is still interesting for sure, but what I mean is that it doesn't solve the fundamental problem of the transformer architecture, which is that you must train it with a large enough context window.
English

@DimitrisPapail I mean you could train with a longer context window yes, with even just some fine-tuning. But testing with a longer context window experimentally doesn't work, it's what people have been struggling with for a long time (i.e. length generalization).
English

@_akhaliq @mhutter42 haven't checked it but mmh it smells a lot like Solomonoff's reference UTM
English

@nouhadziri @DimitrisPapail It also makes me think about universality: is it really the right concept? Couldn't we get, say, a proof of the RH in a few years, with a model that can't sum and add two numbers?
English

@nouhadziri @DimitrisPapail that comes back regularly. But when will it finally be solved?
English

📢 DeepSeek R1 still cannot solve multiplication with 100% accuracy🫠😬
Though it can achieve high scores on hard math questions (AIME, MATH-500), extremely difficult physics, biology, and chemistry problems (GPQA Diamond), and coding challenges (LiveCode, CodeForces)-problems that require advanced problem-solving skills, it struggles with a simple multiplication algorithm [1/8].

English
Grégoire Delétang retweetledi





