Mathieu Van Vyve

11.6K posts

Mathieu Van Vyve banner
Mathieu Van Vyve

Mathieu Van Vyve

@MathieuVVyve

Prof of Operations Research at LSM (UCLouvain) and philosopher. Avid runner, sailor, climber... anything outdoors. The ecological catastrophe breaks my heart.

Katılım Eylül 2015
344 Takip Edilen1.2K Takipçiler
Sabitlenmiş Tweet
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
Eradicate the optimist who takes the easy view that human values will persist no matter what we do. Annihilate the pessimist whose ineffectual cry is that the goal’s already missed however hard we try. - Piet Hein.
English
2
0
24
0
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@raiyan006 @teostealth @scaling01 I do not agree, the mechanics in some of them are easy, but for some others I find it really hard to grasp. I had the same to say for ARC-GI 2 by the way. But maybe it's only me.
English
1
0
0
7
Raiyan
Raiyan@raiyan006·
@teostealth @scaling01 It’s not that hard. Try them out. Can figure out most of the puzzles in under 30s. I think it’s difficult for ai maybe because there are no instructions and it’s a minimal setup
English
1
0
2
116
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@ptremblay @teostealth @scaling01 I do not agree, the mechanics in some of them are easy, but for some others I find it really hard to grasp. I had the same to say for ARC-GI 2 by the way. But maybe it's only me.
English
1
0
0
13
Eva Gelisch
Eva Gelisch@_EvaG_·
@pmagn @MathieuVVyve Because it unnecessarily exposes an attack surface for your credibility. Situation is dipshlt enough, no need to add artificial drama.
English
1
0
1
22
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@GregKamradt Tried g50t but after fifteen minutes and several game overs I got to Level 3 only and still do not really understand what are the mechanics of the game. I am a prof of mathematical optimization. I believe these are not "easy for humans, hard for ai". It's harder than this.
English
0
0
0
14
Greg Kamradt
Greg Kamradt@GregKamradt·
The 25 public ARC-AGI-3 games On average, they are easier for humans and AI However, the difficulty ranges, there are very easy games and games which are more difficult Easy for AI: arcprize.org/tasks/vc33 Hard for AI: arcprize.org/tasks/g50t
Greg Kamradt@GregKamradt

Today we're launching ARC-AGI-3 135 Novel Environments (nearly 1K levels) we build by hand It is the only unsaturated agent benchmark in the world Each game is 100% human solvable, AI scores <1% This gap between human and AI performance proves we do not have AGI Agents today need human handholding. Agents that beat V3 will prove they don’t need that level of supervision. Agents that beat V3 will demonstrate: * Continual learning - Each level builds on top of each other. You can’t beat level 3 without carrying forward what you learned in levels 1 and 2. * World modeling - Many of the environments require planning actions many actions ahead. AI will have no choice but to build an internal world model for how the environment works, run simulations “in its head” and proceed with an action In our early testing, we’ve seen a few clear failure modes of AI: * Anticipation of future events - If an environment requires that AI set up a scene, and then carry out a scenario (like in sp80), it starts to break down. * Anchoring on early hypothesis - Early in a game it comes up with a hypothesis (even if wrong) and refuses to update its beliefs later. * Thinking it’s playing another game - AI thinks it’s playing chess, pacman. The training data holds hard! One major problem is there is too much data to carry forward in a single context. Models must learn what to remember and what to forget The agent that beats ARC-AGI-3 will have demonstrated the most authoritative evidence of progress towards general intelligence to date We're excited to get this out and excited to see what you think

English
4
4
40
4.8K
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@ChristosTzamos This is really clever. I like it very much. But, even in principle, what could possibly be a situation where this makes sense, rather then just executing the standard program itself ?
English
0
0
0
235
Christos Tzamos
Christos Tzamos@ChristosTzamos·
1/4 Want to build a computer inside a transformer? Given the wide interest in our project, we are releasing the code and the weights so that others can build on our construction.
Christos Tzamos tweet media
English
17
92
621
53.5K
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@scaling01 Given the scoring, I believe the hard part will be to go from ~0% to 10% (i.e. succeeding at all tasks with 3x the number of actions of the second best human). From there then it will go very fast to 80%.
English
0
0
0
39
Lisan al Gaib
Lisan al Gaib@scaling01·
my prediction is that the benchmark will be mostly useless and will miss most progress in 2026 only at the end of 2026 or in 2027 will we see scores ramp up from like 20 to 80% within a few months right now all but 3 models score literally 0 and the scores are meaningless. Gemini 3.1 Pro clearly doesn't belong to the same group as GPT-5.4 and Opus 4.6 and there's 0 signal for all other models
Lisan al Gaib@scaling01

notice how they also gave higher weight to later levels? the benchmark was designed to detect the continual learning breakthrough when it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"

English
13
5
117
11K
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@chatgpt21 Seems reasonable as a prediction. Also : 30% might be similar to what a median educated human would achieve (i.e. succeed at all tests with ~double the number of actions of the second best human).
English
0
0
0
63
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@mikeknoop 1. very limited harness OK, but it should include scoring methodology. 2. no tools, I do not get it. In the real-world, AI will always have tools available.
English
0
0
0
30
Mike Knoop
Mike Knoop@mikeknoop·
When we introduced ARC-AGI-2, we switched from only reporting accuracy (%) to include efficiency ($). This was in response to AI progress - reasoning models necessitated this change as you can always buy more performance for more test-time compute. To understand AGI progress you need both. For ARC-AGI-3, we're again adapting to AI progress. We've introduced a "stateless client" scoring philosophy for our official Verified leaderboard. The idea is that future AGI will not require special state management by the test-giver as this can introduce accidental or intentional bias. Modern reasoning models have gotten so good that you can now achieve domain-specific progress (as Codex and Claude Code demonstrate) when humans craft harnesses around base model intelligence. ARC is interested not in testing how well humans can use AI, we want to test the AI directly. This is an AGI-pilled move to give humans and AI the closest practical testing experience to reveal true progress. I expect other benchmarks which care about testing generalization to adopt a similar viewpoint.
English
12
6
80
6.7K
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@GregKamradt The harness should at a very minimum specify how the score is computed, and in particular convey the message that the number of actions taken should be minimized. Otherwise a model might solve many games but score very low, without being able to understand why.
English
1
0
1
31
Greg Kamradt
Greg Kamradt@GregKamradt·
If you're sufficiently AGI pilled, no harness is the best harness Opus 4.5 needed heavy harnessing, Opus 4.6 needed less In the limit, the only "harness" AGI will need is context to the outside world No thinking tricks, no prompts with human intelligence baked in If you really want to know if AGI is here we need to be a direct pass through for model performance
Lisan al Gaib@scaling01

this is pretty much worst case performance no harness at all and very simplistic prompt

English
46
14
261
59.2K
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@scaling01 At a minimum the harness should specify exactly how the model will be scored, in particular that the scoring is very dependent on the number of actions taken.
English
0
0
0
81
Lisan al Gaib
Lisan al Gaib@scaling01·
ARC-AGI-3 scores for GPT-5.4, Gemini 3.1 Pro and Opus 4.6 Gemini 3.1 Pro: 0.37% GPT-5.4: 0.26% Opus 4.6: 0.25% Grok 4.2: 0%
Lisan al Gaib tweet media
Indonesia
138
190
3.1K
414.1K
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@MinuteofZombie Initially thought the same, but when I listened again to the complete sequence, I realized he might not really talk about recovered stuff, but the missile tech they were discussing just prior.
English
0
0
1
76
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@UAPDr It's hard to understand how it is not possible to follow there drones with helicopters equipped with radars to see where they go afterwards. I am sure they know so much more about them that they publicly say.
English
0
0
1
45
Dr. Dan
Dr. Dan@UAPDr·
When 12–15 “drones” come in waves, can’t be identified, resist jamming, the label starts to wear thin. That’s why I call them #UAP
English
15
35
146
9K
Jean-Ri gol
Jean-Ri gol@CentreJeanRiGol·
Est-ce qu’une soirée organisée par un parti politique a sa place dans une école communale?
GIF
Français
3
3
18
925
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@ZoharKo Also I believe that people might be more or less competent at eliciting useful answers from these systems.
English
0
0
4
393
Zohar Komargodski
Zohar Komargodski@ZoharKo·
@MathieuVVyve But people like T.Tao claim that it’s actually truly useful for their research. Much beyond just using it as super powered search which all of us do.
English
1
1
1
760
Zohar Komargodski
Zohar Komargodski@ZoharKo·
An interesting dichotomy: while the internet is flooded with news of math problems solved or nearly solved by AI, two young mathematicians I know—both top contenders for the Fields Medal this year—maintain that AI remains only marginally useful for their research.
English
7
6
101
8K
Mathieu Acher
Mathieu Acher@acherm·
A few days ago I shared a chess engine built from scratch in TeX. Now I pushed the experiment further: I asked a coding agent to build one in #Brainfuck. Yes: a #chess engine in a language with 8 characters and almost no abstractions. Never been done. Thread and blog post ⤵️↘️♟️
Mathieu Acher tweet media
English
3
9
59
4.5K
Mathieu Van Vyve
Mathieu Van Vyve@MathieuVVyve·
@AcerFur AI math will accelerate math research by quickly knocking down all easy-ish questions (which was time-consuming until recently).
English
0
0
0
76
Acer
Acer@AcerFur·
It’s funny looking through r/singularity or r/accelerate and seeing people hype up our result more than we do
English
7
3
127
5.3K