Mathieu Van Vyve

11.6K posts

Mathieu Van Vyve

@MathieuVVyve

Prof of Operations Research at LSM (UCLouvain) and philosopher. Avid runner, sailor, climber... anything outdoors. The ecological catastrophe breaks my heart.

Katılım Eylül 2015

344 Takip Edilen1.2K Takipçiler

Sabitlenmiş Tweet

Mathieu Van Vyve@MathieuVVyve·3 Şub

Eradicate the optimist who takes the easy view that human values will persist no matter what we do. Annihilate the pessimist whose ineffectual cry is that the goal’s already missed however hard we try. - Piet Hein.

English

Mathieu Van Vyve@MathieuVVyve·2d

@raiyan006 @teostealth @scaling01 I do not agree, the mechanics in some of them are easy, but for some others I find it really hard to grasp. I had the same to say for ARC-GI 2 by the way. But maybe it's only me.

English

Raiyan@raiyan006·3d

@teostealth @scaling01 It’s not that hard. Try them out. Can figure out most of the puzzles in under 30s. I think it’s difficult for ai maybe because there are no instructions and it’s a minimal setup

English

116

Lisan al Gaib@scaling01·3d

They went with a minimal setup ... and ARC-AGI-3 is very challenging for humans

Lisan al Gaib@scaling01

ARC-AGI-3 scores for GPT-5.4, Gemini 3.1 Pro and Opus 4.6 Gemini 3.1 Pro: 0.37% GPT-5.4: 0.26% Opus 4.6: 0.25% Grok 4.2: 0%

English

203

21K

Mathieu Van Vyve@MathieuVVyve·2d

@ptremblay @teostealth @scaling01 I do not agree, the mechanics in some of them are easy, but for some others I find it really hard to grasp. I had the same to say for ARC-GI 2 by the way. But maybe it's only me.

English

Philippe Tremblay@ptremblay·3d

@teostealth @scaling01 I did a few. It's very obvious and boring. the standard ARC-AGI tasks are more challenging in my experience.

English

Mathieu Van Vyve@MathieuVVyve·2d

@_EvaG_ @pmagn exactly, this is making it unnecessarily easy for a climate denier to attack you.

English

Eva Gelisch@_EvaG_·2d

@pmagn @MathieuVVyve Because it unnecessarily exposes an attack surface for your credibility. Situation is dipshlt enough, no need to add artificial drama.

English

Climate Watcher 🔥🇨🇦🇬🇧 🇯🇲🌺@pmagn·4d

Folks this can't be happening 🔥👀

English

106

758

3.2K

256.8K

Mathieu Van Vyve@MathieuVVyve·2d

@GregKamradt Tried g50t but after fifteen minutes and several game overs I got to Level 3 only and still do not really understand what are the mechanics of the game. I am a prof of mathematical optimization. I believe these are not "easy for humans, hard for ai". It's harder than this.

English

Greg Kamradt@GregKamradt·3d

The 25 public ARC-AGI-3 games On average, they are easier for humans and AI However, the difficulty ranges, there are very easy games and games which are more difficult Easy for AI: arcprize.org/tasks/vc33 Hard for AI: arcprize.org/tasks/g50t

Greg Kamradt@GregKamradt

Today we're launching ARC-AGI-3 135 Novel Environments (nearly 1K levels) we build by hand It is the only unsaturated agent benchmark in the world Each game is 100% human solvable, AI scores <1% This gap between human and AI performance proves we do not have AGI Agents today need human handholding. Agents that beat V3 will prove they don’t need that level of supervision. Agents that beat V3 will demonstrate: * Continual learning - Each level builds on top of each other. You can’t beat level 3 without carrying forward what you learned in levels 1 and 2. * World modeling - Many of the environments require planning actions many actions ahead. AI will have no choice but to build an internal world model for how the environment works, run simulations “in its head” and proceed with an action In our early testing, we’ve seen a few clear failure modes of AI: * Anticipation of future events - If an environment requires that AI set up a scene, and then carry out a scenario (like in sp80), it starts to break down. * Anchoring on early hypothesis - Early in a game it comes up with a hypothesis (even if wrong) and refuses to update its beliefs later. * Thinking it’s playing another game - AI thinks it’s playing chess, pacman. The training data holds hard! One major problem is there is too much data to carry forward in a single context. Models must learn what to remember and what to forget The agent that beats ARC-AGI-3 will have demonstrated the most authoritative evidence of progress towards general intelligence to date We're excited to get this out and excited to see what you think

English

4.8K

Mathieu Van Vyve@MathieuVVyve·2d

@ChristosTzamos This is really clever. I like it very much. But, even in principle, what could possibly be a situation where this makes sense, rather then just executing the standard program itself ?

English

235

Christos Tzamos@ChristosTzamos·3d

1/4 Want to build a computer inside a transformer? Given the wide interest in our project, we are releasing the code and the weights so that others can build on our construction.

English

621

53.5K

Mathieu Van Vyve@MathieuVVyve·2d

@scaling01 Given the scoring, I believe the hard part will be to go from ~0% to 10% (i.e. succeeding at all tasks with 3x the number of actions of the second best human). From there then it will go very fast to 80%.

English

Lisan al Gaib@scaling01·3d

my prediction is that the benchmark will be mostly useless and will miss most progress in 2026 only at the end of 2026 or in 2027 will we see scores ramp up from like 20 to 80% within a few months right now all but 3 models score literally 0 and the scores are meaningless. Gemini 3.1 Pro clearly doesn't belong to the same group as GPT-5.4 and Opus 4.6 and there's 0 signal for all other models

Lisan al Gaib@scaling01

notice how they also gave higher weight to later levels? the benchmark was designed to detect the continual learning breakthrough when it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"

English

117

11K

Mathieu Van Vyve@MathieuVVyve·2d

@chatgpt21 Seems reasonable as a prediction. Also : 30% might be similar to what a median educated human would achieve (i.e. succeed at all tests with ~double the number of actions of the second best human).

English

Chris@chatgpt21·3d

Deepmind fellow confident that ARC AGI 3 won’t last long I predict frontier models will get 30% by Sep-Oct

Samuel Albanie 🇬🇧@SamuelAlbanie

i don't think arc-agi-3 will last very long environments are fun tho

English

261

16.9K

Mathieu Van Vyve@MathieuVVyve·3d

@mikeknoop 1. very limited harness OK, but it should include scoring methodology. 2. no tools, I do not get it. In the real-world, AI will always have tools available.

English

Mike Knoop@mikeknoop·3d

When we introduced ARC-AGI-2, we switched from only reporting accuracy (%) to include efficiency ($). This was in response to AI progress - reasoning models necessitated this change as you can always buy more performance for more test-time compute. To understand AGI progress you need both. For ARC-AGI-3, we're again adapting to AI progress. We've introduced a "stateless client" scoring philosophy for our official Verified leaderboard. The idea is that future AGI will not require special state management by the test-giver as this can introduce accidental or intentional bias. Modern reasoning models have gotten so good that you can now achieve domain-specific progress (as Codex and Claude Code demonstrate) when humans craft harnesses around base model intelligence. ARC is interested not in testing how well humans can use AI, we want to test the AI directly. This is an AGI-pilled move to give humans and AI the closest practical testing experience to reveal true progress. I expect other benchmarks which care about testing generalization to adopt a similar viewpoint.

English

6.7K

Mathieu Van Vyve@MathieuVVyve·3d

@GregKamradt The harness should at a very minimum specify how the score is computed, and in particular convey the message that the number of actions taken should be minimized. Otherwise a model might solve many games but score very low, without being able to understand why.

English

Greg Kamradt@GregKamradt·3d

If you're sufficiently AGI pilled, no harness is the best harness Opus 4.5 needed heavy harnessing, Opus 4.6 needed less In the limit, the only "harness" AGI will need is context to the outside world No thinking tricks, no prompts with human intelligence baked in If you really want to know if AGI is here we need to be a direct pass through for model performance

Lisan al Gaib@scaling01

this is pretty much worst case performance no harness at all and very simplistic prompt

English

261

59.2K

Mathieu Van Vyve@MathieuVVyve·3d

@scaling01 At a minimum the harness should specify exactly how the model will be scored, in particular that the scoring is very dependent on the number of actions taken.

English

Lisan al Gaib@scaling01·3d

ARC-AGI-3 scores for GPT-5.4, Gemini 3.1 Pro and Opus 4.6 Gemini 3.1 Pro: 0.37% GPT-5.4: 0.26% Opus 4.6: 0.25% Grok 4.2: 0%

Indonesia

138

190

3.1K

414.1K

Mathieu Van Vyve@MathieuVVyve·3d

@MinuteofZombie Initially thought the same, but when I listened again to the complete sequence, I realized he might not really talk about recovered stuff, but the missile tech they were discussing just prior.

English

Missileman@MinuteofZombie·4d

We are just casually talking about UFO technology exploitation on Fox News with a retired assistant FBI director that’s not even connected to UFOs at all?

Dr. Dan@UAPDr

The striking thing in this Fox News clip featuring former Assistant FBI Director Chris Swecker is not the speculation around the general. It is how casually the conversation moves through recovered UFOs, reverse engineering, and an arms race as if all three are already settled.

English

116

7.1K

Mathieu Van Vyve@MathieuVVyve·4d

@UAPDr It's hard to understand how it is not possible to follow there drones with helicopters equipped with radars to see where they go afterwards. I am sure they know so much more about them that they publicly say.

English

Dr. Dan@UAPDr·4d

When 12–15 “drones” come in waves, can’t be identified, resist jamming, the label starts to wear thin. That’s why I call them #UAP

English

146

Mathieu Van Vyve@MathieuVVyve·4d

@CentreJeanRiGol Rixensart !

Deutsch

Jean-Ri gol@CentreJeanRiGol·4d

Est-ce qu’une soirée organisée par un parti politique a sa place dans une école communale?

GIF

Français

925

Mathieu Van Vyve@MathieuVVyve·4d

@ZoharKo Also I believe that people might be more or less competent at eliciting useful answers from these systems.

English

393

Zohar Komargodski@ZoharKo·4d

@MathieuVVyve But people like T.Tao claim that it’s actually truly useful for their research. Much beyond just using it as super powered search which all of us do.

English

760

Zohar Komargodski@ZoharKo·4d

An interesting dichotomy: while the internet is flooded with news of math problems solved or nearly solved by AI, two young mathematicians I know—both top contenders for the Fields Medal this year—maintain that AI remains only marginally useful for their research.

English

101

Mathieu Van Vyve@MathieuVVyve·4d

@picharbonnier about the turing test arxiv.org/abs/2503.23674

English

Mathieu Van Vyve@MathieuVVyve·4d

@picharbonnier Le test de Turing il a été passé depuis de nombreuses années. Ce n'est donc pas de cela dont Jensen parle ici (je suis en désaccord avec lui là-dessus). Par contre, ces systèmes raisonnent maintenant réellement, par ex: blog.mathieuacher.com/BFChessChessEn… epoch.ai/frontiermath/o…

Français

166

Pierre Charbonnier@picharbonnier·4d

Je n'ai jamais cru au test de Turing. Je vois très bien comment une machine peut se faire passer pour une intelligence sans en disposer réellement.

Polymarket@Polymarket

BREAKING: NVIDIA CEO announces “we’ve achieved AGI”

Français

3.1K

Mathieu Van Vyve@MathieuVVyve·4d

@acherm WT(B)F ?! Insane.

English

115

Mathieu Acher@acherm·5d

A few days ago I shared a chess engine built from scratch in TeX. Now I pushed the experiment further: I asked a coding agent to build one in #Brainfuck. Yes: a #chess engine in a language with 8 characters and almost no abstractions. Never been done. Thread and blog post ⤵️↘️♟️

English

4.5K

Mathieu Van Vyve@MathieuVVyve·4d

@AcerFur AI math will accelerate math research by quickly knocking down all easy-ish questions (which was time-consuming until recently).

English

Acer@AcerFur·5d

It’s funny looking through r/singularity or r/accelerate and seeing people hype up our result more than we do

English

127

5.3K

Keşfet

@raiyan006 @teostealth @scaling01 @ptremblay @_EvaG_ @pmagn @GregKamradt @ChristosTzamos