Jay Frydman

51 posts

Jay Frydman

@JayFrydman

CPO at @symbolica Building @agenticasdk

San Francisco, CA Katılım Aralık 2025

17 Takip Edilen16 Takipçiler

Jay Frydman@JayFrydman·5h

@realdrewcarson @agenticasdk You will soon see frontier labs putting similar RLM harnesses behind the API & achieve similar performance gains on tasks like ARC-3

English

Drew Carson@realdrewcarson·6h

@JayFrydman @agenticasdk is the harness behind the API? if so, great, because that's what they're testing - performance on generic API endpoints. But this seems like a task-specific harness for the ARC-AGI-3 challenge.

English

Agentica@agenticasdk·21h

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

English

131

1.4K

371.7K

Jay Frydman@JayFrydman·6h

@MickeySteamboat @mark_k x.com/agenticasdk/st…

Agentica@agenticasdk

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

QME

Satoshi Nakamoto, Andrew Rulnick@MickeySteamboat·1d

@mark_k this will be knocked out of the water in no time. end of 2026

English

Mark Kretschmann@mark_k·2d

ARC-AGI 3 is here, and all existing AI models are below 1% on the benchmark. It's gonna take a while until this one is saturated. How it measures intelligence: - 100% human-solvable environments - Skill-acquisition efficiency over time - Long-horizon planning with sparse feedback - Experience-driven adaptation across multiple steps "As long as there is a gap between AI and human learning, we do not have AGI."

English

196

15.9K

Jay Frydman@JayFrydman·6h

@mark_k @fchollet x.com/agenticasdk/st…

Agentica@agenticasdk

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

QME

Jay Frydman@JayFrydman·7h

@sd_marlow @FakePsyho The Agentica SDK agent beat 113/182 games in one day without using vision

English

Steven Marlow@sd_marlow·1d

@FakePsyho It bothers me that Chollet won't refer to these as visual challenges. It's also a failure for the benchmark to not measure for exploration vs understanding. They have a default assumption that min # of action on 1st try = intelligence (it doesn't).

English

391

Psyho@FakePsyho·2d

Some early non-trivial thoughts about ARC-AGI-3 after reading the paper: - Due to quadratic scaling of steps (see attached image) & linear weight scaling of levels, benchmark scores will follow much more "steep" sigmoid than usual (think sigmoid x quadratic function); I'm not hating this design, but this means that the score on this benchmark has to be interpreted differently than literally any other popular benchmark - expect really silly takes from your local AI influencers - Each game was tested with 10 players and the human baseline is equal to second-best performance; this also means that in order to include the game in the benchmark, it was enough for only two human players to finish the game (out of 10); the public games are said to be significantly easier than private set; now, the potential problem here is that there's a big difference between puzzle being hard (tricky observations, complex logical path) and obtuse (multiple potential interpretations, unclued "random" game mechanic). Success% under heavy time constraint doesn't differentiate between those two. If we have obtuse games in the private set then getting near 100% might be realistically impossible (assuming very strict limit on actions, etc) - The games are similar (especially in terms of aesthetics) and I'd expect for human performance to drastically improve with the number of games that they have played. This is especially true for people that had no prior exposition to puzzle games. In many games the hardest part is figuring out that the specific part of the screen is clickable and you do get better at identifying such spots with each consecutive game you're playing. This is true because ARC-AGI-3 visual language is somewhat distant from your usual puzzle game visuals. AI is going to be severely disadvantaged until public games are added to the training data, so expect big initial jump in the benchmark scores (assuming the labs care about that). Overall, I feel that this is a solid release, but I'm a bit worried that there might be some rather obtuse games in the private set that will heavily skew the results.

English

166

22.2K

Jay Frydman@JayFrydman·7h

@rawantitmc @agenticasdk 🇦🇷

QME

Saffron Warlord@rawantitmc·18h

@agenticasdk Is argentina based on neurosymbolic reasoning?

English

253

Jay Frydman@JayFrydman·7h

@InfiniteHexx @agenticasdk You can call it a harness, no problem.

English

InfiniteHexx@InfiniteHexx·18h

@agenticasdk They used a "general purpose symbolic exoskeleton" to accomplish this, but don't you dare call it a harness! It's totally not a semantic dodge!

English

442

Jay Frydman@JayFrydman·7h

@chillMSR @agenticasdk Chain-of-thought models directly served through the API also use a harness

English

omg is ed!!@chillMSR·20h

@agenticasdk cheating in a benchmark does not make your ai model better share the results without a harness

English

1.3K

Jay Frydman@JayFrydman·7h

@x1f4r @agenticasdk Chain-of-thought models are a harness

English

x1f4r@x1f4r·20h

@agenticasdk The point of ARC-3 is to allow no harnesses for the competition. It makes it actually hard to beat by letting the model approach it exactly like a human would.

English

3.6K

Jay Frydman@JayFrydman·7h

@realdrewcarson @agenticasdk Chain-of-though models are a harness.

English

Drew Carson@realdrewcarson·20h

@agenticasdk Cool, now do it without a harness.

English

3.6K

Jay Frydman@JayFrydman·7h

@flowersslop @agenticasdk Chain-of-thought models are a harness

English

Flowers ☾@flowersslop·20h

@agenticasdk wasnt the point of arc agi 3 exaclty NO harness?

English

5.3K

Jay Frydman@JayFrydman·7h

@chatgpt21 @agenticasdk @spicey_lemonade This is a reflection of the scoring methodology of the challenge. The agent only used 1.68x the action count of the human baseline, which is the score of the second-best human player. If the ARC foundation provides the human score distribution we can compare directly to avg human

English

Chris@chatgpt21·20h

@agenticasdk @spicey_lemonade Significantly worse than a human

English

4.8K

Jay Frydman@JayFrydman·7h

@chatgpt21 @vr4300 @symbolica @scaling01 Not training on anything - we used the same harness we open sourced 3 weeks ago github.com/symbolica-ai/A…

English

Chris@chatgpt21·12h

@vr4300 @symbolica @scaling01 Are you training off their public set?

English

154

George Morgan@vr4300·21h

Unbelievable result from the @symbolica research team! In just 1 day since the launch of ARC-AGI-3 we have been able to raise the score from 0% to 36%, and within a very close margin to human baseline as well. See our blog post linked in this thread for a detailed score breakdown and replays.

Agentica@agenticasdk

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

English

231

84.5K

Jay Frydman@JayFrydman·8h

@jdlichtman 👀

QME

Jared Duker Lichtman@jdlichtman·15h

Not long for this world

Agentica@agenticasdk

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

English

1.6K

Jay Frydman@JayFrydman·8h

@leothecurious Our approach did not use vision and solved 113/182 games with 1.68x action count of human baseline in 1 day. x.com/agenticasdk/st…

Agentica@agenticasdk

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

English

144

davinci@leothecurious·13h

for the purposes of AI benchmarking, this is serious cope. it may be fair in a purely inquisitive question as in "what can a truly alien intelligence figure out?", but it poorly applies to the purposes of evaluating the true automation potential of AIs built to function in a human society. AIs share our world, they share our culture, and they share our jobs. they aren't mean to be alien, and we aren't actively trying to make them more alien to humans. besides, almost every single LLM/VLM benchmark out there is overwhelmingly knowledge-dependent, counting on prior knowledge about human language (often english in particular), human mathematical notation and concepts, human culture and man-made objects, and all sorts of details that have almost no overlap with intelligence per se. this benchmark goes further than almost any i've sedn in decoupling the measurement of knowledge from the measurement of intelligence. this benchmark (imperfectly but commendably) tries to not only measure how well AIs answer questions, but how well they ask the right questions under novelty, to then uncover answers they don't already know when walking into the test.

Simo Ryu@cloneofsimo

This is pure perception task with heavy cultural prior: you need geometical intuition, some prior on solving mazes, what typical game interaction feels like. Like if alien intelligence that doesnt have visual perception like ours, and doesnt know what nintendo is, how are they supposed to solve this? (And yes, there are animals without visual perception) (Oh and guess what other intelligence dont have visual perception and geometrical prior like ours)

English

5.6K

Jay Frydman@JayFrydman·8h

@JonhernandezIA 👀

QME

Jon Hernandez@JonhernandezIA·18h

That was quick...

Agentica@agenticasdk

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

English

1.9K

Jay Frydman@JayFrydman·8h

@daniel_mac8 👀

QME

Dan McAteer@daniel_mac8·19h

arc-agi-3 will be solved by Saturday.

Agentica@agenticasdk

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

English

6.3K

Jay Frydman@JayFrydman·8h

@DeryaTR_ 👀

QME

Derya Unutmaz, MD@DeryaTR_·20h

Ouch!

Agentica@agenticasdk

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

English

188

20.9K

Jay Frydman@JayFrydman·8h

@jmbollenbacher The Agentica SDK agent only used 1.68x the actions of the human baseline (the score of the second best human player). Same harness as we open sourced three weeks ago.

English

JMB 🧙‍♂️@jmbollenbacher·19h

Called it. Double digit scores very soon after launch. ARC-AGI-3 isnt as hard as the ARC Foundation people want you to think. They just rigged the evaluation system to make it look harder than it is so it wouldnt insta-saturate on launch.

Agentica@agenticasdk

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

English

826

Jay Frydman@JayFrydman·8h

@cloneofsimo Our approach does not use vision, used the same harness as we open sourced three weeks ago before the public announcement on March 25th, and solved 113/182 of the playable levels in 1 day for a fraction of the cost (~9x cheaper vs. Opus 4.6 baseline). x.com/agenticasdk/st…

Agentica@agenticasdk

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

English

Simo Ryu@cloneofsimo·16h

François Chollet@fchollet

ARC-AGI-3 is out now! We've designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first time. We've done extensive human testing that shows 100% of these environments are solvable by humans, upon first contact, with no prior training and no instructions. Meanwhile, all frontier AI reasoning models do under 1% at this time.

English

24.7K

Jay Frydman@JayFrydman·8h

@xlr8harder @arcprize We added a multiple of human performance in our announcement for this reason. Our agent built with the Agentica SDK used an average of 1.68x of the human baseline action count. x.com/agenticasdk/st…

Agentica@agenticasdk

We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.

English

101

xlr8harder@xlr8harder·14h

The biggest ARC-AGI-3 issue (and an easy change @arcprize could still make) is presenting a score as a percentage of human score, while the scoring method ensures that it does not mean what a percentage implies to most readers. I'm already seeing people confused by this.

English

2.8K

Keşfet

@realdrewcarson @agenticasdk @MickeySteamboat @mark_k @fchollet @sd_marlow @FakePsyho @rawantitmc