Jay Frydman

51 posts

Jay Frydman banner
Jay Frydman

Jay Frydman

@JayFrydman

CPO at @symbolica Building @agenticasdk

San Francisco, CA Katılım Aralık 2025
17 Takip Edilen16 Takipçiler
Jay Frydman
Jay Frydman@JayFrydman·
@realdrewcarson @agenticasdk You will soon see frontier labs putting similar RLM harnesses behind the API & achieve similar performance gains on tasks like ARC-3
English
1
0
0
20
Drew Carson
Drew Carson@realdrewcarson·
@JayFrydman @agenticasdk is the harness behind the API? if so, great, because that's what they're testing - performance on generic API endpoints. But this seems like a task-specific harness for the ARC-AGI-3 challenge.
English
1
0
0
32
Agentica
Agentica@agenticasdk·
We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.
English
71
131
1.4K
371.7K
Mark Kretschmann
Mark Kretschmann@mark_k·
ARC-AGI 3 is here, and all existing AI models are below 1% on the benchmark. It's gonna take a while until this one is saturated. How it measures intelligence: - 100% human-solvable environments - Skill-acquisition efficiency over time - Long-horizon planning with sparse feedback - Experience-driven adaptation across multiple steps "As long as there is a gap between AI and human learning, we do not have AGI."
Mark Kretschmann tweet media
English
31
17
196
15.9K
Steven Marlow
Steven Marlow@sd_marlow·
@FakePsyho It bothers me that Chollet won't refer to these as visual challenges. It's also a failure for the benchmark to not measure for exploration vs understanding. They have a default assumption that min # of action on 1st try = intelligence (it doesn't).
English
1
0
5
391
Psyho
Psyho@FakePsyho·
Some early non-trivial thoughts about ARC-AGI-3 after reading the paper: - Due to quadratic scaling of steps (see attached image) & linear weight scaling of levels, benchmark scores will follow much more "steep" sigmoid than usual (think sigmoid x quadratic function); I'm not hating this design, but this means that the score on this benchmark has to be interpreted differently than literally any other popular benchmark - expect really silly takes from your local AI influencers - Each game was tested with 10 players and the human baseline is equal to second-best performance; this also means that in order to include the game in the benchmark, it was enough for only two human players to finish the game (out of 10); the public games are said to be significantly easier than private set; now, the potential problem here is that there's a big difference between puzzle being hard (tricky observations, complex logical path) and obtuse (multiple potential interpretations, unclued "random" game mechanic). Success% under heavy time constraint doesn't differentiate between those two. If we have obtuse games in the private set then getting near 100% might be realistically impossible (assuming very strict limit on actions, etc) - The games are similar (especially in terms of aesthetics) and I'd expect for human performance to drastically improve with the number of games that they have played. This is especially true for people that had no prior exposition to puzzle games. In many games the hardest part is figuring out that the specific part of the screen is clickable and you do get better at identifying such spots with each consecutive game you're playing. This is true because ARC-AGI-3 visual language is somewhat distant from your usual puzzle game visuals. AI is going to be severely disadvantaged until public games are added to the training data, so expect big initial jump in the benchmark scores (assuming the labs care about that). Overall, I feel that this is a solid release, but I'm a bit worried that there might be some rather obtuse games in the private set that will heavily skew the results.
Psyho tweet media
English
7
14
166
22.2K
InfiniteHexx
InfiniteHexx@InfiniteHexx·
@agenticasdk They used a "general purpose symbolic exoskeleton" to accomplish this, but don't you dare call it a harness! It's totally not a semantic dodge!
English
1
0
2
442
omg is ed!!
omg is ed!!@chillMSR·
@agenticasdk cheating in a benchmark does not make your ai model better share the results without a harness
English
2
0
3
1.3K
x1f4r
x1f4r@x1f4r·
@agenticasdk The point of ARC-3 is to allow no harnesses for the competition. It makes it actually hard to beat by letting the model approach it exactly like a human would.
English
4
0
32
3.6K
Jay Frydman
Jay Frydman@JayFrydman·
@chatgpt21 @agenticasdk @spicey_lemonade This is a reflection of the scoring methodology of the challenge. The agent only used 1.68x the action count of the human baseline, which is the score of the second-best human player. If the ARC foundation provides the human score distribution we can compare directly to avg human
English
0
0
0
24
davinci
davinci@leothecurious·
for the purposes of AI benchmarking, this is serious cope. it may be fair in a purely inquisitive question as in "what can a truly alien intelligence figure out?", but it poorly applies to the purposes of evaluating the true automation potential of AIs built to function in a human society. AIs share our world, they share our culture, and they share our jobs. they aren't mean to be alien, and we aren't actively trying to make them more alien to humans. besides, almost every single LLM/VLM benchmark out there is overwhelmingly knowledge-dependent, counting on prior knowledge about human language (often english in particular), human mathematical notation and concepts, human culture and man-made objects, and all sorts of details that have almost no overlap with intelligence per se. this benchmark goes further than almost any i've sedn in decoupling the measurement of knowledge from the measurement of intelligence. this benchmark (imperfectly but commendably) tries to not only measure how well AIs answer questions, but how well they ask the right questions under novelty, to then uncover answers they don't already know when walking into the test.
Simo Ryu@cloneofsimo

This is pure perception task with heavy cultural prior: you need geometical intuition, some prior on solving mazes, what typical game interaction feels like. Like if alien intelligence that doesnt have visual perception like ours, and doesnt know what nintendo is, how are they supposed to solve this? (And yes, there are animals without visual perception) (Oh and guess what other intelligence dont have visual perception and geometrical prior like ours)

English
4
1
54
5.6K
Jay Frydman
Jay Frydman@JayFrydman·
@jmbollenbacher The Agentica SDK agent only used 1.68x the actions of the human baseline (the score of the second best human player). Same harness as we open sourced three weeks ago.
English
0
0
0
15
Simo Ryu
Simo Ryu@cloneofsimo·
This is pure perception task with heavy cultural prior: you need geometical intuition, some prior on solving mazes, what typical game interaction feels like. Like if alien intelligence that doesnt have visual perception like ours, and doesnt know what nintendo is, how are they supposed to solve this? (And yes, there are animals without visual perception) (Oh and guess what other intelligence dont have visual perception and geometrical prior like ours)
Simo Ryu tweet media
François Chollet@fchollet

ARC-AGI-3 is out now! We've designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first time. We've done extensive human testing that shows 100% of these environments are solvable by humans, upon first contact, with no prior training and no instructions. Meanwhile, all frontier AI reasoning models do under 1% at this time.

English
17
1
84
24.7K
xlr8harder
xlr8harder@xlr8harder·
The biggest ARC-AGI-3 issue (and an easy change @arcprize could still make) is presenting a score as a percentage of human score, while the scoring method ensures that it does not mean what a percentage implies to most readers. I'm already seeing people confused by this.
English
4
0
75
2.8K