

Guanghan Ning
48 posts

@quietnning
Research @fleet_ai, formerly @ ByteDance-Seed (LLM) https://t.co/FL2xvayzCt · views my own




We just posted scores for GPT-5.5 and Opus 4.7 on ARC-AGI-3 Neither model made material progress, but the more interesting story is about *why* they didn't make progress We reviewed every session they played to find common failure modes and studied what this tells is about real world tasks There were three that surfaced: - True Local Effect, False World Model - The models understand which action produced a change, but they fail to translate the effect into a global rule - Wrong Level of Abstraction From Training Data - The models mistake an ARC-AGI-3 environment for another game - Solved The Level, Didn’t Learn The Game - Even if a model beat a level, it’s unable to use that reward signal to enforce the correct actions This analysis is important because aggregate scores (like most other benchmarks) mask a models thought process.




Wow, this tweet went very viral! I wanted share a possibly slightly improved version of the tweet in an "idea file". The idea of the idea file is that in this era of LLM agents, there is less of a point/need of sharing the specific code/app, you just share the idea, then the other person's agent customizes & builds it for your specific needs. So here's the idea in a gist format: gist.github.com/karpathy/442a6… You can give this to your agent and it can build you your own LLM wiki and guide you on how to use it etc. It's intentionally kept a little bit abstract/vague because there are so many directions to take this in. And ofc, people can adjust the idea or contribute their own in the Discussion which is cool.


Francois Chollet + Sam Altman Fireside @fchollet and @sama fireside during ARC-AGI-3 Launch Party moderated by @deedydas They discuss: - Social contracts evolving - AGI views as a parent - When will labs score >85% on ARC-AGI-3?











People struggle to differentiate fluid intelligence from knowledge because, given enough preparation, memorized templates become a solid substitute for on-the-fly adaptation





