Guanghan Ning

48 posts

Guanghan Ning

@quietnning

Research @fleet_ai, formerly @ ByteDance-Seed (LLM) https://t.co/FL2xvayzCt · views my own

San Francisco, CA Katılım Temmuz 2015

212 Takip Edilen200 Takipçiler

Sabitlenmiş Tweet

Guanghan Ning@quietnning·6d

Personal news: After ByteDance Seed and a stint as an independent researcher, I'm joining @fleet_ai as a Member of Technical Staff on the research team 🚀 Building Witness-inspired puzzle environments for ARC-AGI-3 convinced me that RL environments are one of the most under-explored bottlenecks on the path to truly capable agents. The team, the technical vision, and the open problems are a perfect fit. It’s exactly where I want to spend my next chapter. If you're interested in building and scaling dynamic, complex environments that push the boundaries of agentic reasoning, would love to connect.

English

137

11.9K

Guanghan Ning@quietnning·2d

My agent has general components (memory, rule discovery, drawing from recent papers + OpenClaw) designed to be domain-general, but the perception/decision layer is still Witness-specific. That's the more interesting finding: even scaffolding designed to be domain-general inherits domain-specific perception once shipped. "Wrong level of abstraction" isn't just a model failure mode, it propagates through every layer humans add on top. Caveats: single seed, 5k actions/game budget.

English

Guanghan Ning@quietnning·2d

A data point alongside this: Same model (Claude Opus 4.7), same env interface (ARC's official EnvironmentWrapper). Only the agent stack changes: • my arc-witness-agent → 44 levels solved (36.7% on first 10 levels per game) on arc-witness-envs • my arc-witness-agent → 0 levels solved (n=2 runs) on ARC-AGI-3 • ARC's reference agent → 0.18% on ARC-AGI-3 (per their blog) Two independently-designed agent stacks, same collapse.

Greg Kamradt@GregKamradt

We just posted scores for GPT-5.5 and Opus 4.7 on ARC-AGI-3 Neither model made material progress, but the more interesting story is about *why* they didn't make progress We reviewed every session they played to find common failure modes and studied what this tells is about real world tasks There were three that surfaced: - True Local Effect, False World Model - The models understand which action produced a change, but they fail to translate the effect into a global rule - Wrong Level of Abstraction From Training Data - The models mistake an ARC-AGI-3 environment for another game - Solved The Level, Didn’t Learn The Game - Even if a model beat a level, it’s unable to use that reward signal to enforce the correct actions This analysis is important because aggregate scores (like most other benchmarks) mask a models thought process.

English

2.3K

Guanghan Ning@quietnning·5d

Great point! this decomposition is exactly what makes witness-style envs interesting. arc-witness-envs tries to separate the two by varying rule complexity while holding spatial priors constant; if a model solves simple-rule puzzles but fails on complex-rule ones in the same layout, that points to rule formation rather than exploration. but yeah, doesn't fully isolate them. adding a state-space coverage metric (unique states explored / branching factor) alongside success rate seems like the obvious next step. currently this exists in my agent harness but not in the envs. seen any good ablation designs for this? would love to learn from prior work.

English

Bnaf.OG | 🟧@bnafOg·6d

Two failure modes worth separating: LLMs fail at *exploration* before rule induction. In Witness-style grids they guess paths from visual priors not state-space mapping — the env never teaches the rule. How do you tell exploration failure from rule formation failure in evals?

English

182

Guanghan Ning@quietnning·6d

English

137

11.9K

Guanghan Ning@quietnning·6d

@fleet_ai huge thanks to @nicoup and @andrewthezhou for the welcome package and the trust 🙏

English

552

Guanghan Ning@quietnning·9 Nis

Been doing something very similar since 2025. Claude Code + Obsidian as my go-to setup for building structured knowledge on any topic. Started with AI knowledge base, then specific research projects, comprehensive tutorial with code/problem/tests, personal growth, etc. Quick notes still go to Notion, but anything I want to actually understand now lives in this LLM-maintained wiki workflow. What’s cool from Andrej's post: - The "compile, don't retrieve" mindset. External sources are ground truth, the LLM compiles them into a wiki. Way more traceable, way less hallucination risk. - Filing query results back into the wiki. Best explorations become new wiki pages, and the whole thing compounds. - Lint passes. Never occurred to me to systematically health-check my KB for contradictions, orphan pages, stale claims. Basically code review for the knowledge base. - Also the small things: PDF → Markdown pipeline (marker), a flat index.md for LLM navigation (way faster than my layered MOC maps).

Andrej Karpathy@karpathy

Wow, this tweet went very viral! I wanted share a possibly slightly improved version of the tweet in an "idea file". The idea of the idea file is that in this era of LLM agents, there is less of a point/need of sharing the specific code/app, you just share the idea, then the other person's agent customizes & builds it for your specific needs. So here's the idea in a gist format: gist.github.com/karpathy/442a6… You can give this to your agent and it can build you your own LLM wiki and guide you on how to use it etc. It's intentionally kept a little bit abstract/vague because there are so many directions to take this in. And ofc, people can adjust the idea or contribute their own in the Discussion which is cool.

English

338

Guanghan Ning@quietnning·7 Nis

Was in the room for this. The moment that stuck with me was how differently they approach the same problem. Sam leans on scaling test-time compute for economic acceleration and new scientific discoveries, while François leans on better abstractions and efficiency. Different bets on what matters most, but probably both needed.

English

Greg Kamradt@GregKamradt·6 Nis

To close out ARC-AGI-3 launch we had @fchollet and @sama together for a fireside chat moderated by @deedydas 29-min of AGI progress discussion Deedy open's up with a question about viewing AGI through the lens of a parent

ARC Prize@arcprize

Francois Chollet + Sam Altman Fireside @fchollet and @sama fireside during ARC-AGI-3 Launch Party moderated by @deedydas They discuss: - Social contracts evolving - AGI views as a parent - When will labs score >85% on ARC-AGI-3?

English

Guanghan Ning@quietnning·2 Nis

@huang_chao4969 you guys are always so fast lol

English

795

Chao Huang@huang_chao4969·1 Nis

Introducing OpenHarness, an ultra-lightweight, pure Python alternative to Claude Code that delivers approximately 80% of essential agent functionality using just 3% of the code lines. With a single command, you can launch OpenHarness and unlock seamless integration with popular CLI agents including OpenClaw, nanobot, Cursor, and more. OpenHarness is an open-source Python implementation designed for researchers, builders, and the community: - 🔍 Understand how production AI agents work under the hood - 🧪 Experiment with cutting-edge tools, skills, and agent coordination patterns - 🔧 Extend the harness with custom plugins, providers, and domain knowledge - 🏗️ Build specialized agents on top of proven architecture Try OpenHarness: github.com/HKUDS/OpenHarn…

English

110

500

195.9K

Guanghan Ning@quietnning·26 Mar

But as Greg says, there will be ARC-AGI-4, ARC-AGI-5, and so on. The ultimate goal is to achieve AGI, by his definition, a general AI that can do everything at least as efficiently as a human being. Before reaching the ultimate goal, continuous efforts will be put into observing and closing the existing gaps.

English

125

Guanghan Ning@quietnning·26 Mar

Just back from the ARC-AGI-3 launch party at YC. Humans: 100%. Frontier AI: 0.37%. The most interesting tension from the Chollet × Altman fireside: Sam's view: long-term memory and continual learning are the missing pieces, everything else is pretty close to AGI. Test-time scaling is cool as long as it brings economic acceleration and new scientific discovery. Francois's view: program synthesis as world model. Focus on closing the gap between what humans can do and AI can't (efficiently do), without brute-force scaling. Two very different bets on how we get there. Or just a matter of priority. One thing they did agree on: this new benchmark may be saturated in 1-3 years. Closer to 1 if frontier labs invest heavily.

English

231

Guanghan Ning@quietnning·25 Mar

The failure modes you listed are spot on. The anchoring failure mode is the one I keep hitting. Agent commits to a rule too early, then interprets all new evidence to confirm it instead of revising. My current workaround: high-confidence rules that hit contradicting evidence get their boundaries widened rather than deleted. Helps the agent update beliefs without losing what it already learned. Excited to chat with the community at the launch party tonight.

English

1.5K

Greg Kamradt@GregKamradt·25 Mar

Today we're launching ARC-AGI-3 135 Novel Environments (nearly 1K levels) we build by hand It is the only unsaturated agent benchmark in the world Each game is 100% human solvable, AI scores <1% This gap between human and AI performance proves we do not have AGI Agents today need human handholding. Agents that beat V3 will prove they don’t need that level of supervision. Agents that beat V3 will demonstrate: * Continual learning - Each level builds on top of each other. You can’t beat level 3 without carrying forward what you learned in levels 1 and 2. * World modeling - Many of the environments require planning actions many actions ahead. AI will have no choice but to build an internal world model for how the environment works, run simulations “in its head” and proceed with an action In our early testing, we’ve seen a few clear failure modes of AI: * Anticipation of future events - If an environment requires that AI set up a scene, and then carry out a scenario (like in sp80), it starts to break down. * Anchoring on early hypothesis - Early in a game it comes up with a hypothesis (even if wrong) and refuses to update its beliefs later. * Thinking it’s playing another game - AI thinks it’s playing chess, pacman. The training data holds hard! One major problem is there is too much data to carry forward in a single context. Models must learn what to remember and what to forget The agent that beats ARC-AGI-3 will have demonstrated the most authoritative evidence of progress towards general intelligence to date We're excited to get this out and excited to see what you think

English

820

175.5K

Guanghan Ning@quietnning·25 Mar

Happy to see ARC-AGI-3 finally launched! Been building warm-up training & validation environments for this: 13 puzzle games, 1,872 levels, designed so agents have to discover rules through interaction alone. No instructions, no shortcuts. Most benchmarks let you pattern-match your way through. This one doesn't. github.com/Guanghan/arc-w…

English

2.2K

ARC Prize@arcprize·25 Mar

Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn

GIF

English

247

588

4.3K

728.7K

Guanghan Ning@quietnning·25 Mar

Been thinking about what fluid intelligence actually breaks down to. It's not one single ability. It's more like a “factory configuration” of human beings, a coordinated system of sub-capabilities that evolution spent millions of years tuning: - entity perception (objects, space, colors, quantities) - concept abstraction (representing entities at different levels; parts and wholes can swap depending on needs) - analogical reasoning (spotting similarities between abstracted entities depending on needs) - pattern discovery (inferring rules, identifying goals) - memory storage and update (reinforcing, revising, or erasing knowledge as new evidence comes in) What makes it "fluid" is deploying all of this being able to work on novel problems. For AI agents this changes how I think about the problem. It's less about training one monolithic "reasoning" capability and more about making sure the right primitives are installed, then getting them to coordinate under novelty. High Gf = strong individual primitives (fast & accurate) + broad primitive coverage + strong coordination.

François Chollet@fchollet

People struggle to differentiate fluid intelligence from knowledge because, given enough preparation, memorized templates become a solid substitute for on-the-fly adaptation

English

150

Guanghan Ning@quietnning·24 Mar

The way I see it, there's room for both: - premium enterprise-grade RL environments (deep, curated, B2B) - open-source RL task hubs with community contributions (broad, standardized) They're like AWS and Kubernetes. Altogether they make the ecosystem healthier.

English

Guanghan Ning@quietnning·24 Mar

@OpenReward is trying to build an open standard for RL environments. This is great for open-source developers. It bridges existing ecosystems: harbor2or (converts Fleet AI tasks) + verifiers2or (converts Prime Intellect tasks). Training framework support includes Slime and even Tinker, which makes life a lot easier for individual developers too. It could become the HuggingFace of RL environments. Let's see how it plays out.

English

117

Guanghan Ning@quietnning·24 Mar

@GenReasoning This could become the HuggingFace of RL environments. The bridge converters (harbor2or, verifiers2or) are smart. Solving interface fragmentation is probably the biggest unlock for independent RL developers right now.

English

326

General Reasoning@GenReasoning·24 Mar

Introducing OpenReward. 🌍 330+ RL environments through one API ⚡ Autoscaled sandbox compute 🍒 4.5M+ unique RL tasks 🚂 Works like magic with Tinker, Miles, Slime Link and thread below.

English

191

1.3K

241.7K

Guanghan Ning@quietnning·23 Mar

The 10 hours / $100 per environment number is wild. Huge improvement over manual curation. One thing I'm wondering: how do you handle reward signals for RL training on these? Browser tasks usually only give you a pass/fail at the end. Any step-level verification built into the generated environments, or is that still manual? Been dealing with the same problem in a different setting (puzzle environments for ARC-AGI-3) and sparse rewards are by far the hardest part.

English

217

Shuyan Zhou@shuyanzh36·23 Mar

In 2023, WebArena took 7 grad students more than 6 months to build just 5 environments with 812 variable browser-use tasks. Now, it takes under 10 hours and less than $100 per environment, with easy support for parallel generation. Excited to introduce WebArena-Infinity: a scalable approach for automatically generating high-authenticity, high-complexity browser environments with verifiable tasks suitable for RL training and benchmarking. Even strong open-source models that already achieve 60%+ success rates on WebArena and OSWorld complete fewer than 50% of tasks here. Project page: webarena.dev/webarena-infin… Repo: github.com/web-arena-x/we… 🧵 (1/n)

GIF

English

327

42.9K

Guanghan Ning@quietnning·21 Mar

Another thing from that thread: Semantic ASCII encoding as an intermediate representation for grid-based reasoning. Instead of feeding raw 64x64 pixel grids to LLMs, map them to compact ASCII boards (e.g., @=agent, #=wall, .=path). Noticeably improved the quality of LLM-based rule discovery in my experiments; the model reasons much better over symbolic spatial layouts than raw color indices. Not solved. Main failure mode is when role inference gets the initial mapping wrong (e.g., labeling an interactive element as wall). But way more interpretable and debuggable than staring at pixel arrays. Anyone else working on grid-based RL thinking about similar intermediate abstractions? Full discussion (some really good back-and-forth): reddit.com/r/reinforcemen…

English

Guanghan Ning@quietnning·21 Mar

Posted arc-witness-envs (13 puzzle games, 1,872 levels for ARC-AGI-3) on r/reinforcementlearning. The discussion taught me more than I expected. One idea that stuck: trajectory stitching via Decision Transformers. An agent might discover half the solution in one attempt and the other half in another. A DT can stitch those partial trajectories into a complete solve. What I didn't realize until this exchange: my agent's retry mechanism (carry verified rules across attempts, reset exploration state) is already doing stitching, just at the knowledge level instead of the trajectory level. Same principle, different abstraction. Probably need both eventually. Excited to explore this direction.

English

Keşfet

@fleet_ai @nicoup @andrewthezhou @fchollet @sama @deedydas @huang_chao4969 @OpenReward