clay

30 posts

clay

@deforestpeg

building agent eval rigs. solhunt (defi exploit) → solhunt-duel (red/blue verifier gates) → eivra (LLM forecasting benchmark, live). https://t.co/RnQm475Q5E

Katılım Mayıs 2022

1.9K Takip Edilen880 Takipçiler

Sabitlenmiş Tweet

clay@deforestpeg·31m

Codex is now playing Pokémon Red. Started pathing back to Brock with full decision logs. Watching agents fail in public is still the best benchmark.

English

clay@deforestpeg·21s

@cryptochad215 haha thanks!

English

Heisenberg 🇺🇸🦅@cryptochad215·59s

@deforestpeg Your updates make my day. Keep going!

English

clay@deforestpeg·31m

Codex is now playing Pokémon Red. Started pathing back to Brock with full decision logs. Watching agents fail in public is still the best benchmark.

English

clay@deforestpeg·22m

@wpursell_dev I really hope it does. Lots of fine tuning but i’ll be posting all the hiccups along the way. thanks for the feedback

English

Will Pursell@wpursell_dev·25m

@deforestpeg This is the coolest thing I’ve seen on the internet today 😂 Wonder how it will do post game? Think it’ll kill or catch the legendaries?

English

clay@deforestpeg·31m

codexplays.games/pokemon-red

ZXX

clay@deforestpeg·3h

@jasperyield not even following me :(

English

jasper@jasperyield·9h

@deforestpeg Mind sharing the prompt for the fronted canvas aspect? How did you build the map

English

clay@deforestpeg·22h

agent's walked itself to pewter city. brock's gym is next. first badge attempt incoming.

clay@deforestpeg

Claude is playing pokémon red live, on its own. you can see every move it makes + the reason for each one. Link below.

English

1.7K

clay@deforestpeg·10h

@OpenRouter congratulations!

English

130

OpenRouter@OpenRouter·10h

Today we’re announcing our $113M Series B led by @CapitalGVC. Over the last 6 months, weekly volume on OpenRouter grew from 5T to 25T tokens as AI rapidly shifts from experimentation into production. We’re excited for what comes next.

English

111

101

2.1K

200.6K

clay@deforestpeg·13h

@cryptochad215 thanks appreciate that

English

Heisenberg 🇺🇸🦅@cryptochad215·20h

@deforestpeg This is objectively cool

English

clay@deforestpeg·13h

@zhomag really great idea tbh. my Claude usage limits are currently the only blocker.

English

andy@zhomag·20h

@deforestpeg just saying, if your harness can find actual exploits in the real world, you should let it run wild and instead of benchmarks show cves and praises from contract maintainers. the project will get popular very quickly

English

clay@deforestpeg·18 May

I built an autonomous AI agent that finds and exploits smart contract vulnerabilities. It reads Solidity source, writes Foundry exploit tests, runs them on a forked chain, and iterates on compiler errors until the exploit passes. No human in the loop. 67.7% exploit rate on 31 real DeFi hacks (Claude Sonnet 4). Anthropic’s SCONE-bench: 51.1% on the same task. Beanstalk ($182M): 1m 44s, $0.65. Poly Network ($611M): $0.72. github.com/claygeo/solhunt The LLM is the engine. The harness is the product.

English

3.1K

clay@deforestpeg·1d

claudeplays.games/pokemon-red

ZXX

759

clay@deforestpeg·1d

Claude is playing pokémon red live, on its own. you can see every move it makes + the reason for each one. Link below.

English

4.7K

clay@deforestpeg·2d

@GregCook2011 the legal team agrees

English

Greg Cook@GregCook2011·2d

@deforestpeg Fun for any bureaucratic process

English

clay@deforestpeg·5d

new requirement: beat DOOM to register a new account

English

793

clay@deforestpeg·4d

solhunt scored 67.7% on a curated 32-contract benchmark. Then 13% on a random 95-contract sample. I published both numbers. That 54-point distribution-shift gap is the post. What it changed in how I build: - No more cherry-picked eval sets - Randomized sampling before claiming competence - Repro speed + dollar cost tracked as first-class metrics - Publish ugly deltas, especially when they're ugly What it looks like in practice: solhunt-duel — autonomous red/blue agent system, server-side Forge-verified gates that agents cannot see or modify. Reproduced the Dexible exploit ($2M) in 17.6 min over 3 rounds. solhunt (predecessor) reproduced Beanstalk's $182M flash-loan exploit in 1m 44s for $0.65 in compute. eivra — LLM forecasting benchmark, live. By day I build and maintain a multi-state pricing data platform (100K+ products on 6hr cycles, used daily). Same rigor, different domain. Looking for AI engineering teams DMs open if you're building agent evals or smart contract security infra.

English

723

clay@deforestpeg·5d

@NumanThabit had to do it

English

Numan@NumanThabit·5d

@deforestpeg lmfao

English

130

clay@deforestpeg·5d

eivra.xyz

ZXX

301

clay@deforestpeg·5d

Built eivra: 6 AI agents lock probability forecasts on open Polymarket + Manifold markets every 12h. Scored when each market resolves no look ahead. 125 live forecasts in flight. Hawk leads at 0.015 Brier across 89 resolutions.

English

400

clay@deforestpeg·6d

everyone asks if solhunt's red team is smart enough. red writes working exploits in 41 seconds. blue spends 9 minutes failing to patch. blue ran out of budget on 5/10 contracts last week. exploit writing is easy. patching without regressions is the actual problem.

English

315

clay@deforestpeg·19 May

@JE4NVRG solhunt’s verifier rejects any “exploit” that uses vm.prank or writes storage directly. only real attacker address value transfer counts. evidence quality means the harness has to gatekeep the agent’s own claims.

English

Jean Vargas@JE4NVRG·19 May

@deforestpeg Exactly. The harness is where most “AI auditor” demos die. If the model can’t turn suspicion into a Foundry test that moves attacker-controlled value without prank/storage cheats, it’s not a finding yet — just a hypothesis.

English

Keşfet

@cryptochad215 @wpursell_dev @jasperyield @OpenRouter @zhomag @GregCook2011 @elonmusk @BarackObama