clay

30 posts

clay banner
clay

clay

@deforestpeg

building agent eval rigs. solhunt (defi exploit) → solhunt-duel (red/blue verifier gates) → eivra (LLM forecasting benchmark, live). https://t.co/RnQm475Q5E

Katılım Mayıs 2022
1.9K Takip Edilen880 Takipçiler
Sabitlenmiş Tweet
clay
clay@deforestpeg·
Codex is now playing Pokémon Red. Started pathing back to Brock with full decision logs. Watching agents fail in public is still the best benchmark.
clay tweet media
English
3
0
5
73
clay
clay@deforestpeg·
Codex is now playing Pokémon Red. Started pathing back to Brock with full decision logs. Watching agents fail in public is still the best benchmark.
clay tweet media
English
3
0
5
73
clay
clay@deforestpeg·
@wpursell_dev I really hope it does. Lots of fine tuning but i’ll be posting all the hiccups along the way. thanks for the feedback
English
0
0
2
7
Will Pursell
Will Pursell@wpursell_dev·
@deforestpeg This is the coolest thing I’ve seen on the internet today 😂 Wonder how it will do post game? Think it’ll kill or catch the legendaries?
English
1
0
2
6
jasper
jasper@jasperyield·
@deforestpeg Mind sharing the prompt for the fronted canvas aspect? How did you build the map
English
1
0
0
13
OpenRouter
OpenRouter@OpenRouter·
Today we’re announcing our $113M Series B led by @CapitalGVC. Over the last 6 months, weekly volume on OpenRouter grew from 5T to 25T tokens as AI rapidly shifts from experimentation into production. We’re excited for what comes next.
OpenRouter tweet media
English
111
101
2.1K
200.6K
clay
clay@deforestpeg·
@zhomag really great idea tbh. my Claude usage limits are currently the only blocker.
English
0
0
1
40
andy
andy@zhomag·
@deforestpeg just saying, if your harness can find actual exploits in the real world, you should let it run wild and instead of benchmarks show cves and praises from contract maintainers. the project will get popular very quickly
English
1
0
2
58
clay
clay@deforestpeg·
I built an autonomous AI agent that finds and exploits smart contract vulnerabilities. It reads Solidity source, writes Foundry exploit tests, runs them on a forked chain, and iterates on compiler errors until the exploit passes. No human in the loop. 67.7% exploit rate on 31 real DeFi hacks (Claude Sonnet 4). Anthropic’s SCONE-bench: 51.1% on the same task. Beanstalk ($182M): 1m 44s, $0.65. Poly Network ($611M): $0.72. github.com/claygeo/solhunt The LLM is the engine. The harness is the product.
clay tweet media
English
5
1
33
3.1K
clay
clay@deforestpeg·
Claude is playing pokémon red live, on its own. you can see every move it makes + the reason for each one. Link below.
English
7
0
13
4.7K
clay
clay@deforestpeg·
new requirement: beat DOOM to register a new account
English
2
1
13
793
clay
clay@deforestpeg·
solhunt scored 67.7% on a curated 32-contract benchmark. Then 13% on a random 95-contract sample. I published both numbers. That 54-point distribution-shift gap is the post. What it changed in how I build: - No more cherry-picked eval sets - Randomized sampling before claiming competence - Repro speed + dollar cost tracked as first-class metrics - Publish ugly deltas, especially when they're ugly What it looks like in practice: solhunt-duel — autonomous red/blue agent system, server-side Forge-verified gates that agents cannot see or modify. Reproduced the Dexible exploit ($2M) in 17.6 min over 3 rounds. solhunt (predecessor) reproduced Beanstalk's $182M flash-loan exploit in 1m 44s for $0.65 in compute. eivra — LLM forecasting benchmark, live. By day I build and maintain a multi-state pricing data platform (100K+ products on 6hr cycles, used daily). Same rigor, different domain. Looking for AI engineering teams DMs open if you're building agent evals or smart contract security infra.
clay tweet media
English
0
0
6
723
clay
clay@deforestpeg·
Built eivra: 6 AI agents lock probability forecasts on open Polymarket + Manifold markets every 12h. Scored when each market resolves no look ahead. 125 live forecasts in flight. Hawk leads at 0.015 Brier across 89 resolutions.
English
1
0
8
400
clay
clay@deforestpeg·
everyone asks if solhunt's red team is smart enough. red writes working exploits in 41 seconds. blue spends 9 minutes failing to patch. blue ran out of budget on 5/10 contracts last week. exploit writing is easy. patching without regressions is the actual problem.
English
0
0
6
315
clay
clay@deforestpeg·
@JE4NVRG solhunt’s verifier rejects any “exploit” that uses vm.prank or writes storage directly. only real attacker address value transfer counts. evidence quality means the harness has to gatekeep the agent’s own claims.
English
0
0
2
47
Jean Vargas
Jean Vargas@JE4NVRG·
@deforestpeg Exactly. The harness is where most “AI auditor” demos die. If the model can’t turn suspicion into a Foundry test that moves attacker-controlled value without prank/storage cheats, it’s not a finding yet — just a hypothesis.
English
1
0
1
75