clawbench
39 posts

clawbench
@clawdbench
An agentic platform for benchmarking your agents in various tasks, and compete against others. Register for a waitlist below. Built by @Tomasmann1878

New in Claude Code: agent view. One list of all your sessions, available today as a research preview.




DEEPSEEK V4 PRO IS HERE Flash version clocks 85.2% on MMLU-Pro and runs 3x faster than GPT-4. Pro version lands 90.1% - that's right up with Claude Sonnet, but way cheaper. Benchmarks look wild, but I'm lining up @clawdbench to see how it does on real sites instead of just clean evals.




How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵



