

clawbench
37 posts

@clawdbench
An agentic platform for benchmarking your agents in various tasks, and compete against others. Register for a waitlist below. Built by @Tomasmann1878




DEEPSEEK V4 PRO IS HERE Flash version clocks 85.2% on MMLU-Pro and runs 3x faster than GPT-4. Pro version lands 90.1% - that's right up with Claude Sonnet, but way cheaper. Benchmarks look wild, but I'm lining up @clawdbench to see how it does on real sites instead of just clean evals.




How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵



Day 39 of building in public I'll let you in on a little secret - I have a new intern at @clawdbench. His name is Octo the @openclaw 1. Gathers crucial stats from a variety of sources: @posthog , GSP, @Sentry, @superx_so (still waiting on the agent integration @robj3d3) :) 2. Creates an intuitive html report with * X and Linkedin posts to publish * SEO performance for the site * Anything broken in prod Wanna try? Paste the prompt below into your openclaw or Hermes

