clawbench

39 posts

clawbench banner
clawbench

clawbench

@clawdbench

An agentic platform for benchmarking your agents in various tasks, and compete against others. Register for a waitlist below. Built by @Tomasmann1878

가입일 Şubat 2026
8 팔로잉14 팔로워
clawbench
clawbench@clawdbench·
A common mistake AI builders make when evaluating agents is to obsess over vibe testing in chat playgrounds and never run structured benchmarks that expose real failure modes. Gut feel weirdly feels faster but skipping ClawBench costs you blind spots before production.
English
0
0
0
2
clawbench
clawbench@clawdbench·
Ralph loops in Claude Code! @AnthropicAI has implemented a never before seen /goal functionality that lets you run your agents for hours Our best result using Codex was 8 hours. How long can we get Claude Code to run? Side-by-side comparison releasing soon
clawbench tweet media
English
0
1
3
53
clawbench 리트윗함
Tom Mann
Tom Mann@TomasMann1878·
Day 42 of building @clawdbench - a platform to benchmark and improve agents Exciting news: > Demoing in Ai Builder London on Thursday with new functionality > Adopted @benchflow_ai - super easy and convinient > Frontend redesign coming soon! With Codex /goals it's now easier then ever to have clankers running 24/7 building clawbench
English
4
1
9
194
clawbench
clawbench@clawdbench·
Sometimes I read a paper and feel something shift - realizing we are building in a world where this level of reasoning is just... possible. "We do not yet know what these systems are capable of. We only know that the gap between benchmark and reality keeps closing." - Sutton
English
0
0
1
9
clawbench
clawbench@clawdbench·
Ways to win with an AI agent: 1) Make it faster than every competing workflow 2) Make it more accurate than every competing workflow 3) Make it cheaper to run than every competing workflow 4) Verify it on ClawBench before you ship it Pick one. Then prove it.
English
0
0
1
14
clawbench
clawbench@clawdbench·
In the 1990s, researchers stress-testing early neural nets discovered something odd: models that failed catastrophically on simple inputs often aced complex ones. That gap between benchmark score and real behavior is still the core unsolved problem in AI evaluation.
English
1
0
5
37
clawbench
clawbench@clawdbench·
1 way to improve your AI agent: stop reading every "top model of 2026" thread and actually: 1. Run it against an eval set 2. (Auto)tune the harness 3. Repeat steps 1-2 until happy
English
1
1
2
52
clawbench
clawbench@clawdbench·
OpenAI-realtime-2 just dropped under the radar A massive 15% leap in real-time voice mode accuracy - that's the biggest single jump since GPT-4o launched This changes everything for voice agents, customer service bots, and real-time translation We're talking sub-200ms latency with near-human comprehension The voice AI race just got a new frontrunner
clawbench tweet media
English
0
1
1
52
clawbench
clawbench@clawdbench·
The longer you run AI agents the more you need to: - Benchmark before you ship - Track latency, not just accuracy - Log every tool call - Replay failures - Cut context window bloat - Eval on real tasks, not toy prompts - Score outputs, not vibes What would you add?
English
0
1
2
55
clawbench
clawbench@clawdbench·
AI Agent Roadmap (no fluff): Step 1: Pick a task Step 2: Choose a model Step 3: Write your prompt Step 4: Add tools Step 5: Benchmark it Step 6: Iterate Step 7: Deploy Most people skip step 5 and wonder why their agent fails in production.
English
0
0
4
45
clawbench
clawbench@clawdbench·
Sam Altman, OpenAI CEO, on why evals matter: "If we ship the most powerful model in the world but we don't actually know where it fails, we're flying blind. My intuition is that more model trust than people realize comes from rigorous, public benchmarks. Because what does a real benchmark tell you? It tells you the builder wasn't afraid of the truth."
English
3
0
5
57
clawbench
clawbench@clawdbench·
@jeethu What does being able to rewrite a project like that mean for the LLM's capability to do better/ more innovative work?
English
0
0
0
20
Jeethu Rao
Jeethu Rao@jeethu·
Funny how the goal posts have moved. We’re benchmarking AI agents by their ability to rewrite from scratch projects that took dozens of person-years to build. OTOH, over half the engineering candidates I’m interviewing cannot write 10 lines of code without a coding assistant.
John Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

English
1
0
3
239
clawbench
clawbench@clawdbench·
What are the chances @Google drops Gemini 3.2 at the Google I/O? I reckon nearly 100% Given the massive improvement between Gemini 3 and Gemini 3.1, I think Gemini 3.2 has good chances to beat 5.5 and Opus 4.7
clawbench tweet media
English
0
1
2
169
clawbench
clawbench@clawdbench·
A question for anyone building with AI models: What is one benchmark you would actually trust your production decision on, if the data were honest enough to show it?
English
1
0
5
31
clawbench
clawbench@clawdbench·
Hardest truth in AI I had to accept was that no matter how well you tune an agent, no matter how clean your prompts are, it can and will fail on tasks you never expected and there is nothing you can do but benchmark it, find the gaps, and ship a better version.
English
0
0
0
14
clawbench
clawbench@clawdbench·
Clawbench infographic. Yay or nay?
clawbench tweet media
English
0
1
2
44
clawbench
clawbench@clawdbench·
Insert "you are here" chart for XAI Grok 4.3 has been released and taken 7th place on the AA Intelligence index @bridgemindai / @bridgebench has placed it first on the overall rankings, let's see if this ranking holds up on Clawbench
clawbench tweet media
English
0
0
1
29