james groeneveld retweetledi

I've long preferred Claude Code over Codex or Gemini, because it seemed much more reliable, but couldn't explain why : now Bullshit Bench by @petergostev provides compelling numbers.
It measures bullshit as "when given false premises disguised in jargon, will the model go with the flow (=bullshit) or push back (=truthful)"
And Claude is leagues ahead !
Also, this objective of truthfulness is probably at odds with the Chatbot Arena emergent objective of "pleasant chat experience" ; but a model optimizing for the former will be more useful.

English















