Kit Fraser-Taliente
4 posts


@mikeknoop putting it out there like this increases the chances that they comment on it
English

we might be getting scammed by Anthropic:
"we speculate this is a smaller model (maybe Sonnet-ish?) that runs thinking for longer"
Mike Knoop@mikeknoop
The headline is Opus 4.6 scores 69% for ~$3.50/task on ARC v2. This up +30pp from Opus 4.5. We attribute performance to the new "max" mode and 2X reasoning token budget -- notably task cost is held steady. Based on early field reports and other benchmark scores like SWE Bench, we speculate this is a smaller model (maybe Sonnet-ish?) that runs thinking for longer. If true, ARC v2 is measuring the "CoT search" complexity capability of the AI reasoning system, independent of model knowledge. Pretty cool! To get a sense of the complexity limit, here are all the v2 tasks Opus 4.6 failed to solve: arcprize.org/tasks/?dataset…
English
Kit Fraser-Taliente がリツイート
Kit Fraser-Taliente がリツイート





