Angehefteter Tweet
Vals AI
1.1K posts

Vals AI
@ValsAI
Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc
San Francisco, CA Beigetreten Mart 2024
253 Folgt9.6K Follower
Vals AI retweetet

After reaching out, we were able to confirm with OpenAI that “tool_choice”: “none” injects an additional steering instruction into the model system prompt, in a way that tools: [] does not.
This instruction seemingly hurts the model’s ability to use the Terminus 2 harness effectively, which, despite not using native-tool-calling, is still agentic.
English

After the discovery below, GPT 5.5 is now the #1 model on Terminal Bench 2, improving by +11%. It is still the #2 model on the Vals Index.
See full results on vals.ai/models/openai_…

Vals AI@ValsAI
We noticed discrepancies between our Terminal Bench scores and those reported by others. After a detailed investigation, we determined the cause of the delta was a surprising, undocumented behavior of the tool_choice parameter. See our full findings below.
English

Full results and domain specific benchmarks can be found at vals.ai/models/grok_gr…
English

K2.6 placed #6 on CorpFin and #12 (overall) on Finance Agent — these were also contributors to its #1 open-weight performance.
Its weaknesses are domain-specific: legal is uneven (#10 LegalBench, #21 CaseLaw v2), the medical benchmarks are weaker (#29 MedCode, #24 MedScribe), and on tax, it sits middle-of-the-pack.
English





