Sabitlenmiş Tweet
Vals AI
1.1K posts

Vals AI
@ValsAI
Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc
San Francisco, CA Katılım Mart 2024
253 Takip Edilen9.6K Takipçiler

Full results and benchmark breakdown can be found on vals.ai/models/mistral…
English
Vals AI retweetledi

After reaching out, we were able to confirm with OpenAI that “tool_choice”: “none” injects an additional steering instruction into the model system prompt, in a way that tools: [] does not.
This instruction seemingly hurts the model’s ability to use the Terminus 2 harness effectively, which, despite not using native-tool-calling, is still agentic.
English

After the discovery below, GPT 5.5 is now the #1 model on Terminal Bench 2, improving by +11%. It is still the #2 model on the Vals Index.
See full results on vals.ai/models/openai_…

Vals AI@ValsAI
We noticed discrepancies between our Terminal Bench scores and those reported by others. After a detailed investigation, we determined the cause of the delta was a surprising, undocumented behavior of the tool_choice parameter. See our full findings below.
English

Full results and domain specific benchmarks can be found at vals.ai/models/grok_gr…
English





