Vals AI

1.1K posts

Vals AI

@ValsAI

Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc

San Francisco, CA Beigetreten Mart 2024

253 Folgt9.6K Follower

Angehefteter Tweet

Vals AI@ValsAI·2d

Today @xai just rearranged our leaderboards… Grok 4.3 jumped 25 points to take #1 on CaseLaw v2 and climbed 21 spots to lead CorpFin at 68.5%. Congrats @xai @elonmusk 🚀

English

342

973

3.3K

644.4K

Vals AI retweetet

Elon Musk@elonmusk·1d

Grok #1 in law

Arthur MacWaters@ArthurMacwaters

Grok 4.3 release > #1 in caselaw > #1 in corpfin > impressive given significantly lower cost per 1m tokens (5-10x less than opus 4.7 and openai 5.5) Very exciting to see the massive jump in performance in highly detail-oriented applied fields

English

2.6K

4.6K

27.2K

8.1M

Vals AI@ValsAI·1d

We have updated the results of GPT 5.5 on our site - it is now the #1 model on Terminal Bench 2. Its ranking on the Vals Index has not changed.

English

7.2K

Vals AI@ValsAI·1d

After reaching out, we were able to confirm with OpenAI that “tool_choice”: “none” injects an additional steering instruction into the model system prompt, in a way that tools: [] does not. This instruction seemingly hurts the model’s ability to use the Terminus 2 harness effectively, which, despite not using native-tool-calling, is still agentic.

English

13K

Vals AI@ValsAI·1d

We noticed discrepancies between our Terminal Bench scores and those reported by others. After a detailed investigation, we determined the cause of the delta was a surprising, undocumented behavior of the tool_choice parameter. See our full findings below.

English

147

29.9K

Vals AI@ValsAI·1d

After the discovery below, GPT 5.5 is now the #1 model on Terminal Bench 2, improving by +11%. It is still the #2 model on the Vals Index. See full results on vals.ai/models/openai_…

Vals AI@ValsAI

English

125

13.3K

Vals AI@ValsAI·2d

@scaling01 It's here and we evaluated it- x.com/ValsAI/status/…

Vals AI@ValsAI

Grok 4.3 has launched at #13 on the Vals Index. It ranks #1 on CaseLaw and #1 on CorpFin but it struggles on general coding benchmarks.

English

295

Lisan al Gaib@scaling01·2d

looks like we are getting grok 4.3 api very soon

English

171

8.7K

Vals AI@ValsAI·2d

Full results and domain specific benchmarks can be found at vals.ai/models/grok_gr…

English

932

Vals AI@ValsAI·2d

We evaluated 4.3 with the default xAI hyperparameters: temperature=0.7, Top P: 0.95, and Top K= default. This model has 1M context window and output tokens. Latency is 584.24 seconds per test and costs $0.38/test on the index.

English

Vals AI@ValsAI·2d

Grok 4.3 has launched at #13 on the Vals Index. It ranks #1 on CaseLaw and #1 on CorpFin but it struggles on general coding benchmarks.

English

137

13.2K

Vals AI@ValsAI·4d

K2.6 placed #6 on CorpFin and #12 (overall) on Finance Agent — these were also contributors to its #1 open-weight performance. Its weaknesses are domain-specific: legal is uneven (#10 LegalBench, #21 CaseLaw v2), the medical benchmarks are weaker (#29 MedCode, #24 MedScribe), and on tax, it sits middle-of-the-pack.

English

1.6K

Vals AI@ValsAI·4d

Coding is where K2.6 excels. It took #1 among open-weight models on Terminal Bench 2 (+17 points over K2.5) and SWE-Bench (+8 points), and broke into the top 5 overall on both SWE-Bench (#4) and LiveCodeBench (#5), rivaling frontier closed-source models.

English

1.8K

Vals AI@ValsAI·4d

Results on the remainder of our benchmarks are in for Kimi K2.6, the new #1 open-weight model on the Vals Index (#8 overall) Kimi K2.6 is competitive with many closed-weight models, at a fraction of their price.

English

181

21.7K

Entdecken

@scaling01 @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine