Vals AI

1.1K posts

Vals AI banner
Vals AI

Vals AI

@ValsAI

Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc

San Francisco, CA Beigetreten Mart 2024
253 Folgt9.6K Follower
Angehefteter Tweet
Vals AI
Vals AI@ValsAI·
Today @xai just rearranged our leaderboards… Grok 4.3 jumped 25 points to take #1 on CaseLaw v2 and climbed 21 spots to lead CorpFin at 68.5%. Congrats @xai @elonmusk 🚀
Vals AI tweet media
English
342
973
3.3K
644.4K
Vals AI retweetet
Elon Musk
Elon Musk@elonmusk·
Grok #1 in law
Arthur MacWaters@ArthurMacwaters

Grok 4.3 release > #1 in caselaw > #1 in corpfin > impressive given significantly lower cost per 1m tokens (5-10x less than opus 4.7 and openai 5.5) Very exciting to see the massive jump in performance in highly detail-oriented applied fields

English
2.6K
4.6K
27.2K
8.1M
Vals AI
Vals AI@ValsAI·
We have updated the results of GPT 5.5 on our site - it is now the #1 model on Terminal Bench 2. Its ranking on the Vals Index has not changed.
English
0
4
79
7.2K
Vals AI
Vals AI@ValsAI·
After reaching out, we were able to confirm with OpenAI that “tool_choice”: “none” injects an additional steering instruction into the model system prompt, in a way that tools: [] does not. This instruction seemingly hurts the model’s ability to use the Terminus 2 harness effectively, which, despite not using native-tool-calling, is still agentic.
English
3
2
53
13K
Vals AI
Vals AI@ValsAI·
We noticed discrepancies between our Terminal Bench scores and those reported by others. After a detailed investigation, we determined the cause of the delta was a surprising, undocumented behavior of the tool_choice parameter. See our full findings below.
English
5
4
147
29.9K
Lisan al Gaib
Lisan al Gaib@scaling01·
looks like we are getting grok 4.3 api very soon
Lisan al Gaib tweet media
English
7
4
171
8.7K
Vals AI
Vals AI@ValsAI·
We evaluated 4.3 with the default xAI hyperparameters: temperature=0.7, Top P: 0.95, and Top K= default. This model has 1M context window and output tokens. Latency is 584.24 seconds per test and costs $0.38/test on the index.
English
1
0
5
1K
Vals AI
Vals AI@ValsAI·
Grok 4.3 has launched at #13 on the Vals Index. It ranks #1 on CaseLaw and #1 on CorpFin but it struggles on general coding benchmarks.
Vals AI tweet media
English
8
12
137
13.2K
Vals AI
Vals AI@ValsAI·
K2.6 placed #6 on CorpFin and #12 (overall) on Finance Agent — these were also contributors to its #1 open-weight performance. Its weaknesses are domain-specific: legal is uneven (#10 LegalBench, #21 CaseLaw v2), the medical benchmarks are weaker (#29 MedCode, #24 MedScribe), and on tax, it sits middle-of-the-pack.
English
0
0
6
1.6K
Vals AI
Vals AI@ValsAI·
Coding is where K2.6 excels. It took #1 among open-weight models on Terminal Bench 2 (+17 points over K2.5) and SWE-Bench (+8 points), and broke into the top 5 overall on both SWE-Bench (#4) and LiveCodeBench (#5), rivaling frontier closed-source models.
English
1
1
14
1.8K
Vals AI
Vals AI@ValsAI·
Results on the remainder of our benchmarks are in for Kimi K2.6, the new #1 open-weight model on the Vals Index (#8 overall) Kimi K2.6 is competitive with many closed-weight models, at a fraction of their price.
Vals AI tweet media
English
6
16
181
21.7K