Vals AI

1.1K posts

Vals AI banner
Vals AI

Vals AI

@ValsAI

Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc

San Francisco, CA Katılım Mart 2024
253 Takip Edilen9.6K Takipçiler
Sabitlenmiş Tweet
Vals AI
Vals AI@ValsAI·
Today @xai just rearranged our leaderboards… Grok 4.3 jumped 25 points to take #1 on CaseLaw v2 and climbed 21 spots to lead CorpFin at 68.5%. Congrats @xai @elonmusk 🚀
Vals AI tweet media
English
347
1K
3.3K
665.1K
Vals AI
Vals AI@ValsAI·
Medium 3.5 has a 262k context window. For all benchmarks, it was run with 80k output tokens, temperature=0.7, top P =0.95. On the index, it costs $2.06/per test, which is expensive compared to other open-weight models. .
English
1
0
0
76
Vals AI
Vals AI@ValsAI·
This week Mistral dropped Medium 3.5 model and it landed at #10 on open-weight Vals Index.
Vals AI tweet media
English
1
1
6
231
Vals AI retweetledi
Elon Musk
Elon Musk@elonmusk·
Grok #1 in law
Arthur MacWaters@ArthurMacwaters

Grok 4.3 release > #1 in caselaw > #1 in corpfin > impressive given significantly lower cost per 1m tokens (5-10x less than opus 4.7 and openai 5.5) Very exciting to see the massive jump in performance in highly detail-oriented applied fields

English
2.7K
5.1K
29.6K
8.8M
Vals AI
Vals AI@ValsAI·
We have updated the results of GPT 5.5 on our site - it is now the #1 model on Terminal Bench 2. Its ranking on the Vals Index has not changed.
English
0
4
80
8K
Vals AI
Vals AI@ValsAI·
After reaching out, we were able to confirm with OpenAI that “tool_choice”: “none” injects an additional steering instruction into the model system prompt, in a way that tools: [] does not. This instruction seemingly hurts the model’s ability to use the Terminus 2 harness effectively, which, despite not using native-tool-calling, is still agentic.
English
3
2
54
13.4K
Vals AI
Vals AI@ValsAI·
We noticed discrepancies between our Terminal Bench scores and those reported by others. After a detailed investigation, we determined the cause of the delta was a surprising, undocumented behavior of the tool_choice parameter. See our full findings below.
English
5
5
151
30.5K
Lisan al Gaib
Lisan al Gaib@scaling01·
looks like we are getting grok 4.3 api very soon
Lisan al Gaib tweet media
English
7
4
171
8.8K
Vals AI
Vals AI@ValsAI·
We evaluated 4.3 with the default xAI hyperparameters: temperature=0.7, Top P: 0.95, and Top K= default. This model has 1M context window and output tokens. Latency is 584.24 seconds per test and costs $0.38/test on the index.
English
1
0
5
1K
Vals AI
Vals AI@ValsAI·
Grok 4.3 has launched at #13 on the Vals Index. It ranks #1 on CaseLaw and #1 on CorpFin but it struggles on general coding benchmarks.
Vals AI tweet media
English
8
12
137
13.4K