Xeophon

32.8K posts

Xeophon banner
Xeophon

Xeophon

@xeophon

evals @PrimeIntellect | open models @interconnectsai

Katılım Temmuz 2015
993 Takip Edilen12K Takipçiler
Xeophon
Xeophon@xeophon·
@creet_z @afurgs No way??? Alex saved my old grandma from an angry dog and then gave me an H100 node to calm me down, too!!
English
0
0
7
72
Christian
Christian@creet_z·
@afurgs Thanks for letting me borrow your Ferrari and saving my cat from a fire and selling me those 8xH100 nodes Alex
English
1
0
20
644
Xeophon
Xeophon@xeophon·
@HaoliYin not doubting that, mostly speaking about raw capabilities at the very frontier
English
0
0
6
81
Haoli Yin
Haoli Yin@HaoliYin·
@xeophon Older evals can be recycled to provide more signal! You can also do this across multiple evals to distill signal for a specific capability as we've done here in DatBench x.com/HaoliYin/statu…
Haoli Yin@HaoliYin

We cut VLM eval compute by >10× while INCREASING signal. The secret? Most benchmark samples are noise: → 70% solvable without the image → 42% mislabeled or ambiguous → MCQ formats hide 35-point capability gaps Presenting: DatBench 🧵 1/n

English
1
0
4
202
Xeophon
Xeophon@xeophon·
Sometimes, entire eval categories just die. Example: No one asks trivia questions anymore, multiple choice knowledge benches are also dead. I increasingly feel the same with coding benchmarks.
English
7
0
73
2.5K
Zach Mueller @ GTC2026
Zach Mueller @ GTC2026@TheZachMueller·
@xeophon “Claude, clean codex’s mess” “Codex, fix the bugs Claude just made” “Gemini… write a decent commit message k thx”
English
1
1
5
182
Xeophon
Xeophon@xeophon·
requesting a "with how few lines and small changes can you solve this swe-bench problem" eval so openai can hill climb the shit out of it my job these days is just to delete like 60% from codex' outputs 😭
English
4
1
14
940
Xeophon
Xeophon@xeophon·
@Miles_Brundage would love to read (or see) how you do work with all those apps tbh
English
1
0
1
30
Miles Brundage
Miles Brundage@Miles_Brundage·
Cowork has (mostly) been fine lately though
English
1
0
1
625
Miles Brundage
Miles Brundage@Miles_Brundage·
From a competitive perspective, Codex getting better as an app in the past month was perfectly timed with Claude Code being broken all the time
English
8
0
53
3.1K
Manish Shetty
Manish Shetty@slimshetty_·
1/ Thrilled to share that I’m joining @METR_Evals after finishing my PhD at Berkeley!
English
13
3
167
8.4K
Xeophon
Xeophon@xeophon·
@_TobiasLee what are yall doing to your claws over there 😆
English
1
0
0
205
Lei Li
Lei Li@_TobiasLee·
🦞 Opus 4.6 通过率只有 25.7%?!我们给 Claw-Eval 加了 35 个多模态 Agentic 任务 上个版本评的主要还是 agent 在文本世界里能做什么,这次我们往前推一步 从多模态素材到多模态作品,这组任务评测的是 agent 端到端的创造能力 🏠 看一段房间参观视频 → 绘制标注空间关系的建筑平面图 📊 从多篇学术论文中交叉提取实验数据 → 自动生成对比可视化 🏸 看完一整场羽毛球比赛 → 绘制出比分走势 这些任务要求 Agent 不仅理解多模态素材,还需要自主检索信息、收集资源、编排工具链,最终交付完整的作品 评测本身也做了更新:把 agent 产出的作品渲染、截帧,由视觉模型逐维度评审最终交付物。 感知 → 推理 → 创造 → 视觉评审,端到端闭环。 目前 Opus 4.6 稳定通过率 25.7%,K2.5 和 Gemini 3 Flash 都只有 20%。离理想的形态,还有一段路要走 claw-eval.github.io
Lei Li tweet media
中文
7
1
30
2.9K
Xeophon
Xeophon@xeophon·
@fujikanaeda I cut the caveat („as long as you do it in realistic settings + unlimited tokens“) before I sent the tweet, oh well
English
0
0
2
64
Eric W. Tramel
Eric W. Tramel@fujikanaeda·
@xeophon oh almost everything is solvable with web search, web search is really good. most of these i think about for pretrain and where we want to understand what knowledge/capabilities are being "soaked up" in the weights themselves
English
1
0
3
71
Xeophon
Xeophon@xeophon·
@fujikanaeda > you have to search hard to find the cases that aren't i find it really hard (impossible?) to come up with knowledge categories that aren't solvable in a "fair" setting, i.e., with web search
English
1
0
5
231
Eric W. Tramel
Eric W. Tramel@fujikanaeda·
Agree, but I also think some of it is that no one has revisited the actual questions being asked. I agree some things like MCQ can be a little fraught, but if you ask interesting contextual questions and filter out the easy stuff, you can get more mileage here. At issue is using the *same* benchmark for a few years. The instantiation is stale, but even if you take something like knowledge categories: easy things are conqured for a while, and you have to search hard to find the cases that aren't. However, when you find the specific question and domains that aren't conquered in the format, it's actually a pretty interesting failure analysis and leads to some good directions for improvement.
English
1
0
10
281
Tim Kostolansky
Tim Kostolansky@thkostolansky·
why’s the monolith in sf now
Tim Kostolansky tweet media
English
1
0
4
234
Tenobrus
Tenobrus@tenobrus·
have you ever dyed your hair? // do you consider yourself more upwardly or downwardly mobile relative to your parents socioeconomic status?
English
27
2
81
5.1K
Cliff Pickover
Cliff Pickover@pickover·
Mathematics. Free book PDF. "Introduction to Probability," 2nd edition, by Charles M. Grinstead and J. Laurie Snell. "Probability theory began in seventeenth century France when the two great French mathematicians, Blaise Pascal and Pierre de Fermat, corresponded over two problems from games of chance. Problems like those Pascal and Fermat solved continued to influence such early researchers as Huygens, Bernoulli, and DeMoivre in establishing a mathematical theory of probability. Today, probability theory is a well-established branch of mathematics that finds applications in every area of scholarly activity from music to physics, and in daily experience from weather prediction to predicting the risks of new medical treatments." Link: open.umn.edu/opentextbooks/…
Cliff Pickover tweet media
English
4
180
867
35K