Xeophon
32.8K posts

Xeophon
@xeophon
evals @PrimeIntellect | open models @interconnectsai
Katılım Temmuz 2015
993 Takip Edilen12K Takipçiler

@xeophon @AdinaYakup Almost ruined my day, still MIT, unless you are talking about something else?
English

Another OCR model just dropped on @huggingface (so many OCRs lately!)
dots.mocr from @xiaohongshu Hi Lab looks really impressive on the benchmarks.
-Model: huggingface.co/collections/re…
-Paper: huggingface.co/papers/2603.13…
✨ 3B
✨ Multilingual support
✨ Converts charts, diagrams, and UI layouts directly into SVG code

English

Hi Suhail, my name is Alex. I don't think we've had the chance to meet
Suhail@Suhail
I am now at 5 GPU providers being completely sold out for a single node of 8xH100s. I don’t think people understand the gravity of what is about to come.
English

@xeophon Older evals can be recycled to provide more signal!
You can also do this across multiple evals to distill signal for a specific capability as we've done here in DatBench
x.com/HaoliYin/statu…
Haoli Yin@HaoliYin
We cut VLM eval compute by >10× while INCREASING signal. The secret? Most benchmark samples are noise: → 70% solvable without the image → 42% mislabeled or ambiguous → MCQ formats hide 35-point capability gaps Presenting: DatBench 🧵 1/n
English

@xeophon “Claude, clean codex’s mess”
“Codex, fix the bugs Claude just made”
“Gemini… write a decent commit message k thx”
English

@Miles_Brundage would love to read (or see) how you do work with all those apps tbh
English

1/ Thrilled to share that I’m joining @METR_Evals after finishing my PhD at Berkeley!
English

🦞 Opus 4.6 通过率只有 25.7%?!我们给 Claw-Eval 加了 35 个多模态 Agentic 任务
上个版本评的主要还是 agent 在文本世界里能做什么,这次我们往前推一步
从多模态素材到多模态作品,这组任务评测的是 agent 端到端的创造能力
🏠 看一段房间参观视频 → 绘制标注空间关系的建筑平面图
📊 从多篇学术论文中交叉提取实验数据 → 自动生成对比可视化
🏸 看完一整场羽毛球比赛 → 绘制出比分走势
这些任务要求 Agent 不仅理解多模态素材,还需要自主检索信息、收集资源、编排工具链,最终交付完整的作品
评测本身也做了更新:把 agent 产出的作品渲染、截帧,由视觉模型逐维度评审最终交付物。
感知 → 推理 → 创造 → 视觉评审,端到端闭环。
目前 Opus 4.6 稳定通过率 25.7%,K2.5 和 Gemini 3 Flash 都只有 20%。离理想的形态,还有一段路要走
claw-eval.github.io

中文

@fujikanaeda I cut the caveat („as long as you do it in realistic settings + unlimited tokens“) before I sent the tweet, oh well
English

@xeophon oh almost everything is solvable with web search, web search is really good.
most of these i think about for pretrain and where we want to understand what knowledge/capabilities are being "soaked up" in the weights themselves
English

@fujikanaeda > you have to search hard to find the cases that aren't
i find it really hard (impossible?) to come up with knowledge categories that aren't solvable in a "fair" setting, i.e., with web search
English

Agree, but I also think some of it is that no one has revisited the actual questions being asked. I agree some things like MCQ can be a little fraught, but if you ask interesting contextual questions and filter out the easy stuff, you can get more mileage here.
At issue is using the *same* benchmark for a few years. The instantiation is stale, but even if you take something like knowledge categories: easy things are conqured for a while, and you have to search hard to find the cases that aren't.
However, when you find the specific question and domains that aren't conquered in the format, it's actually a pretty interesting failure analysis and leads to some good directions for improvement.
English

Imagine being this particular Claude 😭

Sen. Bernie Sanders@SenSanders
I spoke to Anthropic’s AI agent Claude about AI collecting massive amounts of personal data and how that information is being used to violate our privacy rights. What an AI agent says about the dangers of AI is shocking and should wake us up.
English

Mathematics.
Free book PDF. "Introduction to Probability," 2nd edition, by Charles M. Grinstead and J. Laurie Snell.
"Probability theory began in seventeenth century France when the two great French mathematicians, Blaise Pascal and Pierre de Fermat, corresponded over two problems from games of chance. Problems like those Pascal and Fermat solved continued to influence such early researchers as Huygens, Bernoulli, and DeMoivre in establishing a mathematical theory of probability. Today, probability theory is a well-established branch of mathematics that finds applications in every area of scholarly activity from music to physics, and in daily experience from weather prediction to predicting the risks of new medical treatments."
Link: open.umn.edu/opentextbooks/…

English










