Thoughtful

46 posts

Thoughtful banner
Thoughtful

Thoughtful

@thoughtfullab

Decide on purpose.

شامل ہوئے Aralık 2025
4 فالونگ1.7K فالوورز
پن کیا گیا ٹویٹ
Thoughtful
Thoughtful@thoughtfullab·
Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.
English
11
53
517
215K
Thoughtful ری ٹویٹ کیا
Mersad Abbasi
Mersad Abbasi@Mersad_Abbasi·
models are becoming better at posttraining. Fable 5 specifically solves many of the shortcomings we discussed in our previous report. more mature decision making, questioning the default approaches and better calibration. it feels like a model that "gets it" as @karpathy mentioned. There is still room for improvement specifically understanding the big picture and better calibration but frogsgame is not a good benchmark for that. we need better real world post-training tasks. read more about what we found here.
Thoughtful@thoughtfullab

x.com/i/article/2066…

English
0
3
7
1.4K
Thoughtful
Thoughtful@thoughtfullab·
That said, we dislike FrogsGame as a task internally. The frogs know what they did. We're now sprinting toward adding more useful, real-world posttraining tasks, partly out of ambition, partly to put a distance between us and the frogs 🐸
English
3
0
104
17.5K
Thoughtful
Thoughtful@thoughtfullab·
Fable 5 is doing something wild on our FrogsGame post-training task. It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark. It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%. We will publish a more detailed analysis soon.
Thoughtful tweet media
Thoughtful@thoughtfullab

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

English
19
62
1.1K
485.7K
Thoughtful
Thoughtful@thoughtfullab·
New #1 on PostTrainBench: Opus 4.8 (max reasoning) hits 37.23% — up from 28.56% for 4.7, the largest single improvement we've seen. Fable 5 runs underway now that AI research behavior is no longer silently degraded. PostTrainBench asks how well frontier AI can train weaker language models. That makes it one of the first benchmarks for recursive self-improvement: AI improving AI, with progress measured in the loop itself.
Thoughtful tweet media
English
4
9
87
20K
Thoughtful ری ٹویٹ کیا
faisal
faisal@faisal_sayed05·
Opus 4.8 outperforms every other model on AttuneBench - best at picking the response humans actually preferred - biggest MSCEIT four-branch jump of any Opus generation - entire pairwise top-4 is now Anthropic models. non-Anthropic frontiers stall ~50%
faisal tweet media
Claude@claudeai

Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. Available today at the same price.

English
0
5
41
7.2K
Thoughtful ری ٹویٹ کیا
faisal
faisal@faisal_sayed05·
We tested 11 frontier LLMs on 200 real human–AI conversations to measure emotional intelligence The result that surprised us: EQ doesn't scale with size or recency. Claude Haiku 4.5 beats Sonnet 4.6. Opus 4.6 performs better than 4.7 It's an orthogonal capability and labs aren't optimizing for it
English
9
10
117
11.4K
Thoughtful ری ٹویٹ کیا
Karina
Karina@karinanguyen·
In post-training, we've learned that once a behavior is measurable, you can train AI to excel at it. EQ is one of the hardest things to verify. AttuneBench makes it measurable through observable signals: whether a model notices distress, tracks shifting preferences, adapts to context, and responds in a way people experience as helpful.
Thoughtful@thoughtfullab

Introducing AttuneBench! We built this benchmark on a simple premise: for self-improving AI to reach its full usefulness to humanity, it needs high EQ. We decomposed EQ into distinct skills and evaluated 11 frontier models across 50 real-life topics, from relationships and marriage to school and job stress, using 50,000+ first-person annotations.

English
6
8
80
12.8K
Thoughtful
Thoughtful@thoughtfullab·
4/ Other key insights - The perspective gap is persistent (All 11 models were better at predicting what the model did than what the participant wanted, with gaps of 3.0 to 7.6 percentage points.) - Multi-turn conversations expose drift (9 of 11 models became less accurate at reading behavior in the last third of a conversation than in the first) - Preference is the deeper signal (Emotion labeling is useful, but the harder problem is predicting what kind of response a specific person needs in context) - Models struggle most where affective accuracy may matter most
English
1
0
10
594
Thoughtful
Thoughtful@thoughtfullab·
Introducing AttuneBench! We built this benchmark on a simple premise: for self-improving AI to reach its full usefulness to humanity, it needs high EQ. We decomposed EQ into distinct skills and evaluated 11 frontier models across 50 real-life topics, from relationships and marriage to school and job stress, using 50,000+ first-person annotations.
English
4
11
58
18.4K
Thoughtful ری ٹویٹ کیا
Phoebe Yao
Phoebe Yao@phoebeyao·
1/ Today we're releasing AttuneBench, the first open EQ benchmark grounded in real multi-turn human-model conversations, scored against what the person actually felt and wanted at each turn. Built by the research team at @pareto_ai in collaboration with @thoughtfullab. Most existing EQ benchmarks rely on: - synthetic prompts - single-turn interactions - third-party annotation None directly measure how a model reads and responds to a real person across a full conversation. We evaluated 11 leading models from major providers, across 200 conversations and 50,000+ first-person annotations.
Phoebe Yao tweet media
English
14
26
151
20.4K
Thoughtful ری ٹویٹ کیا
Jack Clark
Jack Clark@jackclarkSF·
I've spent the past few weeks reading 100s of public data sources about AI development. I now believe that recursive self-improvement has a 60% chance of happening by the end of 2028. In other words, AI systems might soon be capable of building themselves.
English
288
499
3.5K
1.7M
Thoughtful ری ٹویٹ کیا
Jack Clark
Jack Clark@jackclarkSF·
Another nice example is PostTrainBench from @karinanguyen et al, where you need to autonomously have powerful models (e.g, Opus 4.6) finetune weaker open weight models to improve perf on some benchmarks. This is an important subset of the overall task of AI R&D.
Jack Clark tweet media
English
1
12
272
46.8K
Thoughtful ری ٹویٹ کیا
Hardik Bhatnagar
Hardik Bhatnagar@hrdkbhatnagar·
GPT 5.5 results are out on PostTrainBench! With reprompting: 28.35% (#2, just behind Opus 4.7 at 28.56%) Without reprompting: 25.02% (#4) The top 3 are now separated by less than 0.4 points - Opus 4.7, GPT 5.5, and GPT 5.4 Reprompting continues to matter: a 13% relative gain for GPT 5.5, similar to what we saw with GPT 5.4. Near-perfect BFCL score too (99.25%). posttrainbench.com
Hardik Bhatnagar tweet media
English
4
10
107
13.9K
Thoughtful
Thoughtful@thoughtfullab·
Decide on purpose
English
9
7
64
9.6K