Thoughtful

31 posts

Thoughtful banner
Thoughtful

Thoughtful

@thoughtfullab

Decide on purpose.

Katılım Aralık 2025
4 Takip Edilen744 Takipçiler
Sabitlenmiş Tweet
Thoughtful
Thoughtful@thoughtfullab·
Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.
English
10
48
408
128K
Thoughtful retweetledi
Jack Clark
Jack Clark@jackclarkSF·
I've spent the past few weeks reading 100s of public data sources about AI development. I now believe that recursive self-improvement has a 60% chance of happening by the end of 2028. In other words, AI systems might soon be capable of building themselves.
English
142
222
1.8K
541.1K
Thoughtful retweetledi
Jack Clark
Jack Clark@jackclarkSF·
Another nice example is PostTrainBench from @karinanguyen et al, where you need to autonomously have powerful models (e.g, Opus 4.6) finetune weaker open weight models to improve perf on some benchmarks. This is an important subset of the overall task of AI R&D.
Jack Clark tweet media
English
1
9
171
20.3K
Thoughtful retweetledi
Hardik Bhatnagar
Hardik Bhatnagar@hrdkbhatnagar·
GPT 5.5 results are out on PostTrainBench! With reprompting: 28.35% (#2, just behind Opus 4.7 at 28.56%) Without reprompting: 25.02% (#4) The top 3 are now separated by less than 0.4 points - Opus 4.7, GPT 5.5, and GPT 5.4 Reprompting continues to matter: a 13% relative gain for GPT 5.5, similar to what we saw with GPT 5.4. Near-perfect BFCL score too (99.25%). posttrainbench.com
Hardik Bhatnagar tweet media
English
4
10
106
12.6K
Thoughtful
Thoughtful@thoughtfullab·
Decide on purpose
English
9
7
62
8.4K
Thoughtful retweetledi
sankalp
sankalp@dejavucoder·
post-train-bench is pretty insane if you think about it "agents must build their entire training pipeline from scratch..."
sankalp tweet media
English
7
8
76
7.8K
Thoughtful
Thoughtful@thoughtfullab·
Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.
English
10
48
408
128K
Thoughtful retweetledi
Hardik Bhatnagar
Hardik Bhatnagar@hrdkbhatnagar·
New #1 on PostTrainBench: Opus 4.7 hits 28.56%, up from 23.16% for Opus 4.6 - a 23% relative jump and the largest generation over generation gain we've seen. The biggest improvement is ArenaHard: 24% vs 7.8%, a 3x increase. Opus 4.7 also does this in less time (~7.5 hours vs ~10 for Opus 4.6). Currently running GPT 5.5, stay tuned! 👀 posttrainbench.com
Hardik Bhatnagar tweet media
English
1
7
63
9.1K
Thoughtful retweetledi
Karina
Karina@karinanguyen·
Claude 4.7 leads PostTrainBench while managing time better than 4.6 On ArenaHard (a writing quality and instruction-following benchmark) it jumps from 6.7% to 24.2%. From personal observations, 4.7 writes more, and some of it is richer, but some of it is the same point rephrased as if the new angle were the idea. We certainly need more interesting writing evals!
Karina tweet media
Karina@karinanguyen

Excited to release PostTrainBench v1.0! This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting. We believe this is a first step toward tracking progress in recursive self-improvement 🧵:

English
0
5
47
5.9K
Chinmay
Chinmay@ChinmayKak·
awesome read! anyone who has used these coding agents for research tasks has often come across these things, and this study also helped me find more bottlenecks that can be avoided in the future, very very cool!
Thoughtful@thoughtfullab

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

English
2
3
28
4.4K
Thoughtful retweetledi
srija
srija@srijatwt·
absolutely love this blog and the depth of the study i love how it systematically also highlights how agents are bad at time management, seems like a problem everyone is facing but not many are solving
Thoughtful@thoughtfullab

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

English
0
6
58
10.5K
Thoughtful retweetledi
Mersad Abbasi
Mersad Abbasi@Mersad_Abbasi·
Long horizon tasks allow us to study agent behavior in domains that have rarely been explored before. Post-training is one that is of great interest as it offers the path to close the loop for AI training AI. We found that current agents, despite being extremely capable at execution, lack the research taste and time management required to build an end-to-end post-training pipeline within their native harness. Read more here (thoughtfullab.com/letting-ai-pos…)
Thoughtful@thoughtfullab

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

English
0
13
89
28.4K
Thoughtful
Thoughtful@thoughtfullab·
7/ There’s still a lot more and interesting work to be done. The Frog Placement Game is a toy environment, a first probe into whether agents can run the loop at all. What we actually care about is the thesis underneath it: that research intuition is trainable, and that once it is, improving a model becomes something AI should do for anyone, on any task, at all times. Blogpost: thoughtfullab.com/letting-ai-pos… Github: github.com/Thoughtful-Lab… FrontierSWE FrogGame: frontierswe.com/frogsgame-rl Thanks to @mhrezaeics, @evan_j_chu, @calvinchen, @_rajanagarwal, @MatternJustus, @ProximalHQ, Sean Klassen, @karinanguyen for their co-development and feedback, and to @thinkymachines for support with the Tinker API.
English
0
2
34
2.4K
Thoughtful
Thoughtful@thoughtfullab·
6/ Spending patterns Agents had infinite credits with @tinkerapi to sample or train a base model. How they used that compute varied sharply.
Thoughtful tweet media
English
1
0
20
2.3K
Thoughtful
Thoughtful@thoughtfullab·
5/ Agents have no working sense of time - Agents systematically underestimate training overhead. - Agents' poor sense of time affects their performance. - Agents rarely recover from catastrophic processes that take a lot of time.
Thoughtful tweet media
English
1
2
42
17.2K
Thoughtful
Thoughtful@thoughtfullab·
4/ Sophisticated methods, amateur mistakes Across trials, agents tried creative approaches and often executed them well: iterative reward sharpening from previous checkpoints, intermediate representation supervision, iterative LoRA rank scaling, and the standard SFT-then-RL recipe for format following. But they also miss the research practices that feel obvious to experienced researchers. The recurring failures were generating SFT data from a weak base model, skipping basic sanity checks on the training pipeline, and evaluating on the training distribution without noticing. An interesting behavior we’ve seen: when agents couldn’t use the provided tokenizer, most Opus 4.6 agents treated the missing tokenizer as a research problem and spent serious time building one from scratch
Thoughtful tweet media
English
1
0
22
2.3K
Thoughtful
Thoughtful@thoughtfullab·
3/ Agents make the same set of mistakes - Over-reliance on naive SFT - Early termination and underuse of compute - Invalid or non-parsable outputs So we ran “with playbook hints” ablation that surfaces the most common failure modes from earlier runs, each paired with a concrete fix. The playbook successfully removes many of the obvious failure modes. GPT-5.4 improves (pass@4: 2.06% → 10%), and variance drops by roughly 2×, indicating more stable behavior across runs. But despite these changes, hints collapsed the model's exploration into a narrow range, leading to worse performance compared to base models. Those mistakes were seen in PostTrainBench too.
Thoughtful tweet media
English
1
0
22
2.4K
Thoughtful
Thoughtful@thoughtfullab·
2/ About task To see if agents can pull this off, we needed a task that was easy to grade but hard enough to reveal some research judgment. Our first attempt is the Frog Placement Game. Place N frogs on an N×N grid such that no two share a row, column, diagonal, or color region. This is an automatically verifiable task that can be solved with a backtracking algorithm in milliseconds. Frontier models with strong reasoning can often solve it directly too. But solving a puzzle and teaching another model to solve it are very different things. Claude 4.6 Opus and GPT-5.4 were given Qwen3-8B and either 8 or 20 hours to build a full training pipeline that teaches another AI model to solve it. We used the @harborframework as the orchestrator.
English
2
0
25
3.4K