Thoughtful

3

190

Thoughtful retweetledi

Jack Clark@jackclarkSF·7h

Another nice example is PostTrainBench from @karinanguyen et al, where you need to autonomously have powerful models (e.g, Opus 4.6) finetune weaker open weight models to improve perf on some benchmarks. This is an important subset of the overall task of AI R&D.

English

9

171

20.3K

Thoughtful retweetledi

Hardik Bhatnagar@hrdkbhatnagar·1d

GPT 5.5 results are out on PostTrainBench! With reprompting: 28.35% (#2, just behind Opus 4.7 at 28.56%) Without reprompting: 25.02% (#4) The top 3 are now separated by less than 0.4 points - Opus 4.7, GPT 5.5, and GPT 5.4 Reprompting continues to matter: a 13% relative gain for GPT 5.5, similar to what we saw with GPT 5.4. Near-perfect BFCL score too (99.25%). posttrainbench.com

English

4

10

106

12.6K

Thoughtful@thoughtfullab·5d

Decide on purpose

English

9

7

62

8.4K

Thoughtful retweetledi

sankalp@dejavucoder·26 Nis

post-train-bench is pretty insane if you think about it "agents must build their entire training pipeline from scratch..."

English

7

8

76

7.8K

Thoughtful@thoughtfullab·25 Nis

@mhrezaeics @tinkerapi It was wonderful to collaborate!

English

1

71

MohammadHossein Rezaei@mhrezaeics·23 Nis

@thoughtfullab @tinkerapi Thanks for the opportunity! It was really fun.

English

6

858

Thoughtful@thoughtfullab·23 Nis

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

English

10

48

408

128K

Thoughtful retweetledi

Hardik Bhatnagar@hrdkbhatnagar·24 Nis

New #1 on PostTrainBench: Opus 4.7 hits 28.56%, up from 23.16% for Opus 4.6 - a 23% relative jump and the largest generation over generation gain we've seen. The biggest improvement is ArenaHard: 24% vs 7.8%, a 3x increase. Opus 4.7 also does this in less time (~7.5 hours vs ~10 for Opus 4.6). Currently running GPT 5.5, stay tuned! 👀 posttrainbench.com

English

7

63

9.1K

Thoughtful retweetledi

Karina@karinanguyen·24 Nis

Claude 4.7 leads PostTrainBench while managing time better than 4.6 On ArenaHard (a writing quality and instruction-following benchmark) it jumps from 6.7% to 24.2%. From personal observations, 4.7 writes more, and some of it is richer, but some of it is the same point rephrased as if the new angle were the idea. We certainly need more interesting writing evals!

Karina@karinanguyen

Excited to release PostTrainBench v1.0! This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting. We believe this is a first step toward tracking progress in recursive self-improvement 🧵:

English

5

47

5.9K

Chinmay@ChinmayKak·23 Nis

awesome read! anyone who has used these coding agents for research tasks has often come across these things, and this study also helped me find more bottlenecks that can be avoided in the future, very very cool!

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

English

2

3

28

4.4K

Thoughtful@thoughtfullab·23 Nis

@ChinmayKak Glad it's useful!

English

2

41

Thoughtful retweetledi

srija@srijatwt·23 Nis

absolutely love this blog and the depth of the study i love how it systematically also highlights how agents are bad at time management, seems like a problem everyone is facing but not many are solving

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

English

6

58

10.5K

Thoughtful retweetledi

Calvin Chen@calvinchen·23 Nis

Was great to work with @thoughtfullab and @karinanguyen on their task for FrontierSWE!

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

English

22

2.6K

Thoughtful retweetledi

Mersad Abbasi@Mersad_Abbasi·23 Nis

Long horizon tasks allow us to study agent behavior in domains that have rarely been explored before. Post-training is one that is of great interest as it offers the path to close the loop for AI training AI. We found that current agents, despite being extremely capable at execution, lack the research taste and time management required to build an end-to-end post-training pipeline within their native harness. Read more here (thoughtfullab.com/letting-ai-pos…)

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

English

13

89

28.4K

Thoughtful@thoughtfullab·23 Nis

7/ There’s still a lot more and interesting work to be done. The Frog Placement Game is a toy environment, a first probe into whether agents can run the loop at all. What we actually care about is the thesis underneath it: that research intuition is trainable, and that once it is, improving a model becomes something AI should do for anyone, on any task, at all times. Blogpost: thoughtfullab.com/letting-ai-pos… Github: github.com/Thoughtful-Lab… FrontierSWE FrogGame: frontierswe.com/frogsgame-rl Thanks to @mhrezaeics, @evan_j_chu, @calvinchen, @_rajanagarwal, @MatternJustus, @ProximalHQ, Sean Klassen, @karinanguyen for their co-development and feedback, and to @thinkymachines for support with the Tinker API.

English

2

34

2.4K

Thoughtful@thoughtfullab·23 Nis

6/ Spending patterns Agents had infinite credits with @tinkerapi to sample or train a base model. How they used that compute varied sharply.

English

0

20

2.3K

Thoughtful@thoughtfullab·23 Nis

5/ Agents have no working sense of time - Agents systematically underestimate training overhead. - Agents' poor sense of time affects their performance. - Agents rarely recover from catastrophic processes that take a lot of time.

English

2

42

17.2K

Thoughtful@thoughtfullab·23 Nis

4/ Sophisticated methods, amateur mistakes Across trials, agents tried creative approaches and often executed them well: iterative reward sharpening from previous checkpoints, intermediate representation supervision, iterative LoRA rank scaling, and the standard SFT-then-RL recipe for format following. But they also miss the research practices that feel obvious to experienced researchers. The recurring failures were generating SFT data from a weak base model, skipping basic sanity checks on the training pipeline, and evaluating on the training distribution without noticing. An interesting behavior we’ve seen: when agents couldn’t use the provided tokenizer, most Opus 4.6 agents treated the missing tokenizer as a research problem and spent serious time building one from scratch

English

0

22

2.3K

Thoughtful@thoughtfullab·23 Nis

3/ Agents make the same set of mistakes - Over-reliance on naive SFT - Early termination and underuse of compute - Invalid or non-parsable outputs So we ran “with playbook hints” ablation that surfaces the most common failure modes from earlier runs, each paired with a concrete fix. The playbook successfully removes many of the obvious failure modes. GPT-5.4 improves (pass@4: 2.06% → 10%), and variance drops by roughly 2×, indicating more stable behavior across runs. But despite these changes, hints collapsed the model's exploration into a narrow range, leading to worse performance compared to base models. Those mistakes were seen in PostTrainBench too.

English