

Ameya P.
1.1K posts

@AmyPrb
Exploring Science of Benchmarking & Scaling up 🧬 Discovery. Postdoc @bethgelab; Previously: @OxfordTVG, @intelailabs I'm on the job market - https://t.co/To9NNR6goK



New #1 on PostTrainBench: GPT 5.4 hits 28.22%, up from 20.23% without reprompting. Why? GPT 5.4 was only using ~1.5 of the 10 available hours. A simple nudge like "you still have time, keep improving" jumped it from #4 to #1. A 40% relative improvement from elicitation alone. Some standout per-model results: - On Qwen3-4B: 41.40% avg, 100% on BFCL, 49.53% on ArenaHard - On Gemma-3-4B: 24.85% avg, 100% on BFCL This is also a good reminder that PostTrainBench scores are a function of both model capability and elicitation

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵


⏰ The CoLLAs abstract deadline is only 10 days away! We invite researchers to explore all facets of ML adaptation, from incorporating new capabilities during continuous training to efficiently removing outdated or harmful data. - 𝗔𝗯𝘀𝘁𝗿𝗮𝗰𝘁 𝗗𝗲𝗮𝗱𝗹𝗶𝗻𝗲: April 10, 2026 - 𝗦𝘂𝗯𝗺𝗶𝘀𝘀𝗶𝗼𝗻 𝗗𝗲𝗮𝗱𝗹𝗶𝗻𝗲: April 15, 2026 - 𝗖𝗼𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗗𝗮𝘁𝗲𝘀: Sep 14–17, 2026 📚 Accepted papers will be published in the Proceedings of Machine Learning Research (PMLR). 🔗 𝗙𝗼𝗿 𝗳𝘂𝗹𝗹 𝗱𝗲𝘁𝗮𝗶𝗹𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗖𝗮𝗹𝗹 𝗳𝗼𝗿 𝗣𝗮𝗽𝗲𝗿𝘀: lifelong-ml.cc/Conferences/20…

do others find it hard to do small-scale, low-budget RL research as well? OS models (even 3b) are fantastic at most envs, producing great envs is a lot of eng lift. trying to find good OS envs / tasks, qwen2.5-3b has a low mean reward on

Analysis on self-distillation. It works by increasing the confidence, and does not generalize well. We can't assume the distribution given the solution behaves well, and it could be similar to unsupervised model-based verification.

I always dreamed of AGI as a wise advisor for humanity. Although LLMs are great for coding & knowledge work, I wouldn’t trust them to give me advice on my career, business strategy, or policy preferences. How can we build AI systems optimized for wisdom? At Mantic we believe the unlock is prediction: predicting world events as accurately as possible, and hill-climbing this single metric. Today we share some recent progress on the Thinking Machines website, having found Tinker a great platform for our RL experiments. TL;DR: We RL-tune gpt-oss-120b to become a better forecaster than any other model. Having good scaffolding is a prerequisite. A fun result: our tuned model + Grok are decorrelated from the other best models, and so are the most indispensable when picking a team.

Spent the last week experimenting with auto-research and post train bench github.com/aisa-group/Pos… ; here's what I learned.
The throughput is genuinely impressive. Claude ran 16 SFT iterations for a Qwen 1.7B base trying to perfect GSM8K in just 44 hours, under 2 days on 8 GPUs. It tried some non-trivial experiments:
- Rejection sampling training with lower lr improves performance
- Using Qwen3's dedicated

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵


Excited to release PostTrainBench v1.0! This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting. We believe this is a first step toward tracking progress in recursive self-improvement 🧵:





Running Terminal-Bench 2.0 on expensive frontier models costs $1K–$50K or more. BenchPress Predicts Gemini 3.1 Pro and Claude Opus 4.6's scores within ±2 points after 15 randomly selected benchmarks. .... using zero agentic benchmark data!! Cost: $0.