Ameya P.

1.1K posts

Ameya P.

Ameya P.

@AmyPrb

Exploring Science of Benchmarking & Scaling up 🧬 Discovery. Postdoc @bethgelab; Previously: @OxfordTVG, @intelailabs I'm on the job market - https://t.co/To9NNR6goK

Tübingen, Germany Entrou em Eylül 2021
640 Seguindo558 Seguidores
Tweet fixado
Ameya P.
Ameya P.@AmyPrb·
📢 I’m on the job market 📢 My work has been around post-training LLMs that can discover what we *don’t know* yet! This includes: LM agents that reason over long horizons, continually learn from experience & can forecast outcomes of actions. Website: ameya.prabhu.be
Ameya P. tweet media
English
4
8
105
20.1K
Ameya P. retweetou
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
💥 New top-1 entry on PostTrainBench: GPT-5.4 with a simple reprompting loop ("You still have
Hardik Bhatnagar@hrdkbhatnagar

New #1 on PostTrainBench: GPT 5.4 hits 28.22%, up from 20.23% without reprompting. Why? GPT 5.4 was only using ~1.5 of the 10 available hours. A simple nudge like "you still have time, keep improving" jumped it from #4 to #1. A 40% relative improvement from elicitation alone. Some standout per-model results: - On Qwen3-4B: 41.40% avg, 100% on BFCL, 49.53% on ArenaHard - On Gemma-3-4B: 24.85% avg, 100% on BFCL This is also a good reminder that PostTrainBench scores are a function of both model capability and elicitation

English
2
5
39
4.1K
Ameya P. retweetou
Davis Brown
Davis Brown@davisbrownr·
In new work, we find that cheating on model capability evaluations is rampant. For example, the top 3 Terminal-Bench 2 submissions all cheat, usually by sneaking the correct answer to the model. Blog linked below.
Davis Brown tweet media
English
4
11
77
8.8K
Ameya P. retweetou
Melissa Pan
Melissa Pan@melissapan·
We as a field should rethink the role of agent benchmarks. In the MAP study, we show as well that many real-world agent systems development do not use open benchmarks nor do they construct benchmark with golden reference answer due to complexity of creating benchmarks in the
Hao Wang@MogicianTony

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

English
3
4
37
8.8K
Ameya P. retweetou
CoLLAs 2026
CoLLAs 2026@CoLLAs_Conf·
The CoLLAs submission deadline is this Friday! We invite researchers to explore all facets of ML adaptation, from incorporating new capabilities during continuous training to efficiently removing outdated or harmful data. - 𝗔𝗯𝘀𝘁𝗿𝗮𝗰𝘁 𝗗𝗲𝗮𝗱𝗹𝗶𝗻𝗲: April 10, 2026 -
English
0
5
15
2.4K
Ameya P. retweetou
William Fedus
William Fedus@LiamFedus·
RL against verifiable rewards in LLMs has clearly opened a very powerful regime. It works, and because it works, there is a strong tendency to view more and more problems through that lens. You optimize for tasks where the reward is clean, where success is easy to check, where the feedback loop closes quickly. This is productive and will keep paying off. But it also creates a bias: you start emphasizing what is legible to the training setup, not necessarily what is most valuable. Scientific reasoning is a good example. Not every step in science is something that can be cleanly graded at the moment it is produced. A hypothesis can later fail experimentally and still have been exactly the right kind of thinking at the time: creative, mechanistically grounded, and responsive to the available evidence. “Turns out to be wrong” does not imply “was low-quality thinking”. A big part of the next frontier will be AI systems that can operate well under this kind of uncertainty, just like a big part of the last one was RL against verifiable rewards.
English
36
67
797
82.7K
Ameya P. retweetou
Ameya P. retweetou
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
Looking for unsaturated/uncontaminated, cheap to run, economically valuable tasks to test your latest RL research on for COLM/NeurIPS? Checkout openforecaster.github.io It has a 50k train set, and targets reasoning about uncertainty, complementary to exam-style stem benchmarks.
omkaar@omkizzy

do others find it hard to do small-scale, low-budget RL research as well? OS models (even 3b) are fantastic at most envs, producing great envs is a lot of eng lift. trying to find good OS envs / tasks, qwen2.5-3b has a low mean reward on

English
2
3
49
6.2K
Ameya P. retweetou
AI at Meta
AI at Meta@AIatMeta·
We’re releasing SAM 3.1: a drop-in update to SAM 3 that introduces object multiplexing to significantly improve video processing efficiency without sacrificing accuracy. We’re sharing this update with the community to help make high-performance applications feasible on smaller,
AI at Meta tweet media
English
106
274
2.2K
328.7K
Ameya P. retweetou
Alexander Panfilov
Alexander Panfilov@kotekjedi_ml·
New paper: We deploy Claude Code in an autoresearch loop to discover novel jailbreaking algorithms – and it works. It beats 30+ existing GCG-like attacks (with AutoML hyperparameter tuning) This is a strong sign that incremental safety and security research can now be automated.
Alexander Panfilov tweet media
English
49
207
1.6K
300.6K
Ameya P. retweetou
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
Great paper showing self-distillation internalizes environment feedback, but also breaks the ability to navigate uncertainty as the "supervisor" already knows the outcome and doesn't have the same uncertainty. To teach uncertainty navigation, we proposed ∆Belief-RL. We reward actions based on whether they lead to "progress" which is estimated by the update in the model's own beliefs of achieving success. We show this improves both interaction efficiency and scaling in guessing environments, and parallel work like iGPO and TIPS shows it works for search agents. arxiv.org/abs/2602.12342 - Intrinsic Credit Assignment for Long Horizon Interaction arxiv.org/abs/2510.14967 - Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents arxiv.org/abs/2603.22293 - TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs The idea has rich roots in the 1999 paper on potential based reward shaping people.eecs.berkeley.edu/~pabbeel/cs287…, and a 2018 paper showing the potential can be estimated using the agents own beliefs cdn.aaai.org/ojs/11741/1174…. Lots of interesting future work here, ranging from how to measure beliefs over long-form answers, where logprobs might reward style over substance, to beliefs over arbitrary rewards and goals instead of answers, and also incorporating beliefs of other agents in the environment similar to ReBeL for multi-agent imperfect information games github.com/facebookresear….
Rosinality@rosinality

Analysis on self-distillation. It works by increasing the confidence, and does not generalize well. We can't assume the distribution given the solution behaves well, and it could be similar to unsupervised model-based verification.

English
4
13
88
12.9K
Ameya P. retweetou
Neel Guha
Neel Guha@NeelGuha·
I wrote a blogpost about writing machine learning research papers (e.g., NeurIPS, ICML, ICLR, etc.). The core idea is that most papers follow one of a predetermined set of templates. The post talks about each template, describes their rules, and offers examples...
Neel Guha tweet media
English
7
83
621
78.8K
Ameya P. retweetou
Ameya P. retweetou
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
Interesting experiments based on PostTrainBench: agents like Claude Code can already implement non-trivial post-training approaches, but still fail from time to time in some simple ways (e.g., keep going in circles)!
Shaurya Rohatgi@shauryr

Spent the last week experimenting with auto-research and post train bench github.com/aisa-group/Pos… ; here's what I learned. The throughput is genuinely impressive. Claude ran 16 SFT iterations for a Qwen 1.7B base trying to perfect GSM8K in just 44 hours, under 2 days on 8 GPUs. It tried some non-trivial experiments: - Rejection sampling training with lower lr improves performance - Using Qwen3's dedicated token for structured chain-of-thought - Filtering a numerical subset of MetaMathQA to remove noisy samples The results were real too. PostTrainBench reports 41% for Opus 4.6 on GSM8K with 1 GPU. In my experiment by v4 (~10 hours in), the automated loop had pushed that to 69% on 8xH200. But it's not all smooth sailing. The LLM-harness combo has a tendency to get stuck in loops making incremental tweaks instead of fundamentally rethinking its approach. At one point it got stuck doing endless model souping (averaging weights between checkpoints) rather than exploring new directions. It's frustrating to watch the system confidently run an experiment you already know won't yield real gains. Also, finally it did give up LOL says that nothing beats straightforward SFT. My next direction is add an observer LLM with a limited "nudge" budget to the main loop when the trainer agent is going in circles. This way the observer is really thinking about when to nudge, as it only get k turns. The direction is extremely promising. The volume of experiments, iterations, and findings an LLM can produce autonomously is hard to match manually. It just needs better steering. Initially I thought it would be a lot in API token costs, but the agent is mostly waiting for the model to train and is not using a lot of tokens. Here is the final report -

English
1
3
13
3K
Ameya P. retweetou
Chase Brower
Chase Brower@ChaseBrowe32432·
I painstakingly ran all 20 EsoLang-Bench hard problems through Claude webui. It solved 20/20 (100%). No specialized scaffolding, no expert prompting, no few-shot examples, it just solves them natively. This benchmark just suffocated the models with constrictive scaffolding.
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
52
114
1.2K
151.5K
Ameya P. retweetou
Ruxandra Teslo 🧬
Ruxandra Teslo 🧬@RuxandraTeslo·
The story about bureaucracy almost stopping a man from treating his dog’s cancer with an mRNA vaccine went viral. The problem transfers to humans: we’ve made these clinical trials unnecessarily hard, denying hope to patients. New article on this. writingruxandrabio.com/p/the-bureaucr…
English
20
93
438
118.3K
Ameya P.
Ameya P.@AmyPrb·
My DMs are open! Also feel free to email me if you're looking to hire.
English
0
0
4
0
Ameya P.
Ameya P.@AmyPrb·
📢 I’m on the job market 📢 My work has been around post-training LLMs that can discover what we *don’t know* yet! This includes: LM agents that reason over long horizons, continually learn from experience & can forecast outcomes of actions. Website: ameya.prabhu.be
Ameya P. tweet media
English
4
8
105
20.1K
Ameya P. retweetou
Rahaf Aljundi
Rahaf Aljundi@AljundiRahaf·
This fall, during a Dagstuhl seminar on continual learning, we discussed with various researchers from the field the roadmap for continual learning. We converged to one view: modular memory is the key to continual learning agents, as outlined in here arxiv.org/pdf/2603.01761
English
0
6
15
967
Alex Dimakis
Alex Dimakis@AlexGDimakis·
The BenchPress idea is delightfully simple application of compressed sensing on AI evals: Instead of running all the benchmarks, run a few (ideally the cheaper ones) and use these numbers as features, given to a model to predict the other benchmark numbers from these observations. Turns out the matrix of benchmarks is very low-rank and the matrix completion model works very well. My thought is that at the end of the day you still need to run all the benchmarks, but while iterating, this is a valuable trick to get more signal of what works and in which direction. You can also look at the low-rank directions you discover and understand how your model performs in these data-driven performance directions. They may be easy to name: ('Persistence, 'Coding comfort','Terminal use ability?' etc?)
Dimitris Papailiopoulos@DimitrisPapail

Running Terminal-Bench 2.0 on expensive frontier models costs $1K–$50K or more. BenchPress Predicts Gemini 3.1 Pro and Claude Opus 4.6's scores within ±2 points after 15 randomly selected benchmarks. .... using zero agentic benchmark data!! Cost: $0.

English
1
4
38
6.8K