Ameya P.

1.1K posts

Ameya P.

@AmyPrb

Exploring Science of Benchmarking & Scaling up 🧬 Discovery. Postdoc @bethgelab; Previously: @OxfordTVG, @intelailabs I'm on the job market - https://t.co/To9NNR6goK

Tübingen, Germany Entrou em Eylül 2021

640 Seguindo558 Seguidores

Tweet fixado

Ameya P.@AmyPrb·10 Mar

📢 I’m on the job market 📢 My work has been around post-training LLMs that can discover what we *don’t know* yet! This includes: LM agents that reason over long horizons, continually learn from experience & can forecast outcomes of actions. Website: ameya.prabhu.be

English

105

20.1K

Ameya P. retweetou

Maksym Andriushchenko@maksym_andr·1d

💥 New top-1 entry on PostTrainBench: GPT-5.4 with a simple reprompting loop ("You still have remaining. Please continue improving your result and maximize performance.") This simple technique alone leads to a huge improvement: 20.2% -> 28.2%. I think @jackclarkSF was right: the performance of the official instruct models will likely be reached by September 2026.

Hardik Bhatnagar@hrdkbhatnagar

New #1 on PostTrainBench: GPT 5.4 hits 28.22%, up from 20.23% without reprompting. Why? GPT 5.4 was only using ~1.5 of the 10 available hours. A simple nudge like "you still have time, keep improving" jumped it from #4 to #1. A 40% relative improvement from elicitation alone. Some standout per-model results: - On Qwen3-4B: 41.40% avg, 100% on BFCL, 49.53% on ArenaHard - On Gemma-3-4B: 24.85% avg, 100% on BFCL This is also a good reminder that PostTrainBench scores are a function of both model capability and elicitation

English

4.1K

Ameya P. retweetou

Davis Brown@davisbrownr·2d

In new work, we find that cheating on model capability evaluations is rampant. For example, the top 3 Terminal-Bench 2 submissions all cheat, usually by sneaking the correct answer to the model. Blog linked below.

English

8.8K

Ameya P. retweetou

Melissa Pan@melissapan·2d

We as a field should rethink the role of agent benchmarks. In the MAP study, we show as well that many real-world agent systems development do not use open benchmarks nor do they construct benchmark with golden reference answer due to complexity of creating benchmarks in the

Hao Wang@MogicianTony

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

English

8.8K

Ameya P. retweetou

CoLLAs 2026@CoLLAs_Conf·5d

The CoLLAs submission deadline is this Friday! We invite researchers to explore all facets of ML adaptation, from incorporating new capabilities during continuous training to efficiently removing outdated or harmful data. - 𝗔𝗯𝘀𝘁𝗿𝗮𝗰𝘁 𝗗𝗲𝗮𝗱𝗹𝗶𝗻𝗲: April 10, 2026 -

English

2.4K

Ameya P. retweetou

William Fedus@LiamFedus·4 Nis

RL against verifiable rewards in LLMs has clearly opened a very powerful regime. It works, and because it works, there is a strong tendency to view more and more problems through that lens. You optimize for tasks where the reward is clean, where success is easy to check, where the feedback loop closes quickly. This is productive and will keep paying off. But it also creates a bias: you start emphasizing what is legible to the training setup, not necessarily what is most valuable. Scientific reasoning is a good example. Not every step in science is something that can be cleanly graded at the moment it is produced. A hypothesis can later fail experimentally and still have been exactly the right kind of thinking at the time: creative, mechanistically grounded, and responsive to the available evidence. “Turns out to be wrong” does not imply “was low-quality thinking”. A big part of the next frontier will be AI systems that can operate well under this kind of uncertainty, just like a big part of the last one was RL against verifiable rewards.

English

797

82.7K

Ameya P. retweetou

Sarath Chandar@apsarathchandar·31 Mar

Continual learning is the future of AI and @CoLLAs_Conf is the best venue to publish your state-of-the-art research in designing adaptive machine learning systems! Abstract deadline in 10 days and the conference is in Romania this year!

CoLLAs 2026@CoLLAs_Conf

⏰ The CoLLAs abstract deadline is only 10 days away! We invite researchers to explore all facets of ML adaptation, from incorporating new capabilities during continuous training to efficiently removing outdated or harmful data. - 𝗔𝗯𝘀𝘁𝗿𝗮𝗰𝘁 𝗗𝗲𝗮𝗱𝗹𝗶𝗻𝗲: April 10, 2026 - 𝗦𝘂𝗯𝗺𝗶𝘀𝘀𝗶𝗼𝗻 𝗗𝗲𝗮𝗱𝗹𝗶𝗻𝗲: April 15, 2026 - 𝗖𝗼𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗗𝗮𝘁𝗲𝘀: Sep 14–17, 2026 📚 Accepted papers will be published in the Proceedings of Machine Learning Research (PMLR). 🔗 𝗙𝗼𝗿 𝗳𝘂𝗹𝗹 𝗱𝗲𝘁𝗮𝗶𝗹𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗖𝗮𝗹𝗹 𝗳𝗼𝗿 𝗣𝗮𝗽𝗲𝗿𝘀: lifelong-ml.cc/Conferences/20…

English

6.2K

Ameya P. retweetou

Shashwat Goel@ShashwatGoel7·29 Mar

Looking for unsaturated/uncontaminated, cheap to run, economically valuable tasks to test your latest RL research on for COLM/NeurIPS? Checkout openforecaster.github.io It has a 50k train set, and targets reasoning about uncertainty, complementary to exam-style stem benchmarks.

omkaar@omkizzy

do others find it hard to do small-scale, low-budget RL research as well? OS models (even 3b) are fantastic at most envs, producing great envs is a lot of eng lift. trying to find good OS envs / tasks, qwen2.5-3b has a low mean reward on

English

6.2K

Ameya P. retweetou

AI at Meta@AIatMeta·27 Mar

We’re releasing SAM 3.1: a drop-in update to SAM 3 that introduces object multiplexing to significantly improve video processing efficiency without sacrificing accuracy. We’re sharing this update with the community to help make high-performance applications feasible on smaller,

English

106

274

2.2K

328.7K

Ameya P. retweetou

Alexander Panfilov@kotekjedi_ml·26 Mar

New paper: We deploy Claude Code in an autoresearch loop to discover novel jailbreaking algorithms – and it works. It beats 30+ existing GCG-like attacks (with AutoML hyperparameter tuning) This is a strong sign that incremental safety and security research can now be automated.

English

207

1.6K

300.6K

Ameya P. retweetou

Shashwat Goel@ShashwatGoel7·26 Mar

Great paper showing self-distillation internalizes environment feedback, but also breaks the ability to navigate uncertainty as the "supervisor" already knows the outcome and doesn't have the same uncertainty. To teach uncertainty navigation, we proposed ∆Belief-RL. We reward actions based on whether they lead to "progress" which is estimated by the update in the model's own beliefs of achieving success. We show this improves both interaction efficiency and scaling in guessing environments, and parallel work like iGPO and TIPS shows it works for search agents. arxiv.org/abs/2602.12342 - Intrinsic Credit Assignment for Long Horizon Interaction arxiv.org/abs/2510.14967 - Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents arxiv.org/abs/2603.22293 - TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs The idea has rich roots in the 1999 paper on potential based reward shaping people.eecs.berkeley.edu/~pabbeel/cs287…, and a 2018 paper showing the potential can be estimated using the agents own beliefs cdn.aaai.org/ojs/11741/1174…. Lots of interesting future work here, ranging from how to measure beliefs over long-form answers, where logprobs might reward style over substance, to beliefs over arbitrary rewards and goals instead of answers, and also incorporating beliefs of other agents in the environment similar to ReBeL for multi-agent imperfect information games github.com/facebookresear….

Rosinality@rosinality

Analysis on self-distillation. It works by increasing the confidence, and does not generalize well. We can't assume the distribution given the solution behaves well, and it could be similar to unsupervised model-based verification.

English

12.9K

Ameya P. retweetou

Neel Guha@NeelGuha·25 Mar

I wrote a blogpost about writing machine learning research papers (e.g., NeurIPS, ICML, ICLR, etc.). The core idea is that most papers follow one of a predetermined set of templates. The post talks about each template, describes their rules, and offers examples...

English

621

78.8K

Ameya P. retweetou

Nikhil Chandak@nikhilchandak29·21 Mar

Cool work from folks at Mantic! Models which are great at forecasting will bring us immense value. We open-source similar data/code recipe if you want to tinker with forecasting: openforecaster.github.io

Toby Shevlane@tshevl

I always dreamed of AGI as a wise advisor for humanity. Although LLMs are great for coding & knowledge work, I wouldn’t trust them to give me advice on my career, business strategy, or policy preferences. How can we build AI systems optimized for wisdom? At Mantic we believe the unlock is prediction: predicting world events as accurately as possible, and hill-climbing this single metric. Today we share some recent progress on the Thinking Machines website, having found Tinker a great platform for our RL experiments. TL;DR: We RL-tune gpt-oss-120b to become a better forecaster than any other model. Having good scaffolding is a prerequisite. A fun result: our tuned model + Grok are decorrelated from the other best models, and so are the most indispensable when picking a team.

English

800

Ameya P. retweetou

Maksym Andriushchenko@maksym_andr·22 Mar

Interesting experiments based on PostTrainBench: agents like Claude Code can already implement non-trivial post-training approaches, but still fail from time to time in some simple ways (e.g., keep going in circles)!

Shaurya Rohatgi@shauryr

Spent the last week experimenting with auto-research and post train bench github.com/aisa-group/Pos… ; here's what I learned. The throughput is genuinely impressive. Claude ran 16 SFT iterations for a Qwen 1.7B base trying to perfect GSM8K in just 44 hours, under 2 days on 8 GPUs. It tried some non-trivial experiments: - Rejection sampling training with lower lr improves performance - Using Qwen3's dedicated token for structured chain-of-thought - Filtering a numerical subset of MetaMathQA to remove noisy samples The results were real too. PostTrainBench reports 41% for Opus 4.6 on GSM8K with 1 GPU. In my experiment by v4 (~10 hours in), the automated loop had pushed that to 69% on 8xH200. But it's not all smooth sailing. The LLM-harness combo has a tendency to get stuck in loops making incremental tweaks instead of fundamentally rethinking its approach. At one point it got stuck doing endless model souping (averaging weights between checkpoints) rather than exploring new directions. It's frustrating to watch the system confidently run an experiment you already know won't yield real gains. Also, finally it did give up LOL says that nothing beats straightforward SFT. My next direction is add an observer LLM with a limited "nudge" budget to the main loop when the trainer agent is going in circles. This way the observer is really thinking about when to nudge, as it only get k turns. The direction is extremely promising. The volume of experiments, iterations, and findings an LLM can produce autonomously is hard to match manually. It just needs better steering. Initially I thought it would be a lot in API token costs, but the agent is mostly waiting for the model to train and is not using a lot of tokens. Here is the final report -

English

Ameya P. retweetou

Chase Brower@ChaseBrowe32432·22 Mar

I painstakingly ran all 20 EsoLang-Bench hard problems through Claude webui. It solved 20/20 (100%). No specialized scaffolding, no expert prompting, no few-shot examples, it just solves them natively. This benchmark just suffocated the models with constrictive scaffolding.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

114

1.2K

151.5K

Ameya P. retweetou

Ruxandra Teslo 🧬@RuxandraTeslo·15 Mar

The story about bureaucracy almost stopping a man from treating his dog’s cancer with an mRNA vaccine went viral. The problem transfers to humans: we’ve made these clinical trials unnecessarily hard, denying hope to patients. New article on this. writingruxandrabio.com/p/the-bureaucr…

English

438

118.3K

Ameya P. retweetou

Jerry Tworek@MillionInt·12 Mar

How do we actually measure self-improvement is a hard problem we should be taking shots at

Karina Nguyen@karinanguyen

Excited to release PostTrainBench v1.0! This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting. We believe this is a first step toward tracking progress in recursive self-improvement 🧵:

English

181

22.7K

Ameya P.@AmyPrb·10 Mar

My DMs are open! Also feel free to email me if you're looking to hire.

English

Ameya P.@AmyPrb·10 Mar

English

105

20.1K

Ameya P. retweetou

Rahaf Aljundi@AljundiRahaf·4 Mar

This fall, during a Dagstuhl seminar on continual learning, we discussed with various researchers from the field the roadmap for continual learning. We converged to one view: modular memory is the key to continual learning agents, as outlined in here arxiv.org/pdf/2603.01761

English

967

Ameya P.@AmyPrb·26 Şub

@AlexGDimakis Was one of the hidden nuggets I found in previous benchmark prediction works -- arxiv.org/abs/2506.07673!

English

164

Alex Dimakis@AlexGDimakis·25 Şub

The BenchPress idea is delightfully simple application of compressed sensing on AI evals: Instead of running all the benchmarks, run a few (ideally the cheaper ones) and use these numbers as features, given to a model to predict the other benchmark numbers from these observations. Turns out the matrix of benchmarks is very low-rank and the matrix completion model works very well. My thought is that at the end of the day you still need to run all the benchmarks, but while iterating, this is a valuable trick to get more signal of what works and in which direction. You can also look at the low-rank directions you discover and understand how your model performs in these data-driven performance directions. They may be easy to name: ('Persistence, 'Coding comfort','Terminal use ability?' etc?)

Dimitris Papailiopoulos@DimitrisPapail

Running Terminal-Bench 2.0 on expensive frontier models costs $1K–$50K or more. BenchPress Predicts Gemini 3.1 Pro and Claude Opus 4.6's scores within ±2 points after 15 randomly selected benchmarks. .... using zero agentic benchmark data!! Cost: $0.

English

6.8K

Descobrir

@jackclarkSF @CoLLAs_Conf @AlexGDimakis @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates