Ameya P.

1.1K posts

Ameya P.

@AmyPrb

Exploring Science of Benchmarking & Scaling up 🧬 Discovery. Postdoc @bethgelab; Previously: @OxfordTVG, @intelailabs I'm on the job market - https://t.co/To9NNR6goK

Tübingen, Germany เข้าร่วม Eylül 2021

640 กำลังติดตาม558 ผู้ติดตาม

ทวีตที่ปักหมุด

Ameya P.@AmyPrb·10 Mar

📢 I’m on the job market 📢 My work has been around post-training LLMs that can discover what we *don’t know* yet! This includes: LM agents that reason over long horizons, continually learn from experience & can forecast outcomes of actions. Website: ameya.prabhu.be

English

105

20.1K

Ameya P. รีทวีตแล้ว

Maksym Andriushchenko@maksym_andr·1d

💥 New top-1 entry on PostTrainBench: GPT-5.4 with a simple reprompting loop ("You still have remaining. Please continue improving your result and maximize performance.") This simple technique alone leads to a huge improvement: 20.2% -> 28.2%. I think @jackclarkSF was right: the performance of the official instruct models will likely be reached by September 2026.

Hardik Bhatnagar@hrdkbhatnagar

New #1 on PostTrainBench: GPT 5.4 hits 28.22%, up from 20.23% without reprompting. Why? GPT 5.4 was only using ~1.5 of the 10 available hours. A simple nudge like "you still have time, keep improving" jumped it from #4 to #1. A 40% relative improvement from elicitation alone. Some standout per-model results: - On Qwen3-4B: 41.40% avg, 100% on BFCL, 49.53% on ArenaHard - On Gemma-3-4B: 24.85% avg, 100% on BFCL This is also a good reminder that PostTrainBench scores are a function of both model capability and elicitation

English

Ameya P. รีทวีตแล้ว

Davis Brown@davisbrownr·2d

In new work, we find that cheating on model capability evaluations is rampant. For example, the top 3 Terminal-Bench 2 submissions all cheat, usually by sneaking the correct answer to the model. Blog linked below.

English

8.8K

Ameya P. รีทวีตแล้ว

Melissa Pan@melissapan·2d

We as a field should rethink the role of agent benchmarks. In the MAP study, we show as well that many real-world agent systems development do not use open benchmarks nor do they construct benchmark with golden reference answer due to complexity of creating benchmarks in the wild.

Hao Wang@MogicianTony

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

English

8.8K

Ameya P. รีทวีตแล้ว

CoLLAs 2026@CoLLAs_Conf·5d

The CoLLAs submission deadline is this Friday! We invite researchers to explore all facets of ML adaptation, from incorporating new capabilities during continuous training to efficiently removing outdated or harmful data. - 𝗔𝗯𝘀𝘁𝗿𝗮𝗰𝘁 𝗗𝗲𝗮𝗱𝗹𝗶𝗻𝗲: April 10, 2026 - 𝗦𝘂𝗯𝗺𝗶𝘀𝘀𝗶𝗼𝗻 𝗗𝗲𝗮𝗱𝗹𝗶𝗻𝗲: April 15, 2026 - 𝗖𝗼𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗗𝗮𝘁𝗲𝘀: Sep 14–17, 2026 🔗 𝗙𝗼𝗿 𝗳𝘂𝗹𝗹 𝗱𝗲𝘁𝗮𝗶𝗹𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗖𝗮𝗹𝗹 𝗳𝗼𝗿 𝗣𝗮𝗽𝗲𝗿𝘀: lifelong-ml.cc/Conferences/20…

English

2.4K

Ameya P. รีทวีตแล้ว

William Fedus@LiamFedus·4 Nis

RL against verifiable rewards in LLMs has clearly opened a very powerful regime. It works, and because it works, there is a strong tendency to view more and more problems through that lens. You optimize for tasks where the reward is clean, where success is easy to check, where the feedback loop closes quickly. This is productive and will keep paying off. But it also creates a bias: you start emphasizing what is legible to the training setup, not necessarily what is most valuable. Scientific reasoning is a good example. Not every step in science is something that can be cleanly graded at the moment it is produced. A hypothesis can later fail experimentally and still have been exactly the right kind of thinking at the time: creative, mechanistically grounded, and responsive to the available evidence. “Turns out to be wrong” does not imply “was low-quality thinking”. A big part of the next frontier will be AI systems that can operate well under this kind of uncertainty, just like a big part of the last one was RL against verifiable rewards.

English

797

82.6K

Ameya P. รีทวีตแล้ว

Sarath Chandar@apsarathchandar·31 Mar

Continual learning is the future of AI and @CoLLAs_Conf is the best venue to publish your state-of-the-art research in designing adaptive machine learning systems! Abstract deadline in 10 days and the conference is in Romania this year!

CoLLAs 2026@CoLLAs_Conf

⏰ The CoLLAs abstract deadline is only 10 days away! We invite researchers to explore all facets of ML adaptation, from incorporating new capabilities during continuous training to efficiently removing outdated or harmful data. - 𝗔𝗯𝘀𝘁𝗿𝗮𝗰𝘁 𝗗𝗲𝗮𝗱𝗹𝗶𝗻𝗲: April 10, 2026 - 𝗦𝘂𝗯𝗺𝗶𝘀𝘀𝗶𝗼𝗻 𝗗𝗲𝗮𝗱𝗹𝗶𝗻𝗲: April 15, 2026 - 𝗖𝗼𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗗𝗮𝘁𝗲𝘀: Sep 14–17, 2026 📚 Accepted papers will be published in the Proceedings of Machine Learning Research (PMLR). 🔗 𝗙𝗼𝗿 𝗳𝘂𝗹𝗹 𝗱𝗲𝘁𝗮𝗶𝗹𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗖𝗮𝗹𝗹 𝗳𝗼𝗿 𝗣𝗮𝗽𝗲𝗿𝘀: lifelong-ml.cc/Conferences/20…

English

6.2K

Ameya P. รีทวีตแล้ว

Shashwat Goel@ShashwatGoel7·29 Mar

Looking for unsaturated/uncontaminated, cheap to run, economically valuable tasks to test your latest RL research on for COLM/NeurIPS? Checkout openforecaster.github.io It has a 50k train set, and targets reasoning about uncertainty, complementary to exam-style stem benchmarks.

omkaar@omkizzy

do others find it hard to do small-scale, low-budget RL research as well? OS models (even 3b) are fantastic at most envs, producing great envs is a lot of eng lift. trying to find good OS envs / tasks, qwen2.5-3b has a low mean reward on

English

6.2K

Ameya P. รีทวีตแล้ว

AI at Meta@AIatMeta·27 Mar

We’re releasing SAM 3.1: a drop-in update to SAM 3 that introduces object multiplexing to significantly improve video processing efficiency without sacrificing accuracy. We’re sharing this update with the community to help make high-performance applications feasible on smaller, more accessible hardware. 🔗 Model Checkpoint: go.meta.me/8dd321 🔗 Codebase: go.meta.me/b0a9fb

English

106

274

2.2K

328.6K

Ameya P. รีทวีตแล้ว

Alexander Panfilov@kotekjedi_ml·26 Mar

New paper: We deploy Claude Code in an autoresearch loop to discover novel jailbreaking algorithms – and it works. It beats 30+ existing GCG-like attacks (with AutoML hyperparameter tuning) This is a strong sign that incremental safety and security research can now be automated.

English

207

1.6K

300.6K

Ameya P. รีทวีตแล้ว

Shashwat Goel@ShashwatGoel7·26 Mar

Great paper showing self-distillation internalizes environment feedback, but also breaks the ability to navigate uncertainty as the "supervisor" already knows the outcome and doesn't have the same uncertainty. To teach uncertainty navigation, we proposed ∆Belief-RL. We reward actions based on whether they lead to "progress" which is estimated by the update in the model's own beliefs of achieving success. We show this improves both interaction efficiency and scaling in guessing environments, and parallel work like iGPO and TIPS shows it works for search agents. arxiv.org/abs/2602.12342 - Intrinsic Credit Assignment for Long Horizon Interaction arxiv.org/abs/2510.14967 - Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents arxiv.org/abs/2603.22293 - TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs The idea has rich roots in the 1999 paper on potential based reward shaping people.eecs.berkeley.edu/~pabbeel/cs287…, and a 2018 paper showing the potential can be estimated using the agents own beliefs cdn.aaai.org/ojs/11741/1174…. Lots of interesting future work here, ranging from how to measure beliefs over long-form answers, where logprobs might reward style over substance, to beliefs over arbitrary rewards and goals instead of answers, and also incorporating beliefs of other agents in the environment similar to ReBeL for multi-agent imperfect information games github.com/facebookresear….

Rosinality@rosinality

Analysis on self-distillation. It works by increasing the confidence, and does not generalize well. We can't assume the distribution given the solution behaves well, and it could be similar to unsupervised model-based verification.

English

12.9K

Ameya P. รีทวีตแล้ว

Neel Guha@NeelGuha·25 Mar

I wrote a blogpost about writing machine learning research papers (e.g., NeurIPS, ICML, ICLR, etc.). The core idea is that most papers follow one of a predetermined set of templates. The post talks about each template, describes their rules, and offers examples...

English

621

78.8K

Ameya P. รีทวีตแล้ว

Nikhil Chandak@nikhilchandak29·21 Mar

Cool work from folks at Mantic! Models which are great at forecasting will bring us immense value. We open-source similar data/code recipe if you want to tinker with forecasting: openforecaster.github.io

Toby Shevlane@tshevl

I always dreamed of AGI as a wise advisor for humanity. Although LLMs are great for coding & knowledge work, I wouldn’t trust them to give me advice on my career, business strategy, or policy preferences. How can we build AI systems optimized for wisdom? At Mantic we believe the unlock is prediction: predicting world events as accurately as possible, and hill-climbing this single metric. Today we share some recent progress on the Thinking Machines website, having found Tinker a great platform for our RL experiments. TL;DR: We RL-tune gpt-oss-120b to become a better forecaster than any other model. Having good scaffolding is a prerequisite. A fun result: our tuned model + Grok are decorrelated from the other best models, and so are the most indispensable when picking a team.

English

800

Ameya P. รีทวีตแล้ว

Maksym Andriushchenko@maksym_andr·22 Mar

Interesting experiments based on PostTrainBench: agents like Claude Code can already implement non-trivial post-training approaches, but still fail from time to time in some simple ways (e.g., keep going in circles)!

Shaurya Rohatgi@shauryr

Spent the last week experimenting with auto-research and post train bench github.com/aisa-group/Pos… ; here's what I learned. The throughput is genuinely impressive. Claude ran 16 SFT iterations for a Qwen 1.7B base trying to perfect GSM8K in just 44 hours, under 2 days on 8 GPUs. It tried some non-trivial experiments: - Rejection sampling training with lower lr improves performance - Using Qwen3's dedicated token for structured chain-of-thought - Filtering a numerical subset of MetaMathQA to remove noisy samples The results were real too. PostTrainBench reports 41% for Opus 4.6 on GSM8K with 1 GPU. In my experiment by v4 (~10 hours in), the automated loop had pushed that to 69% on 8xH200. But it's not all smooth sailing. The LLM-harness combo has a tendency to get stuck in loops making incremental tweaks instead of fundamentally rethinking its approach. At one point it got stuck doing endless model souping (averaging weights between checkpoints) rather than exploring new directions. It's frustrating to watch the system confidently run an experiment you already know won't yield real gains. Also, finally it did give up LOL says that nothing beats straightforward SFT. My next direction is add an observer LLM with a limited "nudge" budget to the main loop when the trainer agent is going in circles. This way the observer is really thinking about when to nudge, as it only get k turns. The direction is extremely promising. The volume of experiments, iterations, and findings an LLM can produce autonomously is hard to match manually. It just needs better steering. Initially I thought it would be a lot in API token costs, but the agent is mostly waiting for the model to train and is not using a lot of tokens. Here is the final report -

English

Ameya P. รีทวีตแล้ว

Chase Brower@ChaseBrowe32432·22 Mar

I painstakingly ran all 20 EsoLang-Bench hard problems through Claude webui. It solved 20/20 (100%). No specialized scaffolding, no expert prompting, no few-shot examples, it just solves them natively. This benchmark just suffocated the models with constrictive scaffolding.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

114

1.2K

151.5K

Ameya P. รีทวีตแล้ว

Ruxandra Teslo 🧬@RuxandraTeslo·15 Mar

The story about bureaucracy almost stopping a man from treating his dog’s cancer with an mRNA vaccine went viral. The problem transfers to humans: we’ve made these clinical trials unnecessarily hard, denying hope to patients. New article on this. writingruxandrabio.com/p/the-bureaucr… Excerpts: "A story about Paul Conyngham, an AI entrepreneur from Sydney who treated his dog Rosie’s cancer with a personalized mRNA vaccine, has been circulating on X since yesterday. What makes the story inspiring is the initiative the owner showed: he used AI to teach himself about how a personalized vaccine could work, designed much of the process himself and approached top researchers to take it forward. Whether the treatment itself was fully curative and how much of an improvement it is over state-of-the art is not the main focus of this essay. Others have already debated that question at length, and I recommend following their discussions. What interests me instead is the bureaucratic absurdity the dog’s owner encountered while trying to pursue the treatment. He described the long and frustrating process required simply to test the drug in his dog: “The red tape was actually harder than the vaccine creation, and I was trying to get an Australian ethics approval and run a dog trial on Rosie. It took me three months, putting two hours aside every single night, just typing the 100 page document.” Even in a small and urgent case, where the owner was fully willing to fund the treatment himself, the effort was slowed by layers of procedure. Of course, this kind of red tape is not confined to Australia, nor to veterinary medicine. In fact, in the US, the red tape is even worse, at least for in-human trials. In a previous post, I recommended the Australian model for early stage In the United States, GitLab co-founder Sid Sijbrandij found himself in a similar position after the relapse of his osteosarcoma. When the ordinary doors of medicine closed, he entered what he called “founder mode on his cancer.” Like many entrepreneurs confronted with a difficult problem, he began trying to build his own path forward by self-funding his exploration of experimental therapies. Even then, he ran into the same maze of regulatory and institutional barriers that not only delayed him, but also unnecessarily raised the price of his experimental therapies. These are obstacles that only someone with extraordinary resources could hope to navigate, often by assembling an entire team to deal with them and navigate the opacity. In the end, Sijbrandij prevailed: he has been relapse free since 2025, after doctors had told me he was at the end of his options. Around the same time, writer Jake Seliger faced a similar situation while battling advanced throat cancer. Like Sid Sijbrandij, he was willing to try anything that might help. The difference was that Seliger was not a billionaire. He could not hire a team to navigate the system on his behalf, and he struggled even to enroll in the clinical trials that might have offered him a chance. A system originally conceived to safeguard patients has gradually produced a strange and troubling outcome: the mere chance of survival is effectively reserved for the very few who possess the means to assemble an army of experts capable of navigating its labyrinthine procedures. What makes these stories particularly frustrating is that we already know clinical trials — especially small, early-stage ones like the ones Sijbrandij enrolled in for himself— can be conducted far more cheaply and with far less bureaucracy than is currently required. Ironically, the original article cites Australia as a bad example, yet clinical trials there are conducted 2.5–3× cheaper and faster than in the U.S., at least for human trials, without any increase in safety events—a genuine free lunch. Removing unnecessary barriers has long been important. That is why I co-founded the Clinical Trial Abundance initiative in 2024, a policy effort aimed at increasing both the number and efficiency of in-human drug trials and have consistently argued about the importance of making this crucial but often neglected part of the drug discovery process more efficient. Since then, the issue has only become more urgent with the rise of AI. One of the central promises of the AI revolution is that it will accelerate medical progress. Organizations such as the OpenAI Foundation list curing disease as a core goal, and researchers like Dario Amodei of Anthropic have argued that AI could dramatically speed up biomedical innovation. But, as I have written before in response to an interview between Dario and Dwarkesh Patel, AI will not automatically accelerate a key bottleneck in making these dreams a reality: clinical trials. Conyngham’s observation that navigating the red tape to start a trial for his dog took longer than designing the drug itself only underscores the point. Clinical trials themselves vary widely. At one end are small, bespoke trials involving one or a few patients testing highly experimental therapies—like the treatment in the Australian dog story or the experimental therapy Sijbrandij pursued. At the other end are large-scale trials involving thousands of participants, designed to confirm earlier findings and support regulatory approval. Different types of trials require different reforms. In this essay, I will focus on the former: small, exploratory trials, which will be called early-stage small n trials for the purpose of this essay. These are often the fastest way to test promising ideas in humans and learn from them. They represent our best chance at a meaningful “right-to-try,” form the top of the funnel that generates proof-of-concept evidence, and may be the only viable path for personalized medicine and treatments for ultra-rare diseases. Understanding why these trials have been made unnecessarily difficult—and how we might change that—is essential if medical innovation is to keep pace with our growing ability to design new therapies. When the story first circulated on X, many people interpreted it as evidence that a cure already exists but simply hasn’t been used due to bureaucracy. That isn’t quite true, as I explained. The type of mRNA vaccine that the owner pursued looks promising, but he did not know a priori whether it worked or not, as it had not been tested before. So it was not a cure, but “a chance at a cure”. I hesitate to call it an “experimental treatment”, since this term evokes fears of potential safety issues while we generally can predict safety quite well now. The inaccuracy of whether this was a cure or not, however, does not make the story of the bureaucratic red tape that Conyngham encountered any less infuriating. More and more promising treatments are accumulating in the pipeline, fueled by an explosion of new therapeutic modalities, ranging from mRNA to better peptides and more recently, by AI. Yet we are not taking full advantage of them. To better understand these points, it is helpful to briefly outline the clinical development process—the sequence of in-human trials through which a promising scientific idea is gradually translated into a therapy. Drug development is often described as a funnel: many ideas enter at the top, but only a few become approved treatments. Early human studies, known as Phase I trials, sit at the entrance of this process. They involve small numbers of patients and are designed to quickly test whether a new therapy is safe and shows early signs of effectiveness. If the results look promising, the therapy moves to larger and more complex studies, including Phase III trials that enroll large numbers of patients to confirm whether the treatment truly works. Most people gain access to new therapies only after these large randomized trials are completed. On average, moving from a promising idea to Phase III results takes seven to ten years and costs roughly $1.2 billion. Accelerated approval pathways in areas such as cancer or rare diseases can shorten this timeline by relying on surrogate endpoints, but the process remains slow. As a result, many discoveries that make headlines today will take close to a decade before they become treatments that patients can widely access. Part of this delay is unavoidable. Observing how a drug affects the human body simply takes time. But much of it is not. Layers of unnecessary bureaucracy, regulatory opacity, and rising trial costs add years to the process without clearly improving patient safety, which is why I started Clinical Trial Abundance. Allowing a higher volume of small-n early stage trials, the focus of this essay, is a rare “win-win” for both public health and scientific progress. For patients, it transforms a terminal diagnosis from a closed door into a “chance at a cure,” providing legal, supervised access to cutting-edge medicine that currently sits idle in labs. For researchers and society, it unclogs the drug discovery funnel; by lowering the barrier to entry for new ideas, we ensure that the next generation of mRNA, peptide and AI-driven therapies are tested in humans years sooner, ultimately accelerating the arrival of universal cures for everyone. Next, I will explain why making it easier to run these early stage trials matters. First, from a patient perspective, they often provide the closest practical equivalent to a right-to-try. In theory, right-to-try laws allow patients with serious illnesses to access treatments that have not yet been confirmed in large randomized Phase III trials. In practice, these pathways rarely function as intended. Pharmaceutical companies are often reluctant to provide experimental drugs outside formal trials, and treatments typically must have already passed Phase I testing. As a result, very few patients gain access through these mechanisms. Early-stage trials offer a more workable alternative. They allow experimental therapies to be tested in structured clinical environments—often in academic settings or academia–industry collaborations—where patients can be monitored and meaningful data can be collected. Second, early-stage small-n trials are essential for personalized medicine and the treatment of ultra-rare diseases. Many emerging therapies—such as personalized cancer vaccines, gene therapies, and other individualized interventions—do not fit easily into the traditional model of large randomized trials involving thousands of participants. By their nature, these treatments target very small patient populations and often require flexible, adaptive clinical designs. From a societal perspective, these trials play a crucial learning role. As I argued in my earlier essay Clinic-in-the-Loop, early-stage trials are not simply regulatory checkpoints on the path to approval. They are part of the discovery process itself, creating a feedback loop between laboratory hypotheses and human biology. Later-stage studies, particularly Phase III trials, are designed mainly for validation: they test whether a treatment works under defined conditions and produce the evidence needed for approval. Early-stage trials, by contrast, are oriented toward learning. Conducted with small patient groups and often using exploratory designs, they allow researchers to observe how a therapy behaves in the human body and how the disease responds. In this way, they close the gap between theory and real-world biology. In the Clinic-in-the-Loop essay, I explain how these trials were crucial to the discovery of Kymriah, the first curative cell therapy for blood cancer."

English

438

118.3K

Ameya P. รีทวีตแล้ว

Jerry Tworek@MillionInt·12 Mar

How do we actually measure self-improvement is a hard problem we should be taking shots at

Karina Nguyen@karinanguyen

Excited to release PostTrainBench v1.0! This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting. We believe this is a first step toward tracking progress in recursive self-improvement 🧵:

English

181

22.7K

Ameya P.@AmyPrb·10 Mar

My DMs are open! Also feel free to email me if you're looking to hire.

English

783

Ameya P.@AmyPrb·10 Mar

English

105

20.1K

Ameya P. รีทวีตแล้ว

Rahaf Aljundi@AljundiRahaf·4 Mar

This fall, during a Dagstuhl seminar on continual learning, we discussed with various researchers from the field the roadmap for continual learning. We converged to one view: modular memory is the key to continual learning agents, as outlined in here arxiv.org/pdf/2603.01761

English

967

Ameya P.@AmyPrb·26 Şub

@AlexGDimakis Was one of the hidden nuggets I found in previous benchmark prediction works -- arxiv.org/abs/2506.07673!

English

164

Alex Dimakis@AlexGDimakis·25 Şub

The BenchPress idea is delightfully simple application of compressed sensing on AI evals: Instead of running all the benchmarks, run a few (ideally the cheaper ones) and use these numbers as features, given to a model to predict the other benchmark numbers from these observations. Turns out the matrix of benchmarks is very low-rank and the matrix completion model works very well. My thought is that at the end of the day you still need to run all the benchmarks, but while iterating, this is a valuable trick to get more signal of what works and in which direction. You can also look at the low-rank directions you discover and understand how your model performs in these data-driven performance directions. They may be easy to name: ('Persistence, 'Coding comfort','Terminal use ability?' etc?)

Dimitris Papailiopoulos@DimitrisPapail

Running Terminal-Bench 2.0 on expensive frontier models costs $1K–$50K or more. BenchPress Predicts Gemini 3.1 Pro and Claude Opus 4.6's scores within ±2 points after 15 randomly selected benchmarks. .... using zero agentic benchmark data!! Cost: $0.

English

6.8K

ค้นพบ

@jackclarkSF @CoLLAs_Conf @AlexGDimakis @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates