Karen Hambardzumyan (@mahnerak) - Twitter Profili

Karen Hambardzumyan retweetledi

Bhavul Gauri@BhavulGauri·9 Şub

Introducing - AIRS Bench, a benchmark for “AI Researcher Agent”. Agents attempt 20 open ML problems starting from zero code (full research loop). And yes, they beat SOTA in few cases (read more below!) arxiv.org/abs/2602.06855

English

4

12

73

16.2K

Karen Hambardzumyan@mahnerak·5 Ara

We are entering the era of agentic science. But what does it take to build and scale these systems reliably? 🤔 I'm not there to geek out with you, but my co-authors @EdanToledo, @MartinJosifoski and @RishiHazra95 are at #NeurIPS2025 and eager to chat - Don't miss our poster!

Edan Toledo@EdanToledo

Come see our poster today! If you’re interested in research agents and how to unlock scaling for them, come for a chat!

English

0

2

200

Karen Hambardzumyan retweetledi

Rohan Paul@rohanpaul_ai·21 Kas

New @AIatMeta paper shows AI agents do better when their first ideas are diverse, not similar. The authors studied 11,000 agent runs on Kaggle tasks and measured diversity in the first 5 plans. Here diversity means trying model families, not tweaks of one model. Agents whose first ideas covered many types scored higher across many agent setups. When the system prompt was changed to push similar ideas, benchmark scores fell by 7 to 8 points and valid submissions dropped. This suggests that having several distinct plans gives the agent more chances to find a solution it can build. Scaffold settings like sibling memory, staged complexity hints, and explicit variety requests raise diversity, while temperature changes do little and better agents spend more time on successful code. ---- Paper – arxiv. org/abs/2511.15593 Paper Title: "What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity"

English

4

12

72

6.8K

Karen Hambardzumyan retweetledi

Yeskendir 🇰🇿@yeskendir_k·14 Eki

🧵 New paper from FAIR (Meta) on recursion + latent reasoning: "Encode,Think,Decode (ETD): Scaling reasoning through recursive latent thoughts". ETD improves the reasoning of base model by training it to iterate over a subset of reasoning-critical layers during mid-training.(1/n)

English

9

41

286

28.4K

Karen Hambardzumyan retweetledi

Perceptron AI@perceptroninc·17 Eyl

1/ Introducing Isaac 0.1 — our first perceptive-language model. 2B params, open weights. Matches or beats models significantly larger on core perception. We are pushing the efficient frontier for physical AI. perceptron.inc/blog/introduci…

English

24

116

604

195K

Karen Hambardzumyan retweetledi

xjdr@_xjdr·17 Eyl

been waiting for this for a while. really excited for the first release from this amazing team

Perceptron AI@perceptroninc

1/ Introducing Isaac 0.1 — our first perceptive-language model. 2B params, open weights. Matches or beats models significantly larger on core perception. We are pushing the efficient frontier for physical AI. perceptron.inc/blog/introduci…

English

4

8

156

15.8K

Karen Hambardzumyan retweetledi

Andrei Lupu@_andreilupu·6 Ağu

We discovered alien intelligence in sand, and can now play its dreams in real time with a mouse and keyboard. Congrats to the team! Now, can it run Doom? 🤔

Jack Parker-Holder@jparkerholder

Genie 3 feels like a watershed moment for world models 🌐: we can now generate multi-minute, real-time interactive simulations of any imaginable world. This could be the key missing piece for embodied AGI… and it can also create beautiful beaches with my dog, playable real time

English

1

22

1.4K

Karen Hambardzumyan retweetledi

Jakob Foerster@j_foerst·7 Tem

AIRA strikes again! This time we conduct an in-depth study of research agents on MLE-Bench (i.e. kaggle competitions). We find that while exploration and search matter, the biggest delta is due to our more robust software stack. We are open-sourcing all of this to allow YOU to push this work further.

Yoram Bachrach@yorambac

AI Research Agents are becoming proficient at machine learning tasks, but how can we help them search the space of candidate solutions and codebases? Read our new paper looking at MLE-Bench: arxiv.org/pdf/2507.02554 #LLM #Agents #MLEBench

English

0

4

56

6.1K

Karen Hambardzumyan retweetledi

Martin Josifoski@MartinJosifoski·7 Tem

This project was co-led by @EdanToledo, @mahnerak, and myself, with the support of the excellent AI Research Agents team under @yorambac, @j_foerst, and @alex_h_miller. Read the full paper here: arxiv.org/pdf/2507.02554

English

1

2

6

640

Karen Hambardzumyan retweetledi

Martin Josifoski@MartinJosifoski·7 Tem

Shoutout to the MLE-bench authors for creating an awesome testbed of real-world problems. While contamination might be an issue, it does a great job of exposing agents to key challenges (e.g., problem difficulty, effective resource management, generalization gap). @lilianweng @aleks_madry @tejalpatwardhan @ChowdhuryNeil @evanon0ping @JaffeOliver As a baseline, we use AIDE from @WecoAI — their version of greedy search, which performs greedy selection over the full population, is highly effective and difficult to beat in practice. @zhengyaojiang @YuxiangJWu @DhruvSrikanth @schmidtdominik_ @SakanaAILabs's recently proposed version of MCTS is similar in spirit, and I expect it to offer a significant improvement over the more standard versions of MCTS in this domain. @iwiwi

English

3

1

11

679

Karen Hambardzumyan retweetledi

Martin Josifoski@MartinJosifoski·7 Tem

Scaling AI research agents is key to tackling some of the toughest challenges in the field. But what's required to scale effectively? It turns out that simply throwing more compute at the problem isn't enough. We break down an agent into four fundamental components that shape its behavior, regardless of specific design or implementation choices: - Environment: The context (infrastructure) in which the agent operates - Search Policy: How the agent allocates resources - Operator Set and Policy: The available actions the agent can take and how it chooses among them - Evaluation Mechanism: How the agent determines whether a particular direction is promising We specifically focus on ML research agents tasked with real-world machine learning challenges from Kaggle competitions (MLE-bench). What we found is that factors like the environment, the agents’ core capabilities (the operator set), and overfitting emerge as critical bottlenecks long before computational limitations come into play. Here are our key insights: 🔹Environment: Agents can't scale without a robust environment that offers flexible and efficient access to computational resources. For instance, simply running the baseline agents in the (open-sourced) AIRA-dojo environment boosts performance by 10% absolute (30% relative)—highlighting just how crucial the environment is. 🔹Agent design and core capabilities: Resource allocation optimization only matters if agents can actually make good use of those resources. Our analysis shows that the agents’ operator set—the core actions they perform—can limit performance gains from more advanced search methods like evolutionary search and MCTS. We achieve SoTA performance by designing an improved operator set that better manages context and encourages exploration, and coupling it with the search policies. 🔹Evaluation: Accurate evaluation of the solution space is critical and reveals a significant challenge: overfitting. Ironically, agents that are highly effective at optimizing perceived values tend to be more vulnerable to overfitting—a problem that intensifies with increased compute resources. We observe up to 13% performance loss due to suboptimal selection of final solutions caused by this issue. 🔹Compute: Providing agents with sufficient compute resources is essential to avoid introducing an additional limitation and bias into evaluations. We demonstrate this through experiments in which we scale the runtime from 24 to 120 hours. In summary, successfully scaling AI research agents requires careful attention to these foundational aspects. Ignoring them risks turning scaling efforts into, at best, exercises in overfitting. These insights set the stage for exciting developments ahead!

English

5

31

156

18.7K

Karen Hambardzumyan retweetledi

Jakob Foerster@j_foerst·30 Haz

The AIRA team @metaai has the ambitious goal of building/training an agent that can do frontier AI research to help the open-source ecosystem leapfrog closed source LLMs. As a relatively small team we cannot succeed in this mission without the support of the community so we'll be open-sourcing our tools, methods, and benchmarks along the way. 🚨Meet our LLM Speedrunning Benchmark,🚨 which probes the ability of LLM agents to do LLM engineering in the "GPT2 speedrun", which is fast enough for efficient, high signal evals,. Crucially, past human records provide an existence proof for higher performance and allow us to test where the limiting factors for performance are (ideation vs implementation). Spoiler: both are currently a problem! Stay tuned - we are just getting started - and (even better) join the journey!

Minqi Jiang@MinqiJiang

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total self-improvement? Well, we know humans are pretty good at improving LLMs. In the NanoGPT speedrun challenge, created by @kellerjordan0, human researchers iteratively improved @karpathy's GPT-2 replication, slashing the training time (to the same target validation loss) from 45 minutes to under 3 minutes in just under a year (!). Surely, a necessary (but not sufficient) ability for an LLM that can automatically improve frontier techniques is the ability to *reproduce* known innovations on GPT-2, a tiny language model from over 5 years ago. 🤔 So we took several of the top models and combined them with various search scaffolds to create *LLM speedrunner agents*. We then asked these agents to reproduce each of the NanoGPT speedrun records, starting from the previous record, while providing them access to different forms of hints that revealed the exact changes needed to reach the next record. The results were surprising—not because we thought these agents would ace the benchmark, but because even the best agent failed to recover even half of the speed-up of human innovators on average in the easiest hint mode, where we show the agent the full pseudocode of the changes to the next record. We believe The Automated LLM Speedrunning Benchmark provides a simple eval for measuring the lower bound of LLM agents’ ability to reproduce scientific findings close to the frontier of ML. Beyond scientific reproducibility, this benchmark can also be run without hints, transforming into an automated *scientific innovation* benchmark. When run in "innovation mode," this benchmark effectively extends the NanoGPT speedrun to AI participants! While initial results here indicate that current agents seriously struggle to match human innovators beyond just a couple of records, benchmarks have a tendency to fall. This one is particularly exciting to watch, as new state-of-the-art here by definition implies a form of *superhuman innovation*.

English

1

10

102

12.8K

Karen Hambardzumyan retweetledi

Minqi Jiang@MinqiJiang·30 Haz

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total self-improvement? Well, we know humans are pretty good at improving LLMs. In the NanoGPT speedrun challenge, created by @kellerjordan0, human researchers iteratively improved @karpathy's GPT-2 replication, slashing the training time (to the same target validation loss) from 45 minutes to under 3 minutes in just under a year (!). Surely, a necessary (but not sufficient) ability for an LLM that can automatically improve frontier techniques is the ability to *reproduce* known innovations on GPT-2, a tiny language model from over 5 years ago. 🤔 So we took several of the top models and combined them with various search scaffolds to create *LLM speedrunner agents*. We then asked these agents to reproduce each of the NanoGPT speedrun records, starting from the previous record, while providing them access to different forms of hints that revealed the exact changes needed to reach the next record. The results were surprising—not because we thought these agents would ace the benchmark, but because even the best agent failed to recover even half of the speed-up of human innovators on average in the easiest hint mode, where we show the agent the full pseudocode of the changes to the next record. We believe The Automated LLM Speedrunning Benchmark provides a simple eval for measuring the lower bound of LLM agents’ ability to reproduce scientific findings close to the frontier of ML. Beyond scientific reproducibility, this benchmark can also be run without hints, transforming into an automated *scientific innovation* benchmark. When run in "innovation mode," this benchmark effectively extends the NanoGPT speedrun to AI participants! While initial results here indicate that current agents seriously struggle to match human innovators beyond just a couple of records, benchmarks have a tendency to fall. This one is particularly exciting to watch, as new state-of-the-art here by definition implies a form of *superhuman innovation*.

English

40

196

1.2K

568.1K

Karen Hambardzumyan retweetledi

Andrei Lupu@_andreilupu·26 Haz

Theory of Mind (ToM) is crucial for next gen LLM Agents, yet current benchmarks suffer from multiple shortcomings. Enter 💽 Decrypto, an interactive benchmark for multi-agent reasoning and ToM in LLMs! Work done with @TimonWilli & @j_foerst at @AIatMeta & @FLAIR_Ox 🧵👇

English

4

27

104

23.2K

Karen Hambardzumyan retweetledi

Wassim (Wes) Bouaziz@_Vassim·24 Haz

🚨New AI Security paper alert: Winter Soldier 🥶🚨 In our last paper, we show: -how to backdoor a LM _without_ training it on the backdoor behavior -use that to detect if a black-box LM has been trained on your protected data Yes, Indirect data poisoning is real and powerful!

English

1

19

52

6.6K

Karen Hambardzumyan retweetledi

Sonia Joseph@soniajoseph_·4 Haz

Our paper Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video received an Oral at the Mechanistic Interpretability for Vision Workshop at CVPR 2025! 🎉 We’ll be in Nashville next week. Come say hi 👋 @CVPR @miv_cvpr2025

English

3

30

272

19.7K

Karen Hambardzumyan retweetledi

Virginie Do@gini_do·24 Nis

I am at #ICLR and honored to present this work on Saturday afternoon at the poster session. Thanks @jade_lei_yu @mahnerak @nicola_cancedda for this wonderful collaboration! I am also happy to chat about Llama / agents / safety 👋

Lei Yu@jade_lei_yu

New paper! 🎊 We are delighted to announce our new paper "Robust LLM Safeguarding via Refusal Feature Adversarial Training"! There is a common mechanism behind LLM jailbreaking, and it can be leveraged to make models safer!

English

0

6

28

2.5K

Karen Hambardzumyan retweetledi

Wassim (Wes) Bouaziz@_Vassim·18 Mar

Want to know if a ML model was trained on your dataset with 1 API call? See you in conferences 🙌 Excited to share that our paper Data Taggants for image data was accepted at ICLR 2025 🎉 Our follow-up on audio data, was accepted at ICASSP 2025! 🎉 Check out the details below 👇

Wassim (Wes) Bouaziz@_Vassim

Want to know if a ML model was trained on your dataset? Introducing ✨Data Taggants✨! We use data poisoning to leave a harmless and stealthy signature on your dataset that radiates through trained models. Learn how to protect your dataset from unauthorized use... A 🧵

English

1

13

34

6.3K

Karen Hambardzumyan retweetledi

Eduardo Sánchez@eduardosg_ai·4 Mar

Happy to see that Linguini, our benchmark for language-agnostic linguistic reasoning, has been included in DeepMind’s BIG-Bench Extra Hard (BBEH). Linguini remains challenging for reasoning models, being one of only two (hard) tasks where o3-mini doesn't show massive gains.