Karen Hambardzumyan

83 posts

Karen Hambardzumyan banner
Karen Hambardzumyan

Karen Hambardzumyan

@mahnerak

AI Research Agents @AIatMeta (FAIR), PhD Student @ucl_nlp

London, United Kingdom Katılım Haziran 2010
1.1K Takip Edilen349 Takipçiler
Karen Hambardzumyan retweetledi
Bhavul Gauri
Bhavul Gauri@BhavulGauri·
Introducing - AIRS Bench, a benchmark for “AI Researcher Agent”. Agents attempt 20 open ML problems starting from zero code (full research loop). And yes, they beat SOTA in few cases (read more below!) arxiv.org/abs/2602.06855
English
4
12
73
16.2K
Karen Hambardzumyan retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
New @AIatMeta paper shows AI agents do better when their first ideas are diverse, not similar. The authors studied 11,000 agent runs on Kaggle tasks and measured diversity in the first 5 plans. Here diversity means trying model families, not tweaks of one model. Agents whose first ideas covered many types scored higher across many agent setups. When the system prompt was changed to push similar ideas, benchmark scores fell by 7 to 8 points and valid submissions dropped. This suggests that having several distinct plans gives the agent more chances to find a solution it can build. Scaffold settings like sibling memory, staged complexity hints, and explicit variety requests raise diversity, while temperature changes do little and better agents spend more time on successful code. ---- Paper – arxiv. org/abs/2511.15593 Paper Title: "What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity"
Rohan Paul tweet media
English
4
12
72
6.8K
Karen Hambardzumyan retweetledi
Yeskendir 🇰🇿
Yeskendir 🇰🇿@yeskendir_k·
🧵 New paper from FAIR (Meta) on recursion + latent reasoning: "Encode,Think,Decode (ETD): Scaling reasoning through recursive latent thoughts". ETD improves the reasoning of base model by training it to iterate over a subset of reasoning-critical layers during mid-training.(1/n)
Yeskendir 🇰🇿 tweet media
English
9
41
286
28.4K
Karen Hambardzumyan retweetledi
Perceptron AI
Perceptron AI@perceptroninc·
1/ Introducing Isaac 0.1 — our first perceptive-language model. 2B params, open weights. Matches or beats models significantly larger on core perception. We are pushing the efficient frontier for physical AI. perceptron.inc/blog/introduci…
Perceptron AI tweet media
English
24
116
604
195K
Karen Hambardzumyan retweetledi
Karen Hambardzumyan retweetledi
Jakob Foerster
Jakob Foerster@j_foerst·
AIRA strikes again! This time we conduct an in-depth study of research agents on MLE-Bench (i.e. kaggle competitions). We find that while exploration and search matter, the biggest delta is due to our more robust software stack. We are open-sourcing all of this to allow YOU to push this work further.
Yoram Bachrach@yorambac

AI Research Agents are becoming proficient at machine learning tasks, but how can we help them search the space of candidate solutions and codebases? Read our new paper looking at MLE-Bench: arxiv.org/pdf/2507.02554 #LLM #Agents #MLEBench

English
0
4
56
6.1K
Karen Hambardzumyan retweetledi
Martin Josifoski
Martin Josifoski@MartinJosifoski·
Shoutout to the MLE-bench authors for creating an awesome testbed of real-world problems. While contamination might be an issue, it does a great job of exposing agents to key challenges (e.g., problem difficulty, effective resource management, generalization gap). @lilianweng @aleks_madry @tejalpatwardhan @ChowdhuryNeil @evanon0ping @JaffeOliver As a baseline, we use AIDE from @WecoAI — their version of greedy search, which performs greedy selection over the full population, is highly effective and difficult to beat in practice. @zhengyaojiang @YuxiangJWu @DhruvSrikanth @schmidtdominik_ @SakanaAILabs's recently proposed version of MCTS is similar in spirit, and I expect it to offer a significant improvement over the more standard versions of MCTS in this domain. @iwiwi
English
3
1
11
679
Karen Hambardzumyan retweetledi
Martin Josifoski
Martin Josifoski@MartinJosifoski·
Scaling AI research agents is key to tackling some of the toughest challenges in the field. But what's required to scale effectively? It turns out that simply throwing more compute at the problem isn't enough. We break down an agent into four fundamental components that shape its behavior, regardless of specific design or implementation choices: - Environment: The context (infrastructure) in which the agent operates - Search Policy: How the agent allocates resources - Operator Set and Policy: The available actions the agent can take and how it chooses among them - Evaluation Mechanism: How the agent determines whether a particular direction is promising We specifically focus on ML research agents tasked with real-world machine learning challenges from Kaggle competitions (MLE-bench). What we found is that factors like the environment, the agents’ core capabilities (the operator set), and overfitting emerge as critical bottlenecks long before computational limitations come into play. Here are our key insights: 🔹Environment: Agents can't scale without a robust environment that offers flexible and efficient access to computational resources. For instance, simply running the baseline agents in the (open-sourced) AIRA-dojo environment boosts performance by 10% absolute (30% relative)—highlighting just how crucial the environment is. 🔹Agent design and core capabilities: Resource allocation optimization only matters if agents can actually make good use of those resources. Our analysis shows that the agents’ operator set—the core actions they perform—can limit performance gains from more advanced search methods like evolutionary search and MCTS. We achieve SoTA performance by designing an improved operator set that better manages context and encourages exploration, and coupling it with the search policies. 🔹Evaluation: Accurate evaluation of the solution space is critical and reveals a significant challenge: overfitting. Ironically, agents that are highly effective at optimizing perceived values tend to be more vulnerable to overfitting—a problem that intensifies with increased compute resources. We observe up to 13% performance loss due to suboptimal selection of final solutions caused by this issue. 🔹Compute: Providing agents with sufficient compute resources is essential to avoid introducing an additional limitation and bias into evaluations. We demonstrate this through experiments in which we scale the runtime from 24 to 120 hours. In summary, successfully scaling AI research agents requires careful attention to these foundational aspects. Ignoring them risks turning scaling efforts into, at best, exercises in overfitting. These insights set the stage for exciting developments ahead!
Martin Josifoski tweet media
English
5
31
156
18.7K
Karen Hambardzumyan retweetledi
Jakob Foerster
Jakob Foerster@j_foerst·
The AIRA team @metaai has the ambitious goal of building/training an agent that can do frontier AI research to help the open-source ecosystem leapfrog closed source LLMs. As a relatively small team we cannot succeed in this mission without the support of the community so we'll be open-sourcing our tools, methods, and benchmarks along the way. 🚨Meet our LLM Speedrunning Benchmark,🚨 which probes the ability of LLM agents to do LLM engineering in the "GPT2 speedrun", which is fast enough for efficient, high signal evals,. Crucially, past human records provide an existence proof for higher performance and allow us to test where the limiting factors for performance are (ideation vs implementation). Spoiler: both are currently a problem! Stay tuned - we are just getting started - and (even better) join the journey!
Minqi Jiang@MinqiJiang

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total self-improvement? Well, we know humans are pretty good at improving LLMs. In the NanoGPT speedrun challenge, created by @kellerjordan0, human researchers iteratively improved @karpathy's GPT-2 replication, slashing the training time (to the same target validation loss) from 45 minutes to under 3 minutes in just under a year (!). Surely, a necessary (but not sufficient) ability for an LLM that can automatically improve frontier techniques is the ability to *reproduce* known innovations on GPT-2, a tiny language model from over 5 years ago. 🤔 So we took several of the top models and combined them with various search scaffolds to create *LLM speedrunner agents*. We then asked these agents to reproduce each of the NanoGPT speedrun records, starting from the previous record, while providing them access to different forms of hints that revealed the exact changes needed to reach the next record. The results were surprising—not because we thought these agents would ace the benchmark, but because even the best agent failed to recover even half of the speed-up of human innovators on average in the easiest hint mode, where we show the agent the full pseudocode of the changes to the next record. We believe The Automated LLM Speedrunning Benchmark provides a simple eval for measuring the lower bound of LLM agents’ ability to reproduce scientific findings close to the frontier of ML. Beyond scientific reproducibility, this benchmark can also be run without hints, transforming into an automated *scientific innovation* benchmark. When run in "innovation mode," this benchmark effectively extends the NanoGPT speedrun to AI participants! While initial results here indicate that current agents seriously struggle to match human innovators beyond just a couple of records, benchmarks have a tendency to fall. This one is particularly exciting to watch, as new state-of-the-art here by definition implies a form of *superhuman innovation*.

English
1
10
102
12.8K
Karen Hambardzumyan retweetledi
Minqi Jiang
Minqi Jiang@MinqiJiang·
Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total self-improvement? Well, we know humans are pretty good at improving LLMs. In the NanoGPT speedrun challenge, created by @kellerjordan0, human researchers iteratively improved @karpathy's GPT-2 replication, slashing the training time (to the same target validation loss) from 45 minutes to under 3 minutes in just under a year (!). Surely, a necessary (but not sufficient) ability for an LLM that can automatically improve frontier techniques is the ability to *reproduce* known innovations on GPT-2, a tiny language model from over 5 years ago. 🤔 So we took several of the top models and combined them with various search scaffolds to create *LLM speedrunner agents*. We then asked these agents to reproduce each of the NanoGPT speedrun records, starting from the previous record, while providing them access to different forms of hints that revealed the exact changes needed to reach the next record. The results were surprising—not because we thought these agents would ace the benchmark, but because even the best agent failed to recover even half of the speed-up of human innovators on average in the easiest hint mode, where we show the agent the full pseudocode of the changes to the next record. We believe The Automated LLM Speedrunning Benchmark provides a simple eval for measuring the lower bound of LLM agents’ ability to reproduce scientific findings close to the frontier of ML. Beyond scientific reproducibility, this benchmark can also be run without hints, transforming into an automated *scientific innovation* benchmark. When run in "innovation mode," this benchmark effectively extends the NanoGPT speedrun to AI participants! While initial results here indicate that current agents seriously struggle to match human innovators beyond just a couple of records, benchmarks have a tendency to fall. This one is particularly exciting to watch, as new state-of-the-art here by definition implies a form of *superhuman innovation*.
Minqi Jiang tweet media
English
40
196
1.2K
568.1K
Karen Hambardzumyan retweetledi
Andrei Lupu
Andrei Lupu@_andreilupu·
Theory of Mind (ToM) is crucial for next gen LLM Agents, yet current benchmarks suffer from multiple shortcomings. Enter 💽 Decrypto, an interactive benchmark for multi-agent reasoning and ToM in LLMs! Work done with @TimonWilli & @j_foerst at @AIatMeta & @FLAIR_Ox 🧵👇
English
4
27
104
23.2K
Karen Hambardzumyan retweetledi
Wassim (Wes) Bouaziz
Wassim (Wes) Bouaziz@_Vassim·
🚨New AI Security paper alert: Winter Soldier 🥶🚨 In our last paper, we show: -how to backdoor a LM _without_ training it on the backdoor behavior -use that to detect if a black-box LM has been trained on your protected data Yes, Indirect data poisoning is real and powerful!
Wassim (Wes) Bouaziz tweet media
English
1
19
52
6.6K
Karen Hambardzumyan retweetledi
Sonia Joseph
Sonia Joseph@soniajoseph_·
Our paper Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video received an Oral at the Mechanistic Interpretability for Vision Workshop at CVPR 2025! 🎉 We’ll be in Nashville next week. Come say hi 👋 @CVPR @miv_cvpr2025
Sonia Joseph tweet media
English
3
30
272
19.7K
Karen Hambardzumyan retweetledi
Virginie Do
Virginie Do@gini_do·
I am at #ICLR and honored to present this work on Saturday afternoon at the poster session. Thanks @jade_lei_yu @mahnerak @nicola_cancedda for this wonderful collaboration! I am also happy to chat about Llama / agents / safety 👋
Lei Yu@jade_lei_yu

New paper! 🎊 We are delighted to announce our new paper "Robust LLM Safeguarding via Refusal Feature Adversarial Training"! There is a common mechanism behind LLM jailbreaking, and it can be leveraged to make models safer!

English
0
6
28
2.5K
Karen Hambardzumyan retweetledi
Wassim (Wes) Bouaziz
Wassim (Wes) Bouaziz@_Vassim·
Want to know if a ML model was trained on your dataset with 1 API call? See you in conferences 🙌 Excited to share that our paper Data Taggants for image data was accepted at ICLR 2025 🎉 Our follow-up on audio data, was accepted at ICASSP 2025! 🎉 Check out the details below 👇
Wassim (Wes) Bouaziz tweet media
Wassim (Wes) Bouaziz@_Vassim

Want to know if a ML model was trained on your dataset? Introducing ✨Data Taggants✨! We use data poisoning to leave a harmless and stealthy signature on your dataset that radiates through trained models. Learn how to protect your dataset from unauthorized use... A 🧵

English
1
13
34
6.3K
Karen Hambardzumyan retweetledi
Eduardo Sánchez
Eduardo Sánchez@eduardosg_ai·
Happy to see that Linguini, our benchmark for language-agnostic linguistic reasoning, has been included in DeepMind’s BIG-Bench Extra Hard (BBEH). Linguini remains challenging for reasoning models, being one of only two (hard) tasks where o3-mini doesn't show massive gains.
Eduardo Sánchez tweet media
English
1
3
16
3.1K
Karen Hambardzumyan retweetledi
vmoens
vmoens@VincentMoens·
A few tips I share when I talk about perf with @PyTorch in eager mode (with focus on small models): 🪢
English
1
2
42
2.6K