Despoina Magka

2.1K posts

Despoina Magka banner
Despoina Magka

Despoina Magka

@MarlaMagka

Software engineer at Meta London, @MetaAI. PhD Artificial Intelligence, Oxford. Tweets in English, Greek, French, German, Spanish. From Athens.

Katılım Kasım 2012
655 Takip Edilen633 Takipçiler
Sabitlenmiş Tweet
Despoina Magka
Despoina Magka@MarlaMagka·
(🧵) Happy to release AIRS-Bench, a benchmark to test the autonomous machine learning abilities of AI research agents 🤖 AIRS-Bench includes 20 tasks sourced from machine learning papers that assess the autonomous research abilities of LLM agents throughout the full research lifecycle, from hypothesis generation 💡 and implementation 🛠️ to experimentation 🧪 and analysis 📊 Each task is extracted from a paper with a state-of-the-art result and consists of a: 📝 problem description (e.g. text similarity) 🗂️ a dataset (e.g. SICK) and 📏 a metric (e.g. Spearman correlation) to optimise over The agent is then given a GPU and 24 hours to develop and submit a Python solution that matches or exceeds the paper SOTA 📈 Read on for baseline results and examples of agents surpassing human SOTA 👀 🌱We open-source the AIRS-Bench task definitions and evaluation code to accelerate in autonomous scientific research: 💻 GitHub: github.com/facebookresear… 📜 ArXiv: arxiv.org/pdf/2602.06855 🤗 HF paper: huggingface.co/papers/2602.06… 📊 Meta AI website: ai.meta.com/research/publi… Huge shoutout to the team from Meta FAIR who painstakingly crafted, debugged and inspected every single of these tasks and its runs across more than a dozen of agents @alisia_lupidi, @_tomwithanh, @BhavulGauri, @basselralomari, @albertomariape, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, @LuciaCKun, @GagnonAudet, Chee Hau Leow, Sandra Lefdal, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, @mahnerak, @ishitamed, @EdanToledo and @rybolos, @alex_h_miller, @j_foerst, @yorambac for their leadership and support
Despoina Magka tweet media
English
4
12
49
12.3K
Despoina Magka retweetledi
Fabian Gloeckle
Fabian Gloeckle@FabianGloeckle·
Worried about Anthropic's Mythos? Fully formally verified code generation is the defense. Combining Lean, frontier models, multi-agent scaffolds, and inference scaling, we show <12mo benchmarks jumping from 20% to 70%. Real-world verification is here. facebookresearch.github.io/wybecoder/ 1/
Fabian Gloeckle tweet media
English
8
44
211
26.8K
Despoina Magka
Despoina Magka@MarlaMagka·
🚀 Happy to see AIRS-Bench, an AI R&D benchmark that Meta open-sourced earlier this year (x.com/MarlaMagka/sta…), being used in the Muse Spark Safety & Preparedness Report to assess loss of control risks stemming from acceleration of AI development. AIRS-Bench (github.com/facebookresear…) measures the ability of AI agents to execute end-to-end AI R&D across the full research lifecycle, from idea generation 💡 and implementation 🛠️ to experiment analysis 🧪 and iterative refinement 📈 Along with SWE-Bench and MLE-Bench, AIRS-Bench was used to assess the risks of models automating AI R&D work and outpacing governance mechanisms. Our findings suggest that Muse Spark does not substantially contribute to the said threat, as it achieves performance superior to human researchers in only 5 out of 20 tasks and for a fraction of its attempts 🔍 This is inline with results from comparison models and highlights the models' limitations to execute the complete research lifecycle consistently and across a wide range of domains 🤖 Head over to the 158-page report for more detailed results and a wide range of assessments and mitigations under Meta’s Advanced AI Scaling Framework 👇
Despoina Magka tweet media
Summer Yue@summeryue0

🚀 Muse Spark Safety & Preparedness Report for Meta AI is out. We start with our pre-deployment assessment under Meta's Advanced AI Scaling Framework, covering chemical and biological, cybersecurity, and loss of control risks. Our assessment flagged potentially elevated chem/bio risk, so we implemented safeguards and validated mitigations before deployment - bringing residual risk to within acceptable levels. Beyond the Framework, we also share findings and early explorations of model behavior (honesty, intent understanding, etc.), jailbreak robustness, eval awareness, and more. We're sharing this report to give a closer look at how we evaluate advanced AI safety. Always more work to do, and we welcome feedback from the community. ai.meta.com/static-resourc…

English
0
4
6
1.8K
Despoina Magka retweetledi
Martin Josifoski
Martin Josifoski@MartinJosifoski·
Excited to share AIRA₂ — our next-generation AI Research Agents for ML that address key bottlenecks to scaling. AIRA₂ achieves SoTA on real-world ML tasks from MLE-bench-30 (81.5% vs 72.7%), exceeds human SoTA on 6/20 diverse AI research tasks from AIRS-Bench (and hacks another 5), while exhibiting strong, predictable scaling properties. To push the frontier of AI Research, we need systems that scale well. Developing AIRA₂, we learned a lot about the bottlenecks and what it takes to resolve them — insights already driving our next iteration: 1/
Martin Josifoski tweet media
English
5
35
175
31.3K
Despoina Magka retweetledi
Jason Weston
Jason Weston@jaseweston·
🏋️Thinking Mid-training: RL of Interleaved Reasoning🎗️ We address the gap between pretraining (no explicit reasoning) and post-training (reasoning-heavy) with an intermediate SFT+RL mid-training phase to teach models how to think. - Annotate pretraining data with interleaved thoughts - SFT mid-training to learn when/what to think alongside original content - RL mid-training to optimize reasoning generation with grounded reward from future token prediction Result: 3.2x improvement on reasoning benchmarks compared to direct RL post-training on base Llama-3-8B, and gains over only prior SFT as well. Introducing reasoning earlier makes models better prepared for post-training! Read more in the blog post: facebookresearch.github.io/RAM/blogs/thin…
Jason Weston tweet media
English
9
71
557
67.6K
Despoina Magka retweetledi
Rui Hou
Rui Hou@magpie_rayhou·
We are releasing our new model muse spark today - our first step towards personal super-intelligence after 9 month great team effort! Please try it out and tell us what you think!
Alexandr Wang@alexandr_wang

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

English
10
12
93
12.1K
Despoina Magka retweetledi
Alexandr Wang
Alexandr Wang@alexandr_wang·
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
Alexandr Wang tweet media
English
727
1.2K
10.3K
4.5M
Despoina Magka retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)
Andrej Karpathy tweet media
English
1.1K
3.7K
28.4K
11M
Despoina Magka
Despoina Magka@MarlaMagka·
(Last/🧵) Overall, our results indicate a high variability in task performance, with both the LLM and the scaffold of the agent playing a role. For most tasks, even the best performing agent is still significantly behind the human SOTA, showing that AIRS-Bench is far from saturated. We hope AIRS-Bench will help identify gaps and accelerate progress in the development of in AI research agents. We welcome further agent submissions from the agentic AI research community, especially work built on open components (both the scaffold and the LLM) that can be inspected and extended end to end 🌍🤝🧑‍💻
Despoina Magka tweet media
English
0
0
4
183
Despoina Magka
Despoina Magka@MarlaMagka·
(7/🧵) Among our runs, we encountered cases where the agent’s performance, at least in some of the seeds, was higher than the reported human SOTA, i.e. had a normalized score that was greater than 1. Here you can see one such success achieved by the CWM-Greedy agent for a time series forecasting task
Despoina Magka tweet media
English
1
0
2
177
Despoina Magka
Despoina Magka@MarlaMagka·
(🧵) Happy to release AIRS-Bench, a benchmark to test the autonomous machine learning abilities of AI research agents 🤖 AIRS-Bench includes 20 tasks sourced from machine learning papers that assess the autonomous research abilities of LLM agents throughout the full research lifecycle, from hypothesis generation 💡 and implementation 🛠️ to experimentation 🧪 and analysis 📊 Each task is extracted from a paper with a state-of-the-art result and consists of a: 📝 problem description (e.g. text similarity) 🗂️ a dataset (e.g. SICK) and 📏 a metric (e.g. Spearman correlation) to optimise over The agent is then given a GPU and 24 hours to develop and submit a Python solution that matches or exceeds the paper SOTA 📈 Read on for baseline results and examples of agents surpassing human SOTA 👀 🌱We open-source the AIRS-Bench task definitions and evaluation code to accelerate in autonomous scientific research: 💻 GitHub: github.com/facebookresear… 📜 ArXiv: arxiv.org/pdf/2602.06855 🤗 HF paper: huggingface.co/papers/2602.06… 📊 Meta AI website: ai.meta.com/research/publi… Huge shoutout to the team from Meta FAIR who painstakingly crafted, debugged and inspected every single of these tasks and its runs across more than a dozen of agents @alisia_lupidi, @_tomwithanh, @BhavulGauri, @basselralomari, @albertomariape, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, @LuciaCKun, @GagnonAudet, Chee Hau Leow, Sandra Lefdal, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, @mahnerak, @ishitamed, @EdanToledo and @rybolos, @alex_h_miller, @j_foerst, @yorambac for their leadership and support
Despoina Magka tweet media
English
4
12
49
12.3K
Despoina Magka retweetledi
Peter O'Hearn
Peter O'Hearn@PeterOHearn12·
LLMs vs the Halting Problem. (Why, what, where going.) We recently released a paper on this; link to follow. A few comments here for context. Why? With LLM "reasoning" excitement, we thought: why not try LLMs on the first ever code reasoning task, the halting problem. Turing's proof of undecidability established fundamental limits. Fun bit: no matter how "superintelligent" AI becomes, this is a problem it can never perfectly solve. Where to get data to measure? SVCOMP. Verification researchers have through their insight and hard work, curated several thousand example C programs. They run dedicated tools over this dataset in an annual competition. This is in a sense the home turf of symbolic. We didn't know how LLMs would do, and in particular were aware of results of @rao2z , @RishiHazra95 and others showing that LLMs trail symbolic on "easier" decidable problems (SAT, propositional planning). The surprise: LLMs are competitive on halting—where they often trail on "easier" problems. Why? Hypothesis: LLMs are heuristic approximators; in undecidability, heuristic approximation isn't just a workaround—it's often the only way forward. Broader context: Penrose claimed undecidability proved AI is impossible (but didn't show humans can solve the undecidable). Turning the tables: undecidability is an ideal target for heuristic LLMs. Instead of using "already crushed" logic problems to show LLM limits, let's look at uncrushed problems where LLMs might actually help.
Peter O'Hearn tweet mediaPeter O'Hearn tweet mediaPeter O'Hearn tweet media
English
4
12
55
5.4K
Despoina Magka retweetledi
Jakob Foerster
Jakob Foerster@j_foerst·
🚨TL;DR: Benchmarking for AI Scientists just got better!🚨 Everyone is excited about AI Scientists, but we don't have a large scale benchmark that evaluates automated (or augmented) AI research systems on the home turf of the machine learning community: Machine Learning benchmarks. Meet AIRS-Bench, our attempt at filling this gap. We hope AIRS-Bench will help the community to improve the signal-to-noise ratio in the era of research agents and is an important step towards turning ML benchmarks into standardised tasks for AI research agents. This has implications beyond AI scientists and will also help address the replication crisis in ML. The team has invested countless hours (human and GPU) selecting/constructing the tasks, running baseline agents, analysing the outcomes, and hardening the benchmark. We are excited for the community to both expand on our initial task set and benchmark new agentic systems!
Bhavul Gauri@BhavulGauri

Introducing - AIRS Bench, a benchmark for “AI Researcher Agent”. Agents attempt 20 open ML problems starting from zero code (full research loop). And yes, they beat SOTA in few cases (read more below!) arxiv.org/abs/2602.06855

English
2
10
70
12.2K
Despoina Magka retweetledi
Bhavul Gauri
Bhavul Gauri@BhavulGauri·
Introducing - AIRS Bench, a benchmark for “AI Researcher Agent”. Agents attempt 20 open ML problems starting from zero code (full research loop). And yes, they beat SOTA in few cases (read more below!) arxiv.org/abs/2602.06855
English
4
12
73
16.4K
Despoina Magka retweetledi
Yoram Bachrach
Yoram Bachrach@yorambac·
Can AI agents beat humans at frontier ML research? We are introducing AIRS-bench, asking agents to beat human SOTA on 20 research tasks from recent ML papers. Check out the results: arxiv.org/abs/2602.06855
Yoram Bachrach tweet media
English
5
19
106
9K