Despoina Magka (@MarlaMagka) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

(🧵) Happy to release AIRS-Bench, a benchmark to test the autonomous machine learning abilities of AI research agents 🤖 AIRS-Bench includes 20 tasks sourced from machine learning papers that assess the autonomous research abilities of LLM agents throughout the full research lifecycle, from hypothesis generation 💡 and implementation 🛠️ to experimentation 🧪 and analysis 📊 Each task is extracted from a paper with a state-of-the-art result and consists of a: 📝 problem description (e.g. text similarity) 🗂️ a dataset (e.g. SICK) and 📏 a metric (e.g. Spearman correlation) to optimise over The agent is then given a GPU and 24 hours to develop and submit a Python solution that matches or exceeds the paper SOTA 📈 Read on for baseline results and examples of agents surpassing human SOTA 👀 🌱We open-source the AIRS-Bench task definitions and evaluation code to accelerate in autonomous scientific research: 💻 GitHub: github.com/facebookresear… 📜 ArXiv: arxiv.org/pdf/2602.06855 🤗 HF paper: huggingface.co/papers/2602.06… 📊 Meta AI website: ai.meta.com/research/publi… Huge shoutout to the team from Meta FAIR who painstakingly crafted, debugged and inspected every single of these tasks and its runs across more than a dozen of agents @alisia_lupidi, @_tomwithanh, @BhavulGauri, @basselralomari, @albertomariape, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, @LuciaCKun, @GagnonAudet, Chee Hau Leow, Sandra Lefdal, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, @mahnerak, @ishitamed, @EdanToledo and @rybolos, @alex_h_miller, @j_foerst, @yorambac for their leadership and support

English

4

12

49

12.3K

Despoina Magka retweetledi

Fabian Gloeckle@FabianGloeckle·16 Nis

Worried about Anthropic's Mythos? Fully formally verified code generation is the defense. Combining Lean, frontier models, multi-agent scaffolds, and inference scaling, we show <12mo benchmarks jumping from 20% to 70%. Real-world verification is here. facebookresearch.github.io/wybecoder/ 1/

English

8

44

211

26.8K

Despoina Magka@MarlaMagka·17 Nis

🚀 Happy to see AIRS-Bench, an AI R&D benchmark that Meta open-sourced earlier this year (x.com/MarlaMagka/sta…), being used in the Muse Spark Safety & Preparedness Report to assess loss of control risks stemming from acceleration of AI development. AIRS-Bench (github.com/facebookresear…) measures the ability of AI agents to execute end-to-end AI R&D across the full research lifecycle, from idea generation 💡 and implementation 🛠️ to experiment analysis 🧪 and iterative refinement 📈 Along with SWE-Bench and MLE-Bench, AIRS-Bench was used to assess the risks of models automating AI R&D work and outpacing governance mechanisms. Our findings suggest that Muse Spark does not substantially contribute to the said threat, as it achieves performance superior to human researchers in only 5 out of 20 tasks and for a fraction of its attempts 🔍 This is inline with results from comparison models and highlights the models' limitations to execute the complete research lifecycle consistently and across a wide range of domains 🤖 Head over to the 158-page report for more detailed results and a wide range of assessments and mitigations under Meta’s Advanced AI Scaling Framework 👇

Summer Yue@summeryue0

🚀 Muse Spark Safety & Preparedness Report for Meta AI is out. We start with our pre-deployment assessment under Meta's Advanced AI Scaling Framework, covering chemical and biological, cybersecurity, and loss of control risks. Our assessment flagged potentially elevated chem/bio risk, so we implemented safeguards and validated mitigations before deployment - bringing residual risk to within acceptable levels. Beyond the Framework, we also share findings and early explorations of model behavior (honesty, intent understanding, etc.), jailbreak robustness, eval awareness, and more. We're sharing this report to give a closer look at how we evaluate advanced AI safety. Always more work to do, and we welcome feedback from the community. ai.meta.com/static-resourc…

English

0

4

6

1.8K

Despoina Magka retweetledi

Martin Josifoski@MartinJosifoski·15 Nis

Excited to share AIRA₂ — our next-generation AI Research Agents for ML that address key bottlenecks to scaling. AIRA₂ achieves SoTA on real-world ML tasks from MLE-bench-30 (81.5% vs 72.7%), exceeds human SoTA on 6/20 diverse AI research tasks from AIRS-Bench (and hacks another 5), while exhibiting strong, predictable scaling properties. To push the frontier of AI Research, we need systems that scale well. Developing AIRA₂, we learned a lot about the bottlenecks and what it takes to resolve them — insights already driving our next iteration: 1/

English

5

35

175

31.3K

Despoina Magka retweetledi

Jason Weston@jaseweston·8 Nis

🏋️Thinking Mid-training: RL of Interleaved Reasoning🎗️ We address the gap between pretraining (no explicit reasoning) and post-training (reasoning-heavy) with an intermediate SFT+RL mid-training phase to teach models how to think. - Annotate pretraining data with interleaved thoughts - SFT mid-training to learn when/what to think alongside original content - RL mid-training to optimize reasoning generation with grounded reward from future token prediction Result: 3.2x improvement on reasoning benchmarks compared to direct RL post-training on base Llama-3-8B, and gains over only prior SFT as well. Introducing reasoning earlier makes models better prepared for post-training! Read more in the blog post: facebookresearch.github.io/RAM/blogs/thin…

English

9

71

557

67.6K

Despoina Magka retweetledi

Rui Hou@magpie_rayhou·8 Nis

We are releasing our new model muse spark today - our first step towards personal super-intelligence after 9 month great team effort! Please try it out and tell us what you think!

Alexandr Wang@alexandr_wang

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

English

10

12

93

12.1K

Despoina Magka retweetledi

Alexandr Wang@alexandr_wang·8 Nis

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

English

727

1.2K

10.3K

4.5M

Despoina Magka retweetledi

Julien Chaumond@julien_c·7 Mar

Agents are supposed to work for us or is it the other way round?

Rohan Pandey@khoomeik

a few friends are trying polyphasic sleep so they can supervise their coding agents 24/7

English

6

2

34

4.9K

Despoina Magka retweetledi

Andrej Karpathy@karpathy·7 Mar

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

1.1K

3.7K

28.4K

11M

Despoina Magka@MarlaMagka·11 Şub

(Last/🧵) Overall, our results indicate a high variability in task performance, with both the LLM and the scaffold of the agent playing a role. For most tasks, even the best performing agent is still significantly behind the human SOTA, showing that AIRS-Bench is far from saturated. We hope AIRS-Bench will help identify gaps and accelerate progress in the development of in AI research agents. We welcome further agent submissions from the agentic AI research community, especially work built on open components (both the scaffold and the LLM) that can be inspected and extended end to end 🌍🤝🧑‍💻

English

0

4

183

Despoina Magka@MarlaMagka·11 Şub

(7/🧵) Among our runs, we encountered cases where the agent’s performance, at least in some of the seeds, was higher than the reported human SOTA, i.e. had a normalized score that was greater than 1. Here you can see one such success achieved by the CWM-Greedy agent for a time series forecasting task

English

1

0

2

177

Despoina Magka@MarlaMagka·11 Şub

(🧵) Happy to release AIRS-Bench, a benchmark to test the autonomous machine learning abilities of AI research agents 🤖 AIRS-Bench includes 20 tasks sourced from machine learning papers that assess the autonomous research abilities of LLM agents throughout the full research lifecycle, from hypothesis generation 💡 and implementation 🛠️ to experimentation 🧪 and analysis 📊 Each task is extracted from a paper with a state-of-the-art result and consists of a: 📝 problem description (e.g. text similarity) 🗂️ a dataset (e.g. SICK) and 📏 a metric (e.g. Spearman correlation) to optimise over The agent is then given a GPU and 24 hours to develop and submit a Python solution that matches or exceeds the paper SOTA 📈 Read on for baseline results and examples of agents surpassing human SOTA 👀 🌱We open-source the AIRS-Bench task definitions and evaluation code to accelerate in autonomous scientific research: 💻 GitHub: github.com/facebookresear… 📜 ArXiv: arxiv.org/pdf/2602.06855 🤗 HF paper: huggingface.co/papers/2602.06… 📊 Meta AI website: ai.meta.com/research/publi… Huge shoutout to the team from Meta FAIR who painstakingly crafted, debugged and inspected every single of these tasks and its runs across more than a dozen of agents @alisia_lupidi, @_tomwithanh, @BhavulGauri, @basselralomari, @albertomariape, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, @LuciaCKun, @GagnonAudet, Chee Hau Leow, Sandra Lefdal, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, @mahnerak, @ishitamed, @EdanToledo and @rybolos, @alex_h_miller, @j_foerst, @yorambac for their leadership and support

English

4

12

49

12.3K

Despoina Magka retweetledi

Peter O'Hearn@PeterOHearn12·2 Şub

LLMs vs the Halting Problem. (Why, what, where going.) We recently released a paper on this; link to follow. A few comments here for context. Why? With LLM "reasoning" excitement, we thought: why not try LLMs on the first ever code reasoning task, the halting problem. Turing's proof of undecidability established fundamental limits. Fun bit: no matter how "superintelligent" AI becomes, this is a problem it can never perfectly solve. Where to get data to measure? SVCOMP. Verification researchers have through their insight and hard work, curated several thousand example C programs. They run dedicated tools over this dataset in an annual competition. This is in a sense the home turf of symbolic. We didn't know how LLMs would do, and in particular were aware of results of @rao2z , @RishiHazra95 and others showing that LLMs trail symbolic on "easier" decidable problems (SAT, propositional planning). The surprise: LLMs are competitive on halting—where they often trail on "easier" problems. Why? Hypothesis: LLMs are heuristic approximators; in undecidability, heuristic approximation isn't just a workaround—it's often the only way forward. Broader context: Penrose claimed undecidability proved AI is impossible (but didn't show humans can solve the undecidable). Turning the tables: undecidability is an ideal target for heuristic LLMs. Instead of using "already crushed" logic problems to show LLM limits, let's look at uncrushed problems where LLMs might actually help.

English

4

12

55

5.4K

Despoina Magka retweetledi

Jakob Foerster@j_foerst·9 Şub

🚨TL;DR: Benchmarking for AI Scientists just got better!🚨 Everyone is excited about AI Scientists, but we don't have a large scale benchmark that evaluates automated (or augmented) AI research systems on the home turf of the machine learning community: Machine Learning benchmarks. Meet AIRS-Bench, our attempt at filling this gap. We hope AIRS-Bench will help the community to improve the signal-to-noise ratio in the era of research agents and is an important step towards turning ML benchmarks into standardised tasks for AI research agents. This has implications beyond AI scientists and will also help address the replication crisis in ML. The team has invested countless hours (human and GPU) selecting/constructing the tasks, running baseline agents, analysing the outcomes, and hardening the benchmark. We are excited for the community to both expand on our initial task set and benchmark new agentic systems!

Bhavul Gauri@BhavulGauri

Introducing - AIRS Bench, a benchmark for “AI Researcher Agent”. Agents attempt 20 open ML problems starting from zero code (full research loop). And yes, they beat SOTA in few cases (read more below!) arxiv.org/abs/2602.06855

English

2

10

70

12.2K

Despoina Magka retweetledi

Bhavul Gauri@BhavulGauri·9 Şub

Introducing - AIRS Bench, a benchmark for “AI Researcher Agent”. Agents attempt 20 open ML problems starting from zero code (full research loop). And yes, they beat SOTA in few cases (read more below!) arxiv.org/abs/2602.06855

English

4

12

73

16.4K

Despoina Magka retweetledi

Yoram Bachrach@yorambac·10 Şub

Can AI agents beat humans at frontier ML research? We are introducing AIRS-bench, asking agents to beat human SOTA on 20 research tasks from recent ML papers. Check out the results: arxiv.org/abs/2602.06855

English

5

19

106

9K

Despoina Magka

Keşfet