Thom Foster

@_tomwithanh

AI Scientists at Oxford and Meta

Katılım Ekim 2022

112 Takip Edilen24 Takipçiler

Thom Foster retweetledi

Despoina Magka@MarlaMagka·11 Şub

(🧵) Happy to release AIRS-Bench, a benchmark to test the autonomous machine learning abilities of AI research agents 🤖 AIRS-Bench includes 20 tasks sourced from machine learning papers that assess the autonomous research abilities of LLM agents throughout the full research lifecycle, from hypothesis generation 💡 and implementation 🛠️ to experimentation 🧪 and analysis 📊 Each task is extracted from a paper with a state-of-the-art result and consists of a: 📝 problem description (e.g. text similarity) 🗂️ a dataset (e.g. SICK) and 📏 a metric (e.g. Spearman correlation) to optimise over The agent is then given a GPU and 24 hours to develop and submit a Python solution that matches or exceeds the paper SOTA 📈 Read on for baseline results and examples of agents surpassing human SOTA 👀 🌱We open-source the AIRS-Bench task definitions and evaluation code to accelerate in autonomous scientific research: 💻 GitHub: github.com/facebookresear… 📜 ArXiv: arxiv.org/pdf/2602.06855 🤗 HF paper: huggingface.co/papers/2602.06… 📊 Meta AI website: ai.meta.com/research/publi… Huge shoutout to the team from Meta FAIR who painstakingly crafted, debugged and inspected every single of these tasks and its runs across more than a dozen of agents @alisia_lupidi, @_tomwithanh, @BhavulGauri, @basselralomari, @albertomariape, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, @LuciaCKun, @GagnonAudet, Chee Hau Leow, Sandra Lefdal, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, @mahnerak, @ishitamed, @EdanToledo and @rybolos, @alex_h_miller, @j_foerst, @yorambac for their leadership and support

English

12.3K

Thom Foster retweetledi

Alberto Maria Pepe@albertomariape·9 Şub

AIRS-Bench is out! AIRS-Bench is a suite of 20 challenging ML tasks designed to evaluate LLM agents as AI Research Scientists spanning the full scientific method: from hypothesis generation and experimental design to result validation. Paper: arxiv.org/pdf/2602.06855

Islington, London 🇬🇧 English

166

Thom Foster retweetledi

Jakob Foerster@j_foerst·9 Şub

🚨TL;DR: Benchmarking for AI Scientists just got better!🚨 Everyone is excited about AI Scientists, but we don't have a large scale benchmark that evaluates automated (or augmented) AI research systems on the home turf of the machine learning community: Machine Learning benchmarks. Meet AIRS-Bench, our attempt at filling this gap. We hope AIRS-Bench will help the community to improve the signal-to-noise ratio in the era of research agents and is an important step towards turning ML benchmarks into standardised tasks for AI research agents. This has implications beyond AI scientists and will also help address the replication crisis in ML. The team has invested countless hours (human and GPU) selecting/constructing the tasks, running baseline agents, analysing the outcomes, and hardening the benchmark. We are excited for the community to both expand on our initial task set and benchmark new agentic systems!

Bhavul Gauri@BhavulGauri

Introducing - AIRS Bench, a benchmark for “AI Researcher Agent”. Agents attempt 20 open ML problems starting from zero code (full research loop). And yes, they beat SOTA in few cases (read more below!) arxiv.org/abs/2602.06855

English

12.2K

Thom Foster retweetledi

Bhavul Gauri@BhavulGauri·9 Şub

English

16.4K

Thom Foster@_tomwithanh·25 Oca

@thesullivan @ylecun @sama IMO it’s not *just* about funding & brand - the feedback and RLHF data they get from millions of users is also useful

English

202

✦Mr. Sullivan@thesullivan·25 Oca

@ylecun @sama It does seem like Google and Meta have been working on large language models for longer than OpenAI. Google invented BERT. Isn't this just OpenAI hurrying a technology to market to pick up momentum/funding?

English

13.9K

Sam Altman@sama·25 Oca

can’t we all just get along 🥹

Yann LeCun@ylecun

To be clear: I'm not criticizing OpenAI's work nor their claims. I'm trying to correct a *perception* by the public & the media who see chatGPT as this incredibly new, innovative, & unique technological breakthrough that is far ahead of everyone else. It's just not.

English

184

121

2.7K

1.1M

Keşfet

@alisia_lupidi @BhavulGauri @basselralomari @albertomariape @LuciaCKun @GagnonAudet @mahnerak @ishitamed