Alex Ratner

1.8K posts

Alex Ratner

@ajratner

@SnorkelAI @uwcse / prev @StanfordAILab – Interested in data management systems for machine learning, weak supervision, and impactful applications.

Menlo Park, CA Katılım Kasım 2013

691 Takip Edilen6.7K Takipçiler

Sabitlenmiş Tweet

Alex Ratner@ajratner·16 Şub

This week we launched the Open Benchmarks Grant with a $3M initial commitment from @SnorkelAI + partner support from @huggingface @togethercompute @PrimeIntellect @PyTorch @harborframework & others, in order to close the evaluation gap in AI. Our ability to measure AI has been outpaced by our ability to develop it - and open benchmarks are one of several critical, complementary tools to fix this. We're particularly interested in novel benchmarks that push and probe the frontier along three key vectors: (1) Environment complexity --> E.g. complex, domain-specific context and tool/action spaces, human interaction, world modeling) (2) Autonomy horizon --> E.g. long horizon, non-stationary goals (3) Output complexity --> E.g. complex outputs with nuanced, rubric-based evaluation / reward signals Check out more detail + link to apply here! benchmarks.snorkel.ai

English

7.6K

Alex Ratner retweetledi

Snorkel AI@SnorkelAI·4d

Live from MLSys 2026! Thanks to everyone who joined @pham_derek's talk yesterday on RLVR in low-data, low-compute regimes and swung by our poster session. Paper: arxiv.org/abs/2604.18381 Around tonight? Unwind after the conference with drinks, swing suites, and the team behind the paper. Last chance to RSVP ⛳: luma.com/mlsys2026-snor… @vincentsunnchen @ArminPCM @realjustinbauer

English

1.3K

Alex Ratner@ajratner·5d

Congratulations @ravirajjain @ravi_lsvp !!! They have been incredible partners to @SnorkelAI from day one, and at every stage after that. Well deserved recognition!!

Lightspeed@lightspeedvp

Congratulations to @ravi_lsvp, @ravirajjain, and @buckymoore on their recognition in the Seed 100 List! The Seed 100 List from @businessinsider highlights early-stage investors with a unique ability to scout the tech stars of tomorrow. Amid the AI boom, the competitiveness and speed of investors getting in before the “seed stage” as we know it have been reinforced. This is the Seed 100’s sixth year, and it is an honor to have 3 Lightspeed team members acknowledged on the list. Early-stage investing has been wired into our team’s DNA for over 26 years. And we are incredibly proud to have backed many teams from their Seed rounds and beyond. As Ravi puts it: "The founders Lightspeed backs don't extrapolate from the present; they derive from first principles and arrive at futures others haven't thought to look for.”

English

1.6K

Alex Ratner@ajratner·5d

Extremely excited for Terminal-Bench Science, which we're proud to support via our Open Benchmarks Grants @SnorkelAI !

Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English

2.4K

Alex Ratner retweetledi

Steven Dillmann@StevenDillmann·5d

English

111

479

898.6K

Alex Ratner retweetledi

Chris Sniffen@SniffenOutAI·15 May

Sat down with my friend Rezaur @intellgenc (CIO / CISO / CAIO at the @usachp) for a long conversation on building frontier AI for federal infrastructure. Among his projects: "We're working with Google Public Sector and @SnorkelAI on a geospatial deep-research, AI-native system. And since that wasn't challenging enough, a world-simulation system to model real-world impacts on large infrastructure projects." We get into the limits of frontier models, mechanistic interpretability for applied AI, and why Rezaur wants more entropy from his models (not less). 00:26 Building AI-native, not bolt-on 05:18 Why one model can't do geospatial AI 08:23 "I want more entropy, not less" 10:28 Inside the model: mechanistic interpretability 15:32 Externalizing memory: context, files, graphs 26:01 The RGB-pixel trick for sensor data 28:24 The geospatial benchmark gap 31:11 When a frontier model hallucinated an Iran-backed attack

English

394

Alex Ratner retweetledi

Armin Parchami@ArminPCM·14 May

🚀 We're #hiring a Research Scientist – RL Training @SnorkelAI. We need someone who's actually RLFT'd agents using complex environments (e.g. SWE-Bench/Terminal-Bench). Deep hands-on experience with GRPO, RLHF, DPO, reward modeling & frameworks like verl/SkyRL. 30B+ scale and deep expertise in RL algorithms. Come build SOTA coding agents with us! 📍 RWC / SF / NYC / Remote - US #ML #ReinforcementLearning #PostTraining

English

165

24.2K

Alex Ratner retweetledi

Snorkel AI@SnorkelAI·14 May

Our thanks to Carter Wendelken with @GoogleDeepMind (introduced by our own @qi_zhengyang) and everyone who came out to learn about "Code Synthesis for Agentic Decision-Making: Code World Models and Autoharness"

English

2.5K

Alex Ratner@ajratner·12 May

@sh_reya Congratulations @sh_reya !!

English

310

Shreya Shankar@sh_reya·12 May

I'm joining Carnegie Mellon's CS Department (and HCII by courtesy) as an assistant professor in Fall 2027! I'll be recruiting PhD students next cycle. If you're interested in AI systems or human-AI collaboration, list me in your application. Stay tuned for more about my new lab!

English

120

108

209.9K

Alex Ratner retweetledi

Snorkel AI@SnorkelAI·12 May

We’ll be at @AICouncilConf in San Francisco this week (May 12–14). Come say hi if you’re attending 👋 Join us tomorrow from 10:15–11 AM for a workshop with Charles Dickens: Towards Reliable Financial Agents: How a 4B Model Outsmarted a 235B Giant. See you there: snorkel.ai/ai-council.

English

Alex Ratner@ajratner·10 May

One major factor distorting our perception of AI capabilities: benchmark development now lags behind model development for the first time in AI history. In traditional AI/ML: The rate of benchmark advancement (i.e. labeling a small-to-mid sized dataset) exceeded that of model development - and so benchmarks gave a pretty useful view of frontier capabilities. This made them canonical measures of AI progress. Today: it's very difficult to create benchmarks that properly measure *real world* environments, scenarios, and tasks at the jagged frontier of AI capabilities - which itself has become an exponentially bigger space to measure - and are robust to rapid overfitting. Benchmarks show near saturated performance - even though models still have real capability gaps in practice. One more reason why accelerating the pace of benchmark development - and doing so with the full power of open, academic communities- is so important!

English

2.3K

Alex Ratner retweetledi

Justin Bauer@realjustinbauer·11 May

Coding agents have moved from tab-complete to teammate. A model suggesting code one line at a time is easy to review. An agent autonomously refactoring your repo and testing its own changes is much harder. Agentic evaluation is the critical bottleneck now. Benchmarks have to be agentic too — multi-step, executable, and scored across the whole trajectory. New blog: snorkel.ai/coding-agents-…

English

636

Alex Ratner@ajratner·9 May

Come say hi to the @SnorkelAI team at MLSys - and chat with us about the importance of data distributional precision!!

Armin Parchami@ArminPCM

The right data mix can deliver 5x better sample efficiency for RLVR. Our paper "Learning from Less" just got an oral at MLSys '26 — we show that how you compose training data (task complexity, diversity) matters more than how much you throw at the model. The @SnorkelAI research team is presenting next week in Bellevue. If you're at MLSys, come hang! We're hosting an after-hours social on May 21.

English

1.2K

Alex Ratner retweetledi

Kelly Buchanan@ekellbuch·7 May

Very excited to release Terminal-Bench 2.1! Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more. We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark! But the rankings survived, absolute scores moved up to 12pp!

English

768

84.3K

Alex Ratner@ajratner·6 May

Proud to support our partner @harvey 's awesome new open source benchmark LAB!! LAB pushes the frontier of agentic legal evaluation - promoting transparency, safety, and core progress for AI usage in real world legal work. Excited for what's ahead! 🚀

Gabe Pereyra@gabepereyra

x.com/i/article/2051…

English

1.1K

Alex Ratner retweetledi

John Yang@jyangballin·5 May

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

English

103

246

1.6K

718.6K

Alex Ratner retweetledi

Gabe Pereyra@gabepereyra·6 May

x.com/i/article/2051…

ZXX

371

676K

Alex Ratner retweetledi

Armin Parchami@ArminPCM·6 May

Huge congrats to @harvey on open-sourcing LAB, a serious benchmark for long-horizon #legal #agents (1,200+ tasks, 24 practice areas). Proud our team @SnorkelAI got to contribute as a research and expert data partner, helping bring practice-area experts into every part of the benchmark. More to come from our team on open #benchmarks soon! Havery's Blog Post: harvey.ai/blog/introduci…

English

1.7K

Alex Ratner retweetledi

Alex Ratner@ajratner·4 May

Continual learning has been gaining heat as a buzzword of late... but what you can't measure, you can't properly study or improve. Excited to collaborate w/ @pgasawa @matei_zaharia @profjoeyg & team on one of the first benchmarks of *continual learning* across task sequences!

Parth Asawa@pgasawa

Today, we’re releasing Continual Learning Bench 1.0: the first, realistic benchmark for measuring how AI systems can improve in online settings. Benchmarks today assume models are stateless. Each example is independent, and once a system finishes a task, it moves on as if nothing happened. But deployed AI systems should learn from experience. We tested 10+ frontier systems against novel, expert-validated tasks and find there’s still plenty of headroom for learning. (1/n)

English

2.1K

Alex Ratner retweetledi

Parth Asawa@pgasawa·4 May

We’d like to especially thank @SnorkelAI for their support via the Open Benchmarks Grants program and @LaudeInstitute for their support via the Laude Slingshots program. (12/n)

English

3.5K

Keşfet

@pham_derek @vincentsunnchen @ArminPCM @realjustinbauer @ravirajjain @ravi_lsvp @SnorkelAI @AnthropicAI