Alex Ratner

1.8K posts

Alex Ratner banner
Alex Ratner

Alex Ratner

@ajratner

@SnorkelAI @uwcse / prev @StanfordAILab – Interested in data management systems for machine learning, weak supervision, and impactful applications.

Menlo Park, CA Katılım Kasım 2013
691 Takip Edilen6.7K Takipçiler
Sabitlenmiş Tweet
Alex Ratner
Alex Ratner@ajratner·
This week we launched the Open Benchmarks Grant with a $3M initial commitment from @SnorkelAI + partner support from @huggingface @togethercompute @PrimeIntellect @PyTorch @harborframework & others, in order to close the evaluation gap in AI. Our ability to measure AI has been outpaced by our ability to develop it - and open benchmarks are one of several critical, complementary tools to fix this. We're particularly interested in novel benchmarks that push and probe the frontier along three key vectors: (1) Environment complexity --> E.g. complex, domain-specific context and tool/action spaces, human interaction, world modeling) (2) Autonomy horizon --> E.g. long horizon, non-stationary goals (3) Output complexity --> E.g. complex outputs with nuanced, rubric-based evaluation / reward signals Check out more detail + link to apply here! benchmarks.snorkel.ai
English
1
7
45
7.6K
Alex Ratner
Alex Ratner@ajratner·
Congratulations @ravirajjain @ravi_lsvp !!! They have been incredible partners to @SnorkelAI from day one, and at every stage after that. Well deserved recognition!!
Lightspeed@lightspeedvp

Congratulations to @ravi_lsvp, @ravirajjain, and @buckymoore on their recognition in the Seed 100 List! The Seed 100 List from @businessinsider highlights early-stage investors with a unique ability to scout the tech stars of tomorrow. Amid the AI boom, the competitiveness and speed of investors getting in before the “seed stage” as we know it have been reinforced. This is the Seed 100’s sixth year, and it is an honor to have 3 Lightspeed team members acknowledged on the list. Early-stage investing has been wired into our team’s DNA for over 26 years. And we are incredibly proud to have backed many teams from their Seed rounds and beyond. As Ravi puts it: "The founders Lightspeed backs don't extrapolate from the present; they derive from first principles and arrive at futures others haven't thought to look for.”

English
1
0
12
1.6K
Alex Ratner
Alex Ratner@ajratner·
Extremely excited for Terminal-Bench Science, which we're proud to support via our Open Benchmarks Grants @SnorkelAI !
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
1
3
20
2.4K
Alex Ratner retweetledi
Steven Dillmann
Steven Dillmann@StevenDillmann·
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵
Steven Dillmann tweet media
English
15
111
479
898.6K
Alex Ratner retweetledi
Chris Sniffen
Chris Sniffen@SniffenOutAI·
Sat down with my friend Rezaur @intellgenc (CIO / CISO / CAIO at the @usachp) for a long conversation on building frontier AI for federal infrastructure. Among his projects: "We're working with Google Public Sector and @SnorkelAI on a geospatial deep-research, AI-native system. And since that wasn't challenging enough, a world-simulation system to model real-world impacts on large infrastructure projects." We get into the limits of frontier models, mechanistic interpretability for applied AI, and why Rezaur wants more entropy from his models (not less). 00:26 Building AI-native, not bolt-on 05:18 Why one model can't do geospatial AI 08:23 "I want more entropy, not less" 10:28 Inside the model: mechanistic interpretability 15:32 Externalizing memory: context, files, graphs 26:01 The RGB-pixel trick for sensor data 28:24 The geospatial benchmark gap 31:11 When a frontier model hallucinated an Iran-backed attack
English
0
2
8
394
Alex Ratner retweetledi
Armin Parchami
Armin Parchami@ArminPCM·
🚀 We're #hiring a Research Scientist – RL Training @SnorkelAI. We need someone who's actually RLFT'd agents using complex environments (e.g. SWE-Bench/Terminal-Bench). Deep hands-on experience with GRPO, RLHF, DPO, reward modeling & frameworks like verl/SkyRL. 30B+ scale and deep expertise in RL algorithms. Come build SOTA coding agents with us! 📍 RWC / SF / NYC / Remote - US #ML #ReinforcementLearning #PostTraining
English
2
14
165
24.2K
Alex Ratner retweetledi
Snorkel AI
Snorkel AI@SnorkelAI·
Our thanks to Carter Wendelken with @GoogleDeepMind (introduced by our own @qi_zhengyang) and everyone who came out to learn about "Code Synthesis for Agentic Decision-Making: Code World Models and Autoharness"
Snorkel AI tweet mediaSnorkel AI tweet mediaSnorkel AI tweet media
English
1
4
21
2.5K
Shreya Shankar
Shreya Shankar@sh_reya·
I'm joining Carnegie Mellon's CS Department (and HCII by courtesy) as an assistant professor in Fall 2027! I'll be recruiting PhD students next cycle. If you're interested in AI systems or human-AI collaboration, list me in your application. Stay tuned for more about my new lab!
English
120
108
2K
209.9K
Alex Ratner retweetledi
Snorkel AI
Snorkel AI@SnorkelAI·
We’ll be at @AICouncilConf in San Francisco this week (May 12–14). Come say hi if you’re attending 👋 Join us tomorrow from 10:15–11 AM for a workshop with Charles Dickens: Towards Reliable Financial Agents: How a 4B Model Outsmarted a 235B Giant. See you there: snorkel.ai/ai-council.
Snorkel AI tweet media
English
0
4
13
1K
Alex Ratner
Alex Ratner@ajratner·
One major factor distorting our perception of AI capabilities: benchmark development now lags behind model development for the first time in AI history. In traditional AI/ML: The rate of benchmark advancement (i.e. labeling a small-to-mid sized dataset) exceeded that of model development - and so benchmarks gave a pretty useful view of frontier capabilities. This made them canonical measures of AI progress. Today: it's very difficult to create benchmarks that properly measure *real world* environments, scenarios, and tasks at the jagged frontier of AI capabilities - which itself has become an exponentially bigger space to measure - and are robust to rapid overfitting. Benchmarks show near saturated performance - even though models still have real capability gaps in practice. One more reason why accelerating the pace of benchmark development - and doing so with the full power of open, academic communities- is so important!
English
2
9
37
2.3K
Alex Ratner retweetledi
Justin Bauer
Justin Bauer@realjustinbauer·
Coding agents have moved from tab-complete to teammate. A model suggesting code one line at a time is easy to review. An agent autonomously refactoring your repo and testing its own changes is much harder. Agentic evaluation is the critical bottleneck now. Benchmarks have to be agentic too — multi-step, executable, and scored across the whole trajectory. New blog: snorkel.ai/coding-agents-…
English
3
4
16
636
Alex Ratner
Alex Ratner@ajratner·
Come say hi to the @SnorkelAI team at MLSys - and chat with us about the importance of data distributional precision!!
Armin Parchami@ArminPCM

The right data mix can deliver 5x better sample efficiency for RLVR. Our paper "Learning from Less" just got an oral at MLSys '26 — we show that how you compose training data (task complexity, diversity) matters more than how much you throw at the model. The @SnorkelAI research team is presenting next week in Bellevue. If you're at MLSys, come hang! We're hosting an after-hours social on May 21.

English
2
2
10
1.2K
Alex Ratner retweetledi
Kelly Buchanan
Kelly Buchanan@ekellbuch·
Very excited to release Terminal-Bench 2.1! Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more. We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark! But the rankings survived, absolute scores moved up to 12pp!
Kelly Buchanan tweet media
English
27
74
768
84.3K
Alex Ratner
Alex Ratner@ajratner·
Proud to support our partner @harvey 's awesome new open source benchmark LAB!! LAB pushes the frontier of agentic legal evaluation - promoting transparency, safety, and core progress for AI usage in real world legal work. Excited for what's ahead! 🚀
Gabe Pereyra@gabepereyra

x.com/i/article/2051…

English
0
3
16
1.1K
Alex Ratner retweetledi
John Yang
John Yang@jyangballin·
How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵
John Yang tweet media
English
103
246
1.6K
718.6K
Alex Ratner retweetledi
Armin Parchami
Armin Parchami@ArminPCM·
Huge congrats to @harvey on open-sourcing LAB, a serious benchmark for long-horizon #legal #agents (1,200+ tasks, 24 practice areas). Proud our team @SnorkelAI got to contribute as a research and expert data partner, helping bring practice-area experts into every part of the benchmark. More to come from our team on open #benchmarks soon! Havery's Blog Post: harvey.ai/blog/introduci…
Armin Parchami tweet media
English
0
9
47
1.7K
Alex Ratner retweetledi
Alex Ratner retweetledi
Parth Asawa
Parth Asawa@pgasawa·
We’d like to especially thank @SnorkelAI for their support via the Open Benchmarks Grants program and @LaudeInstitute for their support via the Laude Slingshots program. (12/n)
English
1
5
32
3.5K