Snorkel AI (@SnorkelAI) - Twitter پروفائل

Snorkel AI ری ٹویٹ کیا

ICYMI - How can we build the benchmark factory? I'm very excited about the infra approach from @harborframework, because @alexgshaw @ryanmart3n & team obsess over researcher/developer UX (e.g. quality guardrails, low friction to RL/scaled rollouts)!

English

1

5

16

2K

Snorkel AI@SnorkelAI·3d

@marchattonhere @swyx @steveruizok @aiDotEngineer Here's ours: luma.com/SnorkelVIPHapp…

English

0

22

Marc Hatton@marchattonhere·3d

@swyx @steveruizok @aiDotEngineer Pls can you share links to free events? :)

English

1

0

259

swyx@swyx·3d

so AIE Europe is completely taking over 🇬🇧London next week! very very hyped to showcase the best companies, research, and AI engineers in Europe! 3 COMPLETELY FREE ways to join in: - there are a dozen side events around town! from Snorkel to GitHub to Arize to ClawCon and Claude Code meetups! - subscribe on YouTube! everything will be livestreamed and published for free @aidotengineer" target="_blank" rel="nofollow noopener">youtube.com/@aidotengineer - we are releasing 20 more volunteer slots here ai.engineer/associates meant for local, early career folks who otherwise could not afford a ticket! join in/see you in london town!

English

47

35

264

94.7K

Snorkel AI ری ٹویٹ کیا

vincent sunn chen@vincentsunnchen·3d

We'll been in London next week for AIE. Come say hi (DMs open)!! 🇬🇧

swyx@swyx

so AIE Europe is completely taking over 🇬🇧London next week! very very hyped to showcase the best companies, research, and AI engineers in Europe! 3 COMPLETELY FREE ways to join in: - there are a dozen side events around town! from Snorkel to GitHub to Arize to ClawCon and Claude Code meetups! - subscribe on YouTube! everything will be livestreamed and published for free @aidotengineer" target="_blank" rel="nofollow noopener">youtube.com/@aidotengineer - we are releasing 20 more volunteer slots here ai.engineer/associates meant for local, early career folks who otherwise could not afford a ticket! join in/see you in london town!

English

1

2

9

680

Snorkel AI@SnorkelAI·3d

See you in London 🇬🇧 Snorkel AI is hosting a happy hour at Bantof on April 7 for folks working on AI agents, evals, datasets, and open source. Great chance to meet others building in the space (plus food & drinks 🍻) Request an invite: luma.com/SnorkelVIPHapp…

swyx@swyx

so AIE Europe is completely taking over 🇬🇧London next week! very very hyped to showcase the best companies, research, and AI engineers in Europe! 3 COMPLETELY FREE ways to join in: - there are a dozen side events around town! from Snorkel to GitHub to Arize to ClawCon and Claude Code meetups! - subscribe on YouTube! everything will be livestreamed and published for free @aidotengineer" target="_blank" rel="nofollow noopener">youtube.com/@aidotengineer - we are releasing 20 more volunteer slots here ai.engineer/associates meant for local, early career folks who otherwise could not afford a ticket! join in/see you in london town!

English

0

6

319

Snorkel AI@SnorkelAI·3d

Watch on YouTube: youtube.com/watch?v=UCn5gG…

YouTube

English

0

2

160

Snorkel AI@SnorkelAI·3d

“We need a thousand times more benchmarks than we have right now” is @alexgshaw of @LaudeInstitute's take on the current moment. “Coding is an extremely broad domain, 89 tasks isn’t nearly enough.” Full Benchtalks interview posted by @vincentsunnchen and YouTube in the replies

English

1

5

386

Snorkel AI@SnorkelAI·4d

🎧 Full episode: snorkel.ai/blog/benchtalk… 💰 Building an open benchmark? Apply for Snorkel's Open Benchmark Grants ↓ snorkel.ai/open-benchmark…

English

0

3

116

Snorkel AI@SnorkelAI·4d

Top scores on Terminal-Bench 2 went from ~25% → 75-80% in just 4 months. For Benchtalks #1, @vincentsunnchen sat down with @alexgshaw to dig into what happens when your benchmark gets solved before you're ready for the next one. Key takes: → The terminal is the right abstraction for agentic AI → Harbor exists because benchmarking and RL at scale are infra problems → "Benchmaxxing" is real; the defense is shipping harder tasks faster → TB3 is coming, and they want your hardest unsolvable problems "We need 1000x more benchmarks than we have right now" — @alexgshaw

English

1

2

13

466

Snorkel AI ری ٹویٹ کیا

vincent sunn chen@vincentsunnchen·4d

Terminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works. I talked to @alexgshaw about what happens when model capabilities climb faster than we can measure them. His answer: the benchmark factory (@harborframework)— infrastructure to develop hard, representative evals at the pace that the frontier moves. As Alex put it: "we need a thousand times more benchmarks than we have right now." 00:23 - How quickly models hill-climbed TB2 01:46 - What rapid progress reveals about benchmarks vs. real-world capability 03:28 - What made Terminal-Bench stick 04:58 - Why the terminal is the right abstraction for agentic AI 07:14 - How TB2 maintains task quality at scale 09:23 - Managing benchmark integrity in a benchmaxxing world 10:47 - Harbor: from experiment to benchmark factory 12:19 - What Harbor does that nothing else did 14:37 - The invariants: what won't change as agent evals evolve 16:55 - The benchmark Alex most wants to see built 18:18 - The ideal human-in-the-loop task creation flywheel 20:32 - How to contribute to Terminal-Bench 3.0

English

2

11

60

10.6K

Snorkel AI ری ٹویٹ کیا

ACM Conference on AI and Agentic Systems@CAISconf·5d

The first @TheOfficialACM conference on agentic AI systems just got a boost. @SnorkelAI is joining as a sponsor of @CAISconf this May in San Jose. Stanford AI Lab roots, production AI focus, and a shared belief that this community needs a rigorous home. caisconf.org

ACM Conference on AI and Agentic Systems tweet media

English

0

4

13

981

Snorkel AI ری ٹویٹ کیا

Armin Parchami@ArminPCM·5d

We just open-sourced FinQA — an #RL environment for financial reasoning agents. Real SEC 10-K data, multi-step reasoning + tool use, constrained SQL, binary rewards. The whole 9 yards! The kicker: a 4B model fine-tuned with FinQA outperformed a 235B model from the same family on finance reasoning: 58x smaller!

English

3

14

144

11.8K

Snorkel AI@SnorkelAI·5d

In the FinQA env, a 4B model was fine-tuned to outperform a 235B model from the same family on our Finance Reasoning benchmark. What did we teach the 4B model? Tool discipline. Learn more: snorkel.ai/blog/building-…

English

0

1

3

208

Snorkel AI@SnorkelAI·5d

Our FinQA environment is available on OpenEnv (s/o @huggingface + @PyTorch) FinQA is an open RL environment with: • 290 expert-curated questions • Real SEC 10-K data • Tasks requiring multi-step tool use RL proof point on FinQA: make a 4B model > 235B model 👇

English

1

12

570

Snorkel AI@SnorkelAI·27 Mar

London 🇬🇧 We’re heading to @aiDotEngineer Europe (April 8–10). @vincentsunnchen will be presenting on the art and science of benchmarking agents, including learnings from Open Benchmark Grants. Find us at Booth G9: snorkel.ai/ai-engineer-lo…

English

0

2

4

336

Snorkel AI@SnorkelAI·27 Mar

Congrats on the release 🚀 Proud to support research like this that moves the needle on evals and real-world agent performance.

Gabe Orlanski@GOrlanski

We found that agents generate progressively worse code with each iteration. Real developers do not. SlopCodeBench is the only eval that faithfully measures quality degradation on iterative, long-horizon coding tasks. arxiv.org/abs/2603.24755 scbench.ai 🧵

English

0

3

14

1.5K

Snorkel AI ری ٹویٹ کیا

Fred Sala@fredsala·27 Mar

Really excited to have the SlopCodeBench paper out---awesome work from @GOrlanski!

Gabe Orlanski@GOrlanski

We found that agents generate progressively worse code with each iteration. Real developers do not. SlopCodeBench is the only eval that faithfully measures quality degradation on iterative, long-horizon coding tasks. arxiv.org/abs/2603.24755 scbench.ai 🧵

English

0

4

22

1.9K

Snorkel AI ری ٹویٹ کیا

vincent sunn chen@vincentsunnchen·25 Mar

Congratulations to @gregkamradt @fchollet @mikeknoop and the @arcprize team on a massive benchmark!! We @SnorkelAI are excited to support and will be around the launch event— please come say hi! 👋

ARC Prize@arcprize

Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn

English

0

1

6

785

Snorkel AI ری ٹویٹ کیا

Armin Parchami@ArminPCM·25 Mar

Scaling RL training for agentic models is one of the hardest infra problems in ML right now and honestly, one of the most exciting jobs🔥 Our research team @SnorkelAI is deep in RLFT (data valuation, curriculum learning, and more). We're #hiring an ML Training Infra engineer who's actually done this at scale with complex environments and medium sized models. If that sounds like you (or someone you know), DM me or drop a comment 👇 #MLJobs | Link in thread

English

2

10

62

5.2K

Snorkel AI@SnorkelAI·24 Mar

Snorkel was just named one of @FastCompany’s Most Innovative AI Companies of 2026. We’re helping to design and pressure test the datasets and evaluations that make AI models and agents work in the real world. Join us: snorkel.ai/join-us/

English

1

6

17

714

Snorkel AI@SnorkelAI·24 Mar

Coming soon: BenchTalks—a candid podcast series by Snorkel AI on benchmarks, AI evals, and frontier research. 👀🎙️

English

0

1

22

747

Snorkel AI

دریافت کریں