BenchFlow (@benchflow_ai) - Twitter Profili | Zamantika Mersobahis Locabet

BenchFlow retweetledi

Xiangyi Li@xdotli·10 Mar

the first update on the SkillsBench paper has been made available on arXiv

English

3

2

20

1.1K

BenchFlow retweetledi

Xiangyi Li@xdotli·7 Mar

Introducing EnvDash: Agent that automatically tests and improves Skills The head of alignment at meta screwed up with OpenClaw. We made high resolution environments, tasks, and datasets to keep that from happening to you.

English

7

16

178

17K

BenchFlow retweetledi

FounderCoHo@FounderCoHo·5 Mar

🔥 Just wrapped an incredible panel at Stanford University on "Beyond the Prompt: The Infrastructure of Autonomous Agents." Featuring: • Vedran Jukic (@JukicVedran), CTO & Co-Founder of @daytonaio • Xiangyi Li (@xdotli), Founder of @benchflow_ai, creator of skillsbench ai • Moderated by Jing Wang (@jingconan), Founder of @DeepVistaAI and @FounderCoHo Key takeaway? The future of AI agents isn't about better models; it's about RL infrastructure that can scale. 🎯 The real bottleneck: High-throughput sandboxes. Teams using Daytona compressed 8,000 trials from weeks → 1.5 days. In AI, that's a massive competitive edge. 💡 Real-world deployment challenges: • Trust gap in high-stakes sectors (finance, supply chain) • Security paradigm shift: protecting systems FROM agents, not just for them • Benchmarks evolving faster than ever (MMLU → SkillsBench) Looking to 2026: Agent organizations (agents managing agents) + whoever cracks the data flywheel wins. "The model is the product. Distribution is the moat." 🎯

English

1

3

11

1.1K

BenchFlow retweetledi

FounderCoHo@FounderCoHo·27 Şub

How do you boost agent accuracy by 51%? 📈 It’s not better prompting it’s better infrastructure. We’re skipping the theory and showing you the Founder’s Playbook for 2026 at Stanford on March 3. Speakers : • Ivan Burazin (@ivanburazin) (CEO, @daytonaio) — 4X Founder building Massively Parallel RL Infrastructure. • Jing Conan Wang (@jingconan)(Founder, @DeepVista AI & @FounderCoHo) — Ex-DeepMind & PhD in RL bridging research to execution. • Xiangyi Li (@xdotli) (Founder, @benchflow_ai) — Author of SkillsBench; pioneer in agentic evaluation. • Lovre Pesut (AI Engineer, @daytonaio) — Expert in spinning 10k+ sandboxed VMs for RL pipelines. What's on the menu: • Scaling RL rollouts in minutes • Sandboxed VMs (Win/Mac/Linux/Android) • Modular "Skills" vs. Generalist LLMs Limited spots for builders. 🛠️ Register here: luma.com/eswm2omv

English

0

2

3

184

BenchFlow retweetledi

AK@_akhaliq·18 Şub

SkillsBench Benchmarking How Well Agent Skills Work Across Diverse Tasks paper: huggingface.co/papers/2602.12…

English

3

17

64

8.2K

BenchFlow retweetledi

Han@HanchungLee·22 Şub

anyone with ai can vibe agent skills; only when skills got evaluated do you discover who’s been vibing naked.

𝕱𝖔𝖗𝕷𝖔𝖔𝖕@forloopcodes

2026 ai bubble is peaking itself right now: someone just dropped a new benchmark on claude skills and the result of their paper is actually insane the paper explicitly states smaller models with skills beat larger models without them a smaller model like claude 4.5 haiku equipped with high quality skills smokes a raw state of the art opus 4.5 model by about 6 percent (27.7 vs 22.0) imagine getting sota level performance from a free model, its basically cheating, you just have to manually spoonfeed it a basic markdown file explaining how to do its job all of you opus guys are dumb, you can literally spam haiku with skills and get things shipped in 5x lesser time and 0 cost even wilder thing is that codex gpt 5.2 fails on the pareto frontier. codex burned massive compute and costs, just to get completely mogged by gemini 3 flash hitting maximum performance at a fraction of cash i can believe skill engineering is now a valid, mathematically proven substitute for compute over that, it says self generated skills provide zero benefit on average and show negative deltas on 16/84 tasks. if you give an agent more than 3 skills at once, it bloats its context and completely fails

English

3

2

12

2.2K

BenchFlow retweetledi

DailyPapers@HuggingPapers·22 Şub

Top AI Papers of The Week (Feb 16-22) - Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs - SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise - GLM-5: from Vibe Coding to Agentic Engineering by @zhipuAI - Experiential Reinforcement Learning - MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs - Zooming without Zooming: Region-to-Image Distillation by @InclusionAI - Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines? - DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval - SLA2: Sparse-Linear Attention with Learnable Routing and QAT - SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Find them below:

English

11

35

207

35.3K

BenchFlow retweetledi

DAIR.AI@dair_ai·23 Şub

x.com/i/article/2025…

ZXX

5

22

159

80.6K

BenchFlow retweetledi

Xiangyi Li@xdotli·18 Şub

20+ Anthropic Default Skills, 200k+ community skills on skillsmp. People talk about skills without knowing how well they work. We're hosting the largest Agent Skills hackathon at Founders, Inc. (March 7 - 8) from our lessons learned at SkillsBench 🪜 No sims. No slides. No flops.

English

2

14

645

BenchFlow retweetledi

Garry Tan@garrytan·17 Şub

You can’t just ask the agent to self improve. You have to design the arena and give it a push based on you and your knowledge of how it all works

Nathan Wang@AI_Nate_SA

The zero benefit from self-generated skills is a tough reality check for autonomous loops. We're still relying on manual curation to unlock that 16.2% boost. Also wild to see the disparity between domains—Healthcare jumping +51.9pp while Software Engineering only saw +4.5pp suggests current models already saturate coding context but starve for domain-specific workflows.

English

43

8

101

22.8K

BenchFlow retweetledi

Garry Tan@garrytan·17 Şub

Human still required for now

elvis@omarsar0

Nice paper studying whether agents can generate their own procedural knowledge. This is very important to build more reliable self-improving agents. The new benchmark evaluates how well Skills help LLM agents across 86 tasks and 11 domains. Finding over 7,300 agent trajectories: Curated Skills improved agent pass rates by 16.2 percentage points on average. But the gains varied wildly, from +4.5pp in Software Engineering to +51.9pp in Healthcare. The most surprising finding is that self-generated Skills provide no benefit on average. Models struggle to create the procedural knowledge that actually helps them. Focused, concise skills outperformed comprehensive documentation. And smaller models with Skills matched larger models without them. If agents can't reliably create their own procedural knowledge, the curation and design of Skills becomes a critical bottleneck for agent systems. Paper: arxiv.org/abs/2602.12670 Learn to build effective AI agents in our academy: academy.dair.ai

English

14

12

63

20K

BenchFlow retweetledi

Xiangyi Li@xdotli·17 Şub

We are also live on HuggingFace paper of the day! huggingface.co/papers/2602.12…

English

0

1

5

200

BenchFlow retweetledi

Xiangyi Li@xdotli·17 Şub

Introducing SkillsBench, the first benchmarks that measures agent skills and how well agents use them. 86 tasks from 105 domain experts across 11 domains, every task is verifiable, human created and has verified Skills. SOTA model without skills score ~30% without skills. 🧵👇

English

2

11

40

2.6K

BenchFlow retweetledi

Xiangyi Li@xdotli·13 Şub

Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇

English

24

92

668

76.4K

BenchFlow retweetledi

Jiankai Sun@JiankaiSun·17 Şub

Our SkillsBench: a new benchmark for Skills in LLM agents — 86 tasks, 11 domains, 7,300+ trajectories. Result: +16.2pp avg pass gain (range +4.5→+51.9pp). 📄 skillsbench.ai/blogs/introduc… 💻 github.com/benchflow-ai/s… 🌐 skillsbench.ai #LLM #Agents #Benchmark #MachineLearning

Xiangyi Li@xdotli

Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇

English

1

6

13

1.4K

BenchFlow retweetledi

elvis@omarsar0·17 Şub

Nice paper studying whether agents can generate their own procedural knowledge. This is very important to build more reliable self-improving agents. The new benchmark evaluates how well Skills help LLM agents across 86 tasks and 11 domains. Finding over 7,300 agent trajectories: Curated Skills improved agent pass rates by 16.2 percentage points on average. But the gains varied wildly, from +4.5pp in Software Engineering to +51.9pp in Healthcare. The most surprising finding is that self-generated Skills provide no benefit on average. Models struggle to create the procedural knowledge that actually helps them. Focused, concise skills outperformed comprehensive documentation. And smaller models with Skills matched larger models without them. If agents can't reliably create their own procedural knowledge, the curation and design of Skills becomes a critical bottleneck for agent systems. Paper: arxiv.org/abs/2602.12670 Learn to build effective AI agents in our academy: academy.dair.ai

English

23

85

460

70.9K

BenchFlow@benchflow_ai·15 Şub

follow for our paper release next week, and skills hackathons

Xiangyi Li@xdotli

Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇

English

0

3

140

BenchFlow@benchflow_ai·15 Oca

One of the hardest tasks in our benchmark. Took a human with no prior experience of mHC nor nanoGPT 4 days to create the task

Xiangyi Li@xdotli

Given the most effective skills Can agents correctly implement the DeepSeek mHC paper from scratch and train nanoGPT with FineWeb? We made a benchmark and we found Claude Code was able to not only achieve targeted loss, but also replicate main conclusion from the paper ->

English

0

1

2

505

BenchFlow@benchflow_ai·14 Oca

Terminal Bench for Claude Code SkillsBench for Claude Cowork

Xiangyi Li@xdotli

Do you know Claude Code for astronomy is Claude Code with astronomy skills? We made the astronomist skill with @DillmannSteven (PhD @Stanford) to find orbital periods of an exoplanet ->

English

0

3

1.5K

BenchFlow retweetledi

Chris Barber@chrisbarber·13 Oca

I drafted a list of RL environment startups. Calling it Pavlov's List (1/2): AfterQuery. Code & Finance. @CarlosGeorgescu, @spencermateega, Danny Tang BenchFlow. Code. @xdotli Bespoke Labs. Enterprise. @madiator, @AlexGDimakis Calaveras. Code. @cis_female, Alana Xiang Cua. Code & Computer Use. @francedot, @ddupont808 Collinear. Enterprise & Code. @nazneenrajani, @soumyadeepb_ d_model. ML & Alignment. @dlbydq, @dmooooon Datacurve. Code. @serenaa_ge, @charleyslee Deeptune. Enterprise. @timlup Fleet AI. Enterprise. @nicoup General Reasoning. Long Horizon. @rosstaylor90, @ChengxiTaylor, Kip Parker, Thomas Grady Halluminate. Long Horizon & Finance. @Jerr_Wu, @wgm752 Habitat. Code & Computer Use. @maxim_enis, @maxkan_, @AndrewMegalaa Haladir. Code & Math. @jibranhutch, @quanmhuynh, @preston281s, @josephtso914 Hillclimb. Math. @jparkjmc, @agithief

English

10

21

303

45.5K

BenchFlow

Keşfet