BenchFlow

37 posts

BenchFlow

BenchFlow

@benchflow_ai

frontier evals and rl environments

San Francisco Katılım Ekim 2024
21 Takip Edilen258 Takipçiler
BenchFlow retweetledi
Xiangyi Li
Xiangyi Li@xdotli·
the first update on the SkillsBench paper has been made available on arXiv
Xiangyi Li tweet media
English
3
2
20
1.1K
BenchFlow retweetledi
Xiangyi Li
Xiangyi Li@xdotli·
Introducing EnvDash: Agent that automatically tests and improves Skills The head of alignment at meta screwed up with OpenClaw. We made high resolution environments, tasks, and datasets to keep that from happening to you.
English
7
16
178
17K
BenchFlow retweetledi
FounderCoHo
FounderCoHo@FounderCoHo·
🔥 Just wrapped an incredible panel at Stanford University on "Beyond the Prompt: The Infrastructure of Autonomous Agents." Featuring: • Vedran Jukic (@JukicVedran), CTO & Co-Founder of @daytonaio • Xiangyi Li (@xdotli), Founder of @benchflow_ai, creator of skillsbench ai • Moderated by Jing Wang (@jingconan), Founder of @DeepVistaAI and @FounderCoHo Key takeaway? The future of AI agents isn't about better models; it's about RL infrastructure that can scale. 🎯 The real bottleneck: High-throughput sandboxes. Teams using Daytona compressed 8,000 trials from weeks → 1.5 days. In AI, that's a massive competitive edge. 💡 Real-world deployment challenges: • Trust gap in high-stakes sectors (finance, supply chain) • Security paradigm shift: protecting systems FROM agents, not just for them • Benchmarks evolving faster than ever (MMLU → SkillsBench) Looking to 2026: Agent organizations (agents managing agents) + whoever cracks the data flywheel wins. "The model is the product. Distribution is the moat." 🎯
FounderCoHo tweet mediaFounderCoHo tweet media
English
1
3
11
1.1K
BenchFlow retweetledi
FounderCoHo
FounderCoHo@FounderCoHo·
How do you boost agent accuracy by 51%? 📈 It’s not better prompting it’s better infrastructure. We’re skipping the theory and showing you the Founder’s Playbook for 2026 at Stanford on March 3. Speakers : • Ivan Burazin (@ivanburazin) (CEO, @daytonaio) — 4X Founder building Massively Parallel RL Infrastructure. • Jing Conan Wang (@jingconan)(Founder, @DeepVista AI & @FounderCoHo) — Ex-DeepMind & PhD in RL bridging research to execution. • Xiangyi Li (@xdotli) (Founder, @benchflow_ai) — Author of SkillsBench; pioneer in agentic evaluation. • Lovre Pesut (AI Engineer, @daytonaio) — Expert in spinning 10k+ sandboxed VMs for RL pipelines. What's on the menu: • Scaling RL rollouts in minutes • Sandboxed VMs (Win/Mac/Linux/Android) • Modular "Skills" vs. Generalist LLMs Limited spots for builders. 🛠️ Register here: luma.com/eswm2omv
English
0
2
3
184
BenchFlow retweetledi
AK
AK@_akhaliq·
SkillsBench Benchmarking How Well Agent Skills Work Across Diverse Tasks paper: huggingface.co/papers/2602.12…
AK tweet media
English
3
17
64
8.2K
BenchFlow retweetledi
BenchFlow retweetledi
DailyPapers
DailyPapers@HuggingPapers·
Top AI Papers of The Week (Feb 16-22) - Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs - SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise - GLM-5: from Vibe Coding to Agentic Engineering by @zhipuAI - Experiential Reinforcement Learning - MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs - Zooming without Zooming: Region-to-Image Distillation by @InclusionAI - Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines? - DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval - SLA2: Sparse-Linear Attention with Learnable Routing and QAT - SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Find them below:
DailyPapers tweet media
English
11
35
207
35.3K
BenchFlow retweetledi
Xiangyi Li
Xiangyi Li@xdotli·
20+ Anthropic Default Skills, 200k+ community skills on skillsmp. People talk about skills without knowing how well they work. We're hosting the largest Agent Skills hackathon at Founders, Inc. (March 7 - 8) from our lessons learned at SkillsBench 🪜 No sims. No slides. No flops.
Xiangyi Li tweet media
English
2
2
14
645
BenchFlow retweetledi
BenchFlow retweetledi
BenchFlow retweetledi
Xiangyi Li
Xiangyi Li@xdotli·
Introducing SkillsBench, the first benchmarks that measures agent skills and how well agents use them. 86 tasks from 105 domain experts across 11 domains, every task is verifiable, human created and has verified Skills. SOTA model without skills score ~30% without skills. 🧵👇
Xiangyi Li tweet media
English
2
11
40
2.6K
BenchFlow retweetledi
Xiangyi Li
Xiangyi Li@xdotli·
Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇
Xiangyi Li tweet media
English
24
92
668
76.4K
BenchFlow retweetledi
BenchFlow retweetledi
elvis
elvis@omarsar0·
Nice paper studying whether agents can generate their own procedural knowledge. This is very important to build more reliable self-improving agents. The new benchmark evaluates how well Skills help LLM agents across 86 tasks and 11 domains. Finding over 7,300 agent trajectories: Curated Skills improved agent pass rates by 16.2 percentage points on average. But the gains varied wildly, from +4.5pp in Software Engineering to +51.9pp in Healthcare. The most surprising finding is that self-generated Skills provide no benefit on average. Models struggle to create the procedural knowledge that actually helps them. Focused, concise skills outperformed comprehensive documentation. And smaller models with Skills matched larger models without them. If agents can't reliably create their own procedural knowledge, the curation and design of Skills becomes a critical bottleneck for agent systems. Paper: arxiv.org/abs/2602.12670 Learn to build effective AI agents in our academy: academy.dair.ai
elvis tweet media
English
23
85
460
70.9K
BenchFlow retweetledi
Chris Barber
Chris Barber@chrisbarber·
I drafted a list of RL environment startups. Calling it Pavlov's List (1/2): AfterQuery. Code & Finance. @CarlosGeorgescu, @spencermateega, Danny Tang BenchFlow. Code. @xdotli Bespoke Labs. Enterprise. @madiator, @AlexGDimakis Calaveras. Code. @cis_female, Alana Xiang Cua. Code & Computer Use. @francedot, @ddupont808 Collinear. Enterprise & Code. @nazneenrajani, @soumyadeepb_ d_model. ML & Alignment. @dlbydq, @dmooooon Datacurve. Code. @serenaa_ge, @charleyslee Deeptune. Enterprise. @timlup Fleet AI. Enterprise. @nicoup General Reasoning. Long Horizon. @rosstaylor90, @ChengxiTaylor, Kip Parker, Thomas Grady Halluminate. Long Horizon & Finance. @Jerr_Wu, @wgm752 Habitat. Code & Computer Use. @maxim_enis, @maxkan_, @AndrewMegalaa Haladir. Code & Math. @jibranhutch, @quanmhuynh, @preston281s, @josephtso914 Hillclimb. Math. @jparkjmc, @agithief
English
10
21
303
45.5K