Ruochen Zhou retweetledi

📍 Can LLMs discover, abstract, and reuse higher-level tool skills across tasks?
Existing tool-use benchmarks test solving tasks with fixed tools. But real workflows contain recurring structures where efficiency comes from reusable tool compositions, not isolated calls.
We introduce SkillCraft: 126 tasks across 6 domains designed to test whether LLM agents can acquire compositional skills, not just call atomic tools.
We also propose Skill Mode, a lightweight protocol with four MCP primitives that let agents compose, verify, cache, and reuse tool chains at test time.
Our Key findings across evaluating 8 SOTA models:
⚡Skill Mode enables agents to self-discover and reuse skills, leading to higher success and efficiency than agents without it. The gains are larger for stronger models.
🧠 Stronger models (e.g., Claude) discover more generalizable skills, which transfer across tasks and even across models.
🔍 Deeper composition ≠ better — shallow, well-tested skills generalize best.
🔗 Paper: arxiv.org/abs/2603.00718
💻 Code: github.com/shiqichen17/Sk…
🏠 Page: skillcraft-website.github.io/page
(1/7)
English




