Amitoj Singh retweetledi

🔥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic. As function-calling (also called tool-calling) forms the bed-rock of Agentic systems, BFCL V4 Agentic benchmark focuses on tool-calling in real-world agentic settings — including:
🔍 Web search with multi-hop reasoning and error recovery
🧠 Evaluating Tool-Calling for Memory
⚠️ Evaluating Format Sensitivity
As always, BFCL prioritizes real-world realism. For example, in the web-search track, we evaluate not just multi-hop reasoning ability—but also how models handle real-world failures. In BFCL V4, we introduce randomized injection of six common programmatic access errors: 503 Server Error, 429 Too Many Requests, 403 Forbidden, etc
Which models recover gracefully? Which ones fail silently?
All this and more! Checkout BFCL V4 Agentic blogs:
Web-search: gorilla.cs.berkeley.edu/blogs/15_bfcl_…
Memory: gorilla.cs.berkeley.edu/blogs/16_bfcl_…
Format Sensitivity: gorilla.cs.berkeley.edu/blogs/17_bfcl_…
As always, everything is open-sourced at BFCL V4 PR: github.com/ShishirPatil/g…
🏃♂️Who's the overall #1? We're currently sprinting to integrate all models into the new benchmark. Once generations are complete, the leaderboard will migrate from v3 to v4. Hang tight — big updates incoming!

English

