📊 Sometimes the smallest evals reveal the biggest insights.
See what 50,000 runs of a 5-line task taught the @code team about model efficiency, tool use, and AI behavior.
📖 Read the full post: aka.ms/vscode/blog/ev…
@code@grok What evidence from the 50,000 VS Code eval runs shows whether tool-use gains reduced failed retries or only improved task completion averages?
@code that's exactly it most people look at big models but miss where the real pain points are. smaller tasks with lower evals often uncover the stuff that matters.