Foundry AI (@foundryyai) - Twitter Profili | Zamantika Mersobahis Locabet

Foundry AI retweetledi

Deepak Kumar@deepakdk3478·1 Nis

Benchmarks like SWE-Bench and SWE-Bench Pro are focused on evaluating the capability of coding agents for code generation-related tasks. A few evaluate other capabilities of the model related to code review. To support this, we at @foundryyai have published SWE-PRBench. A dataset of 350 pull requests where the ground truth is the review by the human engineers. The benchmark measuring whether frontier LLMs catch similar issues that reviewers catch in production code. The results of top models on SWE-PRBench: - Claude Sonnet 4.6 by Anthropic - 29.7% detection, 22.7% hallucination - DeepSeek V3 by DeepSeek AI - 31.2% detection, 31.5% hallucination - Mistral Large 3 by Mistral AI - 30.5% detection, 35.3% hallucination - GPT-4o by OpenAI - 22.0% detection, 19.3% hallucination The best model still missed 7 in 10 issues that human reviewers caught. And the finding that surprised us most: adding more context made every model worse. All 8 models degraded monotonically as context expanded. When we added execution context and surrounding file content, Type2_Contextual issue detection collapsed by 50–55% across top models. The gap between code generation benchmarks and code review benchmarks tells us something important: progress on SWE-Bench does not predict progress on code review. These are different capabilities, and the field has been measuring only one of them. This is the generalisation gap where the focus is less. Full results below!

English

116

Foundry AI retweetledi

Deepak Kumar@deepakdk3478·31 Mar

Introducing FoundryAI: the coding agent evaluation lab. Coding agents score well on benchmarks… but still fail in production workflows. We built FoundryAI to solve this: ➤ Evaluate models on pull request workflows, multi-file changes, and long-horizon tasks ➤ Identify where models miss real issues across code review, reasoning, and context handling ➤ Generate targeted training data to fix those exact failure modes ...and more FoundryAI focuses on the hardest part of building coding agents: evaluation and training data. Our work is grounded in software engineering workflows, including: ➤ UI generation ➤ Long-horizon tasks ➤ Custom benchmark design ➤ Architecture reasoning ➤ Multi-system workflows ➤ Code review quality Working with leading AI labs building coding agents. Want to see where your model breaks? Check the next thread!

English

123

Foundry AI

Keşfet