Foundry AI retweetledi

Benchmarks like SWE-Bench and SWE-Bench Pro are focused on evaluating the capability of coding agents for code generation-related tasks. A few evaluate other capabilities of the model related to code review. To support this, we at @foundryyai have published SWE-PRBench.
A dataset of 350 pull requests where the ground truth is the review by the human engineers. The benchmark measuring whether frontier LLMs catch similar issues that reviewers catch in production code.
The results of top models on SWE-PRBench:
- Claude Sonnet 4.6 by Anthropic - 29.7% detection, 22.7% hallucination
- DeepSeek V3 by DeepSeek AI - 31.2% detection, 31.5% hallucination
- Mistral Large 3 by Mistral AI - 30.5% detection, 35.3% hallucination
- GPT-4o by OpenAI - 22.0% detection, 19.3% hallucination
The best model still missed 7 in 10 issues that human reviewers caught.
And the finding that surprised us most: adding more context made every model worse. All 8 models degraded monotonically as context expanded. When we added execution context and surrounding file content, Type2_Contextual issue detection collapsed by 50–55% across top models.
The gap between code generation benchmarks and code review benchmarks tells us something important: progress on SWE-Bench does not predict progress on code review. These are different capabilities, and the field has been measuring only one of them.
This is the generalisation gap where the focus is less.
Full results below!
English
