SWE-bench

26 posts

SWE-bench

SWE-bench

@SWEbench

Official SWE-bench Account. Follow for updates to the SWE-universe

Katılım Mayıs 2025
17 Takip Edilen248 Takipçiler
SWE-bench
SWE-bench@SWEbench·
Join us in SWE-bench slack if you're interested in contributing and using these new datasets! (bottom left of swebench.com) Expect a lot more to come in the following weeks :)
English
1
0
3
181
SWE-bench
SWE-bench@SWEbench·
SWE-bench blog site launched! Check out our content + expect more SWE-bench/agent/smith content soon!
SWE-bench tweet media
English
0
0
3
8.1K
SWE-bench retweetledi
John Yang
John Yang@jyangballin·
New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
English
31
99
411
95.1K
SWE-bench retweetledi
Chunyang Chen
Chunyang Chen@chun_yang_chen·
🏆Glad to know that our #ASE25 paper about automated bug repair using MMLM just got the ACM SIGSOFT Distinguished Paper Award🎉 And it is still ranked top #1 in @SWEbench Mulmimodal Track! Thank Kai, Xiaofei @xfxie312, and Jian for the great work!
Chunyang Chen@chun_yang_chen

Excited to announce Kai's latest ASE'25 work, let LLMs not only see bugs, but also fix them: 📄 “Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Repair” 🔗arxiv.org/abs/2506.16136 Ranked #1 on @SWEbench Multimodal!

English
0
1
7
1.1K
SWE-bench retweetledi
Ofir Press
Ofir Press@OfirPress·
3 out of the top 6 most downloaded datasets on @huggingface are SWE-bench related. Thanks!!! ♥️
Ofir Press tweet media
English
1
7
65
18.4K
SWE-bench retweetledi
carlos
carlos@_carlosejimenez·
Recent open model scores on SWE-bench Bash Only: 🥇Qwen3-Coder 480B/A35B Instruct - 55.40% 🥈Kimi-K2-Instruct - 43.80% 🥉gpt-oss-120b - 26.00% See the full leaderboard below! 👇
English
6
27
212
66.3K
SWE-bench retweetledi
Kilian Lieret
Kilian Lieret@KLieret·
What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵
Kilian Lieret tweet media
English
18
21
270
31.8K
SWE-bench retweetledi
Kilian Lieret
Kilian Lieret@KLieret·
Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵
Kilian Lieret tweet media
English
8
22
156
24K
SWE-bench retweetledi
Ofir Press
Ofir Press@OfirPress·
GPT-5 gets 74.9 on SWE-bench. Wonder what the budget per task is.
Ofir Press tweet media
English
3
1
17
4K
SWE-bench retweetledi
carlos
carlos@_carlosejimenez·
What happens if you compare LMs on SWE-bench without the fancy scaffolds? Our new leaderboard “SWE-bench (bash only)” shows you which LMs are the best at getting the job done with just bash. More on why this is important 👇
carlos tweet media
English
14
27
204
32.8K
SWE-bench retweetledi
Ofir Press
Ofir Press@OfirPress·
Super exciting to have 3 new open-weight models that all obtain more than 60 on SWE-bench Verified! Looking forward to the results on SWE-bench Multimodal when these models obtain vision capabilities :)
Ofir Press tweet media
English
5
3
22
3K
SWE-bench retweetledi
Kilian Lieret
Kilian Lieret@KLieret·
Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵
Kilian Lieret tweet media
English
12
75
789
111.9K
SWE-bench
SWE-bench@SWEbench·
@Alibaba_Qwen Congratulations on amazing SWE-bench Verified + Multilingual performance!
English
1
0
2
422
Qwen
Qwen@Alibaba_Qwen·
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves top-tier performance across multiple agentic coding benchmarks among open models, including SWE-bench-Verified!!! 🚀 Alongside the model, we're also open-sourcing a command-line tool for agentic coding: Qwen Code. Forked from Gemini Code, it includes custom prompts and function call protocols to fully unlock Qwen3-Coder’s capabilities. Qwen3-Coder works seamlessly with the community’s best developer tools. As a foundation model, we hope it can be used anywhere across the digital world — Agentic Coding in the World! 💬 Chat: chat.qwen.ai 📚 Blog: qwenlm.github.io/blog/qwen3-cod… 🤗 Model: hf.co/Qwen/Qwen3-Cod… 🤖 Qwen Code: github.com/QwenLM/qwen-co…
Qwen tweet media
English
382
1.5K
9.4K
2.3M