ank R

@Llllink2000

Katılım Kasım 2022

16 Takip Edilen2 Takipçiler

ank R@Llllink2000·1d

@MangQiuyang nice work！

English

ank R retweetledi

Qiuyang Mang@MangQiuyang·2d

Open-ended coding training data may no longer be the bottleneck: AI can scale open-ended tasks—and even outperform human-expert curation. FrontierCS team is releasing FrontierSmith: a system for synthesizing open-ended coding problems at scale. Starting from closed-ended coding tasks, FrontierSmith mutates, filters, and builds runnable optimization environments for long-horizon coding agents. In our experiments, FrontierSmith data trains stronger models than human-curated open-ended data on FrontierCS and ALE-bench. Blog: frontier-cs.org/blog/frontiers… Paper: arxiv.org/abs/2605.14445 Code: github.com/FrontierCS/Fro… Model: huggingface.co/runyuanhe/qwen…

English

296

76.2K

ank R@Llllink2000·5 May

@BoshCavendish @icmlconf @OpenAI @SWEbench Nice work！

English

ank R retweetledi

Boxi Yu@BoshCavendish·5 May

🔥 SWE-ABS accepted by ICML2026 @icmlconf 🔥 OpenAI @OpenAI showed SWE-Bench @SWEbench tests reject correct patches. We reveal the other side: they also accept wrong ones. SWE-ABS strengthens SWE-Bench (Verified & Pro) via: coverage-driven tests + mutation-based attacks. Key results: • All top-30 rankings shift (#1 → #5) • 19.78% “solved” patches are actually wrong • 50.2% Verified strengthened • 64.7% Pro subset strengthened 👉 Test quality—not benchmark difficulty—is the real bottleneck. Links 👇

English

604

Keşfet

@MangQiuyang @BoshCavendish @icmlconf @OpenAI @SWEbench @elonmusk @BarackObama @taylorswift13