ank R

4 posts

ank R banner
ank R

ank R

@Llllink2000

Katılım Kasım 2022
16 Takip Edilen2 Takipçiler
ank R retweetledi
Qiuyang Mang
Qiuyang Mang@MangQiuyang·
Open-ended coding training data may no longer be the bottleneck: AI can scale open-ended tasks—and even outperform human-expert curation. FrontierCS team is releasing FrontierSmith: a system for synthesizing open-ended coding problems at scale. Starting from closed-ended coding tasks, FrontierSmith mutates, filters, and builds runnable optimization environments for long-horizon coding agents. In our experiments, FrontierSmith data trains stronger models than human-curated open-ended data on FrontierCS and ALE-bench. Blog: frontier-cs.org/blog/frontiers… Paper: arxiv.org/abs/2605.14445 Code: github.com/FrontierCS/Fro… Model: huggingface.co/runyuanhe/qwen…
English
11
65
296
76.2K
ank R retweetledi
Boxi Yu
Boxi Yu@BoshCavendish·
🔥 SWE-ABS accepted by ICML2026 @icmlconf 🔥 OpenAI @OpenAI showed SWE-Bench @SWEbench tests reject correct patches. We reveal the other side: they also accept wrong ones. SWE-ABS strengthens SWE-Bench (Verified & Pro) via: coverage-driven tests + mutation-based attacks. Key results: • All top-30 rankings shift (#1#5) • 19.78% “solved” patches are actually wrong • 50.2% Verified strengthened • 64.7% Pro subset strengthened 👉 Test quality—not benchmark difficulty—is the real bottleneck. Links 👇
Boxi Yu tweet media
English
3
8
15
604