Nicholas Edwards

@nedwards99

Katılım Ekim 2023

46 Takip Edilen12 Takipçiler

Nicholas Edwards@nedwards99·17h

Check out the original RExBench announcement for more details about the benchmark: x.com/yukyunglee_/st…

Yukyung Lee@yukyunglee_

Can coding agents autonomously implement AI research extensions? We introduce RExBench, a benchmark that tests if a coding agent can implement a novel experiment based on existing research and code. Finding: Most agents we tested had a low success rate, but there is promise!

English

113

Nicholas Edwards@nedwards99·17h

Thanks to @Mike_A_Merrill and @alexgshaw for early discussions, and to @LinShi592021 and the Adapters team for help with integration!

English

103

Nicholas Edwards@nedwards99·17h

RExBench is now available in Terminal Bench (@harborframework)! 🎉 We integrate 2 tasks (cogs, othello) along with a local testing framework so you can test if your agents can autonomously implement novel AI research extensions.

English

1.5K

Nicholas Edwards@nedwards99·1 Nis

The interactive SWE-bench Verified setting is adapted from Vijayvargiya et al. (2026): arxiv.org/abs/2502.13069

English

Nicholas Edwards@nedwards99·1 Nis

This was work done with @sebschu. Check out the paper for more: Paper: arxiv.org/abs/2603.26233 Code: github.com/nedwards99/ask…

English

Nicholas Edwards@nedwards99·1 Nis

🧵 Do coding agents know when to ask for help? Real-world coding tasks are rarely fully specified, yet most agents are optimized to execute autonomously rather than clarify.

English

933

Nicholas Edwards retweetledi

Sarah Breckner@hieristSarah·13 Mar

Diffusion LLMs can think EoS-by-EoS! The higher the generation length, the better the performance of Masked Diffusion LLMs, even though they generate the same amount of words and only augment them with more and more EoS tokens 👀

English

243

Keşfet

@Mike_A_Merrill @alexgshaw @LinShi592021 @harborframework @sebschu @elonmusk @BarackObama @taylorswift13