Shang Zhou retweetledi

LiveCodeBench Pro remains one of the most challenging code benchmarks, but its evaluation and verification process is still a black box.
We introduce AutoCode, which democratizes evaluation allowing anyone to locally run verification and perform RL training!
For the first time, we also show that an LLM can act as a problem setter, transforming a simple problem into a harder version sometimes even harder than what it can solve itself.
In other words, LLMs can generate problems they can’t yet solve, opening the door to true self-play.
Moreover, through an agentic framework, we find that LLMs can automatically generate test cases, achieving 98.7% evaluation consistency, which is already highly practical accuracy for an RL verifier.

English
