Andreas Plesner (@andreas_plesner) - Twitterプロフィール

In 2025, RLVR was the big thing following the DeepSeek moment. Now, RL for LLMs is increasingly focusing on semi-verifiable domains. After joining HART, I asked just how good a verifier has to be. The answer? Imperfection is not a problem! With @anishathalye and @guzmanhe

Anish Athalye@anishathalye

Does an imperfect verifier break reinforcement learning with verifiable rewards (RLVR)? Turns out it doesn’t! Why does this matter? As the world moves into reinforcement learning in semi-verifiable domains, perfect verifiers don’t exist. We added controlled and LLM-based noise to RLVR reward signals and found that up to 30% noise barely hurts training; performance stays within 4pp of the clean baseline. This research has already impacted how we build reinforcement learning environments at @joinHandshake. For a major benchmark we are launching tomorrow, we hill-climbed the verifier to 88% accuracy—above the 85% human inter-rater agreement—knowing from this research that this is good enough. With @andreas_plesner @guzmanhe

English

121

Andreas Plesner

ディスカバー