
We train LLMs for code at an enterprise lab. Codeforces Elo instability was burning us during post-training eval, so we ran a systematic study, wrote a paper, and submitted to FSE industry track. Got rejected. All three reviewers said "weak connection to industry." 🤡 We built this from a real production pain point. 13,691 test cases, 37 contests. Sometimes the reviewers and the problem exist in different universes.





















