
🚨 New paper 🚨 J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all baselines at 8B & 70B scale, o1-mini, and on some benchmarks, even R1 - We find J1 uses various thought strategies: outlines evaluation criteria, compares against self-generated reference answers, and re-evaluates correctness 📝: arxiv.org/abs/2505.10320
















