
@siddarthv66 @SkanderMoalla We do not estimate the partition function, but calculate the exact value using an analytical expression. Therefore, we do not need any tricks to improve the estimate, since the value is already exact
English
Simon Matrenok
4 posts




🚀 Big time! We can finally do LLM RL fine-tuning with rewards and leverage offline/off-policy data! ❌ You want rewards, but GRPO only works online? ❌ You want offline, but DPO is limited to preferences? ✅ QRPO can do both! 🧵Here's how we do it:



