Simon Matrenok

@matrs01

Katılım Mart 2024

53 Takip Edilen15 Takipçiler

@siddarthv66 @SkanderMoalla We do not estimate the partition function, but calculate the exact value using an analytical expression. Therefore, we do not need any tricks to improve the estimate, since the value is already exact

English

Siddarth Venkatraman@siddarthv66·15 Tem

I don’t think estimation of the partition function through MC integration (through expectation) is a novel idea. This has been done in many prior works, including the Trajectory Balance paper I linked above. The only difference is that also we find estimating log Z using the current policy is better than with p_ref, and this is still a fine thing to do for the squared regression objective since the optimal policy is the unique optimum at 0 loss. I suggest trying it, run the same experiments in your paper, but estimate log Z in batch with trajectories from your current policy instead of using the reference policy

English

119

Skander Moalla@SkanderMoalla·14 Tem

🚀 Big time! We can finally do LLM RL fine-tuning with rewards and leverage offline/off-policy data! ❌ You want rewards, but GRPO only works online? ❌ You want offline, but DPO is limited to preferences? ✅ QRPO can do both! 🧵Here's how we do it:

English

144

25.2K

Simon Matrenok@matrs01·15 Tem

I couldn’t be prouder to share this. 🎉 Our work on Quantile Reward Policy Optimization (QRPO) for LLM RL‑finetuning bridged deep theory and large‑scale practice: * Theory first. We cracked the partition‑function “intractable” myth, reframing it with moment‑generating functions and a family of tractable reward transformations. * Scale matters. Thousands of controlled chat & code runs put QRPO up against DPO, REBEL, and SimPO, testing length bias, sample‑probability shifts, β‑sensitivity, and more. I hope QRPO sparks new directions for aligning LLMs straight to reward signals - while staying simple, stable, and effective in both offline/off‑policy and online RL. There’s still plenty to explore (appendix lovers, rejoice!), but the results speak for themselves. Huge thanks to my brilliant collaborators @SkanderMoalla and @caglarml - couldn’t have done it without you. 🙏 📰 Paper: arxiv.org/abs/2507.08068 🧑‍💻 Code: github.com/CLAIRE-Labo/qu… 🌐 Blog: claire-labo.github.io/quantile-rewar… Star the repo, share feedback, and let’s keep pushing LLM alignment forward! 🚀

Skander Moalla@SkanderMoalla

English

1.3K

Simon Matrenok@matrs01·15 Tem

Exactly! We aim for the same goal, but as you highlight, we take a different approach, which allows for a simple regression objective by employing the closed-form expression for the partition function Z. Note, however, that SPO used 800 samples for each prompt to estimate the value function, while QRPO requires only 1–3 samples for chat tasks and up to 20 for code generation task, making it much more compute-efficient

English

Goliath@zero_goliath·14 Tem

both optimize kl-regularized reward and allow for off-policy rl. the differences seem to be - QRPO sidesteps value estimation by making the partition function tractable via quantiles; SPO embeds value in its cumulative Q‑parameterization. - QRPO uses a single regression loss on quantile‑transformed rewards; SPO mixes terminal Q‑regression, (optional) intermediate targets, and importance‑weighted policy updates

English

108

Simon Matrenok@matrs01·15 Tem

We do indeed share the same regression‑style objective, as many others such as DRO (Richemond et al., 2024) and SPO (Cohen et al., 2025) do. So the novelty is not the loss itself but the tractable Z, which turns an idea that’s elegant in theory into something practical for LLMs. As we discuss in the paper, the squared loss is the most straightforward and elegant objective to optimize. However, everyone seems to rely on learning or estimating Z that is either complex or noisy, while we instead take a novel direction to free ourselves from estimation and derive a closed‑form expression. That’s what turns a neat theoretical idea into something that actually scales on LLMs and opens the door to a new family of methods

English

118

Siddarth Venkatraman@siddarthv66·14 Tem

@SkanderMoalla This looks very similar to trajectory balance, which has already been applied to LLMs in prior work arxiv.org/abs/2503.18929

English

508

Keşfet

@siddarthv66 @SkanderMoalla @caglarml @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates