
I'm excited to share our new pre-print ShiQ: Bringing back Bellman to LLMs! arxiv.org/abs/2505.11081 In this work, we propose a new, Q-learning inspired RL algorithm for finetuning LLMs 🎉 (1/n)
Raphael Avalos
56 posts

@raphael_avalos
Writing the PhD thesis @aibrussels | ex Cohere and FWO Fellow

I'm excited to share our new pre-print ShiQ: Bringing back Bellman to LLMs! arxiv.org/abs/2505.11081 In this work, we propose a new, Q-learning inspired RL algorithm for finetuning LLMs 🎉 (1/n)

📢After months of work, I can finally share our latest research, couldn’t be more thrilled and excited. 🎉 We unify a policy 🤖 and a world model 🌍 into a single LLM, thus no external dynamics model needed! Why does this matter? Because now, the policy can plan based on its internal world model! And this planning boosts tool-use success rates to >90%, on top of SFT + RL. 📄: arxiv.org/abs/2506.02918 🧵[1/8]


How come people don’t do Q-learning on LLMs

Today (two weeks after model launch 🔥) we're releasing a technical report of how we made Command A and R7B 🚀! It has detailed breakdowns of our training process, and evaluations per capability (tools, multilingual, code, reasoning, safety, enterprise, long context)🧵 1/3.





Still 8 days to submit your work to the ALA workshop at AAMAS! We welcome full papers, work in progress, and 2-page abstracts of recently published journal papers. All the info is available at ala-workshop.github.io.

Excited to announce the 17th Adaptive Learning Agent workshop at @AAMASconf in May! We welcome full papers, work in progress, and 2-page abstracts of recently published journal papers. Find out more at our website: ala-workshop.github.io. Deadline for submissions: February 4th.









RLHF gains are largely determined by the quality of the underlying reward model. How can we improve reward model quality without collecting more data? Introducing a novel approach to augmenting human feedback data with synthetic preferences! 🧵 arxiv.org/abs/2401.12086

In clinical early warning systems (EWS), can we go beyond the model estimate of event occurrence and leverage its belief about the event distance to improve our alarm policy? Introducing “Dynamic Survival Analysis for Early Event Prediction” with @ToManuelBurger and @gxr. 🧶

Arrived at #ICLR2024 with @f_delgrange to present our work "The Wasserstein Believer: Learning Belief Updates for Partially Observable MDPs through Reliable Latent Space Models".