Alexandre L.-Piché

146 posts

Alexandre L.-Piché

Alexandre L.-Piché

@alexpiche_

taking REINFORCE seriously

Montreal, Qc Katılım Ekim 2011
4.8K Takip Edilen1.4K Takipçiler
Sabitlenmiş Tweet
Alexandre L.-Piché
Alexandre L.-Piché@alexpiche_·
It’s my last week at @ServiceNowRSRCH after joining ElementAI as an intern in 2018. Grateful for the incredible mentors and collaborators over the years. Ending things on a high note: PipelineRL won Best Paper at nowAI earlier this month! Hi @lawrennd 👋
Alexandre L.-Piché tweet media
English
1
0
21
732
Alexandre L.-Piché retweetledi
finbarr
finbarr@finbarrtimbers·
With in-flight updates (PipelineRL, from @alexpiche_, @DBahdanau et. al), we update our actors in the middle of generation. The system is much faster as we don't have to drain the generation queues to update the weights (which is the same problem as static batching).
finbarr tweet media
English
1
1
11
684
Alexandre L.-Piché retweetledi
Hamish Ivison
Hamish Ivison@hamishivi·
to continue the PipelineRL glazing, @finbarrtimbers implemented PipelineRL for open-instruct a little bit ago and it ended up being probably the single biggest speedup to our overall pipeline. We went from 2-week long RL runs to 5-day runs, without sacrificing performance (combined with some other threading etc. updates). Here's IFEval perf for an internal model (same data, same starting model, same bsz). Same number of training steps, same end perf, but PipelineRL is much faster.
Hamish Ivison tweet media
Rishabh Agarwal@agarwl_

Don't sleep on PipelineRL -- this is one of the biggest jumps in compute efficiency of RL setups that we found in the ScaleRL paper (also validated by Magistral & others before)! What's the problem PipelineRL solves? In RL for LLMs, we need to send weight updates from trainer to generator (to generate data from our latest policy being trained). (Conventional PPO-off-policy) A naive approach would be to "start generators on a batch, wait for all sequences to complete, update the model weights for both trainers and generators, and repeat. Unfortunately, this approach leads to idle generators and low pipeline efficiency due to heterogeneous completion times. (Pipeline-RL) Instead, we simply let the generators continue generating tokens without discarding or finishing ongoing generations in-flight whenever we need to do a weight update -- doing an "in-flight" weight update. As such our KV caches for these generations would be stale, as they would come from LLM with earlier copy(ies) of the weights) but this is ok (see below).

English
6
34
225
54.6K
Alexandre L.-Piché retweetledi
Rishabh Agarwal
Rishabh Agarwal@agarwl_·
Don't sleep on PipelineRL -- this is one of the biggest jumps in compute efficiency of RL setups that we found in the ScaleRL paper (also validated by Magistral & others before)! What's the problem PipelineRL solves? In RL for LLMs, we need to send weight updates from trainer to generator (to generate data from our latest policy being trained). (Conventional PPO-off-policy) A naive approach would be to "start generators on a batch, wait for all sequences to complete, update the model weights for both trainers and generators, and repeat. Unfortunately, this approach leads to idle generators and low pipeline efficiency due to heterogeneous completion times. (Pipeline-RL) Instead, we simply let the generators continue generating tokens without discarding or finishing ongoing generations in-flight whenever we need to do a weight update -- doing an "in-flight" weight update. As such our KV caches for these generations would be stale, as they would come from LLM with earlier copy(ies) of the weights) but this is ok (see below).
Rishabh Agarwal tweet mediaRishabh Agarwal tweet mediaRishabh Agarwal tweet media
Alexandre L.-Piché@alexpiche_

In-flight weight updates have gone from a “weird trick” to a must to train LLMs with RL in the last few weeks. If you want to understand the on-policy and throughput benefits here’s the CoLM talk @DBahdanau and I gave: youtu.be/Z1uEuRKACRs

English
12
59
477
131.2K
Lewis Tunstall
Lewis Tunstall@_lewtun·
In the Smol Training Playbook, I tried to survey the state of popular post-training frameworks. Let me know if I missed any and I'll add them to the list!
Lewis Tunstall tweet media
English
20
15
195
14.9K
Alexandre L.-Piché
Alexandre L.-Piché@alexpiche_·
In-flight weight updates have gone from a “weird trick” to a must to train LLMs with RL in the last few weeks. If you want to understand the on-policy and throughput benefits here’s the CoLM talk @DBahdanau and I gave: youtu.be/Z1uEuRKACRs
YouTube video
YouTube
English
1
29
143
68.6K
Alexandre L.-Piché retweetledi
🇺🇦 Dzmitry Bahdanau
🇺🇦 Dzmitry Bahdanau@DBahdanau·
We did lots of good work since PipelineRL release in May: ⚙️ higher throughput, seq parallel training, multimodal, agentic RL 📜 white paper with great explanations and results: arxiv.org/pdf/2509.19128… We'll present today at CoLM EXPO, room 524C, 1pm!
English
2
8
59
7.8K
Alexandre L.-Piché retweetledi
Torsten Scholak
Torsten Scholak@tscholak·
🧠 Call for Interns – ServiceNow AI Research (Montreal) Our Foundation Models Lab is recruiting interns for 2026! We train & optimize LLMs, from diffusion-based generation to state-space hybrids. If you care about efficient LLMs, diffusion or reasoning → this is for you. 🧵👇
English
5
22
139
10.9K
Alexandre L.-Piché retweetledi
Alexandre L.-Piché retweetledi
vLLM
vLLM@vllm_project·
🚀 The RL community keeps pushing boundaries — from better on-policy data and partial rollouts to in-flight weight updates that mix KV caches across models during inference. Continuing inference while weights change and KV states stay stale sounds wild — but that’s exactly what PipelineRL makes work. vLLM is proud to power this kind of modular, cutting-edge RL innovation. Give it a try and share your thoughts!
🇺🇦 Dzmitry Bahdanau@DBahdanau

I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference! Code: github.com/ServiceNow/Pip… Blog: huggingface.co/blog/ServiceNo…

English
8
64
466
88.5K
Alexandre L.-Piché retweetledi
Sai Rajeswar
Sai Rajeswar@RajeswarSai·
💡So far, I have been sharing our multimodal AI research at @ServiceNow focused on reasoning over pixels. Today, we share a new chapter with an open-source release of our big initiative in the voice and speech domain.🚀 🎧 AU-Harness: Holistic Evaluation of Audio LLM Responses
Sai Rajeswar tweet media
English
1
6
19
519
Alexandre L.-Piché
Alexandre L.-Piché@alexpiche_·
Glad to see OpenAI prioritizing abstention responses in their paper! That's a great intro to our TMLR paper in which we developed an iterative self-reflection method for LLM to know when to abstain without ground truth and no additional cost at test time. openreview.net/pdf?id=SvKPfch…
Adam Tauman Kalai@adamfungi

New research explains why LLMs hallucinate, through a connection between supervised and self-supervised learning. We also describe a key obstacle that can be removed to reduce them. 🧵openai.com/index/why-lang…

English
1
10
18
4.4K