DSPy

41

2K

DSPy retweetledi

Maxime Rivest 🧙‍♂️🦙🐧@MaximeRivest·19h

x.com/i/article/2055…

ZXX

47

321

27.5K

DSPy retweetledi

Omar Khattab@lateinteraction·20h

ICYMI: read the blog on Pedagogical RL Instead of sampling blindly from your LLM, leverage the label used for RLVR! Learn to directly approximate the distribution of your LLM's plausible rollouts that are actually correct. Then sample from *that*! noahziems.com/pedagogical-rl

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

English

7

12

83

6.5K

DSPy retweetledi

Souradip Chakraborty@SOURADIPCHAKR18·1d

If the model can’t already stumble into success or at least useful states, on-policy methods stall. Instead, we argue that an ideal sampler would give our LLM its “nearest successes”: rollouts that are correct, but where every step makes sense to the LLM and thus can be learned

English

Maxime Rivest 🧙‍♂️🦙🐧@MaximeRivest

4

24

4.4K

DSPy@DSPyOSS·20h

@MaximeRivest wow, one year of Maxime in this community :D, we have been so lucky to have you!

English

0

8

228

DSPy retweetledi

Maxime Rivest 🧙‍♂️🦙🐧@MaximeRivest·21h

dspy still feels magical, no other AI engineering framework comes close to it, and their is so much to come that will improve it!

trying dspy this morning.. I was reading the docs and I didnt full get it. Mostly because I thought the example were missing pieces to run. BUT NO. its actually that short. It feels a bit magical.

English

5

4

52

5K

DSPy retweetledi

Atakan Tekparmak@AtakanTekparmak·23h

Seeing @lateinteraction’s name means you’re about to read a banger (and probably will use the method at some point)

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

English

Omar Khattab@lateinteraction

2

10

1.2K

DSPy retweetledi

Omar Khattab@lateinteraction·1d

alright, two of the five are out since April 24 :D funnily enough, i'm still somehow excited about five (3+2) again currently. y'all will find them really really nice.

Being this excited about five rather unexpected research projects simultaneously is almost too painful. Assuming that we figure out how to sequence these releases, y’all are going to thoroughly love each of these.

English

4

1

63

7.1K

DSPy retweetledi

Noah Ziems@NoahZiems·1d

Extremely excited about our recent work in Pedagogical RL. I’m optimistic approaches like this are going to completely shift how data collection is done for hard agentic tasks like coding

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

English

5

13

49

5.2K

DSPy retweetledi

Drew Breunig@dbreunig·1d

Great teachers craft demonstrations their students could have built themselves.

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

English

4

14

1.8K

DSPy retweetledi

will brown@willccbb·1d

@lateinteraction constant factor, close enough :) veeeery nice approach, really excited to dig into it further!

English

34

6.3K

DSPy retweetledi

Amrit Singh Bedi@amritsinghbedi3·1d

Are we really using all the available information when we post-train RL on hard tasks? Maybe not. 🧠 New framework — Pedagogical RL: teaching models to teach themselves to sample trajectories that are both correct AND actually learnable. Up to 40% gains over GRPO and OPD 🚀 👇

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

English

6

25

3.4K

DSPy retweetledi

andthattoo@andthatto·1d

A good research is when something feels intuitive, yet somehow you didn't think of it. Testing this on my runs.

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

English

3

14

1.7K

DSPy retweetledi

Braden Hancock@bradenjhancock·1d

In other words: Humans are teaching teacher models how to teach other models the way good human teachers teach other humans so we can make smarter models that can teach humans to be smarter. Intuition: A good teacher model will not only lead to the right answer--it will do so following a sequence of steps that the student can follow. Teacher models are penalized for taking leaps that feel like they came out of nowhere. More cool work out of @lateinteraction's CSAIL lab!

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

English

2

15

55

6.8K

DSPy retweetledi

Omar Khattab@lateinteraction·1d

End the tyranny of on-policy algorithms in LLM post-training! Maybe the key thing isn't whether your rollouts are purely "on-policy" or not, but the extent to which they’re pedagogically useful. Early explorations into newer paradigms for RL by @SOURADIPCHAKR18* @NoahZiems*:

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

English

7

16

130

10.9K

DSPy retweetledi

Noah Ziems@NoahZiems·1d

Our recent work on Pedagogical RL is out!

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

English