Or Rivlin

59 posts

Or Rivlin

@or_rivlin

Reinforcement Learning engineer

Katılım Eylül 2021

256 Takip Edilen60 Takipçiler

Or Rivlin@or_rivlin·27 Kas

@ChenTessler @KyleMorgenstein I remember this paper: arxiv.org/abs/1711.06006 There may be others as well. I think when you learn online via simulation and sample complexity is not a huge issue, PG is typically a better option. But Q enables offline learning and stitching, which is powerful.

English

Chen Tessler@ChenTessler·27 Kas

@KyleMorgenstein On the other hand, there's some stuff you can only do with Q. Which sucks. All this hindsight relabeling and FB. Haven't seen it formalized through PG yet

English

209

Kyle🤖🚀🦭@KyleMorgenstein·26 Kas

every other formalism is just a proxy via eg Q learning; policy gradient is the only formalism that directly solves the problem we care about. we do not yet have a general way to assess “rewards”, but once we have a reward signal, policy optimization is the way.

Lucas Beyer (bl16)@giffmana

OK I'm probably gonna get some flak for this, but... re the classic school(s) of RL: They are detached from reality. The whole "you can turn everything into mdp just fold stuff into the state until it's Markov. So algorithms that optimally solve mdps lead to general intelligence" might be true in infinite theory, but in finite reality it makes no sense at all. That's why it's mostly ok for simple games and the music stops there, with policy gradient taking over. That's why I --after some implementing, playing around, and thinking about both-- quickly abandoned the Bellman-school and became an adept of the Williams92 church.

English

3.2K

Or Rivlin@or_rivlin·28 Haz

@Liorfink @david_lisovtsev אז משתמע שאנחנו מסכימים אם המלצת ככה לפני שבוע...

עברית

Lior.Finkelshtein.13/07/2024🔺🇮🇱✡︎📟@Liorfink·28 Haz

@or_rivlin @david_lisovtsev יעילה זה משהו שמתברר כלא נכון במקרה באיספהאן. לפי הפרסום הפצצה לא הוטלה שם כי היא לא טובה מספיק. בפורדו היה צריך 12. אני חושב שזה אפשרי, וגם המלצתי לפני שבוע, להשתמש בטיל יריחו 3/4 מבוסס שביט 2 עם רש"ק 5 טון אורניום מדולל ובפנים מטענים חלולים רבים וקצה חד. צריך גג 2 כאלה

עברית

David Lisovtsev@david_lisovtsev·28 Haz

כולם יודעים מה החסרון העיקרי של טילים בכלל וחיל טילים בפרט, המחיר פר קילו חנ"מ על ראש של האויב. דבר מרכזי מאוד שמשפיע על המחיר זה האמינות של הפלטפורמה, למשל עם מטוסי הקרב ופצצות פשוטות, אין לנו כמעט מקרים של מטוסים שלא מגיעים למטרה או מפספסים ביחסית הרבה. במקרה של הירי מאיראן הסטטיסטיקה מעניינת: לפי הנתונים שפורסמו איראן שיגרה לישראל 631 טילים בליסטיים, מתוך בערך 500 הגיעו לאזור ישראל [השאר התפרקו בדרך] ומתוכם 243 היו בדרך לשטחים פתוחים, אזורים פלסטיניים או הים ולכן לא נדרשו ביירוט. כלומר איראן ירתה עלינו 631 טילים בליסטיים, ואנחנו נדרשנו ליירט רק 257 - 40%. רק 40% מהטילים האיראניים באמת הגיעו למצב שהם מאיימים על מטרות או אזורים מיושבים! זה אחוזים מזעזעים במיוחד, כלומר כדי אשכרה לפגוע במטרה שאתה רוצה, בלי נסיונות יירוט קינטיים של האויב אתה צריך בממוצע 2.5 טילים בליסטיים.

עברית

692

42.2K

Or Rivlin@or_rivlin·28 Haz

@Liorfink @david_lisovtsev מהיר זה מונח יחסי, הפצצה שהאמריקאים השתמשו בה ממש לא מהירה ביחס לטיל בליסטי, אבל עדיין יעילה עבור המשימה הזו.

עברית

Lior.Finkelshtein.13/07/2024🔺🇮🇱✡︎📟@Liorfink·28 Haz

@or_rivlin @david_lisovtsev כשאתה מדבר על טיל שמגיע למהירות היפרסוניית בחלל (כמו כל טיל בליסטי לטווח בינוני) מערכת ההנחייה תיהיה יקרה מאוד. גם בגלל החום. הוא חייב להיות מהיר כדי לפצח בונקרים.

עברית

Or Rivlin@or_rivlin·28 Haz

@Liorfink @david_lisovtsev יודעים היום לעשות ניווט אלקטרואופטי סופר מדויק וגם יחסית זול. בהנחה שטיל כזה פועל אחרי ניטרול ההגנה האווירית של האויב ולכן לא צריך להיות מאוד מהיר, אפשר לבצע ניווט אלקטרואופטי משלב יחסית מוקדם בצלילה.

עברית

Lior.Finkelshtein.13/07/2024🔺🇮🇱✡︎📟@Liorfink·28 Haz

@or_rivlin @david_lisovtsev אפשר, אבל רוב הבעיה היא במערכת הניווט והדיוק. זה יקר בצורה מחרידה לטיל כזה.

עברית

Or Rivlin@or_rivlin·5 Şub

@seohong_park Great work! I find simplicity to be a major appeal when considering algorithms for practical use.

English

586

Seohong Park@seohong_park·5 Şub

Excited to introduce flow Q-learning (FQL)! Flow Q-learning is a *simple* and scalable data-driven RL method that trains an expressive policy with flow matching. Paper: arxiv.org/abs/2502.02538 Project page: seohong.me/projects/fql/ Thread ↓

English

147

828

147.5K

Or Rivlin@or_rivlin·5 Oca

@natolambert Can you write a post about how RL is used in O1 style training?

English

519

Nathan Lambert@natolambert·5 Oca

There's a lot of confusion about o1's RL training and the emergence of RL as a popular post-training loss function. Yes, these are the same loss functions and similar data. BUT, the amount of compute used for o1's RL training is much more in line with pretraining. The words we use to describe training are strained already, but o1 may be better viewed as next-token pretraining, rl pretraining, and then some normal post-training.

English

222

31.6K

Or Rivlin@or_rivlin·19 Eki

@mitsuhiko_nm Also, do you have any intuition on why shifting the reward helped? And did you use a Sigmoid output activation with BCE loss (as was done in previous papers from Sergey's group)?

English

Mitsuhiko Nakamoto@mitsuhiko_nm·18 Eki

Many generalist robot policies have been released, but they're not perfect. How can we make them better? Introducing V-GPS🚀: Value Guided Policy Steering, a simple approach to improve any off-the-shelf generalist policy at deployment time.🧵#CoRL2024 🌐nakamotoo.github.io/V-GPS

English

154

52.7K

Or Rivlin@or_rivlin·19 Eki

@mitsuhiko_nm Great work! Did you only train the value function on expert trajectories, or did you need to "relabel" trajectories to create failures? Past attempts to train on expert demos alone got me divergence, is that why you trained IQL for only 200K steps?

English

Or Rivlin@or_rivlin·18 Eki

@svlevine Nice work! It seems intuitive that this aaproach would improve BC trained foundation models, but what would be the mechanism to improve RL trained foundation models which extract similar value functions?

English

136

Sergey Levine@svlevine·18 Eki

Combining robotic foundation models (Octo, OpenVLA, etc.) with offline RL trained value functions makes them better! A great thing about value functions is that we can plug them into any policy as a filter on samples, providing a lightweight and general improvement mechanism.

Mitsuhiko Nakamoto@mitsuhiko_nm

English

211

22.4K

Or Rivlin@or_rivlin·30 Ağu

@seohong_park Very cool work! Several points: 1. Representation being more important makes sense, and has echoes in RL transfer as well. 2. Some problems can never have online finetuning, so offline RL still has much merit. 3. Have you considered unsupervised to offline finetuning?

English

Seohong Park@seohong_park·28 Ağu

We call this framework unsupervised-to-online RL (U2O RL). The recipe is straightforward: In offline-to-online RL, simply replace offline RL (w/ task reward) with unsupervised offline RL (w/ intrinsic reward), such as HILP or offline goal-conditioned RL. That's it!

English

1.1K

Seohong Park@seohong_park·28 Ağu

Is "offline RL" in offline-to-online RL really necessary? Surprisingly, we find that replacing offline RL with *unsupervised* offline RL often leads to better online fine-tuning performance -- even for the *same* task! Paper: arxiv.org/abs/2408.14785 🧵↓

English

241

29.2K

Or Rivlin@or_rivlin·16 Haz

@seohong_park Regarding the contraint in DDPG, it seems like a "diatribution" constraint that might inhibit performance (data has both left and right turn from state, we constrain both), can we get "support" constraints instead? (Maybe AWR as the constraint?)

English

221

Seohong Park@seohong_park·16 Haz

@or_rivlin That's a very good question, and I do think values generalize better than policies in general. I suspect it's because value learning is somewhat "harder" (due to moving targets, TD, more inputs, etc.) and thus is better regularized, but it's still an open question to me.

English

Seohong Park@seohong_park·14 Haz

Most works in offline RL focus on learning better value functions. So value learning is the main bottleneck in offline RL... right? In our new paper, we show that this is *not* the case in general! Paper: arxiv.org/abs/2406.09329 Blog post: seohong.me/projects/offrl… A thread ↓

English

333

56.7K

Or Rivlin@or_rivlin·16 Haz

@seohong_park Thanks for the reply. Why does the value function generalize better than the policy? It would make sense with AWR (data is discarded) but DDPG makes full use of the data

English

Seohong Park@seohong_park·16 Haz

@or_rivlin Regarding pessimism, yes it is indeed designed to prevent this, but it is impossible to *completely* prevent visiting OOD states at test time in practice, and we show that policy accuracy on such OOD states heavily affects performance. So I believe pessimism is not enough.

English

291

Or Rivlin@or_rivlin·15 Haz

@aviral_kumar2 @svlevine @seohong_park @kvfrans Very interesting paper! The point about generalization is surprising though, why are BC policies able to remain in-distribution during evaluation (hence the good performance) but offline RL algorithms do not? Isn't pessimism and the various constraints supposed to prevent this?

English

Aviral Kumar@aviral_kumar2·14 Haz

This was an awesome collaboration led by @seohong_park, w/ @kvfrans and @svlevine. @seohong_park also wrote a terrific blog post (please check it out for more insights and results + short version, if you don't have time): seohong.me/projects/offrl… Paper: arxiv.org/abs/2406.09329

English

5.6K

Aviral Kumar@aviral_kumar2·14 Haz

There's deeper analysis of this in the paper (re why AWR is bad, what DDPG does). TL, DR: DDPG which takes first order derviative of the value function w.r.t. the policy is much better than training with weighted SFT.

English

1.2K

Or Rivlin@or_rivlin·25 May

@AvivTamar1 Regarding 2, I think we should the embeddings of VLMs rather than their text outputs, as these capture semantics and correspondencss, while allowing us fine grained control of algorithm output. I discussed such an idea with Erez Karpas, and he liked it (seeking student).

English

Or Rivlin@or_rivlin·25 May

@AvivTamar1 Is it a shortcomming of the algorithm, or our expectation that policies should somehow generalize in a very human manner? I think if we can inject prior knowledge about the world (such as in LLMs) than we might observe generalization with our RL algorithms.

English

Aviv Tamar@AvivTamar1·24 May

Great story, and a good example of what we still can't do with RL today, even though all the 'human level' papers out there

Bartłomiej Cupiał@CupiaBart

So here's a story of, by far, the weirdest bug I've encountered in my CS career. Along with @maciejwolczyk we've been training a neural network that learns how to play NetHack, an old roguelike game, that looks like in the screenshot. Recenlty, something unexpected happened.

English

2.8K

Or Rivlin@or_rivlin·25 May

@AvivTamar1 I see two paths: 1. Purely using data, as in the recent surge of foundation models for decision making, and several unsupervised RL methods (VIP and the likes of it). 2. Integrating VLMs in the training process. This is less generic and more problem specific.

English

Or Rivlin@or_rivlin·18 Nis

@EugeneVinitsky I like the papers by Scott Fujimoto, always full of profound understanding

English

137

Eugene Vinitsky 🦋@EugeneVinitsky·15 Nis

Q1: Who is your favorite RL researcher that you think should be more widely known? Why? 1/2

English

109

64.6K

Or Rivlin@or_rivlin·12 Mar

@svlevine Do you envision offline RL to a core tool in your path?

English

1.3K

Sergey Levine@svlevine·12 Mar

Since cat is out of the bag, it’s time I share: I’ll be starting a new adventure with an incredible team of friends and long-time collaborators to take on the big challenge of robot learning at scale! It's called Physical Intelligence (Pi… or π, like the symbol). 🧵👇

English

1.2K

108.5K

Keşfet

@ChenTessler @KyleMorgenstein @Liorfink @david_lisovtsev @seohong_park @natolambert @mitsuhiko_nm @svlevine