Or Rivlin

59 posts

Or Rivlin

Or Rivlin

@or_rivlin

Reinforcement Learning engineer

Katılım Eylül 2021
256 Takip Edilen60 Takipçiler
Or Rivlin
Or Rivlin@or_rivlin·
@ChenTessler @KyleMorgenstein I remember this paper: arxiv.org/abs/1711.06006 There may be others as well. I think when you learn online via simulation and sample complexity is not a huge issue, PG is typically a better option. But Q enables offline learning and stitching, which is powerful.
English
1
0
1
35
Chen Tessler
Chen Tessler@ChenTessler·
@KyleMorgenstein On the other hand, there's some stuff you can only do with Q. Which sucks. All this hindsight relabeling and FB. Haven't seen it formalized through PG yet
English
1
0
2
209
Lior.Finkelshtein.13/07/2024🔺🇮🇱✡︎📟
@or_rivlin @david_lisovtsev יעילה זה משהו שמתברר כלא נכון במקרה באיספהאן. לפי הפרסום הפצצה לא הוטלה שם כי היא לא טובה מספיק. בפורדו היה צריך 12. אני חושב שזה אפשרי, וגם המלצתי לפני שבוע, להשתמש בטיל יריחו 3/4 מבוסס שביט 2 עם רש"ק 5 טון אורניום מדולל ובפנים מטענים חלולים רבים וקצה חד. צריך גג 2 כאלה
עברית
1
0
0
55
David Lisovtsev
David Lisovtsev@david_lisovtsev·
כולם יודעים מה החסרון העיקרי של טילים בכלל וחיל טילים בפרט, המחיר פר קילו חנ"מ על ראש של האויב. דבר מרכזי מאוד שמשפיע על המחיר זה האמינות של הפלטפורמה, למשל עם מטוסי הקרב ופצצות פשוטות, אין לנו כמעט מקרים של מטוסים שלא מגיעים למטרה או מפספסים ביחסית הרבה. במקרה של הירי מאיראן הסטטיסטיקה מעניינת: לפי הנתונים שפורסמו איראן שיגרה לישראל 631 טילים בליסטיים, מתוך בערך 500 הגיעו לאזור ישראל [השאר התפרקו בדרך] ומתוכם 243 היו בדרך לשטחים פתוחים, אזורים פלסטיניים או הים ולכן לא נדרשו ביירוט. כלומר איראן ירתה עלינו 631 טילים בליסטיים, ואנחנו נדרשנו ליירט רק 257 - 40%. רק 40% מהטילים האיראניים באמת הגיעו למצב שהם מאיימים על מטרות או אזורים מיושבים! זה אחוזים מזעזעים במיוחד, כלומר כדי אשכרה לפגוע במטרה שאתה רוצה, בלי נסיונות יירוט קינטיים של האויב אתה צריך בממוצע 2.5 טילים בליסטיים.
עברית
64
16
692
42.2K
Or Rivlin
Or Rivlin@or_rivlin·
@Liorfink @david_lisovtsev מהיר זה מונח יחסי, הפצצה שהאמריקאים השתמשו בה ממש לא מהירה ביחס לטיל בליסטי, אבל עדיין יעילה עבור המשימה הזו.
עברית
1
0
0
38
Lior.Finkelshtein.13/07/2024🔺🇮🇱✡︎📟
@or_rivlin @david_lisovtsev כשאתה מדבר על טיל שמגיע למהירות היפרסוניית בחלל (כמו כל טיל בליסטי לטווח בינוני) מערכת ההנחייה תיהיה יקרה מאוד. גם בגלל החום. הוא חייב להיות מהיר כדי לפצח בונקרים.
עברית
1
0
0
35
Or Rivlin
Or Rivlin@or_rivlin·
@Liorfink @david_lisovtsev יודעים היום לעשות ניווט אלקטרואופטי סופר מדויק וגם יחסית זול. בהנחה שטיל כזה פועל אחרי ניטרול ההגנה האווירית של האויב ולכן לא צריך להיות מאוד מהיר, אפשר לבצע ניווט אלקטרואופטי משלב יחסית מוקדם בצלילה.
עברית
1
0
0
68
Or Rivlin
Or Rivlin@or_rivlin·
@seohong_park Great work! I find simplicity to be a major appeal when considering algorithms for practical use.
English
1
0
2
586
Or Rivlin
Or Rivlin@or_rivlin·
@natolambert Can you write a post about how RL is used in O1 style training?
English
0
0
5
519
Nathan Lambert
Nathan Lambert@natolambert·
There's a lot of confusion about o1's RL training and the emergence of RL as a popular post-training loss function. Yes, these are the same loss functions and similar data. BUT, the amount of compute used for o1's RL training is much more in line with pretraining. The words we use to describe training are strained already, but o1 may be better viewed as next-token pretraining, rl pretraining, and then some normal post-training.
English
13
17
222
31.6K
Or Rivlin
Or Rivlin@or_rivlin·
@mitsuhiko_nm Also, do you have any intuition on why shifting the reward helped? And did you use a Sigmoid output activation with BCE loss (as was done in previous papers from Sergey's group)?
English
0
0
0
74
Mitsuhiko Nakamoto
Mitsuhiko Nakamoto@mitsuhiko_nm·
Many generalist robot policies have been released, but they're not perfect. How can we make them better? Introducing V-GPS🚀: Value Guided Policy Steering, a simple approach to improve any off-the-shelf generalist policy at deployment time.🧵#CoRL2024 🌐nakamotoo.github.io/V-GPS
English
3
32
154
52.7K
Or Rivlin
Or Rivlin@or_rivlin·
@mitsuhiko_nm Great work! Did you only train the value function on expert trajectories, or did you need to "relabel" trajectories to create failures? Past attempts to train on expert demos alone got me divergence, is that why you trained IQL for only 200K steps?
English
0
0
0
82
Or Rivlin
Or Rivlin@or_rivlin·
@svlevine Nice work! It seems intuitive that this aaproach would improve BC trained foundation models, but what would be the mechanism to improve RL trained foundation models which extract similar value functions?
English
0
0
1
136
Sergey Levine
Sergey Levine@svlevine·
Combining robotic foundation models (Octo, OpenVLA, etc.) with offline RL trained value functions makes them better! A great thing about value functions is that we can plug them into any policy as a filter on samples, providing a lightweight and general improvement mechanism.
Mitsuhiko Nakamoto@mitsuhiko_nm

Many generalist robot policies have been released, but they're not perfect. How can we make them better? Introducing V-GPS🚀: Value Guided Policy Steering, a simple approach to improve any off-the-shelf generalist policy at deployment time.🧵#CoRL2024 🌐nakamotoo.github.io/V-GPS

English
2
29
211
22.4K
Or Rivlin
Or Rivlin@or_rivlin·
@seohong_park Very cool work! Several points: 1. Representation being more important makes sense, and has echoes in RL transfer as well. 2. Some problems can never have online finetuning, so offline RL still has much merit. 3. Have you considered unsupervised to offline finetuning?
English
1
0
0
50
Seohong Park
Seohong Park@seohong_park·
We call this framework unsupervised-to-online RL (U2O RL). The recipe is straightforward: In offline-to-online RL, simply replace offline RL (w/ task reward) with unsupervised offline RL (w/ intrinsic reward), such as HILP or offline goal-conditioned RL. That's it!
Seohong Park tweet media
English
3
1
4
1.1K
Seohong Park
Seohong Park@seohong_park·
Is "offline RL" in offline-to-online RL really necessary? Surprisingly, we find that replacing offline RL with *unsupervised* offline RL often leads to better online fine-tuning performance -- even for the *same* task! Paper: arxiv.org/abs/2408.14785 🧵↓
Seohong Park tweet media
English
3
38
241
29.2K
Or Rivlin
Or Rivlin@or_rivlin·
@seohong_park Regarding the contraint in DDPG, it seems like a "diatribution" constraint that might inhibit performance (data has both left and right turn from state, we constrain both), can we get "support" constraints instead? (Maybe AWR as the constraint?)
English
2
0
0
221
Seohong Park
Seohong Park@seohong_park·
@or_rivlin That's a very good question, and I do think values generalize better than policies in general. I suspect it's because value learning is somewhat "harder" (due to moving targets, TD, more inputs, etc.) and thus is better regularized, but it's still an open question to me.
English
2
0
1
96
Seohong Park
Seohong Park@seohong_park·
Most works in offline RL focus on learning better value functions. So value learning is the main bottleneck in offline RL... right? In our new paper, we show that this is *not* the case in general! Paper: arxiv.org/abs/2406.09329 Blog post: seohong.me/projects/offrl… A thread ↓
English
6
53
333
56.7K
Or Rivlin
Or Rivlin@or_rivlin·
@seohong_park Thanks for the reply. Why does the value function generalize better than the policy? It would make sense with AWR (data is discarded) but DDPG makes full use of the data
English
1
0
0
82
Seohong Park
Seohong Park@seohong_park·
@or_rivlin Regarding pessimism, yes it is indeed designed to prevent this, but it is impossible to *completely* prevent visiting OOD states at test time in practice, and we show that policy accuracy on such OOD states heavily affects performance. So I believe pessimism is not enough.
English
1
0
3
291
Or Rivlin
Or Rivlin@or_rivlin·
@aviral_kumar2 @svlevine @seohong_park @kvfrans Very interesting paper! The point about generalization is surprising though, why are BC policies able to remain in-distribution during evaluation (hence the good performance) but offline RL algorithms do not? Isn't pessimism and the various constraints supposed to prevent this?
English
0
0
0
68
Aviral Kumar
Aviral Kumar@aviral_kumar2·
There's deeper analysis of this in the paper (re why AWR is bad, what DDPG does). TL, DR: DDPG which takes first order derviative of the value function w.r.t. the policy is much better than training with weighted SFT.
Aviral Kumar tweet media
English
1
2
7
1.2K
Or Rivlin
Or Rivlin@or_rivlin·
@AvivTamar1 Regarding 2, I think we should the embeddings of VLMs rather than their text outputs, as these capture semantics and correspondencss, while allowing us fine grained control of algorithm output. I discussed such an idea with Erez Karpas, and he liked it (seeking student).
English
0
0
1
28
Or Rivlin
Or Rivlin@or_rivlin·
@AvivTamar1 Is it a shortcomming of the algorithm, or our expectation that policies should somehow generalize in a very human manner? I think if we can inject prior knowledge about the world (such as in LLMs) than we might observe generalization with our RL algorithms.
English
3
0
1
83
Or Rivlin
Or Rivlin@or_rivlin·
@AvivTamar1 I see two paths: 1. Purely using data, as in the recent surge of foundation models for decision making, and several unsupervised RL methods (VIP and the likes of it). 2. Integrating VLMs in the training process. This is less generic and more problem specific.
English
0
0
1
26
Or Rivlin
Or Rivlin@or_rivlin·
@EugeneVinitsky I like the papers by Scott Fujimoto, always full of profound understanding
English
1
0
3
137
Eugene Vinitsky 🦋
Eugene Vinitsky 🦋@EugeneVinitsky·
Q1: Who is your favorite RL researcher that you think should be more widely known? Why? 1/2
English
25
8
109
64.6K
Or Rivlin
Or Rivlin@or_rivlin·
@svlevine Do you envision offline RL to a core tool in your path?
English
0
0
1
1.3K
Sergey Levine
Sergey Levine@svlevine·
Since cat is out of the bag, it’s time I share: I’ll be starting a new adventure with an incredible team of friends and long-time collaborators to take on the big challenge of robot learning at scale! It's called Physical Intelligence (Pi… or π, like the symbol). 🧵👇
English
52
70
1.2K
108.5K