David Tao

173 posts

David Tao

David Tao

@Taodav

PhD candidate @BrownBigAI. MSc from the @rlai_lab.

Providence, Rhode Island Katılım Ocak 2010
186 Takip Edilen454 Takipçiler
Sabitlenmiş Tweet
David Tao
David Tao@Taodav·
What does it mean to be “better at” partial observability in RL? Existing benchmarks don't always provide a clear signal for progress. We fix that. Our new work (at RLC 2025 🤖) introduces a new property that ensures your gains are from learning better memory vs other factors. AND we provide a new JAX benchmark with environments that all have this property! 🧵1/5
David Tao tweet media
English
4
24
154
12.4K
David Tao retweetledi
Dan Haramati
Dan Haramati@DanHrmti·
Learning accurate World Models for long horizon planning is hard. So what minimal aspect of world dynamics must a model capture to achieve complex goals? We find a simple and effective solution in our #ICLR2026 paper, which we will present as an Oral at @worldmodel_26. (1/n)
English
6
46
292
29.1K
David Tao retweetledi
Patrick
Patrick@dramaticirony·
the scariest thing of all, a disappointing romantic and academic life
Patrick tweet mediaPatrick tweet media
English
99
1.1K
13.2K
1.3M
David Tao
David Tao@Taodav·
That being said, the current model at top tier conferences is unsustainable too. I’m not sure what the correct answer is, but we shouldn’t ignore visibility.
English
0
0
1
190
David Tao
David Tao@Taodav·
I’ve heard similar takes before and people like to bring up the incentive systems behind publishing at top tier conferences (jobs, positions etc.). I would argue that this often ignores one of the biggest up sides of submitting to a top tier conference: visibility. This was one of the dangers of the RL community starting our own conference as well. We lose visibility from the wider ML community, which I still think is very important.
Hieu Pham@hyhieu226

AI/ML publication venues are broken beyond fixable. I genuinely believe the only way to fix them is to completely devalue them (best to do that immediately, but perhaps slowly overtime since people have inertia). Then, start something new that encourages quality over quantity.

English
1
0
2
513
David Tao retweetledi
Prithviraj (Raj) Ammanabrolu
Prithviraj (Raj) Ammanabrolu@rajammanabrolu·
The correct answer to "what online RL algo should you use" has always been and will always be "whatever you know how to tune the hyper parameters for best"
Prithviraj (Raj) Ammanabrolu tweet media
Arash Ahmadian@aahmadian_

PPO has been cemented as the defacto RL algorithm for RLHF. But… is this reputation + complexity merited?🤔 Our new work revisits PPO from first principles🔎 📜arxiv.org/abs/2402.14740 w @chriscremer_  @mgalle  @mziizm @KreutzerJulia Olivier Pietquin @ahmetustun89 @sarahookr

English
2
4
58
8.1K
David Tao
David Tao@Taodav·
@MAghajohari Yes! Currently working on something to help stabilize PPO with LLMs :)
English
0
0
1
54
David Tao
David Tao@Taodav·
THIS! Working with LLM folk and there seems to be a deep misunderstanding of how reinforcement learning works. I suspect it’s because of the simplified Monte Carlo algorithms (like GRPO) that have become so prevalent, where credit assignment over time isn’t even under consideration.
Khurram Javed@kjaved_

The issue in the first paragraph is real when learning without bootstrapping (e.g., with reinforce). TD learning methods can already learn along the way and figure out what went well and what didn't if the value function has a good understanding of the world. This works even if rewards are delayed by hours. Adding planning updates to the mix allows agents to reason about actions that it did not take and could try in the future.

English
2
0
17
2K
David Tao
David Tao@Taodav·
A huge shoutout to my co-authors @KaichengGuo27, @camall3n and George Konidaris. POBAX is available on Github. If you’re curious to learn more, check out our paper (openreview.net/forum?id=HUTCb…) or come chat with us at RLC 2025! We’ll be presenting this at the Track 4: Evaluation, Benchmarks session on August 6th. Come say hi! 🧵5/5
English
3
1
13
562
David Tao
David Tao@Taodav·
We introduce POBAX: an open-source benchmark on partial observability that includes a diverse range of memory-improvable environments. POBAX is entirely written in JAX for extremely fast, GPU-scalable hyperparameter sweeping and experimentation. 🧵4/5
David Tao tweet media
English
1
0
7
507
David Tao
David Tao@Taodav·
What does it mean to be “better at” partial observability in RL? Existing benchmarks don't always provide a clear signal for progress. We fix that. Our new work (at RLC 2025 🤖) introduces a new property that ensures your gains are from learning better memory vs other factors. AND we provide a new JAX benchmark with environments that all have this property! 🧵1/5
David Tao tweet media
English
4
24
154
12.4K
David Tao retweetledi
kyle r seibel
kyle r seibel@kylerseibel·
got my sticker
kyle r seibel tweet media
English
27
240
2.4K
183K
David Tao
David Tao@Taodav·
@jonas_eschmann @camall3n @samlakig With TD(lambda) algorithms you normally see instability if you set lambda = 1. Given that we're using truncated TD(lambda) with a truncation length of only 16, a small difference in lambdas close to 1 will probably not change things too much!
English
1
0
3
136
Jonas Eschmann
Jonas Eschmann@jonas_eschmann·
Thanks for the response! Any reason to not include a straight 1? I think 0.95 vs. 1 could make a big difference! In my experience e.g. a discount factor (gamma) of 0.99 vs 0.995 can for example make a big difference and I don’t see why that couldn’t be the case with lambda as well 🤔
English
1
0
0
63
Cam Allen
Cam Allen@camall3n·
RL in POMDPs is hard because you need memory. Remembering *everything* is expensive, and RNNs can only get you so far applied naively. New paper: 🎉 we introduce a theory-backed loss function that greatly improves RNN performance! 🧵 1/n
GIF
English
5
56
320
45.4K
David Tao retweetledi
Cam Allen
Cam Allen@camall3n·
We also trained a probe to reconstruct the PacMan dots from the agent’s memory. Guess which agent had an easier time with this… Yep! The λ-discrepancy agent knows where it has been, while the normal RNN agent basically has no idea. 9/n
GIF
English
1
1
17
1.6K
David Tao
David Tao@Taodav·
@KhurramJaved_96 Thanks Khurram! Looking forward to seeing you and discussing at RLC 😀
English
0
0
0
64