David Tao

173 posts

David Tao

@Taodav

PhD candidate @BrownBigAI. MSc from the @rlai_lab.

Providence, Rhode Island Katılım Ocak 2010

186 Takip Edilen454 Takipçiler

Sabitlenmiş Tweet

David Tao@Taodav·1 Ağu

What does it mean to be “better at” partial observability in RL? Existing benchmarks don't always provide a clear signal for progress. We fix that. Our new work (at RLC 2025 🤖) introduces a new property that ensures your gains are from learning better memory vs other factors. AND we provide a new JAX benchmark with environments that all have this property! 🧵1/5

English

154

12.4K

David Tao retweetledi

Dan Haramati@DanHrmti·4 Şub

Learning accurate World Models for long horizon planning is hard. So what minimal aspect of world dynamics must a model capture to achieve complex goals? We find a simple and effective solution in our #ICLR2026 paper, which we will present as an Oral at @worldmodel_26. (1/n)

English

292

29.1K

David Tao retweetledi

Patrick@dramaticirony·1 Kas

the scariest thing of all, a disappointing romantic and academic life

English

1.1K

13.2K

1.3M

David Tao retweetledi

Elai@elaifresh·11 Eki

We must become more Chinese

Elai@elaifresh

Incredible things are happening in China

English

145

10.2K

94.2K

4.1M

David Tao@Taodav·31 Ağu

That being said, the current model at top tier conferences is unsustainable too. I’m not sure what the correct answer is, but we shouldn’t ignore visibility.

English

190

David Tao@Taodav·31 Ağu

I’ve heard similar takes before and people like to bring up the incentive systems behind publishing at top tier conferences (jobs, positions etc.). I would argue that this often ignores one of the biggest up sides of submitting to a top tier conference: visibility. This was one of the dangers of the RL community starting our own conference as well. We lose visibility from the wider ML community, which I still think is very important.

Hieu Pham@hyhieu226

AI/ML publication venues are broken beyond fixable. I genuinely believe the only way to fix them is to completely devalue them (best to do that immediately, but perhaps slowly overtime since people have inertia). Then, start something new that encourages quality over quantity.

English

513

David Tao retweetledi

Prithviraj (Raj) Ammanabrolu@rajammanabrolu·24 Şub

The correct answer to "what online RL algo should you use" has always been and will always be "whatever you know how to tune the hyper parameters for best"

Prithviraj (Raj) Ammanabrolu tweet media

Arash Ahmadian@aahmadian_

PPO has been cemented as the defacto RL algorithm for RLHF. But… is this reputation + complexity merited?🤔 Our new work revisits PPO from first principles🔎 📜arxiv.org/abs/2402.14740 w @chriscremer_ @mgalle @mziizm @KreutzerJulia Olivier Pietquin @ahmetustun89 @sarahookr

English

8.1K

David Tao@Taodav·15 Ağu

@MAghajohari Yes! Currently working on something to help stabilize PPO with LLMs :)

English

Milad Aghajohari@MAghajohari·15 Ağu

@Taodav We studied the effect of accurate credit assignment in VinePPO. I think one culprit is training an accurate value function is hard. arxiv.org/abs/2410.01679

English

185

David Tao@Taodav·15 Ağu

THIS! Working with LLM folk and there seems to be a deep misunderstanding of how reinforcement learning works. I suspect it’s because of the simplified Monte Carlo algorithms (like GRPO) that have become so prevalent, where credit assignment over time isn’t even under consideration.

Khurram Javed@kjaved_

The issue in the first paragraph is real when learning without bootstrapping (e.g., with reinforce). TD learning methods can already learn along the way and figure out what went well and what didn't if the value function has a good understanding of the world. This works even if rewards are delayed by hours. Adding planning updates to the mix allows agents to reason about actions that it did not take and could try in the future.

English

David Tao@Taodav·15 Ağu

@alperahmetoglu HAHAHAHA this is hilarious

Filipino

Alper Ahmetoglu@alperahmetoglu·15 Ağu

@Taodav x.com/rajammanabrolu…

Prithviraj (Raj) Ammanabrolu@rajammanabrolu

The correct answer to "what online RL algo should you use" has always been and will always be "whatever you know how to tune the hyper parameters for best"

QME

175

David Tao@Taodav·1 Ağu

A huge shoutout to my co-authors @KaichengGuo27, @camall3n and George Konidaris. POBAX is available on Github. If you’re curious to learn more, check out our paper (openreview.net/forum?id=HUTCb…) or come chat with us at RLC 2025! We’ll be presenting this at the Track 4: Evaluation, Benchmarks session on August 6th. Come say hi! 🧵5/5

English

562

David Tao@Taodav·1 Ağu

We introduce POBAX: an open-source benchmark on partial observability that includes a diverse range of memory-improvable environments. POBAX is entirely written in JAX for extremely fast, GPU-scalable hyperparameter sweeping and experimentation. 🧵4/5

English

507

David Tao@Taodav·1 Ağu

English

154

12.4K

David Tao@Taodav·5 Mar

About time!!! Well deserved to the founders of our field

Association for Computing Machinery@TheOfficialACM

Meet the recipients of the 2024 ACM A.M. Turing Award, Andrew G. Barto and Richard S. Sutton! They are recognized for developing the conceptual and algorithmic foundations of reinforcement learning. Please join us in congratulating the two recipients! bit.ly/4hpdsbD

English

340

David Tao retweetledi

kyle r seibel@kylerseibel·5 Kas

got my sticker

English

240

2.4K

183K

David Tao@Taodav·14 Tem

@jonas_eschmann @camall3n @samlakig With TD(lambda) algorithms you normally see instability if you set lambda = 1. Given that we're using truncated TD(lambda) with a truncation length of only 16, a small difference in lambdas close to 1 will probably not change things too much!

English

136

Jonas Eschmann@jonas_eschmann·13 Tem

Thanks for the response! Any reason to not include a straight 1? I think 0.95 vs. 1 could make a big difference! In my experience e.g. a discount factor (gamma) of 0.99 vs 0.995 can for example make a big difference and I don’t see why that couldn’t be the case with lambda as well 🤔

English

Cam Allen@camall3n·12 Tem

RL in POMDPs is hard because you need memory. Remembering *everything* is expensive, and RNNs can only get you so far applied naively. New paper: 🎉 we introduce a theory-backed loss function that greatly improves RNN performance! 🧵 1/n

GIF

English

320

45.4K

David Tao retweetledi

Cam Allen@camall3n·12 Tem

We also trained a probe to reconstruct the PacMan dots from the agent’s memory. Guess which agent had an easier time with this… Yep! The λ-discrepancy agent knows where it has been, while the normal RNN agent basically has no idea. 9/n

GIF

English

1.6K

David Tao@Taodav·12 Tem

@KhurramJaved_96 Thanks Khurram! Looking forward to seeing you and discussing at RLC 😀

English

Khurram Javed@kjaved_·12 Tem

@Taodav Very cool!

English

114

David Tao@Taodav·12 Tem

A work that I'm super proud of, two years in the making: we tackle partial observability in reinforcement learning through value function discrepancies. Check it out!

Cam Allen@camall3n

English

3.7K

Keşfet

@MAghajohari @alperahmetoglu @KaichengGuo27 @camall3n @jonas_eschmann @samlakig @elonmusk @BarackObama