
Or Rivlin
59 posts




OK I'm probably gonna get some flak for this, but... re the classic school(s) of RL: They are detached from reality. The whole "you can turn everything into mdp just fold stuff into the state until it's Markov. So algorithms that optimally solve mdps lead to general intelligence" might be true in infinite theory, but in finite reality it makes no sense at all. That's why it's mostly ok for simple games and the music stops there, with policy gradient taking over. That's why I --after some implementing, playing around, and thinking about both-- quickly abandoned the Bellman-school and became an adept of the Williams92 church.















Many generalist robot policies have been released, but they're not perfect. How can we make them better? Introducing V-GPS🚀: Value Guided Policy Steering, a simple approach to improve any off-the-shelf generalist policy at deployment time.🧵#CoRL2024 🌐nakamotoo.github.io/V-GPS













So here's a story of, by far, the weirdest bug I've encountered in my CS career. Along with @maciejwolczyk we've been training a neural network that learns how to play NetHack, an old roguelike game, that looks like in the screenshot. Recenlty, something unexpected happened.




