OpTI-BFM bares similarities with LinUCB for Bandits which we use to prove sublinear regret in episodic settings under mild assumptions.
Because it's online, OpTI-BFM can also adapt to time-varying (non-stationary) rewards by decaying the weight on older observations.
(4/5)
Excited to share that our paper "Optimistic Task Inference for Behavior Foundation Models" was accepted for ICLR 2026.
BFMs are great at zero-shot RL, but task inference requires a dataset with reward labels. Our method OpTI-BFM offers an online alternative.
(1/5)
Last week I presented our last work: 🐝“Epistemically-guided forward backward exploration (FBEE)”🐝 at the @RL_Conference
TLDR: Active learning for unsupervised RL
When multiple tasks need improvements, fine-tuning a generalist policy becomes tricky. How do we allocate a demonstration budget across a set of tasks of varied difficulty and familiarity?
We are presenting a possible solution at ICML on Wednesday!
(1/3)
Our method tackles the occupancy matching objective directly at test-time by estimating the agent's occupancy with samples from a learned world model and matching it to the expert occupancy using Optimal Transport.
(2/3)
Zero-shot imitation from just a single sparse demonstration is hard. Goal-conditioned methods tend to “greedily" move from one state to the next and lose the big picture.
We're presenting an alternative approach on Tuesday at #ICML2025.
(1/3)