Max Tensor ретвитнул

I was surprised by how many didnt know that (1) per token MLE is whole seq MLE, and (2) PG at token level same as PG at seq level (optimizkng one big combinatorial action).
story is different if you introduce fitted critic/Q-values or intermediate resets.
Nando de Freitas@NandoDF
Most RL for LLMs involves only 1 step of RL. It’s a contextual bandit problem and there’s no covariate shift because the state (question, instruction) is given. This has many implications, eg DAgger becomes SFT, and it is trivial to design Expectation Maximisation (EM) maximum likelihood solutions that do exactly the same as RL. Of course, RL and multiagent systems will be needed as the picture illustrates.
English



