Post

@rasbt I don't have a good understanding of RL so I was explaining it to myself as follow: annotation work for SL is slow & expensive = difficult to cover tasks and domains horizontally. RL is 'cheap' in comparison: thumbs up/down or answer ranking = allows covering general language.
English

@Pythonner I would say it's similarly expensive in terms of the annotation cost. The cost here is to get the labels to train the reward model in the first place. If you have these labels, you can use the reward model either for RL or SL (/weakly supervised learning)
English