Discussion about this post

User's avatar
Chris Chang's avatar

A major component of RL training is that the model is exposed to its own generated inputs during training. This allows the model to see behavior that it hasn't directly observed in the training data. The learned reward model then reinforces certain behaviors in the process of exploration, which can improve generalization and reduce overfitting to the training distribution.

The exposure bias in SFT stems from the fact that the model only sees data from a limited distribution (the training dataset), and when exposed to new data, it can fail to generalize. RL mitigates this bias because the agent is exposed to a wider variety of states during training through its exploration, which can help the model generalize better to unseen scenarios.

Expand full comment
Chris Chang's avatar

Another point is that a significant portion of the gains of RL might come from mitigating exposure bias and reflects a well-acknowledged challenge in training models with SFT.

Expand full comment
3 more comments...

No posts