Should you even do GRPO?
My goal with this video is to describe how supervised fine-tuning with rejection sampling is very similar to GRPO. Both involve reinforcement learning - i.e. inducing correct answers via sampling, and using those correct answers to improve performance. The difference - and there is a clear theoretical one, although not as clear a practical one - is in the objective function:
SFT involves minimising cross entropy loss.
GRPO - loosely - involves taking a group of answers and moving the probabilities of the model towards the better ones and away from the worse ones in that group. This is a more complex function that - theoretically - can have some benefits, but in practise can be hard to beat SFT with.
The more important point here - whether for [SFT with rejection sampling] OR for GRPO - is that reinforcement learning works if you have a way to randomly create new answers, some of which are correct, and have the model behave more like those correct answers.
In short though, any form of reinforcement learning is most useful if you are at the frontier for a new application. If there exists a big reasoning model, like R1, that can provide decent answers (even if only via sampling), then it will be better to generate and keep the best answers from a strong model to do SFT on your smaller model. Trying to do GRPO (or SFT with rejection sampling) on a model that does not have engrained knowledge for your application, is likely to perform worse.
Let me know your comments or questions. It’s a difficult topic to communicate clearly, and there is still much for me to learn.
Cheers, Ronan
Build & Deploy Faster: Fine-tuning, Inference, Audio, Evals and Vision Tools
Need AI Guidance? Book a Consultation for Technical or Market Insights
Are you a top developer? Apply to join the Trelis Team
Starting a new project or venture? Apply for a Trelis Grant
Comparing GRPO vs SFT for Mathematical Reasoning on Llama 3.2 1B
The video compares three approaches for improving mathematical reasoning capabilities in small language models: supervised fine-tuning (SFT), odds ratio preference optimization (ORPO), and group relative policy optimization (GRPO).
Study Setup
- Base model: Llama 3.2B (1B parameters)
- Dataset: GSM8K (grade school math problems)
- Training set: ~7,000 samples
- Test set: ~1,000 samples
Evaluation metrics:
- Pass@8: Percentage of problems solved with at least 1 correct answer out of 8 attempts
- Majority@8: Percentage of problems with 5+ correct answers out of 8 attempts
Baseline Performance
The base model achieved:
- Pass@8: 79%
- Majority@8: 35%
Methods Compared
Supervised Fine-Tuning (SFT)
1. Generate 8 answers per training question (~56,000 total)
2. Filter for correct answers (~14,000)
3. Train for one epoch at constant learning rate
4. Can be repeated iteratively with improved model
ORPO
1. Generate answers as with SFT
2. Create pairs of correct/incorrect answers
3. Train model to increase probability of correct answers while decreasing incorrect
4. Uses explicit positive/negative rewards
GRPO
1. Generate 8 answers per question
2. Assign rewards based on:
- Correctness (0 or 2 points)
- Format adherence (4 x 0.5 point components)
3. Uses KL divergence term to stabilize training
4. Applies clipping to model updates, also to avoid divergence.
Results
After training:
SFT:
- Pass@8: 81% (+2%)
- Majority@8: 40% (+5%)
ORPO:
- Pass@8: 79% (no change)
- Majority@8: 37% (+2%)
GRPO:
- Pass@8: 75% (-4%)
- Majority@8: 30% (-5%)
Key Findings
1. SFT provided the most robust improvements
2. GRPO performance degraded despite optimizing format rewards
3. Format optimization may have come at the expense of mathematical accuracy
4. KL divergence term may have restricted model improvement
5. Sample efficiency was lower for GRPO due to not dropping poor examples
Practical Recommendations
1. Use SFT with rejection sampling when a stronger model is available for generating training data
2. Ensure adequate pre-training on domain-specific data before attempting reinforcement learning
3. Consider computational efficiency - SFT allows use of faster inference engines
4. Monitor format vs correctness optimization to avoid reward hacking
5. Consider implementing sample filtering in GRPO to improve efficiency
A major component of RL training is that the model is exposed to its own generated inputs during training. This allows the model to see behavior that it hasn't directly observed in the training data. The learned reward model then reinforces certain behaviors in the process of exploration, which can improve generalization and reduce overfitting to the training distribution.
The exposure bias in SFT stems from the fact that the model only sees data from a limited distribution (the training dataset), and when exposed to new data, it can fail to generalize. RL mitigates this bias because the agent is exposed to a wider variety of states during training through its exploration, which can help the model generalize better to unseen scenarios.
Another point is that a significant portion of the gains of RL might come from mitigating exposure bias and reflects a well-acknowledged challenge in training models with SFT.