Understanding GRPO
1. I compare GRPO with SFT and ORPO. GRPO is quite like ORPO.
2. I explain the history of how GRPO comes from PPO, which comes from TRPO, which tackles the problem of RL approaches diverging.
3. The "magic" of GRPO is the sampling and verification (via rewards). This "magic" isn't unique to GRPO - you can do it with other methods too (SFT, ORPO, DPO).
Incidentally, GRPO is a somewhat inefficient implementation because it back-props on a lot of bad responses.
Still have questions? Just click reply OR let me know in the comments.
Cheers, Ronan
Trelis Research
Build & Deploy Faster: Fine-tuning, Inference, Audio, Evals and Vision Tools
Need AI Guidance? Book a Consultation for Technical or Market Insights
Are you a top developer? Apply to join the Trelis Team
Starting a new project or venture? Apply for a Trelis Grant
Understanding Group Relative Policy Optimization
Core Mechanism
GRPO processes samples in groups, for example, using 8 generated answers per group. For each group:
1. Model generates multiple completions for a prompt
2. Each completion receives a reward score based on:
- Accuracy (numerical comparison with ground truth)
- Format (regex validation of required elements like "think" tags)
3. The model updates based on reward delta versus group baseline
4. Baseline = average reward of other samples in group
Comparison to Previous Methods
Supervised Fine-Tuning (SFT)
- Uses cross-entropy loss
- Updates probabilities based on ground truth vs. predictions
- Indirectly reduces probability of incorrect answers
Example: For "Is Paris the capital of France?"
- Forward pass: Yes (40%), Maybe (20%), No (40%)
- After update: Yes (50%), Maybe (15%), No (35%)
Odds Ratio Preference Optimization (ORPO)
- Combines cross-entropy loss with odds ratio term
- Explicitly increases chosen answer probability
- Explicitly decreases rejected answer probability
- Works with pairs of responses (chosen vs. rejected)
Technical Advantages of GRPO Over PPO
GRPO simplifies Proximal Policy Optimization (PPO) by:
1. Eliminating separate value model
2. Using group sampling to estimate baseline
3. Reducing computational overhead
4. Requiring less VRAM
Key Implementation Details
Reward Structure:
- Binary components: accuracy + format
- Discrete rather than continuous rewards
- No neural reward model required
- Direct comparison to ground truth
Group Processing:
- Typically 8 samples per group
- Baseline calculated from group average
- Updates based on delta from baseline
- Requires sufficient VRAM for group inference
Potential Limitations
1. All samples in a group could be low quality
2. Format rewards may plateau early
3. No guarantee of improvement if all samples score similarly
4. Group size affects VRAM requirements and update granularity
Historical Context
GRPO evolved from:
1. Trust Region Policy Optimization (TRPO)
2. Proximal Policy Optimization (PPO)
3. Direct Preference Optimization (DPO) (although GRPO is somewhat parallel to DPO and ORPO).
Each iteration simplified the approach while maintaining performance benefits.