How does GRPO work?

Comparing GRPO with SFT and ORPO

Feb 12, 2025

Understanding GRPO

1. I compare GRPO with SFT and ORPO. GRPO is quite like ORPO.

2. I explain the history of how GRPO comes from PPO, which comes from TRPO, which tackles the problem of RL approaches diverging.

3. The "magic" of GRPO is the sampling and verification (via rewards). This "magic" isn't unique to GRPO - you can do it with other methods too (SFT, ORPO, DPO).

Incidentally, GRPO is a somewhat inefficient implementation because it back-props on a lot of bad responses.

Still have questions? Just click reply OR let me know in the comments.

Cheers, Ronan

Trelis Research

Build & Deploy Faster: Fine-tuning, Inference, Audio, Evals and Vision Tools
Need AI Guidance? Book a Consultation for Technical or Market Insights
Are you a top developer? Apply to join the Trelis Team
Starting a new project or venture? Apply for a Trelis Grant

Understanding Group Relative Policy Optimization

Core Mechanism

GRPO processes samples in groups, for example, using 8 generated answers per group. For each group:

1. Model generates multiple completions for a prompt

2. Each completion receives a reward score based on:

- Accuracy (numerical comparison with ground truth)

- Format (regex validation of required elements like "think" tags)

3. The model updates based on reward delta versus group baseline

4. Baseline = average reward of other samples in group

Comparison to Previous Methods

Supervised Fine-Tuning (SFT)

- Uses cross-entropy loss

- Updates probabilities based on ground truth vs. predictions

- Indirectly reduces probability of incorrect answers

Example: For "Is Paris the capital of France?"

- Forward pass: Yes (40%), Maybe (20%), No (40%)

- After update: Yes (50%), Maybe (15%), No (35%)

Odds Ratio Preference Optimization (ORPO)

- Combines cross-entropy loss with odds ratio term

- Explicitly increases chosen answer probability

- Explicitly decreases rejected answer probability

- Works with pairs of responses (chosen vs. rejected)

Technical Advantages of GRPO Over PPO

GRPO simplifies Proximal Policy Optimization (PPO) by:

1. Eliminating separate value model

2. Using group sampling to estimate baseline

3. Reducing computational overhead

4. Requiring less VRAM

Key Implementation Details

Reward Structure:

- Binary components: accuracy + format

- Discrete rather than continuous rewards

- No neural reward model required

- Direct comparison to ground truth

Group Processing:

- Typically 8 samples per group

- Baseline calculated from group average

- Updates based on delta from baseline

- Requires sufficient VRAM for group inference

Potential Limitations

1. All samples in a group could be low quality

2. Format rewards may plateau early

3. No guarantee of improvement if all samples score similarly

4. Group size affects VRAM requirements and update granularity

Historical Context

GRPO evolved from:

1. Trust Region Policy Optimization (TRPO)

2. Proximal Policy Optimization (PPO)

3. Direct Preference Optimization (DPO) (although GRPO is somewhat parallel to DPO and ORPO).

Each iteration simplified the approach while maintaining performance benefits.