Reinforcement Learning for LLMs in 2025

SFT, ORPO, GRPO and Verifiable rewards

Feb 10, 2025

Reinforcement Learning is big as of late 2024 and early 2025 - with big progress made by o1/o3 and DeepSeek R1, mostly notably on hard reasoning challenges like ARC.

One of the key questions is - how to elicit improved reasoning from models?

Is reasoning innately in pre-training datasets and just needs the right examples to be brought out?
Why does GRPO make sense, as opposed to Supervised Fine-tuning with the right examples?

My general sense is that GRPO (or PPO or ORPO) may not offer all that much benefit over SFT. In fact, they generally are more complex. Really, what matters is how the fine-tuning data is created.

This is the first video in a series on Reinforcement Learning. Maybe you’re looking to directly dig into GRPO - but I think that’s the wrong way to look at things. A better - ground up approach - is to start with careful performance measurement (there are gotchas even around how one marks answers correct or not), then carefully think about data preparation, then do Supervised Fine-tuning, and only then start to look at preference and reward methods.

Definitely leave comments if a) you see things that can be improved or I’ve made mistakes on, or b) you have a specific reasoning dataset in mind that would be useful to see a demo on in the future.

Cheers, Ronan

🛠 Explore Fine-tuning, Inference, Vision, Audio, and Evaluation Tools

💡 Consulting (Technical Assistance OR Market Insights)

🤝 Join the Trelis Team

💸 Grants Program

Weekly Poll (NEW)

Video Summary: Improving Maths Performance of Llama 3.2 1B: A Study of SFT and ORPO Training

The above video improves mathematical reasoning in the Llama-1B model using the GSM8K dataset, comparing supervised fine-tuning (SFT) and odds ratio preference optimization (ORPO) approaches.

Baseline Performance

- Initial model: Llama-1B with instruction tuning

- Test dataset: 100 questions from GSM8K

Metrics:

- Pass@8 (at least 1 correct out of 8 samples): 79%

- Majority@8 (5+ correct out of 8 samples): 28%

Data Generation and Verification

- Generated training data using model sampling on 7,473 GSM8K training questions

- Each question sampled 8 times with temperature > 0

Answers verified through:

1. Exact match with ground truth

2. Gemini API backup verification (~1% of cases)

Required proper formatting with "think" tags for training data

Supervised Fine-Tuning Results

Training parameters:

- Learning rate: 1e-4

- Batch size: 32

- LoRA rank: 64

- One epoch training

Performance after SFT:

- Pass@8: ~79% (no significant change)

- Majority@8: ~34% (6% improvement)

ORPO Training Results

- Used same training data but paired correct/incorrect answers

- Loss function: Cross-entropy + beta * odds ratio term

Performance metrics:

- Pass@8: ~75%

- Majority@8: ~35.5%

Basically, it’s hard to outperform SFT (more on OPRO and then GRPO in the next video).

Key Findings

1. Both SFT and ORPO improved consistency (Majority@8) but not reach (Pass@8)

2. SFT achieved similar results with simpler implementation

3. Model size may be limiting factor for further improvements

4. Format enforcement through verification proved effective

5. Gradient norms are high in ORPO maybe suggesting hyper parameters can be improved.

Technical Implementation

- Used Unsloth for faster training compared to transformers

- Implemented LoRA adaptation for efficient fine-tuning

- Employed gradient checkpointing with re-entrancy

- Used SGLang for efficient batch inference

- Set up robust answer verification pipeline with API fallback

The study demonstrates that while small language models can be made more consistent through fine-tuning, improving their absolute capability may require larger model architectures or more sophisticated training approaches. To be continued!!!

Trelis Research

Discussion about this post

Ready for more?