Synthetic Data Generation and Fine-tuning

Oct 09, 2024

I’ve previously made videos covering dataset generation - notably Q&A generation. I’ve picked up quite a few new tricks over the last half year, so it was time for a new video on Synthetic Data Creation…

To Generate Diverse Questions:

Use higher temperature (1-1.5) for variety
Employ top P of 0.9 to avoid nonsensical outputs / bad tokens
Generate multiple questions in the same LLM call for increased diversity

To Generate Accurate, Detailed Answers:

Use low temperature (0.1-0.25) for best quality
Request step-by-step reasoning in answers
Longer, well-reasoned responses lead to better trained models

Generating Augmented Answers:

Incorporate hints, feedback, or ground truth data along with the questions
Condition responses on background information (if present) for improved quality

🛠️ Raw Data Types:

Documents: Generate ~1 question per 200 characters (increase if needed)
Q&A Datasets: Expand terse answers into detailed explanations
Customer Chats: Upgrade weak model responses with stronger models

🔬 Case Study: Touch Rugby Rules

Base GPT-4o-mini: 8/22 correct | Fine-tuned GPT-4o-mini: 14-15/22 correct
Base Llama 3.1 8B: 3/22 correct | Fine-tuned Llama 3.1 8B: 10-11/22 correct
Base Llama 3.2 1B: 0/22 correct | Fine-tuned Llama 3.2 1B: 3/22 correct

💡 Bonus: Improving Math Performance

Raw data fine-tuning: Minimal improvement
Synthetic data from stronger model: Measurable boost
Augmented synthetic data: Further enhancement is possible if the dataset was larger

Cheers, Ronan

More resources at Trelis.com/About

Trelis Research

Discussion about this post