Synthetic Data Generation and Fine-tuning
I’ve previously made videos covering dataset generation - notably Q&A generation. I’ve picked up quite a few new tricks over the last half year, so it was time for a new video on Synthetic Data Creation…
To Generate Diverse Questions:
Use higher temperature (1-1.5) for variety
Employ top P of 0.9 to avoid nonsensical outputs / bad tokens
Generate multiple questions in the same LLM call for increased diversity
To Generate Accurate, Detailed Answers:
Use low temperature (0.1-0.25) for best quality
Request step-by-step reasoning in answers
Longer, well-reasoned responses lead to better trained models
Generating Augmented Answers:
Incorporate hints, feedback, or ground truth data along with the questions
Condition responses on background information (if present) for improved quality
🛠️ Raw Data Types:
Documents: Generate ~1 question per 200 characters (increase if needed)
Q&A Datasets: Expand terse answers into detailed explanations
Customer Chats: Upgrade weak model responses with stronger models
🔬 Case Study: Touch Rugby Rules
Base GPT-4o-mini: 8/22 correct | Fine-tuned GPT-4o-mini: 14-15/22 correct
Base Llama 3.1 8B: 3/22 correct | Fine-tuned Llama 3.1 8B: 10-11/22 correct
Base Llama 3.2 1B: 0/22 correct | Fine-tuned Llama 3.2 1B: 3/22 correct
💡 Bonus: Improving Math Performance
Raw data fine-tuning: Minimal improvement
Synthetic data from stronger model: Measurable boost
Augmented synthetic data: Further enhancement is possible if the dataset was larger
Cheers, Ronan
More resources at Trelis.com/About

