I go deep into synthetic data preparation techniques for fine-tuning LLMs, including:
What library to use for converting documents to markdown?
How best to create chunks from document text?
How to create high quality comprehensive question answer pairs?
How to visualise your synthetic dataset via embeddings or tags?
How to create a high quality evaluation dataset.
Cheers, Ronan
P.S. The scripts are in the ADVANCED-fine-tuning repo:
Video Links:
Trelis Links:
🤝 Are you a talented developer? Work for Trelis
💡 Need Technical or Market Assistance? Book a Consult Here
💸 Starting a New Project/Venture? Apply for a Trelis Grant
Fine-Tuning Data Preparation: High-Quality Synthetic Datasets
Creating high-quality training data is essential for fine-tuning large language models. This article outlines a systematic approach to preparing synthetic question-answer datasets, covering document ingestion, chunking, question generation, and evaluation set creation.
Key Goals for Data Quality
Coverage: Questions must comprehensively cover document content across topics, formats, and difficulty levels
Contextualization: Questions need proper context to be unambiguous and answerable
Representative Evaluation: Eval datasets must reflect the distribution of topics in training data
Consistent Grading: Clear rubrics for evaluating answer correctness
Document Ingestion Methods
Three main approaches for converting PDFs to text:
Marker PDF: Most accurate, uses OCR and layout detection models. ~19s processing time (local on Mac M4 for 24 pages), $3/1000 pages
Microsoft's MarkItDown: Fast CPU-based conversion, ~free but lower quality
Gemini Vision: ~20s processing time, $0.50/1000 pages, handles embedded text in images
Chunking Strategy
Text is split into chunks using:
Sentence detection via regex or NLTK
Table extraction for structured data
Chunk size parameters:
Minimum length (configurable)
Maximum length (default 5000 tokens)
Tables preserved as single chunks
Question Generation Pipeline
The process involves:
Generate document summary for context
For each chunk:
Pass chunk + summary to LLM
Generate multiple QA pairs iteratively until coverage is complete
Include evaluation criteria and difficulty rating
Visualization and Quality Assessment
Two methods for analyzing question distribution:
Embedding-based:
Plot questions in 2D using dimensionality reduction
Compare coverage across different models
Identify potential gaps in coverage
Tag-based:
Generate consistent tags across datasets
Create histograms showing topic distribution
Compare tag coverage between models
Evaluation Dataset Creation
Three approaches for creating evaluation sets:
Random Split: Simple but may create gaps in training coverage
Balanced Clone: Maintains distribution but measures verbatim learning
Balanced Clone with Rephrasing: Best practice - maintains distribution while testing generalization
The recommended approach uses:
20% split or 32 examples (whichever is lower)
Clustering via elbow method for balanced sampling
Optional mirrored set for measuring overfitting
Question rephrasing to test generalization
This systematic approach to data preparation enables comprehensive coverage while maintaining high quality standards for fine-tuning. The resulting datasets can be evaluated objectively through embedding visualizations and tag distributions.