Wondering what LLM is best for your custom application?
The principled approach is to create an application-specific benchmark!
I explain how using:
- YourBench to create Q&As from your documents
- LightEval to evaluate the performance of different LLMs
- Trelis ADVANCED-evals for data inspection.
Cheers, Ronan
Trelis Links:
📋Trelis Evals (hosted solution) - Waitlist
🤝 Are you a talented developer? Work for Trelis
💡 Need Technical or Market Assistance? Book a Consult Here
💸 Starting a New Project/Venture? Apply for a Trelis Grant
Video Links:
TIMESTAMPS:
0:00 Creating a custom benchmarking dataset
0:31 Video Overview
1:06 Quick-start with YourBench from HuggingFace
7:47 Running YourBench locally to create a benchmark
20:59 Advanced data generation notes (pdf conversion, estimating difficulty, citations, chunking, multi-hop, filtering)
29:23 Evaluating a custom dataset using LightEval
36:29 Evaluation and Data Inspection with Trelis ADVANCED-evals
46:01 Conclusion
Choosing and Evaluating LLMs with Custom Benchmarks
A systematic approach to selecting the right large language model (LLM) involves creating and running custom benchmarks tailored to your specific use case. The video explains how to build and evaluate custom benchmarks using two key tools: YourBench for dataset creation and LightEval for model evaluation.
Creating Custom Benchmarks with YourBench
YourBench is a HuggingFace library that generates question-answer pairs from input documents. The process works in several stages:
Document Processing
Converts PDFs to text using Microsoft's markdown library
Generates document summaries to provide context
Chunks text into segments based on configurable length parameters
Question Generation
Creates single-chunk questions from individual text segments
Generates multi-hop questions combining information from multiple chunks
Produces questions with difficulty ratings and citation information
Uses LLMs to dynamically determine appropriate number of questions per chunk
Evaluation with LightEval
LightEval enables systematic testing of multiple models against the benchmark dataset:
Supports both HuggingFace inference providers and OpenAI-compatible APIs
Uses LLM-as-judge approach to score model responses
Reports accuracy metrics across the question set
Allows configuration of evaluation parameters like concurrent requests
Key Technical Details
Question Generation Parameters:
Chunk size: Configurable min/max length
Multi-hop depth: 2-5 chunks combined
Sampling: Can select percentage of chunk combinations
Output: Question, answer, difficulty rating, citations
Evaluation Configuration:
Judge model: Default GPT-4
Metrics: Binary correctness scoring
API Integration: Supports HuggingFace providers, OpenRouter
Parallel evaluation: Configurable concurrent requests
Performance Results Example
Testing on a Touch Rugby rules dataset:
Qwen 72B: 27% accuracy
Llama 70B: 20% accuracy
Mistral Small: 17% accuracy
Claude Sonnet: 54% accuracy (34/62 questions)
Gemini Flash: 40% accuracy (25/62 questions)
Implementation Considerations
Data Quality:
Manual inspection of generated questions recommended
No automatic deduplication or filtering (yet)
Question quality depends heavily on generation model strength
Technical Limitations:
Currently no cross-document question generation
Semantic chunking feature needs improvement
Citation verification not implemented
Share this post