0:00
/
0:00
Transcript

Build Custom LLM Benchmarks for your Application

YourBench, LightEval and Trelis ADVANCED-evals
1

Wondering what LLM is best for your custom application?

The principled approach is to create an application-specific benchmark!

I explain how using:

- YourBench to create Q&As from your documents

- LightEval to evaluate the performance of different LLMs

- Trelis ADVANCED-evals for data inspection.

Cheers, Ronan

Purchase Repo Access


Trelis Links:

📋Trelis Evals (hosted solution) - Waitlist

🤝 Are you a talented developer? Work for Trelis

💡 Need Technical or Market Assistance? Book a Consult Here

💸 Starting a New Project/Venture? Apply for a Trelis Grant

Video Links:

- YourBench

- LightEval


TIMESTAMPS:

0:00 Creating a custom benchmarking dataset

0:31 Video Overview

1:06 Quick-start with YourBench from HuggingFace

7:47 Running YourBench locally to create a benchmark

20:59 Advanced data generation notes (pdf conversion, estimating difficulty, citations, chunking, multi-hop, filtering)

29:23 Evaluating a custom dataset using LightEval

36:29 Evaluation and Data Inspection with Trelis ADVANCED-evals

46:01 Conclusion


Choosing and Evaluating LLMs with Custom Benchmarks

A systematic approach to selecting the right large language model (LLM) involves creating and running custom benchmarks tailored to your specific use case. The video explains how to build and evaluate custom benchmarks using two key tools: YourBench for dataset creation and LightEval for model evaluation.

Creating Custom Benchmarks with YourBench

YourBench is a HuggingFace library that generates question-answer pairs from input documents. The process works in several stages:

  1. Document Processing

    1. Converts PDFs to text using Microsoft's markdown library

    2. Generates document summaries to provide context

    3. Chunks text into segments based on configurable length parameters

  2. Question Generation

    1. Creates single-chunk questions from individual text segments

    2. Generates multi-hop questions combining information from multiple chunks

    3. Produces questions with difficulty ratings and citation information

    4. Uses LLMs to dynamically determine appropriate number of questions per chunk

Evaluation with LightEval

LightEval enables systematic testing of multiple models against the benchmark dataset:

  1. Supports both HuggingFace inference providers and OpenAI-compatible APIs

  2. Uses LLM-as-judge approach to score model responses

  3. Reports accuracy metrics across the question set

  4. Allows configuration of evaluation parameters like concurrent requests

Key Technical Details

Question Generation Parameters:

  1. Chunk size: Configurable min/max length

  2. Multi-hop depth: 2-5 chunks combined

  3. Sampling: Can select percentage of chunk combinations

  4. Output: Question, answer, difficulty rating, citations

Evaluation Configuration:

  1. Judge model: Default GPT-4

  2. Metrics: Binary correctness scoring

  3. API Integration: Supports HuggingFace providers, OpenRouter

  4. Parallel evaluation: Configurable concurrent requests

Performance Results Example

Testing on a Touch Rugby rules dataset:

  1. Qwen 72B: 27% accuracy

  2. Llama 70B: 20% accuracy

  3. Mistral Small: 17% accuracy

  4. Claude Sonnet: 54% accuracy (34/62 questions)

  5. Gemini Flash: 40% accuracy (25/62 questions)

Implementation Considerations

Data Quality:

  1. Manual inspection of generated questions recommended

  2. No automatic deduplication or filtering (yet)

  3. Question quality depends heavily on generation model strength

Technical Limitations:

  1. Currently no cross-document question generation

  2. Semantic chunking feature needs improvement

  3. Citation verification not implemented

Discussion about this video

User's avatar