Build Custom LLM Benchmarks for your Application

Playback speed

Share post at current time

0:00

Transcript

Build Custom LLM Benchmarks for your Application

YourBench, LightEval and Trelis ADVANCED-evals

Trelis Research

Apr 14, 2025

Transcript

Wondering what LLM is best for your custom application?

The principled approach is to create an application-specific benchmark!

I explain how using:

- YourBench to create Q&As from your documents

- LightEval to evaluate the performance of different LLMs

- Trelis ADVANCED-evals for data inspection.

Cheers, Ronan

Purchase Repo Access

Trelis Links:

📋Trelis Evals (hosted solution) - Waitlist

🤝 Are you a talented developer? Work for Trelis

💡 Need Technical or Market Assistance? Book a Consult Here

💸 Starting a New Project/Venture? Apply for a Trelis Grant

Video Links:

- YourBench

- LightEval

TIMESTAMPS:

0:00 Creating a custom benchmarking dataset

0:31 Video Overview

1:06 Quick-start with YourBench from HuggingFace

7:47 Running YourBench locally to create a benchmark

20:59 Advanced data generation notes (pdf conversion, estimating difficulty, citations, chunking, multi-hop, filtering)

29:23 Evaluating a custom dataset using LightEval

36:29 Evaluation and Data Inspection with Trelis ADVANCED-evals

46:01 Conclusion

Choosing and Evaluating LLMs with Custom Benchmarks

A systematic approach to selecting the right large language model (LLM) involves creating and running custom benchmarks tailored to your specific use case. The video explains how to build and evaluate custom benchmarks using two key tools: YourBench for dataset creation and LightEval for model evaluation.

Creating Custom Benchmarks with YourBench

YourBench is a HuggingFace library that generates question-answer pairs from input documents. The process works in several stages:

Document Processing
1. Converts PDFs to text using Microsoft's markdown library
2. Generates document summaries to provide context
3. Chunks text into segments based on configurable length parameters
Question Generation
1. Creates single-chunk questions from individual text segments
2. Generates multi-hop questions combining information from multiple chunks
3. Produces questions with difficulty ratings and citation information
4. Uses LLMs to dynamically determine appropriate number of questions per chunk