Improving AI Performance (using Evals)
Part 2 of the LLM Evals series is out, and it covers the topic of...
"How do I Improve AI Performance for my Use Case?”
In short:
Making AI work for custom applications is all about creating high quality data.
That data is often IN YOUR HEAD or IN YOUR CUSTOMER'S HEAD.
You need to make that information explicit!
And to do that, it helps to have a clear framework for creating high quality examples of tasks for the AI to learn from (via prompting).
That's what I explain here.
Cheers, Ronan
Trelis.com
Summary: Improving LLM Performance Through Systematic Evaluation and Few-Shot Learning
Large Language Model (LLM) evaluations serve two key purposes: measuring performance and testing improvements. This article describes a systematic approach to creating high-quality examples and measuring their impact on model performance.
Creating the Initial Pipeline
The first step is establishing a baseline pipeline with minimal configuration:
- Model: Claude 3.5 Sonnet
- No system prompt
- No few-shot examples
This simple configuration provides a reference point for measuring improvements.
Building an Evaluation Dataset
The evaluation dataset requires:
Clear input questions
Specific evaluation criteria
For example:
The correct factual answer
A requirement for verbatim citations from source documents
Specific formatting (e.g. brief answer followed by citation)
Generating Training Examples
Training examples can be created through a systematic process - much like creating an evaluation dataset:
1. Write question and evaluation criteria
2. Include relevant reference documents
3. Generate sample answer using baseline model
4. Validate answer meets all criteria using automated judge
5. Mark validated examples as training data to prevent evaluation contamination
Key principle: Training examples should be similar to but not identical to evaluation examples to avoid overfitting.
Enhanced Pipeline with Few-Shot Learning
Once you have some training examples, you can create an improved pipeline with:
System prompt identifying model as domain expert
2-3 high-quality few-shot examples demonstrating desired format
XML tags to clearly delineate example components:
Document context
Question
Evaluation criteria
Ground Truth Answer
Results and Observations
Testing on a touch rugby Q&A dataset:
Baseline pipeline (no examples): 0/2 correct
Enhanced pipeline with few-shot examples: 2/2 correct
Key improvements when using few-shot examples:
Consistent answer formatting
Proper citation inclusion
Accurate content aligned with source documents
Implementation Notes
When building evaluation systems:
1. Keep evaluation separate from training data
2. Use binary scoring (pass/fail) initially for clearer results (don’t overcomplicate)
3. Manually select training examples rather than random splits
4. Consider context length limitations (you can sometimes truncate the middle of long documents in the few-shot examples), especially with reasoning models
5. Structure prompts with clear delineation between components
Special Considerations for Reasoning Models
Models like o1 respond differently to few-shot examples:
May perform poorly with lengthy context
Often confused by irrelevant examples
Work better with direct, simple prompts
Overall, you nay require hybrid approaches combining reasoning and traditional transformer models - especially if dealing with a lot of context.