Improving LLM Performance (with Evals)

Part 2 in the LLM Evals series

Trelis Research

Jan 08, 2025

Improving AI Performance (using Evals)

Part 2 of the LLM Evals series is out, and it covers the topic of...

"How do I Improve AI Performance for my Use Case?”

In short:

Making AI work for custom applications is all about creating high quality data.
That data is often IN YOUR HEAD or IN YOUR CUSTOMER'S HEAD.
You need to make that information explicit!
And to do that, it helps to have a clear framework for creating high quality examples of tasks for the AI to learn from (via prompting).

That's what I explain here.

Cheers, Ronan

Trelis.com

Summary: Improving LLM Performance Through Systematic Evaluation and Few-Shot Learning

Large Language Model (LLM) evaluations serve two key purposes: measuring performance and testing improvements. This article describes a systematic approach to creating high-quality examples and measuring their impact on model performance.

Creating the Initial Pipeline

The first step is establishing a baseline pipeline with minimal configuration:

- Model: Claude 3.5 Sonnet

- No system prompt

- No few-shot examples

This simple configuration provides a reference point for measuring improvements.

Building an Evaluation Dataset

The evaluation dataset requires:

Clear input questions
Specific evaluation criteria

For example:

The correct factual answer
A requirement for verbatim citations from source documents
Specific formatting (e.g. brief answer followed by citation)

Generating Training Examples

Training examples can be created through a systematic process - much like creating an evaluation dataset:

1. Write question and evaluation criteria

2. Include relevant reference documents

3. Generate sample answer using baseline model

4. Validate answer meets all criteria using automated judge

5. Mark validated examples as training data to prevent evaluation contamination

Key principle: Training examples should be similar to but not identical to evaluation examples to avoid overfitting.

Enhanced Pipeline with Few-Shot Learning

Once you have some training examples, you can create an improved pipeline with:

System prompt identifying model as domain expert
2-3 high-quality few-shot examples demonstrating desired format
XML tags to clearly delineate example components:
- Document context
- Question
- Evaluation criteria
- Ground Truth Answer

Results and Observations

Testing on a touch rugby Q&A dataset:

Baseline pipeline (no examples): 0/2 correct
Enhanced pipeline with few-shot examples: 2/2 correct

Key improvements when using few-shot examples:

Consistent answer formatting
Proper citation inclusion
Accurate content aligned with source documents

Implementation Notes

When building evaluation systems:

1. Keep evaluation separate from training data

2. Use binary scoring (pass/fail) initially for clearer results (don’t overcomplicate)

3. Manually select training examples rather than random splits

4. Consider context length limitations (you can sometimes truncate the middle of long documents in the few-shot examples), especially with reasoning models

5. Structure prompts with clear delineation between components

Special Considerations for Reasoning Models

Models like o1 respond differently to few-shot examples:

May perform poorly with lengthy context
Often confused by irrelevant examples
Work better with direct, simple prompts
Overall, you nay require hybrid approaches combining reasoning and traditional transformer models - especially if dealing with a lot of context.