Evaluation
LLM Evaluation Frameworks — Measuring What Matters
Evaluating LLMs requires standardized benchmarks, reproducible pipelines, and comprehensive task coverage. This guide covers the major evaluation frameworks and how to use them for rigorous model assessment.
- lm-eval-harness — EleutherAI's framework for standardized evaluation
- OpenCompass — Comprehensive evaluation with 100+ benchmarks
- HELM — Holistic evaluation across many dimensions
- Evaluation Design — Choosing the right metrics and tasks
If you can't measure it, you can't improve it.
LLM Evaluation Frameworks
Evaluating LLMs is challenging because they perform many different tasks. Evaluation frameworks provide standardized pipelines for running benchmarks, ensuring fair comparison across models, and covering diverse capabilities from reasoning to safety.
DfEvaluation Framework
An evaluation framework is a standardized system for assessing LLM capabilities across multiple tasks, providing reproducible results, and enabling fair comparison between models.
lm-eval-harness
Overview
Dflm-eval-harness
lm-eval-harness (EleutherAI) is the most widely used open-source framework for evaluating language models. It provides 200+ tasks, standardized prompts, and consistent evaluation metrics across models.
Installation and Usage
# Install lm-eval-harness
pip install lm-eval
# Run evaluation on a model
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=float16 \
--tasks hellaswag,arc_challenge,mmlu \
--device cuda:0 \
--batch_size 8
# Run with specific configuration
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks "mmlu==5-shot" \
--num_fewshot 5 \
--output_path results/
Task Categories
| Category | Example Tasks | What It Measures |
|---|---|---|
| Knowledge | MMLU, ARC, TriviaQA | Factual knowledge |
| Reasoning | GSM8K, MATH, LogiQA | Mathematical/logical reasoning |
| Reading | HellaSwag, WinoGrande | Common sense, coreference |
| Code | HumanEval, MBPP | Code generation |
| Safety | TruthfulQA, BBQ | Bias, toxicity |
| Instruction | MT-Bench, AlpacaEval | Following instructions |
Custom Task Implementation
from lm_eval.api.task import Task
from lm_eval.api.registry import register_task
@register_task("custom_qa")
class CustomQA(Task):
"""Custom evaluation task."""
OUTPUT_TYPE = "multiple_choice"
def __init__(self):
super().__init__()
self.dataset = self.load_dataset()
def download(self, data_dir=None, cache_dir=None):
"""Load evaluation data."""
# Load your custom dataset
pass
def has_training_docs(self):
return False
def has_validation_docs(self):
return True
def has_test_docs(self):
return True
def fewshot_examples(self, k):
"""Provide few-shot examples."""
pass
def doc_to_text(self, doc):
"""Convert document to prompt text."""
return f"Question: {doc['question']}\nAnswer:"
def doc_to_target(self, doc):
"""Convert document to target answer."""
return doc["answer"]
def construct_requests(self, doc, ctx):
"""Build evaluation requests."""
return rf.loglikelihood(ctx, doc["answer"])
def process_results(self, doc, results):
"""Process evaluation results."""
pred = results[0]
return {"acc": float(pred == doc["answer"])}
def aggregation(self):
return {"acc": mean}
def higher_is_better(self):
return {"acc": True}
lm-eval-harness uses a standardized evaluation protocol: it formats prompts consistently, handles few-shot examples automatically, and computes metrics in a reproducible way. This ensures fair comparison between models.
OpenCompass
Overview
DfOpenCompass
OpenCompass (2023) is a comprehensive evaluation platform with 100+ benchmarks covering 50+ capabilities. It supports both English and Chinese evaluation, with a web-based leaderboard for comparing models.
Configuration
from opencompass.models import HuggingFaceCausalLM
from opencompass.datasets import MMLU, HellaSwag, ARC
# Model configuration
models = [
dict(
type=HuggingFaceCausalLM,
abbr='llama-2-7b',
path='meta-llama/Llama-2-7b-hf',
model_kwargs=dict(
torch_dtype='auto',
device_map='auto',
),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
run_cfg=dict(num_gpus=1),
)
]
# Dataset configuration
datasets = [
# MMLU
*MMLU.dump_all('configs/datasets/mmlu/'),
# HellaSwag
*HellaSwag.dump_all('configs/datasets/hellaswag/'),
# ARC
*ARC.dump_all('configs/datasets/arc/'),
]
# Evaluation configuration
eval = dict(
partitioner=dict(type='SizePartitioner', max_task_size=20000),
runner=dict(
type='SlurmRunner',
max_num_workers=16,
task=dict(type='OpenICLInferencer'),
retry=2,
),
)
Benchmark Coverage
| Domain | Benchmarks | Tasks |
|---|---|---|
| Knowledge | MMLU, C-Eval, CMMLU | 57+ subjects |
| Reasoning | GSM8K, MATH, BBH | Math, logic |
| Language | HellaSwag, WinoGrande | Language understanding |
| Code | HumanEval, MBPP | Code generation |
| Safety | SafetyBench, TruthfulQA | Safety evaluation |
| Multilingual | XL-BEN, MGSM | Cross-lingual |
HELM (Holistic Evaluation)
Overview
DfHELM
HELM (Liang et al., 2023) is a holistic evaluation framework from Stanford that assesses LLMs across 42 scenarios, 7 metrics, and considers fairness, bias, and safety alongside accuracy.
Evaluation Dimensions
| Dimension | Metrics | Purpose |
|---|---|---|
| Accuracy | F1, Exact Match, ROUGE | Task performance |
| Calibration | ECE, Brier Score | Confidence accuracy |
| Robustness | Perturbation accuracy | Adversarial robustness |
| Fairness | Demographic parity | Bias measurement |
| Bias | Sentiment bias, toxicity | Social bias |
| Toxicity | Toxicity score | Harmful outputs |
| Efficiency | Latency, throughput | Cost-effectiveness |
HELM Scenarios
# Example HELM scenario configuration
scenario = dict(
name="openbookqa",
description="OpenBookQA dataset for commonsense reasoning",
dataset=dict(
name="openbookqa",
split="test",
),
metric_list=[
dict(name="exact_match", args={"normalize": True}),
dict(name="calibration_macro_avg", args={"num_bins": 20}),
],
noise_args=dict(
open_book_missing_fraction=0.5,
),
perturbation_params=["misspelling", "synonym"],
)
HELM's strength is its comprehensive coverage. It evaluates not just accuracy but also calibration (how well the model's confidence matches its correctness), fairness (across demographic groups), and robustness (against perturbations).
Evaluation Methodology
Few-Shot Evaluation
Few-Shot Accuracy
Here,
- =Accuracy with k few-shot examples
- =Number of test examples
- =Model prediction
- =Ground truth
- =Number of few-shot examples
Perplexity
Perplexity
Here,
- =Sequence length
- =Model probability for token t
Inference Cost
Cost per Token
Here,
- =Hourly cost of GPU rental
- =Inference throughput
Practical Evaluation
Running a Complete Evaluation
from lm_eval import evaluator, tasks
from lm_eval.models.huggingface import HFLM
def evaluate_model(model_path, task_list, num_fewshot=0):
"""Run comprehensive model evaluation."""
# Load model
model = HFLM(
pretrained=model_path,
device="cuda:0",
batch_size=8,
dtype="float16"
)
# Run evaluation
results = evaluator.simple_evaluate(
model=model,
tasks=task_list,
num_fewshot=num_fewshot,
log_samples=True
)
# Format results
formatted = {}
for task, metrics in results["results"].items():
formatted[task] = {
k: f"{v:.4f}" if isinstance(v, float) else v
for k, v in metrics.items()
}
return formatted
# Example usage
tasks = ["mmlu", "hellaswag", "arc_challenge", "gsm8k", "truthfulqa"]
results = evaluate_model("meta-llama/Llama-2-7b-hf", tasks, num_fewshot=5)
for task, metrics in results.items():
print(f"{task}: {metrics}")
Building an Evaluation Suite
class EvaluationSuite:
"""Custom evaluation suite for specific use cases."""
def __init__(self, name, tasks, weights=None):
self.name = name
self.tasks = tasks
self.weights = weights or {t: 1.0 for t in tasks}
def evaluate(self, model):
"""Run all tasks and compute weighted score."""
scores = {}
for task in self.tasks:
score = self._run_task(model, task)
scores[task] = score
# Compute weighted average
weighted_sum = sum(scores[t] * self.weights[t] for t in self.tasks)
total_weight = sum(self.weights[t] for t in self.tasks)
overall = weighted_sum / total_weight
return {
"overall": overall,
"per_task": scores
}
def _run_task(self, model, task):
"""Run a single task."""
# Task-specific evaluation logic
pass
# Example: Code-focused evaluation suite
code_suite = EvaluationSuite(
name="code_evaluation",
tasks=["humaneval", "mbpp", "apps", "code_contests"],
weights={"humaneval": 0.4, "mbpp": 0.3, "apps": 0.2, "code_contests": 0.1}
)
Design your evaluation suite based on your specific use case. A code assistant should weight code tasks higher, while a chatbot should prioritize conversational quality and safety metrics.
Common Pitfalls
Evaluation Anti-Patterns
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Overfitting to benchmarks | Good scores but poor real-world performance | Include held-out tasks |
| Ignoring calibration | Confident but wrong answers | Report ECE alongside accuracy |
| Single metric | Misses important failure modes | Report multiple metrics |
| No diversity | Only tests narrow capabilities | Cover 10+ task categories |
| Inconsistent prompting | Unfair comparison across models | Use standardized frameworks |
The "benchmark saturation" problem occurs when models achieve near-human performance on standard benchmarks but still fail on real-world tasks. This is why HELM and OpenCompass include diverse tasks beyond traditional NLP benchmarks.
Practice Exercises
-
Conceptual: Explain why evaluating LLMs is harder than evaluating traditional ML models. What challenges are unique to generative models?
-
Practical: Use lm-eval-harness to evaluate a 7B model on 5 different tasks. Create a radar chart comparing its strengths and weaknesses.
-
Analysis: Compare the evaluation results of two models (e.g., LLaMA-2 vs Mistral) across knowledge, reasoning, and safety tasks. Which model is better for which use case?
-
Research: Design an evaluation suite for a medical LLM. What tasks and metrics would you include? How would you ensure the evaluation is fair and comprehensive?
Key Takeaways:
- lm-eval-harness is the standard open-source framework with 200+ tasks
- OpenCompass provides comprehensive coverage with 100+ benchmarks including Chinese
- HELM evaluates across 7 dimensions including fairness and safety
- Evaluation should cover knowledge, reasoning, code, safety, and efficiency
- Few-shot evaluation tests generalization; zero-shot tests pure capability
- Always report multiple metrics (accuracy, calibration, robustness)
- Design evaluation suites tailored to your specific use case
What to Learn Next
-> Human Evaluation of LLMs Chatbot Arena, preference studies, and annotation.
-> Automated LLM Evaluation LLM-as-judge, G-Eval, and automatic metrics.
-> LLM Evaluation Benchmarks Understanding standard benchmarks for LLMs.
-> LLM Safety and Red Teaming Testing and hardening LLMs against attacks.
-> Building Production LLM Apps From prototype to production deployment.
-> Constitutional AI Training safe and aligned language models.