Evaluation

LLM Evaluation Frameworks — Measuring What Matters

Evaluating LLMs requires standardized benchmarks, reproducible pipelines, and comprehensive task coverage. This guide covers the major evaluation frameworks and how to use them for rigorous model assessment.

lm-eval-harness — EleutherAI's framework for standardized evaluation
OpenCompass — Comprehensive evaluation with 100+ benchmarks
HELM — Holistic evaluation across many dimensions
Evaluation Design — Choosing the right metrics and tasks

If you can't measure it, you can't improve it.

LLM Evaluation Frameworks

Evaluating LLMs is challenging because they perform many different tasks. Evaluation frameworks provide standardized pipelines for running benchmarks, ensuring fair comparison across models, and covering diverse capabilities from reasoning to safety.

DfEvaluation Framework

An evaluation framework is a standardized system for assessing LLM capabilities across multiple tasks, providing reproducible results, and enabling fair comparison between models.

lm-eval-harness

Overview

Dflm-eval-harness

lm-eval-harness (EleutherAI) is the most widely used open-source framework for evaluating language models. It provides 200+ tasks, standardized prompts, and consistent evaluation metrics across models.

Installation and Usage

# Install lm-eval-harness
pip install lm-eval

# Run evaluation on a model
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=float16 \
    --tasks hellaswag,arc_challenge,mmlu \
    --device cuda:0 \
    --batch_size 8

# Run with specific configuration
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-2-7b-hf \
    --tasks "mmlu==5-shot" \
    --num_fewshot 5 \
    --output_path results/

Task Categories

Category	Example Tasks	What It Measures
Knowledge	MMLU, ARC, TriviaQA	Factual knowledge
Reasoning	GSM8K, MATH, LogiQA	Mathematical/logical reasoning
Reading	HellaSwag, WinoGrande	Common sense, coreference
Code	HumanEval, MBPP	Code generation
Safety	TruthfulQA, BBQ	Bias, toxicity
Instruction	MT-Bench, AlpacaEval	Following instructions

Custom Task Implementation

from lm_eval.api.task import Task
from lm_eval.api.registry import register_task

@register_task("custom_qa")
class CustomQA(Task):
    """Custom evaluation task."""
    
    OUTPUT_TYPE = "multiple_choice"
    
    def __init__(self):
        super().__init__()
        self.dataset = self.load_dataset()
    
    def download(self, data_dir=None, cache_dir=None):
        """Load evaluation data."""
        # Load your custom dataset
        pass
    
    def has_training_docs(self):
        return False
    
    def has_validation_docs(self):
        return True
    
    def has_test_docs(self):
        return True
    
    def fewshot_examples(self, k):
        """Provide few-shot examples."""
        pass
    
    def doc_to_text(self, doc):
        """Convert document to prompt text."""
        return f"Question: {doc['question']}\nAnswer:"
    
    def doc_to_target(self, doc):
        """Convert document to target answer."""
        return doc["answer"]
    
    def construct_requests(self, doc, ctx):
        """Build evaluation requests."""
        return rf.loglikelihood(ctx, doc["answer"])
    
    def process_results(self, doc, results):
        """Process evaluation results."""
        pred = results[0]
        return {"acc": float(pred == doc["answer"])}
    
    def aggregation(self):
        return {"acc": mean}
    
    def higher_is_better(self):
        return {"acc": True}

lm-eval-harness uses a standardized evaluation protocol: it formats prompts consistently, handles few-shot examples automatically, and computes metrics in a reproducible way. This ensures fair comparison between models.

OpenCompass

Overview

DfOpenCompass

OpenCompass (2023) is a comprehensive evaluation platform with 100+ benchmarks covering 50+ capabilities. It supports both English and Chinese evaluation, with a web-based leaderboard for comparing models.

Configuration

from opencompass.models import HuggingFaceCausalLM
from opencompass.datasets import MMLU, HellaSwag, ARC

# Model configuration
models = [
    dict(
        type=HuggingFaceCausalLM,
        abbr='llama-2-7b',
        path='meta-llama/Llama-2-7b-hf',
        model_kwargs=dict(
            torch_dtype='auto',
            device_map='auto',
        ),
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        run_cfg=dict(num_gpus=1),
    )
]

# Dataset configuration
datasets = [
    # MMLU
    *MMLU.dump_all('configs/datasets/mmlu/'),
    
    # HellaSwag
    *HellaSwag.dump_all('configs/datasets/hellaswag/'),
    
    # ARC
    *ARC.dump_all('configs/datasets/arc/'),
]

# Evaluation configuration
eval = dict(
    partitioner=dict(type='SizePartitioner', max_task_size=20000),
    runner=dict(
        type='SlurmRunner',
        max_num_workers=16,
        task=dict(type='OpenICLInferencer'),
        retry=2,
    ),
)

Benchmark Coverage

Domain	Benchmarks	Tasks
Knowledge	MMLU, C-Eval, CMMLU	57+ subjects
Reasoning	GSM8K, MATH, BBH	Math, logic
Language	HellaSwag, WinoGrande	Language understanding
Code	HumanEval, MBPP	Code generation
Safety	SafetyBench, TruthfulQA	Safety evaluation
Multilingual	XL-BEN, MGSM	Cross-lingual

HELM (Holistic Evaluation)

Overview

DfHELM

HELM (Liang et al., 2023) is a holistic evaluation framework from Stanford that assesses LLMs across 42 scenarios, 7 metrics, and considers fairness, bias, and safety alongside accuracy.

Evaluation Dimensions

Dimension	Metrics	Purpose
Accuracy	F1, Exact Match, ROUGE	Task performance
Calibration	ECE, Brier Score	Confidence accuracy
Robustness	Perturbation accuracy	Adversarial robustness
Fairness	Demographic parity	Bias measurement
Bias	Sentiment bias, toxicity	Social bias
Toxicity	Toxicity score	Harmful outputs
Efficiency	Latency, throughput	Cost-effectiveness

HELM Scenarios

# Example HELM scenario configuration
scenario = dict(
    name="openbookqa",
    description="OpenBookQA dataset for commonsense reasoning",
    dataset=dict(
        name="openbookqa",
        split="test",
    ),
    metric_list=[
        dict(name="exact_match", args={"normalize": True}),
        dict(name="calibration_macro_avg", args={"num_bins": 20}),
    ],
    noise_args=dict(
        open_book_missing_fraction=0.5,
    ),
    perturbation_params=["misspelling", "synonym"],
)

HELM's strength is its comprehensive coverage. It evaluates not just accuracy but also calibration (how well the model's confidence matches its correctness), fairness (across demographic groups), and robustness (against perturbations).

Evaluation Methodology

Few-Shot Evaluation

Few-Shot Accuracy

\\text{Acc}_k = \\frac{1}{N} \\sum_{i=1}^{N} \\mathbb{1}[\\hat{y}_i = y_i | k \\text{ examples}]

Here,

$\text{Acc}_k$ =Accuracy with k few-shot examples
$N$ =Number of test examples
$\hat{y}_i$ =Model prediction
$y_i$ =Ground truth
$k$ =Number of few-shot examples

Perplexity

\\text{PPL} = \\exp\\left(-\\frac{1}{T} \\sum_{t=1}^{T} \\log P(x_t | x_{<t})\\right)

Here,

$T$ =Sequence length
$P(x_t | x_{<t})$ =Model probability for token t

Inference Cost

Cost per Token

\\text{Cost} = \\frac{\\text{GPU cost/hour}}{\\text{tokens/hour}}

Here,

$GPU cost/hour$ =Hourly cost of GPU rental
$tokens/hour$ =Inference throughput

Practical Evaluation

Running a Complete Evaluation

from lm_eval import evaluator, tasks
from lm_eval.models.huggingface import HFLM

def evaluate_model(model_path, task_list, num_fewshot=0):
    """Run comprehensive model evaluation."""
    
    # Load model
    model = HFLM(
        pretrained=model_path,
        device="cuda:0",
        batch_size=8,
        dtype="float16"
    )
    
    # Run evaluation
    results = evaluator.simple_evaluate(
        model=model,
        tasks=task_list,
        num_fewshot=num_fewshot,
        log_samples=True
    )
    
    # Format results
    formatted = {}
    for task, metrics in results["results"].items():
        formatted[task] = {
            k: f"{v:.4f}" if isinstance(v, float) else v
            for k, v in metrics.items()
        }
    
    return formatted

# Example usage
tasks = ["mmlu", "hellaswag", "arc_challenge", "gsm8k", "truthfulqa"]
results = evaluate_model("meta-llama/Llama-2-7b-hf", tasks, num_fewshot=5)

for task, metrics in results.items():
    print(f"{task}: {metrics}")

Building an Evaluation Suite

class EvaluationSuite:
    """Custom evaluation suite for specific use cases."""
    
    def __init__(self, name, tasks, weights=None):
        self.name = name
        self.tasks = tasks
        self.weights = weights or {t: 1.0 for t in tasks}
    
    def evaluate(self, model):
        """Run all tasks and compute weighted score."""
        scores = {}
        
        for task in self.tasks:
            score = self._run_task(model, task)
            scores[task] = score
        
        # Compute weighted average
        weighted_sum = sum(scores[t] * self.weights[t] for t in self.tasks)
        total_weight = sum(self.weights[t] for t in self.tasks)
        
        overall = weighted_sum / total_weight
        
        return {
            "overall": overall,
            "per_task": scores
        }
    
    def _run_task(self, model, task):
        """Run a single task."""
        # Task-specific evaluation logic
        pass

# Example: Code-focused evaluation suite
code_suite = EvaluationSuite(
    name="code_evaluation",
    tasks=["humaneval", "mbpp", "apps", "code_contests"],
    weights={"humaneval": 0.4, "mbpp": 0.3, "apps": 0.2, "code_contests": 0.1}
)

Design your evaluation suite based on your specific use case. A code assistant should weight code tasks higher, while a chatbot should prioritize conversational quality and safety metrics.

Common Pitfalls

Evaluation Anti-Patterns

Anti-Pattern	Problem	Solution
Overfitting to benchmarks	Good scores but poor real-world performance	Include held-out tasks
Ignoring calibration	Confident but wrong answers	Report ECE alongside accuracy
Single metric	Misses important failure modes	Report multiple metrics
No diversity	Only tests narrow capabilities	Cover 10+ task categories
Inconsistent prompting	Unfair comparison across models	Use standardized frameworks

The "benchmark saturation" problem occurs when models achieve near-human performance on standard benchmarks but still fail on real-world tasks. This is why HELM and OpenCompass include diverse tasks beyond traditional NLP benchmarks.

Practice Exercises

Conceptual: Explain why evaluating LLMs is harder than evaluating traditional ML models. What challenges are unique to generative models?
Practical: Use lm-eval-harness to evaluate a 7B model on 5 different tasks. Create a radar chart comparing its strengths and weaknesses.
Analysis: Compare the evaluation results of two models (e.g., LLaMA-2 vs Mistral) across knowledge, reasoning, and safety tasks. Which model is better for which use case?
Research: Design an evaluation suite for a medical LLM. What tasks and metrics would you include? How would you ensure the evaluation is fair and comprehensive?

Key Takeaways:

lm-eval-harness is the standard open-source framework with 200+ tasks
OpenCompass provides comprehensive coverage with 100+ benchmarks including Chinese
HELM evaluates across 7 dimensions including fairness and safety
Evaluation should cover knowledge, reasoning, code, safety, and efficiency
Few-shot evaluation tests generalization; zero-shot tests pure capability
Always report multiple metrics (accuracy, calibration, robustness)
Design evaluation suites tailored to your specific use case

What to Learn Next

-> Human Evaluation of LLMs Chatbot Arena, preference studies, and annotation.

-> Automated LLM Evaluation LLM-as-judge, G-Eval, and automatic metrics.

-> LLM Evaluation Benchmarks Understanding standard benchmarks for LLMs.

-> LLM Safety and Red Teaming Testing and hardening LLMs against attacks.

-> Building Production LLM Apps From prototype to production deployment.

-> Constitutional AI Training safe and aligned language models.

LLM Evaluation Frameworks

LLM Evaluation Frameworks — Measuring What Matters

LLM Evaluation Frameworks

DfEvaluation Framework

lm-eval-harness

Overview

Dflm-eval-harness

Installation and Usage

Task Categories

Custom Task Implementation

OpenCompass

Overview

DfOpenCompass

Configuration

Benchmark Coverage

HELM (Holistic Evaluation)

Overview

DfHELM

Evaluation Dimensions

HELM Scenarios

Evaluation Methodology

Few-Shot Evaluation

Few-Shot Accuracy

Perplexity

Perplexity

Inference Cost

Cost per Token

Practical Evaluation

Running a Complete Evaluation

Building an Evaluation Suite

Common Pitfalls

Evaluation Anti-Patterns

Practice Exercises

What to Learn Next

Need Expert LLM Help?