LLM Evaluation Benchmarks

EvaluationBenchmarksFree Lesson

Advertisement

LLM Evaluation Benchmarks

Evaluating large language models is one of the most challenging problems in AI. Unlike traditional ML tasks with clear metrics, LLMs are general-purpose systems whose capabilities span reasoning, creativity, knowledge, and more. This tutorial covers the major benchmarks and evaluation methodologies used to assess LLM performance.

Why Evaluation is Hard

LLMs exhibit emergent capabilities that are difficult to measure with simple metrics:

  • Open-ended generation has no single correct answer
  • Reasoning chains require evaluating intermediate steps
  • Safety requires testing for harms that may not appear in standard benchmarks
  • Alignment measures subjective qualities like helpfulness and honesty

Perplexity

Perplexity is the most fundamental metric for language models, measuring how well the model predicts the next token.

Perplexity (PPL) is the exponentiated average negative log-likelihood of a sequence, measuring how "surprised" the model is by the test data. Lower perplexity indicates better predictive performance.

Perplexity

\\text{PPL}(\\mathbf{x}) = \\exp\\left(-\\frac{1}{N} \\sum_{i=1}^{N} \\log P_\\theta(x_i | x_{<i})\\right)

Here,

  • =
  • =
  • =
  • =

Cross-Entropy Loss

H(\\mathbf{x}, P_\\theta) = -\\frac{1}{N} \\sum_{i=1}^{N} \\log P_\\theta(x_i | x_{<i})

Here,

  • =
  • =

The relationship between perplexity and cross-entropy:

Perplexity is simply the exponentiated cross-entropy: PPL = exp(H). A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 possibilities.

Computing Perplexity

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model_name: str, text: str, stride: int = 512) -> float:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.eval()
    
    encodings = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
    seq_len = encodings.input_ids.size(1)
    
    nlls = []
    prev_end_loc = 0
    
    for begin_loc in range(0, seq_len, stride):
        end_loc = min(begin_loc + 2048, seq_len)
        input_ids = encodings.input_ids[:, begin_loc:end_loc]
        target_ids = input_ids.clone()
        
        # Mask tokens outside the current window
        if begin_loc > 0:
            target_ids[:, :-stride] = -100
        
        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            neg_log_likelihood = outputs.loss
        
        nlls.append(neg_log_likelihood)
        prev_end_loc = end_loc
        
        if end_loc == seq_len:
            break
    
    ppl = torch.exp(torch.stack(nlls).mean())
    return ppl.item()

Perplexity is useful for comparing models of similar size on the same test set, but it is not a reliable indicator of downstream task performance. A model with lower perplexity may still perform worse on reasoning tasks.

MMLU (Massive Multitask Language Understanding)

MMLU measures knowledge across 57 subjects spanning STEM, humanities, social sciences, and more.

Benchmark Structure

CategorySubjectsExamples
STEMPhysics, Math, CS14,042 questions
HumanitiesHistory, Philosophy, Law11,039 questions
Social SciencesEconomics, Psychology8,302 questions
OtherMisc, Professional7,530 questions

Evaluation Protocol

MMLU uses 5-shot evaluation with multiple-choice questions:

def format_mmlu_prompt(question, options, examples=None):
    prompt = "Answer the following multiple-choice question.\n\n"
    
    if examples:
        for ex in examples:
            prompt += f"Question: {ex['question']}\n"
            for i, opt in enumerate(ex['options']):
                prompt += f"({chr(65+i)}) {opt}\n"
            prompt += f"Answer: {ex['answer']}\n\n"
    
    prompt += f"Question: {question}\n"
    for i, opt in enumerate(options):
        prompt += f"({chr(65+i)}) {opt}\n"
    prompt += "Answer:"
    
    return prompt

def evaluate_mmlu(model, tokenizer, dataset, k=5):
    correct = 0
    total = 0
    
    for question in dataset:
        examples = question['few_shot_examples'][:k]
        prompt = format_mmlu_prompt(
            question['question'],
            question['options'],
            examples
        )
        
        inputs = tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=1)
        
        predicted = tokenizer.decode(outputs[0][-1:])
        if predicted == question['answer']:
            correct += 1
        total += 1
    
    return correct / total

HumanEval (Code Generation)

HumanEval evaluates a model's ability to generate correct Python functions from docstrings.

Pass@k (Code Generation)

textPass@k=mathbbEtextproblemsleft[1βˆ’fracbinomnβˆ’ckbinomnkright]\\text{Pass@k} = \\mathbb{E}_{\\text{problems}} \\left[ 1 - \\frac{\\binom{n-c}{k}}{\\binom{n}{k}} \\right]

Here,

  • =
  • =
  • =

The unbiased estimator for Pass@k:

Unbiased Pass@k Estimator

widehattextPass@k=1βˆ’fracbinomnβˆ’ckbinomnk\\widehat{\\text{Pass@k}} = 1 - \\frac{\\binom{n-c}{k}}{\\binom{n}{k}}

Here,

  • =
  • =
  • =
  • =
import math

def pass_at_k(n: int, c: int, k: int) -> float:
    if n - c < k:
        return 1.0
    return 1.0 - math.prod(1.0 - k / (n - i) for i in range(c))

def evaluate_humaneval(model, tokenizer, problems, n_samples=200, k_values=[1, 10, 100]):
    results = {}
    
    for problem in problems:
        prompt = f"def {problem['function_name']}({problem['signature']}):\n    \"\"\"{problem['docstring']}\"\"\"\n"
        
        samples = []
        for _ in range(n_samples):
            inputs = tokenizer(prompt, return_tensors="pt")
            with torch.no_grad():
                output = model.generate(
                    **inputs,
                    max_new_tokens=512,
                    temperature=0.8,
                    do_sample=True
                )
            code = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
            samples.append(code)
        
        c = sum(1 for s in samples if run_test_cases(s, problem['test_cases']))
        
        results[problem['task_id']] = {
            k: pass_at_k(n_samples, c, k) for k in k_values
        }
    
    return results

GSM8K (Math Reasoning)

GSM8K tests grade-school math reasoning with multi-step word problems.

Evaluation Metrics

MetricDescriptionCalculation
Exact MatchFinal answer matches exactlyBinary per problem
Execution AccuracyCode execution produces correct answerBinary per problem
Reasoning ScoreIntermediate steps are valid0-1 per problem
def extract_answer(response: str) -> str:
    """Extract final numerical answer from response."""
    import re
    # Look for #### pattern used in GSM8K
    match = re.search(r'####\s*(.+)', response)
    if match:
        return match.group(1).strip()
    # Fallback: last number in response
    numbers = re.findall(r'-?\d+\.?\d*', response)
    return numbers[-1] if numbers else ""

def evaluate_gsm8k(model, tokenizer, dataset):
    correct = 0
    total = 0
    
    for problem in dataset:
        prompt = f"Question: {problem['question']}\n\nAnswer:"
        
        inputs = tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.0
            )
        
        response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
        predicted = extract_answer(response)
        ground_truth = extract_answer(problem['answer'])
        
        if predicted == ground_truth:
            correct += 1
        total += 1
    
    return correct / total

MT-Bench and Chatbot Arena

MT-Bench

MT-Bench evaluates multi-turn conversation quality using GPT-4 as a judge:

MT_BENCH_CATEGORIES = [
    "writing", "roleplay", "reasoning", "math",
    "coding", "extraction", "stem", "humanities"
]

MT_BENCH_JUDGE_PROMPT = """[System]
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Rate on a scale of 1 to 10.

[Question]
{question}

[Assistant Response]
{response}

[Rating]
Provide your rating on a scale of 1 to 10. Explain your rating briefly."""

Chatbot Arena (ELO Rating)

Chatbot Arena uses blind comparison with human voting to compute ELO ratings:

ELO Rating Update

Rtextnew=Rtextold+K(Sβˆ’E)R_{\\text{new}} = R_{\\text{old}} + K(S - E)

Here,

  • =
  • =
  • =
  • =
  • =

LLM-as-Judge Evaluation

Using GPT-4 or other strong models as evaluation judges:

def llm_judge(
    question: str,
    response: str,
    criteria: dict,
    judge_model,
    judge_tokenizer
) -> dict:
    judge_prompt = f"""You are an expert evaluator. Rate the following response on these criteria:

{chr(10).join(f'- {k}: {v}' for k, v in criteria.items())}

Question: {question}
Response: {response}

Provide ratings (1-10) for each criterion and an overall score.
Format your response as JSON."""
    
    inputs = judge_tokenizer(judge_prompt, return_tensors="pt")
    with torch.no_grad():
        output = judge_model.generate(**inputs, max_new_tokens=256)
    
    judgment = judge_tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
    return parse_judgment(judgment)

LLM-as-judge has been shown to achieve >80% agreement with human evaluators on tasks like response quality assessment. However, it can be biased toward its own outputs if used to evaluate similar models.

Evaluation Pitfalls and Limitations

Common Pitfalls

  1. Data Contamination: Test data may appear in training data
  2. Benchmark Gaming: Models can overfit to specific benchmark formats
  3. Metric Sensitivity: Small changes in evaluation protocol can change rankings
  4. Missing Capabilities: Benchmarks may not capture important real-world skills

Evaluation Anti-Patterns

# BAD: Only evaluating on one benchmark
score = evaluate_gsm8k(model, tokenizer, test_set)

# GOOD: Comprehensive evaluation
evaluation_suite = {
    "perplexity": compute_perplexity(model, held_out_data),
    "mmlu": evaluate_mmlu(model, tokenizer, mmlu_test),
    "humaneval": evaluate_humaneval(model, tokenizer, humaneval),
    "gsm8k": evaluate_gsm8k(model, tokenizer, gsm8k_test),
    "safety": evaluate_safety(model, safety_test),
    "helpfulness": llm_judge(model, helpfulness_test)
}

Limitations of Automatic Metrics

MetricStrengthsWeaknesses
PerplexityFast, consistentDoesn't measure reasoning
Exact MatchClear, objectiveMisses partial credit
Pass@kTask-specificExpensive to compute
LLM JudgeFlexible, nuancedExpensive, potential bias

Always evaluate LLMs on multiple benchmarks across different capability dimensions. No single benchmark captures all aspects of model quality. Combine automatic metrics with human evaluation for critical applications.

Building an Evaluation Pipeline

class LLMEvaluator:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.results = {}
    
    def evaluate(self, benchmarks: dict):
        for name, (eval_fn, dataset) in benchmarks.items():
            print(f"Evaluating {name}...")
            self.results[name] = eval_fn(self.model, self.tokenizer, dataset)
        
        return self.results
    
    def compare_with_baseline(self, baseline_results: dict) -> dict:
        comparison = {}
        for benchmark in self.results:
            if benchmark in baseline_results:
                comparison[benchmark] = {
                    "current": self.results[benchmark],
                    "baseline": baseline_results[benchmark],
                    "delta": self.results[benchmark] - baseline_results[benchmark]
                }
        return comparison
    
    def generate_report(self) -> str:
        report = "# LLM Evaluation Report\n\n"
        for benchmark, score in self.results.items():
            report += f"## {benchmark}\n"
            report += f"- Score: {score:.4f}\n\n"
        return report

Summary

  • Perplexity measures next-token prediction quality: PPL = exp(H)
  • MMLU evaluates knowledge across 57 academic subjects
  • HumanEval measures code generation capability via Pass@k
  • GSM8K tests multi-step mathematical reasoning
  • MT-Bench and Chatbot Arena use LLM-as-judge and human comparison
  • LLM-as-judge provides scalable evaluation with >80% human agreement
  • Always use multiple benchmarks and combine automatic with human evaluation
  • Watch for data contamination and benchmark gaming

Practice Exercises

  1. Perplexity Comparison: Compute perplexity for 3 different models on the WikiText-2 dataset. How does perplexity correlate with model size?

  2. Benchmark Implementation: Implement a 5-shot MMLU evaluation for a small language model. Report accuracy by subject category.

  3. Code Generation: Evaluate a model on HumanEval with k=1, k=10, and k=100. How does Pass@k improve with more samples?

  4. LLM-as-Judge: Use GPT-4 to evaluate 50 responses from different models on the MT-Bench dataset. Compare the rankings with your own judgments.

  5. Evaluation Pipeline: Build a complete evaluation pipeline that runs 4 different benchmarks and generates a comparison report.


Previous: 14 - Constitutional AI ← | Next: 16 - LLM Inference Optimization β†’

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement