CW

LLM Evaluation Frameworks

EvaluationBenchmarkingFree Lesson

Advertisement

Evaluation

LLM Evaluation Frameworks — Measuring What Matters

Evaluating LLMs requires standardized benchmarks, reproducible pipelines, and comprehensive task coverage. This guide covers the major evaluation frameworks and how to use them for rigorous model assessment.

  • lm-eval-harness — EleutherAI's framework for standardized evaluation
  • OpenCompass — Comprehensive evaluation with 100+ benchmarks
  • HELM — Holistic evaluation across many dimensions
  • Evaluation Design — Choosing the right metrics and tasks

If you can't measure it, you can't improve it.

LLM Evaluation Frameworks

Evaluating LLMs is challenging because they perform many different tasks. Evaluation frameworks provide standardized pipelines for running benchmarks, ensuring fair comparison across models, and covering diverse capabilities from reasoning to safety.

DfEvaluation Framework

An evaluation framework is a standardized system for assessing LLM capabilities across multiple tasks, providing reproducible results, and enabling fair comparison between models.

lm-eval-harness

Overview

Dflm-eval-harness

lm-eval-harness (EleutherAI) is the most widely used open-source framework for evaluating language models. It provides 200+ tasks, standardized prompts, and consistent evaluation metrics across models.

Installation and Usage

# Install lm-eval-harness
pip install lm-eval

# Run evaluation on a model
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=float16 \
    --tasks hellaswag,arc_challenge,mmlu \
    --device cuda:0 \
    --batch_size 8

# Run with specific configuration
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-2-7b-hf \
    --tasks "mmlu==5-shot" \
    --num_fewshot 5 \
    --output_path results/

Task Categories

CategoryExample TasksWhat It Measures
KnowledgeMMLU, ARC, TriviaQAFactual knowledge
ReasoningGSM8K, MATH, LogiQAMathematical/logical reasoning
ReadingHellaSwag, WinoGrandeCommon sense, coreference
CodeHumanEval, MBPPCode generation
SafetyTruthfulQA, BBQBias, toxicity
InstructionMT-Bench, AlpacaEvalFollowing instructions

Custom Task Implementation

from lm_eval.api.task import Task
from lm_eval.api.registry import register_task

@register_task("custom_qa")
class CustomQA(Task):
    """Custom evaluation task."""
    
    OUTPUT_TYPE = "multiple_choice"
    
    def __init__(self):
        super().__init__()
        self.dataset = self.load_dataset()
    
    def download(self, data_dir=None, cache_dir=None):
        """Load evaluation data."""
        # Load your custom dataset
        pass
    
    def has_training_docs(self):
        return False
    
    def has_validation_docs(self):
        return True
    
    def has_test_docs(self):
        return True
    
    def fewshot_examples(self, k):
        """Provide few-shot examples."""
        pass
    
    def doc_to_text(self, doc):
        """Convert document to prompt text."""
        return f"Question: {doc['question']}\nAnswer:"
    
    def doc_to_target(self, doc):
        """Convert document to target answer."""
        return doc["answer"]
    
    def construct_requests(self, doc, ctx):
        """Build evaluation requests."""
        return rf.loglikelihood(ctx, doc["answer"])
    
    def process_results(self, doc, results):
        """Process evaluation results."""
        pred = results[0]
        return {"acc": float(pred == doc["answer"])}
    
    def aggregation(self):
        return {"acc": mean}
    
    def higher_is_better(self):
        return {"acc": True}

lm-eval-harness uses a standardized evaluation protocol: it formats prompts consistently, handles few-shot examples automatically, and computes metrics in a reproducible way. This ensures fair comparison between models.

OpenCompass

Overview

DfOpenCompass

OpenCompass (2023) is a comprehensive evaluation platform with 100+ benchmarks covering 50+ capabilities. It supports both English and Chinese evaluation, with a web-based leaderboard for comparing models.

Configuration

from opencompass.models import HuggingFaceCausalLM
from opencompass.datasets import MMLU, HellaSwag, ARC

# Model configuration
models = [
    dict(
        type=HuggingFaceCausalLM,
        abbr='llama-2-7b',
        path='meta-llama/Llama-2-7b-hf',
        model_kwargs=dict(
            torch_dtype='auto',
            device_map='auto',
        ),
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        run_cfg=dict(num_gpus=1),
    )
]

# Dataset configuration
datasets = [
    # MMLU
    *MMLU.dump_all('configs/datasets/mmlu/'),
    
    # HellaSwag
    *HellaSwag.dump_all('configs/datasets/hellaswag/'),
    
    # ARC
    *ARC.dump_all('configs/datasets/arc/'),
]

# Evaluation configuration
eval = dict(
    partitioner=dict(type='SizePartitioner', max_task_size=20000),
    runner=dict(
        type='SlurmRunner',
        max_num_workers=16,
        task=dict(type='OpenICLInferencer'),
        retry=2,
    ),
)

Benchmark Coverage

DomainBenchmarksTasks
KnowledgeMMLU, C-Eval, CMMLU57+ subjects
ReasoningGSM8K, MATH, BBHMath, logic
LanguageHellaSwag, WinoGrandeLanguage understanding
CodeHumanEval, MBPPCode generation
SafetySafetyBench, TruthfulQASafety evaluation
MultilingualXL-BEN, MGSMCross-lingual

HELM (Holistic Evaluation)

Overview

DfHELM

HELM (Liang et al., 2023) is a holistic evaluation framework from Stanford that assesses LLMs across 42 scenarios, 7 metrics, and considers fairness, bias, and safety alongside accuracy.

Evaluation Dimensions

DimensionMetricsPurpose
AccuracyF1, Exact Match, ROUGETask performance
CalibrationECE, Brier ScoreConfidence accuracy
RobustnessPerturbation accuracyAdversarial robustness
FairnessDemographic parityBias measurement
BiasSentiment bias, toxicitySocial bias
ToxicityToxicity scoreHarmful outputs
EfficiencyLatency, throughputCost-effectiveness

HELM Scenarios

# Example HELM scenario configuration
scenario = dict(
    name="openbookqa",
    description="OpenBookQA dataset for commonsense reasoning",
    dataset=dict(
        name="openbookqa",
        split="test",
    ),
    metric_list=[
        dict(name="exact_match", args={"normalize": True}),
        dict(name="calibration_macro_avg", args={"num_bins": 20}),
    ],
    noise_args=dict(
        open_book_missing_fraction=0.5,
    ),
    perturbation_params=["misspelling", "synonym"],
)

HELM's strength is its comprehensive coverage. It evaluates not just accuracy but also calibration (how well the model's confidence matches its correctness), fairness (across demographic groups), and robustness (against perturbations).

Evaluation Methodology

Few-Shot Evaluation

Few-Shot Accuracy

textAcck=frac1Nsumi=1Nmathbb1[hatyi=yiktextexamples]\\text{Acc}_k = \\frac{1}{N} \\sum_{i=1}^{N} \\mathbb{1}[\\hat{y}_i = y_i | k \\text{ examples}]

Here,

  • Acck\text{Acc}_k=Accuracy with k few-shot examples
  • NN=Number of test examples
  • y^i\hat{y}_i=Model prediction
  • yiy_i=Ground truth
  • kk=Number of few-shot examples

Perplexity

Perplexity

textPPL=expleft(frac1Tsumt=1TlogP(xtx<t)right)\\text{PPL} = \\exp\\left(-\\frac{1}{T} \\sum_{t=1}^{T} \\log P(x_t | x_{<t})\\right)

Here,

  • TT=Sequence length
  • P(xtx<t)P(x_t | x_{<t})=Model probability for token t

Inference Cost

Cost per Token

textCost=fractextGPUcost/hourtexttokens/hour\\text{Cost} = \\frac{\\text{GPU cost/hour}}{\\text{tokens/hour}}

Here,

  • GPUcost/hourGPU cost/hour=Hourly cost of GPU rental
  • tokens/hourtokens/hour=Inference throughput

Practical Evaluation

Running a Complete Evaluation

from lm_eval import evaluator, tasks
from lm_eval.models.huggingface import HFLM

def evaluate_model(model_path, task_list, num_fewshot=0):
    """Run comprehensive model evaluation."""
    
    # Load model
    model = HFLM(
        pretrained=model_path,
        device="cuda:0",
        batch_size=8,
        dtype="float16"
    )
    
    # Run evaluation
    results = evaluator.simple_evaluate(
        model=model,
        tasks=task_list,
        num_fewshot=num_fewshot,
        log_samples=True
    )
    
    # Format results
    formatted = {}
    for task, metrics in results["results"].items():
        formatted[task] = {
            k: f"{v:.4f}" if isinstance(v, float) else v
            for k, v in metrics.items()
        }
    
    return formatted

# Example usage
tasks = ["mmlu", "hellaswag", "arc_challenge", "gsm8k", "truthfulqa"]
results = evaluate_model("meta-llama/Llama-2-7b-hf", tasks, num_fewshot=5)

for task, metrics in results.items():
    print(f"{task}: {metrics}")

Building an Evaluation Suite

class EvaluationSuite:
    """Custom evaluation suite for specific use cases."""
    
    def __init__(self, name, tasks, weights=None):
        self.name = name
        self.tasks = tasks
        self.weights = weights or {t: 1.0 for t in tasks}
    
    def evaluate(self, model):
        """Run all tasks and compute weighted score."""
        scores = {}
        
        for task in self.tasks:
            score = self._run_task(model, task)
            scores[task] = score
        
        # Compute weighted average
        weighted_sum = sum(scores[t] * self.weights[t] for t in self.tasks)
        total_weight = sum(self.weights[t] for t in self.tasks)
        
        overall = weighted_sum / total_weight
        
        return {
            "overall": overall,
            "per_task": scores
        }
    
    def _run_task(self, model, task):
        """Run a single task."""
        # Task-specific evaluation logic
        pass

# Example: Code-focused evaluation suite
code_suite = EvaluationSuite(
    name="code_evaluation",
    tasks=["humaneval", "mbpp", "apps", "code_contests"],
    weights={"humaneval": 0.4, "mbpp": 0.3, "apps": 0.2, "code_contests": 0.1}
)

Design your evaluation suite based on your specific use case. A code assistant should weight code tasks higher, while a chatbot should prioritize conversational quality and safety metrics.

Common Pitfalls

Evaluation Anti-Patterns

Anti-PatternProblemSolution
Overfitting to benchmarksGood scores but poor real-world performanceInclude held-out tasks
Ignoring calibrationConfident but wrong answersReport ECE alongside accuracy
Single metricMisses important failure modesReport multiple metrics
No diversityOnly tests narrow capabilitiesCover 10+ task categories
Inconsistent promptingUnfair comparison across modelsUse standardized frameworks

The "benchmark saturation" problem occurs when models achieve near-human performance on standard benchmarks but still fail on real-world tasks. This is why HELM and OpenCompass include diverse tasks beyond traditional NLP benchmarks.

Practice Exercises

  1. Conceptual: Explain why evaluating LLMs is harder than evaluating traditional ML models. What challenges are unique to generative models?

  2. Practical: Use lm-eval-harness to evaluate a 7B model on 5 different tasks. Create a radar chart comparing its strengths and weaknesses.

  3. Analysis: Compare the evaluation results of two models (e.g., LLaMA-2 vs Mistral) across knowledge, reasoning, and safety tasks. Which model is better for which use case?

  4. Research: Design an evaluation suite for a medical LLM. What tasks and metrics would you include? How would you ensure the evaluation is fair and comprehensive?

Key Takeaways:

  • lm-eval-harness is the standard open-source framework with 200+ tasks
  • OpenCompass provides comprehensive coverage with 100+ benchmarks including Chinese
  • HELM evaluates across 7 dimensions including fairness and safety
  • Evaluation should cover knowledge, reasoning, code, safety, and efficiency
  • Few-shot evaluation tests generalization; zero-shot tests pure capability
  • Always report multiple metrics (accuracy, calibration, robustness)
  • Design evaluation suites tailored to your specific use case

What to Learn Next

-> Human Evaluation of LLMs Chatbot Arena, preference studies, and annotation.

-> Automated LLM Evaluation LLM-as-judge, G-Eval, and automatic metrics.

-> LLM Evaluation Benchmarks Understanding standard benchmarks for LLMs.

-> LLM Safety and Red Teaming Testing and hardening LLMs against attacks.

-> Building Production LLM Apps From prototype to production deployment.

-> Constitutional AI Training safe and aligned language models.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement