Evaluation

Human Evaluation of LLMs — The Gold Standard

Automated metrics capture some aspects of LLM quality, but human judgment remains the gold standard for evaluating fluency, helpfulness, and safety. This guide covers Chatbot Arena, preference studies, and annotation methodologies.

Chatbot Arena — Crowdsourced ELO ratings through blind comparison
Preference Studies — Side-by-side comparisons and rating scales
Annotation Design — Creating reliable human evaluation protocols
Inter-Annotator Agreement — Ensuring evaluation consistency

The ultimate test of a language model is whether humans find it useful.

Human Evaluation of LLMs

While automated benchmarks provide consistent metrics, human evaluation captures aspects that automated metrics miss: helpfulness, creativity, safety, and real-world utility. Human evaluation is essential for evaluating open-ended generation, dialogue quality, and alignment with human preferences.

DfHuman Evaluation

Human evaluation is the process of assessing LLM outputs using human judgment. It measures qualities like helpfulness, fluency, safety, and alignment with human preferences that automated metrics often fail to capture.

Chatbot Arena

Overview

DfChatbot Arena

Chatbot Arena (LMSYS) is a crowdsourced platform for evaluating LLMs through blind side-by-side comparisons. Users interact with two anonymous models and vote for the better response, generating ELO ratings similar to chess rankings.

ELO Rating System

ELO Rating Update

R_i' = R_i + K(S_i - E_i)

Here,

$R_i$ =Current rating of model i
$K$ =Update factor (typically 32)
$S_i$ =Actual score (1 for win, 0.5 for tie, 0 for loss)
$E_i$ =Expected score

Expected Score

E_i = \\frac{1}{1 + 10^{(R_j - R_i)/400}}

Here,

$E_i$ =Expected score for model i
$R_j$ =Rating of opponent model j

Chatbot Arena Architecture

class ChatbotArena:
    """Simplified Chatbot Arena backend."""
    
    def __init__(self):
        self.models = {}
        self.battles = []
        self.k_factor = 32
    
    def register_model(self, model_id, model_handler):
        """Register a model for evaluation."""
        self.models[model_id] = {
            "handler": model_handler,
            "rating": 1000,  # Initial ELO
            "battles": 0,
            "wins": 0,
            "losses": 0,
            "ties": 0
        }
    
    def start_battle(self, user_query, model_a_id=None, model_b_id=None):
        """Start a blind comparison battle."""
        import random
        
        # Randomly select two models if not specified
        if model_a_id is None or model_b_id is None:
            available = list(self.models.keys())
            model_a_id, model_b_id = random.sample(available, 2)
        
        # Get responses from both models
        response_a = self.models[model_a_id]["handler"](user_query)
        response_b = self.models[model_b_id]["handler"](user_query)
        
        # Randomly assign to positions (A or B)
        if random.random() > 0.5:
            model_a_id, model_b_id = model_b_id, model_a_id
            response_a, response_b = response_b, response_a
        
        return {
            "battle_id": len(self.battles),
            "model_a": {"id": model_a_id, "response": response_a},
            "model_b": {"id": model_b_id, "response": response_b},
            "query": user_query
        }
    
    def record_vote(self, battle_id, winner):
        """Record user vote and update ratings."""
        battle = self.battles[battle_id]
        
        model_a_id = battle["model_a"]["id"]
        model_b_id = battle["model_b"]["id"]
        
        # Update ratings
        if winner == "a":
            self._update_ratings(model_a_id, model_b_id, 1)
        elif winner == "b":
            self._update_ratings(model_b_id, model_a_id, 1)
        else:
            self._update_ratings(model_a_id, model_b_id, 0.5)
            self._update_ratings(model_b_id, model_a_id, 0.5)
    
    def _update_ratings(self, winner_id, loser_id, score):
        """Update ELO ratings for both models."""
        winner = self.models[winner_id]
        loser = self.models[loser_id]
        
        # Expected scores
        exp_winner = 1 / (1 + 10 ** ((loser["rating"] - winner["rating"]) / 400))
        exp_loser = 1 - exp_winner
        
        # Update ratings
        winner["rating"] += self.k_factor * (score - exp_winner)
        loser["rating"] += self.k_factor * ((1 - score) - exp_loser)
        
        # Update statistics
        winner["battles"] += 1
        loser["battles"] += 1
        
        if score == 1:
            winner["wins"] += 1
            loser["losses"] += 1
        elif score == 0:
            winner["losses"] += 1
            loser["wins"] += 1
        else:
            winner["ties"] += 1
            loser["ties"] += 1
    
    def get_leaderboard(self):
        """Get current model rankings."""
        return sorted(
            self.models.items(),
            key=lambda x: x[1]["rating"],
            reverse=True
        )

Chatbot Arena has collected over 1 million human votes as of 2024, making it the largest human evaluation dataset for LLMs. The ELO system provides statistically robust rankings with confidence intervals.

Preference Studies

Bradley-Terry Model

DfBradley-Terry Model

The Bradley-Terry model is a statistical model for pairwise comparisons. It estimates the probability that one model is preferred over another based on their latent "quality" parameters.

Bradley-Terry Probability

P(i \\succ j) = \\frac{\\pi_i}{\\pi_i + \\pi_j}

Here,

$P(i \succ j)$ =Probability model i is preferred over j
$\pi_i$ =Quality parameter for model i
$\pi_j$ =Quality parameter for model j

import numpy as np
from scipy.optimize import minimize

class BradleyTerryModel:
    """Bradley-Terry model for pairwise comparisons."""
    
    def __init__(self, n_models):
        self.n_models = n_models
        self.params = np.zeros(n_models)  # Log-quality parameters
    
    def fit(self, comparisons):
        """Fit model from pairwise comparisons.
        
        comparisons: list of (winner_id, loser_id) tuples
        """
        def neg_log_likelihood(params):
            ll = 0
            for winner, loser in comparisons:
                # P(winner > loser)
                prob = 1 / (1 + np.exp(params[loser] - params[winner]))
                ll += np.log(prob + 1e-10)
            return -ll
        
        result = minimize(
            neg_log_likelihood,
            self.params,
            method='L-BFGS-B'
        )
        
        self.params = result.x
        return self
    
    def predict(self, model_i, model_j):
        """Predict probability that model i is preferred."""
        return 1 / (1 + np.exp(self.params[model_j] - self.params[model_i]))
    
    def get_rankings(self):
        """Get model rankings by quality parameter."""
        return np.argsort(-self.params)

Rating Scales

Scale Type	Description	Use Case
Likert	1-5 or 1-7 rating	Overall quality
Comparison	Side-by-side preference	A/B testing
Ranking	Order multiple outputs	Multi-model comparison
Binary	Acceptable/Not acceptable	Quality thresholds

Designing Rating Scales

class RatingScale:
    """Design and validate rating scales for LLM evaluation."""
    
    LIKERT_5 = {
        1: "Very Poor",
        2: "Poor", 
        3: "Acceptable",
        4: "Good",
        5: "Excellent"
    }
    
    HELPFULNESS_SCALE = {
        1: "Not helpful at all",
        2: "Slightly helpful",
        3: "Moderately helpful",
        4: "Very helpful",
        5: "Extremely helpful"
    }
    
    SAFETY_SCALE = {
        1: "Unsafe - harmful content",
        2: "Somewhat unsafe - questionable content",
        3: "Neutral - no safety concerns",
        4: "Safe - appropriate content",
        5: "Very safe - helpful safety information"
    }
    
    @staticmethod
    def create_task-specific_scale(task_type):
        """Create appropriate scale for task type."""
        if task_type == "summarization":
            return {
                "dimensions": ["accuracy", "completeness", "conciseness", "fluency"],
                "scale": RatingScale.LIKERT_5
            }
        elif task_type == "code_generation":
            return {
                "dimensions": ["correctness", "efficiency", "readability", "documentation"],
                "scale": RatingScale.LIKERT_5
            }
        elif task_type == "conversation":
            return {
                "dimensions": ["helpfulness", "coherence", "engagement", "safety"],
                "scale": RatingScale.HELPFULNESS_SCALE
            }

Annotation Methodologies

Single Annotation

DfSingle Annotation

Single annotation assigns one annotator per example. It's fast and cheap but lacks reliability measures. Use only for initial screening or when inter-annotator agreement is expected to be high.

Multiple Annotations

DfMultiple Annotations

Multiple annotations assign multiple annotators (typically 3-5) per example. The majority vote or average rating provides more reliable estimates and enables computing inter-annotator agreement.

import statistics
from collections import Counter

class AnnotationAggregator:
    """Aggregate multiple annotations into final scores."""
    
    @staticmethod
    def majority_vote(annotations):
        """Use majority vote for categorical labels."""
        counter = Counter(annotations)
        return counter.most_common(1)[0][0]
    
    @staticmethod
    def weighted_average(annotations, weights=None):
        """Compute weighted average for ordinal ratings."""
        if weights is None:
            weights = [1] * len(annotations)
        
        total = sum(a * w for a, w in zip(annotations, weights))
        return total / sum(weights)
    
    @staticmethod
    def compute_agreement(annotations_list):
        """Compute inter-annotator agreement (Fleiss' Kappa)."""
        n_items = len(annotations_list[0])
        n_raters = len(annotations_list)
        n_categories = max(max(ann) for ann in annotations_list) + 1
        
        # Compute agreement
        agree_count = 0
        for item_idx in range(n_items):
            item_annotations = [ann[item_idx] for ann in annotations_list]
            most_common = Counter(item_annotations).most_common(1)[0][1]
            agree_count += most_common
        
        # Fleiss' Kappa
        p_o = agree_count / (n_items * n_raters)
        
        # Expected agreement
        p_e = 0
        for cat in range(n_categories):
            cat_count = sum(1 for ann in annotations_list 
                          for a in ann if a == cat)
            p_e += (cat_count / (n_items * n_raters)) ** 2
        
        kappa = (p_o - p_e) / (1 - p_e)
        return kappa

Fleiss' Kappa measures inter-annotator agreement beyond chance. Values above 0.8 indicate strong agreement, 0.6-0.8 moderate agreement, and below 0.6 suggests the annotation scheme needs improvement.

Annotation Protocols

Protocol	Description	Trade-off
Blind	Annotators don't know model identity	Reduces bias, more expensive
Randomized	Random order of outputs	Reduces position bias
Calibration	Include gold-standard examples	Detects low-quality annotators
Training	Practice examples before evaluation	Improves consistency

Quality Control

Annotator Screening

class AnnotatorScreening:
    """Screen annotators for quality."""
    
    def __init__(self, gold_standards, min_accuracy=0.8):
        self.gold_standards = gold_standards
        self.min_accuracy = min_accuracy
    
    def screen_annotator(self, annotations):
        """Check if annotator meets quality threshold."""
        correct = sum(1 for ann, gold in zip(annotations, self.gold_standards)
                     if ann == gold)
        accuracy = correct / len(annotations)
        
        return accuracy >= self.min_accuracy
    
    def detect_biased_annotators(self, all_annotations):
        """Identify annotators with systematic biases."""
        # Check for position bias (always preferring first/second option)
        position_biases = []
        
        for annotator_id, annotations in all_annotations.items():
            first_wins = sum(1 for ann in annotations if ann["preferred"] == "first")
            bias = abs(first_wins / len(annotations) - 0.5)
            position_biases.append((annotator_id, bias))
        
        # Flag annotators with extreme biases
        return [aid for aid, bias in position_biases if bias > 0.3]

Attention Checks

Include attention check questions where the correct answer is obvious. Annotators who fail these checks are likely not reading carefully and their annotations should be excluded.

Reporting Results

Statistical Significance

McNemar's Test

\\chi^2 = \\frac{(b - c)^2}{b + c}

Here,

$b$ =Cases where model A wins and B loses
$c$ =Cases where model B wins and A loses

from scipy.stats import chi2

def mcnemar_test(b, c):
    """McNemar's test for paired comparisons."""
    chi2_stat = (b - c) ** 2 / (b + c)
    p_value = 1 - chi2.cdf(chi2_stat, df=1)
    return chi2_stat, p_value

# Example: Compare two models
# Model A wins 120 times, Model B wins 80 times (out of 300 ties)
b, c = 120, 80
chi2, p = mcnemar_test(b, c)
print(f"Chi-squared: {chi2:.2f}, p-value: {p:.4f}")
# If p < 0.05, the difference is statistically significant

Reporting Checklist

Element	Description
Sample size	Number of examples evaluated
Annotator count	Number of annotators per example
Inter-annotator agreement	Fleiss' Kappa or similar
Statistical significance	p-values for model comparisons
Confidence intervals	Uncertainty in ratings
Demographics	Annotator background (if relevant)

Always report confidence intervals alongside point estimates. A model rated 4.2 ± 0.1 is very different from 4.2 ± 0.5, even though the point estimate is the same.

Practice Exercises

Conceptual: Explain why Chatbot Arena's blind comparison design reduces bias. What biases would be introduced if annotators knew which model they were evaluating?
Mathematical: Given 100 pairwise comparisons where Model A wins 55 times and Model B wins 45 times, compute the p-value using McNemar's test. Is the difference statistically significant?
Practical: Design an annotation protocol for evaluating code generation quality. What dimensions should annotators rate, and how will you ensure inter-annotator agreement?
Research: Compare human preferences on Chatbot Arena with automated metrics (BLEU, BERTScore). How often do human preferences disagree with automated metrics, and what explains the disagreements?

Key Takeaways:

Human evaluation is the gold standard for assessing LLM quality
Chatbot Arena provides crowdsourced ELO ratings through blind comparisons
Bradley-Terry model estimates pairwise preference probabilities
Inter-annotator agreement (Fleiss' Kappa) measures annotation reliability
Multiple annotations (3-5) with majority vote improve reliability
Always include attention checks and screening questions
Report confidence intervals and statistical significance
Blind evaluation reduces bias in model comparison

What to Learn Next

-> Automated LLM Evaluation LLM-as-judge, G-Eval, and automatic metrics.

-> LLM Evaluation Frameworks lm-eval-harness, OpenCompass, and HELM.

-> LLM Evaluation Benchmarks Understanding standard benchmarks for LLMs.

-> RLHF and Alignment Training models to align with human preferences.

-> Constitutional AI Training safe and aligned language models.

-> DPO and Preference Optimization Direct preference optimization for alignment.

Human Evaluation of LLMs

Human Evaluation of LLMs — The Gold Standard

Human Evaluation of LLMs

DfHuman Evaluation

Chatbot Arena

Overview

DfChatbot Arena

ELO Rating System

ELO Rating Update

Expected Score

Chatbot Arena Architecture

Preference Studies

Bradley-Terry Model

DfBradley-Terry Model

Bradley-Terry Probability

Rating Scales

Designing Rating Scales

Annotation Methodologies

Single Annotation

DfSingle Annotation

Multiple Annotations

DfMultiple Annotations

Annotation Protocols

Quality Control

Annotator Screening

Attention Checks

Reporting Results

Statistical Significance

McNemar's Test

Reporting Checklist

Practice Exercises

What to Learn Next

Need Expert LLM Help?