CW

Human Evaluation of LLMs

EvaluationHuman AssessmentFree Lesson

Advertisement

Evaluation

Human Evaluation of LLMs — The Gold Standard

Automated metrics capture some aspects of LLM quality, but human judgment remains the gold standard for evaluating fluency, helpfulness, and safety. This guide covers Chatbot Arena, preference studies, and annotation methodologies.

  • Chatbot Arena — Crowdsourced ELO ratings through blind comparison
  • Preference Studies — Side-by-side comparisons and rating scales
  • Annotation Design — Creating reliable human evaluation protocols
  • Inter-Annotator Agreement — Ensuring evaluation consistency

The ultimate test of a language model is whether humans find it useful.

Human Evaluation of LLMs

While automated benchmarks provide consistent metrics, human evaluation captures aspects that automated metrics miss: helpfulness, creativity, safety, and real-world utility. Human evaluation is essential for evaluating open-ended generation, dialogue quality, and alignment with human preferences.

DfHuman Evaluation

Human evaluation is the process of assessing LLM outputs using human judgment. It measures qualities like helpfulness, fluency, safety, and alignment with human preferences that automated metrics often fail to capture.

Chatbot Arena

Overview

DfChatbot Arena

Chatbot Arena (LMSYS) is a crowdsourced platform for evaluating LLMs through blind side-by-side comparisons. Users interact with two anonymous models and vote for the better response, generating ELO ratings similar to chess rankings.

ELO Rating System

ELO Rating Update

Ri=Ri+K(SiEi)R_i' = R_i + K(S_i - E_i)

Here,

  • RiR_i=Current rating of model i
  • KK=Update factor (typically 32)
  • SiS_i=Actual score (1 for win, 0.5 for tie, 0 for loss)
  • EiE_i=Expected score

Expected Score

Ei=frac11+10(RjRi)/400E_i = \\frac{1}{1 + 10^{(R_j - R_i)/400}}

Here,

  • EiE_i=Expected score for model i
  • RjR_j=Rating of opponent model j

Chatbot Arena Architecture

class ChatbotArena:
    """Simplified Chatbot Arena backend."""
    
    def __init__(self):
        self.models = {}
        self.battles = []
        self.k_factor = 32
    
    def register_model(self, model_id, model_handler):
        """Register a model for evaluation."""
        self.models[model_id] = {
            "handler": model_handler,
            "rating": 1000,  # Initial ELO
            "battles": 0,
            "wins": 0,
            "losses": 0,
            "ties": 0
        }
    
    def start_battle(self, user_query, model_a_id=None, model_b_id=None):
        """Start a blind comparison battle."""
        import random
        
        # Randomly select two models if not specified
        if model_a_id is None or model_b_id is None:
            available = list(self.models.keys())
            model_a_id, model_b_id = random.sample(available, 2)
        
        # Get responses from both models
        response_a = self.models[model_a_id]["handler"](user_query)
        response_b = self.models[model_b_id]["handler"](user_query)
        
        # Randomly assign to positions (A or B)
        if random.random() > 0.5:
            model_a_id, model_b_id = model_b_id, model_a_id
            response_a, response_b = response_b, response_a
        
        return {
            "battle_id": len(self.battles),
            "model_a": {"id": model_a_id, "response": response_a},
            "model_b": {"id": model_b_id, "response": response_b},
            "query": user_query
        }
    
    def record_vote(self, battle_id, winner):
        """Record user vote and update ratings."""
        battle = self.battles[battle_id]
        
        model_a_id = battle["model_a"]["id"]
        model_b_id = battle["model_b"]["id"]
        
        # Update ratings
        if winner == "a":
            self._update_ratings(model_a_id, model_b_id, 1)
        elif winner == "b":
            self._update_ratings(model_b_id, model_a_id, 1)
        else:
            self._update_ratings(model_a_id, model_b_id, 0.5)
            self._update_ratings(model_b_id, model_a_id, 0.5)
    
    def _update_ratings(self, winner_id, loser_id, score):
        """Update ELO ratings for both models."""
        winner = self.models[winner_id]
        loser = self.models[loser_id]
        
        # Expected scores
        exp_winner = 1 / (1 + 10 ** ((loser["rating"] - winner["rating"]) / 400))
        exp_loser = 1 - exp_winner
        
        # Update ratings
        winner["rating"] += self.k_factor * (score - exp_winner)
        loser["rating"] += self.k_factor * ((1 - score) - exp_loser)
        
        # Update statistics
        winner["battles"] += 1
        loser["battles"] += 1
        
        if score == 1:
            winner["wins"] += 1
            loser["losses"] += 1
        elif score == 0:
            winner["losses"] += 1
            loser["wins"] += 1
        else:
            winner["ties"] += 1
            loser["ties"] += 1
    
    def get_leaderboard(self):
        """Get current model rankings."""
        return sorted(
            self.models.items(),
            key=lambda x: x[1]["rating"],
            reverse=True
        )

Chatbot Arena has collected over 1 million human votes as of 2024, making it the largest human evaluation dataset for LLMs. The ELO system provides statistically robust rankings with confidence intervals.

Preference Studies

Bradley-Terry Model

DfBradley-Terry Model

The Bradley-Terry model is a statistical model for pairwise comparisons. It estimates the probability that one model is preferred over another based on their latent "quality" parameters.

Bradley-Terry Probability

P(isuccj)=fracpiipii+pijP(i \\succ j) = \\frac{\\pi_i}{\\pi_i + \\pi_j}

Here,

  • P(ij)P(i \succ j)=Probability model i is preferred over j
  • πi\pi_i=Quality parameter for model i
  • πj\pi_j=Quality parameter for model j
import numpy as np
from scipy.optimize import minimize

class BradleyTerryModel:
    """Bradley-Terry model for pairwise comparisons."""
    
    def __init__(self, n_models):
        self.n_models = n_models
        self.params = np.zeros(n_models)  # Log-quality parameters
    
    def fit(self, comparisons):
        """Fit model from pairwise comparisons.
        
        comparisons: list of (winner_id, loser_id) tuples
        """
        def neg_log_likelihood(params):
            ll = 0
            for winner, loser in comparisons:
                # P(winner > loser)
                prob = 1 / (1 + np.exp(params[loser] - params[winner]))
                ll += np.log(prob + 1e-10)
            return -ll
        
        result = minimize(
            neg_log_likelihood,
            self.params,
            method='L-BFGS-B'
        )
        
        self.params = result.x
        return self
    
    def predict(self, model_i, model_j):
        """Predict probability that model i is preferred."""
        return 1 / (1 + np.exp(self.params[model_j] - self.params[model_i]))
    
    def get_rankings(self):
        """Get model rankings by quality parameter."""
        return np.argsort(-self.params)

Rating Scales

Scale TypeDescriptionUse Case
Likert1-5 or 1-7 ratingOverall quality
ComparisonSide-by-side preferenceA/B testing
RankingOrder multiple outputsMulti-model comparison
BinaryAcceptable/Not acceptableQuality thresholds

Designing Rating Scales

class RatingScale:
    """Design and validate rating scales for LLM evaluation."""
    
    LIKERT_5 = {
        1: "Very Poor",
        2: "Poor", 
        3: "Acceptable",
        4: "Good",
        5: "Excellent"
    }
    
    HELPFULNESS_SCALE = {
        1: "Not helpful at all",
        2: "Slightly helpful",
        3: "Moderately helpful",
        4: "Very helpful",
        5: "Extremely helpful"
    }
    
    SAFETY_SCALE = {
        1: "Unsafe - harmful content",
        2: "Somewhat unsafe - questionable content",
        3: "Neutral - no safety concerns",
        4: "Safe - appropriate content",
        5: "Very safe - helpful safety information"
    }
    
    @staticmethod
    def create_task-specific_scale(task_type):
        """Create appropriate scale for task type."""
        if task_type == "summarization":
            return {
                "dimensions": ["accuracy", "completeness", "conciseness", "fluency"],
                "scale": RatingScale.LIKERT_5
            }
        elif task_type == "code_generation":
            return {
                "dimensions": ["correctness", "efficiency", "readability", "documentation"],
                "scale": RatingScale.LIKERT_5
            }
        elif task_type == "conversation":
            return {
                "dimensions": ["helpfulness", "coherence", "engagement", "safety"],
                "scale": RatingScale.HELPFULNESS_SCALE
            }

Annotation Methodologies

Single Annotation

DfSingle Annotation

Single annotation assigns one annotator per example. It's fast and cheap but lacks reliability measures. Use only for initial screening or when inter-annotator agreement is expected to be high.

Multiple Annotations

DfMultiple Annotations

Multiple annotations assign multiple annotators (typically 3-5) per example. The majority vote or average rating provides more reliable estimates and enables computing inter-annotator agreement.

import statistics
from collections import Counter

class AnnotationAggregator:
    """Aggregate multiple annotations into final scores."""
    
    @staticmethod
    def majority_vote(annotations):
        """Use majority vote for categorical labels."""
        counter = Counter(annotations)
        return counter.most_common(1)[0][0]
    
    @staticmethod
    def weighted_average(annotations, weights=None):
        """Compute weighted average for ordinal ratings."""
        if weights is None:
            weights = [1] * len(annotations)
        
        total = sum(a * w for a, w in zip(annotations, weights))
        return total / sum(weights)
    
    @staticmethod
    def compute_agreement(annotations_list):
        """Compute inter-annotator agreement (Fleiss' Kappa)."""
        n_items = len(annotations_list[0])
        n_raters = len(annotations_list)
        n_categories = max(max(ann) for ann in annotations_list) + 1
        
        # Compute agreement
        agree_count = 0
        for item_idx in range(n_items):
            item_annotations = [ann[item_idx] for ann in annotations_list]
            most_common = Counter(item_annotations).most_common(1)[0][1]
            agree_count += most_common
        
        # Fleiss' Kappa
        p_o = agree_count / (n_items * n_raters)
        
        # Expected agreement
        p_e = 0
        for cat in range(n_categories):
            cat_count = sum(1 for ann in annotations_list 
                          for a in ann if a == cat)
            p_e += (cat_count / (n_items * n_raters)) ** 2
        
        kappa = (p_o - p_e) / (1 - p_e)
        return kappa

Fleiss' Kappa measures inter-annotator agreement beyond chance. Values above 0.8 indicate strong agreement, 0.6-0.8 moderate agreement, and below 0.6 suggests the annotation scheme needs improvement.

Annotation Protocols

ProtocolDescriptionTrade-off
BlindAnnotators don't know model identityReduces bias, more expensive
RandomizedRandom order of outputsReduces position bias
CalibrationInclude gold-standard examplesDetects low-quality annotators
TrainingPractice examples before evaluationImproves consistency

Quality Control

Annotator Screening

class AnnotatorScreening:
    """Screen annotators for quality."""
    
    def __init__(self, gold_standards, min_accuracy=0.8):
        self.gold_standards = gold_standards
        self.min_accuracy = min_accuracy
    
    def screen_annotator(self, annotations):
        """Check if annotator meets quality threshold."""
        correct = sum(1 for ann, gold in zip(annotations, self.gold_standards)
                     if ann == gold)
        accuracy = correct / len(annotations)
        
        return accuracy >= self.min_accuracy
    
    def detect_biased_annotators(self, all_annotations):
        """Identify annotators with systematic biases."""
        # Check for position bias (always preferring first/second option)
        position_biases = []
        
        for annotator_id, annotations in all_annotations.items():
            first_wins = sum(1 for ann in annotations if ann["preferred"] == "first")
            bias = abs(first_wins / len(annotations) - 0.5)
            position_biases.append((annotator_id, bias))
        
        # Flag annotators with extreme biases
        return [aid for aid, bias in position_biases if bias > 0.3]

Attention Checks

Include attention check questions where the correct answer is obvious. Annotators who fail these checks are likely not reading carefully and their annotations should be excluded.

Reporting Results

Statistical Significance

McNemar's Test

chi2=frac(bc)2b+c\\chi^2 = \\frac{(b - c)^2}{b + c}

Here,

  • bb=Cases where model A wins and B loses
  • cc=Cases where model B wins and A loses
from scipy.stats import chi2

def mcnemar_test(b, c):
    """McNemar's test for paired comparisons."""
    chi2_stat = (b - c) ** 2 / (b + c)
    p_value = 1 - chi2.cdf(chi2_stat, df=1)
    return chi2_stat, p_value

# Example: Compare two models
# Model A wins 120 times, Model B wins 80 times (out of 300 ties)
b, c = 120, 80
chi2, p = mcnemar_test(b, c)
print(f"Chi-squared: {chi2:.2f}, p-value: {p:.4f}")
# If p < 0.05, the difference is statistically significant

Reporting Checklist

ElementDescription
Sample sizeNumber of examples evaluated
Annotator countNumber of annotators per example
Inter-annotator agreementFleiss' Kappa or similar
Statistical significancep-values for model comparisons
Confidence intervalsUncertainty in ratings
DemographicsAnnotator background (if relevant)

Always report confidence intervals alongside point estimates. A model rated 4.2 ± 0.1 is very different from 4.2 ± 0.5, even though the point estimate is the same.

Practice Exercises

  1. Conceptual: Explain why Chatbot Arena's blind comparison design reduces bias. What biases would be introduced if annotators knew which model they were evaluating?

  2. Mathematical: Given 100 pairwise comparisons where Model A wins 55 times and Model B wins 45 times, compute the p-value using McNemar's test. Is the difference statistically significant?

  3. Practical: Design an annotation protocol for evaluating code generation quality. What dimensions should annotators rate, and how will you ensure inter-annotator agreement?

  4. Research: Compare human preferences on Chatbot Arena with automated metrics (BLEU, BERTScore). How often do human preferences disagree with automated metrics, and what explains the disagreements?

Key Takeaways:

  • Human evaluation is the gold standard for assessing LLM quality
  • Chatbot Arena provides crowdsourced ELO ratings through blind comparisons
  • Bradley-Terry model estimates pairwise preference probabilities
  • Inter-annotator agreement (Fleiss' Kappa) measures annotation reliability
  • Multiple annotations (3-5) with majority vote improve reliability
  • Always include attention checks and screening questions
  • Report confidence intervals and statistical significance
  • Blind evaluation reduces bias in model comparison

What to Learn Next

-> Automated LLM Evaluation LLM-as-judge, G-Eval, and automatic metrics.

-> LLM Evaluation Frameworks lm-eval-harness, OpenCompass, and HELM.

-> LLM Evaluation Benchmarks Understanding standard benchmarks for LLMs.

-> RLHF and Alignment Training models to align with human preferences.

-> Constitutional AI Training safe and aligned language models.

-> DPO and Preference Optimization Direct preference optimization for alignment.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement