Model Evaluation
Evaluation Metrics
# BLEU Score
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
def calculate_bleu(references, hypotheses):
return corpus_bleu(references, hypotheses)
# ROUGE Score
from rouge_score import rouge_scorer
def calculate_rouge(reference, hypothesis):
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
return scorer.score(reference, hypothesis)
# BERTScore
from bert_score import score
def calculate_bertscore(references, hypotheses):
P, R, F1 = score(hypotheses, references, lang="en", verbose=True)
return F1.mean().item()
LLM-as-Judge
def llm_judge(question, response_a, response_b, judge_model):
prompt = f"""You are an expert judge evaluating AI responses.
Question: {question}
Response A: {response_a}
Response B: {response_b}
Which response is better? Consider:
1. Accuracy and factual correctness
2. Completeness and thoroughness
3. Clarity and readability
4. Helpfulness
Output your judgment as "A" or "B" followed by a brief explanation."""
return judge_model.generate(prompt)
Benchmarks
| Benchmark | Task | Metric |
|---|---|---|
| MMLU | Knowledge | Accuracy |
| HumanEval | Code | Pass@k |
| GSM8K | Math | Accuracy |
| TruthfulQA | Honesty | % Truthful |
Summary
Effective evaluation combines automatic metrics, human judgment, and benchmarks to comprehensively assess model quality.
Next: We'll explore hallucination detection techniques.