Advanced Training
Data Quality and Curation — The Hidden Driver of LLM Performance
Data quality matters more than data quantity. A smaller, high-quality dataset consistently outperforms a larger, noisy one. This guide covers the science of data curation for LLMs.
- Deduplication — Removing duplicates can improve performance by 10%+ on downstream tasks
- Data Mixing — The ratio of code, math, books, and web text dramatically affects capabilities
- Quality Filtering — Perplexity-based and classifier-based filtering separate signal from noise
The quality of what you feed the model determines the quality of what it produces.
Data Quality and Curation for LLMs
Data quality is arguably the most important factor in LLM performance, yet it receives far less attention than model architecture or training algorithms. The "Scaling Data-Constrained Language Models" paper demonstrated that repeating data beyond 4 epochs yields diminishing returns, while carefully curated data can achieve the same performance with 10x fewer tokens.
DfData Quality
Data quality in the context of LLM training encompasses multiple dimensions: factual accuracy, linguistic quality, deduplication, diversity, recency, and domain coverage. High-quality training data directly translates to better downstream performance, fewer hallucinations, and more reliable reasoning.
The Data Quality Hierarchy
Level 1: Basic Filtering
Remove clearly low-quality data:
def basic_quality_filter(document):
if len(document) < 100 or len(document) > 100000:
return False
words = document.split()
if len(words) < 20:
return False
alpha_ratio = sum(c.isalpha() for c in document) / len(document)
if alpha_ratio < 0.8:
return False
word_repetition = len(words) / len(set(words))
if word_repetition > 3.0:
return False
return True
Level 2: Perplexity-Based Filtering
DfPerplexity Filtering
Use a pre-trained language model to score document quality by perplexity. Documents with very high perplexity (noisy, incoherent) or very low perplexity (repetitive, templated) are filtered out.
Document Perplexity
Here,
- =Document
- =Number of tokens in the document
- =Language model probability of token i given context
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def compute_perplexity(doc, model, tokenizer, max_length=512):
inputs = tokenizer(doc, return_tensors="pt", truncation=True, max_length=max_length)
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
return torch.exp(outputs.loss).item()
def perplexity_filter(documents, model_name="gpt2", low_thresh=10, high_thresh=1000):
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
filtered = []
for doc in documents:
ppl = compute_perplexity(doc, model, tokenizer)
if low_thresh <= ppl <= high_thresh:
filtered.append(doc)
return filtered
Level 3: Classifier-Based Filtering
DfClassifier-Based Quality Filtering
Train a binary classifier on a small labeled dataset of high-quality (Wikipedia, books) vs. low-quality (web crawl) documents. Use the classifier to score and filter the full training corpus.
def classifier_quality_score(document, classifier):
prob = classifier.predict_proba([document])[0][1]
return prob
high_quality_docs = [
doc for doc in corpus
if classifier_quality_score(doc, quality_classifier) > 0.7
]
The GPT-3 paper used a classifier trained on Wikipedia as positive examples and random web pages as negative examples. This "WebText-like" filtering significantly improved downstream performance.
Deduplication
Why Deduplication Matters
DfTraining Data Deduplication
Deduplication removes exact and near-duplicate documents from the training corpus. Duplicated data causes: (1) memorization instead of generalization, (2) biased gradients toward repeated content, (3) wasted compute on redundant examples.
| Deduplication Method | Type | Precision | Speed |
|---|---|---|---|
| Exact hash | Exact | 100% | Very fast |
| MinHash LSH | Near-exact | ~95% | Fast |
| SimHash | Approximate | ~90% | Very fast |
| SemDeDup | Semantic | ~85% | Slow |
Exact Deduplication
import hashlib
def exact_dedup(documents):
seen_hashes = set()
unique_docs = []
for doc in documents:
doc_hash = hashlib.sha256(doc.encode()).hexdigest()
if doc_hash not in seen_hashes:
seen_hashes.add(doc_hash)
unique_docs.append(doc)
return unique_docs
Near-Duplicate Detection with MinHash
DfMinHash LSH
MinHash creates compact fingerprints of sets (e.g., shingle sets of documents). Locality-Sensitive Hashing (LSH) groups similar MinHash signatures into buckets, enabling efficient near-duplicate detection.
Jaccard Similarity
Here,
- =Shingle set of document 1
- =Shingle set of document 2
from datasketch import MinHash, MinHashLSH
def minhash_dedup(documents, threshold=0.8, num_perm=128):
lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
unique_docs = []
for i, doc in enumerate(documents):
m = MinHash(num_perm=num_perm)
for shingle in get_shingles(doc, k=5):
m.update(shingle.encode())
result = lsh.query(m)
if not result:
lsh.insert(str(i), m)
unique_docs.append(doc)
return unique_docs
Semantic Deduplication
DfSemantic Deduplication
Semantic deduplication removes documents that are semantically similar (same meaning) even if they use different words. This goes beyond surface-level duplication to remove truly redundant training examples.
from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN
def semantic_dedup(documents, model_name="all-MiniLM-L6-v2", eps=0.3):
model = SentenceTransformer(model_name)
embeddings = model.encode(documents, show_progress_bar=True)
clustering = DBSCAN(eps=eps, min_samples=2, metric="cosine")
clusters = clustering.fit_predict(embeddings)
unique_docs = []
seen_clusters = set()
for doc, cluster_id in zip(documents, clusters):
if cluster_id == -1 or cluster_id not in seen_clusters:
unique_docs.append(doc)
if cluster_id != -1:
seen_clusters.add(cluster_id)
return unique_docs
Data Mixing
Domain Mixing Ratios
DfData Mixing
Data mixing determines the proportion of training data from each domain (web, code, books, academic papers, etc.). The mixing ratio significantly impacts model capabilities — more code data improves coding ability, more books improves long-form reasoning.
| Mixing Ratio (Web:Code:Books:Academic) | Coding | Reasoning | Knowledge | Instruction Following |
|---|---|---|---|---|
| 100:0:0:0 | Baseline | Baseline | Baseline | Baseline |
| 80:10:5:5 | +5% | +3% | +2% | +4% |
| 60:20:10:10 | +15% | +8% | +5% | +10% |
| 40:30:20:10 | +25% | +12% | +8% | +15% |
The "Scaling Data-Constrained Language Models" paper found that code data is particularly valuable — even small amounts (5-10%) significantly improve reasoning and logical thinking abilities.
Optimal Mixing with DOREMI
DfDOREMI
DOREMI (Data Optimal REweighting for Model Instructiveness) automatically learns optimal domain weights by training a small proxy model to identify which domains are underrepresented relative to the optimal distribution.
def doremi_weights(domain_losses, target_loss):
weights = {}
for domain, loss in domain_losses.items():
weights[domain] = max(0, target_loss - loss)
total = sum(weights.values())
if total > 0:
weights = {k: v / total for k, v in weights.items()}
else:
weights = {k: 1.0 / len(weights) for k in weights}
return weights
Data Attribution
Measuring Data Influence
DfData Attribution
Data attribution methods measure the influence of individual training examples on model behavior. This enables understanding which data points are most valuable, identifying harmful data, and debugging model failures.
Influence Function
Here,
- =Influence of training example z_i on test loss
- =Test example
- =Training example
- =Hessian of the loss
For LLMs, exact influence functions are intractable. Practical approximations include: gradient similarity, TracIn, and data Shapley values computed on smaller proxy models.
Data Quality Metrics
Comprehensive Quality Score
Composite Quality Score
Here,
- =Composite quality score for document d
- =Perplexity-based quality score
- =Classifier-based quality score
- =Diversity score (novelty vs corpus)
- =Length/complexity score
- =Weights for each component
Practice Exercises
-
Deduplication Analysis: Given a corpus of 1M documents, estimate the size reduction after exact deduplication, MinHash LSH (threshold=0.8), and semantic deduplication. What is the typical deduplication rate for web crawl data?
-
Data Mixing Experiment: Design a data mixing strategy for an LLM that must excel at (a) conversational AI, (b) code generation, and (c) mathematical reasoning. Justify your domain proportions.
-
Quality Filter Design: Design a multi-stage quality filtering pipeline for Common Crawl data. What filters would you apply, in what order, and what retention rates would you target?
-
Data Attribution: How would you identify training examples that cause a specific hallucination in an LLM? Describe a practical approach using influence functions or similar methods.
Key Takeaways
Summary: Data Quality and Curation
- Data quality > data quantity — curated datasets outperform larger, noisy ones
- Deduplication prevents memorization and improves generalization (10%+ improvement)
- Perplexity filtering removes noisy and repetitive documents effectively
- Classifier-based filtering learns quality standards from labeled examples
- Semantic deduplication removes meaning duplicates, not just exact copies
- Data mixing ratios directly determine model capabilities (code, reasoning, knowledge)
- DOREMI automatically learns optimal domain weights
- Data attribution identifies which training examples influence specific behaviors
- Multi-stage filtering pipelines are standard practice for web-scale data
What to Learn Next
-> Synthetic Data Generation Using LLMs to create high-quality training data for themselves.
-> Pretraining Language Models The fundamentals of training language models on large corpora.
-> Distributed Training for LLMs Scaling training across hundreds of GPUs with parallelism strategies.
-> Scaling Laws and Chinchilla How data quantity and quality interact with model scale.
-> Curriculum Learning for LLMs Strategic ordering of training data for improved learning.
-> Knowledge Distillation for LLMs Using teacher models to generate quality training signals.