Advanced Training
Synthetic Data Generation — Teaching LLMs to Teach Themselves
Synthetic data enables LLMs to generate their own training data, breaking the ceiling of human-created datasets. This guide covers self-instruct, evol-instruct, and quality-controlled synthesis.
- Self-Instruct — Seed tasks generate new instructions automatically
- Evol-Instruct — Iteratively complexify instructions for capability growth
- Quality Control — Filtering, verification, and diversity enforcement
The best data for training an AI is the data that AI itself creates — if properly curated.
Synthetic Data Generation for LLMs
Synthetic data generation has emerged as one of the most powerful techniques for improving LLM capabilities. Models like Alpaca, Vicuna, and WizardLM demonstrated that synthetic instruction data can dramatically improve instruction-following ability. Research shows that LLMs can generate reasoning traces, code examples, and even mathematical proofs that improve their own training.
DfSynthetic Data Generation
Synthetic data generation is the process of using a language model (often a larger, more capable one) to generate training examples — instructions, responses, reasoning traces, or code — that are then used to train or fine-tune a target model.
The Self-Instruct Pipeline
DfSelf-Instruct
Self-Instruct (Wang et al., 2023) generates new instruction-following data by using a small set of seed tasks to prompt an LLM to generate new instructions, then generates input-output pairs for each instruction.
Seed Tasks (175 examples)
|
v
[Instruction Generation] -> New Instructions
|
v
[Classification] -> Classify as classification/generation
|
v
[Input Generation] -> Generate inputs for each instruction
|
v
[Output Generation] -> Generate outputs using LLM
|
v
[Filtering] -> Remove low-quality, duplicates, near-matches
|
v
Final Synthetic Dataset
Implementation
import openai
import json
import random
class SelfInstructPipeline:
def __init__(self, model="gpt-4", seed_tasks_path="seed_tasks.json"):
self.model = model
self.seed_tasks = json.load(open(seed_tasks_path))
def generate_instruction(self, seed_examples):
prompt = f"""Below are some example tasks and their instructions:
{chr(10).join(f'Task: {t["task"]}\nInstruction: {t["instruction"]}' for t in seed_examples)}
Generate a new, creative task and instruction that is different from the above:
Task: """
response = openai.ChatCompletion.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
return response.choices[0].message.content.strip()
def generate_input_output(self, instruction, task_type):
prompt = f"""Generate a detailed input and the correct output for this instruction:
Instruction: {instruction}
Task Type: {task_type}
Input:"""
response = openai.ChatCompletion.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
content = response.choices[0].message.content
return self._parse_input_output(content)
def filter_duplicates(self, new_tasks, existing_tasks, threshold=0.8):
filtered = []
for task in new_tasks:
is_duplicate = False
for existing in existing_tasks:
similarity = self._compute_similarity(task["instruction"], existing["instruction"])
if similarity > threshold:
is_duplicate = True
break
if not is_duplicate:
filtered.append(task)
return filtered
def generate_dataset(self, num_samples=10000):
dataset = list(self.seed_tasks)
while len(dataset) < num_samples:
seed_examples = random.sample(self.seed_tasks, min(5, len(self.seed_tasks)))
new_instruction = self.generate_instruction(seed_examples)
task_type = self.classify_task(new_instruction)
input_output = self.generate_input_output(new_instruction, task_type)
new_task = {
"instruction": new_instruction,
"input": input_output["input"],
"output": input_output["output"],
"task_type": task_type,
"source": "self_instruct"
}
filtered = self.filter_duplicates([new_task], dataset)
dataset.extend(filtered)
return dataset[:num_samples]
Evol-Instruct: Evolutionary Complexity
DfEvol-Instruct
Evol-Instruct (Xu et al., 2023) iteratively evolves instructions to increase complexity. Starting from simple instructions, it applies "evolution" operators (deepening, widening, concretizing, reasoning) to create progressively more challenging tasks.
Evolution Operators
class EvolInstruct:
DEEPEN_PROMPT = """Take the following instruction and make it more complex by adding additional constraints or requirements:
Original: {instruction}
Make it more complex:"""
WIDEN_PROMPT = """Take the following instruction and broaden its scope to require more comprehensive knowledge:
Original: {instruction}
Broaden the scope:"""
CONCRETIZE_PROMPT = """Take the following abstract instruction and make it more specific and concrete:
Original: {instruction}
Make it more specific:"""
REASONING_PROMPT = """Take the following instruction and modify it to require multi-step reasoning:
Original: {instruction}
Require multi-step reasoning:"""
def evolve(self, instruction, operator="deepen"):
prompts = {
"deepen": self.DEEPEN_PROMPT,
"widen": self.WIDEN_PROMPT,
"concretize": self.CONCRETIZE_PROMPT,
"reasoning": self.REASONING_PROMPT
}
prompt = prompts[operator].format(instruction=instruction)
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
return response.choices[0].message.content.strip()
def evolve_dataset(self, initial_instructions, num_rounds=5):
current = initial_instructions
all_evolved = list(current)
for round_num in range(num_rounds):
evolved = []
for instruction in current:
operator = random.choice(["deepen", "widen", "concretize", "reasoning"])
new_instruction = self.evolve(instruction, operator)
evolved.append(new_instruction)
all_evolved.append(new_instruction)
current = evolved
return all_evolved
WizardLM used Evol-Instruct to create 250K complex instructions. Models trained on this data showed significant improvements on complex reasoning benchmarks compared to models trained on simpler Alpaca-style data.
Synthetic Reasoning Data
Chain-of-Thought Synthesis
DfSynthetic CoT Data
Generate chain-of-thought reasoning traces for training data. Use a capable model to produce step-by-step reasoning for problems, then train smaller models on these traces to distill reasoning capabilities.
def generate_cot_data(problems, model="gpt-4"):
cot_data = []
for problem in problems:
prompt = f"""Solve this problem step by step, showing your reasoning:
Problem: {problem['question']}
Solution:"""
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
)
reasoning = response.choices[0].message.content
cot_data.append({
"instruction": problem["question"],
"output": reasoning,
"task_type": "reasoning",
"source": "synthetic_cot"
})
return cot_data
Math Problem Synthesis
def generate_math_problems(num_problems, difficulty_level="medium"):
prompt = f"""Generate {num_problems} math problems at {difficulty_level} difficulty.
For each problem, provide:
1. The problem statement
2. Step-by-step solution
3. Final answer
4. Difficulty rating (1-5)
Format as JSON array.
Example:
[
{{
"problem": "If f(x) = 2x^2 + 3x - 5, find f'(2).",
"solution": "f'(x) = 4x + 3\nf'(2) = 4(2) + 3 = 11",
"answer": 11,
"difficulty": 3
}}
]
Generate problems:"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.8
)
return json.loads(response.choices[0].message.content)
Quality Control for Synthetic Data
Multi-Stage Verification
DfSynthetic Data Quality Control
A pipeline of verification steps to ensure synthetic data is accurate, diverse, and useful. Includes: (1) format validation, (2) factual verification, (3) diversity checks, (4) human spot-checks, (5) model-based quality scoring.
class SyntheticDataQualityControl:
def validate_format(self, sample):
required_fields = ["instruction", "output", "task_type"]
for field in required_fields:
if field not in sample or not sample[field]:
return False, f"Missing field: {field}"
if len(sample["instruction"]) < 10:
return False, "Instruction too short"
if len(sample["output"]) < 20:
return False, "Output too short"
return True, "Valid"
def check_factual_accuracy(self, sample, verifier_model="gpt-4"):
prompt = f"""Verify the factual accuracy of this response:
Question: {sample['instruction']}
Response: {sample['output']}
Is the response factually accurate? If not, what errors exist?
Answer with: ACCURATE, MINOR_ERRORS, or MAJOR_ERRORS"""
response = openai.ChatCompletion.create(
model=verifier_model,
messages=[{"role": "user", "content": prompt}],
temperature=0.1
)
verdict = response.choices[0].message.content.strip()
return verdict == "ACCURATE"
def check_diversity(self, new_sample, existing_samples, threshold=0.7):
for existing in existing_samples:
similarity = compute_similarity(new_sample["instruction"], existing["instruction"])
if similarity > threshold:
return False
return True
def quality_score(self, sample, scoring_model="gpt-4"):
prompt = f"""Rate the quality of this training example on a scale of 1-10:
Instruction: {sample['instruction']}
Output: {sample['output']}
Consider:
- Clarity of instruction
- Completeness of response
- Educational value
- Accuracy
Score (1-10):"""
response = openai.ChatCompletion.create(
model=scoring_model,
messages=[{"role": "user", "content": prompt}],
temperature=0.1
)
try:
score = int(response.choices[0].message.content.strip())
return min(max(score, 1), 10)
except:
return 5
def filter_dataset(self, dataset, min_score=7):
filtered = []
for sample in dataset:
valid, reason = self.validate_format(sample)
if not valid:
continue
if not self.check_factual_accuracy(sample):
continue
if not self.check_diversity(sample, filtered):
continue
score = self.quality_score(sample)
if score >= min_score:
sample["quality_score"] = score
filtered.append(sample)
return filtered
Model Collapse and Diversity
DfModel Collapse
Model collapse occurs when a model is trained on its own outputs iteratively. Over generations, the model's outputs become less diverse, lose tail knowledge, and converge to a narrow distribution. This is a critical risk in synthetic data generation.
Training on purely synthetic data without fresh human data leads to model collapse. Always maintain a mixture of human-generated data (at least 10-20%) in the training mix.
Preventing Model Collapse
def maintain_diversity(synthetic_data, human_data, synthetic_ratio=0.8):
n_synthetic = int(len(synthetic_data) * synthetic_ratio)
n_human = len(synthetic_data) - n_synthetic
sampled_synthetic = random.sample(synthetic_data, min(n_synthetic, len(synthetic_data)))
sampled_human = random.sample(human_data, min(n_human, len(human_data)))
mixed_data = sampled_synthetic + sampled_human
random.shuffle(mixed_data)
return mixed_data
Cost-Effective Synthetic Data
Using Smaller Models for Generation
DfStudent-Teacher Synthesis
Use a large, capable model (teacher) to generate high-quality examples, then use those examples to train a smaller model (student). The student can then generate its own synthetic data for self-improvement, at a fraction of the cost.
def student_teacher_synthesis(teacher_model, student_model, seed_data, num_rounds=3):
current_data = seed_data
for round_num in range(num_rounds):
teacher_examples = generate_with_model(teacher_model, current_data, n=1000)
train_model(student_model, teacher_examples)
student_examples = generate_with_model(student_model, current_data, n=5000)
verified = verify_with_model(teacher_model, student_examples)
current_data = mix_data(current_data, verified, ratio=0.7)
return current_data
Practice Exercises
-
Self-Instruct Design: Design a Self-Instruct pipeline for generating medical QA data. What seed tasks would you use? How would you ensure medical accuracy?
-
Evol-Instruct Strategy: Create an Evol-Instruct pipeline that evolves simple math problems into multi-step reasoning challenges. What evolution operators would you design?
-
Quality Control: Design a 5-stage quality control pipeline for synthetic code generation data. What verification steps would you include?
-
Model Collapse Analysis: If you had a model trained on 100% synthetic data, what metrics would you use to detect model collapse? Design an experiment to measure collapse severity.
Key Takeaways
Summary: Synthetic Data Generation
- Self-Instruct generates new instructions from seed tasks automatically
- Evol-Instruct iteratively complexifies instructions for capability growth
- Synthetic CoT data distills reasoning from large models to smaller ones
- Quality control requires multi-stage verification (format, facts, diversity, quality)
- Model collapse is a real risk — always mix in human-generated data
- Student-teacher synthesis enables cost-effective iterative improvement
- Diversity enforcement prevents the synthetic data distribution from narrowing
- Verification is essential — synthetic does not mean automatically correct
What to Learn Next
-> Data Quality and Curation for LLMs The foundations of data quality, deduplication, and filtering.
-> Instruction Tuning Training models to follow instructions effectively.
-> Curriculum Learning for LLMs Strategic ordering of training data for improved learning.
-> Fine-Tuning LLMs Customizing models for specific tasks and domains.
-> Chain-of-Thought Reasoning Teaching models to reason step by step.
-> Constitutional AI Using AI feedback for self-improvement and alignment.