LLM Training

Instruction Tuning — Teaching LLMs to Follow Instructions

Instruction tuning bridges the gap between pre-training on raw text and aligning with user intent through structured instruction-response pairs.

Loss on Responses Only — Training computes loss only on response tokens, not instruction tokens
Self-Instruct and Evol-Instruct — Automated methods for generating diverse, high-quality instruction datasets
Multi-Task vs Single-Task — Multi-task improves generalization; single-task excels in specialized domains

"FLAN showed that instruction tuning on diverse tasks improves zero-shot performance on unseen tasks."

Instruction Tuning

Instruction tuning is the process of fine-tuning a language model on (instruction, response) pairs to improve its ability to follow human instructions. It bridges the gap between pre-training on raw text and aligning with user intent.

What is Instruction Tuning?

The key insight: pre-training teaches the model to predict the next token, but instruction tuning teaches the model to understand and execute instructions.

Mathematical Formulation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

def compute_instruction_loss(model, batch, tokenizer):
    """Compute loss only on response tokens."""
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["labels"]
    
    # Find where response starts (after instruction)
    # Labels are -100 for instruction tokens
    outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        labels=labels
    )
    
    return outputs.loss

def prepare_instruction_data(
    instruction: str,
    response: str,
    tokenizer,
    max_length: int = 512
) -> dict:
    """Prepare instruction-response pair with proper masking."""
    # Format: <s> Instruction: {instruction} \n Response: {response} </s>
    full_text = f"Instruction: {instruction}\nResponse: {response}"
    instruction_text = f"Instruction: {instruction}\nResponse:"
    
    full_encoded = tokenizer(full_text, truncation=True, max_length=max_length)
    instruction_encoded = tokenizer(instruction_text, truncation=True, max_length=max_length)
    
    # Create labels: mask instruction tokens with -100
    labels = full_encoded["input_ids"].copy()
    instruction_length = len(instruction_encoded["input_ids"])
    labels[:instruction_length] = [-100] * instruction_length
    
    full_encoded["labels"] = labels
    return full_encoded

FLAN (Fine-tuned LAnguage Net)

FLAN demonstrated the power of multi-task instruction tuning across hundreds of tasks:

FLAN Task Formats

Task Type	Format Example
Classification	"Is the following sentence positive or negative? {text}"
Summarization	"Summarize the following article: {article}"
QA	"Based on the context, answer: {question}\nContext: {context}"
Translation	"Translate the following from English to French: {text}"

flan_task_templates = {
    "sentiment": "Is the following review positive or negative? Review: {input}\nAnswer:",
    "summarization": "Summarize the following text in one sentence: {input}\nSummary:",
    "question_answering": "Based on the following passage, answer the question.\nPassage: {passage}\nQuestion: {input}\nAnswer:",
    "translation": "Translate the following sentence to French: {input}\nTranslation:",
    "classification": "Classify the following text into one of these categories: {labels}\nText: {input}\nCategory:",
    "reasoning": "Answer the following question step by step.\nQuestion: {input}\nSolution:"
}

InstructGPT

InstructGPT (Ouyang et al., 2022) established the three-stage training paradigm:

Stage 1: Supervised Fine-Tuning (SFT)

Train on human-written demonstrations:

def train_sft_stage(model, dataset, tokenizer, epochs=3):
    """Stage 1: Supervised Fine-Tuning on human demonstrations."""
    training_args = TrainingArguments(
        output_dir="./sft_model",
        num_train_epochs=epochs,
        per_device_train_batch_size=8,
        learning_rate=2e-5,
        weight_decay=0.01,
        warmup_ratio=0.1,
        fp16=True
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    
    trainer.train()
    return trainer.model

Stage 2: Reward Model Training

Train reward model on human preferences:

class RewardModelTrainer:
    def __init__(self, reward_model, tokenizer):
        self.model = reward_model
        self.tokenizer = tokenizer
    
    def compute_reward_loss(self, chosen, rejected):
        """Train reward model to rank chosen > rejected."""
        chosen_rewards = self.model(chosen).logits
        rejected_rewards = self.model(rejected).logits
        
        # Pairwise ranking loss
        loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards))
        return loss.mean()

Stage 3: PPO Optimization

Optimize policy against reward model with KL penalty:

Dataset Construction

Self-Instruct

Self-Instruct generates instruction-following data using the model itself:

class SelfInstructGenerator:
    def __init__(self, model, tokenizer, seed_tasks: List[str]):
        self.model = model
        self.tokenizer = tokenizer
        self.seed_tasks = seed_tasks
    
    def generate_instructions(self, num_instructions: int = 1000) -> List[Dict]:
        """Generate instruction-following data."""
        generated = []
        
        for _ in range(num_instructions):
            # Sample seed tasks for context
            context = random.sample(self.seed_tasks, k=3)
            
            prompt = f"""Here are some example instructions:
{chr(10).join(f'- {t}' for t in context)}

Generate a new, different instruction that is similar in complexity:"""
            
            instruction = self._generate(prompt)
            
            # Generate response
            response_prompt = f"Instruction: {instruction}\nResponse:"
            response = self._generate(response_prompt, max_tokens=256)
            
            if self._is_valid(instruction, response):
                generated.append({
                    "instruction": instruction,
                    "response": response
                })
        
        return generated
    
    def _is_valid(self, instruction: str, response: str) -> bool:
        """Validate generated data."""
        # Check length
        if len(instruction) < 10 or len(response) < 20:
            return False
        
        # Check for repetition
        if instruction == response:
            return False
        
        # Check for coherence
        if response.count(".") < 1:
            return False
        
        return True

Evol-Instruct

Evol-Instruct progressively makes instructions more complex:

class EvolInstructGenerator:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        
        self.evolution_prompts = {
            "deepen": "Add more constraints or requirements to this instruction: {instruction}",
            "widen": "Add more context or background information to this instruction: {instruction}",
            "concretize": "Make this instruction more specific and concrete: {instruction}",
            "reasoning": "Modify this instruction to require multi-step reasoning: {instruction}"
        }
    
    def evolve(self, instruction: str, evolution_type: str) -> str:
        """Evolve instruction using specified type."""
        prompt = self.evolution_prompts[evolution_type].format(instruction=instruction)
        
        evolved = self._generate(prompt)
        return evolved
    
    def generate_dataset(
        self,
        seed_instructions: List[str],
        num_evolutions: int = 3,
        evolution_factor: int = 2
    ) -> List[Dict]:
        """Generate evolved instruction dataset."""
        dataset = []
        
        for seed in seed_instructions:
            current = [seed]
            
            for _ in range(num_evolutions):
                next_level = []
                for instr in current:
                    for evo_type in self.evolution_prompts.keys():
                        evolved = self.evolve(instr, evo_type)
                        next_level.append(evolved)
                        dataset.append({
                            "instruction": evolved,
                            "evolution_type": evo_type,
                            "level": _
                        })
                
                current = next_level[:evolution_factor]
        
        return dataset

Multi-Task vs Single-Task Tuning

Multi-Task Instruction Tuning

multi_task_dataset = {
    "classification": load_dataset("glue", "sst2"),
    "summarization": load_dataset("cnn_dailymail", "3.0.0"),
    "qa": load_dataset("squad"),
    "translation": load_dataset("wmt14", "fr-en"),
    "code": load_dataset("code_x_glue_tc_nl_code_search_adv")
}

# Combine all tasks with instruction format
def format_multi_task_dataset(datasets, task_formats):
    combined = []
    
    for task_name, dataset in datasets.items():
        format_fn = task_formats[task_name]
        for example in dataset["train"]:
            instruction, response = format_fn(example)
            combined.append({
                "instruction": instruction,
                "response": response,
                "task": task_name
            })
    
    return combined

Single-Task Instruction Tuning

Focus on one domain for specialized performance:

single_task_configs = {
    "medical": {
        "dataset": "medical_dialog",
        "instruction_format": "Based on the patient's symptoms, provide a diagnosis: {symptoms}",
        "response_format": "Diagnosis: {diagnosis}\nReasoning: {reasoning}"
    },
    "legal": {
        "dataset": "legal_contract_qa",
        "instruction_format": "Analyze the following legal clause: {clause}\nQuestion: {question}",
        "response_format": "Analysis: {analysis}\nAnswer: {answer}"
    },
    "code": {
        "dataset": "code_instruct",
        "instruction_format": "Write code to: {task_description}\nLanguage: {language}",
        "response_format": "```&#123;language&#125;\n&#123;code&#125;\n```\nExplanation: {explanation}"
    }
}

Training Best Practices

instruction_training_config = {
    "learning_rate": 2e-5,
    "batch_size": 32,
    "epochs": 3,
    "warmup_ratio": 0.03,
    "weight_decay": 0.1,
    "max_seq_length": 512,
    "gradient_accumulation_steps": 4,
    "fp16": True,
    "logging_steps": 100,
    "save_strategy": "epoch",
    "evaluation_strategy": "epoch"
}

Common Pitfalls

Pitfall	Symptom	Solution
Overfitting	Train loss v, eval loss ^	Early stopping, dropout
Underfitting	Both losses plateau high	Increase epochs, reduce regularization
Catastrophic forgetting	Good on target, bad on general	Lower learning rate, mix general data
Data contamination	High eval accuracy, poor real perf	Use held-out test sets

Summary

Practice Exercises

Instruction Dataset: Create an instruction dataset with 100 examples covering 5 different task types.
SFT Training: Fine-tune a small language model (e.g., GPT-2) on instruction data. Compare before and after on held-out instructions.
Self-Instruct: Implement Self-Instruct to generate 500 instruction-response pairs. Evaluate quality.
Evol-Instruct: Use Evol-Instruct to create a dataset with increasing complexity levels. Analyze difficulty distribution.
Multi-Task Evaluation: Train on 3 tasks and evaluate zero-shot on 5 unseen tasks. Compare with single-task training.

What to Learn Next

-> RLHF and Alignment The next step after instruction tuning — aligning with human preferences.

-> Constitutional AI An alternative to RLHF that uses principles instead of human feedback.

-> Fine-Tuning LLMs The broader fine-tuning techniques that instruction tuning builds upon.

-> LLM Safety and Red Teaming Ensuring instruction-tuned models don't learn to follow harmful instructions.

-> Pretraining Language Models Understanding the pre-training phase before instruction tuning is applied.

-> Building Production LLM Applications Deploying instruction-tuned models in production environments.

Previous: 20 - LLM Agent Frameworks <- | Next: 22 - LLM Safety & Red Teaming ->

Instruction Tuning

Instruction Tuning — Teaching LLMs to Follow Instructions

Instruction Tuning

What is Instruction Tuning?

Mathematical Formulation

FLAN (Fine-tuned LAnguage Net)

FLAN Task Formats

InstructGPT

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Reward Model Training

Stage 3: PPO Optimization

Dataset Construction

Self-Instruct

Evol-Instruct

Multi-Task vs Single-Task Tuning

Multi-Task Instruction Tuning

Single-Task Instruction Tuning

Training Best Practices

Common Pitfalls

Summary

Practice Exercises

What to Learn Next

Need Expert LLM Help?