LLM Training

Fine-Tuning LLMs — Customizing Language Models for Your Task

Fine-tuning adapts a pre-trained language model to specific tasks or domains by continuing training on task-specific data. This guide covers full fine-tuning, instruction tuning, chat formats, and practical HuggingFace examples.

Full Fine-Tuning — Updates all parameters for maximum task performance
Instruction Tuning — Teaches models to follow complex multi-step instructions
Evaluation — Balancing target task performance with general capabilities

Fine-tuning is where general intelligence meets specific purpose.

Fine-tuning LLMs

Fine-tuning adapts a pre-trained language model to specific tasks or domains by continuing training on task-specific data. This tutorial covers the methods, objectives, and practical considerations.

Full Fine-tuning

Full fine-tuning updates all model parameters on the target dataset.

Learning Rate Schedule

Instruction Tuning

Chat Format

Modern instruction-tuned models use a structured chat format with role tokens. The system message sets behavior, user messages provide instructions, and assistant messages contain the model's responses.

Training Datasets

Dataset	Size	Source	Quality
Alpaca	52K	Self-instruct (GPT-3.5)	Medium
ShareGPT	90K	User-shared conversations	High
OpenAssistant	161K	Human annotations	High
Dolly	15K	Databricks employees	High
FLAN Collection	1.8M	Aggregated NLP tasks	Medium

Full Fine-tuning Example

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset

model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

dataset = load_dataset("tatsu-lab/alpaca", split="train")

def format_prompt(example):
    if example["input"]:
        return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"

def tokenize(examples):
    texts = [format_prompt(e) for e in examples]
    return tokenizer(texts, truncation=True, max_length=2048)

tokenized = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)

training_args = TrainingArguments(
    output_dir="./alpaca-finetuned",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    warmup_steps=100,
    max_steps=5000,
    fp16=True,
    logging_steps=50,
    save_steps=500,
    optim="adamw_torch",
    weight_decay=0.1,
    lr_scheduler_type="cosine",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

When to Fine-tune

Fine-tune when:

You have 100+ high-quality examples
The task requires specific formatting or domain knowledge
Prompt engineering alone is insufficient
You need lower latency (no few-shot examples in prompt)

Prompt instead when:

Few labeled examples are available
The task is general-purpose
You need rapid iteration
Compute budget is limited

Practice Exercises

Mathematical: Calculate the total VRAM required to fine-tune a 7B parameter model in FP16 with gradient checkpointing. Assume sequence length 2048 and batch size 4.
Implementation: Fine-tune Llama-2-7B on a small custom dataset using the Alpaca format. Evaluate the before/after performance on 10 held-out examples.
Analysis: Compare the training curves (loss, learning rate) of fine-tuning with learning rates of 1e-5, 2e-5, and 5e-5. Which converges fastest without overfitting?
Research: What are the failure modes of instruction tuning? Investigate cases where fine-tuning degrades the base model's capabilities.

Advanced Fine-tuning Techniques

Data Quality and Curation

The quality of fine-tuning data can be quantified by measuring diversity, accuracy, and relevance. Always prioritize data quality over quantity for instruction tuning.

Hyperparameter Sensitivity

Fine-tuning is highly sensitive to hyperparameters. The most critical are learning rate, batch size, and number of epochs. Always perform a learning rate sweep.

Recommended hyperparameter ranges:

Parameter	Recommended Range	Notes
Learning rate	1e-6 to 5e-5	Start with 2e-5
Batch size	4-32	Larger is more stable
Epochs	1-5	Monitor validation loss
Warmup ratio	0.03-0.1	5-10% of total steps
Weight decay	0.0-0.2	Regularization

Common Failure Modes

Catastrophic forgetting: The model loses pre-trained knowledge. Mitigation: lower learning rate, fewer epochs, use LoRA.
Overfitting: Model memorizes training data. Mitigation: more data, dropout, weight decay, early stopping.
Mode collapse: Model produces the same output for all inputs. Mitigation: diverse training data, label smoothing.
Alignment tax: Fine-tuning improves one task but degrades others. Mitigation: multi-task training, elastic weight consolidation.

Evaluation During Fine-tuning

Monitor both training and validation metrics throughout fine-tuning. Key metrics include:

Training loss: Should decrease steadily
Validation loss: Should decrease then plateau (watch for overfitting)
Task-specific metrics: BLEU, ROUGE, accuracy, F1 depending on the task
Perplexity: Lower is better for language modeling tasks

Use early stopping when validation loss stops improving to prevent overfitting.

What to Learn Next

-> LoRA and PEFT Efficient fine-tuning without full retraining using low-rank adaptation.

-> QLoRA and Quantization Running LLMs on consumer hardware with INT4 quantization.

-> RLHF and Alignment Making LLMs safe and helpful through reinforcement learning from human feedback.

-> Constitutional AI Reducing dependence on human annotation through AI self-alignment.

-> Instruction Tuning Teaching models to follow complex multi-step instructions reliably.

-> Building Production LLM Apps From prototype to production: deploying LLMs at scale.

Fine-tuning LLMs

Fine-Tuning LLMs — Customizing Language Models for Your Task

Fine-tuning LLMs

Full Fine-tuning

Learning Rate Schedule

Instruction Tuning

Chat Format

Training Datasets

Full Fine-tuning Example

When to Fine-tune

Practice Exercises

Advanced Fine-tuning Techniques

Data Quality and Curation

Hyperparameter Sensitivity

Common Failure Modes

Evaluation During Fine-tuning

What to Learn Next

Need Expert LLM Help?