Transfer Learning: Fine-Tuning, Feature Extraction, Domain Adaptation — Asked at NVIDIA & Amazon

🎯 The Interview Question

"Explain the different strategies for transfer learning, including feature extraction vs fine-tuning. What is domain adaptation and how does it address the domain shift problem? How do you decide which layers to freeze when fine-tuning? What are the best practices for fine-tuning large pre-trained models like BERT or GPT?"

This question is crucial for practical ML work — transfer learning is the standard approach in modern deep learning.

📚 Detailed Answer

Transfer Learning: The Core Idea

Transfer learning leverages knowledge from a source task to improve performance on a target task.

Formal definition: Given source domain $\mathcal{D}_S$ with task $\mathcal{T}_S$ and target domain $\mathcal{D}_T$ with task $\mathcal{T}_T$ , transfer learning uses $\mathcal{D}_S, \mathcal{T}_S$ to improve learning of $\mathcal{T}_T$ in $\mathcal{D}_T$ .

Why it works: Features learned on large datasets (ImageNet, Wikipedia) capture general patterns that transfer to new tasks.

Strategy 1: Feature Extraction

Use pre-trained model as fixed feature extractor:

# Freeze all layers
model = models.resnet50(pretrained=True)
for param in model.parameters():
    param.requires_grad = False

# Replace classifier
model.fc = nn.Linear(2048, num_classes)

When to use:

Small target dataset (< 10K samples)
Similar domain to pre-training
Limited compute budget

Advantages:

Fast training (only train classifier head)
No risk of catastrophic forgetting
Low memory footprint

Strategy 2: Fine-Tuning

Update all or part of the pre-trained model:

# Fine-tune all layers with small learning rate
model = models.resnet50(pretrained=True)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

When to use:

Large target dataset
Different domain from pre-training
Need maximum performance

Risks:

Catastrophic forgetting (overwriting useful features)
Overfitting on small datasets
Requires careful hyperparameter tuning

Layer Freezing Strategies

Gradual Unfreezing

Start with all layers frozen, unfreeze one layer at a time:

# Stage 1: Train only classifier
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(2048, num_classes)
train(model, epochs=5)

# Stage 2: Unfreeze last block
for param in model.layer4.parameters():
    param.requires_grad = True
train(model, epochs=5, lr=1e-5)

# Stage 3: Unfreeze all
for param in model.parameters():
    param.requires_grad = True
train(model, epochs=10, lr=1e-6)

💡

Use different learning rates for different layers: lower for early layers (general features), higher for later layers (task-specific features). This is called "discriminative fine-tuning."

Domain Adaptation

Addressing distribution shift between source and target domains.

Maximum Mean Discrepancy (MMD)

Minimize distance between feature distributions:

\mathcal{L}_{MMD} = \|\mu_S - \mu_T\|^2 + \|\Sigma_S - \Sigma_T\|^2

where $\mu, \Sigma$ are mean and covariance of source/target features.

Adversarial Domain Adaptation

Use a domain classifier to make features domain-invariant:

\min_G \max_D \mathcal{L}_{task}(G) - \lambda \mathcal{L}_{domain}(G, D)

where $G$ is feature extractor, $D$ is domain discriminator.

DANN (Domain-Adversarial Neural Network)

Architecture Diagram

Input → Feature Extractor → Task Classifier → Task Loss
                ↓
        Domain Classifier → Domain Loss

Gradient reversal layer ensures features are domain-invariant.

Fine-Tuning Large Language Models

Parameter-Efficient Fine-Tuning (PEFT)

LoRA (Low-Rank Adaptation):

\mathbf{W}' = \mathbf{W} + \mathbf{B}\mathbf{A}

where $\mathbf{A} \in \mathbb{R}^{r \times d}$ , $\mathbf{B} \in \mathbb{R}^{d \times r}$ , $r \ll d$ .

Only train $\mathbf{A}$ and $\mathbf{B}$ (typically $r=8-64$ ), keeping $\mathbf{W}$ frozen.

QLoRA: Quantize frozen weights to 4-bit, train LoRA adapters.

Adapters: Insert small bottleneck layers between frozen layers:

y = x + f(xW_{down})W_{up}

where $W_{down} \in \mathbb{R}^{d \times r}$ , $W_{up} \in \mathbb{R}^{r \times d}$ .

Best Practices

Follow-Up Questions

Q: How do you detect catastrophic forgetting? A: Monitor performance on source task during fine-tuning. If it drops significantly, use lower learning rates, freeze more layers, or use regularization like EWC.

Q: When is domain adaptation necessary? A: When source and target distributions differ significantly. Examples: synthetic to real images, different camera angles, different text styles.

Q: How does few-shot learning differ from transfer learning? A: Transfer learning uses many target examples; few-shot uses very few (1-10). Techniques like Prototypical Networks and MAML are designed for few-shot scenarios.