🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Transfer Learning: Fine-Tuning, Feature Extraction, Domain Adaptation — Asked at NVIDIA & Amazon

Deep Learning Premium InterviewsTransfer Learning⭐ Premium

Advertisement

NVIDIA & Amazon

Transfer Learning: Fine-Tuning, Feature Extraction & Domain Adaptation

Premium Interview Preparation — Transfer Learning Mastery

🎯 The Interview Question

"Explain the different strategies for transfer learning, including feature extraction vs fine-tuning. What is domain adaptation and how does it address the domain shift problem? How do you decide which layers to freeze when fine-tuning? What are the best practices for fine-tuning large pre-trained models like BERT or GPT?"

This question is crucial for practical ML work — transfer learning is the standard approach in modern deep learning.


📚 Detailed Answer

Transfer Learning: The Core Idea

Transfer learning leverages knowledge from a source task to improve performance on a target task.

Formal definition: Given source domain DS\mathcal{D}_S with task TS\mathcal{T}_S and target domain DT\mathcal{D}_T with task TT\mathcal{T}_T, transfer learning uses DS,TS\mathcal{D}_S, \mathcal{T}_S to improve learning of TT\mathcal{T}_T in DT\mathcal{D}_T.

Why it works: Features learned on large datasets (ImageNet, Wikipedia) capture general patterns that transfer to new tasks.

Strategy 1: Feature Extraction

Use pre-trained model as fixed feature extractor:

# Freeze all layers
model = models.resnet50(pretrained=True)
for param in model.parameters():
    param.requires_grad = False

# Replace classifier
model.fc = nn.Linear(2048, num_classes)

When to use:

  • Small target dataset (< 10K samples)
  • Similar domain to pre-training
  • Limited compute budget

Advantages:

  • Fast training (only train classifier head)
  • No risk of catastrophic forgetting
  • Low memory footprint

Strategy 2: Fine-Tuning

Update all or part of the pre-trained model:

# Fine-tune all layers with small learning rate
model = models.resnet50(pretrained=True)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

When to use:

  • Large target dataset
  • Different domain from pre-training
  • Need maximum performance

Risks:

  • Catastrophic forgetting (overwriting useful features)
  • Overfitting on small datasets
  • Requires careful hyperparameter tuning

Layer Freezing Strategies

Gradual Unfreezing

Start with all layers frozen, unfreeze one layer at a time:

# Stage 1: Train only classifier
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(2048, num_classes)
train(model, epochs=5)

# Stage 2: Unfreeze last block
for param in model.layer4.parameters():
    param.requires_grad = True
train(model, epochs=5, lr=1e-5)

# Stage 3: Unfreeze all
for param in model.parameters():
    param.requires_grad = True
train(model, epochs=10, lr=1e-6)

💡

Use different learning rates for different layers: lower for early layers (general features), higher for later layers (task-specific features). This is called "discriminative fine-tuning."

Domain Adaptation

Addressing distribution shift between source and target domains.

Maximum Mean Discrepancy (MMD)

Minimize distance between feature distributions:

LMMD=μSμT2+ΣSΣT2\mathcal{L}_{MMD} = \|\mu_S - \mu_T\|^2 + \|\Sigma_S - \Sigma_T\|^2

where μ,Σ\mu, \Sigma are mean and covariance of source/target features.

Adversarial Domain Adaptation

Use a domain classifier to make features domain-invariant:

minGmaxDLtask(G)λLdomain(G,D)\min_G \max_D \mathcal{L}_{task}(G) - \lambda \mathcal{L}_{domain}(G, D)

where GG is feature extractor, DD is domain discriminator.

DANN (Domain-Adversarial Neural Network)

Architecture Diagram
Input → Feature Extractor → Task Classifier → Task Loss
                ↓
        Domain Classifier → Domain Loss

Gradient reversal layer ensures features are domain-invariant.

Fine-Tuning Large Language Models

Parameter-Efficient Fine-Tuning (PEFT)

LoRA (Low-Rank Adaptation):

W=W+BA\mathbf{W}' = \mathbf{W} + \mathbf{B}\mathbf{A}

where ARr×d\mathbf{A} \in \mathbb{R}^{r \times d}, BRd×r\mathbf{B} \in \mathbb{R}^{d \times r}, rdr \ll d.

Only train A\mathbf{A} and B\mathbf{B} (typically r=864r=8-64), keeping W\mathbf{W} frozen.

QLoRA: Quantize frozen weights to 4-bit, train LoRA adapters.

Adapters: Insert small bottleneck layers between frozen layers:

y=x+f(xWdown)Wupy = x + f(xW_{down})W_{up}

where WdownRd×rW_{down} \in \mathbb{R}^{d \times r}, WupRr×dW_{up} \in \mathbb{R}^{r \times d}.

Best Practices

Follow-Up Questions

Q: How do you detect catastrophic forgetting? A: Monitor performance on source task during fine-tuning. If it drops significantly, use lower learning rates, freeze more layers, or use regularization like EWC.

Q: When is domain adaptation necessary? A: When source and target distributions differ significantly. Examples: synthetic to real images, different camera angles, different text styles.

Q: How does few-shot learning differ from transfer learning? A: Transfer learning uses many target examples; few-shot uses very few (1-10). Techniques like Prototypical Networks and MAML are designed for few-shot scenarios.

Related Topics

Advertisement