BERT & Encoder Models — Complete Guide
BERT (Bidirectional Encoder Representations from Transformers) revolutionized NLP by understanding context from both directions.
How BERT Works
Pre-training (unsupervised):
Task 1: Masked Language Modeling (MLM)
├─ Mask 15% of tokens randomly
├─ Predict the masked token
├─ Forces bidirectional understanding
└─ "The cat [MASK] on the mat" → "sat"
Task 2: Next Sentence Prediction (NSP)
├─ Is sentence B next to sentence A?
├─ 50% positive, 50% negative
└─ Learns sentence relationships
Fine-tuning (supervised):
├─ Add task-specific head
├─ Train on labeled data
└─ Much less data needed than from scratch
BERT Variants
BERT-base: 12 layers, 768 dim, 110M params
BERT-large: 24 layers, 1024 dim, 340M params
RoBERTa: Optimized BERT training
ALBERT: Parameter-efficient BERT
DistilBERT: Smaller, faster BERT
DeBERTa: Disentangled attention (SOTA)
Fine-Tuning BERT
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
# Load
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Tokenize
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
# Fine-tune
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
Key Takeaways
- BERT is bidirectional — understands context from both sides
- Pre-training + fine-tuning paradigm for NLP
- MLM and NSP are pre-training objectives
- BERT excels at classification and token-level tasks
- RoBERTa and DeBERTa are improved versions
- DistilBERT for faster inference (97% accuracy, 60% faster)
- BERT is an encoder-only model (no text generation)
- For text generation, use GPT (decoder-only)