Embedding Models
What are Embeddings?
Embeddings are dense vector representations of text, images, or other data in a continuous vector space. Similar items are positioned close together, enabling semantic search and similarity comparisons.
How Embeddings Work
from sentence_transformers import SentenceTransformer
import numpy as np
# Load an embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings
sentences = [
"The cat sat on the mat",
"A kitten rested on the rug",
"The car drove down the road"
]
embeddings = model.encode(sentences)
# Calculate similarity
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Similar sentences have high similarity
sim_01 = cosine_similarity(embeddings[0], embeddings[1]) # High
sim_02 = cosine_similarity(embeddings[0], embeddings[2]) # Low
print(f"Cat-Kitten similarity: {sim_01:.3f}")
print(f"Cat-Car similarity: {sim_02:.3f}")
Popular Embedding Models
Fine-tuning Embeddings
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
def fine_tune_embeddings(train_pairs, model_name='all-MiniLM-L6-v2'):
model = SentenceTransformer(model_name)
# Prepare training data
train_examples = [
InputExample(texts=[pair['text1'], pair['text2']],
label=pair['similarity'])
for pair in train_pairs
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100
)
return model
Summary
Embeddings are foundational to many AI applications. Understanding their properties and how to use them effectively is essential for building semantic search and retrieval systems.
Next: We'll explore text generation models.