Vector Databases
What are Vector Databases?
Vector databases are specialized data stores designed to efficiently store, index, and query high-dimensional vectors (embeddings). They enable similarity search, finding vectors that are close to a query vector in the embedding space.
Index Algorithms
Using Vector Databases
# Using FAISS
import faiss
import numpy as np
class FAISSVectorStore:
def __init__(self, dimension=768):
self.dimension = dimension
self.index = faiss.IndexFlatL2(dimension)
self.documents = []
def add_documents(self, embeddings, documents):
self.index.add(np.array(embeddings).astype('float32'))
self.documents.extend(documents)
def search(self, query_embedding, k=5):
distances, indices = self.index.search(
np.array([query_embedding]).astype('float32'), k
)
results = []
for i, idx in enumerate(indices[0]):
results.append({
"document": self.documents[idx],
"distance": distances[0][i]
})
return results
# Using ChromaDB
import chromadb
class ChromaVectorStore:
def __init__(self, collection_name="documents"):
self.client = chromadb.Client()
self.collection = self.client.create_collection(collection_name)
def add_documents(self, ids, documents, embeddings):
self.collection.add(
ids=ids,
documents=documents,
embeddings=embeddings
)
def search(self, query_embedding, k=5):
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=k
)
return results
Comparison: Vector Databases
| Database | Type | Scale | Performance | Features |
|---|---|---|---|---|
| Pinecone | Cloud | Very Large | High | Managed, real-time |
| Weaviate | Self/Cloud | Large | High | GraphQL, hybrid search |
| Qdrant | Self/Cloud | Large | Very High | Filtering, quantization |
| ChromaDB | Local | Small-Medium | Good | Simple, developer-friendly |
| FAISS | Library | Very Large | Very High | Low-level, GPU support |
Similarity Metrics
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def euclidean_distance(a, b):
return np.linalg.norm(a - b)
def dot_product(a, b):
return np.dot(a, b)
Summary
Vector databases are essential infrastructure for AI applications requiring similarity search. Choose based on your scale, performance, and feature requirements.
Next: We'll explore embedding models in detail.