K-Nearest Neighbors — Complete Guide
KNN is the simplest ML algorithm — it classifies a point by looking at its K closest neighbors.
How KNN Works
Algorithm:
1. Store all training data
2. For a new point:
a. Calculate distance to ALL training points
b. Find K closest points
c. Majority vote → predicted class
Example (K=3):
New point: (2, 3)
Neighbors: (1,2)→Class A, (2,4)→Class B, (3,3)→Class A
Vote: A=2, B=1
Prediction: Class A
Distance Metrics
Euclidean Distance (most common):
d(x, y) = √(Σ(xᵢ - yᵢ)²)
Manhattan Distance:
d(x, y) = Σ|xᵢ - yᵢ|
Minkowski Distance (generalization):
d(x, y) = (Σ|xᵢ - yᵢ|ᵖ)^(1/p)
p=1: Manhattan, p=2: Euclidean
Cosine Similarity:
cos(θ) = (x·y) / (||x|| × ||y||)
Good for text classification
Choosing K
K=1: Very flexible, noisy, overfits
K=3-5: Usually good starting point
K=10+: Smoother boundaries, may underfit
Rules of thumb:
├─ Use odd K for binary classification (avoid ties)
├─ Start with K = √n (where n = training size)
├─ Use cross-validation to find optimal K
└─ Larger K = smoother decision boundary
K too small: K too large:
Overfitting Underfitting
Complex boundaries Simple boundaries
Sensitive to noise Misses local patterns
Python Implementation
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate data
X, y = make_moons(n_samples=500, noise=0.2, random_state=42)
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Try different K values
for k in [1, 3, 5, 10, 20]:
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"K={k}: Accuracy={score:.3f}")
From Scratch
class KNN:
def __init__(self, k=5):
self.k = k
def fit(self, X, y):
self.X_train = X
self.y_train = y
def predict(self, X):
predictions = [self._predict(x) for x in X]
return np.array(predictions)
def _predict(self, x):
# Calculate distances
distances = [np.sqrt(np.sum((x - x_train)**2))
for x_train in self.X_train]
# Get K nearest
k_indices = np.argsort(distances)[:self.k]
k_labels = self.y_train[k_indices]
# Majority vote
most_common = np.bincount(k_labels).argmax()
return most_common
Weighted KNN
Standard KNN: Equal votes from all K neighbors
Weighted KNN: Closer neighbors get more vote
weight_i = 1 / distance_i
Example:
Neighbor 1: distance=0.5, vote=Class A
Neighbor 2: distance=2.0, vote=Class B
Standard: A=1, B=1 (tie)
Weighted: A=2.0, B=0.5 → A wins
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5, weights='distance')
Curse of Dimensionality
Problem: As dimensions increase, distances become meaningless
In 2D: Points are spread out
In 100D: All points are approximately equidistant
Distance between random points converges to:
d_max / d_min → 1 as dimensions → ∞
Solutions:
├─ Dimensionality reduction (PCA)
├─ Feature selection
├─ Use algorithms that handle high dimensions (trees, SVMs)
└─ Increase training data exponentially with dimensions
Key Takeaways
- KNN is a lazy learner — no training, all computation at prediction
- Scale your features — KNN is distance-based
- Choose K with cross-validation (usually 3-15)
- Weighted KNN often outperforms standard KNN
- KNN suffers from the curse of dimensionality
- Slow for large datasets — must compute all distances
- KNN is a great baseline and for small datasets
- Consider KD-tree or Ball tree for faster search