K-Nearest Neighbors — Complete Guide

ML FoundationsClassificationFree Lesson

Advertisement

K-Nearest Neighbors — Complete Guide

KNN is the simplest ML algorithm — it classifies a point by looking at its K closest neighbors.


How KNN Works

Algorithm:
1. Store all training data
2. For a new point:
   a. Calculate distance to ALL training points
   b. Find K closest points
   c. Majority vote → predicted class

Example (K=3):
New point: (2, 3)
Neighbors: (1,2)→Class A, (2,4)→Class B, (3,3)→Class A
Vote: A=2, B=1
Prediction: Class A

Distance Metrics

Euclidean Distance (most common):
d(x, y) = √(Σ(xᵢ - yᵢ)²)

Manhattan Distance:
d(x, y) = Σ|xᵢ - yᵢ|

Minkowski Distance (generalization):
d(x, y) = (Σ|xᵢ - yᵢ|ᵖ)^(1/p)
p=1: Manhattan, p=2: Euclidean

Cosine Similarity:
cos(θ) = (x·y) / (||x|| × ||y||)
Good for text classification

Choosing K

K=1: Very flexible, noisy, overfits
K=3-5: Usually good starting point
K=10+: Smoother boundaries, may underfit

Rules of thumb:
├─ Use odd K for binary classification (avoid ties)
├─ Start with K = √n (where n = training size)
├─ Use cross-validation to find optimal K
└─ Larger K = smoother decision boundary

K too small:          K too large:
Overfitting           Underfitting
Complex boundaries    Simple boundaries
Sensitive to noise    Misses local patterns

Python Implementation

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate data
X, y = make_moons(n_samples=500, noise=0.2, random_state=42)

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Try different K values
for k in [1, 3, 5, 10, 20]:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    print(f"K={k}: Accuracy={score:.3f}")

From Scratch

class KNN:
    def __init__(self, k=5):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        predictions = [self._predict(x) for x in X]
        return np.array(predictions)

    def _predict(self, x):
        # Calculate distances
        distances = [np.sqrt(np.sum((x - x_train)**2))
                     for x_train in self.X_train]

        # Get K nearest
        k_indices = np.argsort(distances)[:self.k]
        k_labels = self.y_train[k_indices]

        # Majority vote
        most_common = np.bincount(k_labels).argmax()
        return most_common

Weighted KNN

Standard KNN: Equal votes from all K neighbors
Weighted KNN: Closer neighbors get more vote

weight_i = 1 / distance_i

Example:
Neighbor 1: distance=0.5, vote=Class A
Neighbor 2: distance=2.0, vote=Class B

Standard: A=1, B=1 (tie)
Weighted: A=2.0, B=0.5 → A wins

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5, weights='distance')

Curse of Dimensionality

Problem: As dimensions increase, distances become meaningless

In 2D: Points are spread out
In 100D: All points are approximately equidistant

Distance between random points converges to:
d_max / d_min → 1 as dimensions → ∞

Solutions:
├─ Dimensionality reduction (PCA)
├─ Feature selection
├─ Use algorithms that handle high dimensions (trees, SVMs)
└─ Increase training data exponentially with dimensions

Key Takeaways

  1. KNN is a lazy learner — no training, all computation at prediction
  2. Scale your features — KNN is distance-based
  3. Choose K with cross-validation (usually 3-15)
  4. Weighted KNN often outperforms standard KNN
  5. KNN suffers from the curse of dimensionality
  6. Slow for large datasets — must compute all distances
  7. KNN is a great baseline and for small datasets
  8. Consider KD-tree or Ball tree for faster search

Advertisement

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement