CW

Anomaly Detection: Isolation Forest, LOF and Autoencoders

Module 9: Unsupervised LearningFree Lesson

Advertisement

1. What Is Anomaly Detection?

Anomaly detection identifies data points that deviate significantly from the majority of the data. These outliers can indicate fraud, system faults, scientific discoveries, or data quality issues.

Types of Anomalous Points

Types of AnomaliesPoint AnomalySingle outlier farfrom the clustere.g. fraudulent transactionContextual AnomalyanomalyNormal in isolation,abnormal in contexte.g. summer jacket in winterCollective AnomalyA subsequence is anomalousas a groupe.g. network intrusion pattern

Anomaly Categories

TypeDefinitionExample
PointA single observation far from othersCredit card fraud
ContextualAbnormal given its context90°F in December
CollectiveA group of points anomalous togetherCoordinated bot attack
GlobalOutlier vs. entire datasetSensor malfunction reading
LocalOutlier only within its neighborhoodNormal in global sense but rare locally

2. Statistical Methods

Z-Score Method

The Z-score measures how many standard deviations a point is from the mean:

zi=xixˉsz_i = \frac{x_i - \bar{x}}{s}

A point is flagged as anomalous if zi>τ|z_i| > \tau (commonly τ=3\tau = 3).

import numpy as np

def zscore_anomalies(data, threshold=3):
    z = np.abs((data - data.mean()) / data.std())
    return z > threshold

Limitations: Assumes Gaussian distribution; sensitive to extreme values in xˉ\bar{x} and ss (masking effect).

Modified Z-Score (MAD)

Uses median absolute deviation for robustness:

MAD=median(ximedian(x))\text{MAD} = \text{median}(|x_i - \text{median}(x)|)
z~i=0.6745(ximedian(x))MAD\tilde{z}_i = \frac{0.6745 \cdot (x_i - \text{median}(x))}{\text{MAD}}

Points with z~i>3.5|\tilde{z}_i| > 3.5 are flagged.

IQR Method

The interquartile range defines fences:

Q1=25th percentile,Q3=75th percentileQ_1 = \text{25th percentile}, \quad Q_3 = \text{75th percentile}
IQR=Q3Q1IQR = Q_3 - Q_1
Lower fence=Q11.5IQR\text{Lower fence} = Q_1 - 1.5 \cdot IQR
Upper fence=Q3+1.5IQR\text{Upper fence} = Q_3 + 1.5 \cdot IQR
def iqr_anomalies(data):
    q1, q3 = np.percentile(data, [25, 75])
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    return (data < lower) | (data > upper)

Grubbs' Test

Tests whether the most extreme value is an outlier under normality:

G=maxixixˉsG = \frac{\max_i |x_i - \bar{x}|}{s}

Reject H0H_0 if G>Gcritical(α,n)G > G_{\text{critical}}(\alpha, n), where:

Gcritical=n1ntα/(2n),n22n2+tα/(2n),n22G_{\text{critical}} = \frac{n-1}{\sqrt{n}} \sqrt{\frac{t_{\alpha/(2n),\,n-2}^2}{n - 2 + t_{\alpha/(2n),\,n-2}^2}}

3. Isolation Forest

Core Insight

Anomalies are few and different — they are easier to isolate than normal points. Isolation Forest explicitly isolates anomalies by random recursive partitioning.

Algorithm

  1. Build an Isolation Tree (iTree):

    • Randomly select a feature.
    • Randomly select a split value between the feature's min and max.
    • Recurse on left and right partitions until isolation or depth limit.
  2. Build the forest: Repeat step 1 for tt trees.

  3. Score each point:

s(x,n)=2E[h(x)]c(n)s(x, n) = 2^{-\frac{E[h(x)]}{c(n)}}

where E[h(x)]E[h(x)] is the average path length of xx across all trees, and:

c(n)=2H(n1)2(n1)nc(n) = 2H(n-1) - \frac{2(n-1)}{n}

is the average path length of unsuccessful search in a BST, with H(i)ln(i)+0.5772H(i) \approx \ln(i) + 0.5772 (Euler–Mascheroni constant).

  • s1s \to 1: highly anomalous
  • s0.5s \to 0.5: normal
Isolation Tree — Splitting ProcessX1 ∈ [0, 100]X2 ∈ [0, 80]X1 ∈ [50, 100]X3 ∈ [0, 40]ANOMALYPath = 2(isolated quickly)X2 ∈ [30, 70]X3 ∈ [60, 90]X1 ∈ [60, 85]NORMALAnomaly (short path → isolated fast)Normal (longer path → harder to isolate)

Implementation

from sklearn.ensemble import IsolationForest

clf = IsolationForest(n_estimators=300, contamination=0.05, random_state=42)
preds = clf.fit_predict(X)  # 1 = normal, -1 = anomaly
scores = clf.decision_function(X)  # lower = more anomalous

Key Hyperparameters

ParameterDescription
n_estimatorsNumber of trees (more = more stable)
max_samplesSubsample size per tree (min(256,n)\min(256, n))
contaminationExpected proportion of anomalies
max_featuresFeatures per tree (default 1.0)

4. Local Outlier Factor (LOF)

LOF detects local outliers by comparing the density of a point to its neighbors' densities.

Distance to k-th Nearest Neighbor

k-distance(A)=distance from A to its k-th nearest neighbor\text{k-distance}(A) = \text{distance from } A \text{ to its } k\text{-th nearest neighbor}

Reachability Distance

rdk(A,B)=max(k-distance(B),  d(A,B))\text{rd}_k(A, B) = \max(\text{k-distance}(B), \; d(A, B))

This smoothing prevents fluctuations from close neighbors.

Local Reachability Density

lrdk(A)=11Nk(A)BNk(A)rdk(A,B)\text{lrd}_k(A) = \frac{1}{\frac{1}{|N_k(A)|} \sum_{B \in N_k(A)} \text{rd}_k(A, B)}

LOF Score

LOFk(A)=1Nk(A)BNk(A)lrdk(B)lrdk(A)\text{LOF}_k(A) = \frac{\frac{1}{|N_k(A)|} \sum_{B \in N_k(A)} \text{lrd}_k(B)}{\text{lrd}_k(A)}
  • LOF1\text{LOF} \approx 1: point is in a region of similar density
  • LOF1\text{LOF} \gg 1: point is in a sparser region than neighbors → anomalous
LOF — Local Density Comparisonlrd ≈ HIGH(dense neighborhood)?lrd ≈ LOW(sparse neighborhood)LOF ≫ 1Normal: LOF ≈ 1 (similar local density to neighbors)Anomaly: LOF ≫ 1 (sparser than neighbors)
from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
preds = lof.fit_predict(X)
scores = lof.negative_outlier_factor_  # more negative = more anomalous

5. DBSCAN for Anomaly Detection

DBSCAN labels points as core, border, or noise. Noise points are natural anomaly candidates.

Definitions

TermDefinition
ε-neighborhoodNε(p)={qd(p,q)ε}N_\varepsilon(p) = \{q \mid d(p,q) \leq \varepsilon\}
Core pointNε(p)MinPts|N_\varepsilon(p)| \geq \text{MinPts}
Border pointNot core, but within ε of a core point
Noise pointNeither core nor border → anomaly
DBSCAN Point ClassificationCore Points|N_ε(p)| ≥ MinPtsCore + BorderBorder: in ε of core, not core itselfNoise Points (Anomalies)Neither core nor borderAnomalies = DBSCAN noise label (-1). Adjust ε and MinPts to control sensitivity.
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=10)
labels = db.fit_predict(X)
anomalies = X[labels == -1]  # noise points

Parameter Selection

  • eps (ε): Use k-distance graph — sort distances to k-th neighbor and look for the "elbow"
  • MinPts: Rule of thumb: MinPtsd+1\text{MinPts} \geq d + 1 (where dd is feature dimensionality)

6. Autoencoder-Based Detection

Autoencoders learn a compressed representation of normal data. Anomalies produce high reconstruction error because the model has not learned to encode them.

Architecture

Autoencoder Architecture for Anomaly DetectionEncoderCompresses inputInput (d)Hidden₁Latent (k ≪ d)DecoderReconstructs inputHidden₂Output (d)ReconstructionErrorL = ||x - x̂||²Low error → NormalHigh error → Anomaly

Training Objective

L(θ)=1ni=1nxix^i2=1ni=1nxifθ(xi)2\mathcal{L}(\theta) = \frac{1}{n}\sum_{i=1}^{n} \|x_i - \hat{x}_i\|^2 = \frac{1}{n}\sum_{i=1}^{n} \|x_i - f_\theta(x_i)\|^2

Train only on normal data (or assume most data is normal). At inference:

anomaly score(x)=xx^2\text{anomaly score}(x) = \|x - \hat{x}\|^2

Variants

VariantKey Idea
Vanilla AEStandard reconstruction error
Variational AE (VAE)Use xx^2+DKL(q(zx)p(z))\|x - \hat{x}\|^2 + D_{KL}(q(z\|x) \| p(z))
Denoising AETrain to reconstruct from corrupted input
LSTM-AETemporal autoencoder for time-series anomalies

Implementation

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim=8):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, latent_dim),
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, input_dim),
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)

# Train on normal data
model = Autoencoder(input_dim=X_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(100):
    x_hat = model(X_train)
    loss = nn.MSELoss()(x_hat, X_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Score test data
with torch.no_grad():
    x_hat = model(X_test)
    scores = ((X_test - x_hat) ** 2).mean(dim=1)
    threshold = scores.quantile(0.95)
    anomalies = X_test[scores > threshold]

7. Evaluation Metrics

Anomaly detection is inherently imbalanced — standard accuracy is misleading.

Confusion Matrix for Anomalies

Predicted NormalPredicted Anomaly
Actual NormalTNFP
Actual AnomalyFNTP

Key Metrics

Precision and Recall:

Precision=TPTP+FPRecall=TPTP+FN\text{Precision} = \frac{TP}{TP + FP} \quad \text{Recall} = \frac{TP}{TP + FN}

F1 Score:

F1=2PrecisionRecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

AUROC (Area Under ROC Curve): Threshold-independent metric; ranks anomalous points higher than normal points.

AUROC=P(score(x+)>score(x))\text{AUROC} = P(\text{score}(x^+) > \text{score}(x^-))

where x+x^+ is a random anomaly and xx^- is a random normal point.

AUPRC (Area Under Precision-Recall Curve): More informative than AUROC when anomalies are extremely rare.

Metrics That Don't Require Labels

MetricDescription
Local Outlier FactorAvg LOF score of flagged points
Silhouette ScoreSeparation of anomaly vs. normal clusters
Mass-basedFraction of total mass assigned to anomalies
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score

precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')
auroc = roc_auc_score(y_true, anomaly_scores)
print(f"Precision: {precision:.3f}, Recall: {recall:.3f}, F1: {f1:.3f}, AUROC: {auroc:.3f}")

8. Complete Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score

# Generate synthetic data with anomalies
np.random.seed(42)
X_normal, y_normal = make_blobs(n_samples=500, centers=2, cluster_std=1.0, random_state=42)
X_anomaly = np.random.uniform(low=-8, high=8, size=(30, 2))
X = np.vstack([X_normal, X_anomaly])
y_true = np.array([0]*500 + [1]*30)

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Isolation Forest ---
iso = IsolationForest(n_estimators=200, contamination=0.06, random_state=42)
iso_preds = iso.fit_predict(X_scaled)
iso_preds_binary = (iso_preds == -1).astype(int)

# --- LOF ---
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.06)
lof_preds = lof.fit_predict(X_scaled)
lof_preds_binary = (lof_preds == -1).astype(int)

# --- DBSCAN ---
db = DBSCAN(eps=0.5, min_samples=10)
db_labels = db.fit_predict(X_scaled)
db_preds_binary = (db_labels == -1).astype(int)

# --- Evaluate ---
for name, preds in [("Isolation Forest", iso_preds_binary),
                     ("LOF", lof_preds_binary),
                     ("DBSCAN", db_preds_binary)]:
    print(f"\n=== {name} ===")
    print(classification_report(y_true, preds, target_names=["Normal", "Anomaly"]))
    try:
        print(f"AUROC: {roc_auc_score(y_true, preds):.3f}")
    except ValueError:
        print("AUROC: N/A (no positive predictions)")

# --- Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
titles = ["Isolation Forest", "LOF", "DBSCAN"]
predictions = [iso_preds_binary, lof_preds_binary, db_preds_binary]

for ax, title, preds in zip(axes, titles, predictions):
    ax.scatter(X[preds==0, 0], X[preds==0, 1], c='steelblue', s=10, label='Normal', alpha=0.6)
    ax.scatter(X[preds==1, 0], X[preds==1, 1], c='red', s=30, marker='x', label='Anomaly')
    ax.set_title(title)
    ax.legend()
    ax.set_xlim(-10, 10)
    ax.set_ylim(-10, 10)

plt.tight_layout()
plt.savefig("anomaly_comparison.png", dpi=150)
plt.show()

Summary

MethodStrengthsWeaknessesBest For
Z-Score / IQRSimple, interpretableGaussian assumption1-D, univariate data
Isolation ForestScalable, no distance computationRandom splits reduce precisionHigh-dimensional data
LOFCaptures local structureO(n2)O(n^2) without indexingVarying-density clusters
DBSCANNo distribution assumptionSensitive to ε, MinPtsSpatial data, known density
AutoencoderNon-linear, powerfulNeeds training data, tuningComplex high-dim, images, sequences

Key takeaways:

  • No single method dominates — ensemble multiple detectors for robustness
  • Feature engineering (domain-specific features) often matters more than algorithm choice
  • Threshold selection is critical — use precision-recall tradeoffs aligned with business costs
  • For time-series, use temporal models (LSTM-AE, SR) rather than static methods

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement