Anomaly Detection: Isolation Forest, LOF and Autoencoders

1. What Is Anomaly Detection?

Anomaly detection identifies data points that deviate significantly from the majority of the data. These outliers can indicate fraud, system faults, scientific discoveries, or data quality issues.

Types of Anomalous Points

Anomaly Categories

Type	Definition	Example
Point	A single observation far from others	Credit card fraud
Contextual	Abnormal given its context	90°F in December
Collective	A group of points anomalous together	Coordinated bot attack
Global	Outlier vs. entire dataset	Sensor malfunction reading
Local	Outlier only within its neighborhood	Normal in global sense but rare locally

2. Statistical Methods

Z-Score Method

The Z-score measures how many standard deviations a point is from the mean:

A point is flagged as anomalous if (commonly ).

import numpy as np

def zscore_anomalies(data, threshold=3):
    z = np.abs((data - data.mean()) / data.std())
    return z > threshold

Limitations: Assumes Gaussian distribution; sensitive to extreme values in and (masking effect).

Modified Z-Score (MAD)

Uses median absolute deviation for robustness:

Points with are flagged.

IQR Method

The interquartile range defines fences:

def iqr_anomalies(data):
    q1, q3 = np.percentile(data, [25, 75])
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    return (data < lower) | (data > upper)

Grubbs' Test

Tests whether the most extreme value is an outlier under normality:

Reject if , where:

3. Isolation Forest

Core Insight

Anomalies are few and different – they are easier to isolate than normal points. Isolation Forest explicitly isolates anomalies by random recursive partitioning.

Algorithm

Build an Isolation Tree (iTree):
- Randomly select a feature.
- Randomly select a split value between the feature's min and max.
- Recurse on left and right partitions until isolation or depth limit.
Build the forest: Repeat step 1 for trees.
Score each point:

where is the average path length of across all trees, and:

is the average path length of unsuccessful search in a BST, with (Euler–Mascheroni constant).

: highly anomalous
: normal

Isolation Tree – Splitting Process
X1 ∈ [0, 100]
X2 ∈ [0, 80]
X1 ∈ [50, 100]
X3 ∈ [0, 40]
ANOMALY
Path = 2
(isolated quickly)
X2 ∈ [30, 70]
X3 ∈ [60, 90]
X1 ∈ [60, 85]
NORMAL
Anomaly (short path → isolated fast)
Normal (longer path → harder to isolate)

Implementation

from sklearn.ensemble import IsolationForest

clf = IsolationForest(n_estimators=300, contamination=0.05, random_state=42)
preds = clf.fit_predict(X)  # 1 = normal, -1 = anomaly
scores = clf.decision_function(X)  # lower = more anomalous

Key Hyperparameters

Parameter	Description
`n_estimators`	Number of trees (more = more stable)
`max_samples`	Subsample size per tree ()
`contamination`	Expected proportion of anomalies
`max_features`	Features per tree (default 1.0)

4. Local Outlier Factor (LOF)

LOF detects local outliers by comparing the density of a point to its neighbors' densities.

Distance to k-th Nearest Neighbor

Reachability Distance

This smoothing prevents fluctuations from close neighbors.

Local Reachability Density

LOF Score

: point is in a region of similar density
: point is in a sparser region than neighbors → anomalous

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
preds = lof.fit_predict(X)
scores = lof.negative_outlier_factor_  # more negative = more anomalous

5. DBSCAN for Anomaly Detection

DBSCAN labels points as core, border, or noise. Noise points are natural anomaly candidates.

Definitions

Term	Definition
Îµ-neighborhood
Core point
Border point	Not core, but within Îµ of a core point
Noise point	Neither core nor border → anomaly

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=10)
labels = db.fit_predict(X)
anomalies = X[labels == -1]  # noise points

Parameter Selection

eps (Îµ): Use k-distance graph – sort distances to k-th neighbor and look for the "elbow"
MinPts: Rule of thumb: (where is feature dimensionality)

6. Autoencoder-Based Detection

Autoencoders learn a compressed representation of normal data. Anomalies produce high reconstruction error because the model has not learned to encode them.

Architecture

Autoencoder Architecture for Anomaly Detection
Encoder
Compresses input
Input (d)
Hidden₁
Latent (k ≈ª d)
Decoder
Reconstructs input
Hidden₂
Output (d)
Reconstruction
Error
L = ‖x - x̂‖²
Low error → Normal
High error → Anomaly

Training Objective

Train only on normal data (or assume most data is normal). At inference:

Variants

Variant	Key Idea
Vanilla AE	Standard reconstruction error
Variational AE (VAE)	Use
Denoising AE	Train to reconstruct from corrupted input
LSTM-AE	Temporal autoencoder for time-series anomalies

Implementation

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim=8):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, latent_dim),
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, input_dim),
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)

# Train on normal data
model = Autoencoder(input_dim=X_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(100):
    x_hat = model(X_train)
    loss = nn.MSELoss()(x_hat, X_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Score test data
with torch.no_grad():
    x_hat = model(X_test)
    scores = ((X_test - x_hat) ** 2).mean(dim=1)
    threshold = scores.quantile(0.95)
    anomalies = X_test[scores > threshold]

7. Evaluation Metrics

Anomaly detection is inherently imbalanced – standard accuracy is misleading.

Confusion Matrix for Anomalies

	Predicted Normal	Predicted Anomaly
Actual Normal	TN	FP
Actual Anomaly	FN	TP

Key Metrics

Precision and Recall:

F1 Score:

AUROC (Area Under ROC Curve): Threshold-independent metric; ranks anomalous points higher than normal points.

where is a random anomaly and is a random normal point.

AUPRC (Area Under Precision-Recall Curve): More informative than AUROC when anomalies are extremely rare.

Metrics That Don't Require Labels

Metric	Description
Local Outlier Factor	Avg LOF score of flagged points
Silhouette Score	Separation of anomaly vs. normal clusters
Mass-based	Fraction of total mass assigned to anomalies

from sklearn.metrics import precision_recall_fscore_support, roc_auc_score

precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')
auroc = roc_auc_score(y_true, anomaly_scores)
print(f"Precision: {precision:.3f}, Recall: {recall:.3f}, F1: {f1:.3f}, AUROC: {auroc:.3f}")

8. Complete Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score

# Generate synthetic data with anomalies
np.random.seed(42)
X_normal, y_normal = make_blobs(n_samples=500, centers=2, cluster_std=1.0, random_state=42)
X_anomaly = np.random.uniform(low=-8, high=8, size=(30, 2))
X = np.vstack([X_normal, X_anomaly])
y_true = np.array([0]*500 + [1]*30)

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Isolation Forest ---
iso = IsolationForest(n_estimators=200, contamination=0.06, random_state=42)
iso_preds = iso.fit_predict(X_scaled)
iso_preds_binary = (iso_preds == -1).astype(int)

# --- LOF ---
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.06)
lof_preds = lof.fit_predict(X_scaled)
lof_preds_binary = (lof_preds == -1).astype(int)

# --- DBSCAN ---
db = DBSCAN(eps=0.5, min_samples=10)
db_labels = db.fit_predict(X_scaled)
db_preds_binary = (db_labels == -1).astype(int)

# --- Evaluate ---
for name, preds in [("Isolation Forest", iso_preds_binary),
                     ("LOF", lof_preds_binary),
                     ("DBSCAN", db_preds_binary)]:
    print(f"\n=== {name} ===")
    print(classification_report(y_true, preds, target_names=["Normal", "Anomaly"]))
    try:
        print(f"AUROC: {roc_auc_score(y_true, preds):.3f}")
    except ValueError:
        print("AUROC: N/A (no positive predictions)")

# --- Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
titles = ["Isolation Forest", "LOF", "DBSCAN"]
predictions = [iso_preds_binary, lof_preds_binary, db_preds_binary]

for ax, title, preds in zip(axes, titles, predictions):
    ax.scatter(X[preds==0, 0], X[preds==0, 1], c='steelblue', s=10, label='Normal', alpha=0.6)
    ax.scatter(X[preds==1, 0], X[preds==1, 1], c='red', s=30, marker='x', label='Anomaly')
    ax.set_title(title)
    ax.legend()
    ax.set_xlim(-10, 10)
    ax.set_ylim(-10, 10)

plt.tight_layout()
plt.savefig("anomaly_comparison.png", dpi=150)
plt.show()

Summary

Method	Strengths	Weaknesses	Best For
Z-Score / IQR	Simple, interpretable	Gaussian assumption	1-D, univariate data
Isolation Forest	Scalable, no distance computation	Random splits reduce precision	High-dimensional data
LOF	Captures local structure	without indexing	Varying-density clusters
DBSCAN	No distribution assumption	Sensitive to Îµ, MinPts	Spatial data, known density
Autoencoder	Non-linear, powerful	Needs training data, tuning	Complex high-dim, images, sequences

Key takeaways:

No single method dominates – ensemble multiple detectors for robustness
Feature engineering (domain-specific features) often matters more than algorithm choice
Threshold selection is critical – use precision-recall tradeoffs aligned with business costs
For time-series, use temporal models (LSTM-AE, SR) rather than static methods

Anomaly Detection: Isolation Forest, LOF and Autoencoders

1. What Is Anomaly Detection?

Types of Anomalous Points

Anomaly Categories

2. Statistical Methods

Z-Score Method

Modified Z-Score (MAD)

IQR Method

Grubbs' Test

3. Isolation Forest

Core Insight

Algorithm

Implementation

Key Hyperparameters

4. Local Outlier Factor (LOF)

Distance to k-th Nearest Neighbor

Reachability Distance

Local Reachability Density

LOF Score

5. DBSCAN for Anomaly Detection

Definitions

Parameter Selection

6. Autoencoder-Based Detection

Architecture

Training Objective

Variants

Implementation

7. Evaluation Metrics

Confusion Matrix for Anomalies

Key Metrics

Metrics That Don't Require Labels

8. Complete Python Implementation

Summary

Need Expert Data Science Help?