1. What Is Anomaly Detection?
Anomaly detection identifies data points that deviate significantly from the majority of the data. These outliers can indicate fraud, system faults, scientific discoveries, or data quality issues.
Types of Anomalous Points
Anomaly Categories
| Type | Definition | Example |
|---|---|---|
| Point | A single observation far from others | Credit card fraud |
| Contextual | Abnormal given its context | 90°F in December |
| Collective | A group of points anomalous together | Coordinated bot attack |
| Global | Outlier vs. entire dataset | Sensor malfunction reading |
| Local | Outlier only within its neighborhood | Normal in global sense but rare locally |
2. Statistical Methods
Z-Score Method
The Z-score measures how many standard deviations a point is from the mean:
A point is flagged as anomalous if (commonly ).
import numpy as np
def zscore_anomalies(data, threshold=3):
z = np.abs((data - data.mean()) / data.std())
return z > threshold
Limitations: Assumes Gaussian distribution; sensitive to extreme values in and (masking effect).
Modified Z-Score (MAD)
Uses median absolute deviation for robustness:
Points with are flagged.
IQR Method
The interquartile range defines fences:
def iqr_anomalies(data):
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
return (data < lower) | (data > upper)
Grubbs' Test
Tests whether the most extreme value is an outlier under normality:
Reject if , where:
3. Isolation Forest
Core Insight
Anomalies are few and different — they are easier to isolate than normal points. Isolation Forest explicitly isolates anomalies by random recursive partitioning.
Algorithm
-
Build an Isolation Tree (iTree):
- Randomly select a feature.
- Randomly select a split value between the feature's min and max.
- Recurse on left and right partitions until isolation or depth limit.
-
Build the forest: Repeat step 1 for trees.
-
Score each point:
where is the average path length of across all trees, and:
is the average path length of unsuccessful search in a BST, with (Euler–Mascheroni constant).
- : highly anomalous
- : normal
Implementation
from sklearn.ensemble import IsolationForest
clf = IsolationForest(n_estimators=300, contamination=0.05, random_state=42)
preds = clf.fit_predict(X) # 1 = normal, -1 = anomaly
scores = clf.decision_function(X) # lower = more anomalous
Key Hyperparameters
| Parameter | Description |
|---|---|
n_estimators | Number of trees (more = more stable) |
max_samples | Subsample size per tree () |
contamination | Expected proportion of anomalies |
max_features | Features per tree (default 1.0) |
4. Local Outlier Factor (LOF)
LOF detects local outliers by comparing the density of a point to its neighbors' densities.
Distance to k-th Nearest Neighbor
Reachability Distance
This smoothing prevents fluctuations from close neighbors.
Local Reachability Density
LOF Score
- : point is in a region of similar density
- : point is in a sparser region than neighbors → anomalous
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
preds = lof.fit_predict(X)
scores = lof.negative_outlier_factor_ # more negative = more anomalous
5. DBSCAN for Anomaly Detection
DBSCAN labels points as core, border, or noise. Noise points are natural anomaly candidates.
Definitions
| Term | Definition |
|---|---|
| ε-neighborhood | |
| Core point | |
| Border point | Not core, but within ε of a core point |
| Noise point | Neither core nor border → anomaly |
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.5, min_samples=10)
labels = db.fit_predict(X)
anomalies = X[labels == -1] # noise points
Parameter Selection
- eps (ε): Use k-distance graph — sort distances to k-th neighbor and look for the "elbow"
- MinPts: Rule of thumb: (where is feature dimensionality)
6. Autoencoder-Based Detection
Autoencoders learn a compressed representation of normal data. Anomalies produce high reconstruction error because the model has not learned to encode them.
Architecture
Training Objective
Train only on normal data (or assume most data is normal). At inference:
Variants
| Variant | Key Idea |
|---|---|
| Vanilla AE | Standard reconstruction error |
| Variational AE (VAE) | Use |
| Denoising AE | Train to reconstruct from corrupted input |
| LSTM-AE | Temporal autoencoder for time-series anomalies |
Implementation
import torch
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, input_dim, latent_dim=8):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, latent_dim),
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 32),
nn.ReLU(),
nn.Linear(32, 64),
nn.ReLU(),
nn.Linear(64, input_dim),
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z)
# Train on normal data
model = Autoencoder(input_dim=X_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(100):
x_hat = model(X_train)
loss = nn.MSELoss()(x_hat, X_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Score test data
with torch.no_grad():
x_hat = model(X_test)
scores = ((X_test - x_hat) ** 2).mean(dim=1)
threshold = scores.quantile(0.95)
anomalies = X_test[scores > threshold]
7. Evaluation Metrics
Anomaly detection is inherently imbalanced — standard accuracy is misleading.
Confusion Matrix for Anomalies
| Predicted Normal | Predicted Anomaly | |
|---|---|---|
| Actual Normal | TN | FP |
| Actual Anomaly | FN | TP |
Key Metrics
Precision and Recall:
F1 Score:
AUROC (Area Under ROC Curve): Threshold-independent metric; ranks anomalous points higher than normal points.
where is a random anomaly and is a random normal point.
AUPRC (Area Under Precision-Recall Curve): More informative than AUROC when anomalies are extremely rare.
Metrics That Don't Require Labels
| Metric | Description |
|---|---|
| Local Outlier Factor | Avg LOF score of flagged points |
| Silhouette Score | Separation of anomaly vs. normal clusters |
| Mass-based | Fraction of total mass assigned to anomalies |
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')
auroc = roc_auc_score(y_true, anomaly_scores)
print(f"Precision: {precision:.3f}, Recall: {recall:.3f}, F1: {f1:.3f}, AUROC: {auroc:.3f}")
8. Complete Python Implementation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score
# Generate synthetic data with anomalies
np.random.seed(42)
X_normal, y_normal = make_blobs(n_samples=500, centers=2, cluster_std=1.0, random_state=42)
X_anomaly = np.random.uniform(low=-8, high=8, size=(30, 2))
X = np.vstack([X_normal, X_anomaly])
y_true = np.array([0]*500 + [1]*30)
# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# --- Isolation Forest ---
iso = IsolationForest(n_estimators=200, contamination=0.06, random_state=42)
iso_preds = iso.fit_predict(X_scaled)
iso_preds_binary = (iso_preds == -1).astype(int)
# --- LOF ---
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.06)
lof_preds = lof.fit_predict(X_scaled)
lof_preds_binary = (lof_preds == -1).astype(int)
# --- DBSCAN ---
db = DBSCAN(eps=0.5, min_samples=10)
db_labels = db.fit_predict(X_scaled)
db_preds_binary = (db_labels == -1).astype(int)
# --- Evaluate ---
for name, preds in [("Isolation Forest", iso_preds_binary),
("LOF", lof_preds_binary),
("DBSCAN", db_preds_binary)]:
print(f"\n=== {name} ===")
print(classification_report(y_true, preds, target_names=["Normal", "Anomaly"]))
try:
print(f"AUROC: {roc_auc_score(y_true, preds):.3f}")
except ValueError:
print("AUROC: N/A (no positive predictions)")
# --- Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
titles = ["Isolation Forest", "LOF", "DBSCAN"]
predictions = [iso_preds_binary, lof_preds_binary, db_preds_binary]
for ax, title, preds in zip(axes, titles, predictions):
ax.scatter(X[preds==0, 0], X[preds==0, 1], c='steelblue', s=10, label='Normal', alpha=0.6)
ax.scatter(X[preds==1, 0], X[preds==1, 1], c='red', s=30, marker='x', label='Anomaly')
ax.set_title(title)
ax.legend()
ax.set_xlim(-10, 10)
ax.set_ylim(-10, 10)
plt.tight_layout()
plt.savefig("anomaly_comparison.png", dpi=150)
plt.show()
Summary
| Method | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Z-Score / IQR | Simple, interpretable | Gaussian assumption | 1-D, univariate data |
| Isolation Forest | Scalable, no distance computation | Random splits reduce precision | High-dimensional data |
| LOF | Captures local structure | without indexing | Varying-density clusters |
| DBSCAN | No distribution assumption | Sensitive to ε, MinPts | Spatial data, known density |
| Autoencoder | Non-linear, powerful | Needs training data, tuning | Complex high-dim, images, sequences |
Key takeaways:
- No single method dominates — ensemble multiple detectors for robustness
- Feature engineering (domain-specific features) often matters more than algorithm choice
- Threshold selection is critical — use precision-recall tradeoffs aligned with business costs
- For time-series, use temporal models (LSTM-AE, SR) rather than static methods