Introduction to Machine Learning

What Is Machine Learning?

Machine Learning (ML) is a branch of artificial intelligence where systems learn patterns from data rather than following hand-coded rules. Instead of programming every decision, you give the algorithm examples and it discovers the underlying logic itself.

Core Idea: A machine learning model finds a function f that maps inputs X to outputs y, learned entirely from data.

$f: X \rightarrow y$

Real-World Examples

Application	Input (X)	Output (y)	ML Type
Email spam filter	Email text	Spam / Not Spam	Classification
House price prediction	Size, location, rooms	Price ($)	Regression
Netflix recommendations	Watch history	Movie suggestions	Collaborative filtering
Fraud detection	Transaction data	Fraud / Legit	Anomaly detection
ChatGPT	Text prompt	Text response	Generative AI

The Three Types of Machine Learning

Machine Learning
├── Supervised Learning      ← learns from labeled data (X, y pairs)
│   ├── Classification       ← predicts categories  (spam/not spam)
│   └── Regression           ← predicts numbers     (house price)
├── Unsupervised Learning    ← finds patterns in unlabeled data
│   ├── Clustering           ← groups similar items (customer segments)
│   └── Dimensionality Red.  ← compresses features  (PCA)
└── Reinforcement Learning   ← learns by reward/penalty (game AI)

1. Supervised Learning

The algorithm learns a mapping from inputs to outputs using labeled training examples.

$\hat{y} = f_\theta(X) \quad \text{where } \theta \text{ are learned parameters}$

Training objective — minimize loss:

$\mathcal{L}(\theta) = \frac{1}{n}\sum_{i=1}^{n} \ell(y_i, \hat{y}_i)$

Loss Function	Used For	Formula
MSE	Regression	(1)/(n)Σ(y_i - \hat(y)_i)^2
Cross-Entropy	Classification	$-\sum y_i \log(\hat{y}_i)$
Hinge	SVM	$\max(0, 1 - y_i \hat{y}_i)$

2. Unsupervised Learning

No labels — the algorithm discovers structure in raw data.

K-Means objective: $\min_{C_1,...,C_k} \sum_{j=1}^{k} \sum_{x \in C_j} \|x - \mu_j\|^2$

Where $\mu_j$ is the centroid of cluster $C_j$ .

3. Reinforcement Learning

An agent takes actions in an environment and learns to maximize cumulative reward.

$Q(s,a) \leftarrow Q(s,a) + \alpha\bigl[r + \gamma\max_{a'}Q(s',a') - Q(s,a)\bigr]$

Symbol	Meaning
$s, s'$	Current and next state
$a$	Action taken
$r$	Reward received
$\alpha$	Learning rate
$\gamma$	Discount factor (0–1)

The Complete ML Workflow

1. Define Problem        → What are you predicting? What metric matters?
         ↓
2. Collect Data          → Gather raw data from databases, APIs, files
         ↓
3. Explore (EDA)         → Distributions, correlations, missing values
         ↓
4. Preprocess            → Clean, encode, scale, split train/test
         ↓
5. Feature Engineering   → Create new features, select important ones
         ↓
6. Choose Model          → Linear, tree-based, neural network?
         ↓
7. Train                 → Fit model on training data
         ↓
8. Evaluate              → Accuracy, RMSE, F1 — on held-out test set
         ↓
9. Tune                  → Grid search or Bayesian hyperparameter search
         ↓
10. Deploy               → REST API, batch job, or embedded system
         ↓
11. Monitor              → Track data drift and model degradation

Complete Python Example — End-to-End ML Pipeline

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score,
                             recall_score, f1_score,
                             classification_report, confusion_matrix)
import matplotlib.pyplot as plt
import seaborn as sns

# ── 1. Load Data ────────────────────────────────────────────────────
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')  # 0=malignant, 1=benign

print(f"Dataset shape : {X.shape}")
print(f"Class balance : {y.value_counts().to_dict()}")
print(f"\nFeatures:\n{X.describe().round(2)}")

# ── 2. Explore (EDA) ────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Class distribution
y.value_counts().plot(kind='bar', ax=axes[0], color=['salmon','steelblue'])
axes[0].set_title('Class Distribution')
axes[0].set_xticklabels(['Malignant','Benign'], rotation=0)

# Feature correlation heatmap (top 10 features)
top_features = X.corr()['mean radius'].abs().nlargest(10).index
X[top_features].corr().pipe(
    lambda c: sns.heatmap(c, ax=axes[1], cmap='coolwarm',
                          annot=True, fmt='.2f', cbar=False)
)
axes[1].set_title('Top Feature Correlations')
plt.tight_layout()
plt.show()

# ── 3. Preprocess ───────────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)   # fit on train only!
X_test_s  = scaler.transform(X_test)        # apply same scale to test

# ── 4. Train Multiple Models ─────────────────────────────────────────
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42),
}

results = {}
for name, model in models.items():
    # 5-fold cross-validation on training set
    cv_scores = cross_val_score(model, X_train_s, y_train, cv=5, scoring='f1')
    model.fit(X_train_s, y_train)
    y_pred = model.predict(X_test_s)
    results[name] = {
        'CV F1 (mean)': cv_scores.mean(),
        'CV F1 (std)':  cv_scores.std(),
        'Test Accuracy': accuracy_score(y_test, y_pred),
        'Test Precision': precision_score(y_test, y_pred),
        'Test Recall':    recall_score(y_test, y_pred),
        'Test F1':        f1_score(y_test, y_pred),
    }

# ── 5. Compare Results ──────────────────────────────────────────────
results_df = pd.DataFrame(results).T.round(4)
print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)
print(results_df.to_string())

# ── 6. Detailed Report for Best Model ──────────────────────────────
best_model = models['Random Forest']
y_pred = best_model.predict(X_test_s)

print("\nClassification Report — Random Forest:")
print(classification_report(y_test, y_pred,
                            target_names=['Malignant','Benign']))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Malignant','Benign'],
            yticklabels=['Malignant','Benign'])
plt.title('Confusion Matrix — Random Forest')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

# Feature Importance
importances = pd.Series(
    best_model.feature_importances_,
    index=data.feature_names
).sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 5))
importances.plot(kind='bar', color='steelblue')
plt.title('Top 10 Feature Importances — Random Forest')
plt.ylabel('Importance Score')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Sample Output:

Dataset shape : (569, 30)
Class balance : {1: 357, 0: 212}

MODEL COMPARISON
                      CV F1   Test Acc  Test Prec  Test Recall  Test F1
Logistic Regression   0.9648    0.9649     0.9655       0.9722   0.9688
Random Forest         0.9622    0.9737     0.9722       0.9861   0.9791
```text

---

## Evaluation Metrics Reference

### Classification Metrics

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

$$\text{Precision} = \frac{TP}{TP + FP} \quad \text{(How precise are positive predictions?)}$$

$$\text{Recall} = \frac{TP}{TP + FN} \quad \text{(How many positives did we catch?)}$$

$$F_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

### Regression Metrics

| Metric | Formula | Interpretation |
|---|---|---|
| MAE | (1)/(n)Σ\|y_i - \hat(y)_i\| | Average absolute error (same units as y) |
| RMSE | √{(1)/(n)Σ(y_i - \hat(y)_i)^2} | Penalises large errors more |
| R² | $1 - \frac{SS_{res}}{SS_{tot}}$ | Proportion of variance explained (1 = perfect) |

---

## Choosing the Right Algorithm

START │ ├─ Labelled data? ──NO──→ Unsupervised (K-Means, PCA, DBSCAN) │ YES │ ├─ Predicting a number? ─YES──→ Regression │ ├── Linear/Ridge/Lasso (linear relationships) │ ├── Random Forest (non-linear, robust) │ └── XGBoost (tabular, best accuracy) │ └─ Predicting a category? YES─→ Classification ├── Logistic Regression (baseline, interpretable) ├── Random Forest (robust, few params) ├── XGBoost/LightGBM (tabular champion) └── Neural Network (images, text, audio)


---

## Key Takeaways

1. **Supervised learning** needs labelled data and optimises a loss function
2. **The bias-variance tradeoff** — simple models underfit, complex ones overfit
3. **Always evaluate on held-out data** — cross-validation gives reliable estimates
4. **Feature engineering matters** as much as algorithm choice
5. **Start simple** — logistic regression is a strong baseline before trying neural networks