Intro to Machine Learning: Supervised vs Unsupervised

What is Machine Learning?

Machine Learning (ML) is a subset of artificial intelligence that enables systems to learn from data and improve their performance without being explicitly programmed. Rather than following static rules, ML algorithms identify patterns in data and build mathematical models that make predictions or decisions.

DfMachine Learning (Tom Mitchell, 1997)

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

ML Mathematical Framework

D = \{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}

Here,

$D$ =Dataset of input-output pairs
$x_i \in \mathbb{R}^d$ =Feature vector for sample i
$y_i$ =Target label for sample i
$n$ =Number of samples

\min_f \; \mathbb{E}_{(x,y) \sim P_{\text{data}}} \left[ \mathcal{L}(f(x),\; y) \right]

ℹ️ The No Free Lunch Theorem

\sum_{d} P(d \mid m_1) = \sum_{d} P(d \mid m_2)

No single learning algorithm is universally superior across all possible data distributions. For every algorithm that performs well on some class of problems, there exists a distribution on which it performs no better than random guessing. This is why empirical evaluation on representative data is essential.

Types of Machine Learning

Architecture Diagram

                      Machine Learning
                            │
            ┌───────────────┼───────────────┐
            │               │               │
      Supervised      Unsupervised    Reinforcement
        Learning        Learning        Learning
            │               │               │
      ┌─────┴─────┐    ┌───┴───┐       ┌───┴───┐
      │           │    │       │       │       │
  Classification  Regression  │   Dimensional  │
                    Clustering│   Reduction     │
                              │            Agent-Based
                              │            Learning

1. Supervised Learning

In supervised learning, the algorithm learns from labeled data — each input comes with a known correct output.

Key Characteristics:

Training data includes input-output pairs $(x_i, y_i)$
Goal: Learn mapping from inputs to outputs
Evaluation is straightforward: compare predictions to known labels

Two Main Tasks:

Task	Output Type	Example	Algorithms
Classification	Discrete labels	Email → spam/not spam	Logistic Regression, SVM, Decision Trees
Regression	Continuous values	House features → price	Linear Regression, Random Forest, XGBoost

Classification Formulation

\hat{y} = \arg\max_{c \in \{1,\dots,C\}} P(Y=c \mid X=x)

Here,

$\hat{y}$ =Predicted class label
$C$ =Number of classes
$P(Y=c \mid X=x)$ =Posterior probability of class c given x

Regression Formulation

\hat{y} = f(x) = w_0 + w_1 x_1 + \dots + w_d x_d = \mathbf{w}^T \mathbf{x} + b

Here,

$\mathbf{w}$ =Weight vector
$b$ =Bias term
$d$ =Number of features

2. Unsupervised Learning

Unsupervised learning works with unlabeled data, discovering hidden patterns or structures.

Key Characteristics:

No target variable provided
Goal: Discover structure, patterns, or representations
Evaluation is more subjective

Three Main Tasks:

Task	Goal	Example	Algorithms
Clustering	Group similar data	Customer segmentation	K-Means, DBSCAN, Hierarchical
Dimensionality Reduction	Reduce features while preserving info	Visualize high-dim data	PCA, t-SNE, UMAP
Anomaly Detection	Find outliers	Fraud detection	Isolation Forest, Autoencoders

K-Means Objective

J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Here,

$K$ =Number of clusters
$C_k$ =Set of points in cluster k
$\mu_k$ =Centroid of cluster k

3. Reinforcement Learning

An agent learns to make decisions by interacting with an environment, receiving rewards or penalties.

Key Components:

Agent: The learner/decision maker
Environment: The world the agent interacts with
State (s): Current situation of the agent
Action (a): What the agent can do
Reward (r): Feedback signal
Policy (π): Strategy mapping states to actions

Bellman Equation (Value Function)

V(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \, V(s') \right]

Here,

$V(s)$ =Value of state s
$R(s, a)$ =Reward for taking action a in state s
$\gamma$ =Discount factor in [0,1]
$P(s' \mid s, a)$ =Transition probability

ML Workflow

Architecture Diagram

+----------+    +-----------+    +----------+    +-----------+
|  Define  |--->| Collect   |--->| Prepare  |--->|  Select   |
| Problem  |    | Data      |    | Data     |    | Algorithm |
+----------+    +-----------+    +----------+    +-----------+
                                                        |
                                                        v
+----------+    +-----------+    +----------+    +-----------+
| Deploy & |<---| Evaluate  |<---| Train    |<---|  Feature  |
| Monitor  |    | Model     |    | Model    |    | Engineering|
+----------+    +-----------+    +----------+    +-----------+

Step-by-Step Process:

1. Problem Definition:

What are we predicting?
What type of ML task is this?
What is the business objective?

2. Data Collection:

Sources: databases, APIs, web scraping, sensors
Consider: quality, quantity, representativeness

3. Data Preparation:

Data Preparation Pipeline

X_{\text{cleaned}} = \text{Impute}(X_{\text{raw}}) \rightarrow \text{Scale}(X_{\text{imputed}}) \rightarrow \text{Encode}(X_{\text{scaled}})

Here,

$X_{\text{raw}}$ =Raw input data
$X_{\text{cleaned}}$ =Fully processed data

4. Exploratory Data Analysis (EDA):

Statistical summaries: mean, variance, correlations
Visualization: histograms, scatter plots, heatmaps

5. Feature Engineering:

Create new features: $x_{\text{new}} = x_1 \times x_2$
Transform features: $\log(x)$ , $\sqrt{x}$ , polynomial features
Select features: correlation analysis, mutual information

6. Model Selection & Training:

Split data: training (60-80%), validation (10-20%), test (10-20%)
Train multiple algorithms
Tune hyperparameters

7. Model Evaluation:

ℹ️ Model Evaluation

\text{Performance} = f(\text{Accuracy}, \text{Precision}, \text{Recall}, \text{F1}, \text{AUC})

8. Deployment & Monitoring:

Deploy to production
Monitor for drift: $|P_{\text{train}}(x) - P_{\text{prod}}(x)|$

Model Selection Criteria

Bias-Variance Tradeoff

\mathbb{E}\left[(y - \hat{f}(x))^2\right] = \text{Bias}^2(\hat{f}) + \text{Var}(\hat{f}) + \sigma^2

ThBias-Variance Decomposition (Proof Sketch)

For a model $\hat{f}$ predicting $y = f(x) + \epsilon$ where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ , the expected squared error at a point $x$ is:

\mathbb{E}\left[(y - \hat{f}(x))^2\right] = \underbrace{\left(f(x) - \mathbb{E}[\hat{f}(x)]\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\left[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible}}

Proof: Expand $\mathbb{E}[(y - \hat{f})^2]$ by adding and subtracting $\mathbb{E}[\hat{f}]$ and $f(x)$ , then apply the independence of $\epsilon$ and $\hat{f}$ . Cross terms vanish due to $\mathbb{E}[\epsilon] = 0$ and $\mathbb{E}[(\hat{f} - \mathbb{E}[\hat{f}])(f - \mathbb{E}[\hat{f}])] = 0$ .

Architecture Diagram

Error
  ^
  |     \         Total Error
  |      \       /
  |       \     /
  |        \   /
  |         \_/  <-- Optimal complexity
  |        / \
  |       /   \
  |      /     \  Variance
  |     /       \___________
  |    /
  |   /  Bias^2
  |  /
  +----------------------------------> Model Complexity
      Simple                Complex

Overfitting vs Underfitting

Condition	Training Error	Validation Error	Diagnosis
Underfitting	High	High	Model too simple
Good Fit	Low	Low (close to training)	Model appropriate
Overfitting	Very Low	High	Model too complex

Regularization

💡 Preventing Overfitting

To prevent overfitting, add penalty term: $J_{\text{reg}} = J_{\text{original}} + \lambda \cdot \text{Penalty}$

Type	Penalty	Formula	Effect
Ridge (L2)	$\\|w\\|_2^2$	$\sum w_j^2$	Shrinks coefficients
Lasso (L1)	$\\|w\\|_1$	$\sum \\|w_j\\|$	Feature selection
Elastic Net	Mix	$\alpha\\|w\\|_1 + (1-\alpha)\\|w\\|_2^2$	Both effects

Real-World Applications

1. Healthcare — Disease Diagnosis

features = ['age', 'blood_pressure', 'cholesterol', 'glucose', 'bmi']
# Supervised classification: healthy vs diabetic
# Accuracy: 95%+, used as screening tool

2. Finance — Credit Scoring

features = ['income', 'debt_ratio', 'credit_history', 'employment_years']
# Binary classification: approve/deny loan
# Goal: Minimize false positives (approving risky borrowers)

3. E-commerce — Recommendation Systems

# User-item interaction matrix
# Unsupervised: collaborative filtering
# Find users with similar purchase patterns
# Recommend items they haven't seen

4. Autonomous Vehicles — Object Detection

# Computer vision pipeline
# 1. Detect objects (cars, pedestrians, signs)
# 2. Classify object types
# 3. Predict trajectories
# Deep learning + reinforcement learning

5. Natural Language Processing — Sentiment Analysis

# Text classification
# Input: "This product is amazing!"
# Output: Positive sentiment (0.95 probability)
# Use case: Brand monitoring, customer feedback

Complete Python Example

📝Supervised vs Unsupervised Learning Comparison

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, classification_report, silhouette_score

# Generate synthetic dataset
np.random.seed(42)
n_samples = 1000

# Features: income, age, credit_score
X = np.column_stack([
    np.random.normal(50000, 15000, n_samples),
    np.random.normal(40, 12, n_samples),
    np.random.normal(680, 50, n_samples)
])

# Binary target: loan approval (0=denied, 1=approved)
y = ((X[:, 0] > 45000) & (X[:, 2] > 650)).astype(int)
noise = np.random.binomial(1, 0.1, n_samples)
y = np.bitwise_xor(y, noise)

df = pd.DataFrame(X, columns=['income', 'age', 'credit_score'])
df['approved'] = y

# --- Supervised Learning ---
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('approved', axis=1), df['approved'],
    test_size=0.2, random_state=42, stratify=df['approved']
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)
rf_pred = rf_model.predict(X_test_scaled)

print("--- Logistic Regression ---")
print(f"Accuracy: {accuracy_score(y_test, lr_pred):.4f}")
print(classification_report(y_test, lr_pred))

print("\n--- Random Forest ---")
print(f"Accuracy: {accuracy_score(y_test, rf_pred):.4f}")
print(classification_report(y_test, rf_pred))

# --- Unsupervised Learning ---
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
clusters = kmeans.fit_predict(scaler.fit_transform(df.drop('approved', axis=1)))

print("\n--- K-Means Clustering ---")
print(f"Silhouette Score: {silhouette_score(scaler.fit_transform(df.drop('approved', axis=1)), clusters):.4f}")
print(f"Cluster sizes: {np.bincount(clusters)}")

Key Takeaways

📋Summary: Intro to Machine Learning

ML = Learning from Data: Systems improve with experience without explicit programming
Three Paradigms: Supervised (labeled), Unsupervised (unlabeled), Reinforcement (reward-based)
Bias-Variance Tradeoff: Balance model complexity to minimize total error: $\mathbb{E}[(y-\hat{f})^2] = \text{Bias}^2 + \text{Var} + \sigma^2$
Workflow Matters: Success depends more on data preparation than algorithm choice
No Free Lunch: No single algorithm works best for all problems — try multiple approaches
Evaluation is Critical: Always use held-out test data; never evaluate on training data

Practice Exercises

Exercise 1: Problem Classification

Classify each scenario as supervised, unsupervised, or reinforcement learning:

a) Predicting house prices from features
b) Grouping customers by purchase behavior
c) Training a robot to walk
d) Detecting spam emails
e) Reducing 1000 features to 10 for visualization

Exercise 2: Dataset Exploration

from sklearn.datasets import load_iris
iris = load_iris()
# a) How many samples and features?
# b) What are the class labels?
# c) Visualize feature distributions
# d) Which features are most discriminative?

Exercise 3: Model Comparison

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Perform 5-fold cross-validation for each
# Which algorithm performs best? Why?

Exercise 4: Bias-Variance Analysis

Train a Decision Tree with max_depth = 2 (high bias) and max_depth = 20 (high variance)
Plot training and validation accuracy vs max_depth
Find the optimal depth

Reflection Questions

When would you choose unsupervised over supervised learning?
Why might a simpler model be preferred over a complex one?
What are the ethical considerations when deploying ML models?

Intro to ML: Supervised vs Unsupervised

Intro to Machine Learning: Supervised vs Unsupervised

What is Machine Learning?

DfMachine Learning (Tom Mitchell, 1997)

ML Mathematical Framework

Types of Machine Learning

1. Supervised Learning

Classification Formulation

Regression Formulation

2. Unsupervised Learning

K-Means Objective

3. Reinforcement Learning

Bellman Equation (Value Function)

ML Workflow

Step-by-Step Process:

Data Preparation Pipeline

Model Selection Criteria

Bias-Variance Tradeoff

ThBias-Variance Decomposition (Proof Sketch)

Overfitting vs Underfitting

Regularization

Real-World Applications

1. Healthcare — Disease Diagnosis

2. Finance — Credit Scoring

3. E-commerce — Recommendation Systems

4. Autonomous Vehicles — Object Detection

5. Natural Language Processing — Sentiment Analysis

Complete Python Example

📝Supervised vs Unsupervised Learning Comparison

Key Takeaways

📋Summary: Intro to Machine Learning

Practice Exercises

Exercise 1: Problem Classification

Exercise 2: Dataset Exploration

Exercise 3: Model Comparison

Exercise 4: Bias-Variance Analysis

Reflection Questions

Need Expert Data Science Help?