Intro to ML: Supervised vs Unsupervised

Module 2: Machine LearningFree Lesson

Advertisement

Intro to Machine Learning: Supervised vs Unsupervised

What is Machine Learning?

Machine Learning (ML) is a subset of artificial intelligence that enables systems to learn from data and improve their performance without being explicitly programmed. Rather than following static rules, ML algorithms identify patterns in data and build mathematical models that make predictions or decisions.

DfMachine Learning (Tom Mitchell, 1997)

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

ML Mathematical Framework

D={(x1,y1),(x2,y2),…,(xn,yn)}D = \{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}

Here,

  • DD=Dataset of input-output pairs
  • xi∈Rdx_i \in \mathbb{R}^d=Feature vector for sample i
  • yiy_i=Target label for sample i
  • nn=Number of samples
min⁑fβ€…β€ŠE(x,y)∼Pdata[L(f(x),β€…β€Šy)]\min_f \; \mathbb{E}_{(x,y) \sim P_{\text{data}}} \left[ \mathcal{L}(f(x),\; y) \right]

ℹ️ The No Free Lunch Theorem

βˆ‘dP(d∣m1)=βˆ‘dP(d∣m2)\sum_{d} P(d \mid m_1) = \sum_{d} P(d \mid m_2)

No single learning algorithm is universally superior across all possible data distributions. For every algorithm that performs well on some class of problems, there exists a distribution on which it performs no better than random guessing. This is why empirical evaluation on representative data is essential.


Types of Machine Learning

Architecture Diagram
                      Machine Learning
                            β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚               β”‚               β”‚
      Supervised      Unsupervised    Reinforcement
        Learning        Learning        Learning
            β”‚               β”‚               β”‚
      β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”΄β”€β”€β”€β”       β”Œβ”€β”€β”€β”΄β”€β”€β”€β”
      β”‚           β”‚    β”‚       β”‚       β”‚       β”‚
  Classification  Regression  β”‚   Dimensional  β”‚
                    Clusteringβ”‚   Reduction     β”‚
                              β”‚            Agent-Based
                              β”‚            Learning

1. Supervised Learning

In supervised learning, the algorithm learns from labeled data β€” each input comes with a known correct output.

Key Characteristics:

  • Training data includes input-output pairs (xi,yi)(x_i, y_i)
  • Goal: Learn mapping from inputs to outputs
  • Evaluation is straightforward: compare predictions to known labels

Two Main Tasks:

TaskOutput TypeExampleAlgorithms
ClassificationDiscrete labelsEmail β†’ spam/not spamLogistic Regression, SVM, Decision Trees
RegressionContinuous valuesHouse features β†’ priceLinear Regression, Random Forest, XGBoost

Classification Formulation

y^=arg⁑max⁑c∈{1,…,C}P(Y=c∣X=x)\hat{y} = \arg\max_{c \in \{1,\dots,C\}} P(Y=c \mid X=x)

Here,

  • y^\hat{y}=Predicted class label
  • CC=Number of classes
  • P(Y=c∣X=x)P(Y=c \mid X=x)=Posterior probability of class c given x

Regression Formulation

y^=f(x)=w0+w1x1+β‹―+wdxd=wTx+b\hat{y} = f(x) = w_0 + w_1 x_1 + \dots + w_d x_d = \mathbf{w}^T \mathbf{x} + b

Here,

  • w\mathbf{w}=Weight vector
  • bb=Bias term
  • dd=Number of features

2. Unsupervised Learning

Unsupervised learning works with unlabeled data, discovering hidden patterns or structures.

Key Characteristics:

  • No target variable provided
  • Goal: Discover structure, patterns, or representations
  • Evaluation is more subjective

Three Main Tasks:

TaskGoalExampleAlgorithms
ClusteringGroup similar dataCustomer segmentationK-Means, DBSCAN, Hierarchical
Dimensionality ReductionReduce features while preserving infoVisualize high-dim dataPCA, t-SNE, UMAP
Anomaly DetectionFind outliersFraud detectionIsolation Forest, Autoencoders

K-Means Objective

J=βˆ‘k=1Kβˆ‘xi∈Ckβˆ₯xiβˆ’ΞΌkβˆ₯2J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Here,

  • KK=Number of clusters
  • CkC_k=Set of points in cluster k
  • ΞΌk\mu_k=Centroid of cluster k

3. Reinforcement Learning

An agent learns to make decisions by interacting with an environment, receiving rewards or penalties.

Key Components:

  • Agent: The learner/decision maker
  • Environment: The world the agent interacts with
  • State (s): Current situation of the agent
  • Action (a): What the agent can do
  • Reward (r): Feedback signal
  • Policy (Ο€): Strategy mapping states to actions

Bellman Equation (Value Function)

V(s)=max⁑a[R(s,a)+Ξ³βˆ‘sβ€²P(sβ€²βˆ£s,a) V(sβ€²)]V(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \, V(s') \right]

Here,

  • V(s)V(s)=Value of state s
  • R(s,a)R(s, a)=Reward for taking action a in state s
  • Ξ³\gamma=Discount factor in [0,1]
  • P(sβ€²βˆ£s,a)P(s' \mid s, a)=Transition probability

ML Workflow

Architecture Diagram
+----------+    +-----------+    +----------+    +-----------+
|  Define  |--->| Collect   |--->| Prepare  |--->|  Select   |
| Problem  |    | Data      |    | Data     |    | Algorithm |
+----------+    +-----------+    +----------+    +-----------+
                                                        |
                                                        v
+----------+    +-----------+    +----------+    +-----------+
| Deploy & |<---| Evaluate  |<---| Train    |<---|  Feature  |
| Monitor  |    | Model     |    | Model    |    | Engineering|
+----------+    +-----------+    +----------+    +-----------+

Step-by-Step Process:

1. Problem Definition:

  • What are we predicting?
  • What type of ML task is this?
  • What is the business objective?

2. Data Collection:

  • Sources: databases, APIs, web scraping, sensors
  • Consider: quality, quantity, representativeness

3. Data Preparation:

Data Preparation Pipeline

Xcleaned=Impute(Xraw)β†’Scale(Ximputed)β†’Encode(Xscaled)X_{\text{cleaned}} = \text{Impute}(X_{\text{raw}}) \rightarrow \text{Scale}(X_{\text{imputed}}) \rightarrow \text{Encode}(X_{\text{scaled}})

Here,

  • XrawX_{\text{raw}}=Raw input data
  • XcleanedX_{\text{cleaned}}=Fully processed data

4. Exploratory Data Analysis (EDA):

  • Statistical summaries: mean, variance, correlations
  • Visualization: histograms, scatter plots, heatmaps

5. Feature Engineering:

  • Create new features: xnew=x1Γ—x2x_{\text{new}} = x_1 \times x_2
  • Transform features: log⁑(x)\log(x), x\sqrt{x}, polynomial features
  • Select features: correlation analysis, mutual information

6. Model Selection & Training:

  • Split data: training (60-80%), validation (10-20%), test (10-20%)
  • Train multiple algorithms
  • Tune hyperparameters

7. Model Evaluation:

ℹ️ Model Evaluation

Performance=f(Accuracy,Precision,Recall,F1,AUC)\text{Performance} = f(\text{Accuracy}, \text{Precision}, \text{Recall}, \text{F1}, \text{AUC})

8. Deployment & Monitoring:

  • Deploy to production
  • Monitor for drift: ∣Ptrain(x)βˆ’Pprod(x)∣|P_{\text{train}}(x) - P_{\text{prod}}(x)|

Model Selection Criteria

Bias-Variance Tradeoff

E[(yβˆ’f^(x))2]=Bias2(f^)+Var(f^)+Οƒ2\mathbb{E}\left[(y - \hat{f}(x))^2\right] = \text{Bias}^2(\hat{f}) + \text{Var}(\hat{f}) + \sigma^2

ThBias-Variance Decomposition (Proof Sketch)

For a model f^\hat{f} predicting y=f(x)+Ο΅y = f(x) + \epsilon where ϡ∼N(0,Οƒ2)\epsilon \sim \mathcal{N}(0, \sigma^2), the expected squared error at a point xx is:

E[(yβˆ’f^(x))2]=(f(x)βˆ’E[f^(x)])2⏟Bias2+E[(f^(x)βˆ’E[f^(x)])2]⏟Variance+Οƒ2⏟Irreducible\mathbb{E}\left[(y - \hat{f}(x))^2\right] = \underbrace{\left(f(x) - \mathbb{E}[\hat{f}(x)]\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\left[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible}}

Proof: Expand E[(yβˆ’f^)2]\mathbb{E}[(y - \hat{f})^2] by adding and subtracting E[f^]\mathbb{E}[\hat{f}] and f(x)f(x), then apply the independence of Ο΅\epsilon and f^\hat{f}. Cross terms vanish due to E[Ο΅]=0\mathbb{E}[\epsilon] = 0 and E[(f^βˆ’E[f^])(fβˆ’E[f^])]=0\mathbb{E}[(\hat{f} - \mathbb{E}[\hat{f}])(f - \mathbb{E}[\hat{f}])] = 0.

Architecture Diagram
Error
  ^
  |     \         Total Error
  |      \       /
  |       \     /
  |        \   /
  |         \_/  <-- Optimal complexity
  |        / \
  |       /   \
  |      /     \  Variance
  |     /       \___________
  |    /
  |   /  Bias^2
  |  /
  +----------------------------------> Model Complexity
      Simple                Complex

Overfitting vs Underfitting

ConditionTraining ErrorValidation ErrorDiagnosis
UnderfittingHighHighModel too simple
Good FitLowLow (close to training)Model appropriate
OverfittingVery LowHighModel too complex

Regularization

πŸ’‘ Preventing Overfitting

To prevent overfitting, add penalty term: Jreg=Joriginal+Ξ»β‹…PenaltyJ_{\text{reg}} = J_{\text{original}} + \lambda \cdot \text{Penalty}

TypePenaltyFormulaEffect
Ridge (L2)βˆ₯wβˆ₯22\|w\|_2^2βˆ‘wj2\sum w_j^2Shrinks coefficients
Lasso (L1)βˆ₯wβˆ₯1\|w\|_1βˆ‘βˆ₯wjβˆ₯\sum \|w_j\|Feature selection
Elastic NetMixΞ±βˆ₯wβˆ₯1+(1βˆ’Ξ±)βˆ₯wβˆ₯22\alpha\|w\|_1 + (1-\alpha)\|w\|_2^2Both effects

Real-World Applications

1. Healthcare β€” Disease Diagnosis

features = ['age', 'blood_pressure', 'cholesterol', 'glucose', 'bmi']
# Supervised classification: healthy vs diabetic
# Accuracy: 95%+, used as screening tool

2. Finance β€” Credit Scoring

features = ['income', 'debt_ratio', 'credit_history', 'employment_years']
# Binary classification: approve/deny loan
# Goal: Minimize false positives (approving risky borrowers)

3. E-commerce β€” Recommendation Systems

# User-item interaction matrix
# Unsupervised: collaborative filtering
# Find users with similar purchase patterns
# Recommend items they haven't seen

4. Autonomous Vehicles β€” Object Detection

# Computer vision pipeline
# 1. Detect objects (cars, pedestrians, signs)
# 2. Classify object types
# 3. Predict trajectories
# Deep learning + reinforcement learning

5. Natural Language Processing β€” Sentiment Analysis

# Text classification
# Input: "This product is amazing!"
# Output: Positive sentiment (0.95 probability)
# Use case: Brand monitoring, customer feedback

Complete Python Example

πŸ“Supervised vs Unsupervised Learning Comparison

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, classification_report, silhouette_score

# Generate synthetic dataset
np.random.seed(42)
n_samples = 1000

# Features: income, age, credit_score
X = np.column_stack([
    np.random.normal(50000, 15000, n_samples),
    np.random.normal(40, 12, n_samples),
    np.random.normal(680, 50, n_samples)
])

# Binary target: loan approval (0=denied, 1=approved)
y = ((X[:, 0] > 45000) & (X[:, 2] > 650)).astype(int)
noise = np.random.binomial(1, 0.1, n_samples)
y = np.bitwise_xor(y, noise)

df = pd.DataFrame(X, columns=['income', 'age', 'credit_score'])
df['approved'] = y

# --- Supervised Learning ---
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('approved', axis=1), df['approved'],
    test_size=0.2, random_state=42, stratify=df['approved']
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)
rf_pred = rf_model.predict(X_test_scaled)

print("--- Logistic Regression ---")
print(f"Accuracy: {accuracy_score(y_test, lr_pred):.4f}")
print(classification_report(y_test, lr_pred))

print("\n--- Random Forest ---")
print(f"Accuracy: {accuracy_score(y_test, rf_pred):.4f}")
print(classification_report(y_test, rf_pred))

# --- Unsupervised Learning ---
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
clusters = kmeans.fit_predict(scaler.fit_transform(df.drop('approved', axis=1)))

print("\n--- K-Means Clustering ---")
print(f"Silhouette Score: {silhouette_score(scaler.fit_transform(df.drop('approved', axis=1)), clusters):.4f}")
print(f"Cluster sizes: {np.bincount(clusters)}")

Key Takeaways

πŸ“‹Summary: Intro to Machine Learning

  • ML = Learning from Data: Systems improve with experience without explicit programming
  • Three Paradigms: Supervised (labeled), Unsupervised (unlabeled), Reinforcement (reward-based)
  • Bias-Variance Tradeoff: Balance model complexity to minimize total error: E[(yβˆ’f^)2]=Bias2+Var+Οƒ2\mathbb{E}[(y-\hat{f})^2] = \text{Bias}^2 + \text{Var} + \sigma^2
  • Workflow Matters: Success depends more on data preparation than algorithm choice
  • No Free Lunch: No single algorithm works best for all problems β€” try multiple approaches
  • Evaluation is Critical: Always use held-out test data; never evaluate on training data

Practice Exercises

Exercise 1: Problem Classification

Classify each scenario as supervised, unsupervised, or reinforcement learning:

  • a) Predicting house prices from features
  • b) Grouping customers by purchase behavior
  • c) Training a robot to walk
  • d) Detecting spam emails
  • e) Reducing 1000 features to 10 for visualization

Exercise 2: Dataset Exploration

from sklearn.datasets import load_iris
iris = load_iris()
# a) How many samples and features?
# b) What are the class labels?
# c) Visualize feature distributions
# d) Which features are most discriminative?

Exercise 3: Model Comparison

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Perform 5-fold cross-validation for each
# Which algorithm performs best? Why?

Exercise 4: Bias-Variance Analysis

  • Train a Decision Tree with max_depth = 2 (high bias) and max_depth = 20 (high variance)
  • Plot training and validation accuracy vs max_depth
  • Find the optimal depth

Reflection Questions

  1. When would you choose unsupervised over supervised learning?
  2. Why might a simpler model be preferred over a complex one?
  3. What are the ethical considerations when deploying ML models?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement