Introduction to Machine Learning: Supervised, Unsupervised and Reinforcement

What is Machine Learning?

Machine Learning is a field of artificial intelligence that gives computers the ability to learn from data without being explicitly programmed. Rather than following rigid rules, ML algorithms identify patterns in data and improve their performance over time.

Tom Mitchell's Formal Definition (1997):

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Mathematical Formulation:

For a computer program to be "learning," we require:

\text{Performance}(T, P, E_{\text{new}}) > \text{Performance}(T, P, E_{\text{old}})

where $E_{\text{new}}$ contains more or better experience than $E_{\text{old}}$ .

Example â€” Email Spam Filter:

Task T: Classify emails as spam or not spam
Experience E: Labeled dataset of emails (spam/not spam)
Performance P: Classification accuracy (fraction of correctly classified emails)

As the system processes more emails, its accuracy improves â€” it learns.

ML Algorithm Taxonomy

Machine learning algorithms are broadly categorized based on the nature of the learning signal or feedback available:

Types of Machine Learning

1. Supervised Learning

In supervised learning, the algorithm learns from labeled data â€” each training example consists of an input vector $\mathbf{x}$ and a corresponding target output $y$ . The goal is to learn a mapping function $f: \mathbf{x} \rightarrow y$ .

Two primary sub-tasks:

Classification â€” Predicting a discrete class label:

\hat{y} = f(\mathbf{x}), \quad \hat{y} \in \{c_1, c_2, \ldots, c_K\}

Examples: Email spam detection (spam/not spam), image recognition (cat/dog/bird), medical diagnosis (malignant/benign).

Regression â€” Predicting a continuous value:

\hat{y} = f(\mathbf{x}), \quad \hat{y} \in \mathbb{R}

Examples: House price prediction, temperature forecasting, stock price estimation.

2. Unsupervised Learning

In unsupervised learning, the algorithm receives unlabeled data and must discover hidden patterns, structures, or relationships on its own.

Key sub-tasks:

Clustering â€” Grouping similar data points together:

\text{Partition } \mathcal{D} = \{x_1, \ldots, x_n\} \text{ into } k \text{ clusters } C_1, \ldots, C_k

K-Means Objective (minimize within-cluster variance):

J = \sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} \| \mathbf{x} - \boldsymbol{\mu}_i \|^2

where $\boldsymbol{\mu}_i$ is the centroid of cluster $C_i$ .

Examples: Customer segmentation, document grouping, image segmentation.

Dimensionality Reduction â€” Reducing the number of features while preserving important structure:

\mathbf{x} \in \mathbb{R}^d \rightarrow \mathbf{z} \in \mathbb{R}^r, \quad r \ll d

PCA Objective: Find projection matrix $\mathbf{W}$ that maximizes variance:

\max_{\mathbf{W}} \text{tr}(\mathbf{W}^\top \boldsymbol{\Sigma} \mathbf{W}), \quad \text{subject to } \mathbf{W}^\top \mathbf{W} = \mathbf{I}

where $\boldsymbol{\Sigma}$ is the covariance matrix.

Examples: Visualization of high-dimensional data, noise reduction, feature extraction.

Association â€” Discovering interesting relationships between variables:

\{A \Rightarrow B\}: \text{support}, \text{confidence}, \text{lift}

Support: $P(A \cap B)$ â€” how frequently items appear together
Confidence: $P(B|A)$ â€” how often $B$ appears when $A$ is present
Lift: $\frac{P(A \cap B)}{P(A) \cdot P(B)}$ â€” strength of association beyond chance

Examples: Market basket analysis ("customers who buy X also buy Y"), recommendation systems.

3. Semi-supervised Learning

Semi-supervised learning combines a small amount of labeled data with a large pool of unlabeled data. This is practical because labeled data is often expensive to obtain.

Core Assumption: If two data points are close in feature space, they likely share the same label (smoothness assumption).

Key Approaches:

Method	Description
Self-training	Model labels unlabeled data and retrains
Co-training	Two views of data train separate classifiers
Label Propagation	Graph-based spreading of labels
MixMatch	Combines consistency regularization + entropy minimization

Applications: Medical imaging (limited expert annotations), NLP with large unlabeled corpora.

4. Reinforcement Learning

Reinforcement learning involves an agent learning to make sequential decisions by interacting with an environment to maximize cumulative reward.

Markov Decision Process (MDP):

An MDP is defined by the tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ :

$\mathcal{S}$ : Set of states
$\mathcal{A}$ : Set of actions
$P(s'|s, a)$ : Transition probability
$R(s, a, s')$ : Reward function
$\gamma \in [0, 1)$ : Discount factor

Objective â€” Maximize Expected Cumulative Reward:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Bellman Equation (Optimal Value Function):

Q^*(s, a) = \mathbb{E}\left[R_{t+1} + \gamma \max_{a'} Q^*(s', a') \mid S_t = s, A_t = a\right]

Applications: Game playing (AlphaGo, Atari), robotics, autonomous driving, resource management.

The ML Workflow

The machine learning workflow is a systematic pipeline from raw data to deployed model:

Step-by-Step Breakdown

1. Data Collection

Gather relevant data from databases, APIs, sensors, logs, or web scraping. Data quality determines model quality â€” the principle of "garbage in, garbage out" applies.

2. Data Preprocessing

Handle missing values (imputation, deletion)
Remove duplicates and outliers
Normalize/standardize features:

z = \frac{x - \mu}{\sigma} \quad \text{(standardization)}

x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} \quad \text{(min-max scaling)}

Encode categorical variables (one-hot, label encoding)

3. Feature Engineering

Select relevant features (feature selection)
Create new features from existing ones
Reduce dimensionality (PCA, feature hashing)
Domain-specific transformations

4. Model Training

Split data into training, validation, and test sets (typically 70/15/15 or 80/10/10):

\mathcal{D} = \mathcal{D}_{\text{train}} \cup \mathcal{D}_{\text{val}} \cup \mathcal{D}_{\text{test}}

Select algorithm and hyperparameters, then minimize loss:

\hat{\theta} = \arg\min_{\theta} \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(f_\theta(\mathbf{x}_i), y_i)

5. Model Evaluation

Assess generalization on unseen data using appropriate metrics (covered below).

6. Deployment and Monitoring

Deploy to production, monitor for data drift, retrain periodically.

Model Selection and Evaluation

Common Evaluation Metrics

For Classification:

Metric	Formula	Interpretation
Accuracy	$\frac{TP + TN}{TP + TN + FP + FN}$	Overall correctness
Precision	$\frac{TP}{TP + FP}$	Of predicted positives, how many are correct
Recall	$\frac{TP}{TP + FN}$	Of actual positives, how many were found
F1 Score	$\frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$	Harmonic mean of precision and recall
AUC-ROC	Area under ROC curve	Discrimination ability across thresholds

Confusion Matrix:

	Predicted Positive	Predicted Negative
Actual Positive	TP (True Positive)	FN (False Negative)
Actual Negative	FP (False Positive)	TN (True Negative)

For Regression:

Metric	Formula
MSE	$\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$
RMSE	$\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$
MAE	$\frac{1}{n}\sum_{i=1}^{n}\\|y_i - \hat{y}_i\\|$
$R^2$	$1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$

Cross-Validation

K-Fold Cross-Validation: Split data into $k$ folds. Train on $k-1$ folds, validate on the remaining fold. Rotate and average:

\text{CV}(k) = \frac{1}{k} \sum_{i=1}^{k} \text{Score}_i

This provides a more robust estimate of model performance than a single train-test split.

Overfitting and Underfitting

The central challenge in ML is finding the right balance between model complexity and generalization:

The Bias-Variance Tradeoff

The expected prediction error can be decomposed:

\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

Bias: Error from erroneous assumptions in the learning algorithm. High bias â†’ underfitting.
Variance: Error from sensitivity to fluctuations in training data. High variance â†’ overfitting.
Irreducible Noise: Inherent randomness in the data that no model can eliminate.

Preventing Overfitting

Technique	Description
Cross-validation	Better estimation of generalization performance
Regularization (L1/L2)	Penalize large weights: $\lambda\\|\\|\theta\\|\\|_1$ or $\lambda\\|\\|\theta\\|\\|_2^2$
Early stopping	Stop training when validation error starts increasing
Dropout	Randomly deactivate neurons during training
Data augmentation	Artificially increase training set size
Feature selection	Use fewer, more relevant features
Ensemble methods	Combine multiple models (bagging, boosting)

Model Complexity vs Error Tradeoff

Key Observations:

Training error decreases monotonically as model complexity increases (model memorizes training data).
Validation error decreases initially, reaches a minimum, then increases (overfitting begins).
The gap between training and validation error indicates the degree of overfitting.
The optimal model minimizes validation error, not training error.

No Free Lunch Theorem

The No Free Lunch (NFL) Theorem (Wolpert, 1996) states that no single machine learning algorithm is universally superior to all others across all possible problems.

Formal Statement:

For any two learning algorithms $a$ and $b$ :

\sum_{P \in \mathcal{P}} \text{Performance}(a, P) = \sum_{P \in \mathcal{P}} \text{Performance}(b, P)

averaged over all possible problems $\mathcal{P}$ (all possible data-generating distributions), every algorithm performs equally well.

Practical Implications:

No universally best algorithm â€” Choosing the right model requires understanding the problem domain, data characteristics, and constraints.
The "No Free Lunch" principle â€” Every algorithm has inductive biases (assumptions about the data). An algorithm that works well on one class of problems may fail on another.
Algorithm selection â€” Consider:

Problem Characteristic	Preferred Approaches
Large dataset, many features	Neural Networks, Gradient Boosting
Small dataset, interpretability needed	Logistic/Linear Regression, Decision Trees
High-dimensional data	SVM with RBF kernel, PCA preprocessing
Time series	LSTM, ARIMA, Prophet
Non-linear relationships	Kernel methods, Random Forests, Neural Nets

Ensemble methods work well in practice because they combine multiple algorithms, reducing the impact of individual algorithm weaknesses.

Wisdom from Practice:

"All models are wrong, but some are useful." â€” George Box

"The question is not whether a model is 'true,' but whether it is useful for the task at hand."

Summary

Concept	Key Takeaway
ML Definition	Algorithms that improve performance through experience (Mitchell, 1997)
Supervised Learning	Learns from labeled data â†’ Classification or Regression
Unsupervised Learning	Discovers structure in unlabeled data â†’ Clustering, Dim. Reduction, Association
Semi-supervised	Combines small labeled + large unlabeled datasets
Reinforcement Learning	Agent learns via rewards from environment interactions
ML Workflow	Data â†’ Preprocess â†’ Features â†’ Train â†’ Evaluate â†’ Deploy
Evaluation	Use appropriate metrics (accuracy, F1, MSE, $R^2$ ) and cross-validation
Overfitting	Model memorizes noise; high variance, low bias
Underfitting	Model too simple; low variance, high bias
NFL Theorem	No single algorithm works best for all problems â€” choose based on context