CW

Introduction to Machine Learning: Supervised, Unsupervised and Reinforcement

Module 7: Machine Learning FundamentalsFree Lesson

Advertisement

What is Machine Learning?

Machine Learning is a field of artificial intelligence that gives computers the ability to learn from data without being explicitly programmed. Rather than following rigid rules, ML algorithms identify patterns in data and improve their performance over time.

Tom Mitchell's Formal Definition (1997):

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Mathematical Formulation:

For a computer program to be "learning," we require:

Performance(T,P,Enew)>Performance(T,P,Eold)\text{Performance}(T, P, E_{\text{new}}) > \text{Performance}(T, P, E_{\text{old}})

where EnewE_{\text{new}} contains more or better experience than EoldE_{\text{old}}.

Example — Email Spam Filter:

  • Task T: Classify emails as spam or not spam
  • Experience E: Labeled dataset of emails (spam/not spam)
  • Performance P: Classification accuracy (fraction of correctly classified emails)

As the system processes more emails, its accuracy improves — it learns.


ML Algorithm Taxonomy

Machine learning algorithms are broadly categorized based on the nature of the learning signal or feedback available:

Machine LearningSupervised LearningUnsupervised LearningSemi-supervisedReinforcementClassificationRegressionLinear/LogisticSVM, Trees, KNNNaive Bayes, Neural NetsLinear Reg, RidgeLasso, SVRDecision Trees RegClusteringDim. Reduct.AssociationK-Means, DBSCANHierarchicalGMMPCA, t-SNEUMAP, AutoencodersAprioriFP-GrowthEclatModel-FreeModel-BasedQ-LearningSARSAMonte CarloDyna-QMCTSWorld ModelsSelf-trainingLabel PropagationCo-trainingGraph-basedParadigm SummarySupervised: Labeled data (X → y)Unsupervised: Unlabeled data (X → structure)Semi-supervised: Mix of labeled + unlabeledReinforcement: Agent learns via rewards

Types of Machine Learning

1. Supervised Learning

In supervised learning, the algorithm learns from labeled data — each training example consists of an input vector x\mathbf{x} and a corresponding target output yy. The goal is to learn a mapping function f:xyf: \mathbf{x} \rightarrow y.

Two primary sub-tasks:

Classification — Predicting a discrete class label:

y^=f(x),y^{c1,c2,,cK}\hat{y} = f(\mathbf{x}), \quad \hat{y} \in \{c_1, c_2, \ldots, c_K\}

Examples: Email spam detection (spam/not spam), image recognition (cat/dog/bird), medical diagnosis (malignant/benign).

Regression — Predicting a continuous value:

y^=f(x),y^R\hat{y} = f(\mathbf{x}), \quad \hat{y} \in \mathbb{R}

Examples: House price prediction, temperature forecasting, stock price estimation.

Supervised Learning: Classification vs RegressionClassification (Discrete Output)Decision BoundaryFeature x₁Feature x₂Class AClass By ∈ {cat, dog, bird, ...}Regression (Continuous Output)Feature (e.g., size)Predicted trend (fitted line)Observed data pointsy ∈ ℝ (e.g., price in dollars)

2. Unsupervised Learning

In unsupervised learning, the algorithm receives unlabeled data and must discover hidden patterns, structures, or relationships on its own.

Key sub-tasks:

Clustering — Grouping similar data points together:

Partition D={x1,,xn} into k clusters C1,,Ck\text{Partition } \mathcal{D} = \{x_1, \ldots, x_n\} \text{ into } k \text{ clusters } C_1, \ldots, C_k

K-Means Objective (minimize within-cluster variance):

J=i=1kxCixμi2J = \sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} \| \mathbf{x} - \boldsymbol{\mu}_i \|^2

where μi\boldsymbol{\mu}_i is the centroid of cluster CiC_i.

Examples: Customer segmentation, document grouping, image segmentation.

Dimensionality Reduction — Reducing the number of features while preserving important structure:

xRdzRr,rd\mathbf{x} \in \mathbb{R}^d \rightarrow \mathbf{z} \in \mathbb{R}^r, \quad r \ll d

PCA Objective: Find projection matrix W\mathbf{W} that maximizes variance:

maxWtr(WΣW),subject to WW=I\max_{\mathbf{W}} \text{tr}(\mathbf{W}^\top \boldsymbol{\Sigma} \mathbf{W}), \quad \text{subject to } \mathbf{W}^\top \mathbf{W} = \mathbf{I}

where Σ\boldsymbol{\Sigma} is the covariance matrix.

Examples: Visualization of high-dimensional data, noise reduction, feature extraction.

Association — Discovering interesting relationships between variables:

{AB}:support,confidence,lift\{A \Rightarrow B\}: \text{support}, \text{confidence}, \text{lift}
  • Support: P(AB)P(A \cap B) — how frequently items appear together
  • Confidence: P(BA)P(B|A) — how often BB appears when AA is present
  • Lift: P(AB)P(A)P(B)\frac{P(A \cap B)}{P(A) \cdot P(B)} — strength of association beyond chance

Examples: Market basket analysis ("customers who buy X also buy Y"), recommendation systems.


3. Semi-supervised Learning

Semi-supervised learning combines a small amount of labeled data with a large pool of unlabeled data. This is practical because labeled data is often expensive to obtain.

Core Assumption: If two data points are close in feature space, they likely share the same label (smoothness assumption).

Key Approaches:

MethodDescription
Self-trainingModel labels unlabeled data and retrains
Co-trainingTwo views of data train separate classifiers
Label PropagationGraph-based spreading of labels
MixMatchCombines consistency regularization + entropy minimization

Applications: Medical imaging (limited expert annotations), NLP with large unlabeled corpora.


4. Reinforcement Learning

Reinforcement learning involves an agent learning to make sequential decisions by interacting with an environment to maximize cumulative reward.

Markov Decision Process (MDP):

An MDP is defined by the tuple (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma):

  • S\mathcal{S}: Set of states
  • A\mathcal{A}: Set of actions
  • P(ss,a)P(s'|s, a): Transition probability
  • R(s,a,s)R(s, a, s'): Reward function
  • γ[0,1)\gamma \in [0, 1): Discount factor

Objective — Maximize Expected Cumulative Reward:

Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Bellman Equation (Optimal Value Function):

Q(s,a)=E[Rt+1+γmaxaQ(s,a)St=s,At=a]Q^*(s, a) = \mathbb{E}\left[R_{t+1} + \gamma \max_{a'} Q^*(s', a') \mid S_t = s, A_t = a\right]

Applications: Game playing (AlphaGo, Atari), robotics, autonomous driving, resource management.


The ML Workflow

The machine learning workflow is a systematic pipeline from raw data to deployed model:

DataCollectionGather raw datafrom sources→DataPreprocessingClean, normalize,handle missing→FeatureEngineeringSelect, transform,create features→ModelTrainingFit model totraining data→Evaluateand DeployTest performance,put in productionThe Machine Learning Workflow Pipeline

Step-by-Step Breakdown

1. Data Collection

Gather relevant data from databases, APIs, sensors, logs, or web scraping. Data quality determines model quality — the principle of "garbage in, garbage out" applies.

2. Data Preprocessing

  • Handle missing values (imputation, deletion)
  • Remove duplicates and outliers
  • Normalize/standardize features:
z=xμσ(standardization)z = \frac{x - \mu}{\sigma} \quad \text{(standardization)}
x=xxminxmaxxmin(min-max scaling)x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} \quad \text{(min-max scaling)}
  • Encode categorical variables (one-hot, label encoding)

3. Feature Engineering

  • Select relevant features (feature selection)
  • Create new features from existing ones
  • Reduce dimensionality (PCA, feature hashing)
  • Domain-specific transformations

4. Model Training

Split data into training, validation, and test sets (typically 70/15/15 or 80/10/10):

D=DtrainDvalDtest\mathcal{D} = \mathcal{D}_{\text{train}} \cup \mathcal{D}_{\text{val}} \cup \mathcal{D}_{\text{test}}

Select algorithm and hyperparameters, then minimize loss:

θ^=argminθ1Ni=1NL(fθ(xi),yi)\hat{\theta} = \arg\min_{\theta} \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(f_\theta(\mathbf{x}_i), y_i)

5. Model Evaluation

Assess generalization on unseen data using appropriate metrics (covered below).

6. Deployment and Monitoring

Deploy to production, monitor for data drift, retrain periodically.


Model Selection and Evaluation

Common Evaluation Metrics

For Classification:

MetricFormulaInterpretation
AccuracyTP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN}Overall correctness
PrecisionTPTP+FP\frac{TP}{TP + FP}Of predicted positives, how many are correct
RecallTPTP+FN\frac{TP}{TP + FN}Of actual positives, how many were found
F1 Score2PrecisionRecallPrecision+Recall\frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}Harmonic mean of precision and recall
AUC-ROCArea under ROC curveDiscrimination ability across thresholds

Confusion Matrix:

Predicted PositivePredicted Negative
Actual PositiveTP (True Positive)FN (False Negative)
Actual NegativeFP (False Positive)TN (True Negative)

For Regression:

MetricFormula
MSE1ni=1n(yiy^i)2\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
RMSE1ni=1n(yiy^i)2\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}
MAE1ni=1nyiy^i\frac{1}{n}\sum_{i=1}^{n}\|y_i - \hat{y}_i\|
R2R^21(yiy^i)2(yiyˉ)21 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Cross-Validation

K-Fold Cross-Validation: Split data into kk folds. Train on k1k-1 folds, validate on the remaining fold. Rotate and average:

CV(k)=1ki=1kScorei\text{CV}(k) = \frac{1}{k} \sum_{i=1}^{k} \text{Score}_i

This provides a more robust estimate of model performance than a single train-test split.


Overfitting and Underfitting

The central challenge in ML is finding the right balance between model complexity and generalization:

Overfitting vs Underfitting vs Good FitUnderfittingModel Complexity →High Bias, Low VarianceModel too simpleGood Fit (Generalizes)Model Complexity →Balanced Bias-VarianceCaptures true patternOverfittingModel Complexity →Low Bias, High VarianceMemorizes noise

The Bias-Variance Tradeoff

The expected prediction error can be decomposed:

Error=Bias2+Variance+Irreducible Noise\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}
  • Bias: Error from erroneous assumptions in the learning algorithm. High bias → underfitting.
  • Variance: Error from sensitivity to fluctuations in training data. High variance → overfitting.
  • Irreducible Noise: Inherent randomness in the data that no model can eliminate.

Preventing Overfitting

TechniqueDescription
Cross-validationBetter estimation of generalization performance
Regularization (L1/L2)Penalize large weights: λθ1\lambda\|\|\theta\|\|_1 or λθ22\lambda\|\|\theta\|\|_2^2
Early stoppingStop training when validation error starts increasing
DropoutRandomly deactivate neurons during training
Data augmentationArtificially increase training set size
Feature selectionUse fewer, more relevant features
Ensemble methodsCombine multiple models (bagging, boosting)

Model Complexity vs Error Tradeoff

Model Complexity vs Error TradeoffModel Complexity →Prediction Error →Optimal ComplexityUnderfittingHigh BiasOverfittingHigh VarianceSweet SpotTraining ErrorValidation Error

Key Observations:

  • Training error decreases monotonically as model complexity increases (model memorizes training data).
  • Validation error decreases initially, reaches a minimum, then increases (overfitting begins).
  • The gap between training and validation error indicates the degree of overfitting.
  • The optimal model minimizes validation error, not training error.

No Free Lunch Theorem

The No Free Lunch (NFL) Theorem (Wolpert, 1996) states that no single machine learning algorithm is universally superior to all others across all possible problems.

Formal Statement:

For any two learning algorithms aa and bb:

PPPerformance(a,P)=PPPerformance(b,P)\sum_{P \in \mathcal{P}} \text{Performance}(a, P) = \sum_{P \in \mathcal{P}} \text{Performance}(b, P)

averaged over all possible problems P\mathcal{P} (all possible data-generating distributions), every algorithm performs equally well.

Practical Implications:

  1. No universally best algorithm — Choosing the right model requires understanding the problem domain, data characteristics, and constraints.
  2. The "No Free Lunch" principle — Every algorithm has inductive biases (assumptions about the data). An algorithm that works well on one class of problems may fail on another.
  3. Algorithm selection — Consider:
Problem CharacteristicPreferred Approaches
Large dataset, many featuresNeural Networks, Gradient Boosting
Small dataset, interpretability neededLogistic/Linear Regression, Decision Trees
High-dimensional dataSVM with RBF kernel, PCA preprocessing
Time seriesLSTM, ARIMA, Prophet
Non-linear relationshipsKernel methods, Random Forests, Neural Nets
  1. Ensemble methods work well in practice because they combine multiple algorithms, reducing the impact of individual algorithm weaknesses.

Wisdom from Practice:

"All models are wrong, but some are useful." — George Box

"The question is not whether a model is 'true,' but whether it is useful for the task at hand."


Summary

ConceptKey Takeaway
ML DefinitionAlgorithms that improve performance through experience (Mitchell, 1997)
Supervised LearningLearns from labeled data → Classification or Regression
Unsupervised LearningDiscovers structure in unlabeled data → Clustering, Dim. Reduction, Association
Semi-supervisedCombines small labeled + large unlabeled datasets
Reinforcement LearningAgent learns via rewards from environment interactions
ML WorkflowData → Preprocess → Features → Train → Evaluate → Deploy
EvaluationUse appropriate metrics (accuracy, F1, MSE, R2R^2) and cross-validation
OverfittingModel memorizes noise; high variance, low bias
UnderfittingModel too simple; low variance, high bias
NFL TheoremNo single algorithm works best for all problems — choose based on context

Further Reading

  • Mitchell, T. (1997). Machine Learning. McGraw Hill.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  • Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer.
  • Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement