What is Machine Learning?
Machine Learning is a field of artificial intelligence that gives computers the ability to learn from data without being explicitly programmed. Rather than following rigid rules, ML algorithms identify patterns in data and improve their performance over time.
Tom Mitchell's Formal Definition (1997):
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Mathematical Formulation:
For a computer program to be "learning," we require:
where contains more or better experience than .
Example — Email Spam Filter:
- Task T: Classify emails as spam or not spam
- Experience E: Labeled dataset of emails (spam/not spam)
- Performance P: Classification accuracy (fraction of correctly classified emails)
As the system processes more emails, its accuracy improves — it learns.
ML Algorithm Taxonomy
Machine learning algorithms are broadly categorized based on the nature of the learning signal or feedback available:
Types of Machine Learning
1. Supervised Learning
In supervised learning, the algorithm learns from labeled data — each training example consists of an input vector and a corresponding target output . The goal is to learn a mapping function .
Two primary sub-tasks:
Classification — Predicting a discrete class label:
Examples: Email spam detection (spam/not spam), image recognition (cat/dog/bird), medical diagnosis (malignant/benign).
Regression — Predicting a continuous value:
Examples: House price prediction, temperature forecasting, stock price estimation.
2. Unsupervised Learning
In unsupervised learning, the algorithm receives unlabeled data and must discover hidden patterns, structures, or relationships on its own.
Key sub-tasks:
Clustering — Grouping similar data points together:
K-Means Objective (minimize within-cluster variance):
where is the centroid of cluster .
Examples: Customer segmentation, document grouping, image segmentation.
Dimensionality Reduction — Reducing the number of features while preserving important structure:
PCA Objective: Find projection matrix that maximizes variance:
where is the covariance matrix.
Examples: Visualization of high-dimensional data, noise reduction, feature extraction.
Association — Discovering interesting relationships between variables:
- Support: — how frequently items appear together
- Confidence: — how often appears when is present
- Lift: — strength of association beyond chance
Examples: Market basket analysis ("customers who buy X also buy Y"), recommendation systems.
3. Semi-supervised Learning
Semi-supervised learning combines a small amount of labeled data with a large pool of unlabeled data. This is practical because labeled data is often expensive to obtain.
Core Assumption: If two data points are close in feature space, they likely share the same label (smoothness assumption).
Key Approaches:
| Method | Description |
|---|---|
| Self-training | Model labels unlabeled data and retrains |
| Co-training | Two views of data train separate classifiers |
| Label Propagation | Graph-based spreading of labels |
| MixMatch | Combines consistency regularization + entropy minimization |
Applications: Medical imaging (limited expert annotations), NLP with large unlabeled corpora.
4. Reinforcement Learning
Reinforcement learning involves an agent learning to make sequential decisions by interacting with an environment to maximize cumulative reward.
Markov Decision Process (MDP):
An MDP is defined by the tuple :
- : Set of states
- : Set of actions
- : Transition probability
- : Reward function
- : Discount factor
Objective — Maximize Expected Cumulative Reward:
Bellman Equation (Optimal Value Function):
Applications: Game playing (AlphaGo, Atari), robotics, autonomous driving, resource management.
The ML Workflow
The machine learning workflow is a systematic pipeline from raw data to deployed model:
Step-by-Step Breakdown
1. Data Collection
Gather relevant data from databases, APIs, sensors, logs, or web scraping. Data quality determines model quality — the principle of "garbage in, garbage out" applies.
2. Data Preprocessing
- Handle missing values (imputation, deletion)
- Remove duplicates and outliers
- Normalize/standardize features:
- Encode categorical variables (one-hot, label encoding)
3. Feature Engineering
- Select relevant features (feature selection)
- Create new features from existing ones
- Reduce dimensionality (PCA, feature hashing)
- Domain-specific transformations
4. Model Training
Split data into training, validation, and test sets (typically 70/15/15 or 80/10/10):
Select algorithm and hyperparameters, then minimize loss:
5. Model Evaluation
Assess generalization on unseen data using appropriate metrics (covered below).
6. Deployment and Monitoring
Deploy to production, monitor for data drift, retrain periodically.
Model Selection and Evaluation
Common Evaluation Metrics
For Classification:
| Metric | Formula | Interpretation |
|---|---|---|
| Accuracy | Overall correctness | |
| Precision | Of predicted positives, how many are correct | |
| Recall | Of actual positives, how many were found | |
| F1 Score | Harmonic mean of precision and recall | |
| AUC-ROC | Area under ROC curve | Discrimination ability across thresholds |
Confusion Matrix:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP (True Positive) | FN (False Negative) |
| Actual Negative | FP (False Positive) | TN (True Negative) |
For Regression:
| Metric | Formula |
|---|---|
| MSE | |
| RMSE | |
| MAE | |
Cross-Validation
K-Fold Cross-Validation: Split data into folds. Train on folds, validate on the remaining fold. Rotate and average:
This provides a more robust estimate of model performance than a single train-test split.
Overfitting and Underfitting
The central challenge in ML is finding the right balance between model complexity and generalization:
The Bias-Variance Tradeoff
The expected prediction error can be decomposed:
- Bias: Error from erroneous assumptions in the learning algorithm. High bias → underfitting.
- Variance: Error from sensitivity to fluctuations in training data. High variance → overfitting.
- Irreducible Noise: Inherent randomness in the data that no model can eliminate.
Preventing Overfitting
| Technique | Description |
|---|---|
| Cross-validation | Better estimation of generalization performance |
| Regularization (L1/L2) | Penalize large weights: or |
| Early stopping | Stop training when validation error starts increasing |
| Dropout | Randomly deactivate neurons during training |
| Data augmentation | Artificially increase training set size |
| Feature selection | Use fewer, more relevant features |
| Ensemble methods | Combine multiple models (bagging, boosting) |
Model Complexity vs Error Tradeoff
Key Observations:
- Training error decreases monotonically as model complexity increases (model memorizes training data).
- Validation error decreases initially, reaches a minimum, then increases (overfitting begins).
- The gap between training and validation error indicates the degree of overfitting.
- The optimal model minimizes validation error, not training error.
No Free Lunch Theorem
The No Free Lunch (NFL) Theorem (Wolpert, 1996) states that no single machine learning algorithm is universally superior to all others across all possible problems.
Formal Statement:
For any two learning algorithms and :
averaged over all possible problems (all possible data-generating distributions), every algorithm performs equally well.
Practical Implications:
- No universally best algorithm — Choosing the right model requires understanding the problem domain, data characteristics, and constraints.
- The "No Free Lunch" principle — Every algorithm has inductive biases (assumptions about the data). An algorithm that works well on one class of problems may fail on another.
- Algorithm selection — Consider:
| Problem Characteristic | Preferred Approaches |
|---|---|
| Large dataset, many features | Neural Networks, Gradient Boosting |
| Small dataset, interpretability needed | Logistic/Linear Regression, Decision Trees |
| High-dimensional data | SVM with RBF kernel, PCA preprocessing |
| Time series | LSTM, ARIMA, Prophet |
| Non-linear relationships | Kernel methods, Random Forests, Neural Nets |
- Ensemble methods work well in practice because they combine multiple algorithms, reducing the impact of individual algorithm weaknesses.
Wisdom from Practice:
"All models are wrong, but some are useful." — George Box
"The question is not whether a model is 'true,' but whether it is useful for the task at hand."
Summary
| Concept | Key Takeaway |
|---|---|
| ML Definition | Algorithms that improve performance through experience (Mitchell, 1997) |
| Supervised Learning | Learns from labeled data → Classification or Regression |
| Unsupervised Learning | Discovers structure in unlabeled data → Clustering, Dim. Reduction, Association |
| Semi-supervised | Combines small labeled + large unlabeled datasets |
| Reinforcement Learning | Agent learns via rewards from environment interactions |
| ML Workflow | Data → Preprocess → Features → Train → Evaluate → Deploy |
| Evaluation | Use appropriate metrics (accuracy, F1, MSE, ) and cross-validation |
| Overfitting | Model memorizes noise; high variance, low bias |
| Underfitting | Model too simple; low variance, high bias |
| NFL Theorem | No single algorithm works best for all problems — choose based on context |
Further Reading
- Mitchell, T. (1997). Machine Learning. McGraw Hill.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer.
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.