What Is Machine Learning?
Machine Learning (ML) is a branch of artificial intelligence where systems learn patterns from data rather than following hand-coded rules. Instead of programming every decision, you give the algorithm examples and it discovers the underlying logic itself.
Core Idea: A machine learning model finds a function f that maps inputs X to outputs y, learned entirely from data.
Real-World Examples
| Application | Input (X) | Output (y) | ML Type |
|---|---|---|---|
| Email spam filter | Email text | Spam / Not Spam | Classification |
| House price prediction | Size, location, rooms | Price ($) | Regression |
| Netflix recommendations | Watch history | Movie suggestions | Collaborative filtering |
| Fraud detection | Transaction data | Fraud / Legit | Anomaly detection |
| ChatGPT | Text prompt | Text response | Generative AI |
The Three Types of Machine Learning
Machine Learning
├── Supervised Learning ← learns from labeled data (X, y pairs)
│ ├── Classification ← predicts categories (spam/not spam)
│ └── Regression ← predicts numbers (house price)
├── Unsupervised Learning ← finds patterns in unlabeled data
│ ├── Clustering ← groups similar items (customer segments)
│ └── Dimensionality Red. ← compresses features (PCA)
└── Reinforcement Learning ← learns by reward/penalty (game AI)
1. Supervised Learning
The algorithm learns a mapping from inputs to outputs using labeled training examples.
Training objective — minimize loss:
| Loss Function | Used For | Formula |
|---|---|---|
| MSE | Regression | (1)/(n)Σ(y_i - \hat(y)_i)^2 |
| Cross-Entropy | Classification | |
| Hinge | SVM | |
2. Unsupervised Learning
No labels — the algorithm discovers structure in raw data.
K-Means objective:
Where is the centroid of cluster .
3. Reinforcement Learning
An agent takes actions in an environment and learns to maximize cumulative reward.
| Symbol | Meaning |
|---|---|
| Current and next state |
| Action taken |
| Reward received |
| Learning rate |
| Discount factor (0–1) |
The Complete ML Workflow
1. Define Problem → What are you predicting? What metric matters?
↓
2. Collect Data → Gather raw data from databases, APIs, files
↓
3. Explore (EDA) → Distributions, correlations, missing values
↓
4. Preprocess → Clean, encode, scale, split train/test
↓
5. Feature Engineering → Create new features, select important ones
↓
6. Choose Model → Linear, tree-based, neural network?
↓
7. Train → Fit model on training data
↓
8. Evaluate → Accuracy, RMSE, F1 — on held-out test set
↓
9. Tune → Grid search or Bayesian hyperparameter search
↓
10. Deploy → REST API, batch job, or embedded system
↓
11. Monitor → Track data drift and model degradation
Complete Python Example — End-to-End ML Pipeline
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score,
recall_score, f1_score,
classification_report, confusion_matrix)
import matplotlib.pyplot as plt
import seaborn as sns
# ── 1. Load Data ────────────────────────────────────────────────────
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target') # 0=malignant, 1=benign
print(f"Dataset shape : {X.shape}")
print(f"Class balance : {y.value_counts().to_dict()}")
print(f"\nFeatures:\n{X.describe().round(2)}")
# ── 2. Explore (EDA) ────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Class distribution
y.value_counts().plot(kind='bar', ax=axes[0], color=['salmon','steelblue'])
axes[0].set_title('Class Distribution')
axes[0].set_xticklabels(['Malignant','Benign'], rotation=0)
# Feature correlation heatmap (top 10 features)
top_features = X.corr()['mean radius'].abs().nlargest(10).index
X[top_features].corr().pipe(
lambda c: sns.heatmap(c, ax=axes[1], cmap='coolwarm',
annot=True, fmt='.2f', cbar=False)
)
axes[1].set_title('Top Feature Correlations')
plt.tight_layout()
plt.show()
# ── 3. Preprocess ───────────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train) # fit on train only!
X_test_s = scaler.transform(X_test) # apply same scale to test
# ── 4. Train Multiple Models ─────────────────────────────────────────
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
}
results = {}
for name, model in models.items():
# 5-fold cross-validation on training set
cv_scores = cross_val_score(model, X_train_s, y_train, cv=5, scoring='f1')
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
results[name] = {
'CV F1 (mean)': cv_scores.mean(),
'CV F1 (std)': cv_scores.std(),
'Test Accuracy': accuracy_score(y_test, y_pred),
'Test Precision': precision_score(y_test, y_pred),
'Test Recall': recall_score(y_test, y_pred),
'Test F1': f1_score(y_test, y_pred),
}
# ── 5. Compare Results ──────────────────────────────────────────────
results_df = pd.DataFrame(results).T.round(4)
print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)
print(results_df.to_string())
# ── 6. Detailed Report for Best Model ──────────────────────────────
best_model = models['Random Forest']
y_pred = best_model.predict(X_test_s)
print("\nClassification Report — Random Forest:")
print(classification_report(y_test, y_pred,
target_names=['Malignant','Benign']))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Malignant','Benign'],
yticklabels=['Malignant','Benign'])
plt.title('Confusion Matrix — Random Forest')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()
# Feature Importance
importances = pd.Series(
best_model.feature_importances_,
index=data.feature_names
).sort_values(ascending=False).head(10)
plt.figure(figsize=(10, 5))
importances.plot(kind='bar', color='steelblue')
plt.title('Top 10 Feature Importances — Random Forest')
plt.ylabel('Importance Score')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
Sample Output:
Dataset shape : (569, 30)
Class balance : {1: 357, 0: 212}
MODEL COMPARISON
CV F1 Test Acc Test Prec Test Recall Test F1
Logistic Regression 0.9648 0.9649 0.9655 0.9722 0.9688
Random Forest 0.9622 0.9737 0.9722 0.9861 0.9791
```text
---
## Evaluation Metrics Reference
### Classification Metrics
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
$$\text{Precision} = \frac{TP}{TP + FP} \quad \text{(How precise are positive predictions?)}$$
$$\text{Recall} = \frac{TP}{TP + FN} \quad \text{(How many positives did we catch?)}$$
$$F_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
### Regression Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| MAE | (1)/(n)Σ\|y_i - \hat(y)_i\| | Average absolute error (same units as y) |
| RMSE | √{(1)/(n)Σ(y_i - \hat(y)_i)^2} | Penalises large errors more |
| R² | $1 - \frac{SS_{res}}{SS_{tot}}$ | Proportion of variance explained (1 = perfect) |
---
## Choosing the Right Algorithm
START │ ├─ Labelled data? ──NO──→ Unsupervised (K-Means, PCA, DBSCAN) │ YES │ ├─ Predicting a number? ─YES──→ Regression │ ├── Linear/Ridge/Lasso (linear relationships) │ ├── Random Forest (non-linear, robust) │ └── XGBoost (tabular, best accuracy) │ └─ Predicting a category? YES─→ Classification ├── Logistic Regression (baseline, interpretable) ├── Random Forest (robust, few params) ├── XGBoost/LightGBM (tabular champion) └── Neural Network (images, text, audio)
---
## Key Takeaways
1. **Supervised learning** needs labelled data and optimises a loss function
2. **The bias-variance tradeoff** — simple models underfit, complex ones overfit
3. **Always evaluate on held-out data** — cross-validation gives reliable estimates
4. **Feature engineering matters** as much as algorithm choice
5. **Start simple** — logistic regression is a strong baseline before trying neural networks