Logistic Regression + Confusion Matrix

Module 2: Machine LearningFree Lesson

Advertisement

Logistic Regression + Confusion Matrix

Why Not Linear Regression for Classification?

The Problem

Using linear regression for binary classification (y0,1y \in \\{0, 1\\}) leads to several issues:

Architecture Diagram
y
^
|                              *  *
|                         *
|                    *
|               *          Linear Regression
|          *               tries to fit line through
|     *                    binary outcomes
|  *
|*
+---------------------------------------------> x
0=Negative              1=Positive

Problems:
1. Predictions can be <0 or >1 (not valid probabilities)
2. Assumes constant variance (violated for binary data)
3. Sensitive to outliers
4. Residuals are NOT normally distributed

The Solution: Logistic Regression

Instead of predicting yy directly, predict the probability that y=1y = 1:

P(y=1x)=σ(wTx+b)P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b)

The Sigmoid Function

Mathematical Definition

DfSigmoid Function

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Properties

  • Output range: (0,1)(0, 1) — always a valid probability
  • Derivative: σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))
  • Decision boundary: σ(z)=0.5\sigma(z) = 0.5 when z=0z = 0

Visual Representation

Architecture Diagram
sigma(z)
^
1.0 |                           ___________
    |                         /
    |                       /
0.5 |---------------------/----------------  <-- Decision Boundary
    |                   /
    |                 /
0.0 |_______________/
    |
    +-------|-------|-------|-------|------> z
           -5      0       5

Key Points:
- Always outputs between 0 and 1
- Smooth S-shaped curve
- z=0 -> sigma(z) = 0.5 (decision boundary)
- Symmetric: sigma(-z) = 1 - sigma(z)

Why Sigmoid? (Intuition)

💡 Why Sigmoid for Classification?

Linear regression outputs can be any real number. For classification, we need probabilities [0,1]. The sigmoid function:

  1. Compresses any real number into (0,1)
  2. Is monotonic (higher z always means higher probability)
  3. Has a clean derivative: σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))
  4. The log-odds interpretation gives statistical meaning

From Sigmoid to Log-Odds

Architecture Diagram
If p = sigma(z) = 1/(1 + e^(-z)), then:

  p/(1-p) = e^z          (odds ratio)

  ln(p/(1-p)) = z        (log-odds / logit)

  ln(p/(1-p)) = w^T*x + b

Interpretation:
- w_i > 0: increasing feature i increases probability of class 1
- w_i < 0: increasing feature i decreases probability of class 1
- |w_i|: magnitude of effect on log-odds
- doubling feature i: odds change by e^(2*w_i)

Decision Boundary

The decision boundary is where P(y=1|x) = 0.5, which occurs when w^T*x + b = 0.

Architecture Diagram
For 2D features (x1, x2):
  w1*x1 + w2*x2 + b = 0
  x2 = -(w1*x1 + b) / w2    <-- line in feature space

  x2 |      Class 1 (P>0.5)
     |    /
     |   /  <-- Decision boundary: w1*x1 + w2*x2 + b = 0
     |  /
     | /
     |/      Class 0 (P<0.5)
     +-------------> x1

Key insight: The boundary is ALWAYS linear (a line in 2D, plane in 3D, hyperplane in higher dims).
This is why logistic regression is called a "linear classifier."

The Loss Function: Cross-Entropy

ℹ️ Why Not MSE for Classification?

If true y=1, predicted p=0.01 (very wrong): MSE = (1 - 0.01)^2 = 0.98

If true y=1, predicted p=0.99 (very right): MSE = (1 - 0.99)^2 = 0.0001

MSE gradient near p=0 or p=1 is very small → slow learning!

Solution: Binary Cross-Entropy (Log Loss):

L=[yln(p)+(1y)ln(1p)]L = -\left[y \cdot \ln(p) + (1-y) \cdot \ln(1-p)\right]

The Decision Rule

Decision Rule

y^={1if σ(wTx+b)0.50if σ(wTx+b)<0.5\hat{y} = \begin{cases} 1 & \text{if } \sigma(\mathbf{w}^T \mathbf{x} + b) \geq 0.5 \\ 0 & \text{if } \sigma(\mathbf{w}^T \mathbf{x} + b) < 0.5 \end{cases}

Here,

  • y^\hat{y}=Predicted class label

Odds Ratio and Log-Odds

Odds

The odds of an event occurring:

Odds

odds=P(y=1)P(y=0)=p1p\text{odds} = \frac{P(y=1)}{P(y=0)} = \frac{p}{1-p}

Here,

  • pp=Probability of positive class

Log-Odds (Logit Function)

log(p1p)=wTx+b\log\left(\frac{p}{1-p}\right) = \mathbf{w}^T \mathbf{x} + b

This is the inverse of the sigmoid function:

logit(p)=log(p1p)=log(σ(z))log(1σ(z))\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \log(\sigma(z)) - \log(1 - \sigma(z))

Interpretation

Probability ppOdds p1p\frac{p}{1-p}Log-Odds log(p1p)\log\left(\frac{p}{1-p}\right)
0.010.0101-4.60
0.10.111-2.20
0.51.00.0
0.99.02.20
0.9999.04.60

Coefficient Interpretation:

A one-unit increase in xjx_j multiplies the odds by eβje^{\beta_j}:

Odds Ratio

odds(xj+1)odds(xj)=eβj\frac{\text{odds}(x_j + 1)}{\text{odds}(x_j)} = e^{\beta_j}

Here,

  • βj\beta_j=Coefficient for feature j

Cost Function: Cross-Entropy Loss

Why Not MSE?

For logistic regression, using MSE creates a non-convex cost function with local minima:

JMSE=1ni=1n(σ(wTxi+b)yi)2J_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}\left(\sigma(\mathbf{w}^T\mathbf{x}_i + b) - y_i\right)^2
Architecture Diagram
J(MSE)
^
|   *     *
|    *   *
|     * *
|      *        Local minima
|    *   *      make optimization
|   *     *     difficult
|  *       *
+------------------------> Parameters

Binary Cross-Entropy Loss

J(w)=1ni=1n[yilog(p^i)+(1yi)log(1p^i)]J(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]

💡 Cross-Entropy Intuition

  • When y=1y = 1: J=log(p^)J = -\log(\hat{p}) — penalizes low probability predictions
  • When y=0y = 0: J=log(1p^)J = -\log(1-\hat{p}) — penalizes high probability predictions

Gradient of Cross-Entropy

Jwj=1ni=1n(p^iyi)xij\frac{\partial J}{\partial w_j} = \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)x_{ij}
Jb=1ni=1n(p^iyi)\frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)

This has the same form as linear regression gradient!


Complete Derivation

Maximum Likelihood Estimation

The likelihood function for logistic regression:

DfLikelihood Function

L(w)=i=1np^iyi(1p^i)1yiL(\mathbf{w}) = \prod_{i=1}^{n} \hat{p}_i^{y_i} (1-\hat{p}_i)^{1-y_i}

Log-likelihood (easier to optimize):

(w)=i=1n[yilog(p^i)+(1yi)log(1p^i)]\ell(\mathbf{w}) = \sum_{i=1}^{n}\left[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]

Negative log-likelihood = Cross-entropy loss:

J(w)=1n(w)J(\mathbf{w}) = -\frac{1}{n}\ell(\mathbf{w})

Gradient Descent Update

wj(t+1)=wj(t)α1ni=1n(p^iyi)xijw_j^{(t+1)} = w_j^{(t)} - \alpha \cdot \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)x_{ij}

Binary Classification Implementation

📝Logistic Regression with Confusion Matrix Analysis

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    roc_curve, auc, precision_recall_curve
)

np.random.seed(42)
n = 1000

X = np.column_stack([
    np.random.normal(50, 15, n),
    np.random.normal(120, 20, n),
    np.random.normal(200, 40, n),
    np.random.normal(100, 30, n),
    np.random.normal(28, 5, n)
])

logits = (
    0.05 * (X[:, 0] - 50) +
    0.02 * (X[:, 1] - 120) +
    0.01 * (X[:, 2] - 200) +
    0.03 * (X[:, 3] - 100) +
    0.1 * (X[:, 4] - 28) - 1
)
prob = 1 / (1 + np.exp(-logits))
y = np.random.binomial(1, prob)

df = pd.DataFrame(X, columns=['age', 'blood_pressure', 'cholesterol', 'glucose', 'bmi'])
df['heart_disease'] = y

X_train, X_test, y_train, y_test = train_test_split(
    df.drop('heart_disease', axis=1), df['heart_disease'],
    test_size=0.2, random_state=42, stratify=df['heart_disease']
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

print("=== Model Performance ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))

Confusion Matrix

Structure

Architecture Diagram
                    Predicted
                    0         1
                +---------+---------+
Actual    0     |   TN    |   FP    |
                +---------+---------+
          1     |   FN    |   TP    |
                +---------+---------+

TN = True Negative   (correctly predicted 0)
FP = False Positive  (incorrectly predicted 1) — Type I Error
FN = False Negative  (incorrectly predicted 0) — Type II Error
TP = True Positive   (correctly predicted 1)

Metrics Derived from Confusion Matrix

Accuracy

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Here,

  • TPTP=True Positives
  • TNTN=True Negatives
  • FPFP=False Positives
  • FNFN=False Negatives

Precision

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Here,

  • TPTP=True Positives
  • FPFP=False Positives

ℹ️ Precision Interpretation

"Of all patients we predicted as positive, how many actually have the disease?"

Recall (Sensitivity)

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Here,

  • TPTP=True Positives
  • FNFN=False Negatives

ℹ️ Recall Interpretation

"Of all patients who actually have the disease, how many did we correctly identify?"

Specificity (True Negative Rate):

Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}

F1 Score

F1=2PrecisionRecallPrecision+RecallF1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Here,

  • Precision\text{Precision}=Positive predictive value
  • Recall\text{Recall}=True positive rate

Visual Example

Architecture Diagram
Medical Diagnosis Example (1000 patients)
==========================================

                Predicted Healthy    Predicted Sick
              +--------------------+------------------+
Actually  0   |   TN = 850        |   FP = 50        |
Healthy       |   (Correct)       |   (False Alarm)  |
              +--------------------+------------------+
Actually  1   |   FN = 30         |   TP = 70        |
Sick          |   (Missed)        |   (Correct)      |
              +--------------------+------------------+

Metrics:
- Accuracy = (850+70)/1000 = 92%
- Precision = 70/(70+50) = 58.3%
- Recall = 70/(70+30) = 70%
- Specificity = 850/(850+50) = 94.4%
- F1 = 2 × 0.583 × 0.70 / (0.583 + 0.70) = 63.6%

Precision-Recall Tradeoff

Architecture Diagram
Precision
^
1.0 |    *
    |   * *
    |  *   *
    | *     *
    |*       *
    |         *
    |           *
    |             *
    +-------------------------> Recall
    0                     1.0

Higher threshold → Higher Precision, Lower Recall
Lower threshold → Lower Precision, Higher Recall

ROC Curve and AUC

ROC (Receiver Operating Characteristic) Curve

Plots True Positive Rate vs False Positive Rate at various thresholds:

TPR=Recall=TPTP+FN\text{TPR} = \text{Recall} = \frac{TP}{TP + FN}
FPR=1Specificity=FPFP+TN\text{FPR} = 1 - \text{Specificity} = \frac{FP}{FP + TN}
Architecture Diagram
TPR (Sensitivity)
^
1.0 |           ___________
    |          /
    |         /
    |        /    AUC = 0.92
    |       /     (Excellent)
0.5 |      /
    |     /
    |    /
    |   /
    |  /
    | /
0.0 |/
    +-------------------------> FPR (1-Specificity)
    0                     1.0

Diagonal line (random): AUC = 0.5
Perfect classifier: AUC = 1.0

AUC (Area Under ROC Curve)

AUC RangeInterpretation
0.9 - 1.0Excellent
0.8 - 0.9Good
0.7 - 0.8Fair
0.6 - 0.7Poor
0.5 - 0.6Fail (random)
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)
plt.show()

Multiclass Classification

One-vs-Rest (OvR)

Train KK binary classifiers, one per class:

Classk:P(y=kx) vs P(ykx)\text{Class}_k: P(y=k \mid \mathbf{x}) \text{ vs } P(y \neq k \mid \mathbf{x})
Architecture Diagram
Class 1 vs Rest    Class 2 vs Rest    Class 3 vs Rest
+---+---+---+      +---+---+---+      +---+---+---+
| 1 | 1 | 2 |      | R | 2 | 2 |      | R | R | 3 |
| 1 | R | R |      | R | 2 | R |      | R | 2 | 3 |
+---+---+---+      +---+---+---+      +---+---+---+

Final prediction: argmax of all classifier probabilities

One-vs-One (OvO)

Train (K2)\binom{K}{2} classifiers for all pairs:

(K2)=K(K1)2\binom{K}{2} = \frac{K(K-1)}{2}

Majority vote determines the final class.

Softmax (Multinomial Logistic Regression)

For KK classes, directly model:

Softmax Function

P(y=kx)=ewkTx+bkj=1KewjTx+bjP(y = k \mid \mathbf{x}) = \frac{e^{\mathbf{w}_k^T \mathbf{x} + b_k}}{\sum_{j=1}^{K} e^{\mathbf{w}_j^T \mathbf{x} + b_j}}

Here,

  • KK=Number of classes
  • wk\mathbf{w}_k=Weight vector for class k
  • bkb_k=Bias for class k

Properties:

  • k=1KP(y=kmathbfx)=1\sum_{k=1}^{K} P(y=k|\\mathbf{x}) = 1
  • Reduces to sigmoid when K=2K=2

Feature Importance and Odds Ratios

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

X_train = np.random.randn(500, 4) * [15, 40, 20, 5] + [50, 200, 120, 3]
y_train = (X_train[:, 0] * 0.05 + X_train[:, 2] * 0.02 - 1 > 0).astype(int)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

model = LogisticRegression(random_state=42)
model.fit(X_scaled, y_train)

feature_names = ['Age', 'Cholesterol', 'Blood Pressure', 'Exercise Hours']
coefficients = model.coef_[0]

print("Feature Importance Analysis:")
print("-" * 50)
for name, coef in zip(feature_names, coefficients):
    odds_ratio = np.exp(coef)
    direction = "increases" if coef > 0 else "decreases"
    print(f"{name:20s}: coef={coef:.3f}, OR={odds_ratio:.3f} ({direction} risk)")

Key Takeaways

📋Summary: Logistic Regression

  1. Logistic Regression outputs probabilities: Use sigmoid function to map any real number to (0,1)(0, 1)
  2. Cross-entropy loss is convex → guaranteed global minimum with gradient descent
  3. Coefficient interpretation: ebetaje^{\\beta_j} gives the odds ratio for one-unit increase
  4. Confusion Matrix is essential — accuracy alone is misleading for imbalanced data
  5. Precision vs Recall tradeoff: Adjust threshold based on business requirements
  6. AUC-ROC provides threshold-independent evaluation; AUC > 0.8 is generally good
  7. Multiclass: Use softmax for direct multiclass or OvR/OvO strategies

Practice Exercises

Exercise 1: Binary Classification

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.7, 0.3], random_state=42
)

# a) Split data and train model
# b) Print confusion matrix
# c) Calculate precision, recall, F1
# d) Plot ROC curve and report AUC

Exercise 2: Threshold Optimization

from sklearn.metrics import f1_score

thresholds = np.arange(0.1, 0.9, 0.01)
f1_scores = [f1_score(y_test, (y_prob >= t).astype(int)) for t in thresholds]
optimal_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal threshold: {optimal_threshold:.2f}")

Exercise 3: Multiclass Problem

from sklearn.datasets import load_wine
# a) multinomial logistic regression
# b) one-vs-rest logistic regression
# c) Which performs better? Why?

Exercise 4: Cost-Sensitive Learning

  • Train a model with class_weight='balanced'
  • Compare confusion matrices with unweighted model
  • In which scenarios is the weighted model better?

Discussion Questions

  1. When would you prioritize recall over precision (and vice versa)?
  2. Why might AUC be preferred over accuracy for imbalanced datasets?
  3. How does regularization affect logistic regression coefficients?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement