Logistic Regression + Confusion Matrix

Why Not Linear Regression for Classification?

The Problem

Using linear regression for binary classification ( $y \in \\{0, 1\\}$ ) leads to several issues:

Architecture Diagram

y
^
|                              *  *
|                         *
|                    *
|               *          Linear Regression
|          *               tries to fit line through
|     *                    binary outcomes
|  *
|*
+---------------------------------------------> x
0=Negative              1=Positive

Problems:
1. Predictions can be <0 or >1 (not valid probabilities)
2. Assumes constant variance (violated for binary data)
3. Sensitive to outliers
4. Residuals are NOT normally distributed

The Solution: Logistic Regression

Instead of predicting $y$ directly, predict the probability that $y = 1$ :

P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b)

The Sigmoid Function

Mathematical Definition

DfSigmoid Function

\sigma(z) = \frac{1}{1 + e^{-z}}

Properties

Output range: $(0, 1)$ — always a valid probability
Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$
Decision boundary: $\sigma(z) = 0.5$ when $z = 0$

Visual Representation

Architecture Diagram

sigma(z)
^
1.0 |                           ___________
    |                         /
    |                       /
0.5 |---------------------/----------------  <-- Decision Boundary
    |                   /
    |                 /
0.0 |_______________/
    |
    +-------|-------|-------|-------|------> z
           -5      0       5

Key Points:
- Always outputs between 0 and 1
- Smooth S-shaped curve
- z=0 -> sigma(z) = 0.5 (decision boundary)
- Symmetric: sigma(-z) = 1 - sigma(z)

Why Sigmoid? (Intuition)

💡 Why Sigmoid for Classification?

Linear regression outputs can be any real number. For classification, we need probabilities [0,1]. The sigmoid function:

Compresses any real number into (0,1)
Is monotonic (higher z always means higher probability)
Has a clean derivative: $\sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))$
The log-odds interpretation gives statistical meaning

From Sigmoid to Log-Odds

Architecture Diagram

If p = sigma(z) = 1/(1 + e^(-z)), then:

  p/(1-p) = e^z          (odds ratio)

  ln(p/(1-p)) = z        (log-odds / logit)

  ln(p/(1-p)) = w^T*x + b

Interpretation:
- w_i > 0: increasing feature i increases probability of class 1
- w_i < 0: increasing feature i decreases probability of class 1
- |w_i|: magnitude of effect on log-odds
- doubling feature i: odds change by e^(2*w_i)

Decision Boundary

The decision boundary is where P(y=1|x) = 0.5, which occurs when w^T*x + b = 0.

Architecture Diagram

For 2D features (x1, x2):
  w1*x1 + w2*x2 + b = 0
  x2 = -(w1*x1 + b) / w2    <-- line in feature space

  x2 |      Class 1 (P>0.5)
     |    /
     |   /  <-- Decision boundary: w1*x1 + w2*x2 + b = 0
     |  /
     | /
     |/      Class 0 (P<0.5)
     +-------------> x1

Key insight: The boundary is ALWAYS linear (a line in 2D, plane in 3D, hyperplane in higher dims).
This is why logistic regression is called a "linear classifier."

The Loss Function: Cross-Entropy

ℹ️ Why Not MSE for Classification?

If true y=1, predicted p=0.01 (very wrong): MSE = (1 - 0.01)^2 = 0.98

If true y=1, predicted p=0.99 (very right): MSE = (1 - 0.99)^2 = 0.0001

MSE gradient near p=0 or p=1 is very small → slow learning!

Solution: Binary Cross-Entropy (Log Loss):

L = -\left[y \cdot \ln(p) + (1-y) \cdot \ln(1-p)\right]

The Decision Rule

Decision Rule

\hat{y} = \begin{cases} 1 & \text{if } \sigma(\mathbf{w}^T \mathbf{x} + b) \geq 0.5 \\ 0 & \text{if } \sigma(\mathbf{w}^T \mathbf{x} + b) < 0.5 \end{cases}

Here,

$\hat{y}$ =Predicted class label

Odds Ratio and Log-Odds

Odds

The odds of an event occurring:

Odds

\text{odds} = \frac{P(y=1)}{P(y=0)} = \frac{p}{1-p}

Here,

$p$ =Probability of positive class

Log-Odds (Logit Function)

\log\left(\frac{p}{1-p}\right) = \mathbf{w}^T \mathbf{x} + b

This is the inverse of the sigmoid function:

\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \log(\sigma(z)) - \log(1 - \sigma(z))

Interpretation

Probability $p$	Odds $\frac{p}{1-p}$	Log-Odds $\log\left(\frac{p}{1-p}\right)$
0.01	0.0101	-4.60
0.1	0.111	-2.20
0.5	1.0	0.0
0.9	9.0	2.20
0.99	99.0	4.60

Coefficient Interpretation:

A one-unit increase in $x_j$ multiplies the odds by $e^{\beta_j}$ :

Odds Ratio

\frac{\text{odds}(x_j + 1)}{\text{odds}(x_j)} = e^{\beta_j}

Here,

$\beta_j$ =Coefficient for feature j

Cost Function: Cross-Entropy Loss

Why Not MSE?

For logistic regression, using MSE creates a non-convex cost function with local minima:

J_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}\left(\sigma(\mathbf{w}^T\mathbf{x}_i + b) - y_i\right)^2

Architecture Diagram

J(MSE)
^
|   *     *
|    *   *
|     * *
|      *        Local minima
|    *   *      make optimization
|   *     *     difficult
|  *       *
+------------------------> Parameters

Binary Cross-Entropy Loss

J(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]

💡 Cross-Entropy Intuition

When $y = 1$ : $J = -\log(\hat{p})$ — penalizes low probability predictions
When $y = 0$ : $J = -\log(1-\hat{p})$ — penalizes high probability predictions

Gradient of Cross-Entropy

\frac{\partial J}{\partial w_j} = \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)x_{ij}

\frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)

This has the same form as linear regression gradient!

Complete Derivation

Maximum Likelihood Estimation

The likelihood function for logistic regression:

DfLikelihood Function

L(\mathbf{w}) = \prod_{i=1}^{n} \hat{p}_i^{y_i} (1-\hat{p}_i)^{1-y_i}

Log-likelihood (easier to optimize):

\ell(\mathbf{w}) = \sum_{i=1}^{n}\left[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]

Negative log-likelihood = Cross-entropy loss:

J(\mathbf{w}) = -\frac{1}{n}\ell(\mathbf{w})

Gradient Descent Update

w_j^{(t+1)} = w_j^{(t)} - \alpha \cdot \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)x_{ij}

Binary Classification Implementation

📝Logistic Regression with Confusion Matrix Analysis

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    roc_curve, auc, precision_recall_curve
)

np.random.seed(42)
n = 1000

X = np.column_stack([
    np.random.normal(50, 15, n),
    np.random.normal(120, 20, n),
    np.random.normal(200, 40, n),
    np.random.normal(100, 30, n),
    np.random.normal(28, 5, n)
])

logits = (
    0.05 * (X[:, 0] - 50) +
    0.02 * (X[:, 1] - 120) +
    0.01 * (X[:, 2] - 200) +
    0.03 * (X[:, 3] - 100) +
    0.1 * (X[:, 4] - 28) - 1
)
prob = 1 / (1 + np.exp(-logits))
y = np.random.binomial(1, prob)

df = pd.DataFrame(X, columns=['age', 'blood_pressure', 'cholesterol', 'glucose', 'bmi'])
df['heart_disease'] = y

X_train, X_test, y_train, y_test = train_test_split(
    df.drop('heart_disease', axis=1), df['heart_disease'],
    test_size=0.2, random_state=42, stratify=df['heart_disease']
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

print("=== Model Performance ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))

Confusion Matrix

Structure

Architecture Diagram

                    Predicted
                    0         1
                +---------+---------+
Actual    0     |   TN    |   FP    |
                +---------+---------+
          1     |   FN    |   TP    |
                +---------+---------+

TN = True Negative   (correctly predicted 0)
FP = False Positive  (incorrectly predicted 1) — Type I Error
FN = False Negative  (incorrectly predicted 0) — Type II Error
TP = True Positive   (correctly predicted 1)

Metrics Derived from Confusion Matrix

Accuracy

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Here,

$TP$ =True Positives
$TN$ =True Negatives
$FP$ =False Positives
$FN$ =False Negatives

Precision

\text{Precision} = \frac{TP}{TP + FP}

Here,

$TP$ =True Positives
$FP$ =False Positives

ℹ️ Precision Interpretation

"Of all patients we predicted as positive, how many actually have the disease?"

Recall (Sensitivity)

\text{Recall} = \frac{TP}{TP + FN}

Here,

$TP$ =True Positives
$FN$ =False Negatives

ℹ️ Recall Interpretation

"Of all patients who actually have the disease, how many did we correctly identify?"

Specificity (True Negative Rate):

\text{Specificity} = \frac{TN}{TN + FP}

F1 Score

F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Here,

$\text{Precision}$ =Positive predictive value
$\text{Recall}$ =True positive rate

Visual Example

Architecture Diagram

Medical Diagnosis Example (1000 patients)
==========================================

                Predicted Healthy    Predicted Sick
              +--------------------+------------------+
Actually  0   |   TN = 850        |   FP = 50        |
Healthy       |   (Correct)       |   (False Alarm)  |
              +--------------------+------------------+
Actually  1   |   FN = 30         |   TP = 70        |
Sick          |   (Missed)        |   (Correct)      |
              +--------------------+------------------+

Metrics:
- Accuracy = (850+70)/1000 = 92%
- Precision = 70/(70+50) = 58.3%
- Recall = 70/(70+30) = 70%
- Specificity = 850/(850+50) = 94.4%
- F1 = 2 × 0.583 × 0.70 / (0.583 + 0.70) = 63.6%

Precision-Recall Tradeoff

Architecture Diagram

Precision
^
1.0 |    *
    |   * *
    |  *   *
    | *     *
    |*       *
    |         *
    |           *
    |             *
    +-------------------------> Recall
    0                     1.0

Higher threshold → Higher Precision, Lower Recall
Lower threshold → Lower Precision, Higher Recall

ROC Curve and AUC

ROC (Receiver Operating Characteristic) Curve

Plots True Positive Rate vs False Positive Rate at various thresholds:

\text{TPR} = \text{Recall} = \frac{TP}{TP + FN}

\text{FPR} = 1 - \text{Specificity} = \frac{FP}{FP + TN}

Architecture Diagram

TPR (Sensitivity)
^
1.0 |           ___________
    |          /
    |         /
    |        /    AUC = 0.92
    |       /     (Excellent)
0.5 |      /
    |     /
    |    /
    |   /
    |  /
    | /
0.0 |/
    +-------------------------> FPR (1-Specificity)
    0                     1.0

Diagonal line (random): AUC = 0.5
Perfect classifier: AUC = 1.0

AUC (Area Under ROC Curve)

AUC Range	Interpretation
0.9 - 1.0	Excellent
0.8 - 0.9	Good
0.7 - 0.8	Fair
0.6 - 0.7	Poor
0.5 - 0.6	Fail (random)

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)
plt.show()

Multiclass Classification

One-vs-Rest (OvR)

Train $K$ binary classifiers, one per class:

\text{Class}_k: P(y=k \mid \mathbf{x}) \text{ vs } P(y \neq k \mid \mathbf{x})

Architecture Diagram

Class 1 vs Rest    Class 2 vs Rest    Class 3 vs Rest
+---+---+---+      +---+---+---+      +---+---+---+
| 1 | 1 | 2 |      | R | 2 | 2 |      | R | R | 3 |
| 1 | R | R |      | R | 2 | R |      | R | 2 | 3 |
+---+---+---+      +---+---+---+      +---+---+---+

Final prediction: argmax of all classifier probabilities

One-vs-One (OvO)

Train $\binom{K}{2}$ classifiers for all pairs:

\binom{K}{2} = \frac{K(K-1)}{2}

Majority vote determines the final class.

Softmax (Multinomial Logistic Regression)

For $K$ classes, directly model:

Softmax Function

P(y = k \mid \mathbf{x}) = \frac{e^{\mathbf{w}_k^T \mathbf{x} + b_k}}{\sum_{j=1}^{K} e^{\mathbf{w}_j^T \mathbf{x} + b_j}}

Here,

$K$ =Number of classes
$\mathbf{w}_k$ =Weight vector for class k
$b_k$ =Bias for class k

Properties:

$\sum_{k=1}^{K} P(y=k|\\mathbf{x}) = 1$
Reduces to sigmoid when $K=2$

Feature Importance and Odds Ratios

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

X_train = np.random.randn(500, 4) * [15, 40, 20, 5] + [50, 200, 120, 3]
y_train = (X_train[:, 0] * 0.05 + X_train[:, 2] * 0.02 - 1 > 0).astype(int)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

model = LogisticRegression(random_state=42)
model.fit(X_scaled, y_train)

feature_names = ['Age', 'Cholesterol', 'Blood Pressure', 'Exercise Hours']
coefficients = model.coef_[0]

print("Feature Importance Analysis:")
print("-" * 50)
for name, coef in zip(feature_names, coefficients):
    odds_ratio = np.exp(coef)
    direction = "increases" if coef > 0 else "decreases"
    print(f"{name:20s}: coef={coef:.3f}, OR={odds_ratio:.3f} ({direction} risk)")

Key Takeaways

📋Summary: Logistic Regression

Logistic Regression outputs probabilities: Use sigmoid function to map any real number to $(0, 1)$
Cross-entropy loss is convex → guaranteed global minimum with gradient descent
Coefficient interpretation: $e^{\\beta_j}$ gives the odds ratio for one-unit increase
Confusion Matrix is essential — accuracy alone is misleading for imbalanced data
Precision vs Recall tradeoff: Adjust threshold based on business requirements
AUC-ROC provides threshold-independent evaluation; AUC > 0.8 is generally good
Multiclass: Use softmax for direct multiclass or OvR/OvO strategies

Practice Exercises

Exercise 1: Binary Classification

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.7, 0.3], random_state=42
)

# a) Split data and train model
# b) Print confusion matrix
# c) Calculate precision, recall, F1
# d) Plot ROC curve and report AUC

Exercise 2: Threshold Optimization

from sklearn.metrics import f1_score

thresholds = np.arange(0.1, 0.9, 0.01)
f1_scores = [f1_score(y_test, (y_prob >= t).astype(int)) for t in thresholds]
optimal_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal threshold: {optimal_threshold:.2f}")

Exercise 3: Multiclass Problem

from sklearn.datasets import load_wine
# a) multinomial logistic regression
# b) one-vs-rest logistic regression
# c) Which performs better? Why?

Exercise 4: Cost-Sensitive Learning

Train a model with class_weight='balanced'
Compare confusion matrices with unweighted model
In which scenarios is the weighted model better?

Discussion Questions

When would you prioritize recall over precision (and vice versa)?
Why might AUC be preferred over accuracy for imbalanced datasets?
How does regularization affect logistic regression coefficients?

Logistic Regression + Confusion Matrix

Logistic Regression + Confusion Matrix

Why Not Linear Regression for Classification?

The Problem

The Solution: Logistic Regression

The Sigmoid Function

Mathematical Definition

DfSigmoid Function

Properties

Visual Representation

Why Sigmoid? (Intuition)

From Sigmoid to Log-Odds

Decision Boundary

The Loss Function: Cross-Entropy

The Decision Rule

Decision Rule

Odds Ratio and Log-Odds

Odds

Odds

Log-Odds (Logit Function)

Interpretation

Odds Ratio

Cost Function: Cross-Entropy Loss

Why Not MSE?

Binary Cross-Entropy Loss

Gradient of Cross-Entropy

Complete Derivation

Maximum Likelihood Estimation

DfLikelihood Function

Gradient Descent Update

Binary Classification Implementation

📝Logistic Regression with Confusion Matrix Analysis

Confusion Matrix

Structure

Metrics Derived from Confusion Matrix

Accuracy

Precision

Recall (Sensitivity)

F1 Score

Visual Example

Precision-Recall Tradeoff

ROC Curve and AUC

ROC (Receiver Operating Characteristic) Curve

AUC (Area Under ROC Curve)

Multiclass Classification

One-vs-Rest (OvR)

One-vs-One (OvO)

Softmax (Multinomial Logistic Regression)

Softmax Function

Feature Importance and Odds Ratios

Key Takeaways

📋Summary: Logistic Regression

Practice Exercises

Exercise 1: Binary Classification

Exercise 2: Threshold Optimization

Exercise 3: Multiclass Problem

Exercise 4: Cost-Sensitive Learning

Discussion Questions

Need Expert Data Science Help?