Logistic Regression + Confusion Matrix
Why Not Linear Regression for Classification?
The Problem
Using linear regression for binary classification () leads to several issues:
y
^
| * *
| *
| *
| * Linear Regression
| * tries to fit line through
| * binary outcomes
| *
|*
+---------------------------------------------> x
0=Negative 1=Positive
Problems:
1. Predictions can be <0 or >1 (not valid probabilities)
2. Assumes constant variance (violated for binary data)
3. Sensitive to outliers
4. Residuals are NOT normally distributed
The Solution: Logistic Regression
Instead of predicting directly, predict the probability that :
The Sigmoid Function
Mathematical Definition
DfSigmoid Function
Properties
- Output range: — always a valid probability
- Derivative:
- Decision boundary: when
Visual Representation
sigma(z)
^
1.0 | ___________
| /
| /
0.5 |---------------------/---------------- <-- Decision Boundary
| /
| /
0.0 |_______________/
|
+-------|-------|-------|-------|------> z
-5 0 5
Key Points:
- Always outputs between 0 and 1
- Smooth S-shaped curve
- z=0 -> sigma(z) = 0.5 (decision boundary)
- Symmetric: sigma(-z) = 1 - sigma(z)
Why Sigmoid? (Intuition)
💡 Why Sigmoid for Classification?
Linear regression outputs can be any real number. For classification, we need probabilities [0,1]. The sigmoid function:
- Compresses any real number into (0,1)
- Is monotonic (higher z always means higher probability)
- Has a clean derivative:
- The log-odds interpretation gives statistical meaning
From Sigmoid to Log-Odds
If p = sigma(z) = 1/(1 + e^(-z)), then:
p/(1-p) = e^z (odds ratio)
ln(p/(1-p)) = z (log-odds / logit)
ln(p/(1-p)) = w^T*x + b
Interpretation:
- w_i > 0: increasing feature i increases probability of class 1
- w_i < 0: increasing feature i decreases probability of class 1
- |w_i|: magnitude of effect on log-odds
- doubling feature i: odds change by e^(2*w_i)
Decision Boundary
The decision boundary is where P(y=1|x) = 0.5, which occurs when w^T*x + b = 0.
For 2D features (x1, x2):
w1*x1 + w2*x2 + b = 0
x2 = -(w1*x1 + b) / w2 <-- line in feature space
x2 | Class 1 (P>0.5)
| /
| / <-- Decision boundary: w1*x1 + w2*x2 + b = 0
| /
| /
|/ Class 0 (P<0.5)
+-------------> x1
Key insight: The boundary is ALWAYS linear (a line in 2D, plane in 3D, hyperplane in higher dims).
This is why logistic regression is called a "linear classifier."
The Loss Function: Cross-Entropy
ℹ️ Why Not MSE for Classification?
If true y=1, predicted p=0.01 (very wrong): MSE = (1 - 0.01)^2 = 0.98
If true y=1, predicted p=0.99 (very right): MSE = (1 - 0.99)^2 = 0.0001
MSE gradient near p=0 or p=1 is very small → slow learning!
Solution: Binary Cross-Entropy (Log Loss):
The Decision Rule
Decision Rule
Here,
- =Predicted class label
Odds Ratio and Log-Odds
Odds
The odds of an event occurring:
Odds
Here,
- =Probability of positive class
Log-Odds (Logit Function)
This is the inverse of the sigmoid function:
Interpretation
| Probability | Odds | Log-Odds |
|---|---|---|
| 0.01 | 0.0101 | -4.60 |
| 0.1 | 0.111 | -2.20 |
| 0.5 | 1.0 | 0.0 |
| 0.9 | 9.0 | 2.20 |
| 0.99 | 99.0 | 4.60 |
Coefficient Interpretation:
A one-unit increase in multiplies the odds by :
Odds Ratio
Here,
- =Coefficient for feature j
Cost Function: Cross-Entropy Loss
Why Not MSE?
For logistic regression, using MSE creates a non-convex cost function with local minima:
J(MSE)
^
| * *
| * *
| * *
| * Local minima
| * * make optimization
| * * difficult
| * *
+------------------------> Parameters
Binary Cross-Entropy Loss
💡 Cross-Entropy Intuition
- When : — penalizes low probability predictions
- When : — penalizes high probability predictions
Gradient of Cross-Entropy
This has the same form as linear regression gradient!
Complete Derivation
Maximum Likelihood Estimation
The likelihood function for logistic regression:
DfLikelihood Function
Log-likelihood (easier to optimize):
Negative log-likelihood = Cross-entropy loss:
Gradient Descent Update
Binary Classification Implementation
📝Logistic Regression with Confusion Matrix Analysis
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score, confusion_matrix, classification_report,
roc_curve, auc, precision_recall_curve
)
np.random.seed(42)
n = 1000
X = np.column_stack([
np.random.normal(50, 15, n),
np.random.normal(120, 20, n),
np.random.normal(200, 40, n),
np.random.normal(100, 30, n),
np.random.normal(28, 5, n)
])
logits = (
0.05 * (X[:, 0] - 50) +
0.02 * (X[:, 1] - 120) +
0.01 * (X[:, 2] - 200) +
0.03 * (X[:, 3] - 100) +
0.1 * (X[:, 4] - 28) - 1
)
prob = 1 / (1 + np.exp(-logits))
y = np.random.binomial(1, prob)
df = pd.DataFrame(X, columns=['age', 'blood_pressure', 'cholesterol', 'glucose', 'bmi'])
df['heart_disease'] = y
X_train, X_test, y_train, y_test = train_test_split(
df.drop('heart_disease', axis=1), df['heart_disease'],
test_size=0.2, random_state=42, stratify=df['heart_disease']
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
print("=== Model Performance ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))
Confusion Matrix
Structure
Predicted
0 1
+---------+---------+
Actual 0 | TN | FP |
+---------+---------+
1 | FN | TP |
+---------+---------+
TN = True Negative (correctly predicted 0)
FP = False Positive (incorrectly predicted 1) — Type I Error
FN = False Negative (incorrectly predicted 0) — Type II Error
TP = True Positive (correctly predicted 1)
Metrics Derived from Confusion Matrix
Accuracy
Here,
- =True Positives
- =True Negatives
- =False Positives
- =False Negatives
Precision
Here,
- =True Positives
- =False Positives
ℹ️ Precision Interpretation
"Of all patients we predicted as positive, how many actually have the disease?"
Recall (Sensitivity)
Here,
- =True Positives
- =False Negatives
ℹ️ Recall Interpretation
"Of all patients who actually have the disease, how many did we correctly identify?"
Specificity (True Negative Rate):
F1 Score
Here,
- =Positive predictive value
- =True positive rate
Visual Example
Medical Diagnosis Example (1000 patients)
==========================================
Predicted Healthy Predicted Sick
+--------------------+------------------+
Actually 0 | TN = 850 | FP = 50 |
Healthy | (Correct) | (False Alarm) |
+--------------------+------------------+
Actually 1 | FN = 30 | TP = 70 |
Sick | (Missed) | (Correct) |
+--------------------+------------------+
Metrics:
- Accuracy = (850+70)/1000 = 92%
- Precision = 70/(70+50) = 58.3%
- Recall = 70/(70+30) = 70%
- Specificity = 850/(850+50) = 94.4%
- F1 = 2 × 0.583 × 0.70 / (0.583 + 0.70) = 63.6%
Precision-Recall Tradeoff
Precision
^
1.0 | *
| * *
| * *
| * *
|* *
| *
| *
| *
+-------------------------> Recall
0 1.0
Higher threshold → Higher Precision, Lower Recall
Lower threshold → Lower Precision, Higher Recall
ROC Curve and AUC
ROC (Receiver Operating Characteristic) Curve
Plots True Positive Rate vs False Positive Rate at various thresholds:
TPR (Sensitivity)
^
1.0 | ___________
| /
| /
| / AUC = 0.92
| / (Excellent)
0.5 | /
| /
| /
| /
| /
| /
0.0 |/
+-------------------------> FPR (1-Specificity)
0 1.0
Diagonal line (random): AUC = 0.5
Perfect classifier: AUC = 1.0
AUC (Area Under ROC Curve)
| AUC Range | Interpretation |
|---|---|
| 0.9 - 1.0 | Excellent |
| 0.8 - 0.9 | Good |
| 0.7 - 0.8 | Fair |
| 0.6 - 0.7 | Poor |
| 0.5 - 0.6 | Fail (random) |
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)
plt.show()
Multiclass Classification
One-vs-Rest (OvR)
Train binary classifiers, one per class:
Class 1 vs Rest Class 2 vs Rest Class 3 vs Rest
+---+---+---+ +---+---+---+ +---+---+---+
| 1 | 1 | 2 | | R | 2 | 2 | | R | R | 3 |
| 1 | R | R | | R | 2 | R | | R | 2 | 3 |
+---+---+---+ +---+---+---+ +---+---+---+
Final prediction: argmax of all classifier probabilities
One-vs-One (OvO)
Train classifiers for all pairs:
Majority vote determines the final class.
Softmax (Multinomial Logistic Regression)
For classes, directly model:
Softmax Function
Here,
- =Number of classes
- =Weight vector for class k
- =Bias for class k
Properties:
- Reduces to sigmoid when
Feature Importance and Odds Ratios
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
X_train = np.random.randn(500, 4) * [15, 40, 20, 5] + [50, 200, 120, 3]
y_train = (X_train[:, 0] * 0.05 + X_train[:, 2] * 0.02 - 1 > 0).astype(int)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
model = LogisticRegression(random_state=42)
model.fit(X_scaled, y_train)
feature_names = ['Age', 'Cholesterol', 'Blood Pressure', 'Exercise Hours']
coefficients = model.coef_[0]
print("Feature Importance Analysis:")
print("-" * 50)
for name, coef in zip(feature_names, coefficients):
odds_ratio = np.exp(coef)
direction = "increases" if coef > 0 else "decreases"
print(f"{name:20s}: coef={coef:.3f}, OR={odds_ratio:.3f} ({direction} risk)")
Key Takeaways
📋Summary: Logistic Regression
- Logistic Regression outputs probabilities: Use sigmoid function to map any real number to
- Cross-entropy loss is convex → guaranteed global minimum with gradient descent
- Coefficient interpretation: gives the odds ratio for one-unit increase
- Confusion Matrix is essential — accuracy alone is misleading for imbalanced data
- Precision vs Recall tradeoff: Adjust threshold based on business requirements
- AUC-ROC provides threshold-independent evaluation; AUC > 0.8 is generally good
- Multiclass: Use softmax for direct multiclass or OvR/OvO strategies
Practice Exercises
Exercise 1: Binary Classification
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10,
n_classes=2, weights=[0.7, 0.3], random_state=42
)
# a) Split data and train model
# b) Print confusion matrix
# c) Calculate precision, recall, F1
# d) Plot ROC curve and report AUC
Exercise 2: Threshold Optimization
from sklearn.metrics import f1_score
thresholds = np.arange(0.1, 0.9, 0.01)
f1_scores = [f1_score(y_test, (y_prob >= t).astype(int)) for t in thresholds]
optimal_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal threshold: {optimal_threshold:.2f}")
Exercise 3: Multiclass Problem
from sklearn.datasets import load_wine
# a) multinomial logistic regression
# b) one-vs-rest logistic regression
# c) Which performs better? Why?
Exercise 4: Cost-Sensitive Learning
- Train a model with
class_weight='balanced' - Compare confusion matrices with unweighted model
- In which scenarios is the weighted model better?
Discussion Questions
- When would you prioritize recall over precision (and vice versa)?
- Why might AUC be preferred over accuracy for imbalanced datasets?
- How does regularization affect logistic regression coefficients?