CW

Logistic Regression: Sigmoid, Decision Boundary and Multi-class

Module 7: Machine Learning FundamentalsFree Lesson

Advertisement

Logistic Regression: Sigmoid, Decision Boundary and Multi-class

The Bridge from Regression to Classification

Logistic regression extends linear regression to classification problems. Despite its name, it's a classification algorithm that predicts probabilities using the sigmoid function.

From Linear to Logistic: The Transformation

Linear Regression → Logistic RegressionLinear RegressionFeature (x)Target (y)ŷ = 1.8 ✗Predicts: ŷ ∈ (-∞, +∞)σ(z)Logistic RegressionFeature (x)P(y=1)0.5Predicts: P(y=1) ∈ [0, 1]

1. The Sigmoid Function

Mathematical Definition

Sigmoid (Logistic) Function:

σ(z)=11+ez=ez1+ez\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}

Properties:

  • Output range: (0,1)(0, 1) — always a valid probability
  • Symmetry: σ(z)=1σ(z)\sigma(-z) = 1 - \sigma(z)
  • Derivative: σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))
  • At z = 0: σ(0)=0.5\sigma(0) = 0.5 (decision boundary)
  • Asymptotes: limzσ(z)=1\lim_{z \to \infty} \sigma(z) = 1, limzσ(z)=0\lim_{z \to -\infty} \sigma(z) = 0
Sigmoid Function: σ(z) = 1/(1 + e⁻ᶻ)zσ(z)0.5z=0 → σ=0.5y = 1y = 0Class 1 (P > 0.5)Class 0 (P < 0.5)σ(-3)≈0.05σ(3)≈0.95Decision boundary

Derivative of Sigmoid

The derivative has an elegant form that enables efficient backpropagation:

Derivation:

σ(z)=ddz(11+ez)=ez(1+ez)2\sigma'(z) = \frac{d}{dz}\left(\frac{1}{1 + e^{-z}}\right) = \frac{e^{-z}}{(1 + e^{-z})^2}
=11+ezez1+ez=σ(z)(1σ(z))= \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} = \sigma(z) \cdot (1 - \sigma(z))

Key Insight: The derivative is expressed in terms of the function itself — no recomputation needed!

Log-Odds Interpretation

From Probability to Log-Odds:

p=σ(z)=11+ez    p1p=ezp = \sigma(z) = \frac{1}{1 + e^{-z}} \implies \frac{p}{1-p} = e^z
ln(p1p)=z=wTx+b\ln\left(\frac{p}{1-p}\right) = z = \mathbf{w}^T\mathbf{x} + b

Interpretation:

  • p1p\frac{p}{1-p} = odds (ratio of success to failure)
  • ln(p1p)\ln\left(\frac{p}{1-p}\right) = log-odds (logit function)
  • wjw_j = change in log-odds for unit increase in xjx_j
  • ewje^{w_j} = odds ratio — multiplicative effect on odds
Probability ppOdds p1p\frac{p}{1-p}Log-Odds ln(p1p)\ln\left(\frac{p}{1-p}\right)
0.010.0101−4.60
0.10.111−2.20
0.30.429−0.85
0.51.00.0
0.72.3330.85
0.99.02.20
0.9999.04.60

2. Decision Boundary

Linear Decision Boundary

The decision boundary is the hypersurface where P(y=1x)=0.5P(y=1|\mathbf{x}) = 0.5, which occurs when wTx+b=0\mathbf{w}^T\mathbf{x} + b = 0.

Binary Classification (2D Features):

w1x1+w2x2+b=0w_1 x_1 + w_2 x_2 + b = 0
x2=w1x1+bw2x_2 = -\frac{w_1 x_1 + b}{w_2}

This is always a line (hyperplane in higher dimensions).

Linear Decision BoundaryFeature x₁Feature x₂w₁x₁ + w₂x₂ + b = 0Class 1Class 0Decision Boundaryw (normal)

Non-Linear Decision Boundaries

By adding polynomial features, logistic regression can learn non-linear boundaries:

Polynomial Features:

wTϕ(x)+b=w1x1+w2x2+w3x12+w4x22+w5x1x2+b=0\mathbf{w}^T\phi(\mathbf{x}) + b = w_1 x_1 + w_2 x_2 + w_3 x_1^2 + w_4 x_2^2 + w_5 x_1 x_2 + b = 0

This creates elliptical, parabolic, or other conic section boundaries.

Non-Linear Decision Boundaries via Polynomial FeaturesLinearCircularElliptical

Add polynomial features: x₁², x₂², x₁x₂ to learn curved boundaries


3. Cost Function: Binary Cross-Entropy

Why Not MSE for Classification?

MSE with Sigmoid Creates Non-Convex Loss:

JMSE=1ni=1n(σ(wTxi+b)yi)2J_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}\left(\sigma(\mathbf{w}^T\mathbf{x}_i + b) - y_i\right)^2
  • Problem: Multiple local minima make optimization unreliable
  • Gradient: Near saturation (σ0\sigma \approx 0 or 11), gradient 0\approx 0 → slow learning
MSE Loss Surface (Non-Convex) vs Cross-Entropy (Convex)MSE Loss (Non-Convex)Local minLocal minMultiple local minima → unreliableCross-Entropy Loss (Convex)Global minConvex → guaranteed global minimum

Binary Cross-Entropy Derivation

Single Sample Loss:

For y=1y = 1: L=log(p^)L = -\log(\hat{p}) → penalizes low confidence predictions

For y=0y = 0: L=log(1p^)L = -\log(1-\hat{p}) → penalizes high confidence wrong predictions

Combined:

L(p^,y)=[ylog(p^)+(1y)log(1p^)]\mathcal{L}(\hat{p}, y) = -\left[y \log(\hat{p}) + (1-y) \log(1-\hat{p})\right]
Cross-Entropy Loss ComponentsPredicted Probability (p̂)Loss-log(p̂) when y=1-log(1-p̂) when y=0p̂ = 0.5When y=1 and p̂→0: Loss→∞When y=0 and p̂→1: Loss→∞Penalizes confident wrong predictions

Average Cross-Entropy Loss

Full Cost Function:

J(w)=1ni=1n[yilog(p^i)+(1yi)log(1p^i)]J(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]

Where p^i=σ(wTxi+b)\hat{p}_i = \sigma(\mathbf{w}^T\mathbf{x}_i + b)

Gradient (same form as linear regression!):

Jwj=1ni=1n(p^iyi)xij\frac{\partial J}{\partial w_j} = \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)x_{ij}
Jb=1ni=1n(p^iyi)\frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)
Cross-Entropy Loss: Convex OptimizationOptimumGradient Descent Pathw₁w₂

4. Maximum Likelihood Estimation

Likelihood Function

Bernoulli Likelihood:

Each observation follows yiBernoulli(p^i)y_i \sim \text{Bernoulli}(\hat{p}_i)

P(yixi;w)=p^iyi(1p^i)1yiP(y_i | \mathbf{x}_i; \mathbf{w}) = \hat{p}_i^{y_i}(1-\hat{p}_i)^{1-y_i}

Joint Likelihood (i.i.d. samples):

L(w)=i=1np^iyi(1p^i)1yiL(\mathbf{w}) = \prod_{i=1}^{n} \hat{p}_i^{y_i}(1-\hat{p}_i)^{1-y_i}

Log-Likelihood (easier to optimize):

(w)=i=1n[yilog(p^i)+(1yi)log(1p^i)]\ell(\mathbf{w}) = \sum_{i=1}^{n}\left[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]

Negative Log-L likelihood = Cross-Entropy Loss:

J(w)=1n(w)J(\mathbf{w}) = -\frac{1}{n}\ell(\mathbf{w})

Gradient Derivation

Chain Rule Application:

Jwj=Jp^ip^iziziwj\frac{\partial J}{\partial w_j} = \frac{\partial J}{\partial \hat{p}_i} \cdot \frac{\partial \hat{p}_i}{\partial z_i} \cdot \frac{\partial z_i}{\partial w_j}

Where:

  • Jp^i=p^iyip^i(1p^i)\frac{\partial J}{\partial \hat{p}_i} = \frac{\hat{p}_i - y_i}{\hat{p}_i(1-\hat{p}_i)}
  • p^izi=p^i(1p^i)\frac{\partial \hat{p}_i}{\partial z_i} = \hat{p}_i(1-\hat{p}_i)
  • ziwj=xij\frac{\partial z_i}{\partial w_j} = x_{ij}

Result:

Jwj=1ni=1n(p^iyi)xij\frac{\partial J}{\partial w_j} = \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)x_{ij}

The (1p^i)(1-\hat{p}_i) terms cancel beautifully!


5. Multi-class Classification

One-vs-Rest (OvR)

Train KK binary classifiers, one per class:

One-vs-Rest Strategy: K Binary ClassifiersClassifier 1Class 1 vs Rest1RR1R213RP(y=1|x) vs P(y≠1|x)Classifier 2Class 2 vs RestR2RR2RR3RP(y=2|x) vs P(y≠2|x)Classifier 3Class 3 vs RestRR3R23RR3P(y=3|x) vs P(y≠3|x)

Final: Å· = argmax[P(y=1|x), P(y=2|x), P(y=3|x)]

Softmax (Multinomial Logistic Regression)

For direct multi-class modeling, use the softmax function:

Softmax Function:

P(y=kx)=ewkTx+bkj=1KewjTx+bjP(y = k | \mathbf{x}) = \frac{e^{\mathbf{w}_k^T\mathbf{x} + b_k}}{\sum_{j=1}^{K} e^{\mathbf{w}_j^T\mathbf{x} + b_j}}

Properties:

  • k=1KP(y=kx)=1\sum_{k=1}^{K} P(y=k|\mathbf{x}) = 1 (valid probability distribution)
  • Reduces to sigmoid when K=2K = 2
  • Monotonic: higher logit → higher probability
  • Differentiable everywhere
Softmax: Converting Logits to ProbabilitiesLogits (z)z₁ = 2.0z₂ = 1.0z₃ = 0.1exp()eᶻe²·⁰ = 7.39e¹·⁰ = 2.72e⁰·¹ = 1.11÷ ΣP(y=k|x)P(y=1) = 0.66P(y=2) = 0.24P(y=3) = 0.10Distributionk=1k=2k=3

Å· = argmax(P) = Class 1 (highest probability)

Cross-Entropy for Multi-class

Categorical Cross-Entropy Loss:

J(W)=1ni=1nk=1Kyiklog(p^ik)J(\mathbf{W}) = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik} \log(\hat{p}_{ik})

Where yiky_{ik} is 1 if sample ii belongs to class kk, else 0 (one-hot encoded).

Gradient for class k:

Jwk=1ni=1n(p^ikyik)xi\frac{\partial J}{\partial \mathbf{w}_k} = \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_{ik} - y_{ik})\mathbf{x}_i

6. Regularization

L1 Regularization (Lasso)

L1 Penalty:

JL1(w)=J(w)+λj=1pwjJ_{\text{L1}}(\mathbf{w}) = J(\mathbf{w}) + \lambda \sum_{j=1}^{p} |w_j|

Effect: Drives some coefficients to exactly zero → feature selection

几何 Interpretation: Diamond constraint region → corner solutions

L2 Regularization (Ridge)

L2 Penalty:

JL2(w)=J(w)+λj=1pwj2J_{\text{L2}}(\mathbf{w}) = J(\mathbf{w}) + \lambda \sum_{j=1}^{p} w_j^2

Effect: Shrinks all coefficients toward zero, none exactly zero

Geometric Interpretation: Circular constraint region → smooth solutions

Elastic Net

Combined Penalty:

JEN(w)=J(w)+λ1j=1pwj+λ2j=1pwj2J_{\text{EN}}(\mathbf{w}) = J(\mathbf{w}) + \lambda_1 \sum_{j=1}^{p} |w_j| + \lambda_2 \sum_{j=1}^{p} w_j^2

Advantage: Handles correlated features better than L1 alone

Regularization: Constraint RegionsL1 (Lasso)SolutionCorner solutions → sparsityL2 (Ridge)SolutionSmooth shrinkage, no zerosElastic NetSolutionBest of both worlds

7. Complete Python Implementation

Binary Classification with Evaluation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    roc_curve, auc, precision_recall_curve
)

np.random.seed(42)

# Generate synthetic medical data
n = 1000
X = np.column_stack([
    np.random.normal(50, 15, n),    # age
    np.random.normal(120, 20, n),   # blood_pressure
    np.random.normal(200, 40, n),   # cholesterol
    np.random.normal(100, 30, n),   # glucose
    np.random.normal(28, 5, n)      # BMI
])

# True relationship
logits = (
    0.05 * (X[:, 0] - 50) +
    0.02 * (X[:, 1] - 120) +
    0.01 * (X[:, 2] - 200) +
    0.03 * (X[:, 3] - 100) +
    0.1 * (X[:, 4] - 28) - 1
)
prob = 1 / (1 + np.exp(-logits))
y = np.random.binomial(1, prob)

df = pd.DataFrame(X, columns=['age', 'blood_pressure', 'cholesterol', 'glucose', 'bmi'])
df['heart_disease'] = y

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('heart_disease', axis=1), df['heart_disease'],
    test_size=0.2, random_state=42, stratify=df['heart_disease']
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)

# Predictions
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

print("=== Model Performance ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))

Feature Importance and Odds Ratios

feature_names = ['Age', 'Blood Pressure', 'Cholesterol', 'Glucose', 'BMI']
coefficients = model.coef_[0]

print("\n=== Feature Importance ===")
print(f"{'Feature':<20} {'Coefficient':>12} {'Odds Ratio':>12} {'Effect':>15}")
print("-" * 62)

for name, coef in zip(feature_names, coefficients):
    odds_ratio = np.exp(coef)
    direction = "increases" if coef > 0 else "decreases"
    print(f"{name:<20} {coef:>12.4f} {odds_ratio:>12.4f} {direction:>15}")

print(f"\nIntercept: {model.intercept_[0]:.4f}")

Decision Boundary Visualization

from matplotlib.colors import ListedColormap

# Use first two features for visualization
X_2d = X_train_scaled[:, :2]
y_2d = y_train

model_2d = LogisticRegression(random_state=42)
model_2d.fit(X_2d, y_2d)

# Create mesh grid
h = 0.02
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict on mesh
Z = model_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(10, 8))
cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#0000FF'])

plt.contourf(xx, yy, Z, alpha=0.3, cmap=cmap_light)
plt.contour(xx, yy, Z, colors='black', linewidths=0.5)

plt.scatter(X_2d[y_2d == 0, 0], X_2d[y_2d == 0, 1], c='blue', 
            label='Class 0', alpha=0.6, edgecolors='black')
plt.scatter(X_2d[y_2d == 1, 0], X_2d[y_2d == 1, 1], c='red', 
            label='Class 1', alpha=0.6, edgecolors='black')

plt.xlabel('Feature 1 (standardized)')
plt.ylabel('Feature 2 (standardized)')
plt.title('Logistic Regression Decision Boundary')
plt.legend()
plt.show()

ROC Curve and AUC

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random classifier')
plt.fill_between(fpr, tpr, alpha=0.2, color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)
plt.show()

8. Threshold Optimization

Default threshold = 0.5 is not always optimal!

Cost-Sensitive Threshold Selection:

Expected Cost=FNCFN+FPCFP\text{Expected Cost} = FN \cdot C_{FN} + FP \cdot C_{FP}

Where CFNC_{FN} = cost of false negative (missed disease), CFPC_{FP} = cost of false alarm.

from sklearn.metrics import f1_score, precision_recall_curve

# Find optimal F1 threshold
thresholds = np.arange(0.1, 0.9, 0.01)
f1_scores = [f1_score(y_test, (y_prob >= t).astype(int)) for t in thresholds]
optimal_threshold = thresholds[np.argmax(f1_scores)]

print(f"Optimal threshold (F1): {optimal_threshold:.2f}")
print(f"F1 at optimal: {max(f1_scores):.4f}")

# Plot threshold vs metrics
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_prob)

plt.figure(figsize=(10, 6))
plt.plot(thresholds, f1_scores, label='F1 Score', color='blue', lw=2)
plt.axvline(x=optimal_threshold, color='red', linestyle='--', label=f'Optimal: {optimal_threshold:.2f}')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Threshold Optimization')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Key Takeaways

  1. Sigmoid function maps any real number to (0,1)(0, 1) — enables probability interpretation
  2. Cross-entropy loss is convex → guaranteed global minimum with gradient descent
  3. Decision boundary is always linear in feature space (use polynomial features for non-linear)
  4. Coefficient interpretation: ewje^{w_j} = odds ratio for one-unit increase in xjx_j
  5. Multi-class: Use softmax for direct multiclass, or OvR for binary decomposition
  6. Regularization: L1 for feature selection, L2 for smooth shrinkage, Elastic Net for both
  7. Threshold tuning is critical — default 0.5 may not be optimal for imbalanced data

Practice Exercises

Exercise 1: Binary Classification Pipeline

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.7, 0.3], random_state=42
)

# a) Split data, train model with StandardScaler
# b) Print confusion matrix
# c) Calculate precision, recall, F1
# d) Plot ROC curve and report AUC
# e) Compare with/without regularization

Exercise 2: Multi-class Problem

from sklearn.datasets import load_wine

wine = load_wine()
X, y = wine.data, wine.target

# a) Train multinomial logistic regression
# b) Train one-vs-rest logistic regression
# c) Compare accuracy and confusion matrices
# d) Which strategy performs better? Why?

Exercise 3: Cost-Sensitive Learning

  • Train a model with class_weight='balanced'
  • Compare confusion matrices with unweighted model
  • In which scenarios is the weighted model better?

Exercise 4: Polynomial Features

from sklearn.preprocessing import PolynomialFeatures

# Generate non-linear data (e.g., circles or moons)
# Train logistic regression with polynomial features of degree 2, 3, 4
# Visualize decision boundaries
# What's the trade-off with higher degree?

Discussion Questions

  1. When would you prioritize recall over precision (and vice versa)?
  2. Why might AUC be preferred over accuracy for imbalanced datasets?
  3. How does regularization affect logistic regression coefficients and interpretability?
  4. Under what assumptions does logistic regression fail?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement