Logistic Regression: Sigmoid, Decision Boundary and Multi-class
The Bridge from Regression to Classification
Logistic regression extends linear regression to classification problems. Despite its name, it's a classification algorithm that predicts probabilities using the sigmoid function.
From Linear to Logistic: The Transformation
1. The Sigmoid Function
Mathematical Definition
Sigmoid (Logistic) Function:
Properties:
- Output range: — always a valid probability
- Symmetry:
- Derivative:
- At z = 0: (decision boundary)
- Asymptotes: ,
Derivative of Sigmoid
The derivative has an elegant form that enables efficient backpropagation:
Derivation:
Key Insight: The derivative is expressed in terms of the function itself — no recomputation needed!
Log-Odds Interpretation
From Probability to Log-Odds:
Interpretation:
- = odds (ratio of success to failure)
- = log-odds (logit function)
- = change in log-odds for unit increase in
- = odds ratio — multiplicative effect on odds
| Probability | Odds | Log-Odds |
|---|---|---|
| 0.01 | 0.0101 | −4.60 |
| 0.1 | 0.111 | −2.20 |
| 0.3 | 0.429 | −0.85 |
| 0.5 | 1.0 | 0.0 |
| 0.7 | 2.333 | 0.85 |
| 0.9 | 9.0 | 2.20 |
| 0.99 | 99.0 | 4.60 |
2. Decision Boundary
Linear Decision Boundary
The decision boundary is the hypersurface where , which occurs when .
Binary Classification (2D Features):
This is always a line (hyperplane in higher dimensions).
Non-Linear Decision Boundaries
By adding polynomial features, logistic regression can learn non-linear boundaries:
Polynomial Features:
This creates elliptical, parabolic, or other conic section boundaries.
3. Cost Function: Binary Cross-Entropy
Why Not MSE for Classification?
MSE with Sigmoid Creates Non-Convex Loss:
- Problem: Multiple local minima make optimization unreliable
- Gradient: Near saturation ( or ), gradient → slow learning
Binary Cross-Entropy Derivation
Single Sample Loss:
For : → penalizes low confidence predictions
For : → penalizes high confidence wrong predictions
Combined:
Average Cross-Entropy Loss
Full Cost Function:
Where
Gradient (same form as linear regression!):
4. Maximum Likelihood Estimation
Likelihood Function
Bernoulli Likelihood:
Each observation follows
Joint Likelihood (i.i.d. samples):
Log-Likelihood (easier to optimize):
Negative Log-L likelihood = Cross-Entropy Loss:
Gradient Derivation
Chain Rule Application:
Where:
Result:
The terms cancel beautifully!
5. Multi-class Classification
One-vs-Rest (OvR)
Train binary classifiers, one per class:
Softmax (Multinomial Logistic Regression)
For direct multi-class modeling, use the softmax function:
Softmax Function:
Properties:
- (valid probability distribution)
- Reduces to sigmoid when
- Monotonic: higher logit → higher probability
- Differentiable everywhere
Cross-Entropy for Multi-class
Categorical Cross-Entropy Loss:
Where is 1 if sample belongs to class , else 0 (one-hot encoded).
Gradient for class k:
6. Regularization
L1 Regularization (Lasso)
L1 Penalty:
Effect: Drives some coefficients to exactly zero → feature selection
å‡ ä½• Interpretation: Diamond constraint region → corner solutions
L2 Regularization (Ridge)
L2 Penalty:
Effect: Shrinks all coefficients toward zero, none exactly zero
Geometric Interpretation: Circular constraint region → smooth solutions
Elastic Net
Combined Penalty:
Advantage: Handles correlated features better than L1 alone
7. Complete Python Implementation
Binary Classification with Evaluation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score, confusion_matrix, classification_report,
roc_curve, auc, precision_recall_curve
)
np.random.seed(42)
# Generate synthetic medical data
n = 1000
X = np.column_stack([
np.random.normal(50, 15, n), # age
np.random.normal(120, 20, n), # blood_pressure
np.random.normal(200, 40, n), # cholesterol
np.random.normal(100, 30, n), # glucose
np.random.normal(28, 5, n) # BMI
])
# True relationship
logits = (
0.05 * (X[:, 0] - 50) +
0.02 * (X[:, 1] - 120) +
0.01 * (X[:, 2] - 200) +
0.03 * (X[:, 3] - 100) +
0.1 * (X[:, 4] - 28) - 1
)
prob = 1 / (1 + np.exp(-logits))
y = np.random.binomial(1, prob)
df = pd.DataFrame(X, columns=['age', 'blood_pressure', 'cholesterol', 'glucose', 'bmi'])
df['heart_disease'] = y
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
df.drop('heart_disease', axis=1), df['heart_disease'],
test_size=0.2, random_state=42, stratify=df['heart_disease']
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)
# Predictions
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
print("=== Model Performance ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))
Feature Importance and Odds Ratios
feature_names = ['Age', 'Blood Pressure', 'Cholesterol', 'Glucose', 'BMI']
coefficients = model.coef_[0]
print("\n=== Feature Importance ===")
print(f"{'Feature':<20} {'Coefficient':>12} {'Odds Ratio':>12} {'Effect':>15}")
print("-" * 62)
for name, coef in zip(feature_names, coefficients):
odds_ratio = np.exp(coef)
direction = "increases" if coef > 0 else "decreases"
print(f"{name:<20} {coef:>12.4f} {odds_ratio:>12.4f} {direction:>15}")
print(f"\nIntercept: {model.intercept_[0]:.4f}")
Decision Boundary Visualization
from matplotlib.colors import ListedColormap
# Use first two features for visualization
X_2d = X_train_scaled[:, :2]
y_2d = y_train
model_2d = LogisticRegression(random_state=42)
model_2d.fit(X_2d, y_2d)
# Create mesh grid
h = 0.02
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict on mesh
Z = model_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
plt.figure(figsize=(10, 8))
cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#0000FF'])
plt.contourf(xx, yy, Z, alpha=0.3, cmap=cmap_light)
plt.contour(xx, yy, Z, colors='black', linewidths=0.5)
plt.scatter(X_2d[y_2d == 0, 0], X_2d[y_2d == 0, 1], c='blue',
label='Class 0', alpha=0.6, edgecolors='black')
plt.scatter(X_2d[y_2d == 1, 0], X_2d[y_2d == 1, 1], c='red',
label='Class 1', alpha=0.6, edgecolors='black')
plt.xlabel('Feature 1 (standardized)')
plt.ylabel('Feature 2 (standardized)')
plt.title('Logistic Regression Decision Boundary')
plt.legend()
plt.show()
ROC Curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random classifier')
plt.fill_between(fpr, tpr, alpha=0.2, color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)
plt.show()
8. Threshold Optimization
Default threshold = 0.5 is not always optimal!
Cost-Sensitive Threshold Selection:
Where = cost of false negative (missed disease), = cost of false alarm.
from sklearn.metrics import f1_score, precision_recall_curve
# Find optimal F1 threshold
thresholds = np.arange(0.1, 0.9, 0.01)
f1_scores = [f1_score(y_test, (y_prob >= t).astype(int)) for t in thresholds]
optimal_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal threshold (F1): {optimal_threshold:.2f}")
print(f"F1 at optimal: {max(f1_scores):.4f}")
# Plot threshold vs metrics
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_prob)
plt.figure(figsize=(10, 6))
plt.plot(thresholds, f1_scores, label='F1 Score', color='blue', lw=2)
plt.axvline(x=optimal_threshold, color='red', linestyle='--', label=f'Optimal: {optimal_threshold:.2f}')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Threshold Optimization')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Key Takeaways
- Sigmoid function maps any real number to — enables probability interpretation
- Cross-entropy loss is convex → guaranteed global minimum with gradient descent
- Decision boundary is always linear in feature space (use polynomial features for non-linear)
- Coefficient interpretation: = odds ratio for one-unit increase in
- Multi-class: Use softmax for direct multiclass, or OvR for binary decomposition
- Regularization: L1 for feature selection, L2 for smooth shrinkage, Elastic Net for both
- Threshold tuning is critical — default 0.5 may not be optimal for imbalanced data
Practice Exercises
Exercise 1: Binary Classification Pipeline
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10,
n_classes=2, weights=[0.7, 0.3], random_state=42
)
# a) Split data, train model with StandardScaler
# b) Print confusion matrix
# c) Calculate precision, recall, F1
# d) Plot ROC curve and report AUC
# e) Compare with/without regularization
Exercise 2: Multi-class Problem
from sklearn.datasets import load_wine
wine = load_wine()
X, y = wine.data, wine.target
# a) Train multinomial logistic regression
# b) Train one-vs-rest logistic regression
# c) Compare accuracy and confusion matrices
# d) Which strategy performs better? Why?
Exercise 3: Cost-Sensitive Learning
- Train a model with
class_weight='balanced' - Compare confusion matrices with unweighted model
- In which scenarios is the weighted model better?
Exercise 4: Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
# Generate non-linear data (e.g., circles or moons)
# Train logistic regression with polynomial features of degree 2, 3, 4
# Visualize decision boundaries
# What's the trade-off with higher degree?
Discussion Questions
- When would you prioritize recall over precision (and vice versa)?
- Why might AUC be preferred over accuracy for imbalanced datasets?
- How does regularization affect logistic regression coefficients and interpretability?
- Under what assumptions does logistic regression fail?