Logistic Regression: Sigmoid, Decision Boundary and Multi-class

The Bridge from Regression to Classification

Logistic regression extends linear regression to classification problems. Despite its name, it's a classification algorithm that predicts probabilities using the sigmoid function.

From Linear to Logistic: The Transformation

Linear Regression ?� Logistic Regression
Linear Regression
Feature (x)
Target (y)
ŷ = 1.8 ��
Predicts: ŷ �� (-��, +��)
f(z)
Logistic Regression
Feature (x)
P(y=1)
0.5
Predicts: P(y=1) �� [0, 1]

How this diagram works: This side-by-side comparison illustrates the critical transformation from linear regression to logistic regression. On the left, linear regression predicts unbounded values (like ŷ = 1.8) that go outside the valid probability range making it unsuitable for classification. On the right, the sigmoid function f(z) squashes any input into the [0, 1] range, producing valid probabilities. The S-shaped curve creates a natural decision boundary at 0.5, where points above the threshold are classified as Class 1 and below as Class 0.

1. The Sigmoid Function

Mathematical Definition

Sigmoid (Logistic) Function:

Properties:

Output range: always a valid probability
Symmetry:
Derivative:
At z = 0: (decision boundary)
Asymptotes: ,

Sigmoid Function: f(z) = 1/(1 + e⁻ᶻ)
z
f(z)
0.5
z=0 ?� ý0.5
y = 1
y = 0
Class 1 (P > 0.5)
è-3)��0.05
è3)��0.95
Decision boundary

Derivative of Sigmoid

The derivative has an elegant form that enables efficient backpropagation:

Derivation:

Key Insight: The derivative is expressed in terms of the function itself no recomputation needed!

Log-Odds Interpretation

From Probability to Log-Odds:

Interpretation:

= odds (ratio of success to failure)
= log-odds (logit function)
= change in log-odds for unit increase in
= odds ratio multiplicative effect on odds

Probability	Odds	Log-Odds
0.01	0.0101	��4.60
0.1	0.111	��2.20
0.3	0.429	��0.85
0.5	1.0	0.0
0.7	2.333	0.85
0.9	9.0	2.20
0.99	99.0	4.60

2. Decision Boundary

Linear Decision Boundary

The decision boundary is the hypersurface where , which occurs when .

Binary Classification (2D Features):

This is always a line (hyperplane in higher dimensions).

Non-Linear Decision Boundaries

By adding polynomial features, logistic regression can learn non-linear boundaries:

Polynomial Features:

This creates elliptical, parabolic, or other conic section boundaries.

Add polynomial features: x₁², x₂², x₁x₂ to learn curved boundaries

3. Cost Function: Binary Cross-Entropy

Why Not MSE for Classification?

MSE with Sigmoid Creates Non-Convex Loss:

Problem: Multiple local minima make optimization unreliable
Gradient: Near saturation ( or ), gradient ?� slow learning

MSE Loss Surface (Non-Convex) vs Cross-Entropy (Convex)
MSE Loss (Non-Convex)
Local min
Local min
Multiple local minima ?� unreliable
Cross-Entropy Loss (Convex)
Global min
Convex ?� guaranteed global minimum

Binary Cross-Entropy Derivation

Single Sample Loss:

For : ?� penalizes low confidence predictions

For : ?� penalizes high confidence wrong predictions

Combined:

Cross-Entropy Loss Components
Predicted Probability (p̂)
Loss
-log(p̂) when y=1
-log(1-p̂) when y=0
p̂ = 0.5
When y=1 and p̂?�0: Loss?���
When y=0 and p̂?�1: Loss?���
Penalizes confident wrong predictions

Average Cross-Entropy Loss

Full Cost Function:

Where

Gradient (same form as linear regression!):

4. Maximum Likelihood Estimation

Likelihood Function

Bernoulli Likelihood:

Each observation follows

Joint Likelihood (i.i.d. samples):

Log-Likelihood (easier to optimize):

Negative Log-L likelihood = Cross-Entropy Loss:

Gradient Derivation

Chain Rule Application:

Where:

Result:

The terms cancel beautifully!

5. Multi-class Classification

One-vs-Rest (OvR)

Train binary classifiers, one per class:

Final: ŷ = argmax[P(y=1|x), P(y=2|x), P(y=3|x)]

Softmax (Multinomial Logistic Regression)

For direct multi-class modeling, use the softmax function:

Softmax Function:

Properties:

(valid probability distribution)
Reduces to sigmoid when
Monotonic: higher logit ?� higher probability
Differentiable everywhere

Architecture Diagram

Softmax: Converting Logits to Probabilities
Logits (z)
{l.label}
exp()
eᶻ
{l.label}
?
P(y=k|x)
{l.label}
Distribution
k=1
k=2
k=3

    ŷ = argmax(P) = Class 1 (highest probability)

Cross-Entropy for Multi-class

Categorical Cross-Entropy Loss:

Where is 1 if sample belongs to class , else 0 (one-hot encoded).

Gradient for class k:

6. Regularization

L1 Regularization (Lasso)

L1 Penalty:

Effect: Drives some coefficients to exactly zero ?� feature selection

冠何 Interpretation: Diamond constraint region ?� corner solutions

L2 Regularization (Ridge)

L2 Penalty:

Effect: Shrinks all coefficients toward zero, none exactly zero

Geometric Interpretation: Circular constraint region ?� smooth solutions

Elastic Net

Combined Penalty:

Advantage: Handles correlated features better than L1 alone

Regularization: Constraint Regions
L1 (Lasso)
Solution
Corner solutions ?� sparsity
L2 (Ridge)
Solution
Smooth shrinkage, no zeros
Elastic Net
Solution
Best of both worlds

7. Complete Python Implementation

Binary Classification with Evaluation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    roc_curve, auc, precision_recall_curve
)

np.random.seed(42)

# Generate synthetic medical data
n = 1000
X = np.column_stack([
    np.random.normal(50, 15, n),    # age
    np.random.normal(120, 20, n),   # blood_pressure
    np.random.normal(200, 40, n),   # cholesterol
    np.random.normal(100, 30, n),   # glucose
    np.random.normal(28, 5, n)      # BMI
])

# True relationship
logits = (
    0.05 * (X[:, 0] - 50) +
    0.02 * (X[:, 1] - 120) +
    0.01 * (X[:, 2] - 200) +
    0.03 * (X[:, 3] - 100) +
    0.1 * (X[:, 4] - 28) - 1
)
prob = 1 / (1 + np.exp(-logits))
y = np.random.binomial(1, prob)

df = pd.DataFrame(X, columns=['age', 'blood_pressure', 'cholesterol', 'glucose', 'bmi'])
df['heart_disease'] = y

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('heart_disease', axis=1), df['heart_disease'],
    test_size=0.2, random_state=42, stratify=df['heart_disease']
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)

# Predictions
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

print("=== Model Performance ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))

Feature Importance and Odds Ratios

feature_names = ['Age', 'Blood Pressure', 'Cholesterol', 'Glucose', 'BMI']
coefficients = model.coef_[0]

print("\n=== Feature Importance ===")
print(f"{'Feature':<20} {'Coefficient':>12} {'Odds Ratio':>12} {'Effect':>15}")
print("-" * 62)

for name, coef in zip(feature_names, coefficients):
    odds_ratio = np.exp(coef)
    direction = "increases" if coef > 0 else "decreases"
    print(f"{name:<20} {coef:>12.4f} {odds_ratio:>12.4f} {direction:>15}")

print(f"\nIntercept: {model.intercept_[0]:.4f}")

Decision Boundary Visualization

from matplotlib.colors import ListedColormap

# Use first two features for visualization
X_2d = X_train_scaled[:, :2]
y_2d = y_train

model_2d = LogisticRegression(random_state=42)
model_2d.fit(X_2d, y_2d)

# Create mesh grid
h = 0.02
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict on mesh
Z = model_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(10, 8))
cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#0000FF'])

plt.contourf(xx, yy, Z, alpha=0.3, cmap=cmap_light)
plt.contour(xx, yy, Z, colors='black', linewidths=0.5)

plt.scatter(X_2d[y_2d == 0, 0], X_2d[y_2d == 0, 1], c='blue', 
            label='Class 0', alpha=0.6, edgecolors='black')
plt.scatter(X_2d[y_2d == 1, 0], X_2d[y_2d == 1, 1], c='red', 
            label='Class 1', alpha=0.6, edgecolors='black')

plt.xlabel('Feature 1 (standardized)')
plt.ylabel('Feature 2 (standardized)')
plt.title('Logistic Regression Decision Boundary')
plt.legend()
plt.show()

ROC Curve and AUC

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random classifier')
plt.fill_between(fpr, tpr, alpha=0.2, color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)
plt.show()

8. Threshold Optimization

Default threshold = 0.5 is not always optimal!

Cost-Sensitive Threshold Selection:

Where = cost of false negative (missed disease), = cost of false alarm.

from sklearn.metrics import f1_score, precision_recall_curve

# Find optimal F1 threshold
thresholds = np.arange(0.1, 0.9, 0.01)
f1_scores = [f1_score(y_test, (y_prob >= t).astype(int)) for t in thresholds]
optimal_threshold = thresholds[np.argmax(f1_scores)]

print(f"Optimal threshold (F1): {optimal_threshold:.2f}")
print(f"F1 at optimal: {max(f1_scores):.4f}")

# Plot threshold vs metrics
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_prob)

plt.figure(figsize=(10, 6))
plt.plot(thresholds, f1_scores, label='F1 Score', color='blue', lw=2)
plt.axvline(x=optimal_threshold, color='red', linestyle='--', label=f'Optimal: {optimal_threshold:.2f}')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Threshold Optimization')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Key Takeaways

Sigmoid function maps any real number to enables probability interpretation
Cross-entropy loss is convex ?� guaranteed global minimum with gradient descent
Decision boundary is always linear in feature space (use polynomial features for non-linear)
Coefficient interpretation: = odds ratio for one-unit increase in
Multi-class: Use softmax for direct multiclass, or OvR for binary decomposition
Regularization: L1 for feature selection, L2 for smooth shrinkage, Elastic Net for both
Threshold tuning is critical default 0.5 may not be optimal for imbalanced data

Practice Exercises

Exercise 1: Binary Classification Pipeline

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.7, 0.3], random_state=42
)

# a) Split data, train model with StandardScaler
# b) Print confusion matrix
# c) Calculate precision, recall, F1
# d) Plot ROC curve and report AUC
# e) Compare with/without regularization

Exercise 2: Multi-class Problem

from sklearn.datasets import load_wine

wine = load_wine()
X, y = wine.data, wine.target

# a) Train multinomial logistic regression
# b) Train one-vs-rest logistic regression
# c) Compare accuracy and confusion matrices
# d) Which strategy performs better? Why?

Exercise 3: Cost-Sensitive Learning

Train a model with class_weight='balanced'
Compare confusion matrices with unweighted model
In which scenarios is the weighted model better?

Exercise 4: Polynomial Features

from sklearn.preprocessing import PolynomialFeatures

# Generate non-linear data (e.g., circles or moons)
# Train logistic regression with polynomial features of degree 2, 3, 4
# Visualize decision boundaries
# What's the trade-off with higher degree?

Discussion Questions

When would you prioritize recall over precision (and vice versa)?
Why might AUC be preferred over accuracy for imbalanced datasets?
How does regularization affect logistic regression coefficients and interpretability?
Under what assumptions does logistic regression fail?

Logistic Regression: Sigmoid, Decision Boundary and Multi-class

Logistic Regression: Sigmoid, Decision Boundary and Multi-class

The Bridge from Regression to Classification

From Linear to Logistic: The Transformation

1. The Sigmoid Function

Mathematical Definition

Derivative of Sigmoid

Log-Odds Interpretation

2. Decision Boundary

Linear Decision Boundary

Non-Linear Decision Boundaries

3. Cost Function: Binary Cross-Entropy

Why Not MSE for Classification?

Binary Cross-Entropy Derivation

Average Cross-Entropy Loss

4. Maximum Likelihood Estimation

Likelihood Function

Gradient Derivation

5. Multi-class Classification

One-vs-Rest (OvR)

Softmax (Multinomial Logistic Regression)

Cross-Entropy for Multi-class

6. Regularization

L1 Regularization (Lasso)

L2 Regularization (Ridge)

Elastic Net

7. Complete Python Implementation

Binary Classification with Evaluation

Feature Importance and Odds Ratios

Decision Boundary Visualization

ROC Curve and AUC

8. Threshold Optimization

Key Takeaways

Practice Exercises

Exercise 1: Binary Classification Pipeline

Exercise 2: Multi-class Problem

Exercise 3: Cost-Sensitive Learning

Exercise 4: Polynomial Features

Discussion Questions

Need Expert Data Science Help?