🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Logistic Regression: Decision Boundary, Cost Function & Multiclass

Machine LearningLogistic Regression⭐ Premium

Advertisement

Meta & Microsoft Interview

Logistic Regression: Decision Boundary, Cost Function & Multiclass

The workhorse of binary classification in industry

Interview Question

"Explain the decision boundary in logistic regression. Why do we use the logistic (sigmoid) function instead of a linear function for classification? How do you extend logistic regression to multiclass problems?"

Difficulty: Medium | Frequently asked at Meta, Microsoft, Amazon


Theoretical Foundation

The Logistic Model

Logistic regression models the probability that a binary outcome y{0,1}y \in \{0, 1\} occurs given input features xx:

P(y=1x)=σ(z)=11+ezP(y = 1 | x) = \sigma(z) = \frac{1}{1 + e^{-z}}

where z=βTx+β0z = \beta^T x + \beta_0 is the linear combination of features and the sigmoid function σ(z)\sigma(z) maps any real number to the interval (0,1)(0, 1).

Properties of the sigmoid function:

  • σ(0)=0.5\sigma(0) = 0.5 (decision boundary)
  • σ(z)=1σ(z)\sigma(-z) = 1 - \sigma(z) (symmetry)
  • σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)) (efficient gradient computation)
  • As zz \to \infty, σ(z)1\sigma(z) \to 1
  • As zz \to -\infty, σ(z)0\sigma(z) \to 0

ℹ️

Key Insight: The sigmoid function is the inverse of the logit function: logit(p)=log(p1p)\text{logit}(p) = \log\left(\frac{p}{1-p}\right). This means logistic regression models the log-odds of the outcome as a linear function of the features.

Why Not Use Linear Regression for Classification?

There are three fundamental problems:

  1. Unbounded Output: Linear regression predicts values in (,)(-\infty, \infty), but probabilities must be in [0,1][0, 1]

  2. Non-normal Errors: The error term ϵ\epsilon is not normally distributed for binary outcomes (it follows a Bernoulli distribution)

  3. Heteroscedasticity: The variance of the error term depends on the input, violating OLS assumptions

The Decision Boundary

The decision boundary is the hypersurface where the model switches between predicting class 0 and class 1. For logistic regression:

Decision boundary: βTx+β0=0\text{Decision boundary: } \beta^T x + \beta_0 = 0

This is because:

  • When βTx+β0>0\beta^T x + \beta_0 > 0: P(y=1x)>0.5P(y=1|x) > 0.5 → Predict class 1
  • When βTx+β0<0\beta^T x + \beta_0 < 0: P(y=1x)<0.5P(y=1|x) < 0.5 → Predict class 0
  • When βTx+β0=0\beta^T x + \beta_0 = 0: P(y=1x)=0.5P(y=1|x) = 0.5 → Boundary

The decision boundary is always linear (a hyperplane in feature space). This is both a strength (simple, interpretable) and a limitation (cannot capture non-linear relationships without feature engineering).

⚠️

Common Misconception: Many candidates think the decision boundary in logistic regression is curved because the probability is non-linear. The boundary itself is linear; only the probability mapping is non-linear.

The Cost Function

Logistic regression uses Binary Cross-Entropy (BCE) loss, also called log loss:

J(β)=1ni=1n[yilog(p^i)+(1yi)log(1p^i)]J(\beta) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i) \right]

where p^i=σ(βTxi+β0)\hat{p}_i = \sigma(\beta^T x_i + \beta_0).

Why not use Mean Squared Error?

If we used MSE: J(β)=1ni=1n(yiσ(βTxi))2J(\beta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \sigma(\beta^T x_i))^2

The problem is that MSE creates a non-convex loss surface with local minima, making optimization difficult. BCE is convex, guaranteeing a global minimum.

Mathematical Derivation of BCE:

From maximum likelihood estimation:

L(β)=i=1np^iyi(1p^i)1yi\mathcal{L}(\beta) = \prod_{i=1}^{n} \hat{p}_i^{y_i} (1-\hat{p}_i)^{1-y_i}

Taking the log:

logL(β)=i=1n[yilog(p^i)+(1yi)log(1p^i)]\log \mathcal{L}(\beta) = \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i) \right]

Minimizing the negative log-likelihood gives us BCE.

Gradient of the Cost Function

The gradient has a beautiful form:

Jβj=1ni=1n(p^iyi)xij\frac{\partial J}{\partial \beta_j} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i) x_{ij}

This is identical in form to the OLS gradient, which is why logistic regression can be solved with similar optimization algorithms.

Optimization Algorithms

1. Gradient Descent:

β(t+1)=β(t)ηJ(β(t))\beta^{(t+1)} = \beta^{(t)} - \eta \nabla J(\beta^{(t)})

2. Newton-Raphson (Second-Order):

β(t+1)=β(t)H1J(β(t))\beta^{(t+1)} = \beta^{(t)} - H^{-1} \nabla J(\beta^{(t)})

where HH is the Hessian matrix. Faster convergence but O(p2)O(p^2) per iteration.

3. L-BFGS (Limited-memory BFGS):

  • Approximates the Hessian using limited memory
  • Default in scikit-learn
  • Good balance of speed and memory efficiency

4. Coordinate Descent (for regularized logistic regression):

  • Used when L1/L2 penalties are added
  • Solves one coordinate at a time
  • Very efficient for high-dimensional sparse data

Multiclass Extensions

One-vs-Rest (OvR) / One-vs-All

Trains KK binary classifiers, one for each class:

Class k vs. Not Class k\text{Class } k \text{ vs. Not Class } k

Prediction: Choose the class with the highest probability.

Pros: Simple, parallelizable Cons: Class probabilities don't sum to 1; can be suboptimal when classes are not balanced

Multinomial Logistic Regression (Softmax)

Directly models KK classes using the softmax function:

P(y=kx)=eβkTxj=1KeβjTxP(y = k | x) = \frac{e^{\beta_k^T x}}{\sum_{j=1}^{K} e^{\beta_j^T x}}

Properties:

  • Probabilities sum to 1: k=1KP(y=kx)=1\sum_{k=1}^{K} P(y=k|x) = 1
  • Requires KK weight vectors (or K1K-1 if one class is the reference)
  • Uses categorical cross-entropy loss:
J(β)=1ni=1nk=1Kyiklog(p^ik)J(\beta) = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log(\hat{p}_{ik})

💡

Production Tip: In practice, multinomial logistic regression (softmax) usually outperforms one-vs-rest. Meta uses softmax regression for their content recommendation ranking systems.


Code Implementation

Explanation of Code

  1. Decision Boundary: Visualizes the linear decision boundary and shows the equation of the separating hyperplane.

  2. Probability Output: Demonstrates how logistic regression outputs calibrated probabilities and how different thresholds affect predictions.

  3. Multiclass Comparison: Compares One-vs-Rest vs Multinomial (Softmax) approaches, showing Softmax produces properly calibrated probabilities.

  4. Cost Function: Illustrates why BCE is preferred over MSE for classification, showing BCE penalizes confident wrong predictions more heavily.

  5. Regularization: Shows how C (inverse of λ) controls model complexity in logistic regression.


Real-World Applications

Meta: Content Ranking

Meta uses logistic regression (and its neural network extensions) for:

  • News Feed Ranking: Predicting probability of user engagement
  • Ad Targeting: Estimating click-through rates (CTR)
  • Content Moderation: Classifying potentially harmful content

Microsoft: Spam Detection

Microsoft's email spam filters use multinomial logistic regression with:

  • TF-IDF features from email text
  • Header features (sender reputation, time sent)
  • Behavioral features (sender-recipient interaction history)

Industry Best Practices

  1. Feature Scaling: Always standardize features before logistic regression
  2. Class Imbalance: Use class_weight='balanced' or SMOTE
  3. Multicollinearity: Check VIF values; use regularization if VIF > 10
  4. Model Calibration: Use CalibratedClassifierCV if probabilities need to be well-calibrated

💡

Meta Interview Tip: Be prepared to discuss how logistic regression scales to billions of samples. Mention techniques like stochastic gradient descent, feature hashing, and parameter servers.


Common Follow-Up Questions

Q1: Why is the sigmoid function preferred over other activation functions for binary classification?

The sigmoid function is preferred because:

  • It's the inverse of the logit function, giving a natural probabilistic interpretation
  • Its derivative σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1-\sigma(z)) is easy to compute
  • It arises naturally from the exponential family distribution
  • It ensures outputs are in [0,1][0, 1]

Q2: How do you handle multiclass problems where classes are not mutually exclusive?

Use One-vs-Rest (OvR) instead of Softmax. Softmax assumes mutual exclusivity (probabilities sum to 1). For multi-label classification, train independent binary classifiers for each class.

Q3: What is the connection between logistic regression and neural networks?

Logistic regression is equivalent to a single-layer neural network with:

  • One output neuron
  • Sigmoid activation function
  • Binary cross-entropy loss

Deep learning extends this by adding hidden layers and non-linear activations.

Q4: How do you detect and handle class imbalance in logistic regression?

  1. Resampling: SMOTE (oversampling minority) or undersampling majority
  2. Class weights: Set class_weight='balanced' in sklearn
  3. Threshold tuning: Lower threshold to increase recall
  4. Metrics: Use F1, AUC-PR instead of accuracy

Company-Specific Tips

Meta Interview Tips

  • Discuss online learning variants for streaming data
  • Be ready to explain probability calibration techniques
  • Mention A/B testing frameworks for model comparison
  • Talk about feature importance in high-dimensional sparse data

Microsoft Interview Tips

  • Focus on interpretability requirements in regulated industries
  • Discuss model serving at scale (batch vs real-time)
  • Be prepared to explain regularization choices in production
  • Mention monitoring for model drift in live systems

Related Topics

Advertisement