Meta & Microsoft Interview

Logistic Regression: Decision Boundary, Cost Function & Multiclass

The workhorse of binary classification in industry

Interview Question

"Explain the decision boundary in logistic regression. Why do we use the logistic (sigmoid) function instead of a linear function for classification? How do you extend logistic regression to multiclass problems?"

Difficulty: Medium | Frequently asked at Meta, Microsoft, Amazon

Theoretical Foundation

The Logistic Model

Logistic regression models the probability that a binary outcome $y \in \{0, 1\}$ occurs given input features $x$ :

P(y = 1 | x) = \sigma(z) = \frac{1}{1 + e^{-z}}

where $z = \beta^T x + \beta_0$ is the linear combination of features and the sigmoid function $\sigma(z)$ maps any real number to the interval $(0, 1)$ .

Properties of the sigmoid function:

$\sigma(0) = 0.5$ (decision boundary)
$\sigma(-z) = 1 - \sigma(z)$ (symmetry)
$\sigma'(z) = \sigma(z)(1 - \sigma(z))$ (efficient gradient computation)
As $z \to \infty$ , $\sigma(z) \to 1$
As $z \to -\infty$ , $\sigma(z) \to 0$

ℹ️

Key Insight: The sigmoid function is the inverse of the logit function: $\text{logit}(p) = \log\left(\frac{p}{1-p}\right)$ . This means logistic regression models the log-odds of the outcome as a linear function of the features.

Why Not Use Linear Regression for Classification?

There are three fundamental problems:

Unbounded Output: Linear regression predicts values in $(-\infty, \infty)$ , but probabilities must be in $[0, 1]$
Non-normal Errors: The error term $\epsilon$ is not normally distributed for binary outcomes (it follows a Bernoulli distribution)
Heteroscedasticity: The variance of the error term depends on the input, violating OLS assumptions

The Decision Boundary

The decision boundary is the hypersurface where the model switches between predicting class 0 and class 1. For logistic regression:

\text{Decision boundary: } \beta^T x + \beta_0 = 0

This is because:

When $\beta^T x + \beta_0 > 0$ : $P(y=1|x) > 0.5$ → Predict class 1
When $\beta^T x + \beta_0 < 0$ : $P(y=1|x) < 0.5$ → Predict class 0
When $\beta^T x + \beta_0 = 0$ : $P(y=1|x) = 0.5$ → Boundary

The decision boundary is always linear (a hyperplane in feature space). This is both a strength (simple, interpretable) and a limitation (cannot capture non-linear relationships without feature engineering).

⚠️

Common Misconception: Many candidates think the decision boundary in logistic regression is curved because the probability is non-linear. The boundary itself is linear; only the probability mapping is non-linear.

The Cost Function

Logistic regression uses Binary Cross-Entropy (BCE) loss, also called log loss:

J(\beta) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i) \right]

where $\hat{p}_i = \sigma(\beta^T x_i + \beta_0)$ .

Why not use Mean Squared Error?

If we used MSE: $J(\beta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \sigma(\beta^T x_i))^2$

The problem is that MSE creates a non-convex loss surface with local minima, making optimization difficult. BCE is convex, guaranteeing a global minimum.

Mathematical Derivation of BCE:

From maximum likelihood estimation:

\mathcal{L}(\beta) = \prod_{i=1}^{n} \hat{p}_i^{y_i} (1-\hat{p}_i)^{1-y_i}

Taking the log:

\log \mathcal{L}(\beta) = \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i) \right]

Minimizing the negative log-likelihood gives us BCE.

Gradient of the Cost Function

The gradient has a beautiful form:

\frac{\partial J}{\partial \beta_j} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i) x_{ij}

This is identical in form to the OLS gradient, which is why logistic regression can be solved with similar optimization algorithms.

Optimization Algorithms

1. Gradient Descent:

\beta^{(t+1)} = \beta^{(t)} - \eta \nabla J(\beta^{(t)})

2. Newton-Raphson (Second-Order):

\beta^{(t+1)} = \beta^{(t)} - H^{-1} \nabla J(\beta^{(t)})

where $H$ is the Hessian matrix. Faster convergence but $O(p^2)$ per iteration.

3. L-BFGS (Limited-memory BFGS):

Approximates the Hessian using limited memory
Default in scikit-learn
Good balance of speed and memory efficiency

4. Coordinate Descent (for regularized logistic regression):

Used when L1/L2 penalties are added
Solves one coordinate at a time
Very efficient for high-dimensional sparse data

Multiclass Extensions

One-vs-Rest (OvR) / One-vs-All

Trains $K$ binary classifiers, one for each class:

\text{Class } k \text{ vs. Not Class } k

Prediction: Choose the class with the highest probability.

Pros: Simple, parallelizable Cons: Class probabilities don't sum to 1; can be suboptimal when classes are not balanced

Multinomial Logistic Regression (Softmax)

Directly models $K$ classes using the softmax function:

P(y = k | x) = \frac{e^{\beta_k^T x}}{\sum_{j=1}^{K} e^{\beta_j^T x}}

Properties:

Probabilities sum to 1: $\sum_{k=1}^{K} P(y=k|x) = 1$
Requires $K$ weight vectors (or $K-1$ if one class is the reference)
Uses categorical cross-entropy loss:

J(\beta) = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log(\hat{p}_{ik})

💡

Production Tip: In practice, multinomial logistic regression (softmax) usually outperforms one-vs-rest. Meta uses softmax regression for their content recommendation ranking systems.

Code Implementation

Explanation of Code

Decision Boundary: Visualizes the linear decision boundary and shows the equation of the separating hyperplane.
Probability Output: Demonstrates how logistic regression outputs calibrated probabilities and how different thresholds affect predictions.
Multiclass Comparison: Compares One-vs-Rest vs Multinomial (Softmax) approaches, showing Softmax produces properly calibrated probabilities.
Cost Function: Illustrates why BCE is preferred over MSE for classification, showing BCE penalizes confident wrong predictions more heavily.
Regularization: Shows how C (inverse of λ) controls model complexity in logistic regression.

Real-World Applications

Meta: Content Ranking

Meta uses logistic regression (and its neural network extensions) for:

News Feed Ranking: Predicting probability of user engagement
Ad Targeting: Estimating click-through rates (CTR)
Content Moderation: Classifying potentially harmful content

Microsoft: Spam Detection

Microsoft's email spam filters use multinomial logistic regression with:

TF-IDF features from email text
Header features (sender reputation, time sent)
Behavioral features (sender-recipient interaction history)

Industry Best Practices

Feature Scaling: Always standardize features before logistic regression
Class Imbalance: Use class_weight='balanced' or SMOTE
Multicollinearity: Check VIF values; use regularization if VIF > 10
Model Calibration: Use CalibratedClassifierCV if probabilities need to be well-calibrated

💡

Meta Interview Tip: Be prepared to discuss how logistic regression scales to billions of samples. Mention techniques like stochastic gradient descent, feature hashing, and parameter servers.

Common Follow-Up Questions

Q1: Why is the sigmoid function preferred over other activation functions for binary classification?

The sigmoid function is preferred because:

It's the inverse of the logit function, giving a natural probabilistic interpretation
Its derivative $\sigma'(z) = \sigma(z)(1-\sigma(z))$ is easy to compute
It arises naturally from the exponential family distribution
It ensures outputs are in $[0, 1]$

Q2: How do you handle multiclass problems where classes are not mutually exclusive?

Use One-vs-Rest (OvR) instead of Softmax. Softmax assumes mutual exclusivity (probabilities sum to 1). For multi-label classification, train independent binary classifiers for each class.

Q3: What is the connection between logistic regression and neural networks?

Logistic regression is equivalent to a single-layer neural network with:

One output neuron
Sigmoid activation function
Binary cross-entropy loss

Deep learning extends this by adding hidden layers and non-linear activations.

Q4: How do you detect and handle class imbalance in logistic regression?

Resampling: SMOTE (oversampling minority) or undersampling majority
Class weights: Set class_weight='balanced' in sklearn
Threshold tuning: Lower threshold to increase recall
Metrics: Use F1, AUC-PR instead of accuracy

Company-Specific Tips

Meta Interview Tips

Discuss online learning variants for streaming data
Be ready to explain probability calibration techniques
Mention A/B testing frameworks for model comparison
Talk about feature importance in high-dimensional sparse data

Microsoft Interview Tips

Focus on interpretability requirements in regulated industries
Discuss model serving at scale (batch vs real-time)
Be prepared to explain regularization choices in production
Mention monitoring for model drift in live systems

Logistic Regression: Decision Boundary, Cost Function & Multiclass

Logistic Regression: Decision Boundary, Cost Function & Multiclass

Interview Question

Theoretical Foundation

The Logistic Model

Why Not Use Linear Regression for Classification?

The Decision Boundary

The Cost Function

Gradient of the Cost Function

Optimization Algorithms

Multiclass Extensions

One-vs-Rest (OvR) / One-vs-All

Multinomial Logistic Regression (Softmax)

Code Implementation

Explanation of Code

Real-World Applications

Meta: Content Ranking

Microsoft: Spam Detection

Industry Best Practices

Common Follow-Up Questions

Company-Specific Tips

Meta Interview Tips

Microsoft Interview Tips

Related Topics