Logistic Regression
Logistic regression predicts class probabilities for binary classification problems. Despite the name, it is a classification algorithm, not regression. It serves as a standard baseline for binary classification.
Model Definition
The probability of the positive class is computed as:
P(y=1|x) = sigmoid(wx + b) = 1 / (1 + e^(-(wx + b)))
The sigmoid function maps any real value to the interval (0, 1).
Decision Boundary
Prediction: class 1 if P(y=1|x) > threshold (default: 0.5)
When w*x + b > 0, the model predicts Class 1.
The decision boundary is a hyperplane in feature space.
Training
Loss Function: Binary Cross-Entropy
Loss = -(1/n) * Sum[y_i * log(y_hat_i) + (1 - y_i) * log(1 - y_hat_i)]
Gradient Descent
Training logistic regression with gradient descent:
- Initialize weights w to zero and bias b to zero
- For each iteration:
- Compute linear combination: z = X * w + b
- Apply sigmoid to get predictions: y_hat = 1 / (1 + e^(-z))
- Calculate weight gradient: dw = X^T * (y_hat - y) / n
- Calculate bias gradient: db = mean(y_hat - y)
- Update weights: w = w - learning_rate * dw
- Update bias: b = b - learning_rate * db
- Return learned weights and bias
Regularization
L1 and L2 penalties apply as in linear regression:
Loss = BCE + lambda * ||w||
Where ||w|| is either the L1 norm (sum of absolute values) or L2 norm (sum of squared values).
Multi-class Classification
One-vs-Rest (OvR)
Train K binary classifiers, one per class. Predict class with highest probability.
Softmax (Multinomial)
P(y=k|x) = e^(w_k * x) / Sum_j(e^(w_j * x))
This normalizes the outputs across all classes so probabilities sum to 1.
Evaluation Metrics
| Metric | Use Case |
|---|---|
| Accuracy | Balanced classes |
| Precision | High false positive cost |
| Recall | High false negative cost |
| F1 Score | Balance precision and recall |
| AUC-ROC | Overall discrimination quality |
Reference
| Topic | Description |
|---|---|
| Sigmoid function purpose | Maps real numbers to (0, 1) for probability interpretation. Has convenient mathematical properties for optimization. |
| Difference from linear regression | Linear regression predicts continuous values; logistic regression predicts bounded probabilities (0-1). Uses log loss instead of MSE. |
| Imbalanced classes | Use class weights, threshold tuning, or resampling. Evaluate with precision-recall metrics, not accuracy. |
| Coefficient interpretation | A one-unit increase in feature X changes log-odds by the coefficient value. Exponentiate for odds ratio. |
| Failure cases | Complex non-linear decision boundaries, high-dimensional sparse data without regularization, unengineered feature interactions. |