Linear Regression
Linear regression predicts a continuous target variable as a linear combination of input features. It serves as a standard baseline for regression problems.
Model Definition
y = w_0 + w_1*x_1 + w_2*x_2 + ... + w_n*x_n + e
Matrix form: y = Xw + e
| Symbol | Meaning |
|---|---|
| y | Target variable |
| X | Feature matrix |
| w | Weight parameters |
| e | Error term |
Model Assumptions
| Assumption | Description |
|---|---|
| Linearity | Relationship between X and y is linear |
| Independence | Observations are independent |
| Homoscedasticity | Constant variance of residuals |
| Normality | Residuals are normally distributed |
| No multicollinearity | Features are not highly correlated |
Training Methods
Ordinary Least Squares (OLS)
Minimizes the sum of squared residuals:
Loss = Sum((y_i - y_hat_i)^2)
Closed-form solution: w = (X^T * X)^(-1) * X^T * y
Gradient Descent
Gradient descent iteratively updates weights to minimize the loss:
- Initialize weights to zero
- For each iteration:
- Compute predictions: y_hat = X * w
- Calculate gradient: gradient = -2 * X^T * (y - y_hat) / n
- Update weights: w = w - learning_rate * gradient
- Repeat until convergence or maximum iterations reached
Regularization
Ridge Regression (L2)
Adds penalty on sum of squared weights:
Loss = Sum((y_i - y_hat_i)^2) + lambda * Sum(w_j^2)
Effect: Shrinks weights, handles multicollinearity.
Lasso Regression (L1)
Adds penalty on sum of absolute weights:
Loss = Sum((y_i - y_hat_i)^2) + lambda * Sum(|w_j|)
Effect: Can zero out weights, enabling feature selection.
Elastic Net
Combines L1 and L2 penalties:
Loss = Sum((y_i - y_hat_i)^2) + lambda_1 * Sum(|w_j|) + lambda_2 * Sum(w_j^2)
Evaluation Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| MSE | Mean((y - y_hat)^2) | Lower indicates better fit |
| RMSE | sqrt(MSE) | Same units as target variable |
| MAE | Mean(|y - y_hat|) | Robust to outliers |
| R^2 | 1 - SS_res/SS_tot | Proportion of variance explained |
Reference
| Topic | Description |
|---|---|
| Model assumptions | Linearity, independence, homoscedasticity, normal errors, no multicollinearity. Different assumptions matter for prediction versus inference. |
| Ridge vs Lasso | Ridge: use when all features may be relevant and are correlated. Lasso: use when feature selection is desired. |
| Multicollinearity handling | Ridge regression, drop correlated features, or PCA. VIF (variance inflation factor) detects multicollinearity. |
| Non-linear relationships | Add polynomial features, interaction terms, or use non-linear models. |
| Coefficient interpretation | A one-unit increase in feature X changes the prediction by the coefficient value, holding other features constant. Interpretation requires attention to scaling and multicollinearity. |