Linear Regression

Linear regression predicts a continuous target variable as a linear combination of input features. It serves as a standard baseline for regression problems.

Model Definition

y = w_0 + w_1*x_1 + w_2*x_2 + ... + w_n*x_n + e

Matrix form: y = Xw + e

Symbol	Meaning
y	Target variable
X	Feature matrix
w	Weight parameters
e	Error term

Model Assumptions

Assumption	Description
Linearity	Relationship between X and y is linear
Independence	Observations are independent
Homoscedasticity	Constant variance of residuals
Normality	Residuals are normally distributed
No multicollinearity	Features are not highly correlated

Training Methods

Ordinary Least Squares (OLS)

Minimizes the sum of squared residuals:

Loss = Sum((y_i - y_hat_i)^2)

Closed-form solution: w = (X^T * X)^(-1) * X^T * y

Gradient Descent

Gradient descent iteratively updates weights to minimize the loss:

Initialize weights to zero
For each iteration:
- Compute predictions: y_hat = X * w
- Calculate gradient: gradient = -2 * X^T * (y - y_hat) / n
- Update weights: w = w - learning_rate * gradient
Repeat until convergence or maximum iterations reached

Regularization

Ridge Regression (L2)

Adds penalty on sum of squared weights:

Loss = Sum((y_i - y_hat_i)^2) + lambda * Sum(w_j^2)

Effect: Shrinks weights, handles multicollinearity.

Lasso Regression (L1)

Adds penalty on sum of absolute weights:

Loss = Sum((y_i - y_hat_i)^2) + lambda * Sum(|w_j|)

Effect: Can zero out weights, enabling feature selection.

Elastic Net

Combines L1 and L2 penalties:

Loss = Sum((y_i - y_hat_i)^2) + lambda_1 * Sum(|w_j|) + lambda_2 * Sum(w_j^2)

Evaluation Metrics

Metric	Formula	Interpretation
MSE	Mean((y - y_hat)^2)	Lower indicates better fit
RMSE	sqrt(MSE)	Same units as target variable
MAE	Mean(\|y - y_hat\|)	Robust to outliers
R^2	1 - SS_res/SS_tot	Proportion of variance explained

Reference

Topic	Description
Model assumptions	Linearity, independence, homoscedasticity, normal errors, no multicollinearity. Different assumptions matter for prediction versus inference.
Ridge vs Lasso	Ridge: use when all features may be relevant and are correlated. Lasso: use when feature selection is desired.
Multicollinearity handling	Ridge regression, drop correlated features, or PCA. VIF (variance inflation factor) detects multicollinearity.
Non-linear relationships	Add polynomial features, interaction terms, or use non-linear models.
Coefficient interpretation	A one-unit increase in feature X changes the prediction by the coefficient value, holding other features constant. Interpretation requires attention to scaling and multicollinearity.

Model Definition​

Model Assumptions​

Training Methods​

Ordinary Least Squares (OLS)​

Gradient Descent​

Regularization​

Ridge Regression (L2)​

Lasso Regression (L1)​

Elastic Net​

Evaluation Metrics​

Reference​

Table of Contents