Skip to main content

Linear Regression

Linear regression predicts a continuous target variable as a linear combination of input features. It serves as a standard baseline for regression problems.

Model Definition

y = w_0 + w_1*x_1 + w_2*x_2 + ... + w_n*x_n + e

Matrix form: y = Xw + e

SymbolMeaning
yTarget variable
XFeature matrix
wWeight parameters
eError term

Model Assumptions

AssumptionDescription
LinearityRelationship between X and y is linear
IndependenceObservations are independent
HomoscedasticityConstant variance of residuals
NormalityResiduals are normally distributed
No multicollinearityFeatures are not highly correlated

Training Methods

Ordinary Least Squares (OLS)

Minimizes the sum of squared residuals:

Loss = Sum((y_i - y_hat_i)^2)

Closed-form solution: w = (X^T * X)^(-1) * X^T * y

Gradient Descent

Gradient descent iteratively updates weights to minimize the loss:

  1. Initialize weights to zero
  2. For each iteration:
    • Compute predictions: y_hat = X * w
    • Calculate gradient: gradient = -2 * X^T * (y - y_hat) / n
    • Update weights: w = w - learning_rate * gradient
  3. Repeat until convergence or maximum iterations reached

Regularization

Ridge Regression (L2)

Adds penalty on sum of squared weights:

Loss = Sum((y_i - y_hat_i)^2) + lambda * Sum(w_j^2)

Effect: Shrinks weights, handles multicollinearity.

Lasso Regression (L1)

Adds penalty on sum of absolute weights:

Loss = Sum((y_i - y_hat_i)^2) + lambda * Sum(|w_j|)

Effect: Can zero out weights, enabling feature selection.

Elastic Net

Combines L1 and L2 penalties:

Loss = Sum((y_i - y_hat_i)^2) + lambda_1 * Sum(|w_j|) + lambda_2 * Sum(w_j^2)

Evaluation Metrics

MetricFormulaInterpretation
MSEMean((y - y_hat)^2)Lower indicates better fit
RMSEsqrt(MSE)Same units as target variable
MAEMean(|y - y_hat|)Robust to outliers
R^21 - SS_res/SS_totProportion of variance explained

Reference

TopicDescription
Model assumptionsLinearity, independence, homoscedasticity, normal errors, no multicollinearity. Different assumptions matter for prediction versus inference.
Ridge vs LassoRidge: use when all features may be relevant and are correlated. Lasso: use when feature selection is desired.
Multicollinearity handlingRidge regression, drop correlated features, or PCA. VIF (variance inflation factor) detects multicollinearity.
Non-linear relationshipsAdd polynomial features, interaction terms, or use non-linear models.
Coefficient interpretationA one-unit increase in feature X changes the prediction by the coefficient value, holding other features constant. Interpretation requires attention to scaling and multicollinearity.