ML-06: Linear Regression

Summary
A deep dive into linear regression: understand why squared error works, derive the normal equation step-by-step, explore the geometric intuition behind projections, and implement from scratch.
House Price Fit
Linear regression finds the best-fit line by minimizing the sum of squared residuals (vertical dashed lines).

Learning Objectives

  • Understand regression vs classification
  • Derive the least squares solution
  • Implement linear regression from scratch
  • Extend to multiple features

Theory

Regression vs Classification

TaskOutput TypeExample
ClassificationDiscrete labelsSpam or not spam
RegressionContinuous valuesHouse price prediction

Classification answers “which category?” while regression answers “how much?

Linear Model

y^=w0+w1x1+w2x2+=wTx\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots = \boldsymbol{w}^T \boldsymbol{x}

Where x=(1,x1,x2,)T\boldsymbol{x} = (1, x_1, x_2, \ldots)^T includes the bias term (intercept).

Breaking it down:

  • w0w_0 is the intercept (baseline prediction when all features are 0)
  • w1,w2,w_1, w_2, \ldots are weights that determine how much each feature contributes
  • Each weight tells the model: “for every 1-unit increase in this feature, change the prediction by this amount”

Why Squared Error?

The goal is to find the “best” line through data points. What makes a line optimal? Consider these error metrics:

Error MetricFormulaProperties
Sum of errors(yiy^i)\sum (y_i - \hat{y}_i)❌ Positive and negative errors cancel out
Sum of absolute errorsyiy^i\sum \vert y_i - \hat{y}_i \vert✓ Works, but not differentiable at 0
Sum of squared errors(yiy^i)2\sum (y_i - \hat{y}_i)^2✓ Differentiable, penalizes large errors more

Key insight: Squared error is preferred because:

  1. It’s always positive (no cancellation)
  2. It’s smooth and differentiable (enables calculus-based optimization)
  3. It penalizes large errors heavily (one big error is worse than several small ones)

Least Squares Objective

Minimize the sum of squared errors:

J(w)=i=1N(yiwTxi)2=yXw2J(\boldsymbol{w}) = \sum_{i=1}^{N} (y_i - \boldsymbol{w}^T \boldsymbol{x}_i)^2 = |\boldsymbol{y} - X\boldsymbol{w}|^2

In matrix form, where XX is the design matrix (each row is a sample, each column is a feature):

  • y\boldsymbol{y} = vector of true values (y1,y2,,yN)T(y_1, y_2, \ldots, y_N)^T
  • XwX\boldsymbol{w} = vector of predictions
  • 2|\cdot|^2 = squared Euclidean norm

Deriving the Closed-Form Solution

The optimal weights can be derived step by step.

Step 1: Expand the objective function

J(w)=(yXw)T(yXw)J(\boldsymbol{w}) = (\boldsymbol{y} - X\boldsymbol{w})^T(\boldsymbol{y} - X\boldsymbol{w})

Expanding the product:

J(w)=yTy2wTXTy+wTXTXwJ(\boldsymbol{w}) = \boldsymbol{y}^T\boldsymbol{y} - 2\boldsymbol{w}^T X^T \boldsymbol{y} + \boldsymbol{w}^T X^T X \boldsymbol{w}

Step 2: Take the gradient with respect to w\boldsymbol{w}

Using matrix calculus rules:

  • w(wTa)=a\frac{\partial}{\partial \boldsymbol{w}}(\boldsymbol{w}^T \boldsymbol{a}) = \boldsymbol{a}
  • w(wTAw)=2Aw\frac{\partial}{\partial \boldsymbol{w}}(\boldsymbol{w}^T A \boldsymbol{w}) = 2A\boldsymbol{w} (when AA is symmetric)

wJ=2XTy+2XTXw\nabla_w J = -2X^T\boldsymbol{y} + 2X^T X\boldsymbol{w}

Step 3: Set gradient to zero and solve

2XTy+2XTXw=0-2X^T\boldsymbol{y} + 2X^T X\boldsymbol{w}^* = 0

XTXw=XTyX^T X \boldsymbol{w}^* = X^T \boldsymbol{y}

w=(XTX)1XTy\boldsymbol{w}^* = (X^T X)^{-1} X^T \boldsymbol{y}

This is the normal equation — a direct, closed-form solution for linear regression.

Geometric Interpretation

The prediction y^=Xw\hat{\boldsymbol{y}} = X\boldsymbol{w} is the projection of y\boldsymbol{y} onto the column space of XX.

Geometric interpretation of least squares
The prediction ŷ is the orthogonal projection of y onto the column space of X, with the residual (y - ŷ) perpendicular to the plane.

Why projection? The intuition:

  • The column space of XX contains all possible predictions the model can make
  • The true y\boldsymbol{y} might not be in this space (data is rarely perfectly linear)
  • The closest point in this space to y\boldsymbol{y} is its orthogonal projection
  • “Closest” in Euclidean distance = minimizing squared error!

The orthogonality condition: The residual vector (yy^)(\boldsymbol{y} - \hat{\boldsymbol{y}}) is perpendicular to every column of XX:

XT(yXw)=0X^T(\boldsymbol{y} - X\boldsymbol{w}) = 0

This is exactly the normal equation rearranged—geometry and calculus yield the same answer.

Understanding R² Score

The coefficient of determination (R2R^2) measures how well the model explains the variance in the data:

R2=1SSresSStot=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Where:

  • SSresSS_{res} = Residual sum of squares (unexplained variance)
  • SStotSS_{tot} = Total sum of squares (total variance)
  • yˉ\bar{y} = mean of yy

Interpreting R²:

R² ValueInterpretation
1.0Perfect fit — model explains all variance
0.8Good — 80% of variance explained
0.5Moderate — half the variance explained
0.0Model is no better than predicting the mean
< 0Model is worse than predicting the mean

Code Practice

From Scratch Implementation

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import numpy as np

class LinearRegressionScratch:
    def __init__(self):
        self.w = None
    
    def fit(self, X, y):
        # Add bias column
        X_b = np.c_[np.ones(len(X)), X]
        # Normal equation
        self.w = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
        return self
    
    def predict(self, X):
        X_b = np.c_[np.ones(len(X)), X]
        return X_b @ self.w

Example: House Price

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import matplotlib.pyplot as plt

# Generate data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)  # Size in 1000 sqft
y = 4 + 3 * X.ravel() + np.random.randn(100)  # Price = 4 + 3*size + noise

# Train
model = LinearRegressionScratch()
model.fit(X, y)

print(f"Intercept: {model.w[0]:.2f}")
print(f"Slope: {model.w[1]:.2f}")

# Plot
plt.scatter(X, y, alpha=0.6)
X_line = np.array([[0], [2]])
plt.plot(X_line, model.predict(X_line), 'r-', linewidth=2)
plt.xlabel('Size (1000 sqft)')
plt.ylabel('Price ($10k)')
plt.title('Linear Regression: House Price')
plt.savefig('assets/linear_regression.png')

Output:

1
2
Intercept: 4.22
Slope: 2.77
Linear Regression: House Price
Linear Regression: House Price

Multiple Linear Regression

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")

# Feature importance
for name, coef in zip(diabetes.feature_names, model.coef_):
    print(f"{name:>6}: {coef:>7.2f}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
R² Score: 0.480
RMSE: 52.27
   age:  -49.11
   sex: -187.26
   bmi:  552.55
    bp:  294.34
    s1: -932.06
    s2:  556.03
    s3:  161.11
    s4:  191.75
    s5:  763.95
    s6:  148.11

Deep Dive

Q1: When does the normal equation fail?

The normal equation w=(XTX)1XTy\boldsymbol{w}^* = (X^T X)^{-1} X^T \boldsymbol{y} requires inverting XTXX^T X, which can fail or be problematic in several cases:

ProblemCauseSolution
Singular matrixFeatures are linearly dependent (e.g., feature_3 = 2 × feature_1)Remove redundant features or use regularization
Near-singular matrixFeatures are highly correlated (multicollinearity)Use Ridge regression (L2 regularization)
Large datasetMatrix inversion is O(n3)O(n^3), slow for millions of samplesUse gradient descent (iterative, O(n)O(n) per step)
More features than samplesXTXX^T X is not full rank when p>np > nUse regularization or dimensionality reduction

Practical rule of thumb: Use normal equation for small datasets (< 10,000 samples), gradient descent for large ones.

Q2: What does a negative coefficient mean?

A negative weight indicates an inverse relationship: as that feature increases, the prediction decreases (holding other features constant).

Examples:

  • House prices: “years since renovation” → negative (older renovation = lower price)
  • Test scores: “hours of distraction” → negative (more distraction = lower score)
  • Fuel efficiency: “vehicle weight” → negative (heavier = less efficient)
Tip

Be careful with interpretation! Coefficients show correlation in the model, not necessarily causation. A negative coefficient for “age” in a house price model might reflect other factors correlated with age.

Q3: Linear regression vs. correlation?

These concepts are related but serve different purposes:

AspectCorrelation (r)Linear Regression
PurposeMeasure relationship strengthPredict values
OutputSingle number (-1 to 1)Prediction function y^=wTx\hat{y} = w^T x
DirectionalitySymmetric (r(X,Y) = r(Y,X))Asymmetric (X predicts Y)
UnitsUnitlessSame units as Y
RelationshipR2=r2R^2 = r^2 for simple linear regression

Key insight: The correlation coefficient rr measures how tightly points cluster around any best-fit line. The relationship R2=r2R^2 = r^2 shows the proportion of variance explained—both convey the same information in different forms.

Q4: How to handle categorical features?

Linear regression works with numbers, so categorical features must be encoded:

MethodExampleWhen to use
One-hot encodingColor: [Red, Blue, Green] → [1,0,0], [0,1,0], [0,0,1]Nominal categories (no order)
Ordinal encodingSize: [S, M, L] → [1, 2, 3]Ordinal categories (with order)
Warning

Don’t use ordinal encoding for nominal categories! Assigning Red=1, Blue=2, Green=3 implies Blue is “between” Red and Green, which is meaningless.

Summary

ConceptFormula/Meaning
Linear Modely^=wTx\hat{y} = \boldsymbol{w}^T \boldsymbol{x}
Loss FunctionyXw2|y - X w|^2 (squared error)
Solutionw=(XTX)1XTyw = (X^T X)^{-1} X^T y
R² ScoreProportion of variance explained

References