Linear Regression and Regularization

Author
Dr. Ole Bialas

import numpy as np
import matplotlib.pyplot as plt

Linear models are useful tools for describing the relationship between variables (e.g. the intensity of a stimulus and the magnitude of a neural response) and form the basis of many different methods of statistical analysis. In this notebook you are going to learn about two of them: ordinary least squares (OLS) and ridge regression. OLS finds the line that best fits the data, while ridge regression helps to avoid overfitting, leading to better predictions on new, unseen data. You’ll also learn how to use bootstrap resampling to identify which predictors are significantly associated with the outcome.

The notebook proceeds as follows:

  1. Estimate the linear relationship between simulated variables using OLS regression
  2. Use ridge regularization and test how this improves the prediction of new, unseen data
  3. Apply bootstrap resampling to test the significance of the model’s predictors
  4. Implement OLS and ridge regression using pure NumPy

Setup

Utility Functions

def _fit_regression(X, y, intercept=False, lam=0):
    X = X.copy()
    if intercept:
        X = np.hstack((np.ones((len(X), 1)), X))
    I = np.eye(X.shape[1])
    if intercept:
        I[0, 0] = 0
    coef = np.linalg.solve(X.T @ X + lam * I, X.T @ y)
    return coef

def _apply_regression(X, coefs):
    X = np.atleast_2d(X).T if X.ndim == 1 else X.copy()
    if len(coefs) == X.shape[1] + 1:
        X = np.hstack([np.ones((len(X), 1)), X])
    y_pred = X @ coefs
    return y_pred

def _plot_bootstrap_coefs(coefs, ci=95):
    mean = coefs.mean(axis=0)
    lower = np.percentile(coefs, (100 - ci) / 2, axis=0)
    upper = np.percentile(coefs, 100 - (100 - ci) / 2, axis=0)

    plt.errorbar(range(len(mean)), mean, 
                yerr=[mean - lower, upper - mean], 
                fmt='o', capsize=5)
    plt.axhline(0, color='gray', linestyle='--')
    plt.xlabel("Coefficient")
    plt.ylabel("Value");

    for i, (lo, hi) in enumerate(zip(lower, upper)):
        if lo > 0 or hi < 0:
            plt.text(i, upper[i] + 0.02, '*', ha='center')

class utils:
    fit_regression = _fit_regression
    apply_regression = _apply_regression
    plot_bootstrap_coefs = _plot_bootstrap_coefs

Section 1: Ordinary Least Squares (OLS) Regression

Background

A linear relationship has the form y=Xβy = X\beta where yy is the dependent variable and XX is a matrix with one or multiple predictors. β\beta is its slope, which describes how strongly yy and XX are associated. If there are multiple predictors in XX, there is one β\beta for each one.

The simplest way of estimating the values of β\beta is ordinary least squares (OLS) regression. OLS finds the coefficients β^\hat{\beta} that minimize the sum of squared residuals between the observed data yy and the model predictions y^=Xβ^\hat{y} = X\hat{\beta}:

β^=argminβyXβ2\hat{\beta} = \underset{\beta}{\arg\min} \|y - X\beta\|^2

Optionally, we can add an intercept α\alpha to the model: y=α+Xβy = \alpha + X\beta to account for cases where the data is not centered on 0.

Exercises

In this section, you will use OLS to estimate the slope and intercept of a linear model from simulated data and visualize the estimated model. Rather than implementing OLS from scratch, the notebook provides utility functions for fitting and applying OLS regression. Here’s how to use them:

Code Description
utils.fit_regression(X, y) Fit OLS regression and return the estimated coefficients β^\hat{\beta}
utils.fit_regression(X, y, intercept=True) Fit OLS regression with an intercept term α\alpha
utils.apply_regression(X, coef) Compute predictions y^=Xβ^\hat{y} = X\hat{\beta} for a given X and coefficients
plt.scatter(x, y) Create a scatter plot of x and y
plt.plot(x, y) Create a line plot of x and y
rng = np.random.default_rng(42) Create a random number generator with seed 42
rng.standard_normal(size=(10, 1)) Sample 10 values from a standard normal distribution

Example: The cell below samples 20 data points that follow the equation y=3xy = 3x. Use utils.fit_regression and print the estimated coefficient β\beta.

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(20 , 1))
y = 3 * X + rng.standard_normal(size=(20 , 1))
plt.scatter(X, y);

coef = utils.fit_regression(X, y)
coef
array([[3.15525491]])

Exercise: The cell below samples 500 points from the same model. Use utils.fit_regression and print the estimated β\beta coefficient. Is this value close to the true value β=3\beta=3?

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(500 , 1))
y = 3 * X + rng.standard_normal(size=(500, 1))
plt.scatter(X, y);

Solution
coef = utils.fit_regression(X, y)
coef
array([[2.99134185]])

Exercise: The cell below samples 100 data points from a different model. Use utils.fit_regression to estimate the value of β\beta.

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(100 , 1))
y = -0.75 * X + rng.standard_normal(size=(100, 1))
plt.scatter(X, y);

Solution
coef = utils.fit_regression(X, y)
coef
array([[-0.63043635]])

Example: Estimate the value of β\beta and use utils.apply_regression to compute the predicted values for y y_pred. Plot the original data points together with the prediction.

coef = utils.fit_regression(X, y)
y_pred = utils.apply_regression(X, coef)
plt.scatter(X, y)
plt.plot(X, y_pred);

Exercise: The cell below samples 100 data points from the model y=2+3xy = 2 + 3x. Estimate the value of β\beta and use utils.apply_regression to compute the predicted values for y y_pred. Plot the original data points together with the prediction. Does the line fit the data?

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(100 , 1))
y = 2 + 3 * X + rng.standard_normal(size=(100,1))
Solution
coef = utils.fit_regression(X, y)
y_pred = utils.apply_regression(X, coef)
plt.scatter(X, y)
plt.plot(X, y_pred);

Exercise: Repeat the code from above but set intercept=True when calling utils.fit_regression() to also estimate the intercept α\alpha. Does the line fit the data now?

Solution
coef = utils.fit_regression(X, y, intercept=True)
y_pred = utils.apply_regression(X, coef)
plt.scatter(X, y)
plt.plot(X, y_pred);

Exercise: Print the estimated coefficients from the previous exercise. Do they match the true values α=2\alpha=2 and β=3\beta=3?

Solution
coef
array([[1.99535902],
       [3.11917464]])

Exercise: The cell below samples 100 data points from the model y=2+3x11.5x2+0.8x3y = 2 + 3x_1 - 1.5x_2 + 0.8x_3. Use utils.fit_regression() to fit a linear regression with intercept and print the estimated coefficients.

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(100 , 3))
y = 2 + 3*X[:,0] - 1.5*X[:,1] + 0.8*X[:,2] + rng.standard_normal(100)
Solution
coef = utils.fit_regression(X, y, intercept=True)
coef
array([ 2.08767734,  3.00559502, -1.43020719,  0.60250069])

Example: Predict the relationship between yy and x1x_1 (X[:,0]) using the estimates for α\alpha and β1\beta_1 (coef[[0, 1]]) and plot the original data together with the estimated relationship.

y_pred = utils.apply_regression(X[:,0], coef[[0, 1]])
plt.scatter(X[:,0], y)
plt.plot(X[:,0], y_pred)

Exercise: Predict the relationship between yy and x2x_2 (X[:,1]) using the estimates for α\alpha and β2\beta_2 (coef[[0, 2]]) and plot the original data together with the estimated relationship.

Solution
y_pred = utils.apply_regression(X[:,1], coef[[0, 2]])
plt.scatter(X[:,1], y)
plt.plot(X[:,1], y_pred)

Exercise: Predict the relationship between yy and x3x_3 (X[:,2]) using the estimates for α\alpha and β3\beta_3 (coef[[0, 3]]) and plot the original data together with the estimated relationship.

Solution
y_pred = utils.apply_regression(X[:,2], coef[[0, 3]])
plt.scatter(X[:,2], y)
plt.plot(X[:,2], y_pred)

Section 2: Making Robust Predictions with Ridge Regularization

Background

OLS finds the coefficients that best fit the training data, but when the number of predictors is large relative to the number of observations, it can overfit — fitting the noise in the training data rather than the underlying signal. As a result, the model performs well on the training data but poorly on new, unseen data.

Ridge regression addresses this by adding a penalty on the size of the coefficients to the OLS objective:

β^=argminβyXβ2+λβ2\hat{\beta} = \underset{\beta}{\arg\min} \|y - X\beta\|^2 + \lambda\|\beta\|^2

This shrinks the coefficients toward zero, reducing overfitting. The strength of the regularization is controlled by λ\lambda: a larger λ\lambda leads to more shrinkage. The solution is:

(XTX+λI)β^=XTy(X^T X + \lambda I)\hat{\beta} = X^T y

When λ\lambda is 0, ridge regression is equivalent to OLS.

Exercises

In this section, you’ll use the lam parameter in utils.fit_regression to use ridge regression with different values of λ\lambda. You’ll fit ridge regression to training data and evaluate it on test data to see how well the model generalizes to new, unseen data. To evaluate model accuracy, you are going to use the mean squared error (MSE) between the observed data and the model’s predictions. Here are the relevant code snippets:

Code Description
utils.fit_regression(X, y, lam=1) Fit ridge regression with regularization strength λ=1\lambda=1
np.sqrt(np.mean((y - y_pred)**2)) Compute the root mean squared error (RMSE) between y and y_pred
np.linspace(0, 10, 20) Create an array of 20 evenly spaced values between 0 and 10
np.argmin(arr) Return the index of the smallest value in arr

The cell below generates 30 samples with 10 predictors, only two of which are actually associated with y (in reality we usually do not know which predictors are associated with y).

We then assign the first 15 samples to the train set and the rest to the test set. We’ll fit the regression on the train set and evaluate it on the test set to see how well the model generalizes to unseen data.

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(30 , 10))
y = 1 + 1.75*X[:,0] - 0.9*X[:,1] + rng.standard_normal(30)

X_train, X_test = X[:15], X[15:]
y_train, y_test = y[:15], y[15:]

Example: Fit OLS regression on the train set, predict y_train and compute the root mean squared error (RMSE) of the prediction: np.sqrt(np.mean((y_train-y_train_pred)**2)).

coef = utils.fit_regression(X_train, y_train, intercept=True)
y_train_pred = utils.apply_regression(X_train, coef)
np.sqrt(np.mean((y_train-y_train_pred)**2))
np.float64(0.34449221293568094)

Exercise: Use the coef from above to predict the test set and compute the MSE. How does it compare to the MSE for the train set?

Solution
y_test_pred = utils.apply_regression(X_test, coef)
np.sqrt(np.mean((y_test-y_test_pred)**2))
np.float64(2.339959720309393)

Exercise: Fit a regression to the train set again but set lam=5 to use ridge regression with λ=5\lambda=5. Predict y_train and compute the MSE. How does it compare to the previous MSE for the train set?

Solution
coef = utils.fit_regression(X_train, y_train, intercept=True, lam=5)
y_train_pred = utils.apply_regression(X_train, coef)
np.sqrt(np.mean((y_train-y_train_pred)**2))
np.float64(0.7331380443164356)

Exercise: Use the coef from above to predict the test set and compute the MSE. How does it compare to the previous MSE for the test set?

Solution
y_test_pred = utils.apply_regression(X_test, coef)
np.sqrt(np.mean((y_test-y_test_pred)**2))
np.float64(1.5495128300227903)

Exercise: Fit a ridge regression on the train set with λ=15\lambda=15 and use the estimated coefficients to predict y_train and y_test. How do the MSEs for test and train set compare to the previous exercises?

Solution
coef = utils.fit_regression(X_train, y_train, intercept=True, lam=15)
y_train_pred = utils.apply_regression(X_train, coef)
y_test_pred = utils.apply_regression(X_test, coef)
np.sqrt(np.mean((y_train-y_train_pred)**2)), np.sqrt(np.mean((y_test-y_test_pred)**2))
(np.float64(1.0963781800130035), np.float64(1.4791331477136682))

Exercise: Fit a ridge regression on the train set with λ=1000\lambda=1000 and use the estimated coefficients to predict y_train and y_test. Is the MSE larger for the train or test set?

Solution
coef = utils.fit_regression(X_train, y_train, intercept=True, lam=1000)
y_train_pred = utils.apply_regression(X_train, coef)
y_test_pred = utils.apply_regression(X_test, coef)
np.sqrt(np.mean((y_train-y_train_pred)**2)), np.sqrt(np.mean((y_test-y_test_pred)**2))
(np.float64(1.757910624020377), np.float64(1.733375772618923))

Exercise: Use np.linspace() to create an array with 20 lambdas between 0 and 100 and run the cell below to compute the MSE on the test set for every λ\lambda and plot the result.

lambdas = ... # np.linspace()
Solution
lambdas = np.linspace(0, 100, 20)
mse = []
for lam in lambdas:
    coef = utils.fit_regression(X_train, y_train, intercept=True, lam=lam)
    y_train_pred = utils.apply_regression(X_train, coef)
    mse.append(np.sqrt(np.mean((y_train-y_train_pred)**2)))

plt.plot(lambdas, mse)
plt.xlabel("$\lambda$")
plt.ylabel("MSE (Train Set)")

Exercise: Use np.linspace() to create an array with 1000 lambdas between 0 and 200 and run the cell below to compute the MSE on the test set for every λ\lambda and plot the result.

lambdas = ... # np.linspace()
Solution
lambdas = np.linspace(0, 200, 1000)
mse = []
for lam in lambdas:
    coef = utils.fit_regression(X_train, y_train, intercept=True, lam=lam)
    y_test_pred = utils.apply_regression(X_test, coef)
    mse.append(np.sqrt(np.mean((y_test-y_test_pred)**2)))

plt.plot(lambdas, mse)
plt.xlabel("$\lambda$")
plt.ylabel("MSE (Test Set)")

Exercise: Use np.argmin to find the index of the smallest MSE for the test set and run the cell below to print the optimal value for λ\lambda.

min_idx = ... # np.argmin()
Solution
min_idx = np.argmin(mse)
print("Optimal value for lambda: ", lambdas[min_idx])
Optimal value for lambda:  13.013013013013014

Section 3: Quantifying Uncertainty with Bootstrapping

Background

In the previous section we used ridge regression and found the value of λ\lambda for making predictions on new, unseen data. However, a single regression fit gives point estimates of the coefficients, but not a measure of how reliable those estimates are. Bootstrapping is a resampling technique that quantifies this uncertainty without requiring distributional assumptions.

The idea is simple: repeatedly resample the data with replacement, refit the model on each resample, and collect the resulting coefficients. The spread of these bootstrapped estimates reflects how much the coefficients would vary if the experiment were repeated — giving a confidence interval (CI) for each coefficient.

A coefficient whose CI does not include zero can be considered statistically significant at the corresponding level.

Exercises

In the following exercises you are going to apply bootstrap resampling to identify which coefficients are significantly associated with the dependent variable. You are also going to see how ridge and OLS regression differ with respect to the coefficient estimates. The notebook provides the resampling code as well as a utility function to visualize the estimates with confidence intervals:

Code Description
idx = np.random.choice(n, n, replace=True) Draw n random integers between 0 and n with replacement
np.stack(arrays) Stack a list of arrays into a single array
utils.plot_bootstrap_coefs(coefs) Plot mean coefficients with 95% CI, asterisk marks CIs excluding 0
utils.plot_bootstrap_coefs(coefs, ci=99) Plot with a 99% CI

Exercise: The code below randomly samples from X and y and then applies ridge regression to the resulting bootstrap sample X_boot, y_boot. Rerun the cell a few times and observe how the values of the estimated coefficients change.

Note: The true trend in the data is given by the equation y=1+1.75x10.9x2y = 1 + 1.75 x_1 - 0.9 x_2. x3x_3 to x10x_{10} are not associated with yy at all.

idx = np.random.choice(len(X), len(X), replace=True)
X_boot, y_boot = X[idx], y[idx]
coef = utils.fit_regression(X_boot, y_boot, intercept=True, lam=13)
coef
array([ 1.28666415,  0.98292911, -0.61409241, -0.18522603,  0.15980478,
       -0.19004276, -0.09841726, -0.17051188, -0.16305329, -0.06011261,
        0.04875652])

Exercise: Set n_boot to 100 and run the cell below to compute the regression coefficients 100 times with bootstrap resampling using the previously estimated best value of λ=13\lambda=13. Print the bootstrapped coefficients - what is the shape of the array?

n_boot = ...
Solution
n_boot = 100
coefs = []
for i in range(n_boot):
    idx = np.random.choice(len(X), len(X), replace=True)
    X_boot, y_boot = X[idx], y[idx]
    coef = utils.fit_regression(X_boot, y_boot, intercept=True, lam=13)
    coefs.append(coef)
coefs = np.stack(coefs)
coefs
array([[ 1.23695717,  0.69134477, -0.55211956, ...,  0.05927028,
         0.06008401, -0.151001  ],
       [ 1.62440651,  0.91209257, -0.55471076, ..., -0.05183594,
        -0.11540324,  0.05235936],
       [ 1.4551361 ,  0.67992198, -0.15373708, ..., -0.0040847 ,
        -0.01714821,  0.04267496],
       ...,
       [ 1.42698789,  1.03316772, -0.420802  , ..., -0.14716959,
        -0.05798001, -0.00685537],
       [ 1.2208383 ,  0.79222136, -0.54683182, ..., -0.10275386,
         0.00533218,  0.03058543],
       [ 1.21449081,  0.91517625, -0.39263143, ...,  0.07640955,
        -0.11786716,  0.15861812]], shape=(100, 11))

Exercise: Rerun the code from above but set n_boot to 10_000. How did this change the shape of coefs?

Solution
n_boot = 10000
coefs = []
for i in range(n_boot):
    idx = np.random.choice(len(X), len(X), replace=True)
    X_boot, y_boot = X[idx], y[idx]
    coef = utils.fit_regression(X_boot, y_boot, intercept=True, lam=13)
    coefs.append(coef)
coefs = np.stack(coefs)
coefs
array([[ 1.66580184,  0.83907837, -0.38962806, ...,  0.03339816,
        -0.088214  ,  0.05630418],
       [ 0.97983361,  0.57847098, -0.2078252 , ...,  0.25443684,
        -0.09184317,  0.06679293],
       [ 1.32553261,  1.06024077, -0.50136916, ..., -0.01212684,
         0.03584278, -0.12186026],
       ...,
       [ 1.6390338 ,  0.93733506, -0.40086725, ..., -0.26435439,
        -0.08500956,  0.04845428],
       [ 1.49176372,  0.81536038, -0.27869283, ..., -0.13402602,
         0.08238664,  0.200545  ],
       [ 1.17760004,  1.03200618, -0.62195721, ..., -0.20897942,
         0.12867731, -0.17484593]], shape=(10000, 11))

Example: Plot the mean value for each coefficient across all bootstrap resamples with a 95% confidence interval (CI). The asterisk indicates that the CI for the given coefficient does not include 0.

utils.plot_bootstrap_coefs(coefs, ci=95)

Exercise: Plot the bootstrapped coefficients with a 99% and 99.9% CI. Which coefficients are significantly different from 0?

Solution
utils.plot_bootstrap_coefs(coefs, ci=99)

utils.plot_bootstrap_coefs(coefs, ci=99.9)

Exercise: Repeat the bootstrap estimation of the coefficients but set lam=0 to use OLS regression. Plot the estimated coefficients with a 99.9% CI. How does this compare to the coefficients estimated with ridge regression?

Solution
n_boot = 10000
coefs = []
for i in range(n_boot):
    idx = np.random.choice(len(X), len(X), replace=True)
    X_boot, y_boot = X[idx], y[idx]
    coef = utils.fit_regression(X_boot, y_boot, intercept=True, lam=0)
    coefs.append(coef)
coefs = np.stack(coefs)
utils.plot_bootstrap_coefs(coefs, ci=99.9)

Section 4: Implementing OLS and Ridge Regression in NumPy

Background

In the previous sections we used the provided utility functions to fit and apply OLS and ridge regression. This allowed us to focus on the interpretation of the results rather than the implementation of the regression itself. However, the functions are actually pretty straightforward and can be implemented in a few steps using NumPy. This section will show you how to do exactly that.

Exercises

In this section you are going to apply OLS and ridge regression to simulated data using only NumPy functions and the matrix multiplication operator @. Some of the steps will be provided for you and there will be hints in the exercise text on which functions to use.

Code Description
np.ones((n, 1)) Create an n×1n \times 1 column of ones
np.hstack((A, B)) Horizontally stack arrays A and B
np.eye(p) Create a p×pp \times p identity matrix
np.linalg.solve(A, b) Solve the linear system Ax=bAx = b for xx
X.T @ X Compute XTXX^TX
X @ coef Compute predictions y^=Xβ^\hat{y} = X\hat{\beta}

The cell below generates 100 data points from the model y=1+2xy = 1 + 2x.

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(100, 1))
y = 1 + 2 * X + rng.standard_normal(size=(100, 1))
plt.scatter(X, y)

Exercise: Solve the normal equations (XTX)β=XTy(X^TX)\beta = X^Ty using np.linalg.solve(X.T @ X, X.T @ y) and print the returned regression coefficients.

Solution
coef = np.linalg.solve(X.T @ X, X.T @ y)
coef
array([[2.03574216]])

Exercise: Predict the values of yy via matrix multiplication of X and the estimated coefficients (X @ coef). Then plot the original data together with the prediction.

Solution
y_pred = X @ coef
plt.scatter(X, y)
plt.plot(X, y_pred);

Example: To fit an intercept, we prepend a column of np.ones() to X. The coefficient assigned to that column gets multiplied by 1 for every sample and thus acts as a constant offset.

X = np.hstack((np.ones((len(X), 1)), X))
X[:10]
array([[ 1.        ,  0.30471708],
       [ 1.        , -1.03998411],
       [ 1.        ,  0.7504512 ],
       [ 1.        ,  0.94056472],
       [ 1.        , -1.95103519],
       [ 1.        , -1.30217951],
       [ 1.        ,  0.1278404 ],
       [ 1.        , -0.31624259],
       [ 1.        , -0.01680116],
       [ 1.        , -0.85304393]])

Exercise: Solve the normal equations (XTX)β=XTy(X^TX)\beta = X^Ty using np.linalg.solve(X.T @ X, X.T @ y) and print the returned regression coefficients. Do the values match the true intercept α=1\alpha=1 and slope β=2\beta=2?

coef = np.linalg.solve(X.T @ X, X.T @ y)
coef
array([[0.99535902],
       [2.11917464]])

Exercise: Predict the values of yy via matrix multiplication of X and the estimated coefficients (X @ coef). Then plot the original data together with the prediction.

Note: Because we added a column of ones to X, use the second column when plotting (X[:,1]).

Solution
plt.scatter(X[:, 1], y)
plt.plot(X[:, 1], X @ coef);

Example: To use ridge regression, we create an identity matrix with the size equal to the number of features in X. We multiply it by λ\lambda to control the regularization strength. Because we do not want to regularize the intercept, we set the first entry of I to 0.

I = np.eye(X.shape[1])
I[0, 0] = 0
I
array([[0., 0.],
       [0., 1.]])

Exercise: Solve the normal equations (XTX+λI)β=XTy(X^TX + \lambda I)\beta = X^Ty using np.linalg.solve(X.T @ X + lam * I, X.T @ y) with lam=1. Compute predictions and plot the original data together with the regression line.

Solution
lam = 1
coef = np.linalg.solve(X.T @ X + lam * I, X.T @ y)
plt.scatter(X[:, 1], y)
plt.plot(X[:, 1], X @ coef);

Exercise: Test different values of λ\lambda and observe how the slope of the regression changes.

Solution
plt.scatter(X[:, 1], y, color="black")
for lam in [1, 10, 100, 1000]:
    coef = np.linalg.solve(X.T @ X + lam * I, X.T @ y)
    plt.plot(X[:, 1], X @ coef, label=f"$\lambda$={lam}")
plt.legend();