iBOTS Learn: Linear Regression and Regularization

Linear Regression and Regularization

Author

Dr. Ole Bialas

Download Materials

import numpy as np
import matplotlib.pyplot as plt

Linear models are useful tools for describing the relationship between variables (e.g. the intensity of a stimulus and the magnitude of a neural response) and form the basis of many different methods of statistical analysis. In this notebook you are going to learn about two of them: ordinary least squares (OLS) and ridge regression. OLS finds the line that best fits the data, while ridge regression helps to avoid overfitting, leading to better predictions on new, unseen data. You’ll also learn how to use bootstrap resampling to identify which predictors are significantly associated with the outcome.

The notebook proceeds as follows:

Estimate the linear relationship between simulated variables using OLS regression
Use ridge regularization and test how this improves the prediction of new, unseen data
Apply bootstrap resampling to test the significance of the model’s predictors
Implement OLS and ridge regression using pure NumPy

Setup

Utility Functions

def _fit_regression(X, y, intercept=False, lam=0):
    X = X.copy()
    if intercept:
        X = np.hstack((np.ones((len(X), 1)), X))
    I = np.eye(X.shape[1])
    if intercept:
        I[0, 0] = 0
    coef = np.linalg.solve(X.T @ X + lam * I, X.T @ y)
    return coef

def _apply_regression(X, coefs):
    X = np.atleast_2d(X).T if X.ndim == 1 else X.copy()
    if len(coefs) == X.shape[1] + 1:
        X = np.hstack([np.ones((len(X), 1)), X])
    y_pred = X @ coefs
    return y_pred

def _plot_bootstrap_coefs(coefs, ci=95):
    mean = coefs.mean(axis=0)
    lower = np.percentile(coefs, (100 - ci) / 2, axis=0)
    upper = np.percentile(coefs, 100 - (100 - ci) / 2, axis=0)

    plt.errorbar(range(len(mean)), mean, 
                yerr=[mean - lower, upper - mean], 
                fmt='o', capsize=5)
    plt.axhline(0, color='gray', linestyle='--')
    plt.xlabel("Coefficient")
    plt.ylabel("Value");

    for i, (lo, hi) in enumerate(zip(lower, upper)):
        if lo > 0 or hi < 0:
            plt.text(i, upper[i] + 0.02, '*', ha='center')

class utils:
    fit_regression = _fit_regression
    apply_regression = _apply_regression
    plot_bootstrap_coefs = _plot_bootstrap_coefs

Section 1: Ordinary Least Squares (OLS) Regression

Background

A linear relationship has the form $y = X\beta$ where $y$ is the dependent variable and $X$ is a matrix with one or multiple predictors. $\beta$ is its slope, which describes how strongly $y$ and $X$ are associated. If there are multiple predictors in $X$ , there is one $\beta$ for each one.

The simplest way of estimating the values of $\beta$ is ordinary least squares (OLS) regression. OLS finds the coefficients $\hat{\beta}$ that minimize the sum of squared residuals between the observed data $y$ and the model predictions $\hat{y} = X\hat{\beta}$ :

\hat{\beta} = \underset{\beta}{\arg\min} \|y - X\beta\|^2

Optionally, we can add an intercept $\alpha$ to the model: $y = \alpha + X\beta$ to account for cases where the data is not centered on 0.

Exercises

In this section, you will use OLS to estimate the slope and intercept of a linear model from simulated data and visualize the estimated model. Rather than implementing OLS from scratch, the notebook provides utility functions for fitting and applying OLS regression. Here’s how to use them:

Code	Description
`utils.fit_regression(X, y)`	Fit OLS regression and return the estimated coefficients $\hat{\beta}$
`utils.fit_regression(X, y, intercept=True)`	Fit OLS regression with an intercept term $\alpha$
`utils.apply_regression(X, coef)`	Compute predictions $\hat{y} = X\hat{\beta}$ for a given `X` and coefficients
`plt.scatter(x, y)`	Create a scatter plot of `x` and `y`
`plt.plot(x, y)`	Create a line plot of `x` and `y`
`rng = np.random.default_rng(42)`	Create a random number generator with seed `42`
`rng.standard_normal(size=(10, 1))`	Sample 10 values from a standard normal distribution

Example: The cell below samples 20 data points that follow the equation $y = 3x$ . Use utils.fit_regression and print the estimated coefficient $\beta$ .

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(20 , 1))
y = 3 * X + rng.standard_normal(size=(20 , 1))
plt.scatter(X, y);

coef = utils.fit_regression(X, y)
coef

array([[3.15525491]])

Exercise: The cell below samples 500 points from the same model. Use utils.fit_regression and print the estimated $\beta$ coefficient. Is this value close to the true value $\beta=3$ ?

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(500 , 1))
y = 3 * X + rng.standard_normal(size=(500, 1))
plt.scatter(X, y);

Solution

coef = utils.fit_regression(X, y)
coef

array([[2.99134185]])

Exercise: The cell below samples 100 data points from a different model. Use utils.fit_regression to estimate the value of $\beta$ .

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(100 , 1))
y = -0.75 * X + rng.standard_normal(size=(100, 1))
plt.scatter(X, y);

Solution

coef = utils.fit_regression(X, y)
coef

array([[-0.63043635]])

Example: Estimate the value of $\beta$ and use utils.apply_regression to compute the predicted values for y y_pred. Plot the original data points together with the prediction.

coef = utils.fit_regression(X, y)
y_pred = utils.apply_regression(X, coef)
plt.scatter(X, y)
plt.plot(X, y_pred);

Exercise: The cell below samples 100 data points from the model $y = 2 + 3x$ . Estimate the value of $\beta$ and use utils.apply_regression to compute the predicted values for y y_pred. Plot the original data points together with the prediction. Does the line fit the data?

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(100 , 1))
y = 2 + 3 * X + rng.standard_normal(size=(100,1))

Solution

coef = utils.fit_regression(X, y)
y_pred = utils.apply_regression(X, coef)
plt.scatter(X, y)
plt.plot(X, y_pred);

Exercise: Repeat the code from above but set intercept=True when calling utils.fit_regression() to also estimate the intercept $\alpha$ . Does the line fit the data now?

Solution

coef = utils.fit_regression(X, y, intercept=True)
y_pred = utils.apply_regression(X, coef)
plt.scatter(X, y)
plt.plot(X, y_pred);

Exercise: Print the estimated coefficients from the previous exercise. Do they match the true values $\alpha=2$ and $\beta=3$ ?

Solution

coef

array([[1.99535902],
       [3.11917464]])

Exercise: The cell below samples 100 data points from the model $y = 2 + 3x_1 - 1.5x_2 + 0.8x_3$ . Use utils.fit_regression() to fit a linear regression with intercept and print the estimated coefficients.

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(100 , 3))
y = 2 + 3*X[:,0] - 1.5*X[:,1] + 0.8*X[:,2] + rng.standard_normal(100)

Solution

coef = utils.fit_regression(X, y, intercept=True)
coef

array([ 2.08767734,  3.00559502, -1.43020719,  0.60250069])

Example: Predict the relationship between $y$ and $x_1$ (X[:,0]) using the estimates for $\alpha$ and $\beta_1$ (coef[[0, 1]]) and plot the original data together with the estimated relationship.

y_pred = utils.apply_regression(X[:,0], coef[[0, 1]])
plt.scatter(X[:,0], y)
plt.plot(X[:,0], y_pred)

Exercise: Predict the relationship between $y$ and $x_2$ (X[:,1]) using the estimates for $\alpha$ and $\beta_2$ (coef[[0, 2]]) and plot the original data together with the estimated relationship.

Solution

y_pred = utils.apply_regression(X[:,1], coef[[0, 2]])
plt.scatter(X[:,1], y)
plt.plot(X[:,1], y_pred)

Exercise: Predict the relationship between $y$ and $x_3$ (X[:,2]) using the estimates for $\alpha$ and $\beta_3$ (coef[[0, 3]]) and plot the original data together with the estimated relationship.

Solution

y_pred = utils.apply_regression(X[:,2], coef[[0, 3]])
plt.scatter(X[:,2], y)
plt.plot(X[:,2], y_pred)

Section 2: Making Robust Predictions with Ridge Regularization

Background

OLS finds the coefficients that best fit the training data, but when the number of predictors is large relative to the number of observations, it can overfit — fitting the noise in the training data rather than the underlying signal. As a result, the model performs well on the training data but poorly on new, unseen data.

Ridge regression addresses this by adding a penalty on the size of the coefficients to the OLS objective:

\hat{\beta} = \underset{\beta}{\arg\min} \|y - X\beta\|^2 + \lambda\|\beta\|^2

This shrinks the coefficients toward zero, reducing overfitting. The strength of the regularization is controlled by $\lambda$ : a larger $\lambda$ leads to more shrinkage. The solution is:

(X^T X + \lambda I)\hat{\beta} = X^T y

When $\lambda$ is 0, ridge regression is equivalent to OLS.

Exercises

In this section, you’ll use the lam parameter in utils.fit_regression to use ridge regression with different values of $\lambda$ . You’ll fit ridge regression to training data and evaluate it on test data to see how well the model generalizes to new, unseen data. To evaluate model accuracy, you are going to use the mean squared error (MSE) between the observed data and the model’s predictions. Here are the relevant code snippets:

Code	Description
`utils.fit_regression(X, y, lam=1)`	Fit ridge regression with regularization strength $\lambda=1$
`np.sqrt(np.mean((y - y_pred)**2))`	Compute the root mean squared error (RMSE) between `y` and `y_pred`
`np.linspace(0, 10, 20)`	Create an array of 20 evenly spaced values between `0` and `10`
`np.argmin(arr)`	Return the index of the smallest value in `arr`

The cell below generates 30 samples with 10 predictors, only two of which are actually associated with y (in reality we usually do not know which predictors are associated with y).

We then assign the first 15 samples to the train set and the rest to the test set. We’ll fit the regression on the train set and evaluate it on the test set to see how well the model generalizes to unseen data.

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(30 , 10))
y = 1 + 1.75*X[:,0] - 0.9*X[:,1] + rng.standard_normal(30)

X_train, X_test = X[:15], X[15:]
y_train, y_test = y[:15], y[15:]

Example: Fit OLS regression on the train set, predict y_train and compute the root mean squared error (RMSE) of the prediction: np.sqrt(np.mean((y_train-y_train_pred)**2)).

coef = utils.fit_regression(X_train, y_train, intercept=True)
y_train_pred = utils.apply_regression(X_train, coef)
np.sqrt(np.mean((y_train-y_train_pred)**2))

np.float64(0.34449221293568094)

Exercise: Use the coef from above to predict the test set and compute the MSE. How does it compare to the MSE for the train set?

Solution

y_test_pred = utils.apply_regression(X_test, coef)
np.sqrt(np.mean((y_test-y_test_pred)**2))

np.float64(2.339959720309393)

Exercise: Fit a regression to the train set again but set lam=5 to use ridge regression with $\lambda=5$ . Predict y_train and compute the MSE. How does it compare to the previous MSE for the train set?

Solution

coef = utils.fit_regression(X_train, y_train, intercept=True, lam=5)
y_train_pred = utils.apply_regression(X_train, coef)
np.sqrt(np.mean((y_train-y_train_pred)**2))

np.float64(0.7331380443164356)

Exercise: Use the coef from above to predict the test set and compute the MSE. How does it compare to the previous MSE for the test set?

Solution

y_test_pred = utils.apply_regression(X_test, coef)
np.sqrt(np.mean((y_test-y_test_pred)**2))

np.float64(1.5495128300227903)

Exercise: Fit a ridge regression on the train set with $\lambda=15$ and use the estimated coefficients to predict y_train and y_test. How do the MSEs for test and train set compare to the previous exercises?

Solution

coef = utils.fit_regression(X_train, y_train, intercept=True, lam=15)
y_train_pred = utils.apply_regression(X_train, coef)
y_test_pred = utils.apply_regression(X_test, coef)
np.sqrt(np.mean((y_train-y_train_pred)**2)), np.sqrt(np.mean((y_test-y_test_pred)**2))

(np.float64(1.0963781800130035), np.float64(1.4791331477136682))

Exercise: Fit a ridge regression on the train set with $\lambda=1000$ and use the estimated coefficients to predict y_train and y_test. Is the MSE larger for the train or test set?

Solution

coef = utils.fit_regression(X_train, y_train, intercept=True, lam=1000)
y_train_pred = utils.apply_regression(X_train, coef)
y_test_pred = utils.apply_regression(X_test, coef)
np.sqrt(np.mean((y_train-y_train_pred)**2)), np.sqrt(np.mean((y_test-y_test_pred)**2))

(np.float64(1.757910624020377), np.float64(1.733375772618923))

Exercise: Use np.linspace() to create an array with 20 lambdas between 0 and 100 and run the cell below to compute the MSE on the test set for every $\lambda$ and plot the result.

lambdas = ... # np.linspace()

Solution

lambdas = np.linspace(0, 100, 20)

mse = []
for lam in lambdas:
    coef = utils.fit_regression(X_train, y_train, intercept=True, lam=lam)
    y_train_pred = utils.apply_regression(X_train, coef)
    mse.append(np.sqrt(np.mean((y_train-y_train_pred)**2)))

plt.plot(lambdas, mse)
plt.xlabel("$\lambda$")
plt.ylabel("MSE (Train Set)")

Exercise: Use np.linspace() to create an array with 1000 lambdas between 0 and 200 and run the cell below to compute the MSE on the test set for every $\lambda$ and plot the result.

lambdas = ... # np.linspace()

Solution

lambdas = np.linspace(0, 200, 1000)

mse = []
for lam in lambdas:
    coef = utils.fit_regression(X_train, y_train, intercept=True, lam=lam)
    y_test_pred = utils.apply_regression(X_test, coef)
    mse.append(np.sqrt(np.mean((y_test-y_test_pred)**2)))

plt.plot(lambdas, mse)
plt.xlabel("$\lambda$")
plt.ylabel("MSE (Test Set)")

Exercise: Use np.argmin to find the index of the smallest MSE for the test set and run the cell below to print the optimal value for $\lambda$ .

min_idx = ... # np.argmin()

Solution

min_idx = np.argmin(mse)

print("Optimal value for lambda: ", lambdas[min_idx])

Optimal value for lambda:  13.013013013013014

Section 3: Quantifying Uncertainty with Bootstrapping

Background

In the previous section we used ridge regression and found the value of $\lambda$ for making predictions on new, unseen data. However, a single regression fit gives point estimates of the coefficients, but not a measure of how reliable those estimates are. Bootstrapping is a resampling technique that quantifies this uncertainty without requiring distributional assumptions.

The idea is simple: repeatedly resample the data with replacement, refit the model on each resample, and collect the resulting coefficients. The spread of these bootstrapped estimates reflects how much the coefficients would vary if the experiment were repeated — giving a confidence interval (CI) for each coefficient.

A coefficient whose CI does not include zero can be considered statistically significant at the corresponding level.

Exercises

In the following exercises you are going to apply bootstrap resampling to identify which coefficients are significantly associated with the dependent variable. You are also going to see how ridge and OLS regression differ with respect to the coefficient estimates. The notebook provides the resampling code as well as a utility function to visualize the estimates with confidence intervals:

Code	Description
`idx = np.random.choice(n, n, replace=True)`	Draw `n` random integers between 0 and `n` with replacement
`np.stack(arrays)`	Stack a list of arrays into a single array
`utils.plot_bootstrap_coefs(coefs)`	Plot mean coefficients with 95% CI, asterisk marks CIs excluding 0
`utils.plot_bootstrap_coefs(coefs, ci=99)`	Plot with a 99% CI

Exercise: The code below randomly samples from X and y and then applies ridge regression to the resulting bootstrap sample X_boot, y_boot. Rerun the cell a few times and observe how the values of the estimated coefficients change.

Note: The true trend in the data is given by the equation $y = 1 + 1.75 x_1 - 0.9 x_2$ . $x_3$ to $x_{10}$ are not associated with $y$ at all.

idx = np.random.choice(len(X), len(X), replace=True)
X_boot, y_boot = X[idx], y[idx]
coef = utils.fit_regression(X_boot, y_boot, intercept=True, lam=13)
coef

array([ 1.28666415,  0.98292911, -0.61409241, -0.18522603,  0.15980478,
       -0.19004276, -0.09841726, -0.17051188, -0.16305329, -0.06011261,
        0.04875652])

Exercise: Set n_boot to 100 and run the cell below to compute the regression coefficients 100 times with bootstrap resampling using the previously estimated best value of $\lambda=13$ . Print the bootstrapped coefficients - what is the shape of the array?

n_boot = ...

Solution

n_boot = 100

coefs = []
for i in range(n_boot):
    idx = np.random.choice(len(X), len(X), replace=True)
    X_boot, y_boot = X[idx], y[idx]
    coef = utils.fit_regression(X_boot, y_boot, intercept=True, lam=13)
    coefs.append(coef)
coefs = np.stack(coefs)
coefs

array([[ 1.23695717,  0.69134477, -0.55211956, ...,  0.05927028,
         0.06008401, -0.151001  ],
       [ 1.62440651,  0.91209257, -0.55471076, ..., -0.05183594,
        -0.11540324,  0.05235936],
       [ 1.4551361 ,  0.67992198, -0.15373708, ..., -0.0040847 ,
        -0.01714821,  0.04267496],
       ...,
       [ 1.42698789,  1.03316772, -0.420802  , ..., -0.14716959,
        -0.05798001, -0.00685537],
       [ 1.2208383 ,  0.79222136, -0.54683182, ..., -0.10275386,
         0.00533218,  0.03058543],
       [ 1.21449081,  0.91517625, -0.39263143, ...,  0.07640955,
        -0.11786716,  0.15861812]], shape=(100, 11))

Exercise: Rerun the code from above but set n_boot to 10_000. How did this change the shape of coefs?

Solution

n_boot = 10000
coefs = []
for i in range(n_boot):
    idx = np.random.choice(len(X), len(X), replace=True)
    X_boot, y_boot = X[idx], y[idx]
    coef = utils.fit_regression(X_boot, y_boot, intercept=True, lam=13)
    coefs.append(coef)
coefs = np.stack(coefs)
coefs

array([[ 1.66580184,  0.83907837, -0.38962806, ...,  0.03339816,
        -0.088214  ,  0.05630418],
       [ 0.97983361,  0.57847098, -0.2078252 , ...,  0.25443684,
        -0.09184317,  0.06679293],
       [ 1.32553261,  1.06024077, -0.50136916, ..., -0.01212684,
         0.03584278, -0.12186026],
       ...,
       [ 1.6390338 ,  0.93733506, -0.40086725, ..., -0.26435439,
        -0.08500956,  0.04845428],
       [ 1.49176372,  0.81536038, -0.27869283, ..., -0.13402602,
         0.08238664,  0.200545  ],
       [ 1.17760004,  1.03200618, -0.62195721, ..., -0.20897942,
         0.12867731, -0.17484593]], shape=(10000, 11))

Example: Plot the mean value for each coefficient across all bootstrap resamples with a 95% confidence interval (CI). The asterisk indicates that the CI for the given coefficient does not include 0.

utils.plot_bootstrap_coefs(coefs, ci=95)

Exercise: Plot the bootstrapped coefficients with a 99% and 99.9% CI. Which coefficients are significantly different from 0?

Solution

utils.plot_bootstrap_coefs(coefs, ci=99)

utils.plot_bootstrap_coefs(coefs, ci=99.9)

Exercise: Repeat the bootstrap estimation of the coefficients but set lam=0 to use OLS regression. Plot the estimated coefficients with a 99.9% CI. How does this compare to the coefficients estimated with ridge regression?

Solution

n_boot = 10000
coefs = []
for i in range(n_boot):
    idx = np.random.choice(len(X), len(X), replace=True)
    X_boot, y_boot = X[idx], y[idx]
    coef = utils.fit_regression(X_boot, y_boot, intercept=True, lam=0)
    coefs.append(coef)
coefs = np.stack(coefs)
utils.plot_bootstrap_coefs(coefs, ci=99.9)

Section 4: Implementing OLS and Ridge Regression in NumPy

Background

In the previous sections we used the provided utility functions to fit and apply OLS and ridge regression. This allowed us to focus on the interpretation of the results rather than the implementation of the regression itself. However, the functions are actually pretty straightforward and can be implemented in a few steps using NumPy. This section will show you how to do exactly that.

Exercises

In this section you are going to apply OLS and ridge regression to simulated data using only NumPy functions and the matrix multiplication operator @. Some of the steps will be provided for you and there will be hints in the exercise text on which functions to use.

Code	Description
`np.ones((n, 1))`	Create an $n \times 1$ column of ones
`np.hstack((A, B))`	Horizontally stack arrays `A` and `B`
`np.eye(p)`	Create a $p \times p$ identity matrix
`np.linalg.solve(A, b)`	Solve the linear system $Ax = b$ for $x$
`X.T @ X`	Compute $X^TX$
`X @ coef`	Compute predictions $\hat{y} = X\hat{\beta}$

The cell below generates 100 data points from the model $y = 1 + 2x$ .

rng = np.random.default_rng(42)
X = rng.standard_normal(size=(100, 1))
y = 1 + 2 * X + rng.standard_normal(size=(100, 1))
plt.scatter(X, y)

Exercise: Solve the normal equations $(X^TX)\beta = X^Ty$ using np.linalg.solve(X.T @ X, X.T @ y) and print the returned regression coefficients.

Solution

coef = np.linalg.solve(X.T @ X, X.T @ y)
coef

array([[2.03574216]])

Exercise: Predict the values of $y$ via matrix multiplication of X and the estimated coefficients (X @ coef). Then plot the original data together with the prediction.

Solution

y_pred = X @ coef
plt.scatter(X, y)
plt.plot(X, y_pred);

Example: To fit an intercept, we prepend a column of np.ones() to X. The coefficient assigned to that column gets multiplied by 1 for every sample and thus acts as a constant offset.

X = np.hstack((np.ones((len(X), 1)), X))
X[:10]

array([[ 1.        ,  0.30471708],
       [ 1.        , -1.03998411],
       [ 1.        ,  0.7504512 ],
       [ 1.        ,  0.94056472],
       [ 1.        , -1.95103519],
       [ 1.        , -1.30217951],
       [ 1.        ,  0.1278404 ],
       [ 1.        , -0.31624259],
       [ 1.        , -0.01680116],
       [ 1.        , -0.85304393]])

Exercise: Solve the normal equations $(X^TX)\beta = X^Ty$ using np.linalg.solve(X.T @ X, X.T @ y) and print the returned regression coefficients. Do the values match the true intercept $\alpha=1$ and slope $\beta=2$ ?

coef = np.linalg.solve(X.T @ X, X.T @ y)
coef

array([[0.99535902],
       [2.11917464]])

Exercise: Predict the values of $y$ via matrix multiplication of X and the estimated coefficients (X @ coef). Then plot the original data together with the prediction.

Note: Because we added a column of ones to X, use the second column when plotting (X[:,1]).

Solution

plt.scatter(X[:, 1], y)
plt.plot(X[:, 1], X @ coef);

Example: To use ridge regression, we create an identity matrix with the size equal to the number of features in X. We multiply it by $\lambda$ to control the regularization strength. Because we do not want to regularize the intercept, we set the first entry of I to 0.

I = np.eye(X.shape[1])
I[0, 0] = 0
I

array([[0., 0.],
       [0., 1.]])

Exercise: Solve the normal equations $(X^TX + \lambda I)\beta = X^Ty$ using np.linalg.solve(X.T @ X + lam * I, X.T @ y) with lam=1. Compute predictions and plot the original data together with the regression line.

Solution

lam = 1
coef = np.linalg.solve(X.T @ X + lam * I, X.T @ y)
plt.scatter(X[:, 1], y)
plt.plot(X[:, 1], X @ coef);

Exercise: Test different values of $\lambda$ and observe how the slope of the regression changes.

Solution

plt.scatter(X[:, 1], y, color="black")
for lam in [1, 10, 100, 1000]:
    coef = np.linalg.solve(X.T @ X + lam * I, X.T @ y)
    plt.plot(X[:, 1], X @ coef, label=f"$\lambda$={lam}")
plt.legend();