Regression models

In a regression problem we are given a set of data points with input features and output values. Our job is to construct a mathematical model that can accurately predict the output values from the input features. Mathematically, this means constructing a function

f(xi,1, xi,2, …, xi,n)

that can predict the output value yi for each point in our data set. Our goal is to construct a model function that minimizes the sum of square errors of the predictions:

In linear regression we limit ourselves to model functions that are linear functions of the input features.

f(xi,1, xi,2, …, xi,n) = β0 + β1 xi,1 + β2 xi,2 + ⋯ + βn xi,n

This can also be written

f(xi,1, xi,2, …, xi,n) = β0 1 + β1 xi,1 + β2 xi,2 + ⋯ + βn xi,n

= β · Xi

Where Xi is the original list of input features for data point i supplemented by a bias term x0 = 1.

If our goal is to construct a linear function that minimizes the sum of square errors for a set of data points, it is actually possible to solve for the optimal set of β values via the normal equations:

β = (XT X)-1 (XT Y)

Here X is a matrix whose rows are the input vectors Xi that have been supplemented with a bias term and Y is the vector of y values for the data set.

Preparing the data

In the notes below I am going to construct a linear regression model for the cycling data set I introduced in the last lecture.

The first step is replicate the data cleaning and preparation steps we introduced in the last lecture.

import pandas as pd
import numpy as np
df = pd.read_csv('rides.csv')
df['dir'] = df['dir'].fillna(value=184)
df['s'] = df['wind']*np.cos(((df['dir']-180)/180)*np.pi)
df['w'] = df['wind']*np.cos(((df['dir']-270)/180)*np.pi)
df['year'] = df['year']-2016
df = df.drop(['wind','dir'],axis=1).dropna()

Next, we need to add a new column for the bias term. To make a column of 1s we use the numpy ones() function, which makes an array filled with 1s of the desired length.

df['bias'] = np.ones(len(df))

To prepare for model building and evaluation we next have to split our data set into a training set and a test set. We will use the training data to build the regression model and then will compute errors using the test set. The following code will split the original data set of 52 elements into a test set of 11 randomly selected test items and the remaining 41 training items. To randomize the data frame I use the pandas sample() method, which can return a random sample of a specified fraction of the rows in the data frame. We set the fraction to 1, which effectively shuffles the rows of the entire data frame.

df = df.sample(frac=1)
test = df.iloc[0:11].copy()
train = df.iloc[11:].copy()

To copy the first 11 rows into a test dataframe we use the pandas iloc[] method, which is used to select rows of a data frame by their integer location indices.

Solving the normal equations

To set up the normal equations we separate the training set into columns of input features and a column of output values. The values property of a pandas data frame returns a numpy array containing the data we want to manipulate.

X = train.drop(columns=['speed']).values
Y = train['speed'].values

To solve the normal equations we use numpy linear algebra functions. For this example I will be using the object forms of the numpy dot() and transpose() functions:

np.dot(A,B)

becomes

A.dot(B)

and

np.transpose(X)

becomes

X.T

This allows us to write the normal equations more compactly:

one = np.linalg.inv(X.T.dot(X))
two = X.T.dot(Y)
beta = one.dot(two)

Model evaluation

We can now evaluate the model on the test data and compute an mse for test data.

testX = test.drop(columns=['speed']).values
testY = test['speed'].values
preds = testX.dot(beta.T)
errs =  testY - preds
sse = (errs*errs).sum()/len(errs)
mse = np.sqrt(sse)
print(mse)

Programming assignment

One problem with the procedure above is that the MSE we computed is dependent on the choice of test set and training set. You can observe that if you run the program above multiple times the MSE values will differ quite a bit as the sample() method gives us different train/test splits.

One way to get a more reliable estimate for the MSE is to use a strategy called cross-validation. Here is an outline of that strategy.

  1. We begin as before by randomizing the original data set.
  2. We then split the data set into five roughly equal sized slices.
  3. We do five rounds of testing, giving each slice in turn a chance to be the test data set, while the remaining four slices combine to form the training set for that round.
  4. We compute an MSE for each of the five rounds and then report the average of those five MSE values as our final MSE estimate for the model.

Modify the program above to do cross validation instead of a single train/test split. Have your program print the MSE values for each of the five rounds and the final average of the MSE values.

Here are a couple of suggestions to help you out. Since the original data set contains 52 entries you may have a little trouble dividing it into slices. One easy way to make slices when your data set is not evenly divisible by five is to use the numpy linspace() function. Given a starting value, an ending value, and an integer N, linspace() returns a list of numbers that divide the original range into N roughly equal subranges.

Another problem you will encounter is reassembling the four slices into one array. To do this you should use the numpy array append() method, which can combine two arrays into a larger array:

Xcombined = X1.append(X2)

For your convenience, here is an archive containing today's program and the CSV file.

Postscript: improving the regression

The results that our linear regression model produce are somewhat underwhelming. This is due to two primary factors. The first is that the data for this example is quite noisy. You can easily find examples of data points where the features are similar yet the outcomes (the average speed of the ride) are quite different. This is due to the fact that our model data does not capture every factor that has an effect on the outcome, such as how tired I was on the day of a ride or how aggressive I decided to be on a particular ride.

The second factor is that the outcome may not be a linear function of some of the features. Two sets of features are relevant here: the temperature and the wind speed. The effect of the temperature on the speed of the ride is that the speeds do generally go up as the temperture rises, but once the temperature passes a certain point (probably around 70 degrees F), increases in temperature may well lead to a decrease in speed. The wind speed also clearly has a nonlinear effect: since wind speeds can be both postive and negative, and winds speeds with a larger absolute value decrease the ride speed more, there is a nonlinear relationship between the wind speeds and the outcome.

In cases where we suspect that there might be a nonlinear relationship between some of the factors and the outcome, we can replace linear regression with polynomial regression. In this technique we introduce new factors that are polynomial functions of some of the original factors and then do linear regression on the expanded set of features.

For example, suppose we wanted to more effectively model the effect of wind speed in the model. Right now the model has two factors that give wind speed information, columns 's' and 'w' in the data frame. In the original linear model these two factors would appear in the linear regression model looking like

βs xs + βw xw

In a polynomial regression model we would replace these terms with

βs xs + βs2 (xs)2 + βsw (xs xw) + βw2 (xw)2 + βw xw

To implement this change in our code we would simply add three new columns to our data frame at the start of the process:

df['s2'] = df['s']**2
df['sw'] = df['s']*df['w']
df['w2'] = df['w']**2

and then proceed through the rest of the steps in the linear regression.

Unfortunately, this does not help. The problem is that we then run into another problem called overfitting. In our effort to make our model more accurate, it becomes more accurate on the training set. Unfortunately, if we then apply our more accurate model to the test set it performs worse. The usual fix for overfitting is to throw more data at the problem. Unfortunately, in this example our data set is fixed at 52 observations. Until I get out and ride my bike for a couple of more years on the same route we won't have enough data to overcome this problem.