First Assignment

First assignment

For the first assignment I am going to have you make a couple of modifications to the code found in the linear.ipynb notebook. You will be making some changes to the model I constructed: you should put the work for the updated model in a new notebook. You can copy and paste the relevant portions of linear.ipynb into the new notebook and then make the modifications I propose below.

Changing the temperature feature

One aspect of the current model needs an adjustment. One input feature in the current model is the temperature at the time of the bike ride. This has an impact on the average speed of the ride, since I go more slowly when the weather is cool and faster when the weather gets warmer. In fact, if you look closely at the details of the linear regression model we developed you will see that the model produces a higher average speed as the temperature rises.

One aspect of the relationship between temperature and speed that is not quite right is the idea that higher temperatures always produce higher speeds. In practice, what happens is that I tend to go faster as the temperature rises, but once I hit an 'ideal' temperature of about 68 degrees, higher temperatures actually produce lower speeds. This makes sense, because once the temperature gets into the 80s, overheating becomes a problem and I have to slow down to compensate.

The original model would predict that at a temperature of 85 degrees I would go faster than I would at a temperature of 65 degrees. Since this is clearly not the case, we need to fix our model.

To fix the model, replace the original temperature input feature with a feature that insteads measures the departure from an ideal temperature. Replace the temperature feature with the absolute value of the difference between the original temperature and 68 degrees. Demonstrate that doing this produces a lower rms error in the model than we found for the original model.

Stratified sampling

In the original model I used two techniques to estimate the rms error for the model. The first was a simple 80/20 train/test split. The second was a five-fold cross-validation technique.

A better approach than the simple 80/20 train/test split is to used stratified sampling. Read about stratified sampling in the "Create a Test Set" section of the chapter two in the text. Use this technique to construct a 80/20 train/test split by selecting a test set as a 20% sample of the original data set stratified by temperature. Is the rms error from this stratified sample closer to the mean rms error from cross-validation than the rms error from a simple 80/20 random train/test split?