First Machine Learning Project

Background reading

The techniques I will use below to construct our first example model are developed in chapter two in the textbook. You should read that chapter before attempting to understand this example in detail.

The cycling model

For our first example of a machine learning model I am going to construct a simple linear regression model that can predict the average speed for a bike ride based on a number of input features such as the temperature and wind speed.

I have a favorite route that I like to ride, depicted below.

Each time I go for a bike ride I use a GPS device to record my route. I subsequently upload the GPS files to strava.com, which allows me to maintain a record of all of the rides I have done.

Since conditions vary each time I do this particular ride, my average speed for this route will vary over a given year from a low of about 14 mph to a high of about 17 mph. For our first exercise in machine learning I am going to build a simple model that can predict my average speed for the route as a function of various factors that may play a role in determining that average speed.

Since I ride this particular route frequently, I have gathered enough data about the bike ride to be able to formulate a simple mathematical model. I have data for 50 iterations of this particular bike ride gathered from the years 2016 and 2017. This is enough data to begin to construct a simple model to predict my average speed.

Building a model

The first step in building a mathematical model is to write down a list of features that will have some impact on the output of the model. For this example we are trying to predict the average speed on a bike ride. Here are some factors that could have an impact on the outcome:

How much power the bike rider can generate: a powerful rider will go faster
The bicycle used for the ride: a light, fast bike will allow for a faster ride
The difficulty of the route: a hilly route will produce a lower average speed than a flat route
The temperature at the time of the ride: cold weather rides will be slower than warm weather rides
The wind speed and direction: strong headwinds make for a slower ride

A general rule of model building is that models with more input features will require more data points than models with fewer input features. A commonly used strategy in science is the controlled experiment: in a controlled experiment we fix as many features as possible in an effort to reduce the number of input features we have to deal with. For this particular example I will attempt to hold as many features as possible fixed to reduce the number of inputs we have to deal with. Here is what I did to reduce the number of input features for this problem.

Since all of the data I will use for this example was generated by me, as a first approximation we can assume that feature number 1 is fixed for all of the rides in the data set.
I also used the same bicycle for each of these rides. I did however use two different wheel sets: in warmer weather I used a set of relatively light wheels, and in cooler weather I switched to a heavier set of wheels with tougher tires. That variation in bike type will be captured by a feature in the data set that assigns a 0 to lighter wheel set and a 1 to the heavier wheel set.
The route used for each ride in the data set will be the same. One small factor that may affect the speed on this route is that the route changed slightly between 2016 and 2017. At the end of 2016 the highway department installed a roundabout on one of the intersections the route crosses. Since I had to slow down to navigate that intersection after the roundabout was installed, this has a small but noticable impact on my average speed for the route. This difference will be captured by a year column in the data set that indicates whether a given ride took place in 2016 on the faster route or 2017 on the slower route.

With these features held more or less steady, the input data set now consists of five input features:

Year: 2016 or 2017
Wheel type: 1 for heavy, slow wheels and 0 for light, faster wheels
Temperature
Wind speed
Wind direction

Gathering the data

The input data for this model comes from two data sources. The first data source is a collection of GPX files downloaded from strava.com. These files contain GPS data from each bike ride I took in 2016 and 2017. The GPS data for each ride is in the form of a long list of GPS coordinates taken over the course of a bike ride. For each bike ride in the data set the file records a time stamp, latitude, longitude, and elevation in meters recorded every four seconds during the ride. The second data source a spreadsheet containing weather data from the national weather service. The spreadsheet records numerous bits of weather data recorded at the Appleton airport recorded every 20 minutes from the start of 2016 through the end of 2017.

The first step in any machine learning project is data wrangling. We have to take the raw input data we have at our disposal and restructure it into a set of data points for the features we want. To handle this part of the task I wrote a Python program, strava.py, that will aggregate the data from the GPX files and the weather data spreadsheet. For futher details on how this Python program works, see the Read Me file and the comments in the Python program.

The end result of the data wrangling set is a CSV file, short_rides.csv, that contains the input data for our model.

Since we are going to focus more on the model building than the data wrangling for this example, I will not comment in detail on the code in the Python program at this point. We may revisit this code later in the tutorial when we get ready to do some further machine learning projects for the final project.

Building and testing the model

With our data set in place, we are ready to start constructing a model. The full details of this process are available in the Jupyter notebook linear.ipynb contained in the archive below.

Files for the model

All of the files for the cycling model are available in this archive.