The techniques I will use below to construct our first example model are developed in chapter two in the textbook. You should read that chapter before attempting to understand this example in detail.
For our first example of a machine learning model I am going to construct a simple linear regression model that can predict the average speed for a bike ride based on a number of input features such as the temperature and wind speed.
I have a favorite route that I like to ride, depicted below.
Each time I go for a bike ride I use a GPS device to record my route. I subsequently upload the GPS files to strava.com, which allows me to maintain a record of all of the rides I have done.
Since conditions vary each time I do this particular ride, my average speed for this route will vary over a given year from a low of about 14 mph to a high of about 17 mph. For our first exercise in machine learning I am going to build a simple model that can predict my average speed for the route as a function of various factors that may play a role in determining that average speed.
Since I ride this particular route frequently, I have gathered enough data about the bike ride to be able to formulate a simple mathematical model. I have data for 50 iterations of this particular bike ride gathered from the years 2016 and 2017. This is enough data to begin to construct a simple model to predict my average speed.
The first step in building a mathematical model is to write down a list of features that will have some impact on the output of the model. For this example we are trying to predict the average speed on a bike ride. Here are some factors that could have an impact on the outcome:
A general rule of model building is that models with more input features will require more data points than models with fewer input features. A commonly used strategy in science is the controlled experiment: in a controlled experiment we fix as many features as possible in an effort to reduce the number of input features we have to deal with. For this particular example I will attempt to hold as many features as possible fixed to reduce the number of inputs we have to deal with. Here is what I did to reduce the number of input features for this problem.
With these features held more or less steady, the input data set now consists of five input features:
The input data for this model comes from two data sources. The first data source is a collection of GPX files downloaded from strava.com. These files contain GPS data from each bike ride I took in 2016 and 2017. The GPS data for each ride is in the form of a long list of GPS coordinates taken over the course of a bike ride. For each bike ride in the data set the file records a time stamp, latitude, longitude, and elevation in meters recorded every four seconds during the ride. The second data source a spreadsheet containing weather data from the national weather service. The spreadsheet records numerous bits of weather data recorded at the Appleton airport recorded every 20 minutes from the start of 2016 through the end of 2017.
The first step in any machine learning project is data wrangling. We have to take the raw input data we have at our disposal and restructure it into a set of data points for the features we want. To handle this part of the task I wrote a Python program, strava.py, that will aggregate the data from the GPX files and the weather data spreadsheet. For futher details on how this Python program works, see the Read Me file and the comments in the Python program.
The end result of the data wrangling set is a CSV file, short_rides.csv, that contains the input data for our model.
Since we are going to focus more on the model building than the data wrangling for this example, I will not comment in detail on the code in the Python program at this point. We may revisit this code later in the tutorial when we get ready to do some further machine learning projects for the final project.
With our data set in place, we are ready to start constructing a model. The full details of this process are available in the Jupyter notebook linear.ipynb contained in the archive below.
All of the files for the cycling model are available in this archive.