Final exam details

The final exam for this course will be structured the same way as the midterm. The final exam will be a 24 hour take-home exam. I will post the exam problems on the course web site at 7 PM on Tuesday, March 16. Your work on the final exam will be due by 7 PM on Wednesday, March 17.

The final exam will have problems that involve writing an using functions, working with text files, and working with pandas dataframes.

Below you will find the final exam problems from the last iteration of this course.

Sample exam data set

The dataset we will be using for this exam comes from a survey of young people gathered for a statistics class at a university in the UK. Additional information about this datat set is available at https://www.kaggle.com/miroslavsabo/young-people-survey.

The CSV file we will be working with is available here.

The structure of the CSV file is simple enough that you can split each line of the file into fields by simply doing split(',') on the lines you read from the file.

Python problems from the previous final

For the problems in this section you should use straight Python code. You may not use pandas to solve these problems.

1. Some of the columns in the data file contain numbers while others contain text. Construct a function that opens the file and returns a list of integers that lists the column numbers of the columns that contain numbers and no text. To test whether or not a string in Python contains only digits and no other characters you can use the isdigit() method on the string. Helpful hint: to read a single line of text from the file you can use the readline() method, which reads just one line from the file. You should skip over the first line in the file and use the contents of the second line in the file to construct your list of column numbers.

2. The correlation between two lists of numbers xi and yi is computed via the formula

where

Write a program that computes the correlation between the column titled 'Classical' and the column titled 'Techno'.

pandas problem from the previous final

3. Use pandas to read the CSV file into a dataframe. Using the function you wrote in problem 1, make a list cols of column indices for the columns that have numbers in them, and then use iloc[:,cols].copy() to make a new data frame that contains only the columns that have numbers in them.

The original data frame has a column labeled 'Alcohol'. Construct a function that maps the entry 'never' to 0, 'social drinker' to 1, and 'drink a lot' to 2. map() this function across the Alcohol column to create a new column and add this column to your numbers only data frame.

Assuming your new numbers only plus alcohol frame is stored in a variable ndf, the following code will compute a list of columns that correlate most strongly with alcohol use:

corrs = ndf.corr()
best_cols = corrs[(corrs['Alcohol'] < -0.15)|(corrs['Alcohol'] > 0.15)].index

Construct a third data frame from the second that has only these columns in it.

Use this third data frame to construct a multilinear regression model that uses these factors to predict the amount of alcohol use. Make a final dataframe that has columns for the predicted alcohol use and the actual alcohol use and print the contents of that data frame.

Solutions

Here are my solutions to these sample questions.