Data Files

This is an open-note exam. You are free to use any of the material posted to the course web site in this exam.

The problems in this exam will require you to work with some data files I have provided. Start by clicking the data files button above to download an archive containing the data files you will need to use in this exam.

1. One of the more powerful features in pandas is column operations. Since the columns in a pandas dataframe are stored as numpy arrays, pandas gets to take advantage of the powerful methods offered by the numpy array. For example, to compute the sum of the elements in a dataframe column, you simply apply the .sum() method to the column. Other column operations include .mean(), .max(), and .min().

In this problem you will do some computations that would normally require the use of loops: these computations can all be done as column operations in pandas.

The correlation coefficient for two data series with elements xi and yi is

Write a Python program or construct a Jupyter notebook that implements the following steps. In each of the steps below you must do your work without using any loops.

  1. Make a dataframe from the Housing.csv file.
  2. Make a new, smaller dataframe that contains just the columns with lot size information and price information.
  3. Compute the mean values of the two columns in your dataframe.
  4. Given a data series with elements xi, the recentered data series is a new series whose elements are xi - x, where x is the mean of the original data series. Compute two data series for the recentered lot size and the recentered house price.
  5. Using only column operations involving the recentered data series, compute the correlation coefficient between the lot size and the house price.
  6. Apply the .corr() method to your dataframe to compute the correlation matrix for the dataframe. Confirm that the entry in the correlation matrix that shows the correlation between the lot size and the house price matches the result you got in step 5.

2. In a list A an inversion occurs when A[i] > A[i+1] for some index i. Write a function that takes a list of integers and an index j as its parameter and returns the smallest index ij at which the list has an inversion. If the list has no inversions, your function should return the length of the list.

Use this function to implement a second function that can be used to fully sort a list of almost sorted integers. A list of integers is almost sorted if it contains only a few inversions. For example, the list

[1,2,4,3,5,6,8,7,9,10,11,13,12,15,14]

is an almost sorted list.

3. In the package of data files I provided you will find a file named survey.csv. Our most basic method for reading text from a CSV file looks something like this:

def cleanData(line):
    # Put some code here to clean up the line after we split it.

def readData(fileName):
    result = []
    with open(fileName) as f:
        for line in f.readlines():
            result.append(cleanData(line.split(',')))
    return result

This sort of code works fine for reading from a wide range of CSV files, but will fail when we try to use it with the survey file. The problem is the use of the split() method with a delimiter of ','. The problem with the survey CSV file is that it includes some column headings that look like this:

"Hiphop, Rap","Reggae, Ska","Swing, Jazz","Techno, Trance"

Note the presence of the comma embedded in each of these column headings. This is not a comma that should be used to separate one column from the next in the row. If we simply used split(',') on this line we would end up splitting many of the column headings into two columns because of the embedded commas. To fix this we are going to have to write our own function for splitting lines, and that function will have to explicitly avoid doing splits on commas that appear inside pairs of quote marks.

Write the code for a function splitWell(line) that can take an entire line of text from the CSV file as its input and correctly split it into individual fields. The function should return a list of strings, with each of the column headings represented as a separate string.

Here are some hints on how to procede. Write a loop that iterates over the characters in the line. Each time you encounter a comma, split off a new piece of the line and append that text to a list you are constructing. Each time you encounter a double quote character, throw a switch that temporarily turns off paying attention to commas. When you encounter the matching double quote, throw the switch again to start paying attention to commas again.

To access an individual character in a line you can use the syntax line[n]. To slice off a portion of a string you can use the array slicing syntax line[start:end], which returns a string containing a copy of the characters from location start to location end-1.

To confirm that your function is working correctly, have your program print the all the column headings in the file, one heading per line.