A generic data reading function

In the next few examples we are going to be reading data from text files. In every case the data will be arranged as a data series with a list of data items on each line of the file. The following Python function will serve as a generic data reading function to load the raw data from the text file. The function reads the individual lines of the input file as text strings and then uses the string split() method to split each line into a list of strings for the individual data items.

def readData(fileName):
  """Generic data reading function: 
     reads lines in a text file and 
     splits them into lists."""
    data = []
    with open(fileName) as f:
        for line in f.readlines():
            data.append(line.split())
    return data

The next step will typically be to convert the strings in our data lists into a data format that is appropriate for our particular application. For example, in the first example program today we are going to be working with an input file that shows US farm population data over time. The input file looks like this:

1935  32.1
1940  30.5
1945  24.4
1950  23
1955  19.1
1960  15.6
1965  12.4
1970  9.7
1975  8.9
1980  7.2

The first entry in the data list returned by readData will look like

["1935","32.1"]

I would like to convert that pair of strings into a tuple containing a combination of an integer and a float. Here is a simple data cleaning function that can perform that transformation:

def cleanLine(line):
  """Converts a raw line list into an appropriate data format."""
    return (int(line[0]),float(line[1]))

We can then use this data cleaning function in combination with a list comprehension to construct the list of data values we want to work with:

rawData = readData("farm.txt")
pairs = [cleanLine(line) for line in rawData]

First example: linear regression

Here is a point plot showing the data in the input file mentioned above.

The data shows a roughly linear decline in US farm population over the span of time in question. A simple way to model this data for the purpose of making predictions is to do a linear regression of the data.

A least-squares linear regression computes the coefficients of a linear model

y = a + b x

that attempts to most closely match a given data series. The regression coefficients can be computed via the following formulas:

x = mean(X)

y = mean(Y)

a = y - b x

Here are the necessary Python functions to do these computations. The means function computes the tuple

(x , y)

from a list of (x,y) pairs.

def means(pairs):
    xSum = 0
    ySum = 0
    for x,y in pairs:
        xSum += x
        ySum += y
    N = len(pairs)
    return (xSum/N,ySum/N)
    
def covariance(pairs,means):
    sum = 0
    for x,y in pairs:
      sum += (x-means[0])*(y-means[1])
    return sum
    
def xVariance(pairs,xMean):
    sum = 0
    for x,y in pairs:
        sum += (x-xMean)*(x-xMean)
    return sum
    
def regressionCoeffs(pairs):
    """Computes linear regression coefficients (a,b) 
       from a list of (x,y) pairs."""
    m = means(pairs)
    beta = covariance(pairs,m)/xVariance(pairs,m[0])
    alpha = m[1]-beta*m[0]
    return (alpha,beta)

Finally, here is a short program that loads the data, computes the regression line, and then prints a table of populations predicted by the regression line for each year in the data sequence versus the actual population:

rawData = readData("farm.txt")
pairs = [cleanLine(line) for line in rawData]
a,b = regressionCoeffs(pairs)
for x,y in pairs:
    prediction = a+x*b
    print("Year: {:d} Prediction: {:5.2f} Actual: {:5.2f}".format(x,prediction,y))

The output produced by this program is

Year: 1935 Prediction: 31.49 Actual: 32.10
Year: 1940 Prediction: 28.56 Actual: 30.50
Year: 1945 Prediction: 25.62 Actual: 24.40
Year: 1950 Prediction: 22.69 Actual: 23.00
Year: 1955 Prediction: 19.76 Actual: 19.10
Year: 1960 Prediction: 16.82 Actual: 15.60
Year: 1965 Prediction: 13.89 Actual: 12.40
Year: 1970 Prediction: 10.96 Actual:  9.70
Year: 1975 Prediction:  8.02 Actual:  8.90
Year: 1980 Prediction:  5.09 Actual:  7.20

This looks about right for a linear regression.