Working with lists

The example program that I constructed in the last lecture featured a set of three data points that we wanted to interpolate with a quadratic polynomial. In that program I started by listing the x and y coordinates of the data points.

# The initial set of data points.
x0 = 1.04
y0 = 2.71
x1 = 1.51
y1 = 2.55
x2 = 1.83
y2 = 2.82

Listing the data points in this way will get more and more unwieldy as we deal with larger data sets. A more convenient way to list this data is to use the Python list construct.

# The initial set of data points.
xs = [1.04, 1.51, 1.83]
ys = [2.71, 2.55, 2.82]

To work with the individual numbers stored in these lists we will use an index notation. To refer to the first number in the list of x values we use the notation xs[0]. Here is the rest of the program rewritten to use the list data with index notation.

# Constructing the coefficients of the polynomial
diff10 = (ys[1]-ys[0])/(xs[1]-xs[0])
diff21 = (ys[2]-ys[1])/(xs[2]-xs[1])
c = ys[0]
b = diff10
a = (diff21-diff10)/(xs[2]-xs[0])

# This is the value of x at which we want to evaluate
x = 1.7

# Compute the value of the polynomial at x using the
# Newton form of the polynomial
p = a*(x-xs[1])*(x-xs[0])+b*(x-xs[0])+c

# Print the result
print('At x = 1.7, y is approximately =',p)

Lists and loops

Perhaps the most compelling reason to use the list data format is that lists lend themselves quite naturally to automation through the use of loops. Here is an example: we start by making a list of data values.

data = [1.7,2.3,1.0,1.2,2.4]

Suppose we wanted to sum up these data values. The following snippet of Python code shows how to do this.

sum = 0
for x in data:
    sum += x

This is our first example of a Python for loop. A for loop is an iteration structure that is designed to iterate over each member of some list of data values. For each member of the list the for loop temporarily assigns the member to the variable x. In the body of the loop (the line of code following the colon) we specify what we want to do with the current value. In this example we will want to add the current value of x onto a running total stored in the sum variable. The Python operator += is a command to add the value of x onto the current value of sum and make that be the new value of the sum variable.

A loop body can consist of more than one statement. Python uses indentation to specify which statements belong to the loop body. In the next example we want to compute the average of the items in the data list. To do that we will need to both compute the sum of the data items and count how many items are in the list. For that purpose we introduce two variables, sum and count, and make the loop body update both variables for each member of the data list.

sum = 0
count = 0
for x in data:
    sum += x
    count += 1
average = sum/count

To indent lines of code in a Python program you can use spaces or tabs. The most widely used convention for indenting statements is to use four spaces for each level of indentation. Alternatively, you can use a single tab to make one level of indentation. Either approach is fine, although you should not mix the two styles of indentation: the Python interpreter will give you an error message if you try to run a program with inconsistent indentation.

Visual Studio Code will handle indentation for you automatically. As soon as you type the colon at the end of the first line of the for loop and hit enter Code will automatically indent the following lines for you. To exit from the indentation you simple press the backspace key at the beginning of a line.

Iterating over a range of indices

One common way to access the elements of a list is to use the list index notation.

For example, here is a list of data values:

list = [12,3,4,5,2]

Each item in a list is associated with a list index. The first item in any list has an index value of 0, the second item has an index value of 1, and so on. You can use the index notation to access individual items in the list. For example, to print the next-to-last item in the list you could use the statement

print(list[3])

Since the index values start at 0, the last index in a list with five items is index 4, which makes the index of the next-to-last item 3.

An alternative way to iterate over a list of items is to iterate the indices instead of the list items themselves. To do this, we use the Python range() construct:

for n in range(0,5):
    print(list[n])

The expression range(0,5) effectively sets up a list of index values running from 0 to 4. (The range construct always runs up through the value right before the second number in the range.) The for loop then iterates over that list of index numbers, using the index numbers one at a time to look up numbers in the original list.

Example - linear regression

In linear regression we seek the linear function that is the best fit to set of data points.

The data points usually do not fall on a line, but it will be possible to construct a line that comes closest to fitting the data points.

Here are the necessary formulas for a linear regression. Given a set of points with coordinates xi and yi we compute the mean values

the covariance of the two data sequences and the variance of the x sequence

From these we construct the coefficients of the regression line

α = y - β x

The best fit regression line has the equation

y = α + β x

Here now is a Python program that constructs the regression line for the data set shown in the picture above.

# The data set for this example is farm population in the United States over several decades.
x = [1935,1940,1945,1950,1955,1960,1965,1970,1975,1980]
y = [32.1,30.5,24.4,23,19.1,15.6,12.4,9.7,8.9,7.2]

# Compute averages of the two data sets
count = 0
xSum = 0
for year in x:
    count += 1
    xSum += year
ySum = 0
for pop in y:
    ySum += pop
xBar = xSum/count
yBar = ySum/count

# Compute the variance and the covariance
covariance = 0
variance = 0
for i in range(0,count):
    covariance += (x[i]-xBar)*(y[i]-yBar)
    variance += (x[i]-xBar)**2

# Compute the coefficients of the regression line
beta = covariance/variance
alpha = yBar - beta * xBar

# Use the regression line to compute an estimate for y
# when x is 1955
estimate = alpha + beta * 1955
print("The farm population in 1955 is approximately ", estimate)

Iterating over a range of index values

One part of this program introduces another new idea in loop construction. The formula for the covariance

requires us to iterate over both the x list and the y list at the same time. This means that the for loop construct that we have been using

for year in x

will not work here. We need to iterate both over the x list and the y list to compute the covariance. The solution to this problem is to set up a loop that iterates over all possible values of the index variable i instead:

covariance = 0
for i in range(0,count):
    covariance += (x[i]-xBar)*(y[i]-yBar)

Using this loop structure in combination with the list index notation gets us what we need.

Python has a len() function that can be applied to lists or strings to determine their lengths. This is useful for writing this style of loop.

covariance = 0
for i in range(0,len(x)):
    covariance += (x[i]-xBar)*(y[i]-yBar)

Printing a table

Something we will want to start doing when we write programs the work with larger lists of data is to expand the set of techniques we have available to print that data.

To print data in the form of a table we will want to construct a loop that prints the individual rows of the table. Furthermore, we will want to exert careful control over the text we print on each row to make sure that the columns of the table line up nicely. Python solves this problem through the use of the string format() method. This method acts on a specially prepared string that contains formatting specifications, which act as a series of placeholders with associated formatting information. The parameters that you pass to the format method are the values you want to place in the placeholders. Here is an example: suppose we wanted to print the data from the x and y lists in the example above. This is how we would use a loop and a format string to manage that.

x = [1935,1940,1945,1950,1955,1960,1965,1970,1975,1980]
y = [32.1,30.5,24.4,23,19.1,15.6,12.4,9.7,8.9,7.2]
print(' Year Population')
for i in range(0,len(x)):
  print('{:5d}    {:>4.1f}'.format(x[i],y[i]))

Each placeholder is contained inside a pair of curly braces. Placeholders start with a colon. When you set up a placeholder to print a number you need to set a format type for the number: d stands for decimal integer and f stands for floating point number. If you want your number printed using a certain number of characters, you place a width specifier in front of the format code. In the example above the width specifier 5 I used for the year specifies that the year value should be printed using a total of 5 characters. If the digit you are printing takes up fewer than 5 characters the number will be padded with extra spaces at the end to bring the character count up to 5. The width specifier 4.1 I used for the population figure specifies that this number should be printed using a total of 4 digits with 1 digit to the right of the decimal point. The > character used with the f specifier indicates that the number should be printed right justified. If the number has fewer than 4 digits it should be padded with additional spaces on the left to bring the character count up to 4.

You can find more documentation on format specifiers online here.

Programming exercise

Once we have constructed a regression line, we can use it to make predictions of y values for different x values.

ypredicted = α + β x

The residual associated with a predicted value is the difference between the value predicted for a given xi and the actual value yi.

residuali = yi - (α + β xi)

Another way to judge the quality of a prediction is to compute the sum of the squares of the errors for a sequence of predictions:

Modify the program above to print both a table of residuals and the SSE for the computed regression line.