Some final features of the Python language

These lecture notes will cover three additional features of the Python language. These did not come up in the course of the term, but I still think that these features are significant enough to cover in this course.

Generator functions

Suppose you have some data stored in a list. Here are two common examples of loops that you can use to interate over the list.

for item in data_list:
  # do something with item

for i in range(0,N):
  # do something with data_list[i]

The intention of the first loop is clear: simply iterate over all the data items in the list.

The second example uses a range expression to help us iterate over a range of index values. You may guess from the structure of the range expression that range() is a function. You may also guess that what range() does is to construct a list of index values [0,1,2,…,N-1] that the loop then iterates over.

This is not what range() does. In particular, since it would take space to set up the list this approach is not optimal. Instead, we could implement the range() function as a peculiar sort of function called a generator function. Here is what the code for the range() generator function might look like:

def range(start,end):
    n = start;
    while n < end:
        yield n
        n += 1

In place of a return statement generator functions have yield statements. A generator function is designed to run until it encounters the yield statement. The function then stops and returns the value in the yield statement. On the next call to the function, the function picks back up on the next statement after the yield statement and again runs until it either hits a yield or it reaches the end of the function body.

The Python for loop is designed to work in combination with generator functions to do iteraration. The loop will repeatedly call the generator function until the generator stops yielding values. The advantage to this arrangement is that it is space efficient. Since we don't need to construct a list of index values to iterate over, we can save the space of that list.

There are a number of applications where generator functions make sense. One common application shows up in the context of reading data from a file. In past examples we have handled data coming in from a text file by reading all of the data into a list and then working with that list. Once again, with generator functions we can avoid having to make that list. Here is some code to show how this might work:

def cleanLine(line):
    """Converts a raw line list into an appropriate data format."""
    return (int(line[0]), float(line[1]))


def readData(fileName):
    """Generic data reading function. This will read individual lines,
         format their contents as desired, and return one line's worth of
         data at a time."""
    with open(fileName) as f:
        for line in f.readlines():
            yield cleanLine(line.split())

This generator function is designed to hand you one line's worth of data at a time. Using this generator function you can write code like this:

for year, pop in readData("data.txt"):

A more advanced application is a generator function meant to capture some complex pattern of terms in a sequence. For example, in the Simpson's rule example we saw a few lectures back we needed to use some tricky logic to capture the pattern of 1,2,4,2,4,2,…,4,1 coefficients in the sequence of terms in the sum. Here is a program for doing a Simpson's rule calculation that uses a generator to handle the logic needed to make the coefficients in the sum:

import math

# This is the function we will work with for this example
def f(x):
    return math.sqrt(4.0 - x*x)

# Generator that generates both the coefficients and the x values
# needed in Simpson's rule.
def simpsonTerms(a,b,N):
    yield (1,a)
    h = (b-a)/N
    for i in range(1,N):
        yield (4,a+((i-1)*(b-a))/N+h/2)
        yield (2,a+(i*(b-a))/N)
    yield (4,b-h/2)
    yield (1,b)

# Compute an approximation to the integral
# of f(x) from a to b by using Simpson's rule
# with steps of size h = (b-a)/N
def SimpsonArea(a,b,N):
    sum = 0.0
    for coeff,x in simpsonTerms(a,b,N):
        sum = sum + coeff*f(x)
    return sum*(b-a)/(6*N)

print("      N  error");
for j in range(3,20,2):
    estimate = SimpsonArea(0.0,2.0,2**j)
    error = math.fabs(math.pi - estimate)
    print("{:>7d}  {:g}".format(2**j,error))

Assertions

One of the few things about the Python language that I don't like is the absence of type information. This problem is most pronounced when we pass parameters to functions. For a function to work correctly, callers will need to pass it the type of data the author of the function intended when they wrote the function. For example, if a function f(x,y) expects x to be a real number and y to be a tuple with three items, we can write f(1.5,2.5) in our code. When the code runs your program will most likely generate an error somewhere in the code for the f(x,y) function. Only then will you realize that you passed the wrong parameters to the function.

There are a number of things that function authors can do to prevent these sorts of problems. One thing they can do is to provide a doc string or other documentation that documents their expectations for the parameters. Another thing the author can do is to put defensive if statements at the start of the body of the function that check to make sure that the incoming parameters have the right type and other details. For example, the author of our imaginary f(x,y) function can do this:

def f(x,y):
  if not isinstance(x,float):
    print("Error! x should be a float.")
  elif not isinstance(y,tuple):
    print("Error! y should be a tuple.")
  elif len(y) != 3:
    print("Error! y should have three elements.")
  else:
    # Go on with the rest of the code

The Python isinstance() function gives us a way to check the type of some data item. The if statements here use that test to confirm that the incoming data has the right types.

This is not an ideal solution, because although this will alert you to the fact that you passed the wrong parameters to f, it won't stop your program from running past the point where you called f incorrectly. This is not a good idea, because if f can't do what you wanted it to do it is quite likely that your own code will break shortly after the failed call to f.

The correct approach in this case is to force the program to both print an error message and stop running as soon as you detect the problem. One way to accomplish this in Python is to use the Python assert construct:

def f(x,y):
  assert isinstance(x,float),"x in f(x,y) should be a float."
  assert isinstance(y,tuple), "y in f(x,y) should be a tuple."
  assert len(y) == 3, "y in f(x,y) should have three elements."
  # Go on with the rest of the code

An assert statement contains a test and a string message. If the test fails the assert will stop the program and print an error message containing the message along with information about where the program was when the assert failed.

For those of you who have seen other programming languages that make use of exceptions, you may be interested to know that what assert does is to throw an exception when its test fails.

The try construct

Python makes use of a mechanism called the exception mechanism to handle fatal errors in programs. Throughout this course we have used the information provided by exceptions to debug our programs. One problem with programming this way is that when something goes wrong our program stops completely and we have to fix the problem.

To give programmers more options in how to respond to these sorts of errors Python offers the try construct. The general form of the try construct is

try:
  # Code that may possibly generate an error
except:
  # Code that runs only if an error took place

Here is an example of how this can be useful. In the plotting assignment with populations and GDP values you have to contend with missing data. Here is one possible solution to that assignment that makes use of the try construct to recover gracefully from errors caused by occasional missing data:

import matplotlib.pyplot as plt
import pandas as pd
import json

# Load the data into a list.
filename = 'data/population_data.json'
with open(filename) as f:
    pop_data = json.load(f)
# Load the data frame
df = pd.read_csv('data/GDP.csv')
df = df.set_index('Country Code')

def findPopulation(code, year):
    for entry in pop_data:
        if entry['Country Code'] == code and entry['Year'] == str(year):
            return float(entry['Value'])
    return 0


g = []
p = []
for entry in pop_data:
    if entry['Year'] == '1995':
        code = entry['Country Code']
        try:
            p95 = float(entry['Value'])
            p05 = findPopulation(code,2005)
            gdp95 = df.loc[code,'1995']
            gdp05 = df.loc[code,'2005']
            gpc95 = gdp95/p95
            gpc05 = gdp05/p05
            gdp_growth = (gpc05-gpc95)/gpc95
            pop_growth = (p05-p95)/p95
            g.append(gdp_growth)
            p.append(pop_growth)
        except:
            print('Missing data for '+code)

plt.plot(p,g, 'bo')
plt.show()

If you look at the code in the for loop at the bottom of the program you will see a try construct. The code that follows the try may fail with an exception if one of two things happens. If population data is missing for a particular country the findPopulation() function is designed to return 0. This will very shortly lead to a division by 0 exception when we go to compute the GDP per capita. The second thing that go wrong here is that the data frame may be missing an entry for a particular country code. When that happens the loc[] expressions will generate an invalid index exception.

If anything goes wrong in the code after the try, the program will stop running the code after the try and jump to the code that follows the except. This will result in the program printing an error message (and also not appending anything to g or p for that particular country), but then going on to process the rest of the countries. This is a graceful and appropriate way to recover from problems caused by missing data.