numpy

Additional numpy examples

In the last lecture you saw some of the basics of numpy. In this lecture I am going to show a series of extended examples of applications of numpy. Some of these examples are intended to be run as programs, while others will take the form of Jupyter notebooks.

Example program: constructing a basis

One of the things you can represent with the numpy array type is a vector. Our first application today consists of manipulating sets of vectors.

A basis for the vector space ℝⁿ is a set of n mutually perpendicular vectors. The coordinates of a vector v with respect to a basis {x₁ , x₂, …, x_n} is a list of numbers {c₁, c₂, …, c_n} such that

v = c₁ x₁ + c₂ x₂ + ⋯ + c_n x_n

In the next example we are going to contruct a basis for ℝ³ and then use that basis to compute the coordinates of a target vector.

The example comes from computer graphics. In a computer graphics application we typically work with two different coordinate systems. The world coordinate system uses a standard basis located at the origin. The basis is composed of the three standard coordinate vectors

In computer graphics we also use a second coordinate system, called eye coordinates. This represents the coordinate system used by an observer who sits at a location in space observing the objects in the world.

The eye coordinate system is usually specified in the form of some information about the observer.

The observer is located at an eye location, p_eye, which is a point expressed in world coordinates.
The observer is looking at a look point, p_look, which is also expressed in world coordinates.
We use these two points to construct a look vector, v_look = p_look - p_eye.
Normalizing and negating the look vector produces the z axis for the eye coordinate frame.
We also specify an up vector for the observer. This is used to orient the observer in space. The y axis for the eye coordinate frame is a vector that lies in the plane formed by the eye z direction and the up vector, but is perpendicular to the eye z direction and has length one.
Taking the cross product of the y and z vectors produces the x vector.

Here now is a summary of the mathematical steps needed to compute the three coordinate vectors for the eye coordinate frame.

v_look = p_look - p_eye

t = up - (up · z) z

x = y × z

Here the notation || v || stands for the norm of the vector v. Once we have constructed the vectors that make up the eye coordinate frame, we can express any point p in world coordinates in terms of eye coordinates via the following mathematical steps:

v = p - p_eye

Solve A c = v for c

Here now is the Python program that does all of these steps. For the most part the math above translates pretty directly into code.

import numpy as np

# Points and vectors that serve as the input data
# for the computation
p_eye = np.array([2,2,4])   # The viewer's location
p_look = np.array([-1,0,0]) # The point they are looking at
up = np.array([0,1,0])    # Their up direction
pt = np.array([-2,1,1])   # The location of a point in the scene

# Compute the three vectors that make up the eye coordinate frame.
v_look = p_look - p_eye
z = -v_look/np.linalg.norm(v_look)
t = up - np.dot(up,z)*z
y = t/np.linalg.norm(t)
x = np.cross(y,z)

# Compute the coordinates of pt with respect to this new frame.
A = np.transpose(np.array([x,y,z]))
print(A)

c = np.linalg.solve(A,pt-p_eye)
print(c)

One statement in the code requires some additional explanation. In the mathematical recipe above we had a step where we gathered together the basis vectors for the new coordinate system to make a matrix.

If we tried translating this into code as

A = np.array([x,y,z])

the result would not be correct, since numpy will interpret each of the vectors x, y, and z as rows in the matrix A. The fix for this is to transpose the matrix so that what were rows of the matrix become columns.

A = np.transpose(np.array([x,y,z]))

Programming assignment

Given two vectors u and v in ℝⁿ we can compute the portion of v that lies in the direction of u by the following algorithm: we start by normalizing u by dividing it by its own length

and then we compute the dot product u · v. The part of v that does not lie in the direction of u is

w = v - (u · v) u

w is then guaranteed to be perpendicular to u.

This basic step forms the basis for an algorithm called the Gram-Schmidt algorithm. In this algorithm we start with a collection of vectors

v₁ , v₂, … v_k

in ℝⁿ. We then seek to construct a new set of vectors

u₁, u₂, …u_k

such that the u vectors all have length 1 and are all perpendicular to each other. The u vectors are all constructed by forming appropriate linear combinations of the v vectors.

Here are the first few steps of the algorithm.

The first u vector is simply the normalized version of the first v vector.

The second u vector is computed by first computing the part of v₂ that does not lie in the direction of u₁.

w₂ = v₂ - (v₂ · u₁) u₁

and then normalizing that vector.

Likewise, here is the proceedure for computing u₃:

w₃ = v₃ - (v₃ · u₁) u₁ - (v₃ · u₂) u₂

Here is a text file containing a set of vectors. Write a Python program that reads the six vectors from the file and constructs a set of six mutually perpendicular vectors from them using the algorithm outlined above. You should use numpy vector operations to carry out each step in the algorithm.

More on vector functions

As we saw in the lecture on plotting, one of the neat tricks that numpy can do is vector function computations. In this technique we construct an ordinary looking function, and then pass a numpy array as a parameter to that function. The function automatically will return a new array whose entries are the entries of the original array evaluated by the function. When combinined with aggregation functions this technique makes it possible to do sophisticated computations entirely without loops.

Here is an example that makes use of both vector function evaluation and aggregation functions.

At the end of calculus I you were introduced to the definite integral of a function f(x).

This quantity can be defined geometrically as an area: the area between the function f(x) and the x-axis over the range of x values a ≤ x ≤ b. As you saw back in calculus 1, you can use geometric techniques to estimate this area. The simplest geometric technique involves dividing the interval from a to b into a mesh of N + 1 equally spaced points x_i = a + h i, where the step size h is equal to (b-a)/N.

This mesh of points divides the interval [a,b] into subintervals [x_i , x_i+1] for i = 0 to N. Over each of those subintervals we can place a rectangle that approximates the area underneath f(x) over the subinterval [x_i , x_i+1]. One way to get a halfway decent approximating rectangle is to give the rectangle that sits over the interval [x_i , x_i+1] a height f(x_i). The drawing below illustrates this set-up.

By summing the areas of all of these approximating rectangles we can get a crude approximation for the area we want.

Since rectangles do not do a very good job of approximating an area under a curve we need a better method. One obvious improvement is to replace rectangles with trapezoids. Over each subinterval [x_i , x_i+1] we place a trapezoid whose left height is given by f(x_i) and whose right height is given by f(x_i+1). The picture below illustrates that the trapezoid does a better job of matching the area under the curve than the rectangle did.

The area of one such trapezoid is h (f(x_i) + f(x_i+1))/2. If we sum up the areas of all of the trapezoids under the curve we get the trapezoid rule estimate for an integral:

The Python program below computes this sum for the function

over the range from a = 0 to b = 2. The area under this curve is a quarter circle of radius 2, so we know in advance that the correct area is (π 2²)/4 = π. This will allow us to see how close to the actual answer the trapezoid rule estimate is.

import numpy as np
import math

# This is the function we will work with for this example
def f(x):
    return np.sqrt(4.0 - x*x)

# Compute an approximation to the integral
# of f(x) from a to b by using trapezoids
# with steps of size (b-a)/N
def trapezoidArea(a,b,N):
    h = (b - a)/N;
    x = np.arange(a+h,b,h)

    return h*(f(x).sum()) + h*f(a)/2 + h*f(b)/2

print("      N  error");
for j in range(3,20,2):
    estimate = trapezoidArea(0.0,2.0,2**j)
    error = math.fabs(math.pi - estimate)
    print("{:>7d}  {:g}".format(2**j,error))

Here is the output generated by this program:

      N  error
      8  0.0517735
     32  0.00649023
    128  0.000811861
    512  0.000101501
   2048  1.26882e-05
   8192  1.58604e-06
  32768  1.98256e-07
 131072  2.4782e-08
 524288  3.09775e-09

Not only does the method produce a good estimate, but the use of numpy vector function computation has the additional side effect of making the code very fast. The same code written out by more conventional Python code techniques would be noticably slower.

The banknotes problem revisited

For our final numpy example I am going to revisit the machine learning banknotes problem. A peculiar result we got from the k-nearest neighbors method in the banknotes example was that the algorithm was able to get all of the test examples right. This is unusual in machine learning applications, so this case bears some further investigation.

We have already seen that numpy and pyplot complement each other very well. I am going to take advantage of that convenient link in this example. What I want to do is to use pyplot to make a visualization of the points in our data set. Hopefully the visualization will shed light on the question of why the k-nearest neighbor algorithm works so well.

The one difficulty in putting together a visualization of our data is the fact that our data points live in a four-dimensional space. Each point in the data set is a list of four numbers. pyplot only contains tools for plotting points in either two or three dimensions, so to make a plot of our data we have to reduce the dimensions involved in the data set. One example of a fairly crude way to do this would be to simply ignore the fourth number in each data point and make a plot of the data points in three dimensions. This is fairly easy to do, and will leave this as a simple exercise for you try after seeing this example.

A somewhat more effective and sophisticated way to make a three-dimensional plot of a four-dimensional data set is to apply a standard machine learning dimensionality reduction algorithm. The algorithm we are going to use in this example is the principle components analysis, or PCA algorithm. This algorithm works by constructing an alternative basis for the four-dimensional space that the data points live in. The axes in this alternative basis are selected to align along directions of maximum variability of the data set. The first axis is the direction along which the data set has the highest variability, and successive axes are chosen to be directions of maximum variability perpendicular to the already selected axes. The final axis chosen in this method is the direction with the least variability: since this is the least interesting of the four axes we can discard this final coordinate when we reduce our points from four dimensions to three. The code that I will be using to do the PCA method comes from this page on the towardsdatascience.com site:

def pca(X):
  # Data matrix X, assumes 0-centered
  n, m = X.shape
  assert np.allclose(X.mean(axis=0), np.zeros(m))
  # Compute covariance matrix
  C = np.dot(X.T, X) / (n-1)
  # Eigen decomposition
  eigen_vals, eigen_vecs = np.linalg.eig(C)
  # Project X onto PC space
  X_pca = np.dot(X, eigen_vecs)
  return X_pca

This code assumes that you have stored your data set into a numpy array, where each row in the array represents one of your data points. This function returns a new numpy array containing the coordiates of each data point with respect to the new basis computed by the PCA algorithm. A further requirement of this function is that the original data be zero-centered, which means that we have to compute the mean value for each column in the data set and then subtract that mean value from each column in the data set.

Here is the code that reads both the training and test sets from the text files. This code organizes the full data set into a list of data points, converts that list into a numpy array X, zero centers X by subtracting the mean from each column, and then calls the pca() function.

points = []
classes = []

with open("training.txt") as f:
    for line in f.readlines():
        parts = line.split()
        points.append([float(parts[0]),float(parts[1]),float(parts[2]),float(parts[3])])
        classes.append(int(parts[4]))

with open("test.txt") as f:
    for line in f.readlines():
        parts = line.split()
        points.append([float(parts[0]),float(parts[1]),float(parts[2]),float(parts[3])])
        classes.append(int(parts[4])+2)

X = np.array(points)

X -= np.mean(X,axis=0)
D = pca(X)

The result of applying the PCA algorithm is a new array of data points D giving the coordinates of each of our original data points in the new PCA basis. This is where we get to do the reduction from four dimensions to three. We use some simple numpy slicing operations to carve our the first three columns of the array to make a list of three-dimensional data points to plot.

xs = D[:,0]
ys = D[:,1]
zs = D[:,2]

The next step is to segregate these data points into sublists based on the classifications of the original points. You may have noticed that in the original code that I used to read the data points I also collected classification information. Real banknotes in the training set get a classification of 0, fake banknotes in the training set get a classification of 1, real banknotes in the test set get a classification of 2, and fake banknotes in the test set get a classification of 3. The next bit of code reattaches the classifications to the transformed data points and then splits the lists again based on the four classifications:

all = list(zip(xs,ys,zs,classes))

red = [] # Points in the training set that are classified 1
green = [] # Points in the training set that are classified 0
yellow = [] # Points in the test set that are classified 1
blue = [] # Points in the test set that are classified 0

for x,y,z,c in all:
    if c == 0:
      green.append((x,y,z))
    elif c == 1:
      red.append((x,y,z))
    elif c == 2:
      blue.append((x,y,z))
    else:
      yellow.append((x,y,z))

redX = [x for x,y,z in red]
redY = [y for x,y,z in red]
redZ = [z for x,y,z in red]

greenX = [x for x,y,z in green]
greenY = [y for x,y,z in green]
greenZ = [z for x,y,z in green]

yellowX = [x for x,y,z in yellow]
yellowY = [y for x,y,z in yellow]
yellowZ = [z for x,y,z in yellow]

blueX = [x for x,y,z in blue]
blueY = [y for x,y,z in blue]
blueZ = [z for x,y,z in blue]

We now have four separate lists of data points to plot.

The last step is to use pyplot to make some three-d plots. Since I am going to be making several plots that view the data set in different ways, I will be using a pyplot figure, which is a container that can hold multiple subplots. pyplot calls subplots axes, so I will use the add_subplot() function to make four different axes.

The type of plot we are going to use for each of the subplots is a pyplot scatter plot, which is used to plot clouds of data points.

fig = plt.figure()
ax = fig.add_subplot(221, projection='3d')

ax.scatter(redX, redY, redZ, c='r', marker='.')
ax.scatter(greenX, greenY, greenZ, c='g', marker='.')

ax = fig.add_subplot(222, projection='3d')

ax.scatter(blueX, blueY, blueZ, c='b', marker='.')
ax.scatter(yellowX, yellowY, yellowZ, c='y', marker='.')

ax = fig.add_subplot(223, projection='3d')

ax.scatter(redX, redY, redZ, c='r', marker='.')
ax.scatter(yellowX, yellowY, yellowZ, c='y', marker='.')

ax = fig.add_subplot(224, projection='3d')

ax.scatter(greenX, greenY, greenZ, c='g', marker='.')
ax.scatter(blueX, blueY, blueZ, c='b', marker='.')

plt.savefig("pca3.png")

Here then is the plot that results.

The two plots in the first row show the training and test sets. What we see in those two plots is that the real and fake banknotes form separate bowl-shaped regions, with the bowl of fake data points nested inside and slightly above the bowl for the real data points. It is this separation of the two regions that accounts for the success of the k-nearest neighbors method.

The two plots in the second row show only the fake banknotes on the left and the real banknotes on the right. In these plots we can see that the data points from the test set merge into the data points from the training set.