The classification task

In the classification task in machine learning we start with a set of data points that have been labeled with a set of possible classes. Each data point consists of a list of data values, called features, and a label that places the data point in one of several possible classes.

In machine learning we take a subset of the data points, called the training set, and use them to train a model that can predict what class a given data point belongs to. We then evaluate the performance of the model by having it predict the classifications for a second set of data points, the test set. By comparing the predictions of the model against the actual classes of the points in the test set we can see how well the algorithm has performed.

Classifying bank notes as real or fake

The data set we are going to work with for this exercise was prepared by a group of machine learning researchers who had access to examples of both real and forged European bank notes. They took pictures of each note and compressed the images using a wavelet transform. As a side effect of the compression algorithm they extracted a set of four numerical features for each note.

The data set we will work with for this assignment consists of two data files. The file data.txt contains 1097 entries in the form

-2.62,-6.8555,6.2169,-0.62285,1

The first four numbers in the line are the four features, while the fifth number gives the classification: 0 for real and 1 for fake.

The second data file, test.txt, has the same format and stores 275 test cases that we can use to evaluate the performance of the algorithm. For each of the 275 members of the test case we will present the four feature values to the algorithm and ask it to predict what class the item belongs to. We can then compare the predicted classes against the actual classes of each test case.

Evaluating the performance of a classifier

When we evaluate a test case there will be four possible outcomes. A true positive is a case where the note is fake and the algorithm predicts that it is fake. A true negative negative is a case where the note is real and the algorithm correctly predicts that it is real. In a false positive the algorithm mistakenly predicts that a real note is fake. In a false negative the algorithm mistakenly classifies a fake note as real.

A convenient way to summarize the performance of a classification algorithm on a test data set is to set up a confusion matrix, which shows how many examples of each of the four possible outcomes we saw:

Actual RealActual Fake
Predicted RealTNFN
Predicted FakeFPTP

Additionally, we compute two additional metrics from this data. The precision of the model is

which measures what fraction of the examples the model predicted fake were actually fake.

The recall of the model is

which measures what fraction of the actually fake notes the model correctly classified as fake.

The k-nearest neighbors model

The machine learning model we will use to classify the banknotes is one of the simplest such models, the k-nearest neighbors model. The basic idea behind this model is the assumption that the two classes involved in the problem will cluster in distinct regions in space. When we have to classify a data point we have never seen before, we can look at its immediate neighbors to see what they are doing. If most of the immediate neighbors belong to a particular class, we will guess that the new point belongs to that class as well.

Here is an outline of the classification algorithm.

  1. We input a new point p from the test set that needs to be classified.
  2. We compute the distance from that point to every point in the training set.
  3. We select the k points from the training set that are closest to p.
  4. If the majority of those k points belong to a particular class, we guess that p belongs to that same class.

Programming assignment

Below you will find links to two files containing the data sets for this exercise.

Training Data Set

Test Data Set

Write a program that reads the training and test data sets from their respective files. Since the data items on each line of the input sets are separated by commas, you will need to use split(",") instead of split() to split each line of the input files into distinct data items.

Once your program has read the input data, it should use the k-nearest neighbors model with k = 5 to classify each of the points from the test data set. Compare the classifications generated by the model with the actual classes for each item in the test data set, and construct a confusion matrix from the results. Print the confusion matrix along with the precision and recall values.

Helpful hint To make finding the k nearest neighbors for a given test data point easier, construct a tuple with the structure

(distance,class)

for each point in the training set, where distance is the distance from the current test data point and class is the class that the training data point belongs to. If you sort this list of tuples in order of ascending distance values, the first k items in that list will be the distances to the k nearest neighbors of the test data point along with their classes.