Classification learning

In a classification learning task we have a set of data points where each data point is labeled with a category. The problem at hand is to construct a model that predict the correct category for each input data point.

The simplest version of the classification problem is the binary classification problem. In this problem we have only two categories that the data points can be labeled with. In the examples below I will be working through a binary classification problem where some of the data points are labeled with a category of 0 and others are labeled with a category of 1. Our goal is to construct a model that will input a data point and output either a 0 or a 1 to predict the classification of that data point.

The diagram below shows a very simple version of the binary classification problem. Each data point in the data set is described by a single variable, x, which makes it possible to plot the points on a line. The points are then colored to show their classifications. The black points are points in category 0 and the red points are in category 1.

The decision function

A crude and simple way to separate the points into different categories is to introduce a decision function:

d(x,b) = x - b

The decision function depends on a parameter, b, which is the point at which the function crosses the axis.

We use the decision function in combination with a classification function

to predict the classification for an input data point xi:

This procedure classifies all points for which the decision function is positive as category 1 points, and all points for which the decision function is negative as category 0 points. If we have chosen the model parameter b correctly, the model will correctly distinguish all the of the points.

We say that the parameter b determines the decision boundary for the model.

Error measures and learning

In more complex problems we will not know in advance how to set the model parameters to produce the correct classification behavior. In those cases we will follow a strategy to adjust the model parameter:

  1. Start with a random value for the model parameter.
  2. Estimate the model error.
  3. Use the error estimate to adjust the model parameter until there is no longer any error.

The picture below shows the situation we would face if we had set the model parameter in our model incorrectly initially.

The decision function in this case would mistakenly classify some of the category 0 points as category 1 points. To measure the classification error we can construct an appropriate error function:

This particular error function measures the mean square error of the classification function summed over all of the points in the data set. The yi values give the correct classification for each of the points in the data set. For the classification function we are using, this error function measures what fraction of the data points the classifier classifies incorrectly.

Our goal is to adjust b to reduce the error to 0. One way to do this is to use the derivative of the error function with respect to b. The iterative learning algorithm

uses the derivative of the error measure to compute a new model parameter. The learning algorithm makes use of a learning rate term, η to control the rate at which we adjust the model parameter. In typical applications several rounds of learning are sufficient to adjust the model parameter to reduce the error term to 0.

Logistic regression

The only problem with the scheme I outlined above is that the error function I have constructed is actually not differentiable. The reason for this is that the classification function c(x) that we used in our classifier is not differentiable.

The fix for this is to replace our classification function with a smooth, differentiable function. The most widely used choice for this purpose is the logistic function

Here is a plot of that function.

Note that this classification function is designed to output classifications that fall in the range from 0 to 1. You can think of the output of this classification function as a measure of the classifiers confidence that a particular point belongs to category 1.

If we use this logistic classification function to compute classifications c(d(x,b)) we say that we have constructed a logistic regression model.

Since the logistic classification function is differentiable with respect to the model parameter b, we can now make use of our learning algorithm

to iteratively adjust the model parameter from an initial guess b0 until we have reduced the error to 0.

Ambiguous models

In some problems we confront a situation in which there is no clear boundary between the two sets of points. Here is a more challenging version of the simple, one dimensional classification problem.

The problem this time is that there is no place to position the decision boundary to cleanly separate the two groups of points. No matter where we position the decision function we are going to end up incorrectly classifying at least some of our points. The best we can hope for once again is to pick a model parameter b via the learning algorithm to minimize the error function e(b). The difference in this case is that minimum value of e(b) is no longer 0.

In situations like this, it is going to be inevitable that the decision function will produce both false positives and false negatives. A false positive is a category 0 point that mistakenly gets classified as a category 1 point. In the picture above, the one black dot to the right of the blue line is a false positive. A false negative is a category 1 point that gets incorrectly classified as a category 0 point. In the picture above the one red point to the left of the blue line is a false negative.

In situations like this we can introduce two further error measures, called precision and recall. These quantities are defined in terms of these four counts.

CountInterpretation
TPTrue positives: how many category 1 points get classified as 1s
FPFalse positives: how many category 0 points get classified as 0s
FNFalse negatives: how many category 1 points get classified as 0s
TNTrue negatives: how many category 0 points get classified as 0s

In terms of these counts we define

Precision measures what fraction of the points that our model classifies as 1s are actually 1s, while recall measures which fraction of the true 1s we classify as 1s.

Thresholding

In situations where data points are difficult to separate cleanly, we can describe the situation as having a region of ambiguity. This is a region near the decision boundary where the decision function has some trouble correctly classifying data points. The picture below shows that region bordered by a pair of green lines.

To the right of the right green line the decision function correctly classifies all category 1 points as 1s. To the left of the left green line the decision function correctly classifies all category 0 points as 0s.

One way to create a region like this is to use the concept of a threshold. We pick some number t and say that the decision function classifies an input x as a one if d(x) > t. The picture below shows two different choices for t. If we set the threshold to the higher value of t, the decision function will pick up only the points that are truly in category 1. If we set it to the lower value of t, the decision function will classify things as 0s that are truly all 0s.

Classifying multiple factors

To keep things simple up to this point I have used only an example with one input factor. Most models that we will build for classification tasks will involve multiple factors. The discussion below shows how the ideas we have developed to handle one factor generalize to problems with multiple input factors. For simplicity, I will work with two input factors. Everything I say about two input factors generalizes to multiple input factors.

The picture below shows a set of data points in a two-factor space. Once again the points are in two categories: category 0 points are black and category 1 points are red.

The decision function for this model takes the form

d(x, b) = b0 + b1 x1 + b2 x2

The graph of the decision function looks like a plane. The decision function equals 0 along a line in the (x1 , x2) plane called the decision boundary. If it is possible to draw a decision boundary that cleanly separates the two categories of points we say that the sets are linearly separable.

The error function for this model generalizes in a natural way:

The only difference here is that now instead of having a single parameter in our model we have a vector of parameters, b. To minimize the error function we use a generalization of the learning rule we developed for one dimension:

The generalization of the derivative term to multiple dimensions is the gradient of the error function:

bk+1 = bk - ηe(bk)

This learning rule implements the gradient descent learning algorithm.