From Regression To Classification

Part 3 of the series "Building a Neural Network from scratch"

17 October 2016
neural-networks
theory

Code to this post

We successfully extended our neuron, so that it can handle all datasets generated by a linear continuous function. Now, we will move to classification tasks, meaning that the target values in the dataset are discrete and not continuous.

Regression vs. Classification

So far, all problems were regression tasks. Our neuron was given two input variables, which it used to predict a continuous target variable. In the last problem of the last post, we introduced a classification problem. The solution for an input vector was either 0 or 1. Our neuron failed to solve this problem because it can only fit a linear function for solving the problem. It generates a least squares fit. For linear regression tasks, this suffices, but for classification problems, the output value should not change linearly, but abruptly.

A solution would be to output discrete values depending on the neuron's output, for example:

net_input = np.dot(x, weights) + bias
prediction = 0 if net_input <= 0.5 else 1

What this means theoretically is that we are moving away from a least squares fit. A least squares fit tries to find a linear function so that the squared distance between the function and the data points is minimized. That is exactly what we did so far. But for classification, a least squares fit will not always be the best solution. I found a good image in these slides, which explain the problem more detailed:

The blue line is the neuron's output and the green dotted line is the resulting prediction. We can see that discretizing a value output by a regression done with least squares fit may be a bad idea for classification. From [1]

The key is to only train the neuron on the samples that it classified wrong. In the code, this means that we do not calculate our error as:

error = np.square(y - net_input)

Instead, we use:

error = np.square(y - prediction)

Activation Functions

What we just added is an activation function. We extended our neuron by feeding its weighted sum of inputs into a function that calculates the output. Specifically, we used the step function for activation. Now, the neuron only "fires", if the net input reaches a certain threshold.

Although the code works for our problem, we got a little careless here. We changed the neuron’s output function and thus the derivative of the error w.r.t. the weights, but we did not modify the update rule. That is because the step function is discontinuous and its derivative is zero except for the step. For now, it is enough to ignore the step function during the weight updates. We tackle this problem in the next post.

Heading for new shores

Now that we are able to tackle classification tasks as well, let us try out a new dataset, generated by the XOR function. In the code above, we only change get_target to
def get_target(x):
    return 1 \
        if (x[0] > 0.5 and x[1] <= 0.5) \
           or (x[0] <= 0.5 and x[1] > 0.5) \
        else 0

and again, our neuron fails. Why? Because the data in the new problem are not linearly separable.

Linear Separability

Our first problem was defined by

def get_target(x):
    return 1 if 0.58 * x[0] + 0.67 * x[1] + 0.5 > 1 else 0

The definition of the second problem is above. I plotted a dataset with 200 samples and the neuron's decision boundary for each problem below:

First Problem
Second Problem

The problem is that our neuron can separate the data only linearly, but the XOR-generated samples are not linearly separable. Note that if we introduce noise into the linearly separable dataset of the first problem, we make it not linearly separable as well. Then, the neuron’s training does not converge, but the decision boundary is still good.

But what is the solution for the XOR dataset? By looking at it, we could say two decisions boundaries. By intuition, we could try to move from a single neuron to our ultimate goal – a neural network.

Conclusion

Fail? Success? I am not sure. We modified our neuron so that we can classify data, but the modification was a little careless (we cannot calculate the derivative of the error anymore). Also, the modification was effective for our first dataset, but we immediately reached a new barrier: not linearly separable data. Let us find out, what we can do about this, in the next post.

Next Post: When one Neuron is not enough

Comments