Experimenting with Hyperparameters

Part 6 of the series "Building a Neural Network from scratch"

24 October 2016
neural-networks
theory

Code to this post

Object-oriented Refactoring

As the code got more complex, I refactored it into an object-oriented design. This way it is easier to stay on top of things and the neural network can be used by other Python modules. I really recommend taking a look at the net.py on GitHub. There, you will also find the files for our following experiments in the same package.

Now, we have classes for the net, the hidden layers and the output layer. This way, we can build a modular net and quickly vary its size and depth by connecting multiple hidden layers. The calculations, however, stayed the same.

Experiments

I ran some experiments on the XOR dataset with a net with 3 hidden neurons. I visualized the decision boundary by using the code that Denny Britz also uses in his tutorial. This is the result:

Decision boundary of a neural network with 3 hidden neurons

I also tracked the decrease of the Mean Squared Error for 50 runs. This resulted in the following graph:

Error decrease for 50 runs with a learning rate of 0.1.

You can find the code for both experiments on Github in boundary_visualization.py and error_visualization.py. Both experiments show that our network seems to learn the dataset pretty well. Also, our choice of 0.1 as the learning rate seems ok.

Next, I tried out sklearn‘s moon dataset – two interleaving circles with a net of 5 hidden neurons and a learning rate of 0.1. The code is in moons.py. The net reached a MSE of 0.01 after 2000 epochs through the 200 samples. The visualization of its decision boundary shows that it was able to learn the underlying structure of the dataset:

Decision boundary with 5 hidden neurons

Choosing the Number of HIdden Neurons

Similar to Denny Britz, I also tried out other values for the number of hidden neurons. You can take a look at the results and compare them to his results.
1 hidden neuron
2 hidden neurons
3 hidden neurons
10 hidden neurons
50 hidden neurons
100 hidden neurons

Beware that his network is larger for the same number of hidden neurons and that he uses a different error function, the Cross Entropy Loss. We will talk about this error function in the next post. For now, we can see in both experiments that a higher number of neurons leads to a higher number of weights / parameters to optimize. Therefore, large networks tend to overfit the data. This is bad because it makes our net remember the data without generalizing to the underlying structure. In fact, through the last posts, we committed a major crime in Machine Learning.

Dividing between Training and Test Data

So far, we learned our data and evaluated the performance of our classifier on the exact same data. If we stick to this measure, the best classifier will be a database that simply looks up the x and returns the stored y. But we want a classifier that can predict the target value of new data. Therefore it is important to measure its performance on new data. We divide the dataset at hand into two parts: a training set for training the classifier and a test set for evaluating its performance. The classifier must not see the samples of the test set until training is finished and the evaluation starts. We will add the code for evaluating the net’s performance with a test set in the next post.

As for our current example, we can see that we need to prevent overfitting. Otherwise, new data will not be predicted based on the general structure of the training data, but also on where outliers lie. These examples illustrate the problem in a more extreme way: [1] [2].

An immediate solution would be to limit the number of hidden neurons. However, we should be careful with that. As Andrej Karpathy puts it in his course on Neural Networks:

The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss.

So, we should limit the number of hidden neurons, but beware of getting stuck in local minima. Instead of choosing the number of parameters to learn too low, we should use other techniques to prevent overfitting, such as constraining the values of the weights. We will talk about those techniques in the next post.

Choosing the Learning Rate

For our simple dataset, the net with 3 hidden neurons did not get stuck in a local minimum. So, I tried out a couple of learning rates:

We can see that dynamically decreasing (or annealing) the learning rate is a very good idea for this dataset. Right now, we are decreasing the learning rate by two once the training error did not get lower after an epoch. However, the learning rate may not get lower than \(10^{-6}\).

Conclusion

This was a short post. We experimented with two hyperparameters – the learning rate and number of hidden neurons – and found out that

  • a high number of weights leads to overfitting, but also that
  • choosing the number of weights too low increases the danger of getting stuck in a local minimum
Generally, we “conquered” the XOR and the Moons dataset. So, in the next post, we will move on to a more complicated task – a problem from the real world: digit recognition.

Next Post: Using Softmax for Multiple Classes

Comments