An Interactive Character-Level Language Model

19 February 2017 Source Code

I let a neural network read long texts one letter at a time. Its task was to predict the next letter based on those it had seen so far. Over time, it recognized patterns between letters. Find out what it learned by feeding it some letters below. When you click the send button on the right / bottom, it will read your text and auto-complete it.

You can choose between networks that read a lot of Wikipedia articles, US Congress transcripts etc.

Generate text from
...
...

Here is the detailed description of what I did: I used a specific type of recurrent neural networks, the LSTM (Long Short-Term Memory), to learn a language model for a given text corpus. Because I fed it only one letter at a time, it learned a language model on a character level. This idea is not new at all. It was inspired by Andrej Karpathy's blog post about the "Unreasonable Effectiveness of Recurrent Neural Networks". He also trained character-level networks on Shakespeare, Wikipedia, Linux Source Code etc. The results are amazing.

I decided to try to reproduce his results and make the trained models available via an interactive chat box, so that you can try them out as well.

Datasets

These are the datasets I used:

US Congress
  • 488 million characters from transcripts of the United States Senate's congressional record
  • Trained for 2 days.
Wikipedia
  • 447 million characters from about 140,000 articles (2.5% of the English Wikipedia)
  • Trained for 2 days.
Sherlock
  • 3.6 million characters (about 650,000 words) from the whole Sherlock Holmes corpus by Sir Arthur Conan Doyle. I removed indentation but kept all line breaks even if their only purpose was formatting.
  • Trained for 3 hours.
South Park
  • 4.7 million characters from all 277 South Park episodes
  • Trained for 2 hours.
Goethe
  • 1.5 million characters from all poems by Johann Wolfgang von Goethe
  • Trained for 2 hours.

As training / validation split, I used a 90 / 10 ratio for the three small datasets and a 95 / 5 ratio for the US Congress and the Wikipedia dataset.

Model

I did not do a lot of hyper-parameter tuning and the tuning I did yielded marginal results. Below are the hyper-parameters that I found to work well for each dataset. For all models, I used 90-dimensional one-hot encodings as input and output of the model. The models were trained by minimizing the cross-entropy / bits nats per character using RMSprop.

  • Number of LSTM layers: 3 (the South Park and the Goethe model only had 2 layers)
  • Number of neurons per layer: 795
  • Batch size: 100
  • Learning rate: 0.001
  • Dropout: 0.5
  • Gradient L2 norm bound: 5
  • A dense softmax layer with 90 units followed the last LSTM layer.
  • At each training step, the error was backpropagated through 160 time steps / characters.
  • I trained all models on an AWS p2.xlarge instance ($0.2 per hour).
  • I used early stopping to prevent the model from overfitting.

The final models that power the text box above run on an AWS instance. As it turns out, five TensorFlow models with 8 to 13 million parameters each can run simultaneously on a single t2.micro instance with only 1 GiB of RAM.

Resetting the LSTM state

I initially trained the LSTM in a stateful manner, meaning that the LSTM's state never gets reset to zero. This allows the model to keep information about the current context beyond the 160 time steps of backpropagation through time. However, I found that this causes the model to generate sentences like this on the Wikipedia dataset:

The Victorian Artist of the Year 1943) was a student of the University of California...
The Victorian Army and the United States Congress of the United States) and the Committee of the American...

Sounds good except for that closing bracket in both sentences. I assume that happens because the model only sees a zero hidden state once during training - at the very beginning. After that, the state never gets reset to a zero vector. Thus, the model can safely store information in the hidden state and even attribute information to zeros in the hidden state, such as "I need to close the bracket soon". For validation and sampling, however, the model starts again with a zero state, so it closes a bracket that was never opened. Resetting the hidden state to zero every now and then solves this problem. I reset the LSTM every 20 training steps, i.e. 3,200 time steps.

Mutual Perplexity

I also fed every validation dataset to each of the models and measured their average character perplexity (with base e instead of 2). Here is how confused they were:

Trained on
Evaluated on Sherlock Wikipedia Congress South Park Goethe
Sherlock 2.94 8.26 5.21 6.25 183.07
Wikipedia 4.35 3.16 5.36 7.31 186.79
Congress 7.39 3.77 2.21 7.68 270.40
South Park 7.10 4.90 6.04 3.35 206.41
Goethe 66.69 14.43 35.53 49.41 5.36

We can see that Goethe was quite confused by the English language.

Code

I published my code on GitHub and as a PyPI package that lets you create your own language model in just a few lines of code:

import tensorflow as tf
from tensorlm import CharLM

with tf.Session() as session:

    # Create a new model. You can also use WordLM
    model = CharLM(session, "datasets/sherlock/tinytrain.txt", max_vocab_size=96,
                   neurons_per_layer=100, num_layers=3, num_timesteps=15)

    # Train it
    model.train(session, max_epochs=10, max_steps=500)

    # Let it generate a text
    generated = model.sample(session, "The ", num_steps=100)
    print("The " + generated)

Comments