Assignment 6: More Realistic Language Modeling & Recurrent Neural Networks

In this task, we will once again deal with the issue of language modeling. This time, however we will be using Tensorflow’s RNN functionalities, which makes defining models significantly easier. On the flipside, we will be dealing with issues that come with variable-length inputs. We will stick with character-level models for now; while word models are more common in practice, they come with their own problems that we will deal with at a later time.

Preparing the Data

Once again we provide you with a script to help you process raw text into a format Tensorflow can understand. Download the script here. This script differs from the previous one in a few regards:

No more fixed sequence lengths. Instead, you need to provide a regular expression which will be used as a sequence delimiter. You could just use the newline character \n as a simple baseline. More interesting results should come from a sensible delimiter for the given corpus. For example, try [0-9]+:[0-9]+ for the King James Bible (available via Project Gutenberg) to split the text into verses. Or try \n\n+ on the Shakespeare text to split it into monologues. Both should give you about 30,000 sequences each, with lengths peaking around 100 characters. Note: Depending on your OS you might need to supply \\n\\n+ for Shakespeare.
Every sequence now ends with a special end-of-sequence character that allows the model to learn how long a typical sequence should be.
You can (and should) supply a maxlen argument that will remove any sequences longer than this threshold. Useful to remove things such as the Project Gutenberg disclaimers and generally keep your program from exploding due to overly long inputs.

This also means that you need to provide the data to Tensorflow in a slightly different way during training:

Because both batch sizes and sequence lengths may vary, your input placeholder should have shape [None, None]. This looks a bit weird, but works just fine.
At the end of the day, Tensorflow works on tensors, and these have fixed sizes in each dimension. That is, sequences of different lengths can’t be put in the same tensor. The standard work-around for this issue is padding: Sequences are filled up with “dummy values” to get them all to the same length (namely, that of the longest sequence of the batch). The most straightforward approach is to simply add these dummy values at the end, and the most common value to use for this is 0. Doing padding is simple in Tensorflow: Simply use padded_batch instead of batch.

Building an RNN

Defining an RNN is much simpler when using the full Tensorflow library. Again, there are essentially no official tutorials on this, so here are the basic steps:

First you need to define the “cell”, i.e. the structure and computations of the RNN at one time step. There are several cells available in tf.nn.rnn_cell. For example, there are the classes BasicRNNCell, LSTMCell and GRUCell. Note that these take different parameters. Building a “deep” RNN is as simple as wrapping a list of cells in a MultiRNNCell.
Having created a cell, there are other classes that take care of the whole doing-things-over-time problem. There is the tf.nn.static_rnn which basically implements what you did last assignment: A computation graph unrolled over time, that can only deal with sequences of a specific length. In this case, you should use the tf.nn.dynamic_rnn, which dynamically unrolls the computation graph as needed. This means it can deal with sequences of any length.
The dynamic rnn has two outputs: One is a batch_size x time x output_size tensor of outputs over time (only of the highest layer in case of multiple layers). The second output holds the final state of all layers (not needed for training, but useful for sampling). Note that for the pre-defined cells, usually output_size == state_size and the outputs are just the states. That is, we still need to apply the output layer to get the logits.
Unfortunately, we can’t just apply a dense layer or matrix multiplication on the 3D outputs. There are two ways to implement the output projection: You could reshape the inputs to 2D, shaped batch_size*time x state_size, use a dense layer and compare to the (also reshaped) targets for the costs (or reshape the outputs back to 3D). Alternatively, you can use tf.tensordot to directly multiply the 3D outputs with the 2D output weight matrix. Use axes=1. If you do this, make sure understand why/how it works. Note that the softmax_cross_entropy_with_logits function will automatically flatten the labels/logits to 2D, so you can just put the 3D tensors into this loss.

You may have noticed that there is one problem remaining: We padded shorter sequences to get valid tensors, but the RNN functions as well as the cost computations have no way of actually knowing that we did this. This means we most likely get nonsensical outputs (and thus costs) for all those sequence elements that correspond to padding. Let’s fix these issues.

dynamic_rnn takes a sequence_length argument, which should for each sequence in the batch contain an integer giving the actual sequence length without padding. The RNN will essentially ignore inputs for a given sequence once it’s past the corresponding sequence length, not updating the state and producing dummy outputs. Compute the sequence lengths as the number of non-zero elements (since we’re using 0 as padding symbol) for each sequence. tf.not_equal should be useful here. It outputs a tensor of bools; cast these to integers before summing.
The real problem still persists: The RNN will still output “dummy” outputs, and these will still factor into the costs. Not good! The standard way of solving this problem is masking: Compute the per-element cross-entropies, but zero-out any elements that correspond to padding. This can be done by element-wise multiplication with a tensor that is 1 for all “real” elements and 0 for padding – once again, tf.not_equal does the job. There is also tf.sequence_mask that will produce such a mask from the lengths we computed beforehand, but since we essentially built the mask already to get the lengths, this would be silly.
Finally, we need to aggregate the per-element costs into a scalar cost. Do not sum over time: Since we get variable-length sequences, this would mean costs are automatically higher for longer sequences, which is not correct. However you also cannot use tf.reduce_mean: Note that the mean is just the sum divided by the number of elements, and this number would also include padding elements! Instead, you should use reduce_sum and then divide by the number of “real” elements (remember the mask!).
If you understood all this, feel free to use tf.contrib.seq2seq.sequence_loss. Note that this expects targets to be indices, not one-hot vectors. The mask should go into the weights argument (cast to floats). Unfortunately, this won’t work if you used tensordot earlier because the output of that has unknown shape which leads to a crash in the loss.

While this was a lot of explanation, your final program should be far more succinct than the previous one, and it’s more flexible as well! Look at the computation graph of this network to see how compactly the RNN is represented. Experiment with different cell types or even multiple layers and see the effects on the cost. Also evaluate the quality of samples from the network.

Sampling works a lot like before: We can input single characters by simply treating them as length-1 sequences. The process should be stopped not after a certain amount of steps, but when the special character </S> is sampled (you could also continue sampling and see how your network breaks down…). Once again, supplying the initial state as a placeholder should help – note that if you use a MultiRNNCell, this needs to be a tuple of states.

Finally, keep in mind that language modeling is actually about assessing the “quality” of a piece of language, usually formalized via probabilities. The probability of a given sequence is simply the product of the probability outputs (for the character actually appearing next) at each time step (i.e. apply softmax to the logits). Try this out: Compute the probabilities for some sequences typical for the training corpus (you could just take them straight from there). Compare these to the probabilities for some not-so-typical sequences. Note that only sequences of the same length can be compared since longer sequences automatically receive a lower probability. For example, in the King James corpus you could simply replace the word “LORD” by “LOOD” somewhere and see how this affects the probability of the overall sequence.