Introduction to Deep Learning
no submission necessary
In this task, we will once again deal with the issue of language modeling. This
time, however we will be using Tensorflow’s RNN functionalities, which makes
defining models significantly easier. On the flipside, we will be dealing with
issues that come with variable-length inputs. This in turn makes defining
models significantly more complicated.
You are also asked to try wrapping everything into the high-level Estimator
interface, which will require a few workarounds.
We will stick with character-level models for now; while word models are more
common in practice, they come with additional problems.
Once again we provide you with a script to help you process raw text into a format Tensorflow can understand. Download the script here. This script differs from the previous one in a few regards:
\n
as a simple baseline (you might need to provide \\n
instead depending on your OS – check whether the resulting sequence
lengths/number of sequences is reasonable. It looks like \n
works with Windows
and \\n
with Linux/Mac). More interesting results should
come from a sensible delimiter for the given corpus. For example, try
[0-9]+:[0-9]+
for the King James Bible (available via
Project Gutenberg) to split the text
into verses. Or try \n\n+
(or \\n\\n+
, see above) on the Shakespeare text
to split it into monologues. Both should give you about 30,000 sequences each,
with lengths peaking around 100 characters.maxlen
argument that will remove any
sequences longer than this threshold. Useful to remove things such as the
Project Gutenberg disclaimers and generally keep your program from exploding
due to overly long inputs.This also means that you need to provide the data to Tensorflow in a slightly
different way during training:
At the end of the day, Tensorflow works on tensors, and these have fixed
sizes in each dimension. That is, sequences of different lengths can’t be put
in the same tensor (and thus not in the same batch). The standard work-around
for this issue is padding: Sequences are filled up with “dummy values” to get
them all to the same length (namely, that of the longest sequence of the
batch). The most straightforward approach is to simply add these dummy values
at the end, and the most common value to use for this is 0. Doing padding is
simple in Tensorflow: Use padded_batch
instead of batch
in tf.data
.
Finally, a note on the “input function” for tf.Estimator
: You may have read
that this should return a two-tuple (features, labels)
that will be passed to
the model function. However, in this case we don’t really need labels since our
targets are just the features shifted by one step. You can simply have your
input function return None
as the labels. However, you can of course
explicitly pass the shifted inputs as labels
instead if you prefer this.
Defining an RNN is much simpler when using the full Tensorflow library. Again, there are essentially no official tutorials on this, so here are the basic steps:
tf.nn.rnn_cell
. For example, there are the classes BasicRNNCell
, LSTMCell
and GRUCell
. Note that these take different parameters. Building a “deep” RNN
is as simple as wrapping a list of cells in a MultiRNNCell
.tf.nn.static_rnn
which basically
implements what you did last assignment: A computation graph unrolled over
time, that can only deal with sequences of a specific length. In this case,
you should use the tf.nn.dynamic_rnn
, which dynamically unrolls the
computation graph as needed. This means it can deal with sequences of any
length.batch_size x time x output_size
tensor of outputs over time (only of the highest layer in case of multiple
layers). The second output holds the final state of all layers (not needed for
training, but can be useful for sampling). Note that for the pre-defined cells,
usually output_size == state_size
and the outputs are just the states. That
is, we still need to apply the output layer to get the logits.tf.matmul
here (since both inputs need to be 2D for that function). Luckily,
tf.layers.dense
is quite flexible and will simply do a “tensor product” over
the last dimension of the input if it has more than two dimensions. This suits
us just fine, so you can use a dense layer to compute the outputs (logits).
Feel free to use tf.tensordot
yourself to learn how it works (at least read
the API docs).softmax_cross_entropy_with_logits
function will automatically reshape
the labels/logits to 2D, so you can just put the 3D tensors (logits/targets)
into this loss.The very least you should do is to re-implement the task from last assignment with these functionalities. That is, you may work with fixed, known sequence lengths as a start. However, the real task lies ahead and you may skip the re-implementation and go straight for that one if you wish.
You may have noticed that there is one problem remaining: We padded shorter sequences to get valid tensors, but the RNN functions as well as the cost computations have no way of actually knowing that we did this. This means we most likely get nonsensical outputs (and thus costs) for all those sequence elements that correspond to padding. Let’s fix these issues.
dynamic_rnn
takes a sequence_length
argument, which should for each
sequence in the batch contain an integer giving the actual sequence length
without padding. The RNN will essentially ignore inputs for a given sequence
once it’s past the corresponding sequence length, not updating the state and
producing dummy outputs. Compute the sequence lengths as the number of non-zero
elements (since we’re using 0 as padding symbol) minus one for each sequence.
tf.not_equal
should be useful here. It outputs a tensor of bools; cast these
to integers before summing. The minus one comes because the last non-padding
sequence element (the end-of-sequence character) isn’t used to predict anything
so it isn’t “really” part of the input sequence.tf.sequence_mask
that will produce such a mask from the
lengths we computed earlier, but since we essentially built the mask already to
get the lengths, this would be silly.tf.reduce_mean
: Note that the mean is just the sum
divided by the number of elements, and this number would also include padding
elements! Instead, you should use reduce_sum
and then divide by the number of
“real” elements (remember the mask!), which differs for each element of the
batch.tf.contrib.seq2seq.sequence_loss
. Note that this expects targets to be
indices, not one-hot vectors. The mask should go into the weights
argument
(cast to floats).While this was a lot of explanation, your program should hopefully be more succinct than the previous one, and it’s more flexible as well! Look at the computation graph of this network to see how compactly the RNN is represented. Experiment with different cell types or even multiple layers and see the effects on the cost. Be prepared for significantly longer training times than with feedforward networks like CNNs.
tf.Estimator
Unfortunately, by using tf.Estimator
we lose the low-level control to do
step-by-step feeding of samples of a network’s output as its next input. To do
sampling, you could just do a low-level implementation again. In this case,
it works a lot like before: We can input single characters by simply treating
them as length-1 sequences. The process should be stopped not after a certain
amount of steps, but when the special character </S>
is sampled (you could
also continue sampling and see if your network breaks down…). Once again,
supplying the initial state as a placeholder should help – note that if you
use a MultiRNNCell
, this needs to be a tuple of states.
But can we do sampling using tf.Estimator
as well? As it turns out, we can
slightly abuse the sequence-to-sequence framework in tf.contrib.seq2seq
for
this task. Consider this part optional as it introduces many additional
concepts. However, it can be instructive to learn about how to work around the
restrictions of high-level frameworks without having to sacrifice all of their
benefits.
There is a seq2seq tutorial on the TF
website. This deals with machine translation using encoder-decoder
architectures. In our case, we basically only have a decoder that generates
from a fixed initial state (usually a zero state). To adapt your RNN to allow
for random sampling, you need to take the following steps (most of the
mentioned classes/functions live in tf.contrib.seq2seq
):
TrainingHelper
that takes the one-hot inputs and
the tensor containing sequence lengths. Note: TrainingHelper
seems to be
quite “smart” in that apparently, if the maximum sequence length provided is
smaller than the length of the input, the output length will be, as well. That
is, outputs will not necessarily be provided for all inputs, only for those
covered by some sequence length. Keep this in mind in case you run into shape
mismatches.BasicDecoder
using your RNN cell, the aforementioned helper, some
initial state (the zero_state
method of your RNN cell comes to mind) and an
output layer (i.e. the Dense
layer to produce the logits goes in here).dynamic_decode
. This returns a 3-tuple (the seq2seq
tutorial is outdated and assumes a 2-tuple); the decoder outputs are the first
element of this tuple.outputs.rnn_output
and proceed as usual.Helper
. To get random output,
this needs to include some kind of sampling. There is SampleEmbeddingHelper
but this only makes sense if you are using an additional character embedding
before the RNN – if you have been following this tutorial, you aren’t doing
this. Otherwise, the best choice seems to be InferenceHelper
. You can use
tf.multinomial
as sample_fn
. As end_fn
you should check whether the
end-of-sequence character was generated. next_inputs_fn
should turn the
sampled indices into one-hot vectors. Finally, start_inputs
should be a
batch of one-hot vectors encoding the beginning-of-sequence character.outputs
is once again the first element of the 3-tuple returned
by dynamic_decode
, the generated samples can be found in outputs.sample_id
.predict
mode. Note that you still need to
provide inputs to the model to generate samples, but the samples are actually
completely independent from those inputs (since they are random and always
start from the initial state), meaning that you can provide “dummy inputs” to
the model if you wish.Finally, keep in mind that language modeling is actually about assessing the “quality” of a piece of language, usually formalized via probabilities. The probability of a given sequence is simply the product of the probability outputs for the character actually appearing next at each time step. Try this out: Compute the probabilities for some sequences typical for the training corpus (you could just take them straight from there). Compare these to the probabilities for some not-so-typical sequences. Note that only sequences of the same length can be compared since longer sequences automatically receive a lower probability. For example, in the King James corpus you could simply replace the word “LORD” by “LOOD” somewhere and see how this affects the probability of the overall sequence.
In the Estimator
interface, getting “your own” sequences into the network can
be a bit annoying since you need to work via input functions. You have
(at least) two options:
tf.data.Dataset.from_generator
– this allows
you to run arbitrary Python code within a generator and yield inputs to the
model function as you deem appropriate.Then you will want to run your model in predict
mode and get the
probabilities that way.