Introduction to Deep Learning
In this task, you will tackle the task of language modeling using RNNs. Language Modeling forms an important basis for most NLP applications such as tagging, parsing or machine translation. However, it can also be used on its own to generate “natural” language.
A language model assigns a probability (or, more generally, some kind of score) to a piece of text. Most of the time, this is done by interpreting the text as a sequence of words and computing probabilities of each word given the previous ones. Check out this Wikipedia article for a quick overview, especially on the classic n-gram models.
A consequence of having a probability distribution over words given previous words is that we can sample from this distribution. This way, we can generate whole sequences of language (usually of questionable quality and sense).
Language Modeling can also be done on a character level, however. That is, the text is predicted character-for-character instead of word-for-word. n-gram models quickly fail here due to their limited context. RNNs offer a compelling alternative due to their memory reaching back an arbitrary amount of time (in theory). Read through this “famous” blog post by Andrej Karpathy to get an impression of what can be done here.
The basic idea is that we train the RNN to predict the next element of a sequence given the previous elements. That is, at each time step the RNN receives a character as input. From this input and its current state, it (computes a new state and) produces a probability distribution over the next character.
Tensorflow has a reputation of having not-so-great support for RNNs, though this has gotten much better in recent times. However, an RNN “layer” can be confusing due to its black box character: All computations over a full sequence of inputs are done internally. To make sure you understand how an RNN “works”, you are asked to implement one from the ground up, defining variables yourself and using basic operations such as tf.matmul
to define the “unrolled” computation graph. There is an RNN tutorial on the Tensorflow website, but this is severely lacking, presenting incomplete code snippets out of context while the full tutorial code is extremely bloated. Also, you are asked not to use the RNNCell
classes for now. You might want to proceed as follows:
TFRecordDataset
and map this via the parse_seq
function we provide. Hint: You will need to create a new function from this with a fixed sequence length that only takes an example as input, e.g. data.map(lambda x: parse_seq(x, 200))
for sequences of length 200.Having prepared the data, build an RNN as follows:
batch_size x seq_len
into batch_size x seq_len x vocab_size
.batch_size x state_size
at training time.For now you might be happy with just training the RNN. Experiment with different layer sizes or sequence lengths. As a reference, an average loss of ~1.5 should be achievable on the Shakespeare corpus using length-200 sequences, with 512 hidden units (batch size 128 and Adam optimizer). Visualize the computation graph in Tensorboard and contemplate your life choices. If you’re feeling fancy, you could even construct a “deep” RNN (stacking multiple RNN layers) or implement more advanced architectures such as LSTMs or GRUs, but these will appear in later assignments anyway.
Having trained an RNN, you can use it to generate language – technically, you’re “sampling from the language model”. To do this, you should:
tf.train.Saver
.saver.restore
.<S>
(the beginning-of-sequence character inserted when creating the dataset) and the last state filled with zeros. Make sure to output the resulting state along with the probabilities so you can feed it into the network for the next step (this is where defining the initial state as a placeholder becomes useful).random.choice
). This will give you an index that you can feed back as input into the network for the next step. Also, you can map this to a character using the vocabulary file.Assuming your network was trained properly and your generation process works, the output should at least superficially resemble the training data. For example, in the case of Shakespeare you should see a dialogue structure with proper use of newlines and whitespace. The text itself should “look like” English, although there will likely be plenty of fantasy words. This is not a problem per se – chances are the task is just too difficult for this simple network. If your output looks completely jumbled, there is probably something wrong with your generation process.