Introduction to Deep Learning
Deadline: Sunday November 4, 6PM
Same submission options as last time.
In this assignment, you will create a better model for the MNIST dataset using convolutional neural networks.
You should have seen that modifying layer sizes, changing activation functions etc. is simple: You can generally change parts of the model without affecting the rest of the program. In fact, you can change the full pipeline from input to model output without having to change anything else (restrictions apply).
Replace your MNIST MLP by a CNN. You can check this tutorial for an example. Note: Depending on your machine, training a CNN may take much longer than the MLPs we’ve seen so far. Also, processing the full test set in one go for evaluation might be too much for your RAM. In that case, you could break up the test set into smaller chunks and average the results. You could also remove dropout and make the dense layer at the end smaller.
tf.layers
API. It offers a decent middle ground between low-level control and convenience
(defining tf.Variable
s by hand gets old quickly). Try building a CNN with
layers
functions instead of the super-low-level way of the above tutorial.
Also, you might want to work with tf.data
for your inputs to get more
practice with it.GradientDescentOptimizer
. One option is to use adaptive algorithms, the most
popular of which is called Adam. Check out tf.train.AdamOptimizer
. This will
usually lead to much faster learning without manual tuning of the learning rate
or other parameters. We will discuss advanced optimization strategies later in
the class, but the basic idea behind Adam is that it automatically
chooses/adapts a per-parameter learning rate as well as incorporating momentum.
Using Adam, your CNN should beat your MLP after only a few hundred steps of
training. Alternatively, you could experiment with MomentumOptimizer
and some
form of learning rate annealing such as tf.train.polynomial_decay
. The
general consensus is that a well-tuned gradient descent with momentum and
learning rate decay will outperform adaptive methods, but you will need to
invest some time into finding a good parameter setting.If your CNN is set up well, you should reach extremely high accuracy results. This is arguably where MNIST stops being interesting. If you haven’t done so, consider working with Fashion-MNIST instead (see Assignment 1). This should present more of a challenge and make improvements due to hyperparameter tuning more obvious/meaningful.
Having set up your basic CNN, you should include some visualizations. In
particular, one thing that is often used to diagnose CNN performance is
visualizing the filters, i.e. the weights of the convolutional layers. The only
filters that are straightforward to interpret are the ones in the first layer,
since they operate directly on the input. The filter matrix should have a shape
filter_width x filter_height x 1 x n_filters
. Visualize the n_filters
many
images. You can do this via tf.summary.image
(this allows you to see the
filters develop over training). Alternatively, you can use libraries such as
matplotlib
as this offers many more plotting options (better colormaps in
particular).
Comment on what these filters seem to be recognizing (this can be difficult with small filter sizes such as 5 x 5). Experiment with different filter sizes as well (maybe up to 28 x 28?). See if there are any redundant filters (i.e. multiple filters recognizing the same patterns) and whether you can achieve a similar performance using fewer filters. In principle such redundancy checking can be done for higher layers as well, but note that there each filter has as many channels as there are filters in the layer below (you would need to visualize these separately).
Note: Accessing the filters when using the layers
API is a bit annoying
because they are created “under the hood”. This is particularly true if you use
something like y = tf.layers.conv2d(x, ...)
. Instead, you could use
conv_layer = tf.layers.Conv2D(...); y = conv_layer.apply(x)
. This gives the
same result, but allows you to access the layer parameters via
conv_layer.trainable_weights
. See
here for some examples of
using tf.layers
(and also tf.data
). You can ignore anything mentioning
“feature columns”.
Also, you should consider applying some form of regularization (e.g. simple L1 or L2 weight penalties) as this can often improve the subjective quality/interpretability of the resulting weights.
So far, we have been using the graph-based execution model of Tensorflow: Build the graph first, then run it. While this results in powerful, optimized models, it can also be cumbersome to use, especially when you want low-level control at each training step and/or are using complex control flow in your models. Luckily, Tensorflow offers an alternative execution model in the form of eager execution. This essentially runs ops as they are written, much like a corresponding model would work e.g. in numpy. Since the added flexibility is particularly important in research applications, we should have a look at eager execution.
There is a fairly comprehensive guide on the TF website. There are also simpler tutorials available if you prefer a more “modular” introduction. Note that these tutorials make use of the Keras package. Keras is another deep learning framework that by now has been fully integrated into Tensorflow. Feel free to try it out, however we won’t really be using it in this class. The “deal” with Keras is that it is very minimalistic and to-the-point, allowing you to define models in very few, easy-to-understand lines of code. However, it lacks customization options/low-level control for our purposes.
After you studied the tutorials, rewrite your CNN training for eager execution. Comment on any (dis)advantages you perceive compared to graph-based execution (including with regards to visualization!). Try to structure your code such that you could switch between eager and graph-based execution with as little effort a possible!