Assignment 3: CNNs

Deadline: Sunday November 4, 6PM

Same submission options as last time.

In this assignment, you will create a better model for the MNIST dataset using convolutional neural networks.

CNN for MNIST

You should have seen that modifying layer sizes, changing activation functions etc. is simple: You can generally change parts of the model without affecting the rest of the program. In fact, you can change the full pipeline from input to model output without having to change anything else (restrictions apply).

Replace your MNIST MLP by a CNN. You can check this tutorial for an example. Note: Depending on your machine, training a CNN may take much longer than the MLPs we’ve seen so far. Also, processing the full test set in one go for evaluation might be too much for your RAM. In that case, you could break up the test set into smaller chunks and average the results. You could also remove dropout and make the dense layer at the end smaller.

If you haven’t done so, now might be a good time to check out the tf.layers API. It offers a decent middle ground between low-level control and convenience (defining tf.Variables by hand gets old quickly). Try building a CNN with layers functions instead of the super-low-level way of the above tutorial. Also, you might want to work with tf.data for your inputs to get more practice with it.
You should consider using a better optimization algorithm than the basic GradientDescentOptimizer. One option is to use adaptive algorithms, the most popular of which is called Adam. Check out tf.train.AdamOptimizer. This will usually lead to much faster learning without manual tuning of the learning rate or other parameters. We will discuss advanced optimization strategies later in the class, but the basic idea behind Adam is that it automatically chooses/adapts a per-parameter learning rate as well as incorporating momentum. Using Adam, your CNN should beat your MLP after only a few hundred steps of training. Alternatively, you could experiment with MomentumOptimizer and some form of learning rate annealing such as tf.train.polynomial_decay. The general consensus is that a well-tuned gradient descent with momentum and learning rate decay will outperform adaptive methods, but you will need to invest some time into finding a good parameter setting.

If your CNN is set up well, you should reach extremely high accuracy results. This is arguably where MNIST stops being interesting. If you haven’t done so, consider working with Fashion-MNIST instead (see Assignment 1). This should present more of a challenge and make improvements due to hyperparameter tuning more obvious/meaningful.

Probing the Network

Having set up your basic CNN, you should include some visualizations. In particular, one thing that is often used to diagnose CNN performance is visualizing the filters, i.e. the weights of the convolutional layers. The only filters that are straightforward to interpret are the ones in the first layer, since they operate directly on the input. The filter matrix should have a shape filter_width x filter_height x 1 x n_filters. Visualize the n_filters many images. You can do this via tf.summary.image (this allows you to see the filters develop over training). Alternatively, you can use libraries such as matplotlib as this offers many more plotting options (better colormaps in particular).

Comment on what these filters seem to be recognizing (this can be difficult with small filter sizes such as 5 x 5). Experiment with different filter sizes as well (maybe up to 28 x 28?). See if there are any redundant filters (i.e. multiple filters recognizing the same patterns) and whether you can achieve a similar performance using fewer filters. In principle such redundancy checking can be done for higher layers as well, but note that there each filter has as many channels as there are filters in the layer below (you would need to visualize these separately).

Note: Accessing the filters when using the layers API is a bit annoying because they are created “under the hood”. This is particularly true if you use something like y = tf.layers.conv2d(x, ...). Instead, you could use conv_layer = tf.layers.Conv2D(...); y = conv_layer.apply(x). This gives the same result, but allows you to access the layer parameters via conv_layer.trainable_weights. See here for some examples of using tf.layers (and also tf.data). You can ignore anything mentioning “feature columns”.

Also, you should consider applying some form of regularization (e.g. simple L1 or L2 weight penalties) as this can often improve the subjective quality/interpretability of the resulting weights.

Bonus: Eager Execution

So far, we have been using the graph-based execution model of Tensorflow: Build the graph first, then run it. While this results in powerful, optimized models, it can also be cumbersome to use, especially when you want low-level control at each training step and/or are using complex control flow in your models. Luckily, Tensorflow offers an alternative execution model in the form of eager execution. This essentially runs ops as they are written, much like a corresponding model would work e.g. in numpy. Since the added flexibility is particularly important in research applications, we should have a look at eager execution.

There is a fairly comprehensive guide on the TF website. There are also simpler tutorials available if you prefer a more “modular” introduction. Note that these tutorials make use of the Keras package. Keras is another deep learning framework that by now has been fully integrated into Tensorflow. Feel free to try it out, however we won’t really be using it in this class. The “deal” with Keras is that it is very minimalistic and to-the-point, allowing you to define models in very few, easy-to-understand lines of code. However, it lacks customization options/low-level control for our purposes.

After you studied the tutorials, rewrite your CNN training for eager execution. Comment on any (dis)advantages you perceive compared to graph-based execution (including with regards to visualization!). Try to structure your code such that you could switch between eager and graph-based execution with as little effort a possible!