Assignment 3: CNNs

In this assignment, you will create a better model for the MNIST dataset using convolutional neural networks.

CNN for MNIST

You should have seen that modifying layer sizes, changing activation functions etc. is simple: You can generally change parts of the model without affecting the rest of the program. In fact, you can change the full pipeline from input to model output without having to change anything else (restrictions apply).

Replace your MNIST MLP by a CNN. You can check this tutorial for an example. Note: Depending on your machine, training a CNN may take much longer than the MLPs we’ve seen so far. Also, processing the full test set in one go for evaluation might be too much for your RAM. In that case, you could break up the test set into smaller chunks and average the results. You could also remove dropout and make the dense layer at the end smaller.

If you haven’t done so, now might be a good time to check out the tf.layers API. It offers a decent middle ground between low-level control and convenience (defining tf.Variables by hand gets old quickly). Try building a CNN with layers functions instead of the super-low-level way of the above tutorial. Also, you might want to work with tf.data for your inputs to get more practice with it.
You should consider using a better optimization algorithm than the basic GradientDescentOptimizer. One option is to use adaptive algorithms, the most popular of which is called Adam. Check out tf.train.AdamOptimizer. This will usually lead to much faster learning without manual tuning of the learning rate or other parameters. We will discuss advanced optimization strategies later in the class, but the basic idea behind Adam is that it automatically chooses/adapts a per-parameter learning rate as well as incorporating momentum. Using Adam, your CNN should beat your MLP after only a few hundred steps of training. Alternatively, you could experiment with MomentumOptimizer and some form of learning rate annealing such as tf.train.polynomial_decay. The general consensus is that a well-tuned gradient descent with momentum and learning rate decay will outperform adaptive methods, but you will need to invest some time into finding a good parameter setting.

If your CNN is set up well, you should reach extremely high accuracy results. This is arguably where MNIST stops being interesting. If you haven’t done so, consider working with Fashion-MNIST instead (see Assignment 1). This should present more of a challenge and make improvements due to hyperparameter tuning more obvious/meaningful.

Probing the Network

Having set up your basic CNN, you should include some visualization. In particular, one thing that is often used to diagnose CNN performance is visualizing the filters, i.e. the weights of the convolutional layers. The only filters that are straightforward to interpret are the ones in the first layer, since they operate directly on the input. The filter matrix should have a shape filter_width x filter_height x 1 x n_filters. Visualize the n_filters many images. You can do this via tf.summary.image (this allows you to see the filters develop over training). Alternatively, you can use libraries such as matplotlib as this offers many more plotting options (better colormaps in particular).

Comment on what these filters seem to be recognizing (this can be difficult with small filter sizes such as 5 x 5). Experiment with different filter sizes as well (maybe up to 28 x 28?). See if there are any redundant filters (i.e. multiple filters recognizing the same patterns) and whether you can achieve a similar performance using fewer filters. In principle such redundancy checking can be done for higher layers as well, but note that there each filter has as many channels as there are filters in the layer below (you would need to visualize these separately).

Note: Accessing the filters when using the layers API is a bit annoying because they are created “under the hood”. This is particularly true if you use something like y = tf.layers.conv2d(x, ...). Instead, you could use conv_layer = tf.layers.Conv2D(...); y = conv_layer.apply(x). This gives the same result, but allows you to access the layer parameters via conv_layer.trainable_weights. See here for some examples of using tf.layers (and also tf.data). You can ignore anything mentioning “feature columns”.