Assignment 7: Autencoders

In this assignment, you will implement various autoencoder architectures on our beloved MNIST data. In particular, you will gain some insight into the problem of training convolutional autoencoders.

Autoencoders in Tensorflow

Building autencoders in Tensorflow is pretty simple. You need to define an encoding based on the input, a decoding based on the encoding, and a loss function that measures the distance between decoding and input. A common choice for image data is simply the mean squared error. To start off, you could try simple MLPs. Note that you are in no way obliged to choose the “reverse” encoder architecture for your encoder; e.g. you could use a 10-layer MLP as an encoder and a single layer as a decoder if you wanted. Note: The activation function of the last decoder layer is very important, as it needs to be able to map the input data range. The MNIST data that comes with Tensorflow should already be floats in the range [0, 1]. However, if you use the data we provided for earlier assignments, you should divide all images by 255.0 since the data is stored as ints in [0, 255] instead. Having data in the range [0, 1] allows you to use a sigmoid output activation, for example. Experiment with different activations such as sigmoid, relu or linear (no) activation and see how it affects the model.

Keep in mind that the cost is not a good proxy for the “quality” of an autoencoder; instead you need to get an impression of what the model learned. Note that if you use a single-layer decoder, its weight matrix will be h_dim x 784 and each of the h_dim rows can be reshaped to 28x28 to get an impression of what kind of image the respective hidden dimension represents. The same holds for the encoder of course, which in the single-layer case will have a 784 x h_dim weight matrix.

Convolutional Autoencoders

Next, you should switch to a convolutional encoder/decoder to make use of the fact that we are working with image data. The encoding should simply be one or more convolutional layers, with any filter size and number of filters. The “inverse” of a tf.layers.conv2d is tf.layers.conv2d_transpose. Again, there is no requirement to make the parameters of encoder and decoder “fit”, e.g. you don’t need to use the same filter sizes. As long as you use “same” padding and unit strides, there should be no problems with input/output shapes mismatching. The last convolutional transpose should have only one filter to go back to what is basically black-and-white image space (one channel, like the MNIST input). Note that pooling cannot be reverted that easily, so you should not use it here. Also, to achieve decent visualizations, you should aim for shallow networks with large filters (e.g. one layer each for encoder and decoder with 11x11 filters). Take note of the cost value the network achieves, and visualize the learned filters. Find an explanation for your observations.

Winner-Take-All Autoencoders

As you should have seen, unregularized convolutional autoencoders don’t learn interesting filters. One way of regularizing is by enforcing extremely sparse activations. Read the paper on Winner-Take-All Autoencoders (an understanding of the method is enough, no need to read e.g. the experiments in detail) and implement this scheme. You only need to implement spatial sparsity for convolutional networks; lifetime sparsity is optional. However, it might actually be easier to start with implementing fully-connected WTA autoencders with lifetime sparsity, and you may be able to resuse this functionality for the convolutional variant. Implementing any of these parts requires some careful manipulation of tensors and indices and is a good exercise in and of itself. Some hints:

For spatial sparsity, flatten the spatial dimensions (i.e. reshape from batch x filters x w x h to batch x filters x w*h) and use tf.max/tf.argmax over the dimension to get the maximum value and its location for each input and feature map.
For lifetime sparsity, tf.nn.top_k should be interesting. Note that WTA autoencoders are defined in terms of a sparsity proportion, whereas top_k takes an integer k, so you will need to convert between the two.
tf.scatter_nd allows you to fill a mostly-zero tensor with a selection of values at specific indices. This function can be hard to understand in its full generality, and building appropriate index arrays requires some careful tinkering. It it is probably easiest to restrict the use of scatter to filling up 1D arrays, and use tf.unstack and tf.stack to take apart/build up higher-dimensional arrays.

Experiment with different sparsity settings and see how the learned filters are affected. If you want to go the extra mile, you could even compare the learned filters and/or embedding spaces (remember the TensorBoard Projector!) of autoencoders with a comparable classification model (i.e. the encoder followed by a softmax classification layer) or even use a trained encoder to initialize the weights of such a model.