Currently some of the best deep learning methods, e.g. the ones that are beating records on ImageNet, rely purely on supervised learning. Theoretically, it is clear that there is much more information available in the data than is used in supervised learning. Also, labeled data is often slow and/or expensive to gather in many applications, while unlabeled samples are readily available in large quantities. So why has unsupervised learning not been able to offer significant improvements in many problems such as ImageNet? I argue that the problem has been in the unsupervised techniques which have not been able to discard those details which are irrelevant for the task at hand but which are important for reconstructing the data. I will propose a solution: a combination of denoising autoencoders and denoising source separation and a model structure which has short-cut connections between the encoder and the decoder. These short-cut connections allow the encoder to discard task-irrelevant details because the decoder can recover the discarded information through the short-cuts. Denoising source separation adds an important feature to the model: unlike standard autoencoders, the model supports unsupervised learning (and cost function terms) throughout the network. This means that it’s very easy to take an existing model structure which performs well in a supervised setting (e.g. convnets) and add unsupervised learning on top of that. With very little added complexity, we have been able to achieve state-of-the-art results in a permutation invariant version of MNIST (both semi-supervised and full-labeled settings) and are now working on extending the result to more realistic problems like ImageNet.
For more on ZenRobotics see http://iki.fi/valpola/research/