As many neuroscientists, I’m also interested in artificial neural networks and am curious about deep learning networks. I want to dedicate some blog posts to this topic, in order to 1) approach deep learning from the stupid neuroscientist’s perspective and 2) to get a feeling of what deep networks can and can not do. Part I, Part III, Part IV, Part IVb.
To work with deep learning networks, one can either write ones’s own functions and libraries for loss functions or backpropagation; or use one of the available “frameworks” that provide the libraries and everything else under the hood that is more interesting for developers, but not for users of deep learning networks.
Here’s a best of: Tensorflow, Theano, Torch, Caffe. Out of those, I would recommend either Tensorflow, the software developed and used by Google, and Theano, which is similar, but developed by academic researchers. Both can be imported as libraries in Python.
First, I tried to install both Theano and Tensorflow on Windows 7, and gave up after a while – both frameworks are not designed to work best on Windows, although there are solutions that do work (update early 2017: Tensorflow is now also available for Windows, although you have to take care to install the correct python version). So I switched to Linux (Mint) and installed Tensorflow, and spent some time digging out my Python knowledge that got lost during the last couple of years when I used Matlab only.
Here’s the recipe that should give a good handle on Tensorflow in 2-4 days:
- Installation of Tensorflow. As always with Python, this can easily take more than one hour. I installed a CPU-only version, since my GPU is not very powerful.
- Some basic information to read – most of this applies to Theano as well.
- Then here’s a nice MNIST tutorial using Tensorflow. MNIST is a standard number recognition dataset that is used as a standard benchmark for supervised classifiers.
- For a more systematic introduction, check out the Tensorflow udacity class that has been announced recently on Google’s research blog, based on a dataset similar to MNIST, but a little bit more difficult to classify. The lecture videos are short and focused on a pragmatic understanding of the software and of deep learning. For deeper understanding, please read a book.
The core of the Tensorflow class is the hands-on part which consists of code notebooks that are made available on github (notebooks 1-4). This allows to understand the Tensorflow syntax and solve some “problems” by modifying small parts of the given sample code. This is really helpful, because googling for possible solutions to problems gives a broader overview of Tensorflow. Going through these exercises will take 1-3 days of work, depending on whether you are a perfectionist or not.
- Now you should be prepared to use convolutions, max pooling, dropout and stochastic gradient descent on your own data with Tensorflow. I will try do to this in the next couple of weeks and report it here.
Here are some observations that I made when trying out Tensorflow:
- There are a lot of hyperparameters (learning rate, number of layers, size of layers, scaling factor of L2 regularization of different layers, shape/stride/depth of convolutional filters) that have to be optimized. Sometimes the behavior is unexpected or even unstable. I didn’t find good and reliable advice on how to set the hyperparameters in function of the classification task.
- To illustrate the unpredictable behavior, here is a very small parameter study on a neural network (no convolutional layer) with two hidden layers with 50 and 25 units each. The factors L1 and L2 give the scaling of the L2 regularization loss term (for explanation see chapter 7.1.1 in this book) of the respective layers. If the loss function for the (smaller) second hidden layer is weighted more strongly than the loss function for the first hidden layer (i.e., L2 >> L1), the system is likely to become unstable and to settle to random assignment of the ten available categories (= 10% success rate). However, whether this happens or not depends also on the learning rate – and even on the initialization of the network.
L1\L2 1e-4 5e-4 1e-3 5e-3 1e-4 89.9% 10.0% 10.0% 10.0% 5e-4 91.8% 91.9% 92.1% 10.0% 1e-3 92.3% 92.6% 92.2% 10.0% 5e-3 10.0% 10.0% 89.3% 10.0%
- When following the tutorial and the course, there is not much to be done except for tweaking algorithms and hyperparameters, with the only network performance readout being “89.1%”, “91.0%”,”95.2%”,”10.0%”,”88.5%” and so on; or maybe a learning curve. It’s a black box with some tunable knobs. But what has happened to the network? How did it learn? How does its internal representation look like? In one of the following weeks, I will have a look into that.