Deep Learning is the latest one of our machine learning intelligence analytics methods — let’s say methods — that is sweeping all sorts of development. It’s supposed to solve the toughest of problems, and be our gateway to AI. All the tech companies (and a lot that aren’t) will eagerly tell that they use or are looking at Deep Learning applied to rocket science, coffee making, and everything in between. So, what is it?
The Short Version
The oversimplified version is, more or less, “Deep Learning is using large, complex Artificial Neural Networks, trained on massive amounts of data, to perform operations beyond classification”. This doesn’t sound too complicated, and, indeed, it is not. The main advantage we have today is, simply put, we have the computational power to, well, power these large (or by historical standards, enormous) neural networks.
The Long Version – Artificial Neural Networks
To see where we came from, and where we are today, let’s discuss Artificial Neural Networks, or ANNs. These are, effectively, a facsimile of our understanding of how brains work; a set of nodes, joined by links, that are able to excite each other in various ways. The neural network consists of three types of layers. An input layer will accept the data, with each input neuron accepting generally a real number between 0 and 1. Zero, but usually one or more hidden layers exist for the purposes of transformation, which then feed the data into a limited number of output layers. Each node in the output layer, once more, produces a real number between 0 and 1.
Before the ANN can be used, it must be trained. Training consists of using pairs of predetermined input and output data. This data is defined as correct, and the training data is, in some order (often randomised) fed to the neural network. The network will produce some output data based on the input, leading to an error calculation. The error calculation will then be used to try and correct the weights assigned to the connections between each node in the neural network; strengthening connections that help produce the correct answer, and weakening ones that don’t.
Over a period of time, the neural network should converge into one that has figured out the commonalities between the input data, and used that to properly match the output data. This is where large data sets come in handy: the more data the network is trained on, the higher the chance it learns to properly identify which feature(s) of the input data are important.
Feeding Artificial Neural Networks
Because the ANN’s input layer can only take one number for each neuron, one can not, for example, simply take an arbitrary image and place it into the ANN. This is often called preprocessing or encoding, and consists of normalising the input data such that each data point falls between the chosen range (e.g. [0,1]), and that the data is input in a uniform manner. For an image, this could mean converting it to black and white, and making sure it is a square. For audio, this could mean converting it to mono sound, and increasing the volume if it is too low.
This preprocessing step is necessary to make the data uniform. A neural network can only reasonably work if the data it sees is similar to the data it has been trained with. Training on, for example, black and white pictures allows the neural network to recognise a wide variety of colours simply by not being able to tell the difference between colours, but it would easily be able to distinguish luminance. When wanting to identify, for example, a ball, it used to seldom be relevant what colour the ball was. (With deep learning, this is changing, but we’ll touch on that later)
Training Artificial Neural Networks
We touched earlier that training a neural network is a repetitive process. But it is also a complex process. Properly training a neural network involves selecting a good and wide data set, configuring a number of algorithms on how the network should be trained, the error, loss functions, and how the synapses should be updated during the training phase, among others.
Furthermore, during the training, the ANN should be supervised. The best results for an ANN are usually the ones that use large parts of the neural network (i.e. teaches it to generalise). Allowing the ANN to diverge leads to the creation of mini neural networks within the larger one, with each small ANN learning to do one specific thing, but with fewer neurons, working well while training but giving worse real-world performance (“overfitting”).
Even without these items, training a neural network is a time-intensive process. The training time increases with both the neural network size, and the size of the training data. It is, therefore, effectively without an upper bound; the upper bound being what the designer/trainer decides is a reasonable size ANN and data set. This is a process that can start at hours, but easily run into the “days” category, if not more, with larger amounts of data.
This has led to things such as Google’s Inception image recognition neural network, an open, pre-trained neural network for recognising images. Not only is it pre-trained, but they have additionally made it possible to retrain only parts of the network (specifically, only the final layer) so that it can recognise other image classes than the 1,000 they have pre-trained. They note that retraining only the last layer of the network can be done in “as little as 30 minutes”, though they neglect to say how large a training set is used for that.
One of the more illustrating sentences in Google’s discussion about retraining Inception harkens back to the difficulty of properly training ANNs (emphasis mine):
A final test accuracy evaluation is run on a set of images kept separate from the training and validation pictures. This test evaluation is the best estimate of how the trained model will perform on the classification task. You should see an accuracy value of between 90% and 95%, though the exact value will vary from run to run since there’s randomness in the training process.
So, in a nutshell, training ANNs is hard, choosing the right data is hard, setting up the right parameters is hard, and, even if one managed to get all those things right, training can take a long time and still not come out good enough.
So why are we talking ANNs? Besides, aren’t they old?
In a nutshell, yes. ANNs are a practically ancient concept in computer science, and have always been viewed as the gateway to AI, primarily on the assumption that if we can gather enough computational power to simulate a human brain, we should be able to run the same “software” as a human brain, but in a computer.
The major change that has happened recently is just that: the hardware has become available to run vast, large ANNs that would be previously unheard of. While the fact that ANNs can easily scale with CPU and memory has never been a mystery, twenty years ago, the default assumption would be no more than two hidden layers to an ANN. Today, two hidden layers would be an introduction to ANN. This led to a decline in ANNs in the 2000s, instead being supplanted by hand-fitted algorithms. These algorithms could run faster, and do things like run on a mobile phone, qualities that were valued higher than the flexibility of ANNs.
Enter, Deep Neural Networks, or DNNs. There is (as yet) no semblance to convergence on how deep a network needs to be to classify itself as DNN, but we are definitely above two hidden layers; hundreds of layers is not uncommon for a DNN. These slides from nVidia and MIT form a good overview of the area, and the richness of neural networks and techniques that have been developed for them.
I get it, DNNs are awesome. I’ll use them everywhere.
Because there’s no free lunch, DNNs are good, but they are still good at specific tasks. For example, an LSTM (Long Short-Term Memory) is a type of network that is favoured for speech recognition, as it is able to “remember” things that happened recently. This helps it recognise a sentence, as it can reference the earlier sounds. This helps to resolve ambiguities in speech. Meanwhile, for image recognition, the current leaders in the field are CNNs (Convolutional Neural Network), iterating over the image one part at a time. For performing logic RNNs (Recurrent Neural Networks) may be preferred.
This can lead to fairly complicated setups needed in order to perform more complex operations. For a demo at Google I/O 2017, the speaker showed a system where a video could be fed into the system, and the system could be asked questions about the video. For example, the system could be asked “what is the man doing”, and answer “packing”, based on the observation that a man in the video was putting a box in a car.
Seemingly a simple use case, it actually boils down to a fairly large number of distinct neural networks pictured to the right. We have a CNN to recognise the video frames, an LSTM to put them together, another LSTM to recognise what the user is asking, a weighted function to combine the output of the video with the output of the user question, and, finally, a classifier (that appears to be an RNN) to produce one of seven possible output keywords.
This is complicated.
So it’s hard. Is there any Deep Learning shortcut?
This is where things get interesting. Because of the rapid increase in available CPU, GPU and TPU power, it is now possible to, quite simply, train multiple DNNs in parallel and see which one comes up with the better result. This cross-breeding between evolutionary computing and neural networks (or, quite possibly, infinite number of monkeys) allows the automation of DNN training parameters, quite possibly making it easy for new ones to be created.
Coupled with releases such as Tensor2Tensor it is easier than ever to test out multiple neural networks, neural network combinations, and training a sequence of connected neural networks (and, at the same time, running up the cloud bill) than ever before.
This, if anything, is the current promise behind Deep Learning.