本文源自YouTube网站《Deep Learning SIMPLIFIED》系列视频，文章内容为视频的字幕，供深度学习初学者参考学习。
So now you’re probably thinking - wow, deep nets are really great! But why did it take so long for them to become popular? Well as it turns out, when you try to train them with a method called backpropagation（反向传播）, you run into a fundamental problem called the vanishing gradient（消失梯度）, or sometimes the exploding gradient（爆发梯度）. When that happens, training takes too long and the accuracy really suffers. Let’s take a closer look.
When you’re training a neural net, you’re constantly calculating a cost value. The cost is typically the difference between the net’s predicted output and the actual output from a set of labelled training data. The cost is then lowered by making slight adjustments to the weights and biases over and over throughout the training process, until the lowest possible value is obtained. Here is that forward prop again; and here are the example weights and biases.
The training process utilizes something called a gradient（梯度）, which measures the rate at which the cost will change with respect to a change in a weight or a bias.
Deep architectures are your best and sometimes your only choice for complex machine learning problems such as facial recognition. But up until 2006, there was no way to accurately train deep nets due to a fundamental problem with the training process: the vanishing gradient.
Let’s think of a gradient like a slope, and the training process like a rock rolling down that slope. A rock will roll quickly down a steep slope but will barely move at all on a flat surface. The same is true with the gradient of a deep net. When the gradient is large, the net will train quickly. When the gradient is small, the net will train slowly.
Here’s that deep net again.
And here is how the gradient could potentially vanish or decay back through the net. As you can see, the gradients are much small in the earlier layers. As a result, the early layers of the network are the slowest to train. But this is a fundamental problem! The early layers are responsible for detecting the simple patterns and the building blocks -when it came to facial recognition, the early layers detected the edges which were combined to form facial features later in the network.
And if the early layers get it wrong, the result built up by the net will be wrong as well. It could mean that instead of a face like this, your net looks for this.
The process used for training a neural net is called back-propagation or back-prop（反向传播）. We saw before that forward prop starts with the inputs and works forward; back-prop does the reverse, calculating the gradient from right to left.
For example, here are 5 gradients, 4 weights and 1 bias.
It starts with the left and works back through the layers, like so. Each time it calculates a gradient, it uses all the previous gradients up to that point.
So, let’s start with that node. That edge uses the gradient at that node. And the next. So far things are simple.
As you keep going back, things get a bit more complex - that one for example uses a lot of gradients, even though this is a relatively simple net. If your net gets larger and deeper, like this one, it gets even worse.
But why is that? Well, a gradient at any point is the product of the previous gradients up to that point. And the product of two numbers between 0 and 1 gives you a smaller number.
Say this rectangle is a one.
Also, say there are two gradients - a fourth - like that
- and a third.
If you multiply them, you get a fourth of a third which is a twelfth.
A fourth of a twelfth is a forty-eighth.
You can see that numbers keep getting smaller the more you multiply. Have you ever had this issue while training a neural network with backpropagation? If so, please comment and let me know your thoughts.
As a result of all this, backprop ends up taking a lot of time to train the net, and the accuracy is often very low.
Up until 2006, deep nets were still underperforming shallow nets and other machine learning algorithms. But everything changed after three breakthrough papers published by Hinton, Lecun, and Bengio in 2006 and 2007.
In the next video, we’ll begin taking a closer look at these breakthroughs, starting with the Restricted Boltzmann Machine（受限波兹曼机）.