In this video, we will discuss
two main problems that arise in training Recurrent Neural Networks,
the problems of exploding and vanishing gradients.
As you already know,
we can use the backpropagation algorithm to train a recurrent neural network,
but in this case, we backpropagate
gradients not only through layers but also through time.
As an example in the previous video,
we derived the expression for the gradient of the loss L with
respect to the current weight matrix W. And now we know,
that to compute the gradient of the loss at time step t
with respect to W, we should sum up
the contributions from all the previous time steps to this gradient.
Now let's look at the expression for the gradient more closely.
As you can see. There is a product of Jacobian matrices in each term of the sum.
And if we look at the particular term which corresponds to
the contributions from some k to the gradient
at time step t, then we see that the more steps between time moments k and t,
the more elements are in this products.
So the values of Jacobian matrices have
a very strong influence especially on the contributions from faraway steps.
Let's suppose for a moment that we have only one hidden unit in our network.
So h is a scalar.
Then all the elements in the expression for the gradient are also scalars.
So the gradient itself is a scalar and all the Jacobian matrices are scalars and so on.
In this case, it is clear that if all the Jacobian matrices,
now Jacobian scalars, are less than one in absolute value,
then their product goes to zero
exponentially fast when the number of elements in this product tends to infinity.
And on the contrary, if all the Jacobian scalars are more than one in absolute value,
then their product goes to infinity exponentially fast.
As a result, in the first case,
the contributions from the faraway steps go to zero and
the gradient contains only the information about nearby steps.
Thus it is difficult to learn long range dependencies
with a simple recurrent neutral network.
This problem is usually called the vanishing gradient problem,
because a lot of elements in the gradient simply vanish and don't affect the training.
In the second case the contributions from
faraway steps grow exponentially fast so the gradient itself grows too.
If an input sequence is long enough
the gradient may even become a not-a-number in practice.
This problem is called the exploding gradient
problem and it makes the training very unstable.
The reason is that if the gradient is
a large number then we make
a long step in the direction of this gradient in the parameter space.
Since we optimize a very complex multimodal function and we use stochastic methods
we may end up in a very poor point after such step.
OK. We have discussed the simplified case,
now let's return to the real life.
A recurrent neural network usually contains not just one hidden unit,
but the whole vector of them.
Consequently, the Jacobian matrices are really matrices, not scalars.
We can apply the same reasoning here.
But instead of the absolute value,
we need to use the spectral matrix norm which is equal to
the largest singular value of the matrix.
If all the Jacobian matrices in the product have the norms which are less than one,
then the gradient vanishes.
And if all the Jacobian matrices have the norms which are higher than one,
then the gradient explodes.
Now we know that values
of the Jacobian matrices are crucial in training a recurrent neural network.
So lets see what values they have in practice.
As you remember, the hidden units at time step t,
can be computed by applying some nonlinear function f to
a linear combination of inputs at this time step and hidden units at the previous time. step,
Let’s denote this linear combination by preactivation t.
To compute the Jacobian matrix of the hidden units at time step t
with respect to the hidden units in the previous time step,
we can use the chain rule.
So we first compute the Jacobian of h with respect to its preactivation and
then compute the Jacobian matrix
of this preactivation with respect to the previous hidden units.
Since f is an element-wise nonlinearity,
the first Jacobian here is a diagonal matrix with the derivatives of f in the diagonal.
And how to compute the second Jacobian? This is the question for you.
Since the preactivation is a linear combination of some elements,
the second Jacobian consists of the weights
of the hidden units in this linear combination.
So it is equal to the weight matrix W. Now lets look at
the both parts of the Jacobian matrix of the hidden unit h_t
with respect to the hidden units h_{t-1} separately.
The first part depends on the type of nonlinearity we use.
The usual choice of nonlinearity for neural networks
is a sigmoid or hyperbolic tangent or rectified linear unit functions.
As you can see on the left part of the slide,
all these nonlinearities are very flat in the large part of the input space.
So the sigmoid and hyperbolic tangent are
almost constant for both small and large inputs,
and the rectified linear unit is equal to zero for all the negative inputs.
The derivatives of this nonlinearities are very
close to zero in the regions where they are flat.
So as you can see on the right part of this slide
the derivatives of the sigmoid and hyperbolic tangent are less than one
almost everywhere.
And this may very likely cause
the vanishing gradient problem. And the situation
with rectified linear unit is much better, at least for positive inputs
its derivative is equal to one.
But the gradients may still vanish because of the zero
derivative in the negative part of the input space.
Now let's look at the second part of the Jacobian ofh. Weight matrix is
a parameter of the model so its norm may be either large or small.
The small norm could aggravate the vanishing gradient problem and
the large norm could cause the exploding gradient problem
especially in the combination with rectified linear unit nonlinearity.
OK let's summarize what we have learned in this video.
Recurrent Neural Networks have sequential nature so they are very deep in time.
Therefore the vanishing and exploding gradient problems may arise during the training of them.
Actually these problems are not exclusive for recurrent neural networks,
they also occur in deep feedforward networks.
Vanishing gradients make the learning of long-range dependencies very
difficult. And exploding gradients
make the learning process unstable and may even crash it.
In the next videos, we will discuss
different methods that can help us to overcome these issues.