Welcome to the TensorBoard intro [inaudible] to debug your Neural Networks.
So Neural Network often is some sort
of black box and it's very hard to see what's going on during training.
So usually people tend to print out all sorts of measures during
the Gradient Descent Loop in order to the debug and make sense of the training phase.
So what are the most important parameters to be looked at in training Neural Network?
So the most important is definitely the loss overtime.
So as you remember from the previous video,
loss defines how good or bad a Neural Network fits the data.
And by the way, this here is a quite good loss function
because it converges to a local optima quite fast.
The blue line instead is indicating a learning rate which is too small.
We are slowly converging against the optimum, but it takes too long.
So let's run this for 10,000 iterations and we obtained the following diagram.
It's hard to say, but chances are high,
that with a too low learning rate,
we never reach to decide local optimum.
Finally, if you have chosen the learning rate which is too high,
we notice a considerable amount of bouncing,
and the only reason why we are still conversion is that
the underlying model we're optimizing over is rather simple.
Another may be even more important measure is accuracy.
Here you see accuracy over training iterations.
And as loss goes down,
you should see accuracy going up.
So those are scalar times series,
but now let's have a look at the weight matrices.
This is far more interesting.
Those contain many values,
and we have to find a way to visualize them at once.
Fortunately, TensorBoard does a pretty good job in creating a summary,
overweight matrix of arbitrary shape.
You can find more on how these summaries are calculated in the video description.
But here, we see a pretty healthy distribution of weights.
Bad examples include cases where all values are very close to zero
or a uniform distribution resembling a random weight initialization.
This can be an indication that the actual layer hasn't done anything.
Besides monitoring weights, we can also monitor activations.
Note that the activation is the output of
a particular layer with the activation function applied.
In this simple case, the [inaudible] Y.
Remember, that Y is the output of the Softmax function.
So any K dimensional vector,
with value range between plus and minus infinity is squashed
into a K dimensional vector if value ranges between zero and one.
This is the default activation function for output layer and classification tasks.
So here we see what we expect in a healthy classifier.
Most of the values are close to zero because those are assigned part of probability,
a particular input is not in the class.
And we see a fair amount of values near to one.
Those are the cases where particular input is in the class.
Since we have 10 different classes.
It is obvious that we see more values close to zero than close to one.
So let's have a look at TensorBoard.
Instructions on how to access TensorBoard from
data science experience are given in the description section of this video.
TensorBoard can visualize multiple run simultaneously,
in order to compare among those.
But let's only have a look at a single run for now.
First, you have to look at the Scalar view.
So in this step, all summaries of scalars
and how they evolve over training time are displayed.
In our example, we've recorded the two most important measures.
Loss and Accuracy. As you can see,
they are in loss, which is actually a very good sign.
The lower the loss and therefore the arrow of the neural network is,
the better the accuracy.
We can maximize the plot and also adjust the smoothing parameter.
If we set it to one,
we obtain a line which isn't displayed here.
So let's decrease it slowly.
As we can see, this line,
more and more fits the actual trajectory and with 0.96
we can clearly see a trend without getting lost too much into details.
And here again, we can have a look at
those two important measures and see that they are inverse.
Remember that loss is based on a defined loss function.
Cross-Entropy in this case and accuracy tells us
the fraction of correctly classified examples over a total number of examples.
So now we have a look at the graph tab.
This graph should be read bottom up.
So we start with a weight matrix W and the placeholder X for your input data.
This is multiplied and then the bias where it will be is added.
This result is quashed using Softmax and the final result is thought in Y.
Note that Y underscore and Y are then used to
calculate accuracy by taking the Argmax of both.
By comparing those and taking the mean over this vector,
we obtain the accuracy.
The upper branch of the graph computes loss using the Cross-Entropy function.
Note that parts of the graph are hidden to us in an externalized sub-graph.
As you can see, the connection points of the sub-graph have turned red now.
You can include the sub-graph but then the graph becomes more complex.
This is because the gradient computation is reading values from variables and
placeholders all over the place and doesn't add more information at that point.
Therefore, let's remove it again.
Now, let's turn to histograms and have a look at the weight matrix.
This is a highly compressed view of what's going on inside the matrix during training.
So in the C axis,
training iterations are reflected.
The closer, the more recent they are.
The x axis informs us about the values,
elements in the matrix have been assigned to.
And the Y axis shows frequency.
So in this case,
we are looking at a pretty healthy chart.
It is important to understand how this condensed view is created.
Therefore, a link containing further explanations can be found below.
But intuitively, lacking the histogram in a pure mathematical sense,
the plot gives you an intuition but value ranges
at what frequencies are assigned to each element of the weight matrix.
So we see that the values are symmetrically centered around the mean of zero,
but although they are close to zero but not exactly zero.
This is important since gradients are correlating
with weights and with zero gradients we cannot work.
The other observation is,
that on the left and the right extremes,
close to minus one and plus one,
there's nothing much going on.
And therefore, we are definitely not over saturating.
Finally, this also definitely doesn't look like a uniform distribution.
Therefore, training worked pretty well.
Now we have a look at Y. Y in
the Softmax regression model corresponds to the newer activations of an output player.
You see that most of the values are close to
zero and some are close to one and in between.
That's nothing much going on.
And this is also a very good sign,
since this is exactly what Softmax should do in a mighty class classifier,
because class probabilities in the output vectors are either
zero for the wrong classes or one for the correct class.
Since there are more incorrect classes and correct ones,
nine versus one in this case,
you see much more zero values than one's.
There's much more to say about TensorBoard,
but we've covered the most important aspects for this course and we will see in
future lectures how those matrices can be
used in order to debug neural networks during training.
In the next module, we will recover automatic differentiation,
one of the key features of Tensor.