0:00

In this video, I'm going to return to the idea of full Baysian learning, and explain

Â a little bit more about how it works. And then in the following video, I'm going

Â to show how it can be made practical. In full Bayesian learning, we don't try

Â and find a single best setting of the parameters.

Â Instead, we try and find the full posterior distribution over all possible

Â settings. That is, for every possible setting, we

Â want a posterior probability density. And all those densities, we want to add up

Â to one. It's extremely computationally intensive

Â to compute this for all, but the simplest models.

Â So, in the example earlier, we did it for a biased coin which just has one

Â parameter, which is how biased it is. But in general, for a neural net, it's

Â impossible. After we've computed the posterior

Â distribution across all possible settings of the parameters, we can then make

Â predictions by letting each different setting of the parameters make its own

Â prediction. And then, averaging all those predictions

Â together, weighting by their posterior probability.

Â This is also very computationally intensive.

Â The advantage of doing this is that if we use the full Bayesian approach, we can use

Â complicated models even when we don't have much data.

Â So, there's a very interesting philosophical point here.

Â 1:24

We're now used to the idea of overfitting, When you fit a complicated model to a

Â small amount of data. But that's basically just a result of not

Â bothering to get the full posterior distribution over the parameters.

Â So, frequentists would say, if you don't have much data, you should use a simple

Â model. And that's true.

Â But it's only true if you assume that fitting a model means finding the single

Â best setting of the parameters. If you find the full posterior

Â distribution, that gets rid of overfitting.

Â If there's very little data, the full posterior distribution will typically give

Â you very vague predictions, because many different settings of the parameters that

Â make very different predictions will have significant posterior probability.

Â As you get more data, the posterior probability will get more and more focused

Â on a few settings of the parameters, and the posterior predictions will get much

Â sharper. So, here's a classic example of

Â overfitting. We've got six data points and we fitted a fifth order polynomial and so

Â it should go exactly through the data, which it more or less does.

Â We also featured a straight line which only has two degrees of freedom.

Â 2:42

And so, which model do you believe? The model that has six coefficients and

Â fits the data almost perfectly, or the model that only has two coefficients and

Â doesn't fit the data all that well. It's obvious that the complicated model

Â fits better, but you don't believe it. It's not economical, and it also makes

Â silly predictions. So, if you look at the blue arrow,

Â If that's the input value and you're trying to predict the output value, the

Â red curve will predict a value that's lower than any of the observed data

Â points, which seems crazy, whereas the green line will predict a sense of the

Â value. But everything changes, if instead of

Â fitting one fifth order polynomial, we start with a reasonable prior of the fifth

Â order polynomials, for example, the coefficient shouldn't be to big.

Â And then, we compute the full posterior distribution over fifth order polynomials.

Â And I've shown you a sample from this distribution in the picture, where a

Â thickened line means higher probability in the posterior.

Â 3:49

So, you will see some of those thin curves, miss a few of the data points by

Â quite a lot, but nevertheless, they're quite close to most of the data points.

Â Now, we get much vaguer, but much more sensible predictions.

Â So, where the blue arrow is, you'll see the different models predict very

Â different things. While, on average, they make a prediction

Â quite close to the prediction made by the green line.

Â From a Bayesian prospective, there's no reason why the amount of data you collect

Â should influence your prior beliefs and the complexity of the model.

Â 4:24

A true Baysian would say, you have prior beliefs about how complicated things might

Â be and just because you haven't collected any data yet, it doesn't mean you think

Â things are much simpler. So, we can approximate full Baysian

Â learning in a neural net, if the neural net has very few parameters.

Â 4:48

So each parameter is only allowed a few return to values and then we take the

Â cross product of all those values for all the parameters.

Â And now, we get a number of grid points in the parameter space.

Â And in each of those points, we can see how well our model predicts the data, that

Â is, if we're doing supervised learning, how well a model predicts the target

Â outputs. And we can say that the posterior

Â probability in that grid-point is the product of how well it predicts the data,

Â how likely it is under the prior. And with the whole thing normalized, so

Â that the posterior probability is [UNKNOWN].

Â 5:32

This is still very expensive, but notice it has some attractive features.

Â There's no gradient descent involved, and there's no local optimum issues.

Â We're not following a path in this space, We're just evaluating a set of points in

Â this space. Once we've decided on the posterior

Â probability to assign to each grid-point, We then use them all to make predictions

Â on the test data. That's also expensive.

Â But when there isn't much data, it'll work much better than maximum likelihood or

Â maximum a posteriori. So, the way we predict the test output,

Â given the test input, is we say, the probability of the test output, given the

Â test input, Is the sum overall the grid points of the

Â probability that, that grid-point is a good model, is the sum over all

Â grid-points of the probability of that grid-point, given the data and given our

Â prior, times the probability that we will get that test output,

Â Given the input and given the grid-point. In other words, we have to take into

Â account, the fact that we might add noise to the output of the net before producing

Â the test answer. So, here's a picture of full Bayesian

Â learning. We have a little net here, that has four

Â weights and two biases. If we allowed, nine possible values for

Â each of those weights and biases, There would be nine to the six grid+points

Â in the parameter space. It's a big number but we can cope with it.

Â For each of those grid-points, we compute the probability of the observed outputs on

Â all the training cases. We multiply by the prior for the

Â grid-point, which might depend on the values of the weights, for example.

Â And then, we re-normalize to get the posterior probability over all the

Â grid-points. Then we make predictions using those

Â grid-points, but weight to each of their predictions by its posterior probability.

Â