the probability of getting head given theta times the probability of getting

tails, given theta times the probability of the second tail and the probability of

heads times another probability of heads. And the probability of the first head

given theta is theta one minus theta for the tail, one minus theta, theta, theta.

Or theta to the power of three, one minus theta to the power of two and that is

exactly the function that is drawn over here.

Now, if we're looking for the theta that predicts D, well, we just defined that as

the theta that maximizes this function, if you draw a line from this maximum down

to the bottom, you can see that this function is maximized at the point near

point six, which, not surprisingly, is the same as the three heads that we saw

over the five total pauses. But generalizing that, let's assume that

we have observed in this context, MH heads and MT tails, and we want to find

the theta that maximizes the likelihood and just as, in the simple analysis, in

the simple example that we had, this likelihood function is going to have

theta appearing MH times and one minus theta appearing MT times.

And so, that's going give us a likelihood function that looks just like the

likelihood function that we saw on the previous line.

And if we think about how we can go about maximizing such a function, then usually

we take the following steps. First, it's convenient to think about not

the likelihood but rather what's called the log likelihood, denoted by a small l,

which is simply the log of this expectant over here and that has the benefits of

turning a product into a summation. And now that we, so that gives us a simpler

optimization objective but it's the one that has the exact same maximum.

And we can now go ahead and do the standard way of maximizing a function

like this, which is differentiating the log-likelihood and solving for theta.

And that's going to give us an optimum which is exactly as we would expect, that

is, it's the fraction of heads among the total number of coin tosses and that's

the maximum of this log-likelihood function and therefore, the likelihood is

well. Now, an important notion in the context

of maximum likelihood estimation is also important in, in when we develop it

further is the notion of a sufficient statistic.

So, when we computed theta in the coin toss example, we defined the likelihood

function as an expression that has this form.

And notice this expression didn't care about the order in which the heads and

the tails came up. It only cared about the number of heads

and the number of tails. And that was sufficient in order to

defined a likelihood function, therefore, sufficient for maximizing the likelihood

function. And so, in this case, MH and MT are what

known, are what's known as sufficient statistics for this particular estimation

problem because they suffice in order to understand the likelihood function and

therefore to optimize it. So, why generally a function of the data

set is the sufficient statistic? if it's a function from instances to a