In this video, we will cover

several probabilistic graphical models for sequence tagging tasks.

You already know one of them.

So you know Hidden Markov models,

and you briefly know how to train and apply this.

In this video, we'll speak about few

more and we'll apply them to Named Entity Recognition,

which is a good example of sequence tagging tasks.

So, this is a recap for hidden Markov model.

You maybe remember the formula,

and one important thing to tell you is that it is generative model,

which means that it models the joint probabilities of x and y.

And the picture in the bottom of the slide illustrates the formula.

So you see that every arrow is about some particular probability in the formula.

So we have transition probabilities going from one y to the next one,

and we have output probabilities that go to x variables.

Now, another one would be Maximum Entropy Markov model.

Is a super similar.

So, do you see what is now different in the picture?

Only one thing has changed.

You have now the arrows going from x to ys.

Okay? And in the formula,

we can see that now the factorization is a little bit different.

We have the probabilities of current tag given the previous tag,

and the current x.

What is also important is that this model is discriminative,

which means that it models the conditional probability of y given x.

So, it doesn't care to describe how the text can be generated.

It says that the text is observable and we just need to

produce the probabilities for our hidden variables, y.

Now, another important thing to mention is

that you see that it is factorized nicely still.

Right? So you have some separate probabilities,

and you have a product of this probabilities.

Let us look into more details how every probability can be written down.

So, in this model,

every probability looks like that.

Maybe a little bit scaring but let us see what is here.

So, we have some exponents and we have some normalization constants.

So, this is actually just soft marks.

This is just soft marks applied to some brackets.

What do we see in the brackets?

We have there something linear.

We have weights multiplied by features.

So you can think about a vector of weights and the vector of features,

and you have a dot product.

Probably you have a feeling that you have

already seen something similar just in the machine learning.

So, do you remember a similar model?

Well, actually logistic regression is super similar to maximum entropy Markov model.

There, you also have a soft max applied to dot product of features and weights.

The only difference is that here,

you have rather complicated features that can depend on the next and the previous states.

So, the model knows about the sequence,

and it knows that the tags are not just separate.

Actually this feature is

a very interesting question because it is our job to generate these features somehow,

so we will get back to this question in a few slides.

But now, let us write down

one more probabilistic graphical model that can be even more powerful than that one.

This model is called conditional random field.

First, you can see that it is still discriminative model.

So it is the probability of y given x.

Now you can see that it is actually similar to the previous one, for example,

in the brackets, you'll still have these dot product of weights and features.

But what is the difference?

Do you see an important difference between CRF and maximum entropy Markov model?

The thing is that now you have only one normalization

constrain that goes outside of the product.

So you don't have any probabilities inside at all.

So, the model is not factorized into probabilities.

Instead, we have some product of some energy functions,

and then we normalize all of them to get the final probability.

And this normalization is actually complicated because,

well, we have many different sequences,

and we have to normalize in such a way that these

probability sums to one over all possible sequences of tags.

Now, when we depict this model with the graph,

it would be undirected graph.

So, I don't have any arrows at all.

I have just some links between the nodes.

And actually in this picture,

I write down a more specific case than the one in the top of the slide.

So here, in the top of the slide,

you can see that your features can depend on three things.

And here I kept them only for two things.

So I have one type of features about

transitions and another type of features about outputs.

Obviously, I could have something more complicated.

So, a general form of conditional random field would look like that.

So you have some arbitrary factors that depend

on some groups of y variables and x variables.

In the picture, these small green boxes would stand for different factors,

that you multiply them and then you normalize them and you get your probabilities.

So this is rather general model,

and maybe you are already lost with all those options.

So, if this is the case,

I have good news for you.

Probably, you will not need to implement this model

yourself because there are lots of black-box implementations for CRF model.

So, this is just some links to check out.

For example, Mallet is a nice library that has an implementation for CRF,

but what we have to do is we have to generate features to feed them into CRF models.

So, in the rest of the video,

we'll discuss how to generate those features.

From the formulas, you might remember that those "f" features can depend on three things;

the current tag, the previous tag,

and the current output.

Now, not to be overwhelmed with the variety of your features,

there is a very nice common technique which is called label observation features.

So, it says that you are only allowed to have these kind of features.

The observation part is about something that depends on the output.

So, we will go to this part,

and the green part,

the labeled part, is about indicators.

So you just check whether you have the current label equal to y,

and you check it for all possible labels.

Okay? So, it means that you have as many features as many labels you

have multiplied by the number of different observation functions that you invent.

And in the case of the second and the third line,

you will have even more features because there,

you check these indicators for the current and for the previous tags.

So, we are going to have lots of them.

Now, how those observation parts will look like.

This is just some example taken from the paper,

and it says that you can be as creative as you want.

So, first, you can check that your current word is equal to some predefined word.

And you can check it for all the words in the vocabulary.

So again, you will have let's say plus 100,000 features just by the first line.

Then, you may want to check your part-of-speech tag for

the current word defined by some extrinsic part-of-speech tager,

and you will have again many features,

many binary features here that tell you whether your tag is equal to

noun or whether it is equal to a verb and so on and so on for all possible tags.

And you can have lots of other ways to define your features.

For example, you can check whether your word is capitalized,

or whether it has a dash in the middle,

or whether it consists only from capital letters,

or whether it appears in some predefined list of

stop words or predefined list of some names.

So, actually this is how your work would look like if

you decided to use CRF for some sequence tagging task.

You would take some implementation and you would generate

as many useful features as you can think about.

Now, one trick that I want to mention here is that even

though we say that the feature can depend only on the current output,

x_t, well, it is not like that.

So, honestly, we can put into this x_t everything that we want.

For example, we can say that our current x_t consists of the current word,

the previous word and the next word,

and with have features for all of them.

So, you should multiply those tremendous number of

features by three right now. And it is okay.

So, the model will not break down just because it is discriminative model.

So, we do not care about modeling the sequence of

x which means that we can do whatever we want basically.

So, this picture depicts this idea and it says that

every feature can actually have access to all words in our sentence.

It can be a little bit messy.

So, just be prepared that it happens and people think that this is okay and this is nice,

and this is how it works actually.

So, just to sum up what we have discussed.

Well, we have discussed different graphical models for sequence tagging.

We have discussed one of them,

Hidden Markov model with more details.

So, we have seen how to train this model and how to do decoding in this model.

And in practice, usually you will need just to do feature generation,

and use some implementation of one of this model.

So, this is all for probabilistic approach for sequence tagging,

but we will also discuss neural networks for this task later.