Hi. In this video,
we will talk about intent classifier and slot tagger in depth.
Let's start with intent classifier. How we can do that.
You can use any model on bag-of-words with n-grams and TF-IDF,
just use classical approaches of text mining,
or you can use some recurrent architecture and you can use LSTM cells,
GRU cells, or any other.
You can also use convolutional networks and you can
use 1D convolutions that we have overviewed in week one.
And the study actually shows that CNNs can
perform better on datasets where the task is essentially
a key phrase recognition task and it can happen in
some sentiment detection datasets, for example.
So, it makes sense to try RNN or CNN,
or any classical approach as a baseline and choose what works best.
Then, there comes a slot tagger,
and this is a bit more difficult task.
It can use handcrafted rules like regular expressions,
so that when I say,
for example, take me to Starbucks,
then you know that if something happens after the phrase take me to,
then that is most definitely like a two slot or any other slots of your intent.
But that approach doesn't scale because
the natural language has a huge variation in how we can express the same thing.
So, it makes sense to do something data driven here.
You can use conditional random fields,
that is a rather classical approach,
or you can use RNN sequence-to-sequence model,
when you have encoder and decoder,
and a funny fact is that you can still
use convolutional networks for a sequence-to-sequence task as well,
and you can add attention to any of these models, any sequence-to-sequence model.
In the next slide, I want to overview
convolutional sequence-to-sequence model because that
is- that gains popularity because it works
faster and sometimes it even beats RNN in some tasks.
Okay, let's see how convolutional networks can be used to model sequences.
Let's say we have an input sequence which is bedding-bedding,
then start of sequence and three German watts.
And what we actually want to do, let's say,
where we want to solve the task of language modeling.
When we see each new token,
we need to predict which token comes next.
And usually, we use a recurrent architectures for this.
But let's see how we can use convolutions.
Let's say that when we generate the next token,
what we actually- we actually care only about
the last three tokens in the sequence that we have seen.
And if we assume that,
then we can use convolution to aggregate the information about
the last three tokens and this is the blue triangle here,
and we actually get some filters in the output.
Let's take half of those filters and add them as is, and the second half,
we will pass through sigmoid activation function,
and then take an element Y as multiplication of these two halves.
What we actually get is we get some Gated Linear Unit,
and we add non-linear part to it and it becomes non-linear.
So, this is how we actually look at
the context that we had before and we predict some hidden state or let's say,
next token and you can use convolutions for that, and then,
that triangle is actually convolutional filter and you can slide
it across the sequence and use the same weights,
the same learned filters,
and it will work the same on every iteration on that sequence.
So, it is pretty similar to RNN,
but in this way,
we actually don't have a hidden state that we need to change.
We actually only look at the context that we had before,
and some intermediate representation.
But you can see that we actually look at
only three last tokens and that is not very good.
Maybe we need to look at it like last 10 tokens or so because RNN is like LSTM cell,
can actually have a very long short-term memory.
Okay. So, we know from convolutional neural networks,
we know how to increase the input receptive field.
And we actually stack convolutional layers.
Let's stack six layers here with kernel size five,
and that will actually result in an input field of 25 elements.
And the experiments show that
25 elements in the receptive field might be enough to model your sequences.
Let's see how CNNs work for sequences.
The office provided the results on language modeling dataset which is WikiText-103,
and you can see that this CNN architecture actually beats LSTM,
it has lower perplexity,
and it actually runs faster.
We will go into that a little bit later.
And another example is a machine translation dataset,
or from English to French,
let's say, and there they have a metric called
BLEU and the higher that metric the better.
And you can see that convolutional sequence-to-sequence actually beats LSTM here as well,
and this is pretty surprising.
What is a good thing about CNNs is, the speed benefit.
If you compare it with RNN,
the problem with RNN is that it has
a hidden state and we change that state through iterations
and we cannot do our calculations in parallel,
because every step depends on the other,
and we can actually overcome that with convolutional networks because during training,
we can process all time steps in parallel.
So, we apply the same convolutional filters but we do that at each time step,
and they are independent and we can do that in parallel.
During testing, let's say, in sequence-to-sequence manner,
our encoder can actually do the same because there is
no that dependence on the previous outputs and we use only our input tokens,
and we can apply that convolutions and get our hidden states in parallel.
During testing one more thing,
one more good thing is that GPUs are highly optimized for
convolutions and we can get a higher throughput,
thanks to using convolutions instead of RNNs.
You can actually see a table here,
and it shows the model based on LSTM,
and the model based on convolutional sequence-to-sequence,
and you can see that convolutional model actually provides
a better score in terms of translation quality,
and it also works 10 times faster.
So, that is a pretty good thing because for a real-world systems like,
let's say Facebook, they need to translate to
the post when you want and they need to translate it fast.
So, in order to implement these machine translation in production environment,
maybe CNN is a very good choice.
By the way, this paper is by the folks from Facebook.
So, let's look at one more thing.
You know that when you do a sequence-to-sequence task,
you actually want your encoder to be bi-directional,
so that you look at the sequence from left to right and from right to left.
And the good thing about convolutions is that
actually you can make that convolutional filters symmetric,
and you can look at your context at the left and at the right to the same time.
So, it is very easy to make bi-directional encoder with CNNs.
And it still works in parallel,
there is no dependence on hidden state here,
it just applies all of that multiplications in parallel.
To move further, with our, let me remind you,
we are actually reviewing intent classifier and slot tagger and to move further,
we need some dataset so that we can use it for our overview.
Let's take ATIS dataset,
it's Airline Travel Information System.
It was collected back in 90s,
and it has roughly 5,000 context independent utterances, and that is important.
That means that we actually have a one turn dialogue
and we don't need like a fancy dialogue manager here.
It has 17 intents and 127 slot labels,
like from location to location,
departure time, and so forth.
The utterances are like this,
show me flights from Seattle to San Diego tomorrow.
The State-of-the-art for this task is the following: 1.7 intent error,
and 95.9 slots F1.
So, this is pretty cool.
Another thing is that you can actually learn
your intent classifier and slot tagger jointly.
You don't need to train like two separate tasks,
you can train this supertask,
because it can actually learn representations that is suitable for both tasks,
and this time, we provide more supervision for
our training and we get the higher quality as a result.
Let's see how this joint model might work.
It is still a sequence-to-sequence model,
but this time we use,
let's say, a bi-directional encoder,
and the last hidden state,
we can use for decoding the slot tags,
and at the same time we can use that to decode the intent.
And if we train these end-to-end for the two tasks,
we can get a higher quality.
And notice that we have in the decoder,
we have hidden states from encoder post just as is,
and this is called aligned inputs,
and we also have C-vectors which are attention.
Let's see how attention works in decoder.
Lets say that we have at time step E,
and we have to output our new decoder hidden state SE.
And that is actually a function of the previous hidden state which is in blue,
a previous output which is in red,
and hidden stated from encoder and some vector which is attention.
Let's see how attention works.
The vector attention Ci,
is actually a weighted sum of hidden vectors from encoder.
And we need to come up with weights for these vectors.
And we actually train the system to learn these weights
in such a way so that it makes sense to give
attention to those weights, to those vectors.
And the coefficient that we use
to define what weight that particular vector from encoder has,
is modeled as a forward network that uses our previous decoder hidden state,
and all of the states from encoders,
and it needs to figure out whether we need that state from encoder or not.
You can also see an example of
attention distribution when we predict the label for the last word,
and you can see that when we predict the label like departure time,
our model looks at phrases like,
from city, or city name, or something like that.
Okay. So, we can also see how our two losses decrease during training,
and during training we use two losses and we use a sum of them,
and you can see the green loss here is for intent,
and the blue one is for slots.
You can see that intent loss actually saturates and it doesn't change, but blue slots,
blue curve continues to decrease and so,
our model continues to train because that is a harder task than intent classification.
Okay. Let's look at joint training results on the 80s dataset.
If we had trained slot filling independently,
we have slot F1 95.7,
and if we train our intent detection,
our classifier independently we have intent at two percent,
but if we train those two tasks jointly using the architecture that we have overviewed,
we actually can get a higher slot F1 and a lower intent error.
And a good thing also is that
this joint model works faster if you use it on mobile phone,
or any other embedded system because you have
only one encoder and you reuse that information for two tasks.
Okay. Let's summarize what we have overviewed.
We have viewed at different options for intent classifier and slot tagger,
you can start from classical approaches and go all the way to deep approaches.
People start to use CNNs for a sequence modeling
and sometimes get better results than with RNN.
This is a pretty surprising fact.
You can also use joint training and it can be beneficial
in terms of speed and performance for your slot tagger and intent classifier.
In the next video,
we will take a look at context utilization in our NLU,
our intent classifier and slot tagger.