So gated recurrent unit combines the forget gate and input gate into a single update gate, it also merged cells state and hidden state. Along with some other small changes so that the resulting model is simpler. Then the standard LSTM model has been growing increasingly more popular. So this GRU structures have only two gates, this reset gate. And and this update gate Z reset guitar so the reset gate determines how to combine you input with the previous hidden state and the update gate defines how much previous hidden state to keep. So let's look at them more closely. So let's first look at the update gate. So one important simplification from GRU compared to our STMA is to combine the forget getting, plug it into one update gates, update gate helps to determine how much previous hidden states to be kept. And with just this simple update. So as a sigmoid over this linear combination of current input xt and the previous hidden state, ht minus one. Then we also have a reset gate determines how much the pass information to forget or to remember. So here again is the same gating mechanism of sigmoid over the input linear combination of the input xt plus the previous hidden state, ht minus one and become the reset gate. Then what information we want to bring into the hidden state is determined by just this. H two to t that is a tanh over the input of the current time xt. And plus this reset gate doing an element wise multiplication with this W times ht minus one, this previous hidden state. So that's how this reset gate is used to do, kind of. If it's, this gate is closed, right rt equal to zero, then we forget the entire history from this point onward. But if reset gate rt equal to one, then we remember everything and if it's something between zero and one, so we forget a little bit. But keep in mind again, this element wise operation so you can actually. Adaptively said different dimension of the hidden state to be remembered or forgotten and then the final new hidden states will be updated by just one minus the. Update gate, the album Magic element wise multiplication was the previous hidden state, ht minus one plus zt. The update gate times this new information h two to t we just calculate it in previous slide. An now putting everything together here, GRU look can be parameterized with this four equations. zt the update gate, rt the reset gate, the new information h two to t and this update creation for the hidden state ht. So next will talk about some extension of RNN how to use RNN to do some applications. So, so far we have assumes this sequence to be processed from left to right. Right so but that makes sense for a lot of time series models. If you want to predict future, you should use history to make that prediction, not using any future information to make the prediction. But if you're analyzing other type of sequential data, like sentences or text, maybe you want to look at both ways. So bidirectional RNN is just. Simple combination of two RNN, one being processed from left to right, one being processed from right to left, then we combine the hidden states of this two RNN together to concatenate them together. One is from left to right, one another from right to left and concatenate the hidden state into a new hidden states. Then you start to do to generate the final output prediction. So that's bidirectional RNN, a little bit more complicated application of RNN is this sequence to sequence model. So the idea of sequence to sequence model is again, trying to configure two different RNN, one as encoder to encode input sequence into some kind of context vector, c. Then having another RNN model as the decoder that takes a context, c along with this input to generate the output. So, mathematically, what you can see is the encoder part is just a standard RNN. You can learn, then you can get the final hidden state ht, which t is the length of the sequence. And this ht become the context vector. Sort of like the whole input sequence. Is encoded in this ht this context vector. Then the decoder would it try to do is. I mean every step. You want to predict yt given all the sequence has been generated before that, right? yt minus one to y one along with context vector c. And if this decoder is RNN, then we'll have this settings right this every output. For example, this this output at time t is activation function over a previous hidden state plus the previous output yt minus one and the context vector, c. To that give us the decoder formulation, so the application of sequence to sequence model. The most standard one is machine translation. You can imagine the input is sentence from one language, like in Chinese. The output is the translation in another language like in English.