One very popular variation of RNN is called long-short term memory network, LSTM. So the idea of LSTM is, instead of using a very simple recurrent process, they create this more sophisticated way to do that. And they call that a cell, right, LSTM cell. If you think about the standard RNN, look at the things on the left hand side, then you have this cell components that's right now is very simple. It's just doing this simple computation of combining the current input x with the previous hidden state h(t-1). And then do linear combination of them and pass through a nonlinear activation, for example, a tanh function, f = tanh in this case, to get the h(t), right? That's the cell, or that's the computation to generate this hidden vectors. An LSTM trying to make this cell competition more robust and make it better by proposing a cell that look like this. So which is more sophisticated and we will go into more detail, explain the idea behind LSTM. The key idea behind LSTM is this, they want to control what information to remember or to forget through the component called gate. Specifically, a gate is just a sigmoid function over linear combination of the input. If the sigmoid is close to one, the gate is open. If the sigmoid is close to zero, then the gate is closed. So LSTM utilize this gating mechanisms to control how information will be passed through, remembered or forget through this process. And it introduce three different gates, input gate, forget gate and output gate. It also introduced important concept of cell state. So I want to overcoming this vanishing gradient problem that we have discussed before, that's the main problem for RNN. So LSTM introduced this new structure called cell state, which is designed to remember useful information. Also able to forget some information over time. So to separate the cell state from the hidden state. Okay, so that's the cell state. And to figure out what's the cell state, how to update cell state, we need to understand this two different gate. One is the forget gate, so forget gate controls how much previous cell state should be kept. And how do we know that, assured this sigmoid function over the input? So the input is the previous previous hidden state, h(t-1) and the current input x(t). So that's the forget gate. Then they will also have this input gate. Now try to determine at what information, what new information to the cell state? And has two different paths. One is this input gate, in term of the formulation is exactly the same as the previous forget gate. But it just said could be a different parameters. How they control what information to keep and forget. But just in terms of formulation is identical. So you have this sigma function over the input x(t) and the previous hidden state (t-1), that becomes the input gate. And then what input information you want kept is through another activation function, in this case tanh. So you have some information between -1 to 1, and again, that's kind of a linear combination of the current input and previous hidden state. That's the new information. And now the cell state can be updated with the forget gate, which we already discussed at the beginning. And to elementwise multiplication, that's what this symbol means with the previous cell state. Plus the input gate during elementwise modification with this c tilde, that's t. And that's the new information at this timestamp t. But keep in mind, the cell state is a vector, so all this are vector operation. So you can imagine some of the element in this vectors will be forgotten more quickly. And some of the element maybe will be remembered for longer period of time, right? That's why this vector operation is very important. And similarly the gating mechanism is also are vectors so that you can configure what information to keep and what information to forget. Not in a global way, but in dimension by dimension ways with respect to the each element in the hidden vectors c(t). So that's cell state. Now let's look at how do we compute h(t), the hidden state. As a hidden state is determined by this output gate, which is again the same structure as input gate and forget gate. But here again, taking the x(t), the current input, and the previous hidden state h(t) and pass through another sigmoid function to get the output gate. Then do an elementwise multiplication with this tanh over the cell state. The current cell state c(t), that's become the new hidden vector h(t). So now let's look at this whole thing, putting together. And looks still pretty complicated, but they actually work very well together, all the three gates, right? This forget gate, input gate and output gate together. We have the cell state to be maintained and also the hidden state. So that's LSTM.