Welcome. This week I will teach you about the transformer model. It's a purely attention based model that I developed with my friends at Google to remedy some problems with RNNs. First, let me tell you what these problems are so you understand why the transformer model is needed. Let's dive in. First, I'll talk about some problems related to recurrent neural networks using some familiar architectures. After that, I'll show you why pure attention models help to solve those issues. In neural machine translation, you use a neural network architecture to translate from one language to another. In this example, we're going to translate from English to German. Using an RNN, you have to take sequential steps to encode your input, and you start from the beginning of your input making computations at every step until you reach the end. At that point, you decode the information following a similar sequential procedure. As you can see here, you have to go through every word in your inputs starting with the first word followed by the second word, one after another. In sequential matcher in order to start the translation, that is done in a sequential way too. For that reason, there is not much room for parallel computations here. The more words you have in the input sequence, the more time it will take to process that sentence. Take a look at a more general sequence to sequence architecture. In this case, to propagate information from your first word to the last output, you have to go through T sequential steps. Where T is an integer that stands for the number of time-steps that your model will go through to process the inputs of one example sentence. If let's say for instance, you are inputting a sentence that consist of five words, then the model will take five times steps to encode the sentence, and in this example, T equals five. As you may recall from earlier in the specialization, with large sequences, the information tends to get lost within the network, and vanishing gradients problems arise related to the length of your input sequences. LSTMs and GRUs help a little with these problems. But even those architectures stop working well when they try to process very long sequences. To recap, there is a loss of information, and then there's the vanishing gradients problem. Now transformers are models specially designed to tackle some of the problems that you just saw. For instance, let's recall the conventional encoder-decoder architecture which is based on RNNs and used to compute T sequential steps. In contrast, transformers are based on attention and don't require any sequential computation per layer, only one single step is needed. Additionally, the gradient steps that need to be taken from the last output to the first input in a transformer is just one. For RNNs, the number of steps is equal to T. Finally, transformers don't suffer from vanishing gradients problems that are related to the length of the sequences. Transformer differs from sequence to sequence by using multi-head attention layers instead of recurrent layers. To understand the intuition of multi-head attention, let's first recall self-attention. Self-attention uses directly its query, key, and value obtained from every sequential inputs. Finally, it's out puts a same length sequential output for each input. Self-attention can be understood as an intention model that incorporates a dense layer for every input, queries, keys, and values. Now you can think of adding a set of parallel self-attention layers which are called heads. These heads are further concatenated to produce a single output. This model is called multi-head attention which emulates the recurrence sequence effect with attention. Transformers also incorporates a positional encoding stage which encodes each inputs position in the sequence since the words order and position is very important for any language. For instance, let's suppose you want to translate from German the phrase ich bin glucklich, and I apologize for my pronunciation. Now to capture the sequential information, the transformers use a positional encoding to retain the positional information of the input sequence. The positional encoding out puts values to be added to the embeddings. That's where every input word that is given to the model you have some of the information about it's order and the position. In this case, a positional coding vector for each word I-C-H B-I-N, and glucklich will have some information which will tell us about their respective positions. Unlike the recurrent layer, the multi-head attention layer computes the outputs of each inputs in the sequence independently then it allows us to parallelize the computation. But it fails to model the sequential information for a given sequence. That is why you need to incorporate the positional encoding stage into the transformer model. In summary, RNNs have some problems that's come from their sequential structure. With RNNs, it is hard to fully exploit the advantages of parallel computing. For long sequences, important information might get lost within the network, and vanishing gradients problems arise. But fortunately recent research has found ways to solve for the shortcomings of RNNs by using transformers. Transformers are a great alternative to RNNs that help overcome these problems in NLP, and in many fields that processed sequence data. Now you understand why RNNs can be slow and can have problems with big contexts. These are the cases where transformer can help. Next, I will show you a concrete examples of tasks where transformer is used. Let's go to the next video.