Good to see you again. When running large deep models, you'll often run out of memory as each layer keeps allocating it for a long time. I'll show you how this can be solved using reversible layers, let's dive in. The transformer network precedes by repeatedly adding residuals to the hidden states. To run it in reverse, you can subtract the residuals in the opposite order, starting with the outputs of the model. But in order to save memory otherwise used to store the residuals, you need to be able to recompute them quickly instead, and this is where reversible residual connections come in. The key idea is that you start with two copies of the model inputs, then at each layer you only update one of them. The activations that you don't update will be the ones used to compute the residuals, where this configuration you can now run the network in reverse. Layer 1 is attention and layer 2 is feedforward. The activations in the model are now twice as big, but you don't have to worry about caching for the backwards pass. The standard transformer equations give Y_a is equal to x plus attention of X, and Y_b is equal to Y_a plus feedforward of Y_a. This is the normal residual connection. In the reversible case, you will have Y_1 equals x_1 plus attention of x_2, and Y_2 equals x_2 plus feedforward of Y_1. Then to save the memory, you can reconstruct the inputs x_1 and x_2 as follows, x_1 equals Y_1 minus attention of x_2 and x_2 equals Y_2 minus feedforward of Y_1. Feel free to pause here for a moment to make sure you understand what's happening. But basically, I'm computing using one side of the network plus feedforward of the other side to compute Y_2. Similarly, one side of the network plus attention of the other side to compute Y_1. By doing it this way, you can use the last row formulas to recompute x_1 and x_2. Let me show you how these new formulas fits into this reversible layers illustration I've already shown you. To do so, I'm going to rotate the diagram onto its side, something like this, so that information is flowing from left to right in a forward pass. Restating the equations I showed you earlier, the first step is to calculate Y_1 equals x_1 plus the attention of x_2. Then after you've done this, the second step is to calculate Y_2 with the second formula, which has a dependency on Y_1, and requires a few extra parts in the illustration that weren't used in the first step, so Y_2 equals x_2 plus feedforward of Y_1 from the first step, and that's it. But notice that Y_1 dependency, the key takeaway is to recognize the two parts to the operation and how the second builds on the first. First you find Y_1, then you use that to find Y_2. That's a forward pass for irreversible residual block. It's combined standard attention and feedforward residual layers from a regular transformer into a single reversible residual block, and there is nothing to be saved in memory except the Y_1 and Y_2 of the output layer instead of activations for every individual layer. Memory saved. Now, I'll show you how to recompute x_1 and x_2 from Y_1 and Y_2 for a backward pass. First thing to notice is that I'm going to calculate x_2 before x_1, the reason for which I'll explain soon. First, I'll reverse some of the arrows in the illustration. The reverse direction is to indicate the information is flowing backwards. Notice I also changed the plus signs in the orange circles to be minuses to indicate subtraction will now be taking place. The first step is to calculate x_2 equals Y_2 minus feedforward of Y_1, and great work, you just calculated an input x_2 from the two outputs of the forward pass. The second step is to calculate x_1. The formula for x_1 has a dependency on x_2 that you just calculated, similar to the Y_2 dependency on Y_1 in the forward pass formulas above, so x_1 equals Y_1 minus the attention of x_2 that you already calculated in the first step. You now know how to compute a backward pass for a residual layer and a transformer model without the need for saving memory hungry activations in the forward pass. I showed you how to do this using reversible residual blocks. You'll be seeing this again in the assignment. Comparing performance of regular and reversible transformers on machine translation, you find roughly the same BLEU scores, and language modeling produces similar results. In fact, reversibility is a very general technique, this can be applied anywhere you use a transformer model. While the comparison shown here suggests reversing produces slight out-performance, it's really because there's been some hyperparameter tuning in the three years since the original transformer paper was published. You now understand reversible layers and how they solved the memory issue during training. Next, I'll put them together with the LSH attention, this will give us a variant of transformer that confinement on very long sequences. Let's go to the next video.