Hi and welcome back! Last time I left you with a Smith-Waterman code. Have you been able to calculate its operational intensity? If not, it is not a problem. Now we will see together how to calculate its operational intensity. Let’s start with some considerations on the code. First of all, we can see that the external loop, iterates for MATRIX_SIZE - N times, that is, N*M - N times. Now, inside the loop, we have some operations that are always executed, and then two conditional statements. The first conditional branch is executed whenever the index i is equal to zero. In the code, this index is zero whenever we are considering the first column and in our particular case, it happens N - 1 times. However, the number of operations that we need to count for our operational intensity in this particular branch is equal to zero, as there are only two assignments to local variables. For what concerns the else branch, by doing some math we can understand that it is executed N-1 times M - 1 times. This because the external loop starts iterating from N and the if is executed N-1 times. To better understand the number of iterations, observe this picture. Our loop is starting from N, hence we are skipping entirely the first line. As the if considers only the first column, and the first element of the column has already been skipped it iterates only N - 1 times, as in this picture. The else takes all the remaining iterations. Observe this new picture. From here it is clear that it iterates N - 1 x M - 1 times. Now, in this else branch, we can see that there is a best and worst case. The worst case happens when the condition of the last if branch is true for each iteration, while the best case is when the condition is always false. I am considering only the last if here, as for each of the previous if branches, the operations involved are just assignments. Now that we have all these information, we can see how to count the operations for this code. Let’s start with Arithmetical operations. We have the index that is incremented each iteration, as well as the two operations on the indexes i and j. Considering the else statement, we can see that there are 4 additions resulting in (NxM - M) x3 + (N - 1) x (M - 1) x4 operations. For arithmetical operations, both best and worst case are the same, as in the final if statement there is no operation of this type. For comparison operations, we can count the condition of the external for, the one of the first if, one for the comparison of the two strings here, and then one comparison operation for each successive if statement that we found in the code, resulting in (MxN - N) x2 + (N - 1) x (M - 1) x5 operations. As before, also for comparison operations best and worst case are the same. Last operations are the indexing one. There is one indexing operation outside the for, to initialize the max_index, then we have three indexing operation here and other two here, to store the results of the matrices. Then, in the worst case we need to add one indexing operation, 0 otherwise. In total we have then 1 + (N-1) x (M -1) x5 in the best case, and 1 + (N-1) x (M -1) x6 in the worst one. Let’s have a look at memory traffic. Our function is reading 1 integer and 2 characters for (N - 1) x (M - 1) times both in the best and worst case, and is writing an integer here, and an integer and a short here for (N - 1) x (M - 1) times. Finally, in the worst case, we need to add another integer for the same number of iterations as before. Putting all these numbers together, we can see as the operational intensity, when our query size N is equal to 256 and our database M is equal to 2048, is equal to 1.668 in the worst case and 1.35 in the best one. That’s great! We have the initial values for the number of operations performed for each byte of information that we are moving! Now we just need to understand what kind of performance to expect from our roofline model. In this picture, you can see two roofline models created for two of the board supported in Xilinx SDAccel. These two boards are the Xilinx Kintex Ultrascale 3 and the Xilinx Virtex 7, both produced by Alpha data. On the x-axis we can see the operational intensity, while on the y-axis there are the performance, by means of Giga Operations Per Second. Note here, that this roofline is created reflecting the fact that we have only integer operations with 32 bits operands. With the operational intensity we obtained in our current code, we can see how we are in the very left part of the chart for both the FPGA boards. This means that our code can achieve at most performance in the order of 16 GOPS, that is a very low performance for our FPGA. Furthermore, the roofline is telling us that we are in the memory bound area, meaning that we are dominated by the memory transactions. In order to increase the performance of our final architecture we need to move to the right part of the chart, and in order to do so we need to reduce the amount of data that we are moving back and forth to our FPGA device. This is not an easy task, and not always it is a viable solution. We will see next how we can tackle this problem. Now we have a performance prediction for the naive implementation of our Smith-Waterman. In the next video, we will see how to run it on SDAccel to perform software and hardware emulation, then we will start optimizing it and we will use the roofline to understand how the optimizations we will perform will impact our performance.