One of the coolest things with FPGAs, is the possibility to perform a huge amount of operations in parallel, while keeping a relative low power profile. We have already seen how to synthesize an application for a specific FPGA device. Now let’s see how we can exploit the full capabilities of our FPGA. In order to do so, we need to analyze the algorithm we are willing to accelerate. We need to understand where is the parallelism of our application, and specify the code so that we exploit it in order to increase the performance of our application. Let’s try to understand where the parallelism is on the Smith-Waterman algorithm. We know that the most compute intensive part of the algorithm, is the calculation of the similarity and traceback matrix, hence let’s try to understand if this can be parallelized. To do so, we need to start by looking at the dependencies among the operations in the code. Here there is a representation of the Similarity Matrix computation where the first row is an example query, and the first column is an example database. In this picture, we are not representing the first row and column as they would be all zeros. By looking at the code, we can understand that each element, as the one in the red box, depends on other three elements: in the picture, the three yellow boxes called north, northwest and west. In order to calculate the element in the red box we need in fact these three values. If we think a little bit more on this dependency, we can understand that each element that lies on an anti-diagonal is dependency free with relation to all the other values in the same anti-diagonal. However, in order to calculate the elements of an anti-diagonal, it is necessary that the elements of the two previous anti-diagonals have already been computed. This means that, given that we have already computed the two previous anti-diagonals, we can compute all the elements of an anti-diagonal in parallel. On the other side, the values computed in the current anti-diagonal will then be used to evaluate the two successive anti-diagonals, and so on and so forth. We can exploit this parallelism on the anti-diagonal implementing what in hardware is called a Systolic Array. In a Systolic Array, we have multiple, tightly coupled Processing Elements, and each processing element is responsible to compute a partial result, based on the data that it received from its upstream neighbours and then sends the results to the successive processing element. This picture provides a graphical view of a systolic array composed of three processing elements. To maximize the parallelism on the anti-diagonal, we should implement a systolic array composed of N processing elements, with N the size of the query sequence. From the picture, it is clear that each processing element takes as input a character of the input query and one of the database, and the three dependency values described before. The logic to calculate the output value is the same for each processing element and in our code it is related to the identification of the new elements for the similarity and traceback matrices. Finally, the output value for the similarity matrix of a processing element, needs to be routed to the successive processing element as new west value and to the same processing element as the new north value, while the value that was the north dependency element, in the successive iteration becomes the new north-west value. Using an architecture as the one described so far, we are able to compute in parallel all the elements of an anti-diagonal, producing every time N new elements that need to be stored in the similarity matrix, and N for the Traceback one. Great! Now we have an idea of an architecture we can use to accelerate the compute_matrices function. Our main goal now, is to be able to compute an anti-diagonal each clock cycle. In this way, we would be able to calculate the entire similarity and traceback matrix in just (N + M - 1) clock cycles, where N is the size of the query, and M the size of the database. All the considerations we have made so far are awesome, however, we need to consider that the roofline model was telling that our implementation was memory bound. Basically, this means that the amount of data we are moving from the DDR to the FPGA and vice versa, is too high with relation to the operations that are being performed. So… We need to think a way of reducing the amount of data transfer and to increase the amount of operations performed on the FPGA. If we think more in depth at what we have said so far, we can understand that actually we do not need the similarity matrix for the identification of the optimal local alignment. In fact, the traceback step is performed starting from the index identifying the maximum element in the similarity matrix, and following the directions stored in the traceback matrix. This is a very valuable information, because is telling us that, actually, each time we calculate an anti-diagonal, we just need the two previous anti-diagonals of the similarity matrix, but once the anti-diagonal has been computed, and the dependency values updated, we do not need to keep the elder values of the similarity matrix. This is very important because, as we are performing only the computation of the matrices on the FPGA, we need to send to our host the intermediate results of the computation. However, as we do not need the similarity matrix, we are considerably reducing the traffic of the memory! As an example, look at this picture. Here we have 3 processing elements, one for each query character and it is clear how to compute the current diagonal (the one in darkest red): we just need the two previous diagonals, as they contain the dependency values. This picture clarifies also another point. When we create a systolic array, we are fixing the number of processing elements to the size of the query. However, the number of elements that needs to be computed when we work by anti-diagonals, depends on the anti-diagonal we are computing. In particular, at the beginning we will have just an element to compute, then two, until we get to N, the size of the query. Then the number of elements remains N, until is start decreasing in the lower right corner of the matrix. Obviously, we need to account for that in our hardware architecture, as we have to understand how many elements to compute at each iteration. If we add some code to understand what is the number of elements to compute at each iteration, we may end up in adding too much control logic to our design. This is a thing it is preferable to avoid, as it may result in slowing down the computation of the FPGA. To solve this issue, a very straightforward solution is to buffer out corners. This simply means that at each iteration we are always computing the same amount of data, and then we will account for the elements that does not contribute to the final result in our host application. Now that we have all these information, we will design our architecture as a Systolic Array of processing elements. For the sake of this example, we will reduce the complexity of the problem by removing the calculation of the max score. We will in fact return to the host just the traceback matrix.