Hi! In this lesson we are going to talk about code profiling. Code profiling is one of the most basic, yet most important steps in the hardware design flow. The profiling is helpful in multiple cases. Most of the times, the code we want to accelerate does not fit completely in the hardware accelerator, once it is completely optimized. Secondly, when it does fit, it may not be worthy to accelerate all the parts of the code it is composed of, as some fractions may not benefit from an FPGA implementation. Also, it provides information regarding what are the most computationally intensive parts, if any, of the algorithm we are considering. In the previous lessons, we have seen that the Smith-Waterman Algorithm is mainly composed of two parts: the first function takes as input two strings, called the query and the database, and outputs two matrices, that are necessary for the successive steps. The second step of the algorithm takes the directionMatrix as input and outputs how the query aligns to the target database. If you do not remember how the algorithm works, I suggest you to take a look at the lesson regarding the Smith-Waterman algorithm. Now that we have recalled the structure of the Smith-Waterman, let’s see how we can profile a naive version of this algorithm. For this example, we use Valgrind to profile our application. Valgrind is an instrumentation framework that, paired with Callgrind, can be used as a call-graph generating cache and branch-prediction profiler. In order to profile our application, I already tuned some parameters, as the scoring system, inside the source code. The first profiling I will perform is specifying a query size of 256 randomly created characters, and a database composed of more than a million characters. One thing that we have to notice here, is that it is very important to use significative datasets when profiling the application, otherwise we may obtain results that are not significative! Now, let’s take a look on how to launch Valgrind to profile the application: First of all, let’s compile our smith_waterman code using gcc, the C compiler. Awesome! We have our executable. Now we can call Valgrind, specifying that we want to use Callgrind as a tool. Let’s specify the executable, its parameters, and we press enter to start the profiling. In the example, N is used to specify the query size, while M is used to specify the database size. The profiler can take some time to benchmark our application, depending on the complexity of the algorithm, and the size and type of the dataset we are providing as input. Once the profiling is complete, the tool produces, in the current folder, a file called callgrind.out plus a number that is different at each call of the tool. Now that we have this file, we can call KChacheGrind to interpret the result and analyze the callgraph. In the bottom right part of the screen, we can notice the call graph of the execution of the program. Let’s zoom into the callgraph in order to analyze it better. The callgraph is a control flow graph, where each block represents a function of the program, and the arrow starting from block A to block B represents that function A calls function B. In the example, function “compute_smith_waterman” is called by the main function. The percentages in the chart represent the percentage of time spent by the function in executing itself and the function that it calls. In your opinion, what is the most compute intensive function of compute_smith_waterman? This one was easy, it is the _compute_matrices function. It takes, in fact, 83.78% of the total time of the program. Consider that _compute_matrices takes almost the same percentage of time of its caller: compute_smith_waterman. Wait a second, there is something strange here: where is the function performing the second step of the Smith-Waterman algorithm? In order to understand a little bit more about what happened here, we need to come back to the tool, and now focus on the left part of the screen, that is where all the functions are shown. Here we can find the trace_back function. In this part of tool, the profiler, provides us with multiple variables. In particular it gives the user information regarding the cost of the function itself, or Self Cost, the cost of the function including all called function, or Inclusive Cost, and the number of times the function is called during program execution. The reason why the trace_back function is not included into the callgraph on the right, is that its cost is less than 1% of the total execution. Let’s try now to profile our application using a smaller dataset by setting the query size to 128 characters and the database to 1024. Profiling with this dataset, we may notice as the cost of the trace_back function increases to 0.40%, suggesting us that its impact is greater with smaller datasets. As a final confirmation, we can profile again our application, keeping the same query size of the first time and increasing the size of the target database. This final profiling confirms us that compute_matrices is the most compute intensive part of the algorithm, and shows us again, as the impact of the trace_back function is almost zero when we use bigger datasets. Thanks to this profiling step, we now know that it is not worthy to accelerate the second part of the algorithm, as it is fast enough to be executed on a general purpose CPU. It is very important to understand that not all the parts of the algorithm will benefit from the hardware acceleration, and being able to understand how to split the computation will allow us to extrapolate the best performance from our algorithm. In the next steps, we will focus on implementing and optimizing, on the hardware accelerator, the first function. We will perform a static code analysis to our code in order to identify its Operational Intensity. Then we will exploit the Berkeley Roofline Model to predict system performance.