In the previous class we realised how different can be executing an application on a processor with respect to an FPGA. We saw that with respect to computational throughput and memory bandwidth, the HLS compiler achieves the best out of the capabilities of the FPGA resources through the processes of scheduling, pipelining, and dataflow, but we did not provide any better explanations on what these three processes are doing. Therefore it is now time to get closer to them and see what are they doing! Generally speaking, as we can also read from Wikipedia, scheduling is the method by which work specified by some means is assigned to resources that complete the work. In our context, scheduling means the ability of identifying the data and control dependencies between different instructions to determine the order in which each one will be executed. In the FPGA scenario, we are not working with a known and fixed set of computational units, and resources can be adapted and configured according to the needs to implement the desired set. Here, the scheduling process refers to the possibility of parallelizing the software algorithm for an hardware implementation. Within this context we can focus on dependencies between adjacent operations as well as across time, guaranteeing that the desired computational units will have enough available resources to be implemented on the FPGA at a specific time. This allows the compiler to group operations, in functions, to execute in the same clock cycle and to set up the hardware to permit the overlap of their executions. Being able to overlap the executions of a set of operations removes the processor restriction that requires the current function call to fully complete before the next function call to the same set of operations can begin. This overlap is exactly the main idea behind the PIPELINING process. Pipelining is a performance optimization technique based on the overlap of the execution of multiple instructions deriving from a sequential execution flow. Pipelining exploits the parallelism among instructions in a sequential instruction stream. The idea is to avoid data dependencies and increase the level of parallelism in an hardware implementation of a given function. The data dependency in the original software implementation is preserved for functional equivalence, but the required circuit is divided into a chain of independent stages. It is important to notice that all the stages, considering that they are supposed to run in parallel, has to share the same clock cycle. As an example we can consider the following function: y = (a+b+x)*d One possibile implementation can be obtained by using two adder blocks and one multiplier to implement this function. This implementation is purely combinational and it is behaving as the corresponding software function since all the input values must be known at the start of the computation, and only one result y can be computed at a time. The overall execution can be computed, without taking into consideration the delay in propagating the signals, as the execution time of the first adder, plus the execution time of the second adder, plus the execution time of the multiplier. Just to provide a better understanding, let us assume 10ns to complete the execution of the multiplier and 5ns for the adders, which means that the time needed to compute the y value is 20ns. Now, within this context, the time needed to compute two subsequent values of y is 40ns, which is 20ns * 2, the time needed to compute tree subsequent values of y is 60ns, and so on for forth. Now, what we are looking for is a way to overlap the execution of multiple instructions and the way in which we are going to do this is by moving from a pure combinational logic to a sequential one. In this figure we can appreciate a pipelined version of the previous circuit. The squares that we introduced in this circuit represent registers that are implemented by flip-flop blocks in the FPGA. Within the this context we are creating the chain of independent stages that we were previously describing, where each box can be counted as a single clock cycle. Therefore, in the pipelined version, the computation of each result y takes three clock cycles, instead of one, as it was in the previous example. This does not seem to be an optimisation, but we do have to remind that it is not important only the number of clock cycles per se, but clock frequency and the throughput. By adding the register, each block is now an isolated unit of computation. This means that the section with the multiplier and the section with the two adders can run in parallel and reduce the overall computational latency of the function. Now, the time needed to complete the computation of the two adders is 10ns, which is also the same time needed to complete the multiplication. Within this context we can reduce the overall clock cycle from 20 to 10ns. Let us now considere the case, as we did with the circuit without pipelining, of the presence of multiple inputs. In the previous examples, to obtain two values of independent y we had to wait for 40ns, while now, by running both these stages in parallel, the circuit is essentially computing the two values of y in parallel. The initial computation of y, which is also referred to as the pipeline fill time, takes three clock cycles. However, after this initial computation, a new value of y is available at the output on every clock cycle because the computation pipeline contains overlapped data sets for the current and subsequent y computations, which means that two independents values of y can be computed in 40ns… Wait. 40ns? What? Where is the improvement? Well, wait a second and consider the case with the computation of three independent values of y. In the previous case we had to wait for 60ns, 20ns * 3, now… we are going to wait 50ns. 30ns for the computation of the first y, but after the first 10ns, we can start the computation of the second set of inputs, which is the same thing with the following 10ns. Finally let us briefly introduce the dataflow optimization process. When we are writing an algorithm that has to be executed in software, usually, this will be executed sequentially! There is nothing wrong in doing this, but this is exactly the idea also behind parallel computation and parallel architectures. What if we are going to have a series of independent functions that have to wait to be executed, not because the need the data to start their execution but because they are wanting for computational resources to be available? Well, this is obviously something that we want to avoid, and that is why we are now working with multi-core processors… but what if our underlying hardware can be adapted to meet, by “nature”, such a condition? Well, this is exactly the case of an FPGA. Even when working with multi-cores, we cannot guarantee that we are going always to have all the cores used to execute our application, which is leading us in wasting resources. With an FPGA this is not going to happen, we can adapt it to implement the necessary computation units according with the application needs. Within this context the goal of the dataflow process is to express parallelism at a coarse-grain level. The SDAccel compiler extracts this level of parallelism by evaluating the interactions between different functions and loops of a program based on their inputs and outputs. The SDAccel compiler can also extract this level of parallelism for functions and loops within a loop. I’m not going here to provide a better example on how to use the dataflow idea to optimize our application, but we are going to hear more about this process in the near future. Don’t worry!