Hi! One of the most challenging, yet important tasks, in developing high performance applications is to devise an implementation that fully exploits the capability of the underlining hardware. Several techniques have studied in the literature but in this class, we will present and discuss just one of them, the Berkeley Roofline Model, which is a graphical tool that allows to easily identify the performance bottleneck of an implementation and provides hints on how to improve the implementation to fully leverage the target architecture capabilities. Before digging into the details on how to use this model to identify performance bottleneck and to apply it in a scenario where the underlying architecture will be implemented on an FPGA, let us first start with a brief history on the roofline model. The roofline model was first proposed in 2008 by Samuel Webb Williams in his PhD thesis at UC Berkeley named: “Auto-tuning Performance on Multicore Computers”. As the thesis name suggests, the model born as a tool primarily meant for general purpose processors. Nevertheless, the roofline model can be also applied and extended to deal with hardware accelerators and, in particular, with FGPA-based accelerators. In this class, the aim is to provide an overview of the model and its basic principle. Hence, we will discuss the foundation of the roofline model and mainly focus on the application of the model for general purpose processors. Let us now move to the core of the lesson and have a closer look at the roofline model. The roofline model, exploits the so-called "bound and bottleneck analysis" which, instead of trying to estimate the performance as stochastic models do, aims at providing more information regarding where are the bottlenecks of the application developed. Furthermore, it tries to give quantification of what is affecting the performances of the system, so that the user knows where optimizations are needed. The underlining assumption behind the roofline model relies on the idea that, in the foreseeable future, the off-chip memory bandwidth will often be the constraining factor. Hence, a model for analyzing the performance of an application, needs, at the very least, to consider both the memory bandwidth towards the off-chip memory and the computational capability of the chip given as peak performance. Both the off-chip memory bandwidth and the peak performance are performance upper bounds, which are architecture-specific and do not depend on the given application. On the other hand, whether the most constraining resource is the peak performance or the memory bandwidth greatly depends on the application itself and on the type of computation that is performed. An application is Input/Output bound if the most constraining resource is the memory bandwidth, while the application is said to be compute bound if its performance is meanly limited by the available computational resources. In order to measure the amount of time that the application spends in computing the data vs. the time the application spends in moving data to and from the off-chip memory, the roofline model introduces the concept of Operational Intensity. The Operational Intensity (or OI) is, in general, seen as the ratio between the Work, W, and the Memory Traffic, M. Within this context the Work is identified by the number of floating-point instructions performed on the data, while the Memory Traffic is defined as the number of bytes of data moved to and from the off-chip memory. Notice that in the context of general purpose processors, the memory traffic considered for the operational intensity is the traffic between the DRAM and the cache hierarchy, since such communication channel is often much slower and constraining than the one between the processor and the cache hierarchy. With this information, we can now start to build our roofline model for a given architecture. Let us consider a 2D graph in which the x-axis reports the operational intensity of the application and, the y-axis, reports the application performance. Such performance, when dealing with the CPU domain, is often measured in GFLOPS, which stands for billion floating-point operations per second. Such metric is very relevant for scientific applications in which the amount of floating-point operations represents the dominant portion of the kernel function whose performance needs to be evaluated. The first bound we can draw on such roofline model is the peak performance bound, which represents the theoretical maximum number of floating-point instructions per second that can be achieved on the architecture. This represents a horizontal line within the roofline which cannot be exceeded by any application running on such architecture. The memory bandwidth limit instead, is represented by a diagonal line whose slope is the peak memory bandwidth beta. The point at which the performance saturates at the peak performance level PI start, is where the diagonal and horizontal roof meet. Such point is defined as ridge point. The ridge point offers insight on the machine's overall performance, by providing the minimum operational intensity required in order to achieve peak performance, and by suggesting, at a glance, the amount of effort required by the programmer to achieve peak performance. The least operational intensity needed by the application to reach the ridge point is given by memory bandwidth divided by the peak performance. If an application has an operational intensity which is smaller than such value the application is referred as memory bound or I/O bound. On the other hand, if the application has an operational intensity which is higher than the one of the ridge point, the application is referred as compute bound. Notice that, an application might be memory bound one one architecture and compute bound on a different one. This is due to the fact that different architectures have different values for both the memory bandwidth and the peak performance values. Overall, the ridge point divides the area of the roofline model into two regions. The I/O bound region on the left side of the ridge point and the compute bound region at the right side of the ridge point. Hence, depending on the operational intensity of the application the upper bound on the achievable performance is either given by the memory bandwidth bound or the peak performance bound. This concludes this first class on the roofline model. In the following one we will present how the model can be used to give insights on the reason why the real performance of the application is below the theoretical bounds and on how to use/apply the roofline model also in the contest of an FPGA used to implement the underlying hardware/architecture.