Hi everybody and welcome to this new class! After having been introduced to the SDAccel environment, it is now time to see how to optimize our application to meet the desired performance. This lesson, with the following ones, will guide you exactly into this! Nowadays, software is at the foundation of application specification and development. No matter the specific domain the application is targeting (genomics, more in general, medicine, gaming, entertainment, physical simulation, scientific compilation, telecommunication). Most products available today are started to be studied and designed as a software model or prototype that needed to be accelerated and executed on a hardware device. From this starting point, the software engineer is tasked with determining the execution device to get a solution to market and to achieve the highest possible degree of acceleration possible. Within this context, to achieve the highest possible acceleration of a software application, recent advances have included the development of multi-core and heterogeneous computing platforms. All multi-core and heterogeneous computing platforms require the programmer to rethink the problem to be solved in terms of explicit parallelism. Recognizing the programming challenge of multi-core and heterogeneous compute devices, the Khronos Group industry consortium has developed the OpenCL programming standard. The OpenCL specification for multi-core and heterogeneous compute devices defines a single consistent memory and computational model and system level abstraction for all hardware devices that support the standard. For a software engineer this means a single programming model to learn what can be directly used on devices from multiple vendors. As specified by the OpenCL standard, any code that complies with the OpenCL specification is functionally portable and will execute on any computing device that supports the standard. Therefore, any code change is for performance optimization. The degree to which an OpenCL program needs to be modified for performance depends on the quality of the starting source code and the execution environment for the application. Let's now spend few words to better describe OpenCL computational and memory models. The two main components of OpenCL Platform are the host and the device. The former is in charge of enabling the drivers for all the devices, executing the application host, and managing both memory buffers and kernel execution. The latter is configured at runtime to execute the kernel. OpenCL computational model is built around the logic abstraction of work-item and work-group, where a work-item is the basic unit of work within an OpenCL device, while a work-group is a group of work-items. From another point of view, each work-group is physically mapped to a compute unit, while each work-item is physically mapped to a Processing Element. The user writes the kernel the work-items will execute, and specifies the number of work-items per work-group. On the other hand, the OpenCL Memory model consists in three layers of memory: a Global memory, which is shared among host and device, a Local memory, which is accessible by all the work-items inside a compute unit, and a Private memory, which is accessible only to the PE or single work-item. According to the target device, these layers map on different physical memories. Xilinx is an active member of the Khronos Group, collaborating on the specification of OpenCL, and supports the compilation of OpenCL programs for Xilinx FPGAs. The Software-Defined Development Environment for Acceleration (or SDAccel) development environment is the Xilinx development environment for compiling OpenCL programs to execute on Xilinx FPGAs. SDAccel combines the industry's first architecturally optimizing compiler supporting any combination of OpenCL, C, and C++ kernels, along with a debugger, a profiler, libraries, development boards and the first complete CPU/GPU like development and run-time experience for FPGAs. The OpenCL standard guarantees functional portability but not performance portability. Therefore, even though the same code will run on every platform supporting OpenCL, the performance achieved will vary depending on coding style and capabilities of the underlying hardware. Optimizing for an FPGA using the SDAccel tool chain requires the same effort as code optimization for a CPU/GPU. The one difference in optimization for these platforms is that in a CPU/GPU, the programmer is trying to get the best mapping of an application onto a fixed architecture. For an FPGA, the programmer is concerned with guiding the compiler to generate optimized compute architecture for each accelerator (referred to as a kernel) in the application. To aid the user in these optimizations, SDAccel provides the user with a set of optimization directives to better specify the architecture to generate. Moreover, SDAccel permits to perform both a software and hardware emulation, to evaluate correctness of the design at different levels. Finally, SDAccel offers performance profiling capabilities integrated into the run-time. This profiling helps the user analyze the achieved performance and pinpoint any potential bottlenecks that need to be addressed. This first introductory lesson was meant to refresh few concepts we should have been already familiar with, but it is now time to start our “optimisation journey”...