In the previous weeks, we learned about task parallelism. In this week, we will study an alternative form of parallel computing called data parallelism. The central idea in task parallelism was to express parts of the computation that should proceed in parallel. Here is one definition of past parallelism that we can agree on. A form of parallelization that distributes execution processes across computing nodes. We saw this in the previous weeks where we learned how to use the task and a parallel constructs to express parallel computations. Data parallelism takes a different approach. The computation details are expressed once, but the same computation is expected to execute in parallel on different parts of the data. So the contrasting definition that we can use for data parallelism is a form of parallelization that distributes data across computing nodes. Let's see some examples to make things more concrete. One of the simplest data parallel programming constructs is the parallel for loop. Here's an example of using a parallel for loop to initialize array entries. Let's declare a method called initialize array, which given an integer array xs and an integer value v rides the value v to every array entry in parallel. The initialize array method starts a parallel for loop. The parallel for loop looks exactly like the regular for each loop on a range collection with the addition of the call to the par method. Adding .par to the range converts it into a parallel range. Iterations of the parallel loop will be executed on different processors concurrently with each other. In the body of the parallel for loop, the value v is assigned to the corresponding array entry i. A parallel for loop does not return any resulting value. The only way it can interact with the rest of the program is by performing some side effect such as writing to an array. The parallel for loop is therefore not very functional. Since it relies on side effects, a parallel for loop is correct only if the iterations of the parallel loop write to separate memory locations or use some form of synchronization. In our example, this is ensured although they are not using any synchronization primitives, separate loop iterations modify separate loop entries i. However, in the iterations of the loop, all modify the same array entry 0 by overriding it with i. The program would be non deterministic, such improperly synchronized programs must be avoided. Let's take a look at another parallel for loop example, visualizing the Mandelbrot set. A Mandelbrot set is a set of complex numbers for which the following sequence does not approach infinity. What does that mean? Let's take a look at a complex number pyramid. The complex plain is this bridge between two coordinates, x and y. A complex number in this plane is written with a real and an imaginary component. So for example, the complex number at coordinates x = 3 and y = 2 is written as 3 + 2 times i. Here, 3 is the real component and 2 is the imaginary component multiplied by special value i. For any such number c, we can compute the following sequence. Let's observe the sequence. We start with a complex number then in the next step, we compute a square of the complex number and that's c. Then in the next step, we take the entire expression from the previous step we computed square and at c, and then we do it again. We take the entire expression square root and at c, and so on. This will give us a sequence of complex numbers. In the complex numbers in this sequence with their absolute value do not approach infinity we consider the number a part of the Mandelbrot set. How do we visualize this on a computer image? Let's see an example. Here you can see one part of the Manderbrot set. To visualize it, we select the rectangle within the complex plain and match it to the screen. For each pixel on the screen, the computer corresponding point in the complex plane c, and we compute the value zn up to some maximum number of iterations. If it turns out that absolute value of zi exceeds 2 at iteration i, we know that the sequence will approach infinity. Otherwise, if you reach the maximum number of iterations we assume that the sequence will never diverge. Divergent pixels take a color that is a function of the number of completed iterations. The color visually tells us how fast is the pixel diverging. For example, the following pixel is diverging, non diversion pixels get the black color. What we just saw is done by the computePixel method which takes the coordinates xc and yc of the complex number, and the maximum number of iterations and then returns the color of the pixel. The method intializes the complex number components x and y, and then updates them according to our formula zn + 1 is zn squared + c. You can use your knowledge of complex numbers to verify yourself that these two assignments really correspond to our sequence formula. At the end, the computePixel method uses another method, color to convert the number of performed iterations i to an actual 32-bit color code. The parallel for loop is then simple. Given an array of pixels called image, it traverses all the pixels, and this is in parallel. If it then converts every pixel index to corresponding coordinates xc and yc and then assigns the color from computePixel to the image array. So let's take a look at the demo. The demo will compare two implementations of parallel Mandelbrot rendering. One that uses the first construct from the previous weeks, and another one that uses the parallel for loop. So let's start the Mandelbrot set program from SBT. We've prepared an interactive UI in which you can at the same time see the Mandelbrot set being rendered and observe the running time for different implementations. On the left, you can see the visualization of the Mandelbrot set. Pixels which are a part of the Mandelbrot set are this time shown with a white color. Pixels which are not part of the Mandelbrot set are this time shown with a black color or different gradients of the blue color. The control panel on the right lets us choose between different implementations. Reduction tree is a task parallel implementation similar to the ones that we saw in the previous weeks. We will see that out of the three options, this will be the least performant implementation. The second implementation will use a parallel for loop which relies on parallel collections. Deferred implementation also uses a parallel for loop but relies on experimental data parallel scheduler. We will start with a taskbar implementation. Let's render the Mandelbrot set several times. We can see that the running time for rendering the set is always well above 200 milliseconds. Next, let's try to scale up our old collection's implementation. Here, the performance varied the worst case it was just as fast as the task parallel implementation. However, on average the performance is around 25% better than task parallel implementation. The third implementation that relies on an experimental scheduler is particularly faster. So let's try it out. This implementation is up to two times faster than the task parallel one. As we can see the running time is usually around 111 milliseconds. In the demo, we saw that the task parallel brought set rendering was in many cases least performant. In some cases, the data parallel implementation outperformed the task parallel one by a factor of 2x. We ask ourselves what is the reason for this. It turns out that different data parallel programs have different workloads. Workload is a function that maps each input element to the amount of work required to process it. The initialized array method, defined previously has a constant amount of workload for each element. We write this as w(i) = constant. Consider a graph on which the x axis is the loop iteration axis and the y axis is the running time. For the initialize array method, each loop iteration is assigned the same amount of time. We call this workload, uniform workload. Uniform workload is particularly easy to parallelize. And a regular workload on the other hand is defined by some function f(i). Shown graphically each slope iteration can be assigned a variable amount of work. In the case of Mandelbrot rendering the run time of each index depends on the number of iterations spent in the compute pixel method. This value depends on the particular pixel and the particular part of the Mandelbrot set that we are rendering. In other words, it depends on the problem instance. The goal of the data parallel scheduler is to efficiently balance the workload across processors without necessarily having any knowledge about w(i). Thanks to the scheduler, the task of balancing the workload is shifted away from the programmer. This is one of the advantages of data parallel programming. In this lecture, we focus mainly on the parallel for loops. In the next lecture, we will study additional operations.