In the previous week, we learned about transformer operations and parallel collections. In this week we will study the implementations of these methods. We will see how the combiners are implemented for various collections, and learn about data structures that are more amenable for parallel computations. So while the emphasis in the previous week was on data parallel APIs and parallel programming abstractions, this week's lecture will be more algorithmic in nature. It will give you an insight into how to pick the right data structure, and organize the data in a parallel program. Recall that a transformer operation is a collection operation that creates another collection instead of just a single value. Methods such as filter, map, flatMap and groupBy are examples of transformer operations. By contrast, methods, such as fold, sum, and aggregate are not transformer operations. We also learned that sequential transformer operations can be implemented generically, using an abstraction called, a Builder. The Builder trait in Scala takes two type parameters T, which denotes the type of the elements of the collection, for example string. And Repr, which denotes the type of the collection, for example a sequence of strings. This trait has two important methods, plus equals, which is used to add an element to the builder, and result, which is used to obtain the collection after all the elements are added. For example, let's assume that a single processor has a sequence of strings, Adam, Bob, and Eve. To capitalize all the strings in the sequence, the processor needs to map each element. It starts by creating a new builder object, and then adding the capitalized versions of the strings into the builder by calling the plus equals method. At the end the processor invokes the result method which creates the resulting sequence. Builders can only be used to implement sequential transformer operations. To implement parallel transformer operations, we need an abstraction called a combiner. A combiner is a builder with an additional method called combine, which takes the elements from two input combiners, and produces a third new combiner. Here's an example. Given a combiner, this, with elements X, Y, and Z, and another combiner, that, with elements U, V, and W, the combine method, which uses a third combiner with the elements xyz, uv and w, and in the process, invalidates the old combiners. After the combine method returns, the old combiners can no longer be used but the new one can be used in the parallel program. Making this operation efficient is not a trivial task, and as we will see, we need to find a way to make it fast. Before we start, let's consider for the combine operation means. It turns out that the meaning of the combine operation depends on the underlying data structure. The collection Repr could be a Scala set or a Scala map. In this case the combine operation represents a union. So given two sets, this with the elements 1, 2, and 4, and that with elements 2, 3, and 4, combine produces a union with elements 1, 2, 3, and 4. When the collection Repr is a scalar sequence such as a vector or an array buffer, the combine operation represents concatenation. So for example, you don't two sequences with the elements 1, 2 and 4 and 2, 3 and 4, the combined operation produces a new sequence with the elements 1 2 4 2 3 and 4. In both cases the combine operation must be efficient. By efficient we mean it must execute in all of log n + log m time where n and m are the sizes of the two input combiners. Why do we make this requirement? During a parallel operation the combine method could be invoked multiple times. Assume for a moment that a combine takes n plus m computational steps. If we have four processors executing a filter operation, then at the leaf level of the parallel reduction tree, they will traverse n elements in n divided by four computational steps. Each processor independently produces an array of elements it filtered. At the next level of the tree, two pairs of arrays would be combined in n divided by two computational steps, by for example, processors P0 and P2. Finally, a single processor would need to call combine to produce the final array at the root of the tree, which takes N computational steps. If we sum these computational steps we will see that the total running time is 7N divided by 4 which is much more than N, the time required for a single processor to complete the filter operation. Although this is a very rough analysis which assumes that combine takes about as much time as filtering, we can see that combine must be more efficient if we are to gain any speed up. If we require that the running time over the combine operation is O(log n + log m), the combine steps become insignificant in the overall execution time. In the following, we shall combine implementation for two arrays. For simplicity we have slightly diverged from the previous combine signature but the semantics of this method remain the same. Given to integer arrays, produce a third array that is their concatenation. So here is a question for you. Is this method efficient enough? If you answered no, you are correct. This method takes O of n plus m time where n and m are the lengths of the two arrays. To show this let's count the total number of steps in this method. The first line allocates the result inquiry R whose length must be n + m and this requires n + m computational steps on the JVM runtime. Then, the contents of the arrays are copied to R with the array that copy calls which again take n + m steps. In total, that's 2n plus 2m computational steps. In complexity analysis, constant factors are ignored, so we can write this as just O(n + m). So, in conclusion, the correct answer is no. This combined method is not efficient. So the question is, can we produce a more efficient combine implementation for arrays? We argue that this is not possible, here is why. Computer memory can be visualized as a long tape. Any data structure that we use occupied a subset of locations on this state. An array of length n is a very simple data structure that occupies a contiguous block of memory. The arrays, xs and ys can be at arbitrary positions on this state. The resulting array must also be a continuous block of memory. If xs and ys were adjacent, we could return the resulting array pointing to the starting of xs. In this case we would not need to copy anything. However xs and ys could be anywhere, so we are forced to copy their elements to produce a contiguous, uninterrupted block. Let's consider some other data structures typically used in programming languages to implement sets and maps, and the efficiency of their operations. A hash table is a continuous block of memory, partially populated with elements. To find or modify an element, they compute its hash code and use it to lock with the element. Insert, remove and look up operations does take expected constant time of one. Balanced search trees are composed of nodes that contain elements and optional child nodes. The topmost node is called the root of the tree. The nodes without children are called leaves. Balanced trees usually have the property with the length of the longest path from the root to the leaf is never two times larger than the shortest path from the root to a leaf. In this case, the longest path from the root to a leaf is equal to 3, while the shortest path from the root to a leaf is equal to 2. Generally this property insures that the height of a tree within elements is O of Log N. One useful consequence of this is that it takes Log N steps to reach any leaf starting from the roof, thus look up insert and remove operation state of Log N time. A link list is a linear arrangement of nodes where each node except the last one has a single successor. Finding the element in the list requires traversing the entire list in reverse case so the running time of their operations is all of them. So for example, if we want to check that the element nine is in the shown list, we have to transverse how the elements in the list. Unfortunately the standard data structures do not have an efficient union operation, making their combiners tricky to implement. As an exercise we encourage you to try to implement the union operation for some of these data structures. Let's next consider data structures that are typically used in programming languages to implement sequences. Here again we get various operation complexities. Mutable linked lists have constant time prepend and append operations, but linear time insertion and lookup. So for example, if we want to append the element nine at the end of this list, we simply need to update it's tail point. Functional link lists, or cons lists, have constant time prepends. But since their tail can not be mutated, all other operations take linear time. Array lists used to implement array buffers in Scala, have amortized prepend operation. Most of the time, we append an element to the array list by filling in the first empty entry in the array. So if you want to add the elements 8 and 6, you just add them like this. When we run out of space in an array list, we simply allocate a larger array block and copy the existing elements. Once that is done, we can append a new element. On average, we do a constant amount of work when we append and only occasionally need to copy everything. Random access to elements takes constant time, but all other operations, such as prepend, take linear time. Except for mutable linked lists, concatenation for these data structures requires copying all the elements, and takes a linear 0(n) amount of time. This lecture should have convinced us that implementing combiners is not trivial. Since most of these data structures do not have an efficient concatenation or union, providing a combiner for the corresponding collections is not straight forward. However, things are not so grim. In the next lecture, we will see that it is indeed possible to implement efficient combiners for this data structures, and we will study the techniques used to do this.