[MUSIC] We're going to use k-means as an opportunity to present a really powerful and broadly used framework for parallel and distributed implementations of algorithms. And this framework is called MapReduce. We'll start by describing MapReduce in the context of a really simple word count in example. Which is the Hello World example of MapReduce. And then we're going to get to how to apply the MapReduce framework to parallelize k-means. Okay, but for now let's imagine that we have 10 billion documents, and just a single machine, and we want to count the number of occurrences, of every word in the corpus. So how are we going to do this? Well, to begin with, we're going to initialize a hash table which we call count. Then, for every document d, in our corpus which we're call the set of the documents. And for every word in document d, we're simply going to increment our count of this word. But of course cycling through 10 billion documents and every word in each one of those documents and then incrementing our account for each one of this words in maintaining those hush table is going to be a really, really intensive task. So instead of question is wow, what if I had a thousand machines to available to me. Can I somehow parallelize or distribute this operation across these different machines. So, in particular let's imagine that we are going to just evenly distribute our data. So, we're going to send 10 million documents to each one of our thousand machines. And then what we can think about doing is we can think about counting words in the documents per machine, separately, because what we're thinking about doing word counts. This is an operation that's independent across documents. The word counts in another document don't influence the word counts in some separate document. So this is an example of what's called a data parallel task. So, some task that can be done completely independently across the different data points. So, what this implies is we can think about just counting the occurrences of words separately in each one of our documents. And then merging these together. Okay, so we can do these counts per machine and then simply. Add these counts together. So to be clear, per machine, we run exactly the code that we had on the previous slide, but now it's only over 10 million documents, not 10 billion documents. But then, once we have these hash tables, we're going to combine them together. So we have that the count, the total count the entire corpus is of a given word is summing over each of one of our thousand machines the local count of that word. Okay but here we've shown this just for a single word. And remember we want to do this for every word in our vocabulary that is present in the corpus. So how do we do this? Well it seems like we're now back to a sequential problem. We have to go through each one of our words and do this merge of these hash tables. So just to make sure that's clear, we have to cycle through all words, in our vocabulary, and I'm not happy about doing that. So can we do something even better? Well, let's imagine that we have a whole bunch of machines available for this word counting and a whole bunch of words available when we're going to merge these hash tables and form our counts of all words in this vocabulary. Okay, so, what we can do is we can divide this into two different phases. One phase where we have an operation that's parallel over documents. Which is counting the words that appear in each one of these documents. So for example, maybe this machine it's going to produce. Five counts of the word, Ten counts of the word machine, and- Seven counts of the word learning. And maybe this machine also has some counts of the word learning. Maybe here we have three counts as well as counts of other words that appeared in the documents on this machine, so that's the step that we described in our previous life. But now what we're going to do is we're going to utilize multiple machines when we go to do the counts across the entire corpus. So in particular when we're going to merge these different counts that we've made. And to do this, this is the really key step is we're going to send all counts of a given word to a single machine. So all counts of the word learning that appear in any one of these thousand machines, are all going to get sent to one machine. Let's say, learning all goes to this machine here, and so this is going to then sum these counts. And in this case, maybe we would output learning. And if these were the only two instances, it would output learning with a count of 10. And the important thing about being able to do this second phase in this distributed manner is the fact that the operation that we're doing when we're merging these different hash tables is, again, an operation that can be implemented independently over now words. So this is an operation that will call, data parallel over words and that's what allows us to have this efficient structure from merging the hash tables. Okay, but the critical component here is making sure that we sent all counts of a given word to a single machine. So that whatever this aggregation over these counts that's produced on this, this machine here. Corresponds to all the counts of that word in the entire corpus. And how are we going to do this? Well, what we're going to do is use something that's called a hash function. And let's describe this in a little bit more detail. So which words go to machine i? Well, our hash function h is going to be something that's defined over our entire vocabulary. So over the indices, however many words we have in the vocabulary. That's going to be the input to the function, and the output is going to be in the range of 1, 2, all the way to the number of machines that we have available. And so in particular we'll send counts of the word learning, To machine h of Learning. Okay, so, to step back, what we've seen is that we had an operation that we could distribute across documents which was simply to count how many words appeared, how many instances of a given word appeared in that document. And then we had a step that could distribute across the unique words. Which was to add up all the counts that we had across these different documents to merge to get a total count of a given word in the corpus. And this is an example of something that falls into what's called the MapReduce abstraction. [MUSIC]