So, in this video, we're going to start reasoning about the performance of hash tables. In particular, we'll make precise this idea that properly implemented they guarantee constant time lookup. So, let me just briefly remind you of the road map that we're in the middle of. So, we observed that every fixed hash function is subject to a pathological data set and so exploring the solution of making a real time decision of what hash function to use. And we've already gone over this really quite interesting definition of universal hash functions and that's the proposed definition of a good random hash function. More over, in the previous video I showed you that there are simple such families of hash functions. They don't take much space to store, they don't take much time to evaluate. And the plan for this video is to prove formally, that if you draw a hash function uniformly at random from a universal family of hash functions, like the one we saw in the previous video, then you're guaranteed expected constant time for all of the supported operations. So, here's the definition once more of a universal family of hash functions. We'll be using this definition, of course, when we prove that these hash functions give good performance. So, remember, we're talking now about a set of hash functions. These are all of the conceivable real time decisions you might make about which hash function to use. So, the universe is fixed, this is something like IP addresses, the number of buckets is fixed. You know that's going to be something like 10,000, say, and what it means for a family to be universal is that the probability that any given pair of items collides is as good, as small as with the gold standard of completely perfect uniform random hashing. That is for each pair xy of distinct elements of the universe, so for example, for each distinct pair of IP addresses, the probability over the choice of the random hash function little h from the family script h is no more than one over n, where n is the number of buckets. So, if you have 10,000 buckets, the probability that any given pair of IP addresses winds up getting mapped to the same bucket is almost one in 10,000. Let me now spell out the precise guarantee we're going to prove if you use a randomly chosen hash function from a universal family. So, for this video, we're going to only talk about hash tables implemented with chaining, with one length list per bucket. We'll be able to give a completely precise mathematical analysis with this collision resolution scheme. We're going to analyze the performance of this hash table in expectation over the choice of a hash function little h drawn uniformly from a universal family script h. So, for example, for the family that we constructed in the previous video, this just amounts to choosing each of the four coefficients uniformly at random. That's how you select a universal, that's how you select a hash function uniformly at random. So, this theorem and also the definition of universal hash functions dates back to a 1979 research paper by Carter and Wegman. The idea of hashing dates back quite a bit before that, certainly to the 50s. So, this just kind of shows us sometimes ideas have to be percolating for awhile before you find the right way to explain what's going on. So, Carter and Wegman provided this very clean and modular way of thinking about performance of hashing through this universal hashing definition. And the guarantee is exactly the one that I promised way back when we talked about what operations are supported by hash tables and what kind of performance should you expect, you should expect constant time performance. As always, with hashing, there is some fine print so let me just be precise about what the caveats of this guarantee are. So, first of all, necessarily this guarantee is an expectation. So, it's on average over the choice of the hash function, little h. But I want to reiterate that this guarantee does hold for an arbitrary data set. So, this guarantee is quite reminiscent of the one we had for the rand omized quick sort algorithm. In Quicksort, we made no assumptions about the data. It was a completely arbitrary input array and the guarantee said, on average over the real time randomized decisions of the algorithm, no matter what the input is, the expected running time was in log in. Here we're saying again, no assumptions about the data. It doesn't matter what you're storing in the hash table and expectation over the real time random decision of what hash function you use, you should expect constant time performance, no matter what that data set is. So, the second caveat is something we've talked about before. Remember, the key to having good hash table performance, not only do you need a good hash function which is what this universality key is providing us but you also need to control the load of the hash table. So, of course, to get constant time performance, as we've discussed, a necessary condition is that you have enough buckets to hold more or less the stuff that you're storing. That is the load, alpha, the number of objects in the table divided by the number of buckets should be constant for this theorem to hold. And finally, whenever you do a hash table operation, you have to in particular invoke the hash function on whatever key you're given. So, in order to have constant time performance, it better be the case that it only takes constant time to evaluate your hash function and that's, of course, something we also discussed in the previous video when we emphasized the importance of having simple universal hash functions like those random linear combinations we discussed in the previous video. In general, the mathematical analysis of hash table performance is a quite deep field, and there is some, quite mathematically interesting results that are well outside the scope of this course. But what's cool, in this theorem I will be able to provide you a full and rigorous proof. So, for hash tables with chaining and using randomly chosen universal hash functions, I'm going to now prove that you do get cons tant time performance. Right, so hash tables support various operations, Insert, Delete and Lookup. But really if we can just bound a running time of an unsuccessful lookup, that's going to be enough to bound the running time of all of these operations. So, remember in hash table with chaining, you first hash the appropriate bucket and then you do the appropriate Insert, Delete or Lookup in the link list in that bucket. So, the worst case as far as traversing though this length list is if you're looking for something but it's not there cuz you have to look at every single element in the list and followup into the list before you can conclude that the lookup has failed. Of course, insertion, as we discussed, is always constant time, deletion and successful searches, well, you might get lucky, and stop early before you hit the end of the list. So, all we're going to do is bother to analyze unsuccessful lookups that will carry over to all of the other operations. So, a little more precisely, let's let s be the data set. This is the objects that we are storing in our hash table. And remember that to get constant time lookup, it really needs to be the case that the load is constant. So, we are assuming that the size of s is bigger over the number of buckets n. And let's suppose we are searching for some other object which is not an s, call it x. And again, I want to emphasize we are making no assumptions about what this data set s is other than that the size is comparable to the number of buckets. So, conceptually doing a lookup in a hash table with chaining is a very simple thing. You just hash to the appropriate bucket and then you scan through the length list in that bucket. So, conceptually, it's very easy to write down the what the running time of this unsuccessful lookup is. It's got two parts. So, the first thing you do is you evaluate the hash function to figure out the right bucket. And again, remember we're assuming that we have a simple of a hash function and it takes constant time. Now, of course, the magic of hash functions is that given that this hash value, we can zip right to where the lenght list is to search for x using random access into our array of buckets. So, we go straight to the appropriate place in our array of buckets and we just search through the list ultimately failing to find what we're looking for s. Traversing a link list, as you all know, it takes time proportional to the length of the list. So, we find something that we discussed informally in the past which is that's the running time of hash table operations implemented with chaining is governed by the list lengths. So, that's really the key quantity we have to understand. So, lets call this, lets give this a name, Capital L, it's important enough to give a name. So, what I want you to appreciate at this point is that this key quantity, Capital L, the list of the length in x's bucket is a random variable. It is a function of which hash function little h, we wind up selecting in a real time. So, for some choices of our hash function, little h, this list length might be as small as zero but for other choices of this hash function h, this list, list length could be bigger. So, this is exactly analogous too in Quicksort where depending on the real time decision of which random pivot element you use, your going to get a different number of comparisons, a different running time. So, the list length and hence the time for the unsuccessful storage, depends on the hash function little h, which we're choosing at random. So, let's recall what it is we're trying to prove. We're trying to prove an upper bound on the running time of an unsuccessful lookup on average, where the on average is over the choice of the hash function little h. We've expressed that this lookup time in terms of the average list length in x's bucket where again the average is over the random choice of h. Summarizing, we've reduced what we care about, expected time for lookup to understanding the expected value of this random variable Capital L, the average list length in x's bucket. So, that's what we've got to do, we've got to compute the expected value of this random variable, Capital L. Now to do that, I want to jog your memory of a general technique for analyzing expectations which you haven't seen in awhile. The last time we saw it, I believe, was when we were doing the analysis of randomized Quicksort and counting its comparisons. So, here's a general decomposition principle which we're now going to use in exactly the same way as we did in Quicksort here to analyze the performance of hashing with chaining. So, this is where you want to understand the expect, expectation of random variable which is complicated but what you can express as the sum of much simpler random variables. Ideally, 0,1 or indicator random variables. So, the first step is to figure out the random variable, Capital Y, it's what I'm calling it here that you really care about. Now, we finished the last slide, completing step one. What we really care about is Capital L, the list length in x's bucket. So, that governs the running time a bit unsuccessful Look up, clearly that's what we really care about. Step two of the decomposition principle is well, you know, the random variable you care about is complicated, hard to analyze directly but decompose it as a sum of 0,1 indicator random variable. So, that's what we're going to do in the beginning of the next slide. Why is it useful to decompose a complicated random variable into the sum of 0,1random variables? Well, then you're in the wheelhouse of linear of expectations. You get that the expected value of the random variable that you care about is just the sum of the expected values of the different indicator random variables and those expectations are generally much easier to understand. And that will again be the case here in this theorem about the performance of hash tables with chaning. So, let's apply this three-step-decomposition principle to complete the proof of the Carter-Wegman theorem. So, for the record, let me just remind you about the random variable that we actually really care about, that's Capital L. The reason that's a random variable is that because it depends on the choice of the hash function, little h. This number could vary between zero and something much, much bigger than zero, depending on which each you choose. So, this is complicated, hard to analyze directly, so let's try to express it as the sum of 0,1 random variables. So, here are the0,1 random variables that are going to be the constituents of Capital L. We're going to have one such variable for each object y in the data set. Now, remember this is an unsuccessful search, x is not in the data set Capital S. So, if y is in the data set, x and y are necessarily different. And we will define each random variable z sub y, as follows. We'll define it as one if y collides with x under h and zero otherwise. So, for a given zy, we have fixed objects x and y So, x is some IP address, say, y is some distinct IP address, x is not in our hash table, y is in our hash table. And now, depending on which hash function we wind up using, these two distinct IP addresses may or may not get mapped to the same bucket of our hash table. So, this indicates a random variable just indicating whether or not they collide, whether or not we unluckily choose a hash function little h that sends these distinct IP addresses x and y to exactly the same bucket. Okay, so, that's zy, clearly by definition, it's a 0,1 random variable. Now, here's what's cool about these random variables is that Capital L, the list length that we care about decomposes precisely into the sum of the zy's. That is, we can write capital L as being equal to the sum over the objects in the hash table of zy. So, if you think about it, this equation is always true, no matter what the hash function h is. That is if we choose some hash functions that maps these IP address x to, say, bucket number seventeen, Capital L is just counting how many other objects in the hash table wind up getting mapped to bucket number seventeen. So, maybe ten different ob jects got mapped to bucket number seventeen. Those are exactly the ten different values of y that will have their zy equal to1, right? So, so l is just counting the number of objects in the data set s that's collide with x. A given zy is just counting whether or not a given object y in hash table is colliding with x. So, summing over all the possible things that could collide with x, summing over the zy's, we of course get the total number of things that collide with x which is exactly equal to the number, the population of x's bucket in the hash table. Alright, so we've got all of our ducks lined up in a row. Now, if we just remember all of the things we have going for us, we can just follow our nose and nail the proof of this theorem. So, what is it that we have going for us? Well, in addition to this decomposition of the list length in to indicate random variables, we've got linear expectation going for us, we've got the fact that our hash function is drawn from a universal family going for us. And we've got the fact that we've chosen the number of buckets and to be comparable to the data set size. So, we want to use all of those assumptions and finish the proof that the expected performance is constant. So, we're going to have a few inequalities and we're going to begin with the thing that we really care about. We care about the average list length in x's bucket. Remember, we saw in the previous slide, this is what governs the expected performance of the lookup. If we can prove that the expected value of capital L is constant, we're done, we've finished the theorem. So, the whole point of this decomposition principle is to apply linear of expectation which says that the expected value of a sum of random variables equals the sum of the expected values. So, because L can be expressed as the sum of these zy's, we can reverse the summation and the expectation and we can first sum over the y's, over the objects in the hash table and then take the expected value of zy. Now, something which came up in our Quicksort an alysis but which you might have forgotten is that 0,1 random variables have particularly simple expectations. So, the next quiz is just going to jog your memory about why 0,1 random variables are so nice in this context. Okay, so the answer to this quiz is the third one, the expected value of zy is simply the probability that x and y collide, that just follows from the definition of the random variable zy and the definition of expectation, namely recall how do we define zy. This is just this one, if this object y in the hash table happens to collide with the object x that we are looking for under the hash function x and zero otherwise, again, this will be, this will be one for some hash functions and zero for other hash functions. And then we just have to compute expectations. So, one way to compute the expected value of a 0,1 random variable is, you just say, well, you know, there are the cases where the random variable evaluates to zero and then there's the cases where the random variable evaluates to one, and of course we can cancel the zero. So, this just equals the probability that zy = one. And since zy being one is exactly the same thing as x and y colliding, that's what gives us the answer. Okay, so it's the probability that x collides with y. So, plenty of that into our derivation. Now, we again have the sum of all the objects y in our hash table and the set of the expected value of zy what's right that in the more interpretable form, the probability that this particular object in the hash table y collides with the thing we are looking for x. Now, we know something pretty cool about the probability that a given pair of distinct elements like x and y collide with each other. What is it that we know? Okay, so I hope you answered the second response to this quiz. This is really in some sense the key point of the analysis. This is the role, that being a universal family of hash functions plays in this performance guarantee. What does it mean to be universal? It means for any pair of objects distinct like x and y in that proof, if you make a random choice of a hash function, the probability of collision is as good as with perfectly random hashing, hashing. Namely at most, 1/ n where n is the number of buckets. So, now I can return to the derivation. What that quiz reminds you is that the definition of a universal family of hash functions guarantees that this probability for each y is at most 1/n, where n is the number of buckets in the hash table. So, let's just rewrite that. So, we've upper bounded the expected list length by a sum over the objects in the data set of 1/n. And this, of course, is just equal to the number of objects in the data set, the [inaudible] of s divided by n. And what is this? This is simply the load, this is the definition of the load alpha which we are assuming is constant. Remember, that was the third caveat in the theorem. So, that's why as long as you have a hash function which you can compute quickly in constant time. And as long as you keep the load under control so the number of buckets is commensurate with the size of the data set that you're storing. That's why, universal hash functions in a hash table with chaining guarantee expected constant time performance.