0:02

Okay. We're going to finish up by talking about some, practical applications of red

black trees. And in particular, B trees which are a general, general version. So

the idea the sign behind bee trees is that often, the data that we're, trying to

store is, is really huge. There's a, a large amount of data. And we're, we're

going to look at a more general model for external storage. Where, we work with

continuous blocks of data that are big. Maybe four, 4k or bigger, or maybe even a

whole file. And all we want to count is the first time we access a page, because

the main cost is trying to find where the page is. Once it's read in we get to read

all of the page for free pretty much. So the real property of external storage that

not your local memory, is that the time required to get to a page is way larger

than the time to access data within a page. So what we want to do is try to

access data that's out, externally, using a minimum number of probes. That's a model

of a file system that is pretty workable. And so bee trees are a generalization of.

Balance trees, that allow for this. The idea is to, allow not just two or three,

keys per node, but a large number like the number that can fit in a page. So, might

be, M = 1000 or M = 4000. And well, we've gotta have at least, two, keys at the

root. And, and the only other restriction is that, we don't want the nodes to get

too empty. So we have less than M. But we want to have at least M over two. And as

you'll see, this is, a generalization of, two three trees. That allows us to, build

balance trees that, are very, very shallow. Typically, these are set up so

that, the, all the data is in the external nodes. And so the external nodes have no

links, they just have keys. And their in, kept in sorted order. So, for example,

this is a external, this is M = six. This is an external five node. So it's got five

keys. It's got, room for one more temporary one, And then what will happen

is, when you insert into a full node, it'll split in the same as before. And

then we'll pass the s plit up causing a split up higher so the red keys in the

internal nodes are copies of keys down below that direct the search. And it is

that, that's a little extra detail. It just makes the implementation a little bit

easier and that's the way it's usually done. So but for now the main idea is that

it's like a 233 except that we allow way more keys per node. And then when a node

gets filled it splits. Into two so a node is always between half full and full. So

it's in 1000, it splits in two. And then each side has 500. And then we can use,

that property of the trees, in the analysis to, show that, it's not going to

be very many probes to get to any key. So the, search is, you know, just the same as

we've been doing, just generalized. There's a list of keys at every internal

node and that key, tells you that, then links for every key that give you, a place

where your key would have to be. So, this link is for all the keys in the B tree

that are between this key and the next one and in every case, it's that way. So if

we're looking for E. B and this b tree would go down the left link. And then

looking on the second link cuz E is between D and H. And that's just the way

it organized. And then when you get to an external node you just look for and so

that's a, that's the all searches terminated in external node, in other

words that's just a generalization of what we just did. And insertion, works the same

way, where we get to the bottom, and then, and then we split. So let's look at just

inserting A into this B tree. It comes into the node on the left. And then that

makes that temporarily overful. It's got one too many. So we split it into two

nodes. And that causes us to add a new entry into this internal, node. In this

case, it's the C, which is the smallest one in this new page. And that had to be

added. And we can move all those over. There's plenty of time by the memory

model. We're only counting the number of times we access the pages. We get to move

things around for free. And you could have some hybrid struc ture where you use

something different for the internal model. But usually it's fine just to do

that. Now that one becomes overfull. So it has to split and we have to create a new

route, just in the same way as we've been doing. So without seeing all the details

yo can understand that the same basic idea is going to work in this situation where

we're dealing with much, much more memory. And so the end result is that a search or

an insertion in a B-tree in a order m, that's where we're putting M keys per

page, requires between log base M - 1N and log. Base M over two M probes and that's

going to be a really small number, so say M is a 1000, log base M over two is, is

log base 500. So what power do you have to raise 500 to get bigger than N? In

practice that's going to be like four or five. And we can keep the root page in

memory so that it means, for any conceivable application, you can get to

any piece of data. Even if it's trillions of, of pieces of data in this huge, huge

file. You can get to any one with only five or six probes. That's quite amazing.

It's really an astounding example of algorithmic technology. Doing something

that, you wouldn't really, necessarily think that you could do so easily.

Maintain a dynamic search symbol table with trillions of keys, so that you can

get to any key just by looking five or six places but that's what B-trees provide for

us. This is a simulation that shows a, a growing B-tree so when a page, at the top,

there's just one page that fills up. When it fills up, it's red, and that splits

into two half pages and then keys get added on one side or the other so each.

Lying in this table some pages getting a new key and eventually one of them fills

up and splits. Now we have three pages and we keep going eventually one of them fills

up and splits. Now we have four pages and now this time the first one fills up and

splits and so forth. So the black is the occupied part of the page. The white is

the unoccupied part. And the full page about to split then right below there's

two pages. S o this shows the process of building a large B-tree. And that and you

can see the amount of black. It's kind of half empty. It's a little more than half

empty usually now, as this shows. And, and people have variants of these algorithms

that keep it more, much more than half empty if that kind of space is a, is a

consideration. So, as I've mentioned, red black trees and B-trees are widely used as

system symbol tables. The Java implementation of tree map and tree set is

red black trees, C++, the standard template library uses, red black trees.

And it's also, used in the, Linux kernel, and in many other systems. B-trees,

there's many different variants that, give different characteristics of, space usage

and other characteristics. In most databases, nowadays that, that you might

use. Sql or Oracles database and others, are based on, some variant of B-trees

because they're so, so effective. But you really know that your data structure and

algorithm is used by a lot of people when it appears in the popular culture. My

friend Philippe Flajolet who recently died was a famous French mathematician send me

an e-mail late one night. He was quite excited because he was watching a re-run

on, of an English actually Canadian TV show on French TV. I didn't know he spend

his time doing that But he was very excited because he saw this clip.

>> It was the red door again. >> I thought the red door was the storage

container. >> But it wasn't red anymore, it was

black. >> So red turning to black means what?

>> Budget deficits, red ink, black ink. >> It could be from a binary search tree.

The red black tree tracks every simple path from a node to a descendant leaf that

has the same number of black nodes. >> Does that help you with the ladies?

>> So not only is there some excitement in that dialogue but it's also technically

correct which you don't often find with math in popular culture of computer

science. A red black tree tracks every simple path from a node to a descendant

leaf with the same number of black nodes they got that rig ht. And that's also true

of B-trees and both of these methods are very effective and widely use.