0:02

My key index counter is a great algorithm but that's not the end of the story.

Â It's also useful for creating a more general purpose algorithm for strings.

Â The first one we'll look at is called LSD radix sort, Least Significant Digit for

Â string sorting. And the idea is a very simple one.

Â We have strings so we're gonna consider now a small example where the strings are

Â all the same length. And again that's often true in practical

Â applications account numbers and so forth. String that uses, sort these, may all be

Â the same length. And what we're gonna do is consider.

Â The character positions in the strings, and move from right to left.

Â The algorithm, is to just stably sort using key index counting, on, the chara-,

Â deed character of the key, where deed goes from the right end and decreases.

Â So, this is, a stable sort. Of those twelve keys, sorting on the

Â right-most character, the B's go before the D's go before the Es.

Â It's crucial that the sort be stable in this application, and that's why we

Â checked with key index counting to make sure that it was stable.

Â So it's a stable sort on, on the right most character.

Â And then all we do is move from right to left and do now the second character.

Â And this now is a stable sort of those same keys on the second character.

Â So now the ones with A in the second character come before the one with B and

Â so-forth. And not only that, it's stable, so their

Â relative order is maintained, BAB, CAB, FAD, BAD, and so-forth.

Â And then. To finish this.

Â Sort. Now we do it on the first key.

Â And the magic of LSD radix sorting is eh, once you do it on the first key then the

Â strings are all sorted. So, that's three passes, one for each

Â character in the string. Each taking linear time and we get a

Â string sort. That's LSD sort.

Â Now we need to prove that it works. And so this is a simple proof by induction

Â that it worked. After we have done I passes then we can

Â assume by induction that the strings are sorted on the last I characters.

Â So we are just showing that for two, after two passes. it sorted on the last two

Â characters thats we're assuming by induction.

Â So now what about the next character that we are sorting.

Â Well, there's two things that can happen. If two strings are different on the first

Â key, then, the key index sound is gonna do the job.

Â A string that starts with B is gonna come before one that starts with a D, and so

Â forth. So if they're different on the sort key,

Â the key index sort puts them in order. If they're the same on a sort key then

Â stability does the job. So all the ones with that stay on order

Â because we've insisted on a stable sort. That's a simple proof by induction that

Â LSD string sorts fixed length strings in a descending order.

Â And it's really easy to implement. This is a complete.

Â Java implementation of LSD string sort. So we're explicitly working with radix

Â r256. =

Â 256 and where the radix comes in, the value of the radix comes in, is that's the

Â size of the array that we use for the, counts and the accumulants.

Â We need one for each character. For each possible character value, we're

Â gonna index into that array that has to be that big.

Â And all this is, is the code for key index counting.

Â And then all we do is take a variable t that goes down from this is the strings of

Â fixed width w And we start at the rightmost character and go down to the

Â first character. And instead of dealing with a-ah, our

Â string A of I2 we're just look at the deed character which is a character.

Â Otherwise just with that replacement and that replacement it's the same code as we

Â looked at for key index counting. So let's do key index counting on the deed

Â character going down from the width from right to left.

Â That's remarkably compact code. And that's gonna be the method of choice

Â for lots of situations with fixed length keys as the sort key.

Â And, and it gives us another look at the performance of sorting out algorithms.

Â That gives us another line in the table that we're requiring that, that be, they

Â be fixed length keys, there's ways to work around that.

Â And we'll consider another algorithm that deals with that in a minute.

Â But again it's often or typically the case that the.

Â Width of the keys is not that long. It's a small constant.

Â And therefore, we have a linear time algorithm.

Â This even works if the keys are binary numbers represented in a binary word.

Â We can break them up into groups of small number of bytes Say, 64 byte number can be

Â broken up into 88 eight bytes characters or four sixteen byte characters.

Â For sixteen byte characters, W would be four and you can get a huge array of that

Â kind of numbers sorted in just the four passes through the array.

Â If they don't have the same length we have to do some extra work.

Â It's an interesting problem to think about.

Â We're, we're gonna look at a different method in a minute.

Â So here's the type of question that somebody might could ask for a job

Â interview. Actually, a web services company, every

Â day, might be in the position of needing sort a million or a billion 32 bit or 64

Â byte integers. And an algorithm student, in interviewing,

Â might could ask what sorting method do you use?

Â Now, senator. You hear Google and I like to think of the

Â presidency as a job interview. Now it's hard to get a job. as president.

Â Right. And, and you're going through that obviously now.

Â It's also hard to get a job at Google. Right.

Â [laugh] We, we have questions. And we ask our candidates questions.

Â And this one is from Larry Schwimmer. Okay.

Â [laugh]. You guys think I am kidding?

Â It's right here. What is the most efficient way to sort a

Â million 32 bit integers? Well I, no, no, no, no, no, I, I think the

Â bubble sort would be the wrong way to go. Come on, who told him this?

Â Okay. I didn't see computer science in the

Â background. We've got our spies in there.

Â Well. 'Kay.

Â Let's ask, let's ask a different interview [laugh] question.

Â [laugh]. So the bottom line is if you want a good

Â job maybe, you wanna know about LSD string sort.

Â Actually, this method has been around for, really a long time.

Â So we'll start with a little bit of a story.

Â So what did people do, in the nineteenth century when, they wanted to take a

Â census? And actually, the story is that, for the

Â 1880 census, it was actually obsolete, before it was completed.

Â It took 1,500 people seven years to manually process the data.

Â So, during that time there was room for some invention, and a man named Herman

Â Hollerith developed an automated machine that could help do the census faster.

Â So what his idea was to use punch cards to record data.

Â The kind of data that was taken in the census.

Â And then the machine could tabulate the data by sorting one column at, at time.

Â And we'll look a byte at how it does that in just a minute.

Â And the idea was that the re-, result of that was that the next census finish

Â really much earlier and under budget, 'cause this machine automated much of the

Â process. And that had a really a profound effect on

Â the development of computing 'cause punch cards, it turned out were useful not just

Â for census but for many other applications.

Â For accounting and for many other for business processes and for many decades,

Â punch cards were the primary medium that was used to store, enter, and process

Â data. And Hollerith's company for building this

Â machine later emerged with a bunch of other companies.

Â And in 1924 that company became known as IBM.

Â And actually, punch cards were used up into the 70's' And even in the 80's' in

Â some places. So, if, it's important.

Â Let's take a little break and talk about the role of LSD string sort.

Â For you know, a couple of decades people who wrote programs were working with

Â punched cards. And in courses at universities, if you

Â want to write a program, you wrote it by putting one line on each punched card.

Â In your program, therefore, was a deck, a long deck of punched cards.

Â If you had a 1000 line program, you had 1000 punched cards.

Â They came in boxes that held 2,000, and people would carry around these boxes of

Â punch cards that, that were their programs.

Â To enter the program there was a thing called a card punch which had a, which had

Â a keyboard kind of like a typewriter, but all it did, you could see the cards, and

Â it'd actually punche holes in the card with what you typed.

Â Now there was a huge so then, you, you're program was punched cards.

Â And there was a machine called a card reader which would take the cards in and

Â convert those punches back into, into binary and characters that again, could be

Â processed on the computer and then you get your results printed out on paper in a

Â line printer. So for many, many years people programmed

Â by making decks of punched cards handing them to an operator who put them on a card

Â reader and then waiting for the printed output to come out.

Â There were other devices, but this was the main thing for a long time.

Â And lots of people learned to program this way.

Â Now, there was a huge flaw in the system though.

Â The flaw was, if you dropped the deck, then your program was completely scrambled

Â and out of order. And you had no program.

Â You had to go through the cards and find the first line and, then find the second

Â line. Well, people figured out to work around

Â for this really almost right from the beginning, cause this is clearly

Â 12:25

intolerable, situation, and, the, along with the same room with the card punch,

Â There was a thing called a card sorter, And the card punch did one other thing

Â automatically. every time you punched a card.

Â It would go to the last six columns of the card and it would put in, a number.

Â Actually they skip by ten So, it would be. The first card would be card ten then

Â twenty, 30. And your cards would be numbered up to six digits, so you could

Â have thousands of cards sequentially. So when you typed in your program, you get

Â the cards numbered, in order. If you wanted to add a few lines to a

Â program, you had room to add a couple of numbers and, re-punch the numbers, but,

Â nuh, the whole point was, that all the time when you're holding on to a card

Â deck, the cards are in order, by number on the right hand column.

Â And if you dropped it, all you needed to do was sort it.

Â Or if the m-, machine operator dropped it, that was not viewed as a big deal for your

Â cards to get [laugh] out of order, because there was this machine that could sort

Â cards. And the way that it worked was, LSD Rate

Â exort, the LSD string sort on those characters that are the numbers that keep

Â the card in order. They would, you'ld start on the right

Â column. And there was a physical thing, you'd set

Â to a call and, it was gonna sort on, you put the deck in, and it'd distribute, the

Â ones with zero in the first bin, or ones with one in the second bin, all the way

Â up, it'd, and of all the cards it start with zero would come in, and it was

Â stable, whatever order the cards were in, that's the order they'd appear in the

Â pile, and then you'd pick them up, and you'd have a new deck, and it'd be all

Â sorted on the right-most column. Then you move over for one position from

Â right to left, and run the cards through again.

Â So if running the cards through the card sorter six times, you could get your deck

Â sorted. So, every programmer knew LSD radix sort.

Â For decades, it was not something that was, difficult to teach, 40 years ago.

Â And, these, this equipment, is now all pretty much gone.

Â But LSD radix sort, is still a good algorithm to know. Not related is

Â something else that was going on with those initials at that time.

Â