0:03

Today we're going to talk about data compression. Now this is a family of

Â algorithms that, everyone uses. and it's based on, many of the classical ideas,

Â that we've covered in terms of, implementing basic algorithms. we'll start

Â with a introduction, just what data compression is. So the idea is to reduce

Â the size of a file. really for two primary reasons. One is to save some space when

Â storing it. and the other is to save some time with transmitting it. And often

Â having significant savings is not too difficult because most files have, plenty

Â of redundancy. but of course we're interested in efficient algorithms that

Â can guess, get the best savings possible. so, you might think that with the way that

Â technology, has been improving over recent years. That we wouldn't really need

Â compression, we can just buy a bigger, faster disk. but the problem with that

Â idea is, that even though. Moore's law tells us that we're going to get, twice as

Â much space every one and a half to two years. It's also the case that we,

Â whatever space we have, we're going to want to fill it up, with, better, higher

Â quality, data. so you want a higher resolution movie if you get more space.

Â And if we can remove redundancy, you're always going to want to do that. So it's a

Â cheap way to, save space, that allows us to do more with the space that we have

Â available. this is a report last year on big data. Every day we could create 2.5

Â quintillion bites of data, so much that 90 percent of the data in the world today has

Â been created in the last two years alone. if we can use data compression to cut that

Â amount of data in half then that's all the data in the world created in a year. So

Â this is a very interesting topic to consider from an algorithmic point of

Â view. Because the basic concepts some of them date back to the 1950's. when it was

Â extremely important to save time when transmitting information cuz of, or space.

Â because it was so costly. but some of the best technology, algorithmic technology,

Â has devel-, been developed recently, and much of it uses, the data structures that

Â we've considered for other applications. And we'll see that as we get into the

Â topic. so, just specific applications that, maybe are familiar for file

Â compression. All file systems and, and disks have built-in, compression

Â technologies. Such as, as zip, and b-zip and many others of similar type.

Â Technologies. And the multimedia, everybody's familiar with, the JPEG and

Â MP3 and MPEG, and all those sorts of things for images, sound and video. Those

Â are all about data compression. and for communication, now, that's, what has, what

Â enabled, fax, and also enables new technologies, like Skype, the ability to,

Â reduce the, amount of data that you actually need to send. for any kind of

Â technology, is gonna, help, improve things, and improve the quality of what

Â you can do. . Also, the types of massive amounts of data that, people are saving,

Â nowadays. Google and Facebook and other, , other web giants are saving lots of data.

Â And they're going to be able to save more, because of data compression. now from a

Â theoretical point of view the type of compression that we're going to consider,

Â lossless compression is very simple conceptually. So we think of a, we talk

Â about a bitstream rema, everything can be encoded in bits, so we might as well just

Â consider a, a stream of bits that we want to compress and we'll call that B. In a

Â data compression algorithm it's going to be a black box that goes ahead and takes

Â those bits and produces a compressed version, which is many fewer bits. and I

Â will call that C of B. But we hope that it's many fewer bits. and so that's what

Â we save or we send, but then we need a companion algorithm, an expansion

Â algorithm, that takes C of B and produces B back. That's lossless compression. We

Â want to get back precisely the same bits that we put in. And that's very important,

Â in many applications. It's not so important in some applications, like

Â images and video where you're happy with an approximation and the original a bit

Â strained but we're not going to consider that kind of algorithm. We're just going

Â to consider a lossless compression, where maybe it's a program or that you can't

Â afford to lose a single bit. Now it turns out that where we talk about in terms of

Â evaluating our algorithms, it's what's the compression ratio. Now, we also care about

Â efficiency. We don't want to spend a lot of time creating CFB, but the compression

Â ratio is the measure that we use to compare algorithms; and we're interested

Â in knowing the percentage of bits, a percentage of the size would be that we're

Â able to save. And surprisingly even for natural languages texts we can get, 50 to

Â 75% or even better compression ratios. Now just in the larger context, data

Â compression has been with us since antiquity. it's always better to try to

Â express what you're doing concisely. Mathematical notation is an example of

Â that, or number systems, or even natural languages. Communications Technology has

Â evolved; it definitely has to do with data compression from braille to morse code to

Â the telephone system. Those are all ways of trying to, they depend on, trying to

Â achieve good data compression. In a modern life, everybody's trying to get pictures

Â or movies on their devices, and it's definitely enabled by good data

Â compression. So, something to think about, what role data compression is going to

Â play in the future. But, it's really hard to see it going away. Number one, it is

Â effective. Number two, it is not that difficult to implement, and it's in

Â software. So, you don't have to buy a bigger disk to have a good compression

Â algorithm in many situations. here's a simple example that sprung up just in

Â recent years. so we've talked about in other applications that genome is a string

Â over a four letter alphabet. and the string might be huge. It might be billions

Â of letters. And so, and there might, and there might be lots of them, for different

Â individuals and different species. so the huge genome data base is out there with

Â these very long strings of the alphabet A, C, T and G. Now, the standard way to store

Â them in standard programming language, standard computers, is to use ascii, and

Â there's eight bits per character, and so if you have an n bit genome, genomic

Â string, it'll be eight n bits. If it's a billion, characters, it's eight billion

Â bits. but just look at the problem. really there's only four characters. You can get

Â by with just two bits per character. So, that's one quarter the storage. if you

Â spent X dollars on this space to store your stuff, you could spend X over $four.

Â And that might be a significant number. And, the fact in general, if you know you

Â have a small alphabet you can get a savings in this way. This seems very

Â obvious and straight forward, but it's amazingly true that in the 1990's when

Â generally, data basis started to emerge there were four times too big. They used

Â ASCII then, didn't think to use this trivial coding that could have cut costs

Â by one fourth. so again, if it's so easy to do, why not do it? If you can cut your

Â cost by a factor of four. And where else, would you give up, reducing cost by a

Â factor of four so easily. So that's why we're interested in data compression. Now

Â we're going to need some tools that are slightly different from the tools that

Â we've been using in order to write effective data compression algorithms. And

Â so we've got , extensions to our standard in and standard out libraries that work

Â with bits. so, these are what we want to, what the methods that we're going to use

Â to write bits to standard output and standard input. So, these bit streams are

Â different than a standard in, standard out. They're binary standard in, binary

Â standard out. And so, what we can do is read Boolean, which is just read one bit,

Â or to save some processing, we can read eight bits of data and put it in a care.

Â Or in general, R bits, where R is less than eight, and put it into a care value.

Â And we can also, if we want, put it into nth values long and boule in a similar

Â methods. And we don't have to read all the bits if we've got some. , a scheme where

Â we only wanna use a certain number of them. but the idea is that, we can read a

Â variable number of bits, and get'em ready for processing easily, by putting them in

Â one of Java's, primitive types. And conversely, we have binary standard out,

Â where we can write a bit. or we can write, eight bits, or we can write any number of

Â bits from a care value or from an int value, or any others. So this allows us to

Â take in bit streams and to write out bit streams and takes care of all the encoding

Â between Java primitive types and bit streams. And it's not too difficult to use

Â and it's very intuitive. And you'll see when we come to code. just to, give an

Â example of how just having this, can save space. this is just a simple example of

Â representing a date. So if you, represent the date, 12/31/1999, as an ASCII

Â character string, in Java. then, when you write out the string, you've got one, two,

Â three, four, five, six, seven, eight, nine, ten characters. in each one of the

Â characters is eight bits so that's a total of 80 bits. so the standard ASCII encoding

Â is to just look up the ASCII of the one and that's 00110001 and then the two and

Â so forth. So in the slash 31 and so forth. So for each character we've got eight bits

Â and we put them up and that's 80 bits. If it was Unicode there would be even more

Â bits. So that's one way that you might consider representing a date with bits.

Â Now if we represent it as three nth values, that's not so effective a way to

Â do it. The month is a nth, the day is an nth and the year's an nth and each one of

Â those is going to be 32 bits. and that's the total of 96 bits though that's, that's

Â actually even worse. by the month in all these things are within small ranges. So,

Â with binary standard out we can say, well, we're learning right out the right most

Â four bits of the month end cuz that give us a number between zero and sixteen and

Â months between one and twelve. And the five bits of the day. And you input that

Â again, and that's, between zero and 31. And that's going to cover any possible

Â day. and then twelve bits for the year. And if we do that .

Â Then we have a total of 21 bits. and there's a little bit of extra at the end,

Â because our byte streams, our bits really are implemented, reasonably in most

Â situations as byte streams. so they're going to get carved into chunks of size,

Â eight by, hardware or, their drivers. And so, we always round up to the nearest

Â multiple of eight to get the proper alignment on bytes. but that's still

Â pretty, significant savings from 80 to 24, just by having, the ability to, write out

Â 13:36

portions of, INT values. And if we had a billion dates, then that would be a huge

Â cost savings that would translate to dollars, as well. the other thing, that we

Â do to give some visual impression of our effectiveness of our compression

Â algorithms particularly in the text, is to look at different ways to examine the

Â contents of a bit-stream. so we have on the book site A binary dump program that,

Â prints out, the bits, sixteen bits at a time, rows of sixteen bits at a time. And

Â then the total number of bits. so, this file aver.text is, a standard character

Â stream. so everything's encoded with, eight bits. and so there's twelve

Â characters and it's 96 bits. And you can see what all the bits are. Or, sometimes

Â it's convenient to use a hex dumpler. We just read the hex digits considering the

Â bits, four bits at a time. And so those 96 bits are twelve bytes, or 24 hex digits.

Â And we print it out so that we can see them. Another thing that's useful

Â sometimes for big files is to look at the bits as a pitcher where we just do a. Give

Â the width and the height of a pixel window and just map the bits to white squares and

Â black squares. So this is the same aver.text. It's just that you can see that

Â they all start with zero one much more easily. This visual representation

Â sometimes gives an interesting perspective into what a compressed or what a bunch of

Â bits looks like. So, this is just the hex to ASCII conversion table that, that if

Â you are not familiar with it it's a quick way to find the hexadecimal corresponding

Â to an ASCII character. So, you look up sa y R in the table and that's in row five

Â and column two. So it's 52. So this is a 52 for the R in Abracadabra. And so those

Â are the basic tools, that we are going to use when we talk about compression

Â algorithms. Now there's one thing that's, really important to think about. and

Â that's the idea that, you cannot have a compression algorithm that can compress

Â every file. despite the fact that, somebody has, patented that. And, and

Â there's a claim that, methods for data compression is capable of compressing all

Â files. and, also, this, this is another, thing that came out of Slash Dot a few

Â years ago, where a company had announced a breakthrough in data compression that does

Â a 100 to one losses compression of random data. And unfortunately, they do say, if

Â this is true, our then withdrawals got a lot smaller. Cause, there's no way that is

Â true, and lets look at why that's the case. Proposition is no algorithm can

Â compress every bit string. Here's one easy way to prove it, not by contradiction. So,

Â let's suppose that we have an algorithm MUI that can go ahead and compress every

Â possible bit string. So we take a bit string and we use our algorithm to

Â compress it, to get a smaller bit string. But since our algorithm can compress every

Â bit string. Why not use it on that one? Use it on that one, you get a smaller one.

Â Then we keep going. Since it's a compress, can compress every bit string, eventually

Â we get down to a bit string of size one. And since it can compress that one, it

Â gets down to zero. So if you can have an algorithm that compresses every bit

Â string, it means that you can compress all bit strings to zero bits, and that's

Â absurd. So that's a proof by contradiction. Or another way to see the

Â same thing is just by counting. So let's say you've got an algorithm that can

Â compress all 1000 bit strings, just to pick a number. Now there's a lot of 1000

Â bit strings. There's two^1000 of them. but how many of those, can be encoded with 99,

Â 999 bits or fewer. Well. Just one plus two plus four, and the powers of two up to

Â two, and out to the 99, 999. Or, like less than 500, that's. so that's way fewer than

Â the total number of, of possible bit strings. Well, it says it's one if you add

Â up the sum that's 2,501. So, there's one that can't be compressed. So, that's it.

Â So, if you want to say do fewer bits, you have much worse problem. only one out of

Â every two forty ninth can be encoded with less than 500 bits. It's just the smaller

Â number of git, bits gives you way number, way fewer possible bit strings and You

Â have to, be able to get your original bit string back. So, the smaller number of

Â bits gives you, a limit on the number of, bit strings that you can possibly

Â compress. so, we can't have universal data compression. And if you want to convince

Â yourself of that, now just try running your favorite compression algorithm on the

Â , result of what it produces. Say for a photo most good compression algorithms

Â will figure out that you're doing that, and just give you the same file back. So

Â at least it won't make it get larger but there's plenty of, of methods out there

Â that will make your file larger. there's another kind of basic idea. and, and that

Â is that there's no way to find the best way to compress a file. And if you think

Â about it, it can be extremely complicated. This is just an example that illustrates

Â the point. But it's also a relevant example, cuz it applies to so much of the

Â data that we deal with. So at the top is a picture dump of a million bit file. So

Â that's a million bits and it's, those are random bits. Well they're not random,

Â they're pseudorandom. what do I mean, pseudorandom? Well we talked about that.

Â they're generated by a program. And, this is a Java program that uses a pseudo

Â random number generator to generate bits. And, that's where those bits came from.

Â But if you think about it, this program is representation of those million bits. It's

Â only a couple of dozen characters, so, you know, if you want to find the optimal way

Â to compress those bits, one of the things you would have to consider is this program

Â T hat's a pretty compact way to represent a million bit. And, if you think about it

Â so much of the data that's out there actually was created by a program so,

Â compression amounts to finding the program that creates, created the data. And,

Â that's obviously a pretty difficult problem in fact it's known that it's

Â undecideable. So, let's not think that we can find the best possible compression

Â algorithm. and the other point that I want to finish with in the introduction is, to

Â realize that when we're talking about, compressing, say English language, it's

Â amazing to see that actually there's a lot of, redundancy in the English language.

Â and so this is, a, . A perturbed version of an English language paragraph that

Â shows that you can change letters and even delete letters and still figure out what

Â it says. and actually at the end, it's you really need the first and last two

Â letters, in a lot of situations to, really, figure out readability. so, there,

Â there is a lot of freedom, because there's so much redundancy. And since there's so

Â much re, redundancy, we can actually do, really well on, on compressing English

Â language texts, for example. So that's an introduction to data compression and we'll

Â take a look at algorithms next.

Â