Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

The second algorithm that we will learn again,

to roughly approximate the idea of edit distance is called the compression distance.

So, the compression distance idea is based on the following observations.

So, here I'll need

to do a very brief overview of what compression is.

So, let us assume that we have a string with a lot of repetitions,

and let's assume that we have

another string which does not have a lot of repetitions.

If you use a compression algorithm,

and I'm pretty sure you have used various compression tools,

compression products out there to compress your large files.

If you use any compression algorithm,

you will see that compression of the string S_1,

and if you compress the string S_2,

this string will be compressed

significantly for the compressed size of this original size of this one,

two, three, four, five, six, seven, eight,

nine, 10 symbols, the value compressed as it is

basically the compressed size of this will be much less than 10,

maybe close to three or four whereas compression of the second string one,

two, three, four, five, six, seven, eight,

nine, 10, compression of

the second string will actually

be close to 10 or in this case it will be identical to 10.

Why is this happening?

The reason why this is happening is that

competition algorithms usually look at redundancies

and remove redundancies to reduce the text representation or sequence representation.

In this example, string S_1 has a lot of redundancy in it,

there are lots of repeated symbols,

there's lots of redundancy in it so,

it can be represented with few symbols once you remove the redundancies.

The second string however,

in this example has no redundancy,

and no redundancy means that it cannot be compressed,

so the compressed size of string S_2 will actually be

exactly identical to uncompressed size of it because there's no redundancy to remove.

So, the gist of this is that compression,

we are not going to learn any compressional algorithms but the gist of

basically the idea is that compression algorithms find redundancies,

they remove these redundancies,

and find a representation that is smaller.

The higher the redundancy,

the lower the compressed text length size.

So, the compression distance essentially will use this idea.

So, let's see how that works.

So, when we compute compression distance,

essentially what we do is,

we take the string P,

we take the string Q,

and we also create a new string which is concatenation of P and Q,

so we have three strings.

Then we create a compressed version of string P,

we create a compressed version of string Q,

and we also create

a compressed version of the string P concatenated with Q.

The observation is that the string P and Q may or may not have a lot of redundancy in it.

If P is very similar to Q,

there will be a lot of redundancy so,

this distance will be essentially similar

to maximum of compressed distance

of P or compressed distance of Q.

Why is this going to be like that?

Because there's a lot of redundancy.

Say, I have A,

B, C, D, A, B,

as P, and say I have A,

B, C, D, A, B,

as Q, they will be identical.

The compassion algorithm will recognize that these are very similar to each

other and will basically say that well,

the compressed size of P and Q together will be

similar to the compressed size of P and in this example,

is similar to the compressed size of Q.

Why? Because they are very similar to each other.

In contrast, if P is very different from Q,

if it is very different from Q,

then C(P:Q) will essentially

be similar to C(P) plus C(Q).

That is compressed size of P concatenated Q,

will be similar to compressed size of P plus compressed size of Q. Why is this happening?

Well, going back to the example,

if we have P as A,

B, C, D, A, B,

and if we have Q as X, Y,

Z, U, T, U,

if you concatenate them,

there is no redundancy between P and Q strings so,

the size of this will be essentially is if I compress this

separately and I compress this part separately and I add up the compressed sizes.

So, what does this mean?

This means that the compressed size of (P:Q), P concatenate Q,

can be as close to maximum of C(P), C(Q),

or can be as large as the summation of C(P) and C(Q),

depending on whether P and Q strings are similar to each other or not.

So, essentially what we do is that when we compute

the compression distance between two string,

we leverage this fact.

So, if two strings are very similar,

this turn here correspond to maximal C(P), C(Q).

This is, if P and Q is very similar,

essentially this means that these are very similar values to each other.

These maximal C(P) and C(Q) will also be very similar to

the minimum of C(P) and C(Q) if the two strings are very similar to each other.

So, essentially in this term will be a small turn close to zero.

So, if the two strings are very similar,

this is going to give us some number very close to zero indicating that the distance

between these two strings are essentially close to zero.

However, if the two strings are very different from each other,

this turn is going to be replaced with C(P) plus C(Q),

and what we are essentially doing is that we are subtracting the minimum of them.

So, let's assume that basically for this example,

let's assume that C(P) is less than C(Q) for this example.

So, this is equal to C(P) and this is equal to C(Q).

So, if the two strings are very different from each other,

essentially what's happening is that I remove the minimum of C(P) from the other time,

I'm left with C(Q) and I divide them by C(Q) for

the distance essentially will be similar to one, close to one.

So, if the two strings are similar,

if P is similar to Q,

I get distance close to zero,

indicating they are similar and if the strings are very different from each other,

then I get a distance close to one.

So, essentially the compressed distance algorithm uses the compression idea,

compression algorithms to compute something approximating

the edit distance without having to compute the edit distance explicitly.

Of course once again,

this is an approximation because it doesn't really compute

the number of edit that you have to do to convert P to Q,

but it to give you a sense as to whether string P is similar to string Q or not.

If the compressed distance is similar to zero,

you can kind of say that these things might be similar to each other.

But if the distance is close to one,

you know that these strings have no chance of being similar to each other.

So, you give basically some sense about the similar turn distance by

using a compression algorithm instead of computing the edit distance directly.

Explore nosso catálogo

Registre-se gratuitamente e obtenha recomendações, atualizações e ofertas personalizadas.