0:00

Our next application is the so-called shortest

common superstring problem, abbreviated as SCS.

In this case, the input consists of n strings.

And what we would like to find is

the shortest possible string that contains each of our n strings as a substring.

For example, if we have input strings like BCD, ABC,

and CDA, then the shortest common superstring is ABCDA.

Right? Indeed, BCD can be found here,

ABC can be found here,

and CDA can be found here.

So this essentially allows us to compress these n strings into one.

In particular, if we have many, many,

many such strings that can be compressed into one string,

which is not too long,

this allows us to store these strings more efficiently and also to use less space.

In particular, instead of storing the BCD, ABC, and CDA,

we just store the indices of the first appearance of the symbol in the superstring.

For example, if the indices,

they start with 1,

then instead of BCD,

we can store just 2.

Instead of ABC, we store 1.

Instead of CDA, we store 3.

So we store just the superstring and the indices of their first symbol.

So instead of just nine symbols, for example, here,

we store five symbols and also three indices.

So in this case, like the space saving,

I'm probably not dramatic,

but if we had many, many,

many such strings, then it makes perfect sense.

So what we've just shown is that there is

a natural application in data storage and data compression in this case.

But also in other famous application of

the shortest common superstring problem

is these genome assemblies that we discussed over this.

So recall that genome is just a very long string,

like billions, millions and millions of symbols of alphabet ACGT.

AC, I'm sorry, GT.

So the story is the current sequence in technologies,

they do not allow us to read the whole genome.

So what they allow us to read is a short three minutes, usually called reads.

So instead of a whole genome, we have many, many,

many of its relatively short substrings.

And then out of these substrings,

we need to compile a strings that contains all the reads.

So in reality, everything is slightly

more complicated because reads contains errors and so on.

And we're not interested in a single genome.

We're looking for a set of context.

But still, so the shortest common superstring is

considered as a mathematical model for genome assembly.

And at this point,

it is not at all clear how

this particular problem is related to the travelling salesman problem.

Because in this case we have n strings,

in case of the travelling salesman problem,

we've had a complete graph with ways.

But still, we will,

as you guessed already,

it is connected to the travelling salesman problem,

and we will figure out how it is related just in a few seconds.

So as a matter of it an example,

consider the following instance of

the following input to the shortest common superstring problem.

So the strings ABE, DFA,

DAB, CBD, ECA and ACB.

First of all, if we're looking for some superstrings,

not just for the shortest ones,

then it is quite easy to get one.

We just concatenate them.

So this is just ABEDFADABCBDECAACB.

So in a sense,

we do not compress anything in this case.

And the reason why sometimes it is possible to compress

or to a have shorter superstring is exactly because some pairs of input strings,

they have non-zero overlap.

So, for example, if we have strings ECA and ACB,

as you see, they share the suffix of lengths one of the first string which is A.

It's the same as the prefix of the second string,

which is also A.

So in this case,

they have non-empty overlap.

So A is called an overlap.

And, usually, we're interested to find the largest overlap, the longest one.

And this, in particular,

means that instead of concatenating them,

we can overlap them,

resulting in a string ECACB.

Here, we can find ECA. It is here, and here, we find ACB.

So they overlap, and this allows us to save some symbols.

And this basically means that what we want to do in

this case is to find the permutation of our input strings.

That way, we would like just to find an order of these strings,

such that the sum of consecutive overlaps is as large as possible.

And this actually brings us to

the idea that this problem is indeed a permutation problem.

So what we're looking for is sum permutation of the n given strings.

For example, if we permute our six strings as follows,

so DAB becomes the first one,

ABE becomes the second one,

the short one is ECA, the fourth one is ACB,

the fifth one is CBD, and the sixth one is DFA.

Then we see that the consecutive strings here,

they have overlap two.

Then these two guys have overlap one, one symbol.

These two guys have overlap two,

as well as these two guys.

And finally, these two guys also have overlap one.

And this allows us to compress

all these strings just in this particular order into one string,

which gives us the shortest superstring

than what we get if we just concatenate all of them.

So once again, this brings us to an idea that,

though it was not obvious from the beginning,

but that the shortest common superstring problem is actually permutation problem.

And this, in fact, allows us to solve,

to reduce this problem to

the travelling salesman problem in which we're looking for a path of maximum weight.

So we can argue as follows.

So let's construct the so-called overlap graph.

So the nodes of our graph are just our input strings.

And then we add directed edges between every pair of the strings.

This results in the following graph.

So on every edge,

let's just focus on the highlighted edges.

For example, on the edge from CBD to DAB,

we put weight 1 just because the overlap of CBD and DAB is equal to 1.

So, indeed they share the letter D. On the edge from ACB to CBD, we put 2.

This just reflects the fact that the overlap from ACB to CBD is equal to 2.

Just because the suffix CB is equal to the prefix CB of the second string.

On the other hand, here, we put zero, from DAB to ACB,

because they do not have any non-trivial overlap.

This gives us a directed graph.

And what we would like to find in this graph is a path of maximum total weight.

So instead of minimum total weight,

as in the classical version of the travelling salesman problem,

we're looking for maximum because

the overlap actually tells us how many symbols we can save.

And we would like to save as many symbols as possible.

So instead of finding the minimum weight path,

we're looking for a maximum way path.

In almost all approaches that we're going to consider in this class,

actually it doesn't matter whether we're looking for a minimum path or for a maximum.

And, also, this illustrates nicely that sometimes we're interested in the minimum path,

sometimes we're interested in the maximum path.

So that's why we have prefix marks for this problem all just

to emphasize the fact that we're looking for the most heavy paths.

And also this letter A here stands for asymmetric.

This means that we have a directed graph.

So you can find such abbreviations in the literature.

So the travelling salesman problem is just a general notion.

And if you would like to specify

that you're looking for a longest paths in the directed graphs,

then you specify it is ATSP.

So, finally, when we construct such a graph,

and we find a path of maximum weight,

this gives us something like this.

So this a path of maximum weight in this particular graph.

So it goes as follows. So it visits all the nodes.

Then this path actually spells a superstring that is as short as possible.

So we start with DAB.

This is DAB.

Then we go to the next node.

We see that there is an overlap of size 2,

so which means that we already have AB in it.

So we have just E to this string.

Then from ABE, we go to ECA.

And we know that the overlap is equal to 1,

which means that we just append CA.

Just instead of appending three symbols,

we append three minus one symbols, and so on.

So by going through all these,

along these paths, we spelled the superstrings of all our n strings.

And the fact that this is a maximum path guarantees that what we

get is a superstring that saves as many symbols as possible,

which, in turn, means that it's the shortest common superstring.