So, in these next few videos, we're going to continue our discussion of the

minimum cost spanning tree problem. And I'm going to focus on a second good

algorithm, good greedy algorithm for the problem,

namely Kruskal's algorithm. Now, we already have an excellent

algorithm for computing the minimum cost spanning tree in Prim's algorithm.

So you might wonder, you know, why bother spending the time learning a

second one? Well, let me give you three reasons.

So the first reason is this is just a, it's a cool algorithm.

It's definitely a candidate for the greatest hits.

It's something I think you should know. It's competitive with Prim's algorithm

both in theory and in practice. So it's another great greedy solution for

the minimum cost spanning problem. The second reason is it'll give us an

opportunity to learn a new data structure,

one we haven't discussed yet in this course.

It's called the union-find data structure.

So, in exactly the same way or in a similar way to how we used heaps to get a

super fast implementation of Prim's algorithm.

We use a unionfind data structure to get a super fast implementation of Kruskal's

algorithm. So that'll be a fun topic just in its own

right. The third reason is, is there's some very

cool connections between Kruskal's algorithm and certain types of clustering

algorithms. So that's a nice application that we'll

spend some time talking about. I'll discuss how natural greedy

algorithms in a clustering context are best understood as a variance of

Kruskal's minimum spanning tree algorithm.

So let me just briefly review some of the things I expect you to remember about the

minimum cost spanning tree problem. So the input of course is an undirected

graph, G and each edge has a cost. And what we're trying to output, the

responsibility of the algorithm is a spanning tree.

That is a subgraph which has no cycles and is connected.

There's a path between each pairs of vertices and amongst all potentially

exponential many spanning trees, the algorithm is supposed to output the one

with the smallest cost, smallest sum of edge costs.

So let me reiterate the, the standing assumptions that we've got through the

minimum spanning tree lectures. So first of all, in assuming if the graph

is connected, that's obviously necessary for it to have

any spanning trees. that said, Kruskal's algorithm actually

extends in a really, just easy, elegant way to the case where G is disconnected.

But I'm not going to talk about that here.

secondly, remember, we're going to assume that all of the edge costs are distincts,

that is there are no ties. Now don't worry, Kruskal's algorithm is

just as correct. If there are ties amongst the edge costs.

I'm just not going to give you a proof that covers that case, but don't worry, a

proof does, indeed exist. Finally, of the various machinery that we

developed to prove the correctness of Prim's algorithm perhaps the most

important and most subtle point was what's called the cut property.

So this is a condition which guarantees you're not making a mistake in a greedy

algorithm. It guarantees that a given edge is indeed

in the minimum spanning tree. And remember, the cut property says,

if you have an edge of a graph and you could find just a single cut for which

this edge is the cheapest one crossing it.

Okay? So, the edge E crosses is cut and every

other edge that crosses it is more expensive.

That certifies the presence of this edge in the minimum spanning tree,

guarantees that it's a safe edge to include.

So we'll definitely be using that again in Kruskal's algorithm when we prove it's

correct. So as with Prim's algorithm, before I hit

you with the pseudocode, let me just show you how the algorithm works in an

example. I think you'll find it very natural.

So let's look at the following graph with five vertices.

So the graph has seven edges and I've annotated them with their edge costs in

blue. So here's the big difference in

philosophy between Prim's algorithm and Kruskal's algorithm.

In Prim's algorithm, you insisted on growing like a mold from a starting

point, always maintaining connectivity, and spanning one new vertex in each

iteration. Kruskal's going to just throw out the

desire to have a connected subgraph at each step of the iteration.

Kruskal's algorithm will be totally content to grow a tree in parallel with

lots of simultaneous little pieces, only having them coalesce at the very end of

the algorithm. So in Prim's algorithm, while we were

only allowed to pick the cheapest edge subject to this constraint of spanning

some new vertex. In Kruskal's algorithm we're just going to pick the cheapest

edge that we haven't looked at yet. Now, there is an issue, of course,

we want to construct a spanning tree at the end.

So, we certainly don't want to create any cycles,

so we'll skip over edges that will create cycles.

But other than that constraint, we'll just look at the cheapest edge next in

Kruskal's algorithm and pick it if there is no cycles.

So let's look at this five vertex example.

Again, there is no starting point. We're just going to look at the cheapest

edge overall. So that's obviously this unit cost edge

and we're going to include that in our tree.

Right? Why not? Why not pick the cheapest edge? It's a

greedy algorithm. So what do we do next?

Well, now we have this edge of cost two, that looks good, so let's go ahead and

pick that one. Cool.

Notice these two edges are totally disjoint.

Kay.' So we are not maintaining a connectivity

of our subgraph at each iteration of Kruskal's algorithm.

Now, it just so happens that when we look at the next edge, the edge of cost 3, we

will fuse together the two disjoint pieces that we had previously. Now, we

happen to have one connected piece. Now, here's where it gets interesting.

When we look at the next edge, the edge of cost 4, we notice that we're not

allowed to pick the edge of cost 4. Why?

Well, that would create a triangle with the edges of costs 2 and 3,

and that of course is a no-no. We want to span the tree at the end of

the day, so we can't have any cycles. So we skip over the 4 because we have no

choice, we can't pick it, we move on to the 5 and the 5 is fine.

So when we pick the edge of cost 5, there's no cycles,

we go ahead and include it. And now we have a spanning tree and we

stop or if you prefer, you could think of it that we do, we do consider the edge of

cost 6. That would create a triangle with the

edges of costs 3 and 5, so we skip the 6. And then, for completeness, we think

about considering the 7, but that would form a triangle with the

edges of costs 1 and 5, so we skip that. So after this single scan through the

edges in assorted order, we find ourselves with these four pink edges.

In this case, it's a spanning tree and as we'll see, not just in this graph but in

every graph it's actually the minimum cost spanning tree.

So, with the intuition hopefully solidly in place, I don't think the following

pseudocode will surprise you. We want to get away with a single scan

through the edges in short order. So, obviously in the preprocessing step,

we want to take the unsorted array of edges and sort them by edge cost.

To keep the notation and the pseudocode simple, let me just, for the purposes of

the algorithm, description only, rename the edges 1, 2, 3, 4, all the way up to m

conforming to this sorted order, right? So, the algorithm's just going to scan

through the edges in this newly found sorted order.

So we're going to call the tree in progress capital T, like we did in Prim's

algorithm. And now, we're just going to zip through

the edges once in sorted order. And we take an edge, unless it's

obviously a bad idea. And here a bad idea means it creates a

cycle, that's a no-no, but as long as there's no cycle, we go

ahead and include the edge. And that's it, after you finish up the

for loop you just return the tree capital T.

It's easy to imagine various optimizations that you could do.

So for example, once you've added enough edges to have a spanning tree. So n - 1

edges, where n is the number of vertices, you can get ahead, go ahead and abort

this for loop and terminate the algorithm.

But let's just keep things simple and analyze this three line version of

Kruskal's algorithm in this lecture. So just like in our discussion of Primm's

algorithm, what I want to do is first just focus on why is Kruskal's algorithm

correct? Why does it output a spanning tree at all? And then, the spanning tree

that it outputs? Why on earth should it have the minimum cost amongst all

spanning trees? That's when we're, once we're convinced

of the correctness, we'll move on to a naive running time implementation and

then finally, a fast implementation using suitable data structures.