All right. Let's now use everything you have learned in this lesson to optimize
our real world SPARC application and this will be their PageRank algorithm.
You already managed to figure out how PageRank works and
also created a solid implementation of this algorithm in SPARC.
This implementation already uses some smart performance tricks.
For example, you know that the links RDD is
static and will be used iteratively so you persisted it in memory.
Or you know that reducing the volume of the shuffle is great so you use
the reduceByKey transformation to sum up all the contributing PageRanks but,
as always, you can do better.
We will start with a simple example of a graph with only 5 nodes and 9 edges.
This makes things simple, fast, and manageable.
As usual, our test environment has 10 executors,
each having two cores and four gigabytes of memory.
The input file contains just 9 lines and we will do 30 iterations.
The test run has been finished in 24 seconds.
It has generated 63 stages but more than 2,000 tasks
and it looks a bit strange that just nine lines of the input produced 2,000 tasks.
It becomes pretty obvious why this happens when you
look at the deck which was generated by the job.
It has these repetitive redundant stages marked in
red rectangles and what you want to do is to get rid of these unnecessary shuffles.
The cause of these shuffles is the join of the links with the ranks.
Why are these join generates a shuffle?
Simple. Join just generates shuffles due to their nature.
But you know something about this specific problem.
You know that links and ranks has the same keys and they don't
change so it may be possible to use this knowledge and optimize things a bit.
You know that SPARC uses a partitioner to perform a shuffle.
In this implementation of the PageRank,
the links RDD has two partitions and a none partitioner but their ranks RDD,
though having two partitions as well,
does not have a partitioner.
This is not good because ranks who are created by
the transformation of the values, not the keys.
And that means that the links RDD and the ranks RDD should be co-partitioned.
While they are, in fact,
not, their partitioners are not equal.
To make things explicit,
we will use their partition byte method on the links RDD to introduce a partitioner.
We will use two partitions as the data set
is small but we will increase this number later.
Okay, you have specified their partitioner but nothing has really changed.
What should you do next?
To make your RDDs co-partitioned,
you should use the same partitioner for the ranks.
It can be achieved by preserving the partitioner in that map transformation.
As you see, both RDDs now have a known partitioner and they are equal.
Are we done? Not quite.
The ranks RDD is more defined in the full loop using the PageRanks of
the contributing links and this modification is done by
the reduceByKey transformation which also needs to do a shuffle.
But if you ensure that the reduceByKey generates
the same number of partitions in the rank RDD as before,
everything will be okay.
One final note, that damping factor is applied
via their mapValues transformation which preserve their partitioner.
If this was done with the map transformation,
you would have to explicitly preserve the partitioner.
Great. The final implementation is here.
Is this mostly the same as before?
With only three minor modifications.
Let's test it.
As you see, the execution time has halved as well as the number of
stages and the number of tasks has decreased 31 times.
Now SPARC UI tells you that there are no redundant stages and
everything that is happening is the same stage recomputing the ranks on each iteration.
Let's now test your modified solution on a realistic problem.
We'll use the Berkeley-Stanford web graph which contains about
700,000 nodes and more than 7 million edges.
The algorithm will iterate five times.
The first implementation, which did not use
all the partition and optimizations has been finished in 4.1
minutes and the optimized solution took 3.5
minutes to finish which gives an increase in performance by about 20%.
And that is pretty great because the algorithm used just 5 iterations,
which is too small for a web graph of this size.
So, summing up, the most important thing you should care
about in your SPARC applications is reducing shuffles.
You may achieve this if the partitioners of RDDs are
known and it is better to explicitly assign a partitioner.
If your transformations do not change the keys,
you should preserve the partitioner.
And if your job has to shuffle,
make sure that you decrease the volume of the shuffle by
using the appropriate transformation like reduceByKey.