This was very easy for
programmers to wrap their minds around, were large clusters of many machines.
If all you had to do was think in terms of map and reduce and
then suddenly you're controlling a lot of machines, but using these two operations,
that's a really simple way to do these large scare distributed operations, but.
Really, the big thing that it offered in addition to this API that lots of people
could wrap their mind around easily, was fault tolerance.
And fault tolerance is really what made it possible for
Hadoop/MapReduce to scale to such large configurations of nodes.
So that meant that you could do processing
on data whose size was completely unbounded.
Things that could never in anybody's wildest dreams be done in a week's worth
of time on one computer, could be easily done on a network of hundreds or
thousands of computers in a matter of minutes or hours.
So, fault tolerance is what made that possible.
Without fault tolerance it would be impossible to do jobs on hundreds or
thousands of nodes.
And the reason why fault tolerance is so important in distributed systems or
in large scale data processing Is because these machines weren't particularly
reliable or fancy, and the likelihood of at least one of those nodes failing or
having some network issue or something was extremely high midway through a job.
What Hadoop/MapReduce gave programmers was the ability to ensure that
you could really actually do computations with unthinkably large data sets and
you know that they'll succeed to completion somehow.
Because even if a few nodes fail,
the system can somehow find a way to recompute that data that was lost.
So there two things, this simple API and this ability to recover from failure so
you can then scale your job through the large clusters, this made it possible for
a normal Google software engineer to craft complex pipelines of MapReduce stages on
really, really big data sets.
So basically everybody in the company was able to do these really large scale
computations and to find all kinds of cool insights in these really large data sets.
Okay, so that all sounds really great.
Why don't we just use Hadoop?
Why bother with Spark then?
Well, if we jump back to the latency numbers that we just saw,
fault-tolerance in Hadoop/MapReduce comes at a cost.
Between each map and reduce step, in order to recover from potential failures, Hadoop
will shuffle its data over the network and write intermediate data to disk.
So it's doing a lot of network and disk operations.
Remember, we just saw that reading and
writing to disk is actually 100 times slower, not 1,000.
That was a typo.
It's actually 100 times slower than in-memory.
And network communication was up to 1,000,000 times
slower than doing computations in-memory if you can.
So that's a big difference.
So remember these latency numbers that we just saw.
If we know that Hadoop/MapReduce is doing a bunch of operations on disk and
over the network, these things have to be expensive.
Is there a better way?
Is there some way to do more to this in-memory?
Spark manages to keep fault tolerance, but it does so
while taking a different strategy to try and reduce this latency.
So in particular, what's really cool about Spark is that it uses ideas from
functional programming to deal with this problem of latency to try and
get rid of writing to disk a lot and doing a lot of network communication.
What spark tries to do is it tries to keep all of its data, or all of
the data that is needed for computational as much as possible in-memory.
And it's always immutable, data is always immutable.
If I can keep immutable data in-memory, and I use these nice,
functional transformations, like we learned in Scala collections,
like doing a map on a list and get another list back.
If we do these sorts of things we can build up chains of transformations on
this immutable, functional data.
So we can get full tolerance by just keeping track
of the functional transformations that we make on this functional data.
And in order to figure out how to recompute a piece of data on
a node that might have crashed, or to restart that
work that was supposed to be done on that node somewhere else, all we have to do
is remember the transformations that were done to this immutable data.