In this video, we will talk about dynamic allocation.
In the previous video,
you're trying to answer the following question.
How much resources should you allocate for your app?
And it turned out that you actually had to answer three questions.
How many executors should you start?
How many cores and memory should each executor have.
That's quite a few decisions to make to start a job.
It happens that Spark allows you to skip one of
these decisions with the mechanism called, dynamic allocation.
So how does this work?
For simplicity, suppose you have a cluster which can run 12 executors of a fixed sized.
At the beginning of the day,
your colleague starts a Jupyter Notebook which
allocates nine executors to compute an expansive analytical query.
But then, he remembers that he has forgotten to grab
his morning coffee and goes to the kitchen to pour some.
His Jupyter session is still there on a cluster holding nine executors which do nothing.
Then you come to the workplace and want to start a simple job with just four executors,
but you fail because only three executors are available,
although, the cluster does nothing.
Dynamic allocation handles this situation by deallocating
idle executors and starting new executors if your job needs them.
So, if dynamic allocation is turned on,
Spark will kill several idle containers of
your colleague and your job will successfully start.
So, why the Spark needs dynamic allocation of resources?
Classical MapReduce frameworks use short-lived containers for their tasks,
but Spark architecture has used long running containers to speed up computations.
And this may not be the optimal approach in shared and noisy clusters.
So, how do you configure a dynamic allocation in Spark?
The main option is Spark.dynamicAllocation.enabled,
which is false by default.
Then you should also specify,
what does the idle time out of your executor?
Meaning, how much time should pass for executor being idle before Spark removes it?
It is 60 seconds by default.
There is also another important option which is forgotten all the time.
You now know that you can speed up your job by caching data.
And if this executor has cached data,
Spark can't remove it,
because cached executor idle timeout is set to infinity, by default.
So you really have to tune this option for dynamic allocation to work it out.
Two other options are, minExecutors and maxExecutors.
Minimum number of executors which are allocated to your job are zero by default.
This means, that you may come up with a situation when your app
is running only the driver and no executors at all.
This behavior may not be desirable.
Maximum number of executors is not bounded.
So you may or may not want to tune these option.
Another important thing to keep in mind with the dynamic allocation,
is that executors have a piece of memory for execution.
We talk about this in one of the previous videos.
This means, that the results of the shuffles,
aggregations, and joints are stored in this memory.
And if Spark removes an executor,
your job will definitely lose this data.
To prevent this, when your cluster runs on yarn,
you have to enable an external shuffles service in Spark,
and also, point yarn to this external shuffle service.
So, what are the use cases for dynamic allocation?
The first one, are long-running ETL jobs.
These jobs start on a regular basis and process batches of new raw data.
As soon as data generation is some probabilistic process,
each start of the job will deal with different volumes.
The second one, and the most useful for our course,
are interactive jobs like Jupyter Notebooks.
Due to the nature of analytical work with data,
it is best to use dynamic allocation here.
And the last one are jobs with large shuffles.
So let's sum up.
Spark Architecture assumes using long-running containers for speed.
These may costs under utilization of the cluster resources.
And that's why dynamic allocation can help.
Dynamic allocation starts new executors when they are
needed and remove executors when are idle.
But dynamic allocation requires fine tuning of your job,
and you need to pay attention for different configuration options.
Finally, due to the dynamic nature of this strategy of resource allocation,
temporary shuffle results should be stored in an external service.
More information on dynamic allocation is available in these talks.