Next, we'll discuss how and why you should consider processing your same Hadoop job code in the cloud using Cloud Dataproc on GCP. So what are the advantages to running Hadoop and Spark? We're closed on Cloud Dataproc. First, low-cost, Cloud Dataproc is priced at $0.01 per virtual CPU per cluster per hour on top of any other GCP resources you use. In addition Cloud Dataproc clusters can include preemptible instances or VMs that are short-lived if you don't need them. They have lower compute prices, you use in pay for things only when you need them. So Cloud Dataproc charge is second by second billing with of course a one-minute minimum billing period. Next, it's super fast, Cloud Dataproc clusters are quick to start to scale and shut down. With each of these operations taking about 90 seconds or less on average. Next, they're resizable in your clusters. Clusters can be created and scaled quickly with a variety of virtual machine types, disc sizes, numbers of nodes and different networking options as you're going to see later. Open source ecosystem. You can use Spark and Hadoop tools libraries and documentation with Cloud Dataproc. Cloud Dataproc provides frequent updates to native versions of Spark, Hadoop, Pig and Hive, so there's no need to learn new tools or APIs. And it's possible to move existing projects or ETL pipelines without redeveloping any code. It's managed, you can easily interact with clusters in Spark or Hadoop jobs without the assistance of an administrator or special software. Through the GCP console the cloud SDK or the Cloud Dataproc REST API. When you're done with a cluster, simply turn it off. So money isn't spent on idle cluster resources. Cloud Dataproc supports of versioning. Image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop and other tools. It's integrated, it has built-in integration with Cloud Storage, BigQuery and Cloud Big Table to ensure data will never be lost. This together with StackDriver Logging and StackDriver Monitoring provides a complete data platform, and not just a Spark or Hadoop cluster. For example, you can use Cloud Dataproc to effortlessly ETL terabytes of raw log data directly into BigqQery for business reporting. In addition to that, you get a few more benefits. The clusters are highly available, you can run clusters with multiple master nodes and set jobs to restart on failure to ensure your cluster is in jobs or highly available. It has developer tools, you have multiple ways that you can manage your cluster including the GCP console the Cloud SDK, RESTful APIs and Direct SSH access. You have initialization actions, you can run these actions to install or customize the settings in libraries that you need when you're clusters first created. We'll look at those more in detail later. It supports automatic or manual configuration of the cluster. Cloud Dataproc automatically configures the hardware and the software on the clusters for you. But if you want to go in there yourself manually update it, you can. Cloud Dataproc has two ways to customize clusters, optional components and initialization actions. Pre-configured optional components can be selected when deployed via the console or the command line and include Anaconda Jupyter notebook, Zeppelin notebook, Presto and Zookeeper. Initialization actions lets you customize your cluster by specifying executables or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. You can define your own initialization scripts or select from a wide range of frequency used ones and other sample initialization scripts that are available. Here's an example of how you can create a Dataproc cluster using the Cloud SDK. And we're going to specify an hbase shell script to run on the clusters initialization. There are a lot of pre-built startup scripts that you can leverage for common Hadoop cluster set of tasks like Flink, Jupyter and more. You can check out the GitHub repo link to learn more. Do you see the additional parameter for the number of master and worker nodes in the script? Let's talk more about the architecture of the cluster. The standard setup architecture is much like you would expect on-premise. You have a cluster of virtual machines for processing and then persistent disks for storage via HDFS. As you can see here, you've got your master node VMS in a set of worker nodes. Worker nodes can also be part of a managed instance group, which is just another way of ensuring that VMS within that group are all of the same template. The advantages here is that you'll soon see in your lab. You can spin up more VMS than you need to automatically resize your cluster based on the demands. It also only takes a few minutes to upgrade or downgrade your cluster. But generally you shouldn't think of a Cloud Dataproc cluster as long-lived. Instead you should spin them up when you need compute processing for a job and then simply turn them down. Of course, you could persist them indefinitely if you wanted to. So what happens to HDFS storage on disk when you turn those clusters down? Well, the storage would go away too, which is why it's a best practice to use storage that's off cluster by connecting to other GCP products. Here we've extended the diagram to show you what that could look like. Instead of using native HDFS on cluster, you could simply use a cluster of bucket on Google Cloud Storage via the HDFS connector. It's pretty easy to adapt existing Hadoop code to use GCS instead of HDFS. It's just a matter of changing the prefix for this storage from hdfs// to gs//. What about hbase off cluster? Well, consider writing in the Cloud Big Table instead. What about large analytical workloads? Well, consider reading that data into BigQuery and doing those analytical work loads there. Using Cloud Dataproc involves this sequence of events, Setup, Configuration, Optimization, Utilization and monitoring. Setup means creating a cluster, and you can do that through the console from the command line using the gcloud command. You can also export a YAML file from an existing cluster or create a cluster from a YAML file. You can create a cluster from the deployment manager template as well, or if you wanted to you can use the REST API. For configuration the cluster can be set up as a single VM, which is usually to keep costs down for development and experimentation. Standard is with a single master node and high availability has three master nodes. You can choose between a region and a zone or select the global region and allow the service to choose the zone for you. The cluster defaults to a global endpoint but defining a regional endpoint may offer increased isolation and in certain cases lower latency. The master node is where the HDFS name node runs as well as the yarn node and job drivers. HDFS replication defaults to to in Cloud Dataproc. Optional components from the Hadoop ecosystem include Anaconda, which is your Python distribution in package manager, Web H CAD, Jupyter Notebook and Zeppelin Notebook as well. Cluster properties are runtime values that can be used by configuration files for more dynamic startup options. And user labels can be used to tag your cluster for your own solutions or your reporting purposes. The master node, worker nodes and preemptible worker nodes if enabled have separate VM options such as vCPU, memory and storage. Preemptible nodes include yarn node manager, but they don't run HDFS. There are a minimum number of worker nodes. The default is two, the maximum number of worker knows is determined by a quota and the number of SS Divs attached to each worker. You can also specify initialization actions such as an initialization script that we saw earlier. It can further customize your worker nodes on startup. And metadata can be defined, so the VM share state information between each other. This may be the first time you saw a preemptible nodes as an option for your cluster. When and why would you use them if they can be preempted by a different job? Well, the main reason to use preemptible VMs or PVMs is to lower costs for fault-tolerant workloads. PVMs can be pulled from service at any time within 24 hours. But if your workload in your cluster architecture is a healthy mix of VMs and PVMs, you may be able to withstand the interruption and get a great discount in the cost of running your job. Custom machine types allow you to specify the balance of memory and CPU to tune the VM to the load, so you're not wasting resources. A custom image can be used to pre-install software. So it takes less time for the customized node become operational, then if you install the software boot time using an initialization script. You can get a persistent SSD boot disk for faster cluster startup. So how do you submit a job to Cloud Dataproc for processing? Jobs can be submitted through the console the gcloud command or via the REST API. They can also be sorted by orchestration services such as Cloud Dataproc workflow and Cloud Composer. Don't use Hadoop's direct interfaces to submit jobs because the metadata will not be available to Cloud Dataproc for job and cluster management. And for security, they're disabled by default. By default jobs are not restartable. However, you can create restartable jobs through the command line or REST API. Restartable jobs must be designed to be item potent and to detect successorship and restore state. Lastly, after you submit your job you'll want to monitor it, you can do so using StackDriver. Or you can also build a custom dashboard with graphs and set up monitoring of alert policies to send emails for example, where you can notify if incidents happen. Any details from HDFS, YARN, metrics about a particular job or overall metrics for the cluster like CPU utilization, disk and network usage can all be monitored and alerted on with StackDriver.