Maybe you're going to ask the question, when is local HDFS a good option instead of using Google Cloud Storage? Local HDFS is a good option if your job requires a lot of metadata operations. For example, you have thousands of partitions and directories, and each file size is relatively small. Or you modify the HDFS data frequently or rename directories often. Cloud storage objects are immutable, so renaming a directory is an expensive operation. Because it consists of copying all objects to a new key and then deleting them afterwards. Or you heavily use the append operations on HDFS files? Or you have workloads that involve very heavy input output. For example, you have a lot of partitioned rights. Or you have IO workloads that are especially sensitive to latency. For example, you require a single digit millisecond latency per storage operation. But in general we recommend using cloud storage as the initial and final source of data in your big data pipeline. For example, if a workflow contains five spark jobs in a series, the first job retrieves the initial data from cloud storage then writes shuffled data, and the immediate job output is the HDFS. The final spark job will then write its results to cloud storage. Using cloud data proc with cloud storage allows you to reduce the disk requirements and save costs by putting your data there instead of persisting it on HDFS. When you keep your data on cloud storage and don't store it locally on your cluster and HDFS, you can use smaller discs for your cluster. And by making your cluster truly on-demand, you're able to as we talked about before, separate the compute in the storage which helps you reduce costs significantly. Even if you store all of your data in cloud storage, your cloud data proc cluster does need HDFS for certain operations such as storing control and recovery files or aggregating logs. It also needs non HDFS local disk space for shuffling. You can reduce the disk size per worker if you're not heavily using local HDFS. Here are some options to adjust the size of your local HDFS for you to consider. You can decrease the total size of your local HDFS by decreasing the size of the primary persistent disks for the master and worker nodes. The primary persistent disk also contains the boot volume and system libraries. So be sure to allocate at least a hundred gigabytes. You can increase the total size of a local HDFS by increasing the size of the primary persistent disks for workers. Consider this option very carefully. It's rare to have workloads that get better performance by using HDFS with standard persistent disks in comparison with doing it on cloud storage, or local HDFS with an SSD. You can attach up to eight SSDs or 375 gigabytes each to each worker, and you can use these discs for HDFS. This is a very good option if you need the HDFS for very high input/output intensive workloads, and you need to get to that single digit millisecond latency. Make sure that you use a machine type that is enough CPU and memory on the worker to support this many discs. Use SSD persistent disks or PD_SSDs, currently in beta for your master or your workers as a primary disk. You should understand the repercussions of geography and regions before you configure your data and jobs. Many GCP services require you to both specify regions or zones to allocate those resources. The latency of requests can increase when the requests are made from a different region than the one where the resources are actually being stored. Additionally, if the services resources and your persistent data are located in different regions. Some calls to GCP services might copy all the required data from one zone to another before processing. This can have a severe impact on your performance. Cloud storage is the primary way to store unstructured data in GCP, but it isn't the only storage option. Some of your data might be better suited to different storage products designed explicitly for big data. You can use cloud big table to store a large amount of sparse data. Cloud big table is an HBase compliant API that offers low latency and high scalability to adapt to your jobs. For data warehousing and analytical work loads, you can consider using BigQuery. Because cloud data proc runs Hadoop on GCP, using a persistent cloud data proc cluster to replicate your on-premise setup might seem like the easiest solution. However, there are some limitations to this lift and shift approach. Keeping your data in persistent HDFS clusters using cloud data proc is more expensive than storing your data inside of cloud storage, which is generally what we recommend. Keeping your data in HDFS cluster also limits your ability to use the data with other GCP products, it's isolated on your cluster. Augmenting or replacing some of your open source based tools with other related GCP services can be more efficient or economical for particular use cases. Using a single persistent cloud data proc cluster for your jobs is more difficult to manage than shifting to targeted clusters that serve an individual's job or job area. The most cost effective and flexible way to migrate your Hadoop system to GCP is to shift away from thinking in terms of large multi-purpose persistent clusters. And instead, think of small short-lived clusters that are designed exclusively to run specific jobs. You store your data and cloud storage to persist multiple temporary processing clusters data. This model is often called the ephemeral model because the cluster is used for processing those jobs are allocated as needed, and then released, and turned down as the job's finished. You've got efficient utilization, don't pay for resources that you don't use. A fixed amount of time after the cluster enters an idle state, you can automatically set a timer. You can give it a time stamp, and the count starts immediately once the expiration has been set. You can set a duration, the time in seconds to wait before automatically turning down the cluster. You can range from ten minutes as a minimum, to 14 days as a maximum at a granularity of one second. It's currently available from the command line and rest API but not through the console. The biggest shift in your approach between running an on-premise Hadoop workflow and running the same workflow on GCP is that shift away from your monolithic persistent clusters to the ephemeral clusters,. You just spin up the cluster when you need to run the job and then delete that cluster when the job completes. The resources required by our jobs are active only when they're being used, so you only pay for what you use. This approach enables you to tailor your clusters configurations for individual jobs. Because you are maintaining and configuring a persistent cluster, you reduce the cost of resource use and cluster administration. This section describes how to move your existing Hadoop infrastructure to an ephemeral model. To get the most from cloud data proc, you have to move that ephemeral model if only using them in your clusters when you need them. This can be scary because a persistent cluster is comfortable. With GCS data persistence and the fast boot of cloud data proc, a persistent cluster is a waste of resources. But if you do need a persistent cluster, make it small, and your clusters can be resized anytime. The ephemeral model is the recommended route, but it requires that storage to be decoupled by compute. You'll want to separate jobs shapes and separate your clusters. You can decompose even further with job scoped clusters. How does this work? You want to isolate your Dev, your staging and your production environments. Run them all in separate clusters, read data from the same underlying data source in GCS, no problem. But run it on separate clusters that are ephemeral again. Of course, you can add appropriate ACLs to your service accounts to protect your data. The point of your ephemeral clusters, use them only for the jobs lifetime. When it's time to run a job, follow this process. Create a properly configured cluster, run your job, setting the output to cloud storage or another persistent location. Delete your cluster, use the job output however you need to. And lastly, view those logs and stack driver, or cloud storage. If you can't accomplish your work without a persistent cluster, sure, you can create one. This option may be costly, and it's not recommended if there's a way to get your job done on an ephemeral cluster. You can minimize the cost if a persistent or long-lived cluster by creating the smallest cluster you can. Scoping your work on that cluster for the smallest amount of jobs. And scaling the cluster to the minimum amount of worker nodes, adding more dynamically to meet the demand.