There are several ways to create jobs that can run on a dataproc cluster. Dataproc provides a Hadoop cluster with pre-installed and preconfigured software, which includes some of the most common software that's part of the standard cluster, such as Apache Hive, Pig, and Spark. In this module, you'll have the opportunity to experience first hand how Hive, Pig, and Spark work. You'll get a real sense of their strengths and limitations and you'll learn key concepts important to using dataproc. There are whole books written on Hive, Pig, and Spark. The goal here is not to make you a proficient Hive, Pig, or Spark programmer, but to give you basic exposure incase you decide to learn more about them after this course. Welcome back. You're on track in the Data Engineering on Google Cloud Platform specialization. This is the course leveraging unstructured data and this is the second of four modules called Running Dataproc jobs. Once again, I'm one of your instructors and a curriculum developer at Google, my name is Tom Stern. In this module, you'll learn to run Hadoop jobs on the dataproc cluster using several tools and methods. You'll begin with the most common tools for working with structured data, including open source software that provides an SQL like interface called Hive. Next, you'll use an open source language called pig that's often used for cleaning up data and for turning semi-structured data into structured data. Finally you'll learn to work with Spark, a powerful open source data processing pipeline system that can work with unstructured data. In the previous module you learned that dataproc provides a stateless Hadoop cluster in only 90 seconds. You'll recall that dataproc was designed to overcome Hadoop's IT overhead and operational limitations, and the best practices to accomplish that are, create a cluster specifically for one job. Use Cloud storage instead of HGFS so that you can shut down the cluster when it's not actually running a job. Use custom machines to closely match the CPU and memory requirements of the job. And on non-critical jobs requiring huge clusters, use pre-emptible VM's to cut cost and speed results at the same time. In this module, you'll learn about Hive, Pig, and Spark. The native Hadoop services for working with structured, semi-structured, and unstructured data. And you'll learn about the various ways to submit jobs to the cluster and differences between submission methods.