Hello. My name is Pavel and I am glad to welcome you on the Specialization Big Data for data engineers. Now you're starting the second course of this Specialization, big data analysis. The second week of the course I will teach is dedicated to data analysis using Spark SQL. And at the end of the week you will be able to read, write and process structured data in different ways by Spark SQL. This week contains three lessons. And in the first lesson, we will discuss the general properties of Spark SQL. In particular, why it's needed, how it works, what opportunities it has? Lesson number two will be devoted to the basic method of data processing. There you will learn how to extract information from our data, how to build aggregates, how to join tables and process them with different functions. And the third lesson, you will figure out how to handle missing queries, date and time, set your own functions, use Windows and you'll do many other interesting things. I would like to call this introductory video, the coolest way of big data processing. Which is quite modest and corresponds to the facts. To start with, I would like to clarify why structured data processing is so important. The fact is that the biggest part of information in companies is stored in a structured way. For example, banks processing payment transactions have all necessary information. Sender ID, receiver ID, the date, the amount of money. Web companies collect and process access log of their sites. Each time when somebody opens a web page of Google or Facebook, a company stores the visitor's ID, the time of the visit, the visited page and its big data storage. Of course, they also keep unstructured information. For example, social networks stores the text and images from the web page that has been visited. And banks store the customer's call history. However, the most powerful of information is structured. And there is a special language for structured data processing, SQL. That stands for Structured Query Language. HIVE allows to process big data in a common SQL language. This is the main reason why HIVE has become one of the most popular data processing tools. All the previous tools like MapReduce operate with raw big data. And it was a real pain in the neck not to say worse to join different data sets or write complex pipelines. HIVE allowed to simplify data processing using MapReduce. Now HIVE users don't need to parse data on disk as it is already parsed and the structure saved in a separate database. They also don't have to write joints at MapReduce each time they need to merge data sets. HIVE could use SQL join cloud instead. Nevertheless, MapReduce framework is quite slow. It stores intermediate and final results on the disk between different MapReduce steps. If I could execute half queries on Spark, Spark works ten times faster during a general data processing and 100 times faster if it can catch intermediate results and memory. Structured data processing on Spark has advantages of transforming data using original algorithm. Firstly, you don't have to parse data every time you need to work with it. Data structure simplifies your code and makes it more readable. Secondly, your code is optimized before execution. For example, optimizer could choose between a maps set join, a reduce set join for you and make your code work in a more efficient way. Thirdly, you can write your Spark processing on a typical SQL and make your life easy not worrying about RDD syntax. And finally, when you SQL query is parsed and executed, all data processing steps will be done in Java without any overhead on parse and code execution. It also makes SQL on Spark much faster than RDD API. One more Spark SQL benefit is that you can read and write your data from another structured data sources like JSON files or any external databases. All additional steps with data input and export become optional. Everything can be done inside your Spark job. All these things are becoming real for you when you use Spark SQL and DataFrame framework. In Spark 1.0, data frame API was one of top level companies for Spark API that worked on top of Spark RDD. All the same, in Spark 2.0 Spark SQL tuned to be a main API. And Spark RDD now is just an internal implementation of it. And all top level libraries are being re-written to work on data frames. Here is a quick summary of this video. You have learned that Spark SQL is like HIVE but faster. Being similar to RDD but easier and faster and allows you to integrate with different data sources. And it's a main API on Spark 2.0. In the next video, you will get acquainted with the main element of Spark SQL DataFrame.