Azure Databricks is a fully-managed, cloud-based, big data and machine learning platform, which empowers developers to accelerate AI by simplifying the process of building enterprise grade production data applications. Built as a joint effort by Databricks and Microsoft. Azure Databricks provides Data Science and Engineering teams with a single platform for big data processing and machine learning. By combining the power of Databricks and end-to-end managed Apache Spark platform optimized for the Cloud with the enterprise scale and security of Microsoft Azure platform, Azure Databricks makes it simple to run large-scale Spark workloads. To provide the best platform for data engineers, data scientists, and business users, Azure Databricks is natively integrated with Microsoft Azure providing a first-party Microsoft Service. The Azure Databricks collaborative workspace enables teams to work together by using features such as user management, Git source code repository integration, and user workspace folders. Microsoft is working to integrate Azure Databricks closely with the Azure platform. Many key features are already completed. These include the integrations that we will now look at. Many existing VMs can be used for clusters, including F-Series for machine learning scenarios, M-series for massive memories scenarios, and D-series for general purpose. In the areas of security and privacy, the ownership and control of data is always with the customer. Microsoft aims for Azure Databricks to adhere to all of the compliance certifications that the rest of Azure provides. To ensure flexibility and network topology, Azure Databricks supports deployments into virtual networks, VNets, which can control which sources and sinks can be accessed and how they can be accessed. Building on the Orchestration abilities of Azure, ETL and ELT workflows including analytics workloads in Azure Databricks can be operationalized using Azure Data Factory pipelines. Power BI can be connected directly to Databricks clusters using JDBC in order to query data interactively at massive scale using familiar tools. Azure Databricks workspace as deploy into customer subscriptions. Naturally, Azure Active Directory can be used to control access to sources, results, and jobs. Azure Storage and Data Lake Store services are exposed to Databricks users through the Databricks file system, dBFS to provide caching and optimized analysis over existing data. Azure Databricks easily and efficiently uploads results into Azure Synapse Analytics, Azure SQL Database, and Azure Cosmos DB for further analysis and real time serving, making it simple to build end-to-end data architectures on Azure. Integration with IoT Hub, Azure Event Hubs and Azure HDInsight Kafka clusters enables developers to build scalable Streaming solutions for real-time analytics. For developers, this design provides three things. First, it enables easy connection to any storage resources in their account, such as an existing Blob Storage or Data Lake Store. Second, they take advantage of deep integrations with other Azure services to quickly build data applications. Third, Databricks is managed centrally from the Azure Control Center, requiring no additional setup, which allows developers to focus on core business value, not Infrastructure Management. When you create an Azure Databricks service, a Databricks appliance is deployed as an Azure resource in your subscription. At the time of cluster creation, you specify the types and sizes of the virtual machines or VMs to use for both the driver and worker nodes but Azure Databricks manages all other aspects of the cluster. The Databricks appliance is deployed into Azure as a managed resource group within your subscription. This resource group contains the driver and worker VMs, along with other required resources, including a virtual network, a security group and a storage account. All metadata for your cluster, such as scheduled jobs, is stored in an Azure database with geo-replication for fault tolerance. Internally, Azure Kubernetes Service is used to run the Azure Databricks control plane and data planes through Containers running on the latest generation of Azure hardware. With Dv3 virtual machines using NVMe SSDs capable of 100 microseconds latency on IO. This ensures the best Databricks IO performance. In addition, accelerated networking provides the fastest virtualized network infrastructure in the Cloud. Azure Databricks utilizes this to further improve Spark performance. The control plane hosts Databricks jobs, notebooks with query results, the cluster manager, web application, hive metastore, security access control lists, and user sessions. These components are managed by Microsoft and collaboration with Databricks and do not reside within your Azure subscription. The data plane contains all the Databricks runtime clusters hosted within the workspace. All data processing and storage exists within the client subscription. This means no data processing ever takes place within the Microsoft Databricks managed subscription. Looking at the platform in a bit more detail, you can see what is being exchanged between the Azure Databricks platform components. Since the web app and cluster manager is part of the control plane, any commands executed in a notebook are sent from the cluster manager to the customer's clusters in the data plane. As previously mentioned, this is because the data processing only occurs within the customer's own subscription. Any table metadata and logs are exchanged between these two high-level components. Customer data sources within the client subscription exchange data with the data plane through read and write activities. Let's take a closer look at a standard Azure Databricks deployment. This contains the boundaries between the control plane and data plane, along with the Azure components deployed to each. The deployment is made up of the customer subscription and the Microsoft subscription. The control plane exists within the Microsoft subscription. The customer subscription contains the data plane and data sources and the Microsoft managed Azure Databricks workspace virtual network or VNet. Information exchange between this VNet and the Microsoft managed Azure Databricks control plane VNet is sent over a secure TLS connection through ports 22 and 55, 57. These are enabled by network security groups called NSGs and protected with port IP filtering. The Blob Storage account provides default file storage within the workspace and the Databricks file system. This resource and all other Microsoft managed resources are completely locked from changes made by the customer and all other resources within the customer subscription or customer managed. They can be added or modified per your Azure subscription permissions, connectivity between these resources and the Databricks clusters that reside within the Data Plane is also secured via TLS. Let's recap a few key points and best practices from Azure Databricks standard deployment. You can write to the default dBFS file storage as needed, but you cannot change the Blob Storage account settings since the account is managed by the Microsoft managed control plane. As a best practice only use the default storage for temporary files, mount additional storage accounts, Blob Storage or Azure Data Lake Storage Gen2 that you create in your Azure subscription for long-term file storage. The default file storage is tied to the lifecycle of your Azure Databricks account. If you delete the Azure Databricks account, the default storage gets deleted with it and finally, if you need advanced network connectivity such as custom VNet peering and VNet injection, you could deploy Azure Databricks data plane resources within your own VNet.