How does data science happen? Five P's of data science. Now that we identified what data science is and how companies can strategize around big data to start building a purpose, let's come back to using data science to get value out of big data around the purpose or questions they defined. Our experience with building and observing successful data science projects led to a method around the craft with five distinct components that can be defined as components of data science. Here we define data science as a multi-disciplinary craft that combines people teaming up around application-specific purpose that can be achieved through a process, big data computing platforms, and programmability. All of these should lead to products where the focus really is on the questions or purpose that are defined by your big data strategy ideas. There are many technology, data and analytical research, and development related activities around the questions. But in the end, everything we do in this phase is to reach to that final product based on our purposes. So, it makes sense to start with it and build a process around how we make this product happen. Remember the wild fire prediction project I described? One of the products we described there was the rate of spread and direction of an ongoing fire. We have identified questions and the process that led us to the product in the end to solve it. We brought together experts around the table for fire modeling, data management, time series analysis, scalable computing, Geographical Information Systems, and emergency response. I asked them, let's not dive into the techniques yet. What is the problem at large? How do we see ourselves solving it? A typical conversation around the process starts with this question. Then from then on, drilling down to many areas of expertise, often we blur lines between the steps. My wildfire team would start listing things like, we don't have an integrated system or we don't have real-time access to data programmatically, so we can't analyze fires on the fly. Or they can say, I can't integrate sensor data with satellite data. All of this leads me to challenges I can then use to define problems. There are many dimensions of data science to think about within this discussion. Let's start with the obvious ones, people and purpose. People refers to a data science team or the projects stakeholders. As you know by now, they're expert in data and analytics, business, computing, science, or big data management, like all the set of experts I listed in my wildfire scenario. The purpose refers to the challenge or set of challenges defined by your big data strategy, like solving the question related to the rate of spread and direction of the fire perimeter in the wildfire case. Since there's a predefined team with a purpose, a great place for this team to start with is a process they could iterate on. We can simply say, people with purpose will define a process to collaborate and communicate around. The process is conceptual in the beginning and defines the set of steps an how everyone can contribute to it. There are many ways to look at the process. One way of looking at it is as two distinct activities, mainly big data engineering and big data analytics, or computational big data science, as I like to call it, as more than simple analytics is being performed here. A more detailed way of looking at the process reveals five distinct steps or activities of this data science process, namely acquire, prepare, analyze, report, and act. We can simply say that data science happens at the boundary of all these steps. Ideally, this process should support experimental work and dynamic scalability on the big data and computing platforms. This five step process can be used in alternative ways in real life big data applications, if we add the dependencies of different tools to each other. The influence of big data pushes for alternative scalability approaches at each step of the process. Just like you would scale each step on its own, you can scale the whole process as a whole in the end. One can simply say, all of these steps have reporting needs in different forms. Or there is a need to draw all these activities as an iterating process, including build, explore, and scale for big data as steps. Big data analysis needs alternative data management techniques and systems, as well as analytical tools and methods. Multiple modes of scalability is needed based on dynamic data and computing loads. In addition, change in physical infrastructure, streaming data specific urgencies arising from special events can also require multiple modes of scalability. In this intro course, for simplicity, we will refer to the process as a set of five sequential activities that iterate. However, we'll touch on scalability as needed in our example applications. As a part of building your big data process, it's important to simply mention two other P's. The first one is big data platforms, like the ones in the Hadoop framework, or other computing platforms to scale different steps. The scalability should be in the mind of all team members and get communicated as an expectation. In addition, the scalable process should be programmable through utilization of reusable and reproducible programming interfaces to libraries, like systems middleware, analytical tools, visualization environments, and end user reporting environments. Thinking of big data applications as a process, including a set of activities that the team members can collaborate over, also helps to build metrics for accountability to be built into it. This way, expectations on cost, time, optimization of deliverables, and time lines can be discussed between the the members starting with the beginning of the data science process. Sometimes we may not be able to do this in one step. And joint explorations like statistical evaluations of intermediate results or accuracy of sample data sets become important. As a summary, data science can be defined as a craft of using the five P's identified in this lecture, leading to a sixth P, the data product. Having a process within the more business-driven Ps, like people and purpose, and the more technically driven P's, like platforms and programmability, leads to a streamlined approach that starts and ends with the product, team accountability, and collaboration in mind. Data science process provides guidelines for implementing big data solution, as it helps to organize efforts and ensures all critical steps taken conforms to pre-define and agreed upon metrics.