Welcome to module four. Let's take a look at the data science approach to big data. We have mentioned the CRISP-DM process earlier in the course. This data mining process has turned into standard called cross-industry standard for data mining. CRISP-DM is composed of six phases. Business understanding, data understanding, data preparation, modeling, evaluation and deployment. So far we have spent a lot of time on data understanding and data preparation with using KNIME. In this module, we're going to focus on modeling, evaluation and deployment. When you think about an upcoming project, where you think you might want to use data mining, you can apply this process and walk through all of these phases. So you would start with a business understanding, where we would spend time understanding the project objectives and requirements, walking into data mining problem definition. Oftentimes, we need to do a situation assessment and take a look at the inventory of the resources, requirements and assumptions as well as constraints in order to have a successful project. This is where we determine the data mining goals and what the successful look like and start producing the project plan. Before we can even think about what kind of data mining approaches and methods we might want to apply to the data, we need to understand the data. In the data understanding phase, we look at the initial data collection and the description. We identify if there's any obvious data quality issues. We typically, describe that data in the data description report, and we start exploring the data. Once we understand the data that we have and maybe additional data that we need to collect, we will move into the data preparation phase. This is where we say that the data scientists spend 60 to 90 percent of their time. We would select a dataset, clean that data, we integrate and format data, record attribute selections. We would probably want to include some rationale for inclusion or exclusion of certain variables, and we will spend a lot of time deriving attributes, may be generating records. We might have to integrate data from many different sources, and oftentimes we will have to format and reformat that data in order to prepare it for the modeling phase. In the modeling phase, we will choose the appropriate technique. Actually, we're typically going to choose more than one and compare them. We will select the training and the test dataset, and then we will train that model. In this phase, as we start building the models, we will build several different models with different parameter settings, with different possible model descriptions. We're still going to assess those models and revise parameter settings as we go through this phase. Once we're happy with the model we have created, we want to evaluate the results. We can determine if the results meet the business objectives and we can identify any business or technical issues that might exist with the model or a number of models that we have produced. We're going to walk through a review process and determine the next steps. The next steps are exciting, we want to deploy that model. In the deployment phase, we will deploy the results of the model into production. There is many different ways we can do that, and we will spend a little bit of time at the end of this module looking into different ways of deploying models with KNIME. We create a plan for monitoring and the maintenance of this model. How often do we want to retrain the model. Then, we want to create a full detailed deployment plan and then produce the final report and documentation. So if you think about the data mining process on the high level, what we really do is export the data, find patterns and then perform predictions. Performing predictions is oftentimes called scoring the model. If we look into more details in this approach, just like we have seen in CRISP-DM, we're going to collect historical data about a particular set of circumstances that we would like to create a predictive model for. We'll start exploring that data and then cleaning it. Once we clean the data, we're going to split the data into training data and test data, and we'll talk a little bit about this in last. We're going to perform modeling, find patterns throughout the data, and this is what we call training the model. This is where that CRISP-DM applies really well. Once we train that model, we're going to go into that evaluation phase where we have a test dataset that separate from the training dataset. We're going to take that trained model and apply the test dataset to the model in order to test, evaluate and validate the model. Once we are happy with that model, then new data will be coming in and we're going to perform prediction or what we call score the model, anywhere from the exploratory data analysis to predictive analytics. If we're talking about exploratory data analysis, we're typically talking about analyzing datasets in order to summarize their main characteristics, often with visual methods or statistical models. Exploratory data analysis was promoted in order to encourage data exploration, to formulate hypotheses and to guide us to new data collections and new experiments. Sometimes we go into the project knowing exactly what we're going to do, and sometimes we just know that this data should be able to bring us some insight but we're not exactly sure what we would like to get from this data, and this exploratory data analysis is extremely valuable for those kinds of projects. We will use exploratory data analysis even if we have a very well formulated hypothesis of what we would like to do because it really takes a lot of time to get to know your data, understand it, and exploratory data analysis can only benefit that process. Then, there is descriptive modeling or oftentimes referred to as discovering patterns on rules. Descriptive modeling typically focuses on summarizing a sample in order to warn about the population that that sample of data represents. So if we're talking about descriptive models, we're oftentimes talking about clustering, customer segmentation, association rules and dependencies, where typically the system exports the data trying to find out if there is any relationships between different attributes. Then, if there is a presence of one attribute, can that imply the presence of another attribute. So 50 percent of the people who buy milk maybe also buy bread or cheese. So we can look into those types of patterns. Sometimes, we're even interested in what sequence they appear. When we talk about sequential patterns, typically view at the system search through the data and we try to identify repeated patterns within the data. A third category of models is predictive modeling. When we talk about predictive modeling, we can refer to classification and regression, temporal or deviation detection. Typically, when we talk about classification models, the system learns how to partition the data. Typically, we supply the system with example or objects from different groups that are historical dataset, and then we let these algorithms decide on a profile of each group based on the attributes that were unique to that particular group. When we talk about temporal or time sequence data, we're typically looking at the methods where we give a set of time sequences and the method can then identify regulatory occurrences of the same sequence or look into the anomaly detection. The deviation detection is the opposite of everything else. The system can determine if there has been a considerable change in the feature from previous or expected values. Sometimes we call this outlier or anomaly detection. There is many different types of machine learning models, but there are three major categories; supervised, unsupervised and reinforcement learning. When we talk about supervised learning, we're typically talking about classification and regression methods. As we'll see in just a little bit, where we talk about decision tree and regression trees, most of the classification methods are able to predict a nominal or categorical value, while most regression models will predict a numeric value. That's the major difference between these two groups. We have a whole family of unsupervised learning. Typically, when you ask people about unsupervised learning they will immediately say, "Oh, clustering. Yeah, I know the example of that." Many people have already had experience with k-means clustering and maybe a recommender systems. When we talk about reinforcement learning, we're typically referring to a family of methods that deal with a gaming AI, learning tasks, often applied to robot navigation and real-time decisions. So what is data science? How does data science fit within the whole world of big data?How does that differ from what we've just learned about the CRISP-DM and data binding process? How different is the data science framework from what we have learned so far? So let's take a look at that. If we look at the data science definition from Wikipedia, it's an interdisciplinary field about processes and systems to extract knowledge or insight from data in various forms. That data can obviously be structured and unstructured, and we've talked a lot about that earlier. We really are bringing tools from statistics and machine learning and data mining together into this one framework. There's many components of data science. Data wrangling, data preparation and cleaning, data curation. Once we prepare that data we're typically performing some machine learning algorithms. Oftentimes, they're within a distributed data architecture. We're going to apply parallel processing because we have a lot of data and we wanted to create a predictive model as fast as possible as accurate as possible. We will obviously apply out the visualization and most machine learning. Models have some type of probability models built in into it. Then, there is new models like deep learning and new jobs like data engineering that highly relate to data science. So let's take a look at the data science lifecycle. Just like with the CRISP-DM, we're going to initiate the project, and then we're going to start with business understanding. Once we understand the business, we're going to take a look into acquiring and preparing the data. Now, this could be slightly different or very different from what we have talked about in CRISP-DM. Our data sources now are not just fight files like they might be in a traditional old timey machine learning project. We now have files that are coming from tweets, sensors, video, text, etc. The data might be coming in streams or the batch processing, and then we can start manipulating that data through the visualization ETL or ELT, and validation of that data. We might be performing this on many different computing environments, anywhere from the Cloud and the Data Lake to Hadoop and GPUs. Once we finish this data acquisition preparation and cleaning, we have created a training dataset. The training dataset then will be used to create the models. Before we can start training any models, we will have to perform feature engineering and transformation on that data. We will start applying methods. We will select a number of different methods and then we're going to perform parameter tuning, possibly pruning of those models, and then we're going to evaluate the models. Whether we do that by splitting the training and test data or by using 10-fold cross validation, at the end we're going to validate those models. Before we can deploy them, we're going to create a plan for product testing and deployment of those models. Once we decide to deploy the models, we can do that in many different ways. Oftentimes, you see these data science or data science models built into products or web services or smart apps. Then, of course, at the end, the customer acceptance. KNIME's approach to data science is very similar. We will read the dataset, transform it, analyze it and deploy it. So far we have spent a lot of time on reading and transformation of data, so now we're ready to start analyzing and then deploying the models. So as far as KNIME goes, there's many modeling tools. Anywhere from decision trees and random forests to neural networks, deep learning, etc. We have many types of available frameworks and libraries like R and Python and H2O and WEKA, etc. There's many different types evaluation nodes like the ROC curve, numeric and entropy scores, feature elimination, 10-fold cross validation, etc. One of the main nodes that we're going to utilize in building predictive models is the node called partitioning. This node will allow us to partition the entire dataset into the training and test datasets. We can decide that we want 50-50 or maybe 70-30 percent of data in training dataset versus the test dataset, we can imply stratified sampling, and we can set the random seed number generator in order to ensure that there is no bias as we split this data. Once we split the data, most of the Learner Predictor Motif models will work in a similar rate to the one we have represented here. Once the data is split into the training and testing, the training data typically goes into the model learner. In this case, we are looking at the decision tree learner. Once that decision tree learner node creates the model, we're going to use the test data and utilize the predictor node in order to take that new data and test the model that we have built.