To better understand the steps of the CRISP-DM process, let's walk through an example together. This particular example is from my own past experience leading a team that developed this product. The product we're going to discuss today was a prediction tool for electric utilities to be able to predict the level of severity and the locations of power outages in advance of storms impacting their utility territory. We'll walk through some of the important considerations which are team faced as we work through each of the six steps of the CRISP-DM process. Let's start with Step 1, the business understanding. The first part of Step 1 is to define the problem. Let's start with who has the problem? In our case, the person we were serving was typically the director of operations or the manager of operations and the electric utility customer. Specific problem that person had is that a big part of their job is planning for storm restoration efforts. As part of the planning process, a couple of days in advance of a storm impacting their territory, they need to decide how many crews they need to have available to repair the expected damage from a storm. This is a challenging problem for them because they have to make this decision under a high degree of uncertainty. Two or three days in advance of a storm, it could be really difficult to try to predict the level of outages and the level of restoration effort that's going to be needed. This problem is particularly challenging because if they overestimate and they staff too many crews, they're wasting money. They have a lot of crews who are sitting around that they need to pay not only wages, but often they need to put them up in hotels, pay for meals, etc. On the other side, if they have too few crews available, the storm restoration efforts can take a long time, the utility's customers are out of power for a long time, can get upset, and in severe cases, the director of operations can even be called in front of government bodies or regulatory bodies such as the Public Utility Commission, and have to explain why they took the decisions they did, and why they didn't properly staff for the restoration efforts. Prior to us launching this product, the current state of solving this problem was that the director of operations would use weather forecasts they were provided by companies such as MyOwn and others, and then they would use their own experience and their intuition to make an educated guess as to how many crews they might need to staff in order to prepare for the expected restoration efforts. Second step in the business understanding phase is to define what success looks like in solving this problem. The expected impact of success with solving the problem for our user was that we would be able to improve their planning and thereby improve restoration times, and also help them minimize any wasted cost of overestimating or having too many crews on hand. Specific metrics we could look at to quantify the expected impact for them would be things such as the reduction of the average restoration time for customers of the utility, and then to quantify the expected output of the prediction model we would build as part of this product. Since our job, in this case, is to try to create a regression model for predicting a number of outages, we would use a typical regression metric, such as a mean squared error of the predictions. Part of the business understanding phase is also to begin the set targets to try to quantify our expected outcome and output metrics. In this case, we could set a target around the reduction of average restoration time, such as our goal is to reduce average restoration time for customers in the utility territory over the course of all events which occur throughout the year, and reduce that on average by X number of minutes. For the output of our model, we would set a target around a mean squared error that we're trying to achieve. We also have a few constraints on the solution that we have to build. Specifically that we need to deliver a prediction of the number of crews that would be needed or the number of outages that are expected within the time frame that this person has to make a decision. So the predictions have to be delivered at least 48 hours in advance of the storm start within the utility's territory. Once we've defined the problem and what constitutes success in solving the problem, we now need to identify possible factors which we may consider to use in our model. In this case, as part of our solution, we were building a predictive model that uses various inputs to be able to predict the level of power outages and also the approximate locations of severe power outages within the territory. There are a couple of key drivers of power outages. The first and most obvious is the weather. However, when we think about weather, there's a lot of different parameters that we might consider. We might look at things such as sustained wind speeds, peak wind gusts, precipitation, whether that's rainfall or snow and ice, and for each of these different parameters, there's many different ways that we can quantify them. For winds, for example, we can look at sustained winds over the course of an hour, average winds over the course of a day. We can look at peak gusts within an hour. We can look at the number of hours in a day where sustained winds were over a certain threshold, such as over 30 miles per hour. There's many different forms that this can take. The second main factor was the location and the density of the utility's assets. In areas where the utility has many assets concentrated, we're much more likely to see power outages rather than to areas where there's very few assets, or very few power lines which can go down and cause outages. Understanding where the utility's assets, specifically where their power lines actually were, was important that construct this model. Final key factor was location of trees relative to power lines. Turns out that the primary cause of power outages is when trees fall down on power lines. Therefore, we have to understand not only where the power lines are, but also where the trees are relative to power lines. In areas where we have power lines running through heavily forested areas, it's much more likely the trees are going to fall on the lines relative to power lines running through open fields with no trees. Likewise, it turns out that seasonality makes a big difference in this case. During the late spring, summer and early fall, when leaves are on trees, they're much more likely to topple over during wind gusts relative to the wintertime when there's no leaves on the trees. After we identified each of the possible factors which may contribute to developing our model, we then needed to find sources of the data for each of those factors. Weather was an easy one, since our company was a provider of commercial weather forecasting, we already had our own weather data available. For the data on tree locations, we had to turn to vendors for satellite imagery and source data from them so that we can identify locations of trees relative to power lines and utility assets. For the density and the occasion of power lines and other assets, we had to source that from our customer themselves, from the utility. Likewise, to get our historical outages, which is the target variable we're trying to predict, we needed to get those from the utility themself. As we build our model, we need to train our model on historical data. Which means we needed to go back in time and collect data on each of our inputs, but also each of our outputs or actual power outages from a number of storms over previous years and feed that into our model in order to train it to be able to generate predictions on future storms. A couple of key considerations. One is how much data did we actually need? Was a year's worth of historical data on storms enough, or five years or 10 years. It's also limited by how much data the utility actually has and keeps within their systems. Second concern here was sensitivity. Some utility customers are hesitant about turning over information on locations of their assets to private vendors, for example, so there were some challenges that we had to overcome there. The final consideration is cost. For some of these datasets we had to purchase from outside vendors and we had to think through costs, which again depends on how much data we were trying to collect. After we source and collected and aggregated our data, we then had to go through the data validation phase. It turns out in this case there were significant amounts of missing data for various reasons. Likewise, these different datasets, we're all coming to us mapped to different geographical resolutions, somewhere on a grid scale, where they divide it up a territory into a five-mile by five-mile grid, for example, and provided data points for each grid cell. Some of these were on a township or city-wide level or zip code level. We had to do a geographical mapping of all these different sources to get them on a common geographical scale. Additionally, we had a number of outliers, storms in our data, situations where for one reason or other, there were certain storms that produce major outages which felt well beyond the balance of the majority of storms.. We had to understand what was it in those particular cases that caused such widespread outages. After we clean and validated our data, we then had to define the feature set that we're going to use for our model. Again, there were many possible features which we could choose to include. There are a whole variety of weather parameters that we could use. For each weather parameter, there are a number of different ways we could quantify that. There're also interactions between features. For example, if there's been heavy recent rainfall and the ground is already saturated from rain, it turns out the trees are much more likely to fall over with lower wind speeds relative to dry periods when the ground is very hard, and trees are less likely to fall over. We have to consider things such as the interactions between those features. Likewise, when we first started, we didn't really know if we had correctly identified all of the features. This is a challenging problem to try the model, and so it's possible that we were missing certain important features which may contribute to causing power outages. After we define the feature set that we're going to use, we then begin the process of actually building our prediction model. In this case, we were seeking a balance between performance of the model, but also interpretability. It's important to achieve good performance of the model, as it usually is, but we also wanted to provide interpretability, in this case, to our customer. We have to recognize that models are never always correct. They're going to be wrong at some point. When they're wrong, we need to be able to explain, in this case, to our customer, why they're going wrong, but also to be able to debug them so that we can improve for the future. While we could've used more complex models such as neural networks, in this case we chose to use simpler models, which provided a higher degree of interpretability, so that we could better understand how the model was reaching the predictions it was reaching, and when it did go wrong, we would be able to explain to our customer what was happening. Likewise, another important consideration in this case was whether we could use a single model which could serve multiple utility customers, or whether we had to create a number of individual tailored models, one for each of our customers, which would generate predictions that are very specific to their territory. After we created our model, we then had to test the performance of our model. There are few different ways we could approach that. The first way we approach this is by using a cross-validation strategy, where we would run the model on historical data. Each time leaving one storm out of the historical data set. We train the model on the rest of the storms, and we then use the model to be able to predict on that single storm, and we'd repeat through the course of all the data that we had available. The second strategy that we used for testing was customer testing using live data. We train our model using all of the available historical data we had, then we'd work with the customer to test our customer as new storms occurred in their territory, feeding in live data and evaluating the results of our model predictions relative to what was actually happening. As is typical, when we went through this testing phase, we uncovered a number of issues that led to lower performance. In this case, the most severe issues were issues with the data itself and the quality of the data. That caused us to go all the way back in our [inaudible] process, to go through another process of cleaning and scrubbing the data, and then to continue on through validating, preparing, model building, and evaluating again. We repeated this cycle a number of times until we got a model that we felt good about. After we had reached a point in our testing that we felt comfortable with the results the model was producing, it was then time to deploy our model out to our first customers. In this case, our model, as usual was not acting as a standalone model, but was integrated within a broader product that our customers were using. Specific product in our case was a visualization product that they were using within their control center to visualize weather and predictions of weather. We integrated the model into that visualization product, and we displayed the results of the model as a visual interface or a map of predicted outages for our customer. An important consideration in the deployment phase was change management within our customers who are using our product. These customers were used to working in a very certain way. They work generally the same way for years and years in terms of how they prepared for storms and how they reached the decisions that they reached in terms of staffing levels for storms. Through the use of our product, their workflow was changed. They were now relying more on a prediction tool relative to other means of developing their staffing plan. Part of our deployment process was to go through a process of change management with our customers, to help them adopt the new product and to adjust their workflow to utilize it. We also had to continue to monitor the performance of our product after it was released into the wild and used by our customers. We had to continue to look at the outputs it was providing, but also to look at the outcomes it was achieving for our customers. To make sure not only was the output performance of the model good, but they were also achieving the business outcomes in terms of reducing average restoration time that we had defined for our customer. Finally, we have to also recognize that the environment around a model can change over time. In this particular example, there are a number of things changing, for example, locations of trees around power outages. Utilities typically engage in tree trimming, where they're pruning back trees and vegetation near their lines. Likewise, there may be new trees planted or trees knocked down, or the locations of utility assets may change over time as utility adds more assets to account for growth in their territory. There are a number of factors which can change, and therefore it's important to not only train the model once and release it and let it go, but to have a retraining plan so that we're identifying how often and when we need to retrain our model to account for changes in our environment, so it's performance stays good.