In this section, we'll move to learning some basic statistical concepts that'll be imperative for your journey in machine learning, as well as data-driven decision-making. So what are our learning goals for this section? In this section, we're going to cover what to keep in mind, one are discussing estimation versus inference and statistics, we're going to discuss the differences between parametric and non-parametric approaches to modeling, we'll discuss the different common statistical distributions that we will see in the real world, then finally, we'll introduce the difference between frequentist and Bayesian statistics. So starting off with estimation versus inference, when we talk about estimation, essentially what we want to keep in mind here is that an estimate is just going to give us an estimate of a certain parameter, such as the mean from our sample data. So in order to calculate the mean, we just say the sum of all of our values in a certain column and divide by the number of values that are there to get the average value. Now, estimation is only a part when we talk about statistical inference. When performing statistical inference, we're trying to understand the underlying distribution of the population, including our estimates of the mean, as well as other parameters such as the standard error of the underlying properties of the population that we're sampling from. In order to get the standard error, we would use something like the equation that we see here, where we just have the estimate of the mean and subtract every value from that and then see what is going to be the average distance that we are from our estimate of the mean. Now, machine learning and what we've defined as statistical inference are very similar. We'll see the degree to which much of what we learn throughout this course and what we use is actually intertwined and built from the foundations of statistics, now applied before we even had the computing power that we have today. In both machine learning and statistical inference, we're using some sample data in order to infer qualities of the actual underlying population distribution in the real world, and the models that would have generated that data. When we say here our data-generating process, we can think of the linear model as an example of a data-generating process representing the actual joint distribution between our x and the y variable. We may care either about the entire distribution when doing machine learning, or just some features of our distribution, such as just getting the point estimate of our mean, for example. Machine learning that focuses on understanding the underlying parameters and individual effects of all of our parameters require tools pulled from statistical inference. On the other hand, some machine learning models tend to only mildly focus on all of these different parameters of the underlying parameters of our distribution, and just focus instead on prediction results, or just those estimates. Now, I want to introduce a business example that we'll use throughout to help bring context to what we've learned so far and what we'll learn throughout this course. The example that we're going to be using is an example of looking at customer churn. What do we mean by that? Data related to churn will include a target variable for whether or not a customer has left the company. So obviously, we don't want for customers to be leaving the company, so we want a lower churn rate. Data related to churn may include the target variable for whether or not a certain customer has left. We'll also include features to help us make that prediction about whether a future customer leave, such as the length of time that we've had that customer as a customer, the type and amount of purchases that customer has made, and other customer characteristics such as age, location, and so on. Churn prediction is often approached by predicting a score for individuals that estimates the probability the customer will leave or not leave. So 0.99 means they're very likely to leave, 0.01 means we're probably going to hold on to that customer. So when we talk about estimation here versus inference, we can estimate the impact of each feature. Think for every additional year someone has been a customer, that being the feature, they are 20 percent less likely to churn, so giving an estimate for the value of each additional year. When we talk about inference, we'd be expanding it to getting the interval as well, so we'd get the statistical significance of the estimate. So for example, using what we just said, rather than just a 20 percent less likely to churn, we can have 95 percent confidence interval on that estimate, saying that the effect is either between 19 percent and 21 percent. So then we'd be fairly confident that each additional year that we've had a customer, they are going to be between 19 and 21, meaning 20 percent is a good estimate of how much less likely they are to churn. On the other hand, the 95 percent confidence interval can be between negative 10 and 50 percent, meaning each additional year we're very uncertain if it's 20 percent, as a point estimate. For all we know we can actually have a negative effect or we can have a much stronger positive effects. We just don't have the statistical significance that we do if we're thinking about somewhere between 19 and 21 percent. So let's look at some actual customer churn data. We're going to use the telco customer churn dataset from IBM Cognos Analytics and that's going to represent customer characteristics and churn outcomes for a fictional telecommunications term. It will include account type, customer characteristics, revenue per customer, that customer satisfaction score, and an estimate of that customer's lifetime value in regards to customer lifetime value being the purchase amounts over the entire time that we have that customer. It'll also include information on whether the customer has churned or not, as well as some categories of churn. You can think of different churn types such as actively canceling versus just not renewing their subscription. Our data for the phone sub-sample, this is for a phone subscription, is contained in a Pandas DataFrame assigned to the variable df_phone. So keep that in mind as we go through the examples as we walk through some ADA in the next couple of slides. So we're going to start off by creating a bar plot. Here, we're setting the y-variable equal to our churn value. So how likely are they to churn, and we're going to look at that according to their different payment types. So x equals payment, and the data that we're going to bring in is from this Pandas DataFrame, df_phone, and we're not going to add on a confidence interval, which we can't to our bar plots. So we see here that people that use credit cards are much less likely to churn than those that are using big withdrawal payments or mailed check. So we see that extra piece of information looking at this bar plot that we add here to the right. We can also look at churn value by the number of months. Here we're using pd.cut. Whenever we are creating a bar plot, we want our x-axis to be categorical variables. So we're going to create a categorical variable using pd.cut and cutting it into five equal length beams. So we see here to the right that we have values between around zero and 15, then 15 and 30, so on and so forth, and we can see that those that are on for a much less time are obviously much more likely to churn compared to those that have been a customer for a longer period of time. We can also pull out our pair plot, which was introduced earlier. We're going to create a pair plot here, just looking at the months that we have selected here, the features that we have selected here, such as the amount of months we've had them as a customer, gigabytes use per month, the total revenue we have brought in by that customer as well as that CLTD, which is the customer lifetime value, and then obviously we have the churn value as well. So you run the pair plot to see the relationship between each one of these variables and then we're also going to split that up. We see that hue equals churn_value. We're going to split it by whether or not they churned. One would be equal to them churning, zero equal to them not churning. So we see the different graphs that we have here, blue being that they did not churn, one in green being that they did churn, and we see the different relationships between each variable as well as the distribution of each one, and a lot of these distributions should make sense. Like we said, when we have the tenure in months, we have that much more are likely to churn as we see with the green values, versus not likely to churn as we see in the blue values in the top left corner. Also going to look at the hexbin plot that we introduced earlier. So we're going to use the joint plot and we want to see the labels are going to be months versus monthly, and what that is is just tenure in months versus the monthly charge. So we want to see if they're charged more, are they more likely to stay or maybe we have those more expensive customers are less or more likely to stay. So we're able to see the distribution of the breakdown of the tenure in months on the top of our plot, and we can also see the distribution of the monthly charge on the right side of our plot. Our hexbins are going to show us. We see the densities on the top right corner, which is that there's more people that have a longer tenure months as well as a higher monthly charge, but we also see on the other end of the spectrum that people that have a lower tenure in months also have a higher monthly charge than the average, and then we see in between those two values a lot less density. That concludes our portion here on the EDA, on the customer churn example. In the next lecture, we're going to start talking about parametric versus non-parametric statistics. I'll see you there.