[MUSIC] Trying to distinguish good from bad analytics, always starts with a simple question. What is the data-generating process? So what is the data-generating process? This question may sound odd, because it was created by data scientists for data scientists. You see, data scientists sometimes think of the world as a giant lab where people do their thing and by doing so generate data for the rest of us. So when I ask, what is the data generating process? It basically says this, explain to me the customer journey, and how different people make decisions. Explain to me how firms react to these decisions. And finally, when I make a comparison across groups of consumers, explain to me how those consumer and firm decisions have an effect on which groups consumers belong to? Now, I know this sounds quite abstract. Let's make it more concrete. Remember this search engine advertising example? So, suppose that the example was about buying cars, first, the consumer journey. And how they make decisions. Some consumers are in the market for buying a car, and some consumers are not. If a consumer is in the market for a car, one of the things they are likely to do, is to type in a car-related search term into a search engine like Google. If a consumer is not in the market for a car, they still might use Google. But most likely, they will not search for something related to buying a car. Okay, so what is this consumer behavior mean for firm decisions? Car manufacturers and dealers are interested in reaching consumers who might be ready to buy a car. One very targeted way of reaching such consumers, is to use search engine advertising. This way, the firm only spends on advertising if a consumer has identified themselves being interested in the car. And this makes sure that manufacturers and dealers don't waste their money on advertising to consumers, who are not in the market for a new car. Final step. How does this affect which groups consumers end up in? When I see a chart like this, why do consumers end up in each of the bins on the X axis? Data scientists ask, how did consumers get assigned to each group? Well, given that manufacturers and dealers only bid for ads, when a consumer types in a correlated search term, the consumers on the left are those who, because they were not interested in buying a car, did not type in a car related search term. Those on the right, because they were interested in buying a car, typed in a key word that manufacturers and dealers paid for. That is what it means to understand the data generating process. Let's pick another example. Remember the pentathlon case you read, Step one. Consumer journey. Some consumers buy sporting goods for many sports. Some only for one sport. Consumers who purchase across many product categories, tend to spend more with pentathlon, than those who purchase only from one category. Step two, firm decisions. Pentathlon has a decentralized way of managing email promotions. So the more departments a consumer buys from, the more promotional emails he or she gets. Step three, assignment to groups. When we see this table, consumers who spend little with Pentathlon, show up in the zero to 1.9 email group. Consumers who spend a moderate amount with Pentathlon, show up in the two to 3.9 email group. And high spenders show up in the four plus email group. That's the data generating process. Knowing the data generating process helps data scientists and you, determine whether the different groups of consumers that we are comparing are in fact comparable. In the next videos, I will discuss this in more detail. But just as a preview, ask yourself whether the groups of consumers who get zero to 1.9 emails, and those who get four plus emails, are probabilistically equivalent. [MUSIC]