Welcome to this first week of the Moneyball story. We are going to look at more detail in this week, at the underpinnings of the Moneyball narrative, as told in the book by Michael Lewis, and as based on some subsequent research carried out by economists. So the basic story of Moneyball, as told by Michael Lewis was that Billy Beane and a team of statisticians of the Oakland A's use statistical analysis to identify players who could play better than was generally thought to be true by the received opinion within baseball. In other words, that statistical analysis enable you to identify talent more accurately than the existing talent scouts. In the book, one story that comes over again and again is the story of walks. Generally speaking, scouts and general managers in baseball had relied on batting statistics, such as slugging percentage in order to identify talent. Slugging percentage is basically a statistic, which he gives you a sense of how strong a hitter a player is. But hitting is not the whole of the story in baseball. If you are a good judge of a baseball, it's possible for you to get to first base by getting what's called a walk. A walk is when the picture throws four balls that don't go over the plate and you as a batter can restrain yourself from trying to swing at those balls and that gets you to first base. That skill, it was argued in the book was being undervalued more generally in baseball. And therefore, when Billy Beane and his statisticians came along and identified that this was a valuable skill, they were able to hire players with that skill and therefore be more successful on the field. Now this argument, in fact, has, in an interesting economic foundation since what it must be saying is that if it's true, then prior to the statistical analysis of being and those like him, then the this skill was being undervalued. And the ability to draw a walk was being undervalued and hence there was an inefficiency in the market. Not only that, but if the Moneyball story is correct, then once the book was published and this true fact about batting skills was revealed, the capacity draw was valued. Then what should also have happened since the Moneyball story was very popular and everybody heard about it, then all of the other teams should have copied this, and so the undervaluation of walks should have disappeared. And that amounts to a hypothesis about the operation of markets, which is testable. So and soon after the book was published, two economists John Hacks and Skip Sour decided to actually see if they could prove this hypothesis using data. And the statistic that they were interested in to look at this was in addition to slugging percentage, they looked at something called on base percentage. And on base percentage includes in it the capacity to the skill of drawing a walk. So it measures not just your ability to hit the ball, it does measure that, but in addition, it includes your capacity to draw walks. And they wanted to test the hypothesis that this was undervalued before the publication Moneyball and was better rewarded afterwards. So in order to make that argument, Hacks and Sour needed to really do to things. They needed to show that firstly that this capacity to draw walks as measured by a statistic, like on base percentage really was valuable and potentially more valuable than simply hitting skills such as slugging, as measured by statistics such as slugging percentage. And then point two, they needed to be able to show that prior to Moneyball, the salaries of players did not value the capacity to draw walks and after Moneyball, the salary of players were adjusted in order to compensate players for the ability to draw a walk. And in the paper that they wrote, they actually did demonstrate both of these facts and these were demonstrated and they are shown essentially into tables in their papers, Tables 1 and Tables 3 of the paper. And in this week we're going to demonstrate the first result, which is shown in Table 1, which is reproduced below. And in the following week, we're going to look at the result from Table 3, which we will come to next week. So as you can see in the table, the table has four columns and each column in this table represents a regression analysis. So a regression is a mechanism for establishing the relationship between a one variable and another group of variables. So ideally to identify some kind of causation running from the variables which you're using to explain something and the variable that you're trying to explain. So here what Hacks and Sour wanted to show was that on base percentage really was significant in determining the success of teams and more significant than a statistics such as slugging percentage. And so they ran these regressions which regressed a measure of success in the form of win percentage, on base percentage and slugging percentage to see which one was more significant. Now, in terms of a team being able to win, it's not just your performance in winning bases that matters. It's the success of your opponents in winning bases that also matters. So win percentage is going to be affected not just by the statistics of your team that it has against your opponents, but it also depends on the statistics of your opponents when playing against you. So what Hacks and Sour estimated was they first calculated the on base percentage and the slugging percentage of each team. So they used the statistical formula for on base percentage and slugging percentage, and based on statistics for each team across the season, they worked out what these statistics were. But then they also worked out what these statistics were against each team. So the success of their opponents against each team in each season. And then they ran these regressions which included on base percentage and slugging percentage for and against. So if we look at column 1 of this table, you can see the first regression looks at on base percentage and on base percentage against each team. And what you can see here is that the statistic for on base percentage in the first column is 3.294. And the column in and the value of on base percentage against is -3.317. Now, the precise value of that is not particularly important at this stage. What's important to notice is that the for and against are roughly equal and that should make sense in the sense that one more run scored by a team in terms of success is exactly equivalent to having conceding one less run against other teams. So that these signs should have roughly equal and opposite signs. Column 2, then goes to slugging percentage, and running a regression of win percentage and slugging percentage. And you can see here that the two coefficients are slightly different. For slugging it's 1.731 and first slugging against it's -1.999. But they are roughly equal and opposite. But the takeaway for Hacks and Sour is that the size of the coefficient of on base percentage is much larger than the coefficient for slugging percentage. Which what's that saying is that each extra unit of on base percentage is much more valuable than an extra unit of slugging percentage for or against. Which means that in some sense on base percentage is more important in determining wins. And in column 3, you can see that when you include these two statistics together in the regression of win percentage and both are significant in determining win percentage. But it's the on base percentage which is roughly twice the size of the coefficient of slugging percentage. So 2.141 against 0.802 for slugging on on base compared to slugging for or -1.892 and -1.005 for on base and slugging against. So roughly speaking the on base statistic has roughly twice the value of the slugging statistic, which means it's roughly in some sense twice as important. And that's, so we can conclude from that that on base percentage really was, it really is important in determining wins of teams. And just briefly, the last column require, imposes the restriction that the for and against statistics must be of equal and opposite signs which is a restriction on the data. Which is actually a restriction that actually makes some sense and is supported by the data. So there you have the basic story that on base percentage really does matter. And so that that really should be something that when we come to looking at salaries of players, it should play a role. Now what we're going to do in this week is reproduce this table using the data. So rather than just looking at the statements saying yes, that's what on base percentage and slugging how they affect the outcomes of the teams, we're going to actually collect all of the on base and for and against the slugging for and against statistics and reproduce these regressions for ourselves. So the regressions that you can see written down at the bottom here we are going to reproduce them ourselves. So the data we're going to use we need data which will allow us to calculate slugging and on base percentage. And so we need now to define slugging on base percentage. So slugging is defined as the number of singles of a team, + twice the number of doubles, + 3 times the number of triples + 4 times the number of home runs, all divided by the number of At bats. It's not actually truly a percentage, but nonetheless, it's usually called slugging percentage. Notice that the walks are not included in this definition of slugging percentage. And then on base percentage is defined as hits, + walks + hit by pitch. So walks is included here, then all divided by at bats + walks + hit by pitch, + sacrifice plies. So on base, so we have a definition of these statistics and now we need to find a database where we can identify these statistics and make these calculations. Well, first before we do that, we need to in python run our packages. So we're going to use Pandas and NumPy in this, we won't actually be using that plot lib which allows us to draw grass, but it's often useful to have it in the background just in case you want to produce your own chart of something as you're going along. So if we introduce those. The next thing we want to do is load our data. So the statistics for teams, each game can be found on retro sheet, which is a fantastic resource, a open source database which provides data on team performance going back to the 1870s. Now we're going to focus primarily on the years in the Moneyball story itself, as analyzed by Hacks and Sour, which the years 1999 to 2003. So you can download this data for yourself from retro sheet but a note of caution to say, if you download the day retro sheet data, you'll find that it doesn't include headers, you can add these yourselves, but there are a lot of variables in the data. We've added here a link to a source where you can actually download the headers for yourself and just merge them into the dataset. But for the purposes of this course, we've just provided you with the data we need in an Excel spreadsheet. So we've done that for you and so you can just now load the Excel data yourself so that we can analyze it further. You'll note that it takes a little while to load. This is a fairly large dataset and so it usually takes a little while to load, and whilst that you can see that asterisk, you can see the code is still running and it tells you that it's in the process of loading up. And when it's done, a number will appear in that box here on the left hand side, telling you that it's completed. So we just have to wait until that's done. And there we are, so the data is now loaded. And in fact this is a very large dataset and it contains many variables of potential interest to baseball statisticians, but a relatively few that we actually need. So the first thing I'm going to do is just print off a list of all the variables. You can see this print command here, which enables me to see what variables are in the data. You can see here all these different variables, which could be very useful in other contexts but much more than we need here. And we're going to edit out a lot of these variables. But before we do that, there's one thing we need that actually is not contained in this data. And that is the identity of the winner of each game. So in the data, it only tells you the score for the home team and the score for the away team. And we can reproduce, we can use that information to identify the winner because obviously the winner is the team that scores the most runs. So we can create new variables here for whether the home team wins or the away team wins. So we call the H win and A win. H win will have a value of 1, if the home team wins and 0 if the home team loses. And away win will be the reverse of this, it will have a value of 1 if the away team wins and a value of 0 if the away team loses. So we can create that variable. Now you might wonder about what if the teams are tied. That's a very rare event in baseball, but it does happen on some occasions. And so what I've put in the self test is actually to create a variable for ties and then use that variable to identify the number of ties in the data. You'll find that there are some, it's a very small number but there are a very small number of ties in our dataset.