Welcome to week two of Moneyball and Beyond. Remember in week one, we looked at the basis for the Moneyball story, based on the paper written by hacks and sour in 2000 and six. In fact, what we did was reproduced their table one which showed that on base percentage matters more in some sense for winning in baseball. Than does slugging, even though slugging is traditionally a statistic which was associated with decision making by coaches, scouts, general managers in relation to determining the quality of baseball players. In some sense, that was partial support for the money ball story that, on base percentage matters. And remember that on base percentage entails the capacity to draw a walk as well as the capacity to hit the ball, whereas slugging doesn't account for that ability to draw a walk. So what we're going to do this week is look at the relationship between on base percentage and the salary, which is paid to the players. So the logic here is that if on base percentage matters for winning, then players with high on base percentage statistics should be paid more than players with poor on base percentage statistics. And we have data on player salaries, which we can use to examine that hypothesis. And that's exactly what hacks and sour did in their paper, which went to generating table three, which we're going to look at in a moment now. It's also important to say that on base percentage and slugging are not likely to be the only factors which matter for determining the salary of the players. And indeed hates and sour in their paper. They added. In three other types of variable into their analysis to help explain the salaries of players. So the first one is fairly straightforward. It's plate appearences, how often do you play? Players who play more often are going to tend to be the better players. The second area is arbitration, eligibility and free agency. This group of variables has to do with your bargaining status relative to the owners. When you're a rookie in baseball, you have no bargaining status. Whatever, you have to accept what you're paid, and you have no capacity to negotiate a higher salary or move to a different team. But after a couple of years, you become what's known as arbitration eligible, which means that you can challenge the salary that your team offers you. And if you make that challenge, then an arbitrator will decide what is a fair salary for your level of skill. And then finally, after six years in the majors, you become a free agent where you can sell your service to the highest bidder. So that's going to affect your salary. What your status is because you're negotiating ability, it will differ significantly. And then, finally, the third group of variables that hacks and sour added were related to fielding position so some positions might be more valuable to the team than others. And therefore we want to take account of that possibility in running these regressions. So let's now take a look at headaches and sour table 3 to see what it is we're going to reproduce in this week. So here we have it. Table three looks like this, and what you can see here is the list of variables I just described down the left hand side of the table with on base percentage and slugging at the top. Which, of course, because those are the ones were particularly interested in, and then the columns you can see relate to particular years. So the first column covers all the years in the data. The second column covers the years 2000 to 2003, effectively the years before Moneyball was published. And then the next four columns look at the individual years in that group. 2000, 2001, 2002 2003 and then the last column is the year after Moneyball was published, and this table tells a very interesting story. If we look at the first column where we look at all years, you can see that on base percentage. Although the impact appears positive, it's not statistically significant. It's not statistically significant either in the second column, when we look at the periods 2000, 2003, and it's not statistically significant in any of the individual years between 2000 and 2003. But when we look at the column for 2004, we see suddenly that on base percentage is statistically significant. And if we compare that statistic with slugging percentage, which is the next row down, we can see that in every year prior to 2004. The size of the coefficient on on base percentage is smaller than the coefficient for slugging percentage. But in 2004, the coefficient, an on base percentage, is larger, much larger than the coefficient on slugging percentage. So what hates and sour concluded from this table was that prior to 2004, baseball teams were not taking into account on base percentage in planning the salaries in offering salaries to players. But they were looking at slugging percentage, but that in 2004, the year following the publication of Moneyball. Suddenly on base percentage becomes highly significant in determining player salaries. And, in fact, much more significant than slugging percentage, which seems to confirm the Moneyball story that was written up in the book and discussed in the film. So from here on, what we're going to do is set about reproducing this table, using data and using our programming skills in Python. And before we go start actually on that process of constructing the data, I just like to give a brief schematic diagram to illustrate really what's going on here. So we're trying to recreate a regression based on the data that we've got here and we can identify four steps in that data. Those steps really involve gathering the salary data, getting the data on on based percentage and slugging, getting the data that relates to the bargaining status of the player. And then getting the data on the fielding positions of the players. And what we're going to do is with all that different data, which comes from different sources. We combine that into one data set so that when we have done, we have something that looks like this blue rectangle. You can see here where you have along the top, you have names of columns relating to the variables that we're interested in. And then along the side we have some index which tells us which particular player the data relates to. And so that is what our target to produce. The data that appears in this form, which will then enable us to run the regressions we want to run. So most of what we're going to do this week is actually going through the process of assembling the data, and that looks a little bit like this. You can imagine that what happens here is you start off with different data frames. Those are these different rectangles in different colors, and we slice and dice the data in different ways in order to get to our ultimate blue rectangle, which will enable us to run the regressions. So in understanding the process here, we're going to go through four steps to generate that process. But what you should understand is that we're going through this as a kind of a cumulative process of adding on to the initial data that we generate in order to build our regressions.