Welcome back to the final part of Moneyball and Beyond, Week 2. So now, having spent all this time constructing the data frame, we're now in a position to run the regressions in the same way that Hakes & Sauer did to construct their table three. And we're going to see if we can get the same results as they did with our data. so the first thing to do is to cut down the data frame our master data to the relevant years. We've constructed it for a much wider range of years, which will be able to use, next week when we extend our analysis to years since the publication of Moneyball. But right now, we are only interested in the years between 2000 and 2004, which the years in which the Hakes & Sauer regressions were run for table three. And so we limit the data set to that group now. And we can see which variables are in our data frame that we've constructed here and remember bearing in mind that we want to reconstruct their salary data regressions. And you can see here what the variables are, we want on base percentage, slugging, plate appearances, arbitration, eligibility, free agent status, catches and infielders. So our next line of code writes down that is a regression, we have to import stats model in order to be able to run a regression. We then define the regression model as a formula where the left hand side is the log of salary. And the right hand side has on base percentage, plus slugging, plus plate appearances, plus arbitration eligible, plus free agents, plus catches, plus infielders. We tell it that the data is derived from our Moneyball data frame. We call this subset of our data frame MB_data. And then this tells Python to print out a summary of the regression results. And if we run this, we get a regression result that looks like like this. Yeah, you can see here a regression form, and you should be getting used to this by now. Seeing what these regressions look like you have here in a long list all of the values of the coefficients, the standard errors, the T statistics. And we want to compare these to the data in the Hakes & Sauer table. So what I've done below is actually, cut out the relevant column from the Hakes & Sauer table and the relevant column from our regression that we've just run. And you can see here that these coefficients line up to be almost exactly the same. Again as we found with Table 1, it's almost exactly it's not quite exact, but close enough for us to believe that these are essentially the same data and the same regression. When we talked about this in relation to Table 1, we talked about reasons why they might differ. And two of the reasons we mentioned were different statistical packages produced slightly different numbers, although usually those differences are very, very small probably wouldn't explain this. And the other factor, which is more likely in this case, is that since the publication of the Hakes & Sauer paper. The large and database has been updated for errors that were made in the past, and that means that these estimates are going to be slightly different. But again, the focus is on the fact that these estimates are only slightly different. That by and large, this is almost exactly the data that was produced by Hicks and Sour. So the regression we've run here is essentially the regression for all seasons in Table three of Hicks and Sour. So the first column and we find here that the effective on base percentage is significantly smaller than the effect of slugging. The next thing we do is reproduce Column two of table three by running this regression here essentially the same story. And again the same idea here I've put alongside the extract of table three for the relevant seasons. This is 2000 to 2003, the coefficients from our regression and once again you can see these are more or less identical. Now we can go through this step by step and painstakingly compare line with line. But that could get a little bit hard on the eyeballs after a while. And so we want a way to reproduce Table 3,in a format that looks similar. So with all the columns stacked up next to each other, and we can do this with the following command. So firstly, we run all of the regressions and create the regression outputs without displaying them. So you can see here the name of the regression output is given here. And the data that is to be used is defined here. And so these are the regressions for each of the individual years. 2000, 2001, 2002, 2003 and 2004. Now, when we run them, Python creates these output but doesn't display them. It only displays things when we tell them to, and we're going to tell them to display that in a minute. So first, we're just going to run them so that they are created in the background. And now we're going to use something that we use before stats model, the summary coal option. And that will allow us to put into column format the coefficients from each of the regressions that we list here under salary column. So we've got the names of each of the regressions here, and this has the nice option as well that we can define the order in which the variables are going to appear. And we can use the same order as they appear is in the Hakes & Sauer paper. And we can also create a header column, which here defines the names that will appear at the top of each column. And again we can use the same names as appear in the Hakes & Sauer paper. And then we can run that. And you can see here this is essentially our reproduction of Table 3 of the Hakes & Sauer paper. And again, once again, beneath this I've reproduced, I've matched the Table 3 from Hakes & Sauer with our estimates. What you can see here is that the pattern of coefficients, and not just the pattern of coefficients in terms of numbers but also in terms of the standard errors are almost identical. And so what we've shown is that we are able to reproduce with our data. The story told by Hakes & Sauer about Moneyball. And this is, I would argue quite impressive. The ability to reproduce papers in statistics in a lot of scientific fields is actually problematic. There are many results that cannot be reproduced, and this is a testament to the quality of the work that Hakes & Sauer did that we can exactly or almost exactly reproduce their results in this. And our results confirm the story that Hakes & Sauer told in their paper. We can see that on base percentage in the seasons before the publication of Moneyball are insignificant. And only become significant the coefficient is only significant in the year 2004. The year of the Moneyball is published, whereas slugging percentage is consistently statistically significant in every season prior to the publication of Moneyball. And it's still statistically significant in the year after Moneyball is published. But the size of the coefficient is about half the size of the coefficient on on base percentage. Which suggests that on base percentage became much more important in determining salaries. So in that sense, the Moneyball story is confirmed on base percentage matters for winning, coaches, scouts, general managers, owners did not seem to value on base percentage prior to the publication of Moneyball. But once Moneyball was published and people were brought to realize that on base percentage was significant. Then finally, players started getting rewarded in their salary for the capacity to have a high on base percentage statistics and in particular to draw walks. So in these first two weeks, we've now managed to reconfirm the story of Moneyball through the Hakes & Sauer paper and data. What we're going to move on to next week is to go beyond this by looking at how the Moneyball story stands up in the years following 2004. Can we see whether the story has remained true? What has been the relationship between salaries in particular and on base percentages in the years following the publication of Moneyball and up to more recent times? And in fact, we can even go back a little bit before the period covered by Hakes & Sauer as well, and try to look at a longer sweep of history. So to see whether we can see a consistent pattern over time. So that's what we're going to move on to when we come back next week to continue our analysis of Moneyball and beyond.