Welcome back to Week 2 of Moneyball and Beyond. We're now moving on to step 4, the last step in the process of creating our data frame so that we can run the regressions to reproduce the Hakes and Sauer table 3, and illustrate the relationship between player salaries and on base percentage and slugging percentage. So the last step is to take account of the fielding positions of the players. And so we need to have some data which tells us where the players play on the field, and for the purposes of reproducing the Hakes and Sauer table, they defined players according to two criteria. They defined catches separately, and they defined infielders separately, by infielders, they meant second baseman, third baseman or short stops. So we're just going to reproduce that. And the implication is that everybody else falls into the default category of player. Okay, so we have a data file on the Lahman database, which tells us where players play on the field and it's called appearances, and if we load that up here, you can see the codes for appearances are all found along the top here. So if you see around the middle column, you see G_p that refers to pictures, G_c that refers to catches, G_1b that's first baseman and then you can see second basement, third baseman, short stops, left field, center field, right field, and so on. So, we're going to use this in order to construct reconstruct really what Hakes and Sauer did in their paper. So the first thing is to look at the aggregate playing position of the players across all of their appearances. And this is the same problem we ran into before. In any given season, a player could have played for more than one team, so they have potentially multiple stints in any given season. And we need to aggregate the players according to their stints, so that we just have in each year we each player appears only once. We did this earlier on, we're going to do this again. We run this code again. And in addition, we restrict the data to just the variables were interested in which are the fielding positions of the players. So we have the player ID, the year in which they played, and they're fielding position they took. Now notice here that in fact, a player could have played in more than one position in a year, and that complicates our analysis a little bit. Now we can see from the day to here how many appearances each player had in each position. But what we want to do is figure out what was the position in which they appeared most frequently, since a player could have played in more than one position in any given season, and you can see that here. So the first few players are actually none of them are are fielding positions, so they have no appearances. Then roles 9 and 10, referred to the great Hank Aaron in 1954-1955, and you can see in 1950 for he played 116 games as an outfielder, but in 1955 he played 100, and 26 games as an outfielder, but also 27 games as a second baseman. So what we want to do is find the position in which he played most frequently. We're going to say that is the position in which he was fielding. This line of code will do that for us, it specifies variable Max_G, and identifies the Max_G is the highest value if all of these columnn dot max will give us the maximum value. So if we run that line of code now, we get the following and you can see here. So there you can see 1955 Hank Aaron on his Max_G is indeed 100 and 26 which is his position as outfielder. So we're going to use that now to say, right, the your fielding position is the position for which you have the value Max_G. Wherever you have a value of Max_G that defines your fielding position. So to do that we need to actually use define a function to do that calculation, and that requires us to use this code def to write a function. And you can see here that we write def, then position which is the name of our function is going to take, and then the next lines of codes give instructions for when you run the function. So when you run the function for position, if Max_G is equal to G_C, which is a catcher, then return the value C for catcher. Otherwise, when you run the function Max_G is equal to first base, then return the value first base, and so on. So this gives the different possible choices which are available for position dependent on the value of Max_G. And so we then say to apply this function to appearances and use that to define position, which is going to be based on what the result of this function is going to be. And so if we run that now, You can see what happens here. Okay, so the position is in the last column and you can see, for example, when it comes to Hank Aaron, we get a position of outfielder both in 54 and 55 which was the value of Max_G. Notice that these players for whom we have no fielding position at all, we actually returns the first value, which is C for catcher. Actually, we're going to discard those rows when they have 0 values, so they won't interfere with our analysis. But you can see here as we go through, we get values for first baseman, outfielders, and so on as we go down the data shortstop, second base, and so on. So as you can see, we had those players with 0 fielding appearance is defined as catches. So we're just going to discard those values by saying, well, only include Max_G, when the value is greater than 0. And we can then now just narrow down our data frame because the only thing we actually want is the fielding position, the player ID, and the year ID. So that then we can merge this into our data frame. If we run that, our appearances data now just looks like this, which says, what the player ideas the year ID, and the fielding position. And now the thing we want to be able to say is, what is the fielding position as defined by Hakes and Sauer? Remember, they only define two categories, catcher and infielder. Well, we can create a dummy variable for catcher, which is equal to 1, if the value of position is C and zero otherwise and a dummy variable for the position of infielder which has a value of 1 if your position is second base, third base or shortstop and zero otherwise, and we will see it would look like. Yeah, and so finally, we can merge this data into our master dataset using PD merge again. So this looks exactly as it did in the previous steps when we carried out emerge. And when we run this, we should now have all of the variables that we need in order to run the Hakes and Sauer regressions, And that's what we're going to do next.