Welcome back to week two of Moneyball and beyond. What we are now moving onto is step three in the process of assembling the data, so that we can run the regressions to reproduce Hakes and Sauer's table three. This part is focusing on the negotiating status of the players. There are three possible states a player can be in, they can be a rookie, in which case they have no negotiating capacity whatever, they have to accept to take the salary that they're offered. Or they can be arbitration eligible, which means if they are not satisfied with the salary they are offered, then they can go to arbitration and an independent arbitrator will decide what is a fair salary for the player. Or after six years of service, a player can become a free agent, in which case they can sell their services to the highest bidder. We need to reproduce in the data, of measure of this status. We're going to do that using data from the Lahman Database again, and this time we're going to use a file called people. Which contains in it a number of biographical characteristics relating to the player. If we run that line here you can see actually there's a lot of information about the individual player and the history. We can see in this file, information on the year in which the player was born, the day, a month which they were born, the country they were born in, which state they were born in, which city they were born in, and so on. So there's a lot of detailed information, and we're only going to use one piece of information for our analysis, and that's the debut year of the player. In fact this data contains something more specific than that, it contains the exact date of their debut. To establish their arbitration status, we only need to know the year in which they debuted, so what we're going to do is, we're going to take the debut year from this file, we're going to slice it so that it only tells us the relevant year of their debut, and then we're going to merge that into our dataset, in order to calculate how many years the player has been playing and their negotiating status with the Major League. In order to do that rather than continue to work with this very large dataset, we're just going to produce a smaller version which just contains the player ID, and the debut year of the player. So we've just got two columns now, player ID and debut year. Now we revisit something we looked at before, which is a problem relating to strings and integers, and how we cut up a particular set of characters, a particular line of code, to select just the part of the line that we're interested in. You can see here in the debut, we have a code that takes the form. The first four digits, tell us the year, then there's a hyphen, then the next two digits tell us the month, and then there's a hyphen, and then the last two digits tell us the day of the month. We only want the first four digits of this entire piece of code and that's easy to extract. What we need to do is first to tell Python that we're treating this as a string variable so that it's essentially a piece of text. Then say that we want to cut the string, for each row, we want to cut it after the fourth character so we get the four digits of the year, and nothing else. The command is relatively straightforward in fact. You can see here we say, let's create a variable called debut year, "debutyr". That's formed by taking the variable debut, and using the command as type (str), which tells Python that it is to treat this variable as a string variable, and then the command dot str[0:4 ], and that says take the first four characters of the string. Which means include only those first four characters and leave everything else out. If we run that, we'll see that we get exactly what we wanted, we now have debut year as the first four characters of that string, and we have a variable which is year. Now one thing to bear in mind is that as far as Python is concerned, this variable is a string, it's not an integer If we want to use it as an integer, well actually have to tell it that we want to go back to treating it as it was an integer. Python will treat variables in ways that we tell it to depending on what commands we use. The next thing is just a little piece of housekeeping. In fact, we don't need the debut variable, we can just cut that out and restrict debut to these two columns, player, and debut year, and then we want to merge this into our master data set. We're going to use the pd.merge command again and we say merge master and debut, and then we merged them on the player ID. The how equals left, we'll tell it just tack on this new variable to the last column, and then we display Master. We should see that now will have with this merge debut year added to the end of the data set, and indeed, if you now look to the last column there, you can see debut year attached to our master Data Frame. The next thing we want do in this process is calculate how many years of experience does any given player have in any particular season? How long have they been playing in the majors? We can define that as the difference between the year ID, which tells us the year that we are looking at for any given player, and the debut year, the year in which they first started playing. Now the only thing to note here is we have to tell it this time we're going to be treating it as an integer. We are treating it as a number so that this difference here, this subtraction will actually produce a number. How many years of experience the player has? We're going to call that exp, e-x-p for experience. If we look at that in our data frame, we can just scrolled to the end of data frame, you can see we now have a number of years of experience for each player, ranging from zero which is the first year that they play in the majors, and you can see an example there of a player with seven, there are players with 10,12,14,15 and so on. We have a range of years of experience of the players. Now our interest, remember, we want to define these three bargaining categories, and these bargaining categories are defined by the number of years of experience that you have. If you've less than two years experience, you're a rookie, that will be our base group if you like. We're going to define two other groups. One is arbitration eligible, that's a player whose experience is greater than or equal to three years, and less than or equal to six years. We call that arb, a-r-b for arbitration eligible players. Then in addition, we'll have free players, free agents, and these are players whose experience is greater than six years. These variables arb and free are dummy variables. They're going to be a one if you are a player whose arbitration eligible in that season or zero if you're not. For free agents, likewise, they'll be a one if you're a free agent in that season and zero otherwise. We can see here if we run that again, and we scroll along to the end, we can see the variables that we've just created. Just over here we can see, the first player in our list has one year of experience, so is neither arbitration eligible nor a free agent. The second player has three years experience, so is arbitration eligible. Obviously, you can only be arbitration eligible or free agent, or you can be ineligible, but you can't have more than one of these categories at the same time. The third player has one year of experience, not arbitration eligible or free agent and so on. We can see that we now have defined this variable depending on the bargaining status of the player. That completes step three. Now we can move on to the fourth and final step in creating the DataFrame, and that step is going to define the fielding position for each player in our data.