In this video, we're going to go ahead now and reading are batting data that we'll be using for regression analysis. And then we'll go ahead and merge in this batting data with the salary data that we created in our previous video. So to start out with, we're going to go ahead here in reading are batting data, and you can see the markdown, we've given a little description here. Then this batting data was obtained from Sean Lanman's baseball database, and we have a link up here for Sean Lanman's website. Sean Lanman is a journalist and baseball analyst who has done in a really great job providing open source of data for baseball. He's gotten a number of different datasets available on the web, and we have second link here to definitions. So definitions for the data sets that we'll be using and other data sets that are available in Sean Lanman's Baseball database can be found here if you click on this link. So once again, many thanks to Sean Lanman for making this data available to use. But start out with, we are going to now go ahead and read in our batting data. So we start out with just calling this data set batting, and we are using this read CSV to read in the data, and you can see that we have this PD set option and display max columns. This line of code here tells Jupiter to display the maximum number of columns to be 50, and we can go ahead and run that data and see what it looks like. So we can see now, there's actually only 22 columns, so it'll just show all 22 columns. If we had more than 50 columns, then it would only display 50 columns, since that's what we input into our set option. And you can take a look and see, we have a number of raw statistics for baseball. I'm at the player level, and what I want to call out is, we have a variable. So we have the player ID. We have the year ID, so that's the player, the season. Then we have the stint variable here, the third variable in the data. This stink variable is created for players that played on multiple teams in a season. So, for example, say a player started the season with one team, then got traded in the middle of the season and finished with another team. That player would have a stint of one for his first team, along with the corresponding statistics for their first team. And then a stint of two for the second team with the corresponding statistics for that second team. Now what we want to do is, we want to collapse that into one row of data. We want to aggregate all statistics based on that one season of data. What we don't want to do is run our aggression on partial seasons of data. So to do that, we're going to go to our next step here and we are going to do a group by function. So we are going to override our batting data, and we're going to do a group by statement on player ID and your ID, and then you can see a .sum after that. So what this line of code is going to do is it's going to sum all variables by player ID in your ID, so that each player has one observation or one statistics for every given season. So we are aggregating his statistics if he played on more than one team in the season. So we can go ahead and run that piece of code, and you scroll down to the bottom. You can see we have 96,612 observations now. Now, for the next portion, we're going to subset our data. So this data actually dates back to 1871 in Sean Lanman's database. So you can see there, in our initial data step, we have data dating back to 1871. To look at the Moneyball story, we probably don't need to go that far back boat. What we're going to do now in this step is we're going to filter our data just for some of the more recent seasons that are likely to be a little bit more relevant to our Moneyball analysis. So you can see in this step here, we are going to filter our batting data, so that year ID is greater than or equal to 1993, and year ID is less than or equal to 2016. And then we have one other subset statement here. We're going to subset are batting data so that at bats are greater than or equal to 130. 130 was chosen because that is the cutoff for whether a player is considered a rookie or not. And really, the purpose of this cutoff, this 130, this piece of code here, is to filter out any players that had very few at bats in a given season. So like a minor leaguer, you might come up and only get, 10 at bats in the season. We don't want put those types of players to skew our data and to skew our results. So that is why we are including this greater than or equal to 130 here. So with that subset statement, we can go ahead and run that. And now you can scroll down to the bottom again, and you can see, now we have 9,303 rows. So we've subset our data to the time frame that we wanted, and with a minimum number of head bats. So now the next step, we can go ahead and start to calculate some of the statistics that we need to be included for our regression analysis. So we see here the formula for play appearances. Play appearances is equal to at bats, plus walks, plus hit by pitches, plus sacrifice hits, plus sacrifice flies. As we saw in the last video on base percentage is equal to hits, plus walks, plus hit by pitches, divided by at bats plus walks, plus hit by pitches, plus sacrifice flies. And 7% is equal to singles, plus two times doubles, plus three times triples, plus four times home runs, divide by at bats. And remember, we don't explicitly have singles in our dataset, but we can easily calculate singles by taking hits, minus doubles, minus triples, minus home runs. So now we can go ahead and calculate these three statistics, so we can see, we create PA for played appearances, OBP for on-base percentage and SLG for slugging percentage. And the formulas for each one of these on the right hand side follows exactly the formulas we have here in our markdown. So with that in mind, we can go ahead and run that piece of code. And then you can see on the right hand side, we have played appearances, on-base percentage and slugging percentage now in our data. And the last thing for this video is we are going to merge in our batting data to our salary data. But if you remember, in our data frame, what we want to do is we want to link a player's previous year's performance to his next year salary. That's the way we want to link it to run our aggression analysis in that way. So to do that, we're going to create this one year lag between the year ID and the salary year. So we have here batting and Salyer, so that we're creating a variable called Salyer, and that's just going to be equal to the year ID plus 1, so we can run that. And then we can see on the right hand side there, if we scroll, we have the salary year for every player. And then to merge our bank data and our salary data, we are going to carry out this final step here. So we're going to override our master data, and what we're going to do is we're going to merge together the batting data we have created in this video. And we're going to merge it with our previous master data, which was essentially just a salary data. And the variables that we'll be merging on are the Salyer and the player-ID. So those are the two variables that are overlap. The two datasets are going to link together using these two variables. So using this merge statement, we can go ahead and run that, and we can see what that looks like. So you can see now we have the player-ID, the year-ID. We have all of our batting data on the left. And if you scroll to the right now, we have all for salary data. So salary year, team-ID, league-ID, salary and our log salary. And you can also notice that for every season, so for our first observation here, for the year 1998, we have the player statistics from 1998, but we have his salary for the next year. So we've successfully linked a player's previous year's performance to his next year salary, which is exactly what we want in this regression. So now we have the salary data. We have the batting data in our master data set. And the next step is to add in a couple experienced variables based on players arbitration, eligibility or free agent eligibility. We're going to add these variables into this master data set to continue our analysis, and we will go ahead and get to that in our next video.