Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

Loading...

From the course by University of Houston System

Math behind Moneyball

24 ratings

University of Houston System

24 ratings

Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

From the lesson

Module 8

You will learn how to use game results to rate sports teams and set point spreads. Simulation of the NCAA basketball tournament will aid you in filling out your 2016 bracket. Final 4 is in Houston!

- Professor Wayne WinstonVisiting Professor

Bauer College of Business

Okay, we are now going to start teaching you how to rate sports teams given the scores of games in major sports. This would work for baseball, soccer, football, hockey, sorry not baseball, basketball. And the reason we'd have trouble with baseball is cause the starting pitcher is such a crucial part of it. In hockey, I guess, the goaltender but most teams use the same goaltender most of the time. [COUGH] Excuse me. So let's start talking about the 2013 NFL season.

And so what we want to learn is if we've got the score of every game, there's 256 regular season games, and we have the playoff games here.

The reason it doesn't say 256 there is because there are a bunch of rows that involve just, not the game score, but just a heading. But that will be okay. When we use our formulas, you'll see how that goes. We'll use our famous if ever, okay. So to understand how to rate teams, let's take a simple example, okay, this one here.

And you have to assume you know how good each team is and then, I know we don't at this point, but then we can figure it out. So the Texans might be, let's say, one point better than average, and the Colts would be five points better than average. Let's suppose that's true. And you have to know where the game is suppose, and again, it's at Indy. And home edge is about three points.

In college football, pro football, and pro basketball, home edge seems to hover around three points. In college basketball, it seem to be more like four points, I don't know why. There's a nice book called Scorecasting that talks about what's the main cause of home edge.

And the main cause that the authors came up with is that it's that the refs want to please the home fans. And they had some very clever ways of trying to figure that out. But I think we'll just try and evaluate the home edge rather than try to figure out why it occurred.

Okay, so if that game is at Indianapolis, the Colts will be four points better, five minus one. So we have the Colts five, you take their ability minus the opponents, plus the home edge, and we're always going to try and predict the amount by which the home team wins. Games in London, I'm not going to worry about considering them separately, you really shouldn't have a full home edge there, but we'll just take whoever was the home team gets the home edge in those games, to keep our life simple. If you really are worried about that, you could say games on the neutral field don't put the home edge in your prediction equation. Okay, now if the game was at Texas, at the Texans in Houston, okay well, the Texans are a one, and the Colts are a five. So the home team would be four points worse, okay, and then I would add three to that, okay.

So at Indy, it's Colts by seven. And at the Texans, it's the Texans by minus one or the Colts by one.

So the swing there is six points from home to away, and if you look at NFL point spreads, you'll see pretty much [COUGH], sorry again, for the divisional games. That there's usually a swing of around six points from the spread when a team is home to the spread when a team is away. Okay, so now we're going to use the Excel solvers. So there's three parts to the Excel solver, we briefly.

Looked at this in the previous video. Okay, but there's three parts. There's the target cell, which you want to maximize or minimize.

And we could weight more recent games more but we're not going to worry about that. But that's really what regression does, it the least squares equation that we've been looking at in regression minimized with the sum of squared errors, okay. So our changing cell is the second part of the self remodel, changing cells. What are the unknowns? Well these are the ratings of each team. How good is each team? We use the solver on just the plus minus as we saw a couple of videos back. Rating of each team, and we'll make zero is an average team, just for convenience. So a plus five rating means five points better than average.

A minus four means four points for. Now you could start with any numbers, that's the beauty of this. Okay, now the only constraint, so optimizations have a constraint, and the key constraint we're going to have is the average rating in zero. Just to make sure zero is an average team, plus rating is an above average team, etc. So we minimize the sum of the squared errors in predicting the home margin.

Now you might say why not minimize the sum of the absolute errors? Again squared errors and absolute errors, the positives and negatives don't cancel out when you square absolute errors. Squared errors minimization seems to predict the future better.

If you use absolute errors the ratings are more based on a typical performance, which I would think would be better, but it just doesn't turn out to be.

And that's cause you're not squaring the errors. Okay, so here we got the scores of these games and we're going to use, again, through the playoffs start in rows 280. So we're going to use [NOISE] through row 279 in our model and try and predict the playoffs here.

Okay, so the data we have, is how many points the home team scored, how many points the away team scored, we want to predict the difference. The difference is just the home score minus the away score, and if there was no game, I use nip error to put a blank in there. See, right there, there's a blank, that makes life easier. So for example, in that first game the home team was Denver, they won by 22. In the third game the home team was Jacksonville, they don't beat much of any team, and they lost that game by 26. Chiefs had a very good year that year, although most people didn't expect that. Okay, so we want to predict this column. So our changing cells are the ratings of the teams, which are highlighted in yellow, and also the home edge. And we average the ratings to make them come out to zero, but we can start with any numbers in those yellow cells and things will work out fine. All right.

Okay, these don't even average to zero, but it gives us a way to easily check our formulas, and let's say we have a home edge of two, okay none of those are correct.

All right, so now what we want to do, the key is to predict the outcome of each game and we use the principal over here. We take the home rating minus the away rating to predict the home, not the home team, but who would win by and add on the home edge. And we can start with any numbers and we minimize squared errors. And if we have ratings that don't make much sense, then our squared errors will be really high and then the solver will change things to make it work. Okay, so the forecast, this is the key, do an if error. How much will the home team win by? You take the home edge, then you look up the home team rating, that's what this does, you need a few look up balls because we have the team names. Then you subtract off the away team rating, okay, and that would be your prediction. So for example, Denver's rating here is a three, Baltimore is a ten, okay.

So I take Denver, they're three points better than average, Baltimore was ten points better than average. I think Denver would lose by seven, but they get two for home field. So that would lose by five is the forecast. Now to check this, you could go formulas, evaluate formula. Okay, so the home edge is two. It looks up Denver, they're are three. And then it's going to subtract off Baltimore, which is a ten. Five minus ten is minus five, and that's how we get the minus five. Now what is the squared error? It's the actual amount by which the home team won by minus the forecast that we square that. Do I do absolute errors, you would simply put ABS for absolute error in there. I think, maybe we'll try that out, okay.

I went through row 290, I think we want to go down through, again, row 279 for the record. Okay, we're minimize the sum of the squared errorrs. Okay now lets put in the solver window from scratch.

Okay let's do reset all, and we'll use this GRG non linear method. Basically what that will do is juggle all the changing cells until there is no way to improve by either increasing team's rating or decreasing a team's rating or changing the home edge. Okay, in a later video we'll need to use what's called a multi stroke. But I want to minimize squared errors.

Okay, oh let me just one point here on this squared error report. It's an if error. Okay so if there's a hyphen, I used a hyphen, underscore, not a blank here, it basically, if the column J minus column K would give an error, I put a zero there so it doesn't affect our targets. Okay, so now we got through row 279. Okay, now we're ready for solver.

And then you put a, sorry, and then you put a comma, there's 32 ratings, and then you put the home edge.

Now you don't want to make all the variables non-negative because how can you make the teams average to zero? It's like the Lake Wobegon effect in Garrison Keillor's show. You have to allow negative numbers to make an average of zero will all be zero. Then we'll add a constraint that the average rating is zero, B1 is the average rating.

Okay, and there is an option here for what's called the linear model, but we're squaring the errors. We have a linear forecast model, let me just make that clear. See I'm adding together and subtracting changing cells, but when I square the errors I have a non linear model. And luckily this is what's called a convex function, the solver has no trouble with it. We should be able to get the ratings of the teams very nicely. All right, so now Denver, and normally should come out pretty good here.

Okay. Okay, so solver converts at a current solution, that's great. Home edge is about 3.1 points. Okay, don't worry about this. Okay now, the best team here, I used the rank function here to rank the teams, and Denver came out number two, going to the playoffs. And guess what? Seattle came out number one, 13 points better than average.

And basically, Denver came out 11 points better than average. Okay, and so we would have predicted, and New England was, in the AFC who was the best team after the Denver, 49ers were third. Who do we add fourth? Carolina Panthers had that great season where they did nothing but play offs. Arizona was sixth. New Orleans was fifth. The NFC was stocked heavy with the NFC. The Chiefs came out seventh, the Colts beat them in an amazing comeback there. Okay, and so we would get going into the playoffs that the Seahawks would be the best team, not that they might have the best chance to win the Super Bowl cause they would be facing tougher opposition then Denver. And I think the bookies had Denver with the best chance of winning the Super Bowl going into the playoffs. For that very reason, the NFC was much tougher. We can talk about strength of schedule in a couple of videos here. Let me just copy this worksheet.

Right click, move or copy, and let's do absolute value while we're at it. So in other words, if I would just change this to absolute value of

J8 minus K8, then copy it down. Let's see if it makes Seattle better vis-a-vis Denver or not. Remember there was about a two point difference in the last spreadsheet. So in other words, this is now absolute value. Now the solver has a little bit more trouble with absolute value because at zero absolute value function doesn't have a slope, but it should get this okay. So it's working here, I did the sum of the absolute errors so it's no longer squared errors. Okay so it's improving a little, it takes a little while here. The problem with absolute errors, by the way, you could have multiple solutions.

Okay, there could be more than one set of ratings that minimizes the mean apps and deviation, and that of course can be an issue, though it's still improving here so I guess we'll give it some time here. You can see in the lower left-hand corner the progress we're making, our target cell, the sum of the errors would be.

We're now at 21 [INAUDIBLE], we've got it. Now see, this makes Denver number one. It makes them 15 points better than average and Seattle 13 points better than average. Okay, so there's a big difference here, I mean, if we looked at absolute errors we would predict Denver to be better by two points than Seattle, when we use least errors we get it the other way.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.