Confidence Intervals and you

One of the truths that pundits in sports ignore is that there are times in the year when it is exceedingly hard to make predictions. Right now is very much such a time in the 2014-15 NBA season. Given the quantity of games that have been played, it would be silly to make predictions or draw conclusions with the same level of confidence that we would have in, say, January. The size of a sample is inversely proportional to the amount of error that your chest thumping affirmation on television will have embedded.

You'll notice that this doesn't really stop anyone from getting on their bully pulpit.

One of the advantages of a sport like basketball is that, eventually, the number of games in a season allows us to make pretty accurate conclusions about relative strength of a team. This leads to a certain level of assurance in those that write about the sport that is (again, after a certain point in the season) a lot more warranted. 5 to 6 games into the season is not that time.

Here's an illustration of just why this is so.

One of the fun projects I took on in the offseason was to build a database of every NBA game ever played. This graph above shows the standard deviation of Point Margin per game before and after a certain game. I looked at every 82 game season played in the NBA (all non-strike years from 1974-75 onwards) and calculated the difference in Point Margin for each team up to an including game number X and afterwards (PM Delta) . For example, after 41 games the standard deviation of this point margin delta is 3.3 points per game over the last ten seasons. After 6 games, that number is 5.3.

Right now, your margin of error is 60% higher than expected.

Does that mean we need to recuse ourselves from providing an opinion until such a time as there is enough information to provide it? Nope. Having to provide a usable and useful opinion without actually having all the relevant facts is actually a common problem. Think of your local weatherman -- he wishes he had enough time to get his weather predictions in the 95% range all the time, but he simply lacks the resources or the processing time to get anything above the typical "80% chance of thundershowers" on a typical day.

Data costs money. Sampling and processing takes time. I would love to run my models in the real world with an unlimited number of samples but the reality is ten or less samples might have to satisfy me most of the time. Scientists, engineers and economists all have to make due with the data that is available to make conclusions that are less than optimal and less certain than we would like.

Given that reality, it's not surprising that there are some well worn concepts around dealing with the uncertainty brought about by small samples. Let's talk about confidence intervals.

In statistics, a confidence interval is a way to provide an interval estimate of a population. Confidence intervals are meant as a range of values that act as good estimates of the unknown variable (for example projected wins). The level of confidence of the interval would indicate the calculated probability that the range captures this true population parameter given the samples we have handy. Basically, we are able to provide a maximum and minumum value based for the number in question based on the observed behavior of the sample.

For example, we could look at a 5-0 NBA team with an average margin of victory of 10 points per game and make a determination as to the maximum and minimum level of wins we would expect for at a certain confidence limit. In applied practice, confidence intervals are typically stated at the 95% confidence level. In layman's terms, if I said team A would be expected with a 95% confidence to win between 40 and 62 games in an 82 game season I am saying that I expect that is the season were played 1000 times, 950 of them would fall within 40 and 62 wins.

The trick is that we need to have a clue as to what the expected error and variation is. If you were paying attention at the top you realize that we kind of do. That let's us do some fun things.

If I had a team with a 6-1 record and a 11 point Margin of victory, I could build a tool to estimate confidence intervals for thier expected win totals. That would look like so:

Learn About Tableau

That purely hyphotethical Northern Atlantic dinosaur themed team would be expected with 95% confidence to win between 45 and 79 games this season.

If I wanted to look at the actual odds of a specific win total for that selfsame team, let's say 48 wins, I'd build a tool like this:

Learn About Tableau

96.3% of the time, that's a tasty over.

There are, of course, some more factors to consider when projecting the season (the schedule). We'll cover that in our upcoming rankings (which will of course incorporate our shiny new confidence intervals)a bit later in the week.

-Arturo

🔖 Arturo Galletti on Nov 11th, 2014

Arturo Galletti on Nov 11th, 2014