# College Math Teaching

## September 1, 2011

### Classic Overfitting

Filed under: media, news, popular mathematics, probability, statistics — blueollie @ 1:33 am

One common mistake that people sometimes make when they model things is the mistake of overfitting known results to past data.
Life is complicated, and if one wants to find a correlation of outcomes with past conditions, it really isn’t that hard to do.

Here Nate Silver calls out a case of overfitting; in this case someone has a model that is supposed to be able to predict the outcome of a presidential election. It has been “proven” right in the past.

If there are, say, 25 keys that could defensibly be included in the model, and you can pick any set of 13 of them, that is a total of 5,200,300 possible combinations. It’s not hard to get a perfect score when you have that large a menu to pick from! Some of those combinations are going to do better than others just by chance alone.

In addition, as I mentioned, at least a couple of variables can credibly be scored in either direction for each election. That gives Mr. Lichtman even more flexibility. It’s less that he has discovered the right set of keys than that he’s a locksmith and can keep minting new keys until he happens to open all 38 doors.

By the way — many of these concerns also apply to models that use solely objective data, like economic variables. These models tell you something, but they are not nearly as accurate as claimed when held up to scrutiny. While you can’t manipulate economic variables — you can’t say that G.D.P. growth was 5 percent when the government said it was 2 percent, at least if anyone is paying attention — you can choose from among dozens of economic variables until you happen to find the ones that pick the lock.

These types of problems, which are technically known as overfitting and data dredging, are among the most important things you ought to learn about in a well-taught econometrics class — but many published economists and political scientists seem to ignore them when it comes to elections forecasting.

In short, be suspicious of results that seem too good to be true. I’m probably in the minority here, but if two interns applied to FiveThirtyEight, and one of them claimed to have a formula that predicted 33 of the last 38 elections correctly, and the other one said they had gotten all 38 right, I’d hire the first one without giving it a second thought — it’s far more likely that she understood the limitations of empirical and statistical analysis.

I’d recommend reading the rest of the article. The point isn’t that the model won’t be right this time; in fact if one goes by the current betting market, there is about a 50 percent chance (slightly higher) that it will be right. But that doesn’t mean that it is useful.