Data and its role in journalism is a hot topic among some of the bloggers that I regularly follow. See: Nate Silver on what he hopes to accomplish with his new website, and Paul Krugman’s caveats on this project. The debate is, as I see it, about the role of data and the role of having expertise in a subject when it comes to providing the public with an accurate picture of what is going on.
Then I saw this meme on a Facebook page:
These two things (the discussion and meme) lead me to make this post.
First the meme: I thought of this meme as a way to explain volume integration by “cross sections”. 🙂 But for this post, I’ll focus on this meme showing an example of a “projection map” in mathematics. I can even provide some equations: imagine the following set in described as follows: Now the projection map to the plane is given by and the image set is which is a disk (in the yellow).
The projection onto the plane is given by and the image is which is a rectangle (in the blue).
The issue raised by this meme is that neither projection, in and of itself, determines the set . In fact, both of these projections, taken together, do not determine the object. For example: the “hollow can” in the shape of our would have the same projection; there are literally an uncountable. Example: imagine a rectangle in the shape of the blue projection joined to one end disk parallel to the yellow plane.
Of course, one can put some restrictions on candidates for (the pre image of both projections taken together); say one might want to be a manifold of either 2 or 3 dimensions, or some other criteria. But THAT would be adding more information to the mix and thereby, in a sense, providing yet another projection map.
Projections, by design, lose information.
In statistics, a statistic, by definition, is a type of projection. Consider, for example, linear regression. I discussed linear regressions and using “fake data” to teach linear regression here. But the linear regression process inputs data points and produces numbers including the mean and standard deviations of the values as well as the correlation coefficient and the regression coefficients.
But one loses information in the process. A good demonstration of this comes from Anscombe’s quartet: one has 4 very different data set producing identical regression coefficients (and yes, correlation coefficients, confidence intervals, etc). Here are the plots of the data:
And here is the data:
The Wikipedia article I quoted is pretty good; they even provide a link to a paper that gives an algorithm to generate different data sets with the same regression values (and yes, the paper defines what is meant by “different”).
Moral: when one crunches data, one has to be aware of the loss of information that is involved.