I recently read this article in *Science News*:

During the past century, though, a mutant form of math has deflected science’s heart from the modes of calculation that had long served so faithfully. Science was seduced by statistics, the math rooted in the same principles that guarantee profits for Las Vegas casinos. Supposedly, the proper use of statistics makes relying on scientific results a safe bet. But in practice, widespread misuse of statistical methods makes science more like a crapshoot.

It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation. Statistical tests are supposed to guide scientists in judging whether an experimental result reflects some real effect or is merely a random fluke, but the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions. Even when performed correctly, statistical tests are widely misunderstood and frequently misinterpreted. As a result, countless conclusions in the scientific literature are erroneous, and tests of medical dangers or treatments are often contradictory and confusing.

Strong stuff and strong claims, right? Well, I wonder. Here is what the rest of the article goes on to say that few practitioners understand the use of the so-called “p-value” of a statistical test.

Here is a rough and dirty: if one is comparing data between two trials: say one trial got a treatment and one did not get it, one can run a statistical test (often a t-test or a z-test, but there are others). The p-value is the probability that one rejects the null hypothesis (the hypothesis that the treatment caused no difference) even if the null hypothesis is true; that is, it is the probability of a false positive (often called Type I error)

They typical threshold is .05 (or 5 percent), though at times other thresholds are used.

So, if one runs a study and finds a difference that scores at, say, .04 on the p-test, there is a probability that the “positive result” was indeed a fluke.

I would imagine that most practitioners know this; this is why science studies need to be replicated. But here is a very interesting way in which this “false positive” stuff pops up:

Even when “significance” is properly defined and P values are carefully calculated, statistical inference is plagued by many other problems. Chief among them is the “multiplicity” issue — the testing of many hypotheses simultaneously. When several drugs are tested at once, or a single drug is tested on several groups, chances of getting a statistically significant but false result rise rapidly. Experiments on altered gene activity in diseases may test 20,000 genes at once, for instance. Using a P value of .05, such studies could find 1,000 genes that appear to differ even if none are actually involved in the disease. Setting a higher threshold of statistical significance will eliminate some of those flukes, but only at the cost of eliminating truly changed genes from the list. In metabolic diseases such as diabetes, for example, many genes truly differ in activity, but the changes are so small that statistical tests will dismiss most as mere fluctuations. Of hundreds of genes that misbehave, standard stats might identify only one or two. Altering the threshold to nab 80 percent of the true culprits might produce a list of 13,000 genes — of which over 12,000 are actually innocent.

Of course, there is the false “negative” too; that is a false null hypothesis isn’t rejected. This could well be because the test isn’t sensitive enough to detect the difference or that no such test exists. So “no statistical significance” doesn’t mean that the effect has been disproved.

Then there is the case where an effect is statistically significant at a very low p-value but the effect itself isn’t significant:

Another common error equates statistical significance to “significance” in the ordinary use of the word. Because of the way statistical formulas work, a study with a very large sample can detect “statistical significance” for a small effect that is meaningless in practical terms. A new drug may be statistically better than an old drug, but for every thousand people you treat you might get just one or two additional cures — not clinically significant. Similarly, when studies claim that a chemical causes a “significantly increased risk of cancer,” they often mean that it is just statistically significant, possibly posing only a tiny absolute increase in risk.

And of course, there is the situation in which, say, one drug produces a statistically significant effect and a second one does not. But the difference in effects between the two drugs isn’t statistically significant!

I’d recommend reading the whole article and I’ll probably give this to my second semester statistics class to read.

## Leave a Reply