A 0.01 P-value Just Doesn't Cut It

John Timmer comments on the problem of modern biomedical research and statistics: we can now measure so much more than our statistics can handle. In a typical genome-wide association study, you're testing so many hypotheses that the favored 0.05, 0.01, and 0.001 p-values from Stats 101 just don't cut it anymore.

"We're so good at medical studies that most of them are wrong:"

The problem now is that we're rapidly expanding our ability to do tests. Various speakers pointed to data sources as diverse as gene expression chips and the Sloan Digital Sky Survey, which provide tens of thousands of individual data points to analyze. At the same time, the growth of computing power has meant that we can ask many questions of these large data sets at once, and each one of these tests increases the prospects than an error will occur in a study; as Shaffer put it, "every decision increases your error prospects." She pointed out that dividing data into subgroups, which can often identify susceptible subpopulations, is also a decision, and increases the chances of a spurious error. Smaller populations are also more prone to random associations.

In the end, Young noted, by the time you reach 61 tests, there's a 95 percent chance that you'll get a significant result at random. And, let's face it—researchers want to see a significant result, so there's a strong, unintentional bias towards trying different tests until something pops out.

Timmer notes some studies where there are clear faults that lie with the researchers: failure to make the code for statistical models openly available, and flat-out statistical errors. But it's not just an issue of bad research practice (as Timmer also notes) - this is a real challenge even for those who are aware of all of the statistical pitfalls. Most statistical methods that are used, at least in genomics, to control for multiple hypothesis testing are fairly crude, essentially just changes in your significance threshold. The result is that, because we can measure so many things at once, we have real trouble teasing out genuine but small effects (which we know exist) from the noise, and what we're left with are just the most significant effects.

And so, in many cases of say, genome-wide association studies, you need an insanely large number of study subjects in order to have the statistical power to detect the effects of many genetic variants that have an impact on disease risk.

I'm not a statistician, and I don't know whether a new, as yet undiscovered statistical approach will get around these problems, or whether we're really bumping up against the limits of the signal can be pulled out of the noise in one giant experiment. In any case, given our current state of knowledge, we can't solve every problem with genomics and large-scale studies. These studies are a start, but they have not rendered obsolete the need to establish multiple lines of evidence for a a phenomenon before we put too much confidence in its significance.

Read the feed:

Related articles

Comments

Know Science And Want To Write?

Donate or Buy SWAG