Combining Significance Levels from Multiple Experiments or Analyses

Combining Significance Levels from Multiple Experiments or Analyses

By Lou Jost

Workers in the biological or social sciences frequently need to judge the overall statistical significance of a series of experiments or analyses. For example, out of twenty experiments testing the effectiveness of a new drug, two may yield statistically significant results. What conclusions should be drawn from such a set of experiments? One might expect one or two experiments out of twenty to yield mildly significant results just by chance, and as more experiments are included in the analyses, it becomes more likely that some will reach levels of statistical significance even if the null hypothesis is true. The usual way to deal with this issue is to demand a higher significance level when multiple experiments or analyses are done. In biology it is common to use the Bonferroni correction, which is a crude and extremely conservative formula. We here present an expression that gives the true probability that a set of p-values was produced by chance. More precisely stated, it gives the probability of obtaining a set of p-values as low as or lower than a given set of p-values, under the null hypothesis. This formula can reveal statistical significance in a series of barely significant results, and can accurately assess the overall significance of a body of work containing a complex mixture of significant and nonsignificant results.

The derivation of the formula is simple. It relies on the observation that a p-value is a uniformly distributed random variable on the interval (0, 1). For n experiments or analyses, one can create an n-dimensional unit hypercube and plot the point (P1, P2, P3,..., Pn) representing the p-values P(i) of each of the n experiments. One then establishes a surface of points with the same probability as this point. Since the p-values are independent probabilities (under the null hypothesis), the individual probabilities can be multiplied to give the probability of obtaining this set of p-values. The set of points whose probability is equal to that of the given set of p-values is just the hyperbola (x1*x2*x3*...*xn) = k, where k = (P1*P2*P3*...*Pn), the product of the given set of p-values. We are not directly concerned with this exact probability. We need the probability of getting a set of p-values as extreme or more extreme than the given set, so we must find the volume under this surface. Because p-values are uniformly distributed random variables, and because the total volume of the cube equals 1, this volume under the surface directly gives the probability of obtaining a set of p-values as extreme or more extreme than the given set. The volume integral depends only on k, the product of the given set of p-values, and n, the number of p-values under consideration. The overall significance level, for the case of two p-values, is

k - (k ln k)

and for n tests the combined significance level is

R. A. Fisher attacked the same issue of combining significance levels in his classic statistical writings. He derived a formula that returns a chi-squared statistic, whose significance can then be looked up in chi-squared tables. It is strange that he left the issue hanging there. It turns out that in this case the chi-squared integral can in fact be solved analytically, and the result is the same as the above formula! The generality of this formula has not been appreciated, and it should be used whenever one must judge the overall significance of a series of experiments or a table of statistical results (such as the output of ANOVA tests or a series of correlation coefficients).