
Ho

Combining Significance Levels from Multiple Experiments or Analyses By Lou Jost
Workers in the biological or social sciences frequently need to judge the overall statistical significance of a series of experiments or analyses. For example, out of twenty experiments testing the effectiveness of a new drug, two may yield statistically significant results. What conclusions should be drawn from such a set of experiments? One might expect one or two experiments out of twenty to yield mildly significant results just by chance, and as more experiments are included in the analyses, it becomes more likely that some will reach levels of statistical significance even if the null hypothesis is true. The usual way to deal with this issue is to demand a higher significance level when multiple experiments or analyses are done. In biology it is common to use the Bonferroni correction, which is a crude and extremely conservative formula. We here present an expression that gives the true probability that a set of pvalues was produced by chance. More precisely stated, it gives the probability of obtaining a set of pvalues as low as or lower than a given set of pvalues, under the null hypothesis. This formula can reveal statistical significance in a series of barely significant results, and can accurately assess the overall significance of a body of work containing a complex mixture of significant and nonsignificant results. The derivation of the formula is simple. It relies on the observation that a pvalue is a uniformly distributed random variable on the interval (0, 1). For n experiments or analyses, one can create an ndimensional unit hypercube and plot the point (P1, P2, P3,..., Pn) representing the pvalues P(i) of each of the n experiments. One then establishes a surface of points with the same probability as this point. Since the pvalues are independent probabilities (under the null hypothesis), the individual probabilities can be multiplied to give the probability of obtaining this set of pvalues. The set of points whose probability is equal to that of the given set of pvalues is just the hyperbola (x1*x2*x3*...*xn) = k, where k = (P1*P2*P3*...*Pn), the product of the given set of pvalues. We are not directly concerned with this exact probability. We need the probability of getting a set of pvalues as extreme or more extreme than the given set, so we must find the volume under this surface. Because pvalues are uniformly distributed random variables, and because the total volume of the cube equals 1, this volume under the surface directly gives the probability of obtaining a set of pvalues as extreme or more extreme than the given set. The volume integral depends only on k, the product of the given set of pvalues, and n, the number of pvalues under consideration. The overall significance level, for the case of two pvalues, is k  (k ln k) and for n tests the combined significance level is . R. A. Fisher attacked the same issue of combining significance levels in his classic statistical writings. He derived a formula that returns a chisquared statistic, whose significance can then be looked up in chisquared tables. It is strange that he left the issue hanging there. It turns out that in this case the chisquared integral can in fact be solved analytically, and the result is the same as the above formula! The generality of this formula has not been appreciated, and it should be used whenever one must judge the overall significance of a series of experiments or a table of statistical results (such as the output of ANOVA tests or a series of correlation coefficients). 