New Synthesis of Diversity and Similarity Measures

Diversity and Similarity Measures in Genetics

The tools of diversity analysis in population genetics were not originally designed for that purpose. Heterozygosity came to be regarded as synonymous with "diversity" for reasons that are largely historical; there was no careful reflection on the mathematical properties we should demand of a "diversity" measure. Likewise the classic differentiation measure of population genetics, Gst, was originally designed to measure fixation probabilities, and its gradual adaption as a measure of compositional differentiation is, from an outsider's view, almost inexplicable. Back in the days when geneticists studied very simple systems with just two alleles, these measures gave acceptable results. However, when diversity is high, they give wildly incorrect results. Gst, for example, can be arbitrarily close to zero even if differentiation is 100%. The belief that Gst is a measure of differentitaion is a sort of collective hallucination, a mass hysteria in which geneticists are driven by peer pressure to believe in something which any of them could easily have disproved had they tried. Even after Hedrick (2005) and Gregorius (1988) pointed this out in precise quantitative detail, most population geneticists irrationally continued to use Gst as a measure of compositional differentiation. Few episodes in the history of science are more difficult to explain than this. (One reason these problems were ignored is that biologists rarely pay much attention to the actual values of these measures anyway. Instead biologists use them as mere tools to generate p-values. Unfortunately the p-value of a meaningless measure is also meaningless.)

The problems in population genetics start with the misinterpretation of heterozygosity as "diversity". One of the principal themes of conservation biology is the conservation of diversity. The moment that we begin to conceive of diversity as something that can be conserved or lost by preserving or destroying populations, we implicitly impose subtle mathematical requirements on the concept. For example, if conservation arguments are to be logically consistent, the amount of diversity lost plus the amount of diversity saved must add to the total diversity, at least in highly symmetric examples where other factors that might affect diversity are held constant. Imagine that we are evaluating a conservation plan for an endangered colonial seabird. Its entire population nests on a cluster of twenty small rock islands. Each island contains about the same number of birds, and members of the colonies always return to their home colony to breed. A genetic analysis at a very polymorphic locus shows that each colony (deme) has a heterozygosity of 0.95, and also shows that no alleles are shared by members of different colonies (all alleles are private). The military plans to use 19 of the 20 islands for bombing practice. Geneticsts are asked what effect this will have on the genetic diversity of the endangered seabird. Following standard procedure, geneticists will equate genetic diversity with heterozygosity, and calculate the answer.

Geneticists hired by the military will argue that even if we save just one island, we will be preserving almost all of the genetic "diversity" of the species. The total heterozygosity of the pooled colonies is 0.9975, and the heterozygosity of one colony is 0.95, so the proportion of the colony's genetic "diversity" saved is 0.95/0.9975 = 95%.

Geneticists hired by an environmental protection group will use the same measures to argue that the military's plan will destroy 99.9% (0.9974/0.9975) of the colonies genetic "diversity". The same measures and the same reasoning, applied to the same data, lead to diametrically opposite conclusions depending on which side we are on.

In fact, both results are as fishy as the seabirds' feces. All the colonies are equally large and equally diverse, and all alleles are private to single islands, so each colony should contribute equally to the total genetic diversity of the species. The loss of 19 of the 20 colonies should cause the loss of 95% (=19/20) of the species' genetic diversity, and saving one colony should preserve 5% (=1/20) of the species' diversity. These sum to unity, as they must. If we want to avoid logical contradictions when arguing about conserving diversity, our diversity measure must behave in this way in these highly symmetrical cases. These cases are an acid test of whether our measures match our diversity concept. Heterozygosity fails this test and the others described in Jost (2008).

There are many measures which pass these kinds of tests, and I call them "true diversities". One which has been floating around in population genetics for a long time is Kimura and Crow's (1964) "effective number of alleles". This can be safely equated with genetic diversity. A simple transformation, 1/(1-H), converts heterozygosity to effective number of alleles. This transformation is not optional if we want logical consistency in our arguments about diversity.

We are no more free to choose measures of diversity than we are free to choose formulas for variance or standard deviation. Standard deviation and variance are simple transformations of each other, but each has properties that the other lacks, and reasoning which makes implicit use of the mathematical properties of standard deviations will often be wrong if applied to variances (and vice versa). It is time to get serious about our measures. Imagine what kind of mess statistics would be in if no one recognized the differences between variance and other measures of central tendency, and called them all by the same word, and used them interchangeably in their formulas. The mess we would have is not unlike the mess we have now in ecology and population genetics surrounding the concept of diversity. This is especially true in their conservation applications. How can conservation biology and conservation genetics do their important tasks of guiding policy-makers and resource manager and the general public, if they cannot even acheive a basic level of logical consistency?

Geneticists made another, more or less independent error which compounded the misconceptions about diversity. The idea that "diversity" or heterozygosity can be additively partitioned into within- and between-group components, by analogy with the additive decomposition of independent variances, is a mistake which has been corrected in most other sciences since at least 1975 (Aczel and Daroczy 1975). But in biology (both population genetics and ecology), this mistake has dovetailed with historical accicents to become a standard partitioning technique to this day (though see Gregorius 1988). This additive partitioning is at the heart of Nei's (1973) derivation of Gst as a measure of differentiation. This measure, (Ht-Hs)/Ht, was perhaps uncritically accepted because of its close relation to Fst, which had long been (mis-)used as a measure of genetic differentiation. Ht (total heterozygosity of the pooled demes) is necessarily greater than or equal to Hs and less than or equal to unity. This obviously means that if within-group heterozygosity Hs is high (close to unity), Ht-Hs will be close to 1-1 = 0, no matter how differentiated are the demes. Dividing by Ht (~ 1) doesn't change this. How could geneticists accept a measure of "differentiation" which necessarily approaches zero when heterozygosity is high? Even Nei (1973) recognized this problem, but later geneticists continued to make the mistake of interpreting Gst or Fst as the principle measure of allelic differentiation in population genetics. This mistake contaminates most of the empirical studies on the genetic structure of populations. It makes nonsense out of many theoretical conclusions as well. Evolutionary models of population subdivision are based on the effect of migration and mutation on Gst. If model parameters produce a low value of Gst, the modellers conclude there is little genetic differentiation among demes. Since Gst turns out not be a measure of differentiation, this is false reasoning, as one can easily show by trying some high-mutation-rate, low-migration-rate examples with many demes, and looking at the actual allele distributions that result at equilibrium. Gst can be low for these examples (especially if deme size and number of demes are both large) but the differences in allele frequencies between demes can be very large.

How then should we measure diversity and differentiation in population genetics? Biologists have often treated these questions as if they were matters of opinion, but in fact the concepts of diversity, compositional simialrity, and differentiation have deep logical and mathematical roots, rich contexts, and many interconnections. These foundations have not been appreciated by most biologists. I have tried to explain some of this in the following articles. Most are written for ecologists rather than geneticists, but the issues are nearly identical in both fields.

Jost, L. 2006. Entropy and diversity Oikos 113: 363–375.

Jost, L. 2007. Partitioning diversity into independent alpha and beta components Ecology 88: 2427–2439.

Jost, L. 2008. Gst and its relatives do not measure differentiation. Molecular Ecology 17: 4015- 4026.

Jost, L. 2009. Mis-measuring biological diversity. Ecological Economics 68: 925-928.

Jost, L. 2009. Partitioning diversity into independent components: Reply to Veech and Crist. Ecology in press.

Jost, L. 2009. Gst versus D: Reply to Heller and Siegismund (2009) and Ryman and Leimar (2009). Molecular Ecology, in press.

Please use the Population Genetics forum in Nature Forums (http://network.nature.com/groups/popgen/forum/) to post responses, queries, or criticisms.

Software for correctly-formulated diversity and differentiation measures is available at:

Chao, A, and Shen, 2009. SPADE. http://chao.stat.nthu.edu.tw/softwareCE.html

Crawford, N. www.ngcrawford.com/django/jost/ This now accepts popular pop gen file formats.

You can also download a basic Excel sheet for calculating genetic differentiation D (Jost 2008) for two demes by clicking here: Excel Genetic Differentiation D Calculator. The programs just cited are much more complete.

Note that these programs may sometimes produce a differentiation value that is negative. This not a flaw but a necessary feature of an unbiased estimation procedure. "Unbiased" means that the expected value of the estimator equals the true value of the parameter you are trying to estimate. The estimator must therefore undershoot the true value as often as it overshoots it (otherwise it would have a bias). When the true value is zero, it will soemtimes overshoot, but this means it must counterbalance that by sometimes undershooting. When the true population value of differentiation is near zero, the only way to undershoot is to sometimes produce an estimate that is negative.

Citations (other than my own articles, which are cited above)

Aczél J, Daróczy Z. 1975. On measures of information and their characterizations. Mathematics in Science and Engineering, vol. 115, Academic Press, New York, San Francisco, London, 1975, xii + 234 pp.

Gregorius, H-R 1988. The meaning of genetic variation within and between subpopulations. Theor. Appl. Genet. 76: 947-951.

Hedrick, P. 2005. A Standardized Genetic Differentiation Measure. Evolution 59: 1633-1638.

Kimura, M., Crow, J. 1964. The number of alleles that can be maintained in a finite population. Genetics 49: 725-738.

Nei M. 1973. Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences, USA., 70: 3321-3323.