Diversity
and Similarity Measures in Genetics
The tools of diversity analysis in population
genetics were not originally designed for that purpose. Heterozygosity
came to be regarded as synonymous with "diversity"
for reasons that are largely historical; there was no careful
reflection on the mathematical properties we should demand of
a "diversity" measure. Likewise the classic differentiation
measure of population genetics, Gst, was originally designed
to measure fixation probabilities, and its gradual adaption
as a measure of compositional differentiation is, from an outsider's
view, almost inexplicable. Back in the days when geneticists
studied very simple systems with just two alleles, these measures
gave acceptable results. However, when diversity is high, they
give wildly incorrect results. Gst, for example, can be arbitrarily
close to zero even if differentiation is 100%. The belief that
Gst is a measure of differentitaion is a sort of collective
hallucination, a mass hysteria in which geneticists are driven
by peer pressure to believe in something which any of them could
easily have disproved had they tried. Even after Hedrick (2005)
and Gregorius (1988) pointed this out in precise quantitative
detail, most population geneticists irrationally continued to
use Gst as a measure of compositional differentiation. Few episodes
in the history of science are more difficult to explain than
this. (One reason these problems were ignored is that biologists
rarely pay much attention to the actual values of these measures
anyway. Instead biologists use them as mere tools to generate
p-values. Unfortunately the p-value of a meaningless measure
is also meaningless.)
The
problems in population genetics start with the misinterpretation
of heterozygosity as "diversity". One of the principal
themes of conservation biology is the conservation of diversity.
The moment that we begin to conceive of diversity as something
that can be conserved or lost by preserving or destroying populations,
we implicitly impose subtle mathematical requirements on the
concept. For example, if conservation arguments are to be logically
consistent, the amount of diversity lost plus the amount of
diversity saved must add to the total diversity, at least in
highly symmetric examples where other factors that might affect
diversity are held constant. Imagine that we are evaluating
a conservation plan for an endangered colonial seabird. Its
entire population nests on a cluster of twenty small rock islands.
Each island contains about the same number of birds, and members
of the colonies always return to their home colony to breed.
A genetic analysis at a very polymorphic locus shows that each
colony (deme) has a heterozygosity of 0.95, and also shows that
no alleles are shared by members of different colonies (all
alleles are private). The military plans to use 19 of the 20
islands for bombing practice. Geneticsts are asked what effect
this will have on the genetic diversity of the endangered seabird.
Following standard procedure, geneticists will equate genetic
diversity with heterozygosity, and calculate the answer.
Geneticists
hired by the military will argue that even if we save just one
island, we will be preserving almost all of the genetic "diversity"
of the species. The total heterozygosity of the pooled colonies
is 0.9975, and the heterozygosity of one colony is 0.95, so
the proportion of the colony's genetic "diversity"
saved is 0.95/0.9975 = 95%.
Geneticists hired by an environmental protection
group will use the same measures to argue that the military's
plan will destroy 99.9% (0.9974/0.9975) of the colonies
genetic "diversity". The same measures and the same
reasoning, applied to the same data, lead to diametrically opposite
conclusions depending on which side we are on.
In
fact, both results are as fishy as the seabirds' feces. All
the colonies are equally large and equally diverse, and all
alleles are private to single islands, so each colony should
contribute equally to the total genetic diversity of the species.
The loss of 19 of the 20 colonies should cause the loss of 95%
(=19/20) of the species' genetic diversity, and saving one colony
should preserve 5% (=1/20) of the species' diversity. These
sum to unity, as they must. If we want to avoid logical contradictions
when arguing about conserving diversity, our diversity measure
must behave in this way in these highly symmetrical cases. These
cases are an acid test of whether our measures match our diversity
concept. Heterozygosity fails this test and the others described
in Jost (2008).
There are many measures which pass these kinds
of tests, and I call them "true diversities". One
which has been floating around in population genetics for a
long time is Kimura and Crow's (1964) "effective number
of alleles". This can be safely equated with genetic diversity.
A simple transformation, 1/(1-H), converts heterozygosity to
effective number of alleles. This transformation is not optional
if we want logical consistency in our arguments about diversity.
We
are no more free to choose measures of diversity than we are
free to choose formulas for variance or standard deviation.
Standard deviation and variance are simple transformations of
each other, but each has properties that the other lacks, and
reasoning which makes implicit use of the mathematical properties
of standard deviations will often be wrong if applied to variances
(and vice versa). It is time to get serious about our measures.
Imagine what kind of mess statistics would be in if no one recognized
the differences between variance and other measures of central
tendency, and called them all by the same word, and used them
interchangeably in their formulas. The mess we would have is
not unlike the mess we have now in ecology and population genetics
surrounding the concept of diversity. This is especially true
in their conservation applications. How can conservation biology
and conservation genetics do their important tasks of guiding
policy-makers and resource manager and the general public, if
they cannot even acheive a basic level of logical consistency?
Geneticists
made another, more or less independent error which compounded
the misconceptions about diversity. The idea that "diversity"
or heterozygosity can be additively partitioned into within-
and between-group components, by analogy with the additive decomposition
of independent variances, is a mistake which has been corrected
in most other sciences since at least 1975 (Aczel and Daroczy
1975). But in biology (both population genetics and ecology),
this mistake has dovetailed with historical accicents to become
a standard partitioning technique to this day (though see Gregorius
1988). This additive partitioning is at the heart of Nei's (1973)
derivation of Gst as a measure of differentiation. This measure,
(Ht-Hs)/Ht, was perhaps uncritically accepted because of its
close relation to Fst, which had long been (mis-)used as a measure
of genetic differentiation. Ht (total heterozygosity of the
pooled demes) is necessarily greater than or equal to Hs and
less than or equal to unity. This obviously means that if within-group
heterozygosity Hs is high (close to unity), Ht-Hs will be close
to 1-1 = 0, no matter how differentiated are the demes. Dividing
by Ht (~ 1) doesn't change this. How could geneticists accept
a measure of "differentiation" which necessarily approaches
zero when heterozygosity is high? Even Nei (1973) recognized
this problem, but later geneticists continued to make the mistake
of interpreting Gst or Fst as the principle measure of allelic
differentiation in population genetics. This mistake contaminates
most of the empirical studies on the genetic structure of populations.
It makes nonsense out of many theoretical conclusions as well.
Evolutionary models of population subdivision are based on the
effect of migration and mutation on Gst. If model parameters
produce a low value of Gst, the modellers conclude there is
little genetic differentiation among demes. Since Gst turns
out not be a measure of differentiation, this is false reasoning,
as one can easily show by trying some high-mutation-rate, low-migration-rate
examples with many demes, and looking at the actual allele distributions
that result at equilibrium. Gst can be low for these examples
(especially if deme size and number of demes are both large)
but the differences in allele frequencies between demes can
be very large.
How
then should we measure diversity and differentiation in population
genetics? Biologists have often treated these questions as if
they were matters of opinion, but in fact the concepts of diversity,
compositional simialrity, and differentiation have deep logical
and mathematical roots, rich contexts, and many interconnections.
These foundations have not been appreciated by most biologists.
I have tried to explain some of this in the following articles.
Most are written for ecologists rather than geneticists, but
the issues are nearly identical in both fields.
Jost,
L. 2006. Entropy
and diversity Oikos 113: 363–375.
Jost,
L. 2007. Partitioning
diversity into independent alpha and beta components
Ecology 88: 2427–2439.
Jost,
L. 2008. Gst and its relatives do not measure differentiation.
Molecular Ecology 17: 4015- 4026.
Jost,
L. 2009. Mis-measuring biological diversity. Ecological Economics
68: 925-928.
Jost,
L. 2009. Partitioning diversity into independent components:
Reply to Veech and Crist. Ecology in press.
Jost,
L. 2009. Gst versus D: Reply to Heller and Siegismund (2009)
and Ryman and Leimar (2009). Molecular Ecology, in press.
Please use the Population Genetics forum
in Nature Forums (http://network.nature.com/groups/popgen/forum/)
to post responses, queries, or criticisms.
Software for correctly-formulated diversity and
differentiation measures is available at:
Chao, A, and Shen, 2009. SPADE.
http://chao.stat.nthu.edu.tw/softwareCE.html
Crawford, N. www.ngcrawford.com/django/jost/
This now accepts popular pop gen file formats.
You can also download a basic Excel sheet for
calculating genetic differentiation D (Jost 2008) for two demes
by clicking here: Excel
Genetic Differentiation D Calculator. The programs
just cited are much more complete.
Note that these programs may sometimes produce
a differentiation value that is negative. This not a flaw but
a necessary feature of an unbiased estimation procedure. "Unbiased"
means that the expected value of the estimator equals the true
value of the parameter you are trying to estimate. The estimator
must therefore undershoot the true value as often as it overshoots
it (otherwise it would have a bias). When the true value is
zero, it will soemtimes overshoot, but this means it must counterbalance
that by sometimes undershooting. When the true population value
of differentiation is near zero, the only way to undershoot
is to sometimes produce an estimate that is negative.
Citations (other than my own articles,
which are cited above)
Aczél J, Daróczy Z. 1975. On measures
of information and their characterizations. Mathematics in Science
and Engineering, vol. 115, Academic Press, New York, San Francisco,
London, 1975, xii + 234 pp.
Gregorius, H-R 1988. The meaning of genetic variation within
and between subpopulations. Theor. Appl. Genet. 76: 947-951.
Hedrick, P. 2005. A Standardized Genetic Differentiation Measure.
Evolution 59: 1633-1638.
Kimura, M., Crow, J. 1964. The number of alleles that can be
maintained in a finite population. Genetics 49: 725-738.
Nei M. 1973. Analysis of gene diversity in subdivided populations.
Proceedings of the National Academy of Sciences, USA., 70: 3321-3323.