Chapter 7 Bayesian phylognetics

Let’s take stock of what we have covered so far. We introduced models, and how to simulate data with them. We then explored how to use models to calculate the likelihood – the probability of the observed data given the topology, edge lengths, model parameters, and model. We used likelihood as an optimality criterion in heuristic searches to find the Maximum Likelihood (ML) phylogeny. We introduced the methods to calculate the frequency of edges on a focal topology, like the ML topology, in a sample of phylogenies. We showed how to generate a sample of phylogenies by running ML searches on bootstrapped matrices, and used this to calculate bootstrap support for each edge in the ML phylogeny.

ML and bootstraps are widely used and are a critical foundation for many phylogenetic analyses. There are, however, a few things about analysis frameworks based on optimality criteria, ML, and bootstraps that are not ideal:

Whenever we apply an optimality criterion, such as ML or parsimony, to identify the “best” phylogeny, that doesn’t tell us anything about how much better this optimal phylogeny is than other hypotheses. If it is the best phylogeny by far, then this single phylogeny tells us a lot about the the phylogenetic information in our topology. If however, there are multiple phylogenies that are almost as good as the best one, as is often the case, then we would need to know about these others as well to have a good understanding of what hypotheses are consistent with our data.
Likelihood is the probability of the data given the phylogenetic hypothesis. The probability of the data under the ML hypothesis will be exceptionally low, often far less than one in a million, since the ML hypothesis could generate many other data as well. This is because we are evaluating the probability of the data given the hypothesis, but often what we really want to know is the probability of the hypothesis given the data. The two are quite different, but are related. It would be nice to be more explicit about that relationship.
Bootstraps are a convenient way to generate a sample of phylogenies, but their statistical interpretation is not clear. The frequencies of bootstrap supports don’t correspond in a clear way to probabilities of our hypothesis, but instead tell us something about how frequently features of the phylogeny are supported when we resample the data. In practice it turns out this is a good proxy for phylogenetic support, but it would be nice to have support values that have a more explicit statistical interpretation.

All analyses include a variety of tradeoffs, and the downsides of the issues listed above are often mitigated by the multiple upsides of a ML bootstrap analysis framework. It also greatly helps that there are decades of experience with such analyses that help contextualize and interpret ML bootstrap results. This work has also produced ML methods and software packages that are highly optimized for computational efficiency, enabling the routine application of these approaches to very large datasets.

Another analysis framework, Bayesian statistics, directly addresses all of the issues listed above. Rather than focus on a single “best” topology, it provides a set of hypotheses consistent with the data. The frequencies of these hypotheses in the sample are the expected probabilities of the hypotheses given the data. These frequencies have a clear probabilistic interpretation, unlike the bootstrap support values. Bayesian statistics and likelihood have deep mathematical connections, so we will be able to build directly on the foundation established in previous chapters.

7.1 Bayesian statistics

Bayesian statistics is much older than many other domains of statistics, but have recently seen a surge in use as computational and methodological advances allow them to be applied more effectively to a wider number of problems. The basic intuition is that our understanding of the world are not drawn from the data alone. Instead, the data are used to update our prior understanding of the world. This updated understanding is referred to as the posterior. There are many excellent introductions to Bayesian statistics, so I will not provide a full derivation here. If you would like more of a background, please consult Appendix A.3.

Bayes’ theorem establishes the the following relationships:

\[\begin{equation} P(H|D) = \frac{P(D|H)P(H)}{P(D)} \tag{7.1} \end{equation}\]

\(P(H|D)\), read as “the probability of the hypothesis given the data”, is the posterior probability of a given hypothesis. It takes into account prior expectations and new insight from the data. This posterior probability is what we are trying to estimate. We have already seen \(P(D|H)\), read as “the probability of the data given the hypothesis”. This is the likelihood, and from previous chapters we know how to calculate it. \(P(H)\) and \(P(D)\) are our prior probabilities, our world view before we collected data. \(P(H)\) is the prior probability of the hypothesis, without considering the data. \(P(D)\) is the prior probability of the data, without considering the hypothesis.

Let’s plug in a few numbers to get some intuition for the behavior of Bayes’ theorem. First, consider the case where \(P(H)=0\), i.e. you assign a prior probability of \(0\) to the hypothesis. The posterior probability, \(P(H|D)\), will then also be \(0\). This shows that if you believe ahead of collecting data that the hypothesis is absolutely impossible, no amount of data can change your mind and your updated hypothesis, the posterior, will also be \(0\).

Second, let’s consider the case where the data are just as likely under the hypothesis is they are in general. This would be like having a medical test for a specific condition where the test was so bad that the data (test result) didn’t depend in any way on the hypothesis (the presence of the condition). For example, the test returns a positive 10% of the time regardless of whether you have the condition or not. In that situation, \(P(D|H)=P(D)\). Since \(P(D|H)\) is in the numerator and \(P(D)\) is in the denominator, they cancel out. That leaves \(P(H|D)=P(H)\) – the posterior probability of the hypothesis is the same as the prior on the hypothesis. The data didn’t change our understanding of the hypothesis at all. This is the behavior we want when the data have no information relevant to the hypothesis.

Third, consider a case where the data are perfect indicators of the hypothesis. You get a specific pattern in the data whenever, and only when, the hypothesis is true. Because the hypothesis and data are perfectly linked, \(P(D)=P(H)\) the priors cancel out. In this case, \(P(H|D)=P(D|H)\), i.e. the posterior becomes equal to the likelihood. This never happens in phylogenetic inference, since any given phylogeny could always generate multiple sequences at the tips. But it is the case that as the data become more informative the priors have less impact on the posterior.

7.2 Bayesian phylogenetic inference

To calculate the posterior probability of a phylogenetic hypothesis, we need to address all the terms on the right side of Equation (7.1). We already know how to calculate \(P(D|H)\), the likelihood. What about \(P(H)\) and \(P(D)\)? There are a variety of practical approaches to \(P(H)\) that have been shown to work well in the context of phylogenetics. These include uniform priors, where all topologies are considered to be equally likely. The priors on edge length can be approximated from an exponential distribution. One of the major take-aways from the past two decades of work on Bayesian phylogenetics is that inference is quite robust to \(P(H)\) when the data are informative about the phylogeny, as expected.

What then, about \(P(D)\)? This is our prior on the data itself. Calculating it requires integrating the probability of generating these particular character data (e.g., nucleotide sequences observed at the tips) across all possible topologies and edge lengths. That would be prohibitively computationally expensive to actually do. So we won’t.

Instead, we will forego dealing with \(P(D)\) at all by approximating the posterior with Markov Chain Monte Carlo (MCMC) sampling(Metropolis et al. 1953). MCMC is a widely used to approximate complex probability distributions that are too complex to calculate analytically. MCMC is implemented by proposing a series of hypothesis that are either rejected or accepted based on specially formulated test statistic \(R\) and criteria for evaluating this statistic, such that the accepted hypotheses form a sample that is drawn from the distribution of interest. In our case, that distribution of interest is the posterior distribution.

Consider the current hypothesis to be \(H\), and the newly proposed hypothesis \(H^*\). We calculate our test statistic as the ratio of the posterior probability of \(H^*\) to the prior probability of \(H\):

\[\begin{equation} R = \frac{P(D|H^*)P(H^*)}{P(D)} \frac{P(D)}{P(D|H)P(H)} = \frac{P(D|H^*)P(H^*)}{P(D|H)P(H)} \tag{7.2} \end{equation}\]

Because \(P(D)\) doesn’t depend on the hypothesis under consideration, it is the same in both posteriors. Because we are considering a ratio of posteriors, it cancels out. We don’t need to calculate it to derive \(R\).

The “Markov Chain” in MCMC alludes to the fact that MCMC is a series of repeated events that depends only on the previous step. Each repeated cycle of events, also known as a generation, proceeds as follows:

We have hypothesis \(H\)
We propose a new hypothesis \(H^*\) by modifying \(H\).
We calculate \(R\) according to Equation (7.2)
If \(R>1\), we accept \(H^*\). If If \(R<1\), we accept \(H^*\) with probability \(R\). Otherwise, we retain \(H\).
The result of the step above is added to the posterior sample, and becomes \(H\) for the next iteration of the cycle.

MCMC produces a sample of model parameters and topologies with edge lengths. This sample is an approximation of the posterior distribution of these entities. We can summarize the topologies in this posterior sample in the same way we did for bootstraps, as edge frequencies. Unlike bootstraps, though, the frequency of an edge in this distribution has a clear statistical interpretation. It is an approximation of the posterior probability of that edge, i.e. the probability of the edge given the data. The edge lengths and model parameters form continuous probability distributions. We can summarize these in a variety of ways, for example by taking the mean for each.

There are a variety of practical considerations to implementing a Bayesian phylogenetic analysis with MCMC. These include:

Selecting the appropriate model. Many of the same considerations we reviewed in the context of likelihood apply here.
Selecting appropriate priors on topology, edge length, and model parameters.
Devising an appropriate hypothesis proposal mechanism. If the new hypothesis \(H^*\) is too different from \(H\), it will tend to be rejected as \(H\) becomes optimized. This will lead to an approximated posterior distribution that is too narrow, leading to overconfidence in a small set of hypotheses. If the steps are too small, then MCMC will tend to get trapped in local optima because they cannot find more distant better optima. There are several approaches to these challenges, including searches with multiple MCMC chains that take different step sizes.
The very early samples in the chain are not a good estimate of the posterior distribution, because the initial \(H\) is not necessarily close to a peak in the posterior. The initial samples becore MCMC settles into a stable distribution are therefore discarded as part of a “burn-in” phase.
Knowing when enough generations have been sampled to adequately approximate the posterior distribution. This is often evaluated by running multiple independent MCMC analyses and stopping them if and when they have converged on similar distributions.

References

Metropolis, Nicholas, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. 1953. “Equation of State Calculations by Fast Computing Machines.” The Journal of Chemical Physics 21 (6): 1087–92. https://doi.org/10.1063/1.1699114.