Chapter 7 Bayesian phylogenetics
Let’s take stock of what we have covered so far. We introduced models, and how to simulate data with them. We then explored how to use models to calculate the likelihood – the probability of the observed data given the topology, branch lengths, model parameters, and model. We used likelihood as an optimality criterion in heuristic searches to find the Maximum Likelihood (ML) phylogeny. We introduced the methods to calculate the frequency of branches on a focal topology, like the ML topology, in a sample of phylogenies. We showed how to generate a sample of phylogenies by running ML searches on bootstrapped matrices, and used this to calculate bootstrap support for each branch in the ML phylogeny.
ML and bootstraps are widely used and are a critical foundation for many phylogenetic analyses. There are, however, a few things about analysis frameworks based on optimality criteria, ML, and bootstraps that are not ideal:
Whenever we apply an optimality criterion, such as ML or parsimony, to identify the “best” phylogeny, that doesn’t tell us anything about how much better this optimal phylogeny is than other hypotheses. If it is the best phylogeny by far, then this single phylogeny tells us a lot about the the phylogenetic information in our analysis. If however, there are multiple phylogenies that are almost as good as the best one, as is often the case, then we would need to know about these others as well to have a good understanding of what hypotheses are consistent with our data.
Likelihood is the probability of the data given the phylogenetic hypothesis. The probability of the data under the ML hypothesis will be exceptionally low, often far less than one in a million, since the ML hypothesis could generate many other data as well. This is because we are evaluating the probability of the data given the hypothesis, but often what we really want to know is the probability of the hypothesis given the data. The two are quite different, but are related. It would be nice to be more explicit about that relationship.
Bootstraps are a convenient way to generate a sample of phylogenies, but their statistical interpretation is not clear. The frequencies of bootstrap supports don’t correspond in a clear way to probabilities of our hypothesis, but instead tell us something about how frequently features of the phylogeny are supported when we resample the data. In practice it turns out this is a good proxy for phylogenetic support, but it would be nice to have support values that have a more explicit statistical interpretation.
All analyses include a variety of tradeoffs, and the downsides of the issues listed above are often mitigated by the multiple upsides of a ML bootstrap analysis framework. It also greatly helps that there are decades of experience with ML analyses that help contextualize and interpret ML bootstrap results. This work has produced ML methods and software packages that are highly optimized for computational efficiency, enabling the routine application of these approaches to very large datasets.
Another analysis framework, Bayesian statistics, directly addresses all of the issues listed above Chen, Kuo, and Lewis (2014). Rather than focus on a single “best” topology, it provides a set of hypotheses consistent with the data. The frequencies of these hypotheses in the sample are the expected probabilities of the hypotheses given the data. These frequencies have a clear probabilistic interpretation, unlike the bootstrap support values. Bayesian statistics and likelihood have deep mathematical connections, so we will be able to build directly on the foundation established in previous chapters.
7.1 Bayesian statistics
Bayesian statistics is much older than many other domains of statistics, but have recently seen a surge in use as computational and methodological advances allow them to be applied more effectively to a wider number of problems. The basic intuition is that our understanding of the world are not drawn from the data alone. Instead, the data are used to update our prior understanding of the world. This updated understanding is referred to as the posterior. There are many excellent introductions to Bayesian statistics, so I will not provide a full derivation here. If you would like more of a background, please consult Appendix A.2.
Bayes’ theorem establishes the the following relationships:
\[\begin{equation} P(H|D) = \frac{P(D|H)P(H)}{P(D)} \tag{7.1} \end{equation}\]
\(P(H|D)\), read as “the probability of the hypothesis given the data”, is the posterior probability of a given hypothesis. It takes into account prior expectations and new insight from the data. This posterior probability is what we are trying to estimate. We have already seen \(P(D|H)\), read as “the probability of the data given the hypothesis”. This is the likelihood, and from previous chapters we know how to calculate it. \(P(H)\) and \(P(D)\) are our prior probabilities, our world view before we collected data. \(P(H)\) is the prior probability of the hypothesis, without considering the data. \(P(D)\) is the prior probability of the data, without considering the hypothesis.
Let’s plug in a few numbers to get some intuition for the behavior of Bayes’ theorem. First, consider the case where \(P(H)=0\), i.e. you assign a prior probability of \(0\) to the hypothesis. The posterior probability, \(P(H|D)\), will then also be \(0\). This shows that if you believe ahead of collecting data that the hypothesis is absolutely impossible, no amount of data can change your mind and your updated hypothesis, the posterior, will also be \(0\).
Second, let’s consider the case where the data are just as likely under the hypothesis as they are under other hypotheses. This would be like having a medical test for a specific condition where the test was so bad that the data (test result) didn’t depend in any way on the hypothesis (the presence of the condition). For example, the test returns a positive 10% of the time regardless of whether you have the condition or not. In that situation, \(P(D|H)=P(D)\). Since \(P(D|H)\) is in the numerator and \(P(D)\) is in the denominator, they cancel out. That leaves \(P(H|D)=P(H)\) – the posterior probability of the hypothesis is the same as the prior on the hypothesis. The data didn’t change our understanding of the hypothesis at all. This is the behavior we want when the data have no information relevant to the hypothesis.
Third, consider a case where the data are perfect indicators of the hypothesis. You get a specific pattern in the data whenever, and only when, the hypothesis is true. Because the hypothesis and data are perfectly linked, \(P(D)=P(H)\) the priors cancel out. In this case, \(P(H|D)=P(D|H)\), i.e. the posterior becomes equal to the likelihood. This never happens in phylogenetic inference, since any given phylogeny could always generate multiple sequences at the tips. But it is the case that as the data become more informative, the priors have less impact on the posterior.
7.2 Bayesian phylogenetic inference
To calculate the posterior probability of a phylogenetic hypothesis, we need to address all the terms on the right side of Equation (7.1). We already know how to calculate \(P(D|H)\), the likelihood. What about \(P(H)\) and \(P(D)\)? There are a variety of practical approaches to \(P(H)\) that have been shown to work well in the context of phylogenetics. These include uniform priors, where all topologies are considered to be equally likely. The priors on branch length can be approximated from an exponential distribution. One of the major take-aways from the past two decades of work on Bayesian phylogenetics is that inference is quite robust to \(P(H)\) when the data are informative about the phylogeny, as expected.
What then, about \(P(D)\)? This is our prior on the data itself. Calculating it requires integrating the probability of generating these particular character data (e.g., nucleotide sequences observed at the tips) across all possible topologies and branch lengths. That would be prohibitively computationally expensive to actually do. So we won’t.
Instead, we will forego calculating \(P(D)\) by approximating the posterior with Markov Chain Monte Carlo (MCMC) sampling (Metropolis et al. 1953). MCMC is a widely used to approximate complex probability distributions that are too complex to calculate analytically. MCMC is implemented by proposing a series of hypothesis that are either rejected or accepted based on specially formulated test statistic \(R\) and criteria for evaluating this statistic, such that the accepted hypotheses form a sample that is drawn from the distribution of interest. In our case, that distribution of interest is the posterior distribution.
Consider the current hypothesis to be \(H\), and the newly proposed hypothesis \(H^*\). We calculate our test statistic as the ratio of the posterior probability of \(H^*\) to the prior probability of \(H\):
\[\begin{equation} R = \frac{P(D|H^*)P(H^*)}{P(D)} \frac{P(D)}{P(D|H)P(H)} = \frac{P(D|H^*)P(H^*)}{P(D|H)P(H)} \tag{7.2} \end{equation}\]
Because \(P(D)\) doesn’t depend on the hypothesis under consideration, it is the same in both posteriors. Because we are considering a ratio of posteriors, it cancels out. We don’t need to calculate it to derive \(R\).
The “Markov Chain” in MCMC alludes to the fact that MCMC is a series of repeated events that depends only on the previous step. Each repeated cycle of events, also known as a generation, proceeds as follows:
- We have hypothesis \(H\).
- We propose a new hypothesis \(H^*\) by modifying \(H\).
- We calculate \(R\) according to Equation (7.2).
- If \(R>1\), we accept \(H^*\). If If \(R<1\), we accept \(H^*\) with probability \(R\). Otherwise, we retain \(H\).
- The result of the step above is added to the posterior sample, and becomes \(H\) for the next iteration of the cycle.
MCMC produces a sample of model parameters and topologies with branch lengths. This sample is an approximation of the posterior distribution of these entities. We can summarize the topologies in this posterior sample in the same way we did for bootstraps, with branch frequencies. Unlike bootstraps, though, the frequency of a branch in this distribution has a clear statistical interpretation. It is an approximation of the posterior probability of that branch, i.e. the probability of the branch given the data and our priors. The branch lengths and model parameters form continuous probability distributions. We can summarize these in a variety of ways, for example by taking the mean for each.
7.3 Bayesian phylogenetic inference in practice
There are a variety of practical considerations to implementing a Bayesian phylogenetic analysis with MCMC. These include:
- Selecting the appropriate model. Many of the same considerations we reviewed in the context of likelihood apply here.
- Selecting appropriate priors on topology, branch length, and model parameters.
- Devising an appropriate hypothesis proposal mechanism. If the new hypothesis \(H^*\) is too different from \(H\), it will tend to be rejected as \(H\) becomes optimized. This will lead to an approximated posterior distribution that is too narrow, leading to overconfidence in a small set of hypotheses. If the steps are too small, then MCMC will tend to get trapped in local optima because they cannot find more distant better optima. There are several approaches to these challenges, including searches with multiple MCMC chains that take different step sizes.
- The very early samples in the chain are not a good estimate of the posterior distribution, because the initial \(H\) is not necessarily close to a peak in the posterior. The initial samples before MCMC settles into a stable distribution are therefore discarded as part of a “burn-in” phase.
- Knowing when enough generations have been sampled to adequately approximate the posterior distribution. This is often evaluated by running multiple independent MCMC analyses and stopping them if and when they have converged on similar distributions.
The ability to specify a prior can be a great advantage of Bayesian analyses. A prior is an explicit, principled way to incorporate prior knowledge into a new analysis. For example, if you are very confident that a particular clade exists based on external evidence, you could design a topology prior that favors topologies that include that clade over topologies that don’t. If you have external information about model parameters, you could set a prior that constrains these parameters. These are cases of what are called informative priors– they are nonuniform, and intentionally constrain results based on external information. In many cases, though, investigators use relatively uninformative priors that give equal probabilities to all hypotheses. This allows the posterior to be dominated by the data, and is appropriate when the investigator doesn’t have external information to include in the prior or wants to focus on signal in a particular dataset. If you are ever concerned about the impact of your priors on your posteriors, you can run your analysis without any data– the posterior will then be determined entirely by the prior. This gives you a good understanding of the impact of your prior on the posterior, and by comparing this empty analysis to an analysis with data you can get a good sense of the impact of your data on the posterior.
Bayesian phylogenetic software stores samples of the run at regular intervals, for example every hundred generations (where a generation is a single MCMC proposal). Typically, there are two output files, a tree file and a trace file. The tree file has one line per sample, each with a tree in newick format. The trace file also has one line per sample. There are multiple columns, each with with a model parameter or run statistic. It is called a trace file because you can plot each of these columns to trace the progress of the MCMC run.
Investigators should at minimum examine the trace of log likelihood (Figure 7.1). There are a few things to look for. First, an appropriate burn-in should be identified. This is the number of generations at the start of the run that are discarded as the MCMC chain settles into the sampling process. During the burn-in, the log likelihood rapidly climbs as the chain moves toward higher likelihood regions of tree and model space. In the example provided here, 100,000 generations is sufficient. Second, the investigator should examine the post-burn-in sampling process to make sure it is well mixed. It should appear noisy, as it does here, indicating that trees and model parameters with a range of likelihoods are sampled. There should not be any sudden major jumps or longer term trends, which could indicate that the sampling process is not adequate (longer runs may be required).
Once the trace file has been examined, the investigator turns to the tree file. All pre-burn-in trees are discarded. You then pick a focal tree, which could be the maximum likelihood tree or consensus of the post-burn-in bayesian trees. The frequency of each edge in the focal tree is then calculated from the set of post burn-in trees and recorded as the posterior probability of that edge. These are then plotted onto the tree, often along with the maximum likelihood bootstrap scores.
7.4 Resources
- A video introduction to Bayes Theorem: https://www.youtube.com/watch?v=HZGCoVF3YvM
- The excellent MCMC robots by Paul Lewis: https://plewis.github.io/applets/mcmc-robot/