Chapter 5 Molecular inference in practice

In the last chapter we examined the mathematical and statistical underpinnings of phylogenetic inference with maximum likelihood. In this chapter we examine some of the project design considerations and decisions you will need to make in implementing a phylogenetic analysis. We are sticking with DNA sequence data for now, but we will extend these approaches to other types of data later. This is an introduction to the decisions you will face and some of the approaches available, not a comprehensive explanation of how to implement a study.

5.1 Your question

The first step to an analysis is to clearly define your question. There will be many decisions to make along the way, and a clear question will provide an explicit benchmark for working through these decisions – which decision is best for the question at hand? If you have a vague question it will be much harder to decide which steps are best, and many decisions about trade-offs will essentially be arbitrary.

5.2 Taxon sampling

The tips in a phylogeny, be they genes, species, individuals, or some other evolving entity, are referred to as taxa. One of the first things you will want to consider is the breadth and depth of taxon sampling.

One of the most common mistakes in phylogenetic analyses is to include too few taxa. There are many ways that adding taxa helps:

Adding taxa breaks up branches in the phylogeny. This often improves the performance of maximum likelihood. If you are only interested in taxa 1…20, it still helps to add in taxa 21…50, even if you end up not paying much attention to where they go. They will often improve your ability to infer relationships among the taxa you are most interested in.
Adding taxa can help improve model evaluation and model parameter estimation.
Adding taxa reduces the number of assumptions going into a study. You are almost always inferring a phylogeny because you are uncertain at the outset about evolutionary relationships. If, for example, you include only a couple of species from Clade B because you assume it is monophyletic, this poor sampling may be insufficient to reveal that Clade C is in fact nested within Clade B. With insufficient sampling of Clade B, it may instead appear that the two groups are sister clades.
Adding taxa provides a more rigorous test of phylogenetic hypotheses. Suppose you are conducting a study to investigate the phylogenetic placement of a specific enigmatic species that is so morphologically distinct from other organisms that it is hard to know where it goes. You think it may fall in Clade A, Clade B, or Clade C, so you densely sample those groups. But in fact it falls within Clade D, and your taxon sampling is such that you can’t even evaluate this hypothesis because you have no taxa from Clade D in your study.
Adding more outgroup taxa often improves rooting. Skimping on outgroup sampling is one of the most common mistakes in project design. The best practice in general is to include the most closely related outgroup taxa possible, and some that are progressively more distantly related. The root and the branches that connect to it are the deepest in the tree, and can be the hardest to infer. It is good to densely sample taxa relevant to the root.

There are costs to adding taxa that must be balanced against the benefits above:

If you are collecting the data, it costs money to add taxa. That money may be best allocated in other ways.
Adding taxa can considerably increase the computational demand of analyses. This can increase the time and cost of analysis, or even make an analysis unfeasible on the computers available to the investigator.

There are some conditions under which you may want to exclude specific taxa, at least from some analyses:

If you suspect that the data are contaminated, incorrectly identified, or insufficiently documented.
Preliminary analyses suggest that the taxon has properties very different from other taxa in the study, such as a very different equilibrium frequency or an exceptionally high rate of evolution.
The taxon has a very high fraction of missing data. Sometimes missing data makes a taxon unstable, which can obscure well supported relationships between other taxa. Sometimes missing data can interact with other factors to lead to bias in the placement of the taxon.

5.3 Collecting sequence data

You will also need to consider which genome regions to include in your analyses. To build a phylogeny, you need homologous characters across the taxa under investigation. In the context of DNA analyses, this means that you need overlapping genome regions across the taxa that will be in your phylogeny. Decisions about which specific taxa to include interact with decisions about which genome regions you will consider, so there is often back and forth as you refine these decisions. If you are using publicly available data, your taxon and genome region sampling will be strongly constrained by what data are already available.

You will need to decide if and how to enrich for particular genome regions prior to phylogenetic analyses. Enrichment occurs in two stages. The first is sample enrichment. It takes place at the lab bench after you extract your DNA and before you sequence, and uses molecular methods to bias which DNA regions are sequenced. The second is data enrichment. It takes place after you sequence and before you perform phylogenetic analyses, and uses computational methods to obtain a useful subset of data best suited to your question.

Here I consider a few sample enrichment strategies. I start with methods that use the least sample enrichment since, as sequencing technologies improve and fall in price, these will become more favorable for a given project.

5.3.1 Whole genomes

It is now routine for phylogenetic studies to use shotgun sequencing to sequence entire genomes of sampled taxa without any sample enrichment at all. This is often cheaper and more productive than taking steps to only sequence regions of interest. Studies differ greatly, though, in how these whole genome data are used. In clades with small genomes with minimal sequence divergence, such as some viruses and bacteria, entire genomes can be sequenced and assembled.

For larger genomes, such as plants and animals, data are processed to isolate a subset of homologous regions across taxa. The processing depends on the questions at hand and genomic resources already available. Several strategies include:

Mapping reads to reference genomes and calling variable sites. If genome assemblies are available for the clade of interest, then reads from each new sample can be mapped to the genome. Sites that are variable at the same mapped sites can then be used for phylogenetic inference. This makes excellent use of the full dataset, often providing hundreds of thousand of phylogenetically informative sites. It is limited, though, to studies where reference genomes are available and are comparable across the taxa included in the analysis.
Assembling and analyzing organelle genomes. Sometimes mitochondrial and plastid genomes are sufficient for answering the questions that motivate a study. Because these genomes are small compared to nuclear genomes, they can often be readily assembled from raw sequence reads and then aligned and analyzed. This approach can be applied regardless of existing genome resources.
Using other methods to isolate homologous regions across studies. The number of tools that can process raw reads to isolate homologous genome regions across samples without the need for a reference genome is growing. These tools include sharkmer, written by Sam Church and me and available at https://github.com/caseywdunn/sharkmer, which can produce gene sequences given Polymerase Chain Reaction (PCR) primer sequences and raw DNA sequence data.

Multiple strategies above are often used in the same study, they are not exclusive. Apart from reduced expenses in the lab, one of the values of whole genome sequencing is that it is compatible with so many downstream analysis methods.

5.3.2 Transcriptomes

A transcriptome is the collection of RNA in a cell or collection of cells. Because RNA is transcribed from the genome, sequencing the RNA gives a snapshot of an enriched subset of the genome. Whereas the other enrichment approaches described are implemented by the investigator at the bench, transcriptome sequencing largely relies on a natural process in the living cell (transcription) to enrich for particular genome regions.

In a typical phylogenetic study based on transcriptome data, messenger RNA (mRNA), which encodes the protein coding genes, is isolated. mRNA is then copied to complementary DNA (cDNA) and sequenced. Because many of the same genes are highly expressed across species and tissues, there is considerable overlap between the genes in transcriptome datasets from different samples. Many investigators are interested in protein coding genes for other purposes as well, adding value to the transcriptome data.

There are a couple challenges to working with transcriptomes. For one, RNA is much less stable than DNA. It is therefore critical to take considerable care when collecting, storing, and processing tissue for transcriptome studies. Transcriptome sequencing only captures genes that are being expressed, leading to missing data for some genes in some samples. As sequencing costs have fallen, genome sequencing has become a better option than transcriptome sequencing for most phylogenetic projects.

5.3.3 Targeted enrichment

In targeted enrichment, an investigator designs short bait sequences that are similar to conserved genome regions. They then isolate DNA from their taxa of interest, fragment it, and use the baits to fish out the conserved regions along with their neighboring sequences. This allows the investigator to create pools of DNA that are enriched for specific pre-selected regions. This is a highly cost-effective way to isolate homologous sequences in across species, and works well when samples have fragmented DNA (in museum specimens, for example).

The baits are typically designed based on one or more genomes that are already available in the clade of interest. There are a few challenges to using targeted enrichment. One is that data are difficult to combine across different studies that used different baits. In addition, the design and synthesis of baits require a large up-front investment that makes this approach best suited to larger projects.

5.3.4 RAD-seq

Like targeted enrichment, RAD-seq enriches for sequence data from specific regions scattered across the genome. It differs, though, in that it uses intrinsic properties of the genome for enrichment, rather than user-designed baits. Genomic DNA is digested with one or two restriction enzymes that cut the genome at specific small sequences, and leave the DNA with particular overhangs at the cut end. These cut fragments are then size-selected. If cutting with two restriction enzymes, the fragments that have different overhangs (because they were cut once with each enzyme) are then further enriched. The fragments are then sequenced.

The result is that sequences are enriched for regions with similar distances between restriction enzyme sites across taxa. This is a powerful way to get a consistent subset of the genome across many different samples. It doesn’t require a big investment in baits at the outset, but there are a few important limitations. First, it only works well on closely related taxa, since differences in restriction site placement evolve relatively quickly and enrichment overlap falls quickly with evolutionary distance. RAD-seq therefore is most often used when looking at evolutionary relationships within species or among very closely related species. Second, there is very little control over which specific genome regions are sequenced. For example, if you want to build a phylogeny and look at the evolution of a particular gene, you can’t readily engineer the sample enrichment to target that specific gene.

5.3.5 PCR

The most extreme form of genome enrichment is PCR. This method enriches for a small number of specific genome regions selected by the user. Short DNA oligonucleotide sequences are designed to flank the region of interest and target it for amplification. By placing these oligonucleotides in conserved regions and engineering them to bind slightly different sequences, they can be used to successfully enrich the same genome region from quite distantly related species. PCR enabled the first wave of DNA sequence-based phylogenetic inference starting in the 1980s, and is still widely used today even as a variety of new enrichment methods have become available.

PCR is cheap and relatively easy. Because it has been used for decades to sequence a small number of genes across a wide diversity of organisms, it is a great way to collect data to analyze in combination with publicly available data. Ironically, though, it is now very expensive per nucleotide to collect data with PCR relative to the other methods above. That is because it is difficult to leverage the advantages of modern high-throughput sequencing, which is very cheap per nucleotide, with such a small number of short amplified sequences. As one of the most extreme forms of genome enrichment available, it provides access to a very small fraction of the genome. PCR is still a convenient way to get a tiny amount of sequence data from a large number of taxa.

5.4 Data processing upstream of phylogenetic analyses

Each enrichment strategy requires its own series of data processing steps to get the data ready for phylogenetic inference. These steps vary greatly across enrichment approaches, requiring distinct methods and software tools. The data processing that occurs upstream of the actual phylogenetic analysis has two primary objectives, estimating the sequence of the sampled genome regions and enrichment of homologous sequences across taxa.

5.4.1 Estimation of original sequences

Raw sequence data have a few properties that make them unusable for direct analysis:

They are redundant. The same genome regions are sequenced multiple times. PCR products are typically sequenced a couple of times (once in each direction), and high throughput sequencing often sequences each region hundreds or thousands of times.
They have many errors. These are introduced during sample preparation at the bench, by the sequencing instruments during data acquisition, and in the initial base calling that generates the sequence file. Errors in raw data vary across sequencing technologies, and are on the order of a few per hundreds or thousands of nucleotides.
They are fragmented. Often the raw sequences do not span the full genome region of interest, and must be tiled together to predict longer regions.
They are ambiguous. It often isn’t clear what region of the genome a sequence belongs to, due to genomic repeats, heterozygosity, and genome duplication. This ambiguity must be resolved or eliminated, depending on the application.

The above technical challenges are usually addressed simultaneously in an analysis process called assembly. The goal of assembly is to predict the sequences of the original genome regions based on the raw sequence data. The general idea is to line up overlapping raw sequence data, correct errors by comparing sequences, and collapse all the sequences down to hypotheses about the original actual sequence from which all the raw data were generated. This is a very complex process, and still a major challenge for many genomes and transcriptomes in particular. The best approach differs quite a bit depending on the enrichment strategy, sequencing technology, and the properties of the genome (such as how repetitive it is).

The process of assembling a genome (or genome regions) from scratch is referred to as de novo assembly. If high-quality closely related genomes are available, they can be used for guided assembly or mapping, where the raw data are tiled onto the reference rather than assembled from scratch.

5.4.2 Identification of homologous genome regions

It isn’t enough to collect and assemble sequence data from the taxa of interest. We also need to collect and identify homologous sequences that represent regions that are shared because they derive from the same region in common ancestors of the taxa.

Like assembly, the optimal approach for identifying homologous sequences depends greatly on the enrichment strategy used. This step is quite straight forward when targeted capture and PCR are used as enrichment strategies, because the data are from specific known regions that were often selected in part because their homology is easier to assess.

In some cases, such as with closely related viruses, genomes can be directly aligned with each other along their full lengths. In larger genomes and transcriptomes, protein coding sequences are often extracted from the assemblies and are the focus of phylogenetic analyses. There are various approaches to identifying homologous protein coding genes across taxa. One is to identify sequences for a preselected panel of genes that are selected because they tend to be present in most datasets, they have properties of molecular evolution that are well-suited to phylogenetic analysis, and their homology is thought to be easier to assess. Another is to compare all gene sequences across all species to identify homologous regions. This is more computational work, but has the advantage of optimizing gene selection to the data and question at hand.

5.5 Working with publicly available data

Most phylogenetic studies include at least some publicly available data from previously published studies. Some rely entirely on publicly available data, while others add new sequence data as well. The primary repositories for DNA sequence data in published studies, including phylogenetic studies, are NCBI, EMBL-EBI, DDBJ, and NGDC. Many of the same data are available across these platforms.

Most phylogenetic studies deposit their raw data in these public archives, but the raw data can require considerable processing (including assembly) for use in phylogenetic inference. It is often hard to track down assemblies and other intermediate products that are easier to work with since they are not consistently deposited in centralized data archives. Wrangling publicly available data and getting it ready for analysis can be one of the most time-consuming components of a phylogenetic inference project. This is one of the major challenges to conducting phylogenetic analyses of DNA sequence data at this time.

When I am interested in building a phylogeny for a particular group of organisms, one of my first steps is to take a look at the NCBI taxonomy browser. There you can enter a taxonomic group name, and select which data you would like to see (Figure 5.1). You can adjust the number of taxonomic levels to control whether you want to view just summaries for larger groups or get all the way down to the species level.

A view of the sequence data available for Cnidaria. Here I have selected to view available Nucleotide, Protein, SRA, Genome, and Assembly data.

Figure 5.1: A view of the sequence data available for Cnidaria. Here I have selected to view available Nucleotide, Protein, SRA, Genome, and Assembly data.

Based on this overview, you can assess which types of data (i.e., which data produced by which enrichment strategy) are best for advancing your question with publicly available data. All the PCR data and most of the raw sequence data will usually be available here, but many assemblies and processed data for whole genomes, transcriptomes, and other high-throughput datasets may not be. The best practice when working with these data is to take a deep dive into the recent literature and see if these intermediate products were deposited in another public archive, or contact authors for more details.

5.6 Homologous sites

Once you have collected homologous sequences for one or more genome regions across your taxa of interest, you next need to identify homologous nucleotide sites within these regions. This process is referred to as multiple sequence alignment (MSA). Rather than rely on explicit models of molecular evolution, multiple sequence alignment usually relies on similarities between sequences. The general idea is to slide the sequences along each other and insert gaps as needed so that similarity at each site is optimized while reducing the number of gaps introduced.

The output of multiple sequence alignment is a matrix. Each row is a taxon that will correspond to a tip in the phylogeny. Each column is a single nucleotide site that is hypothesized to be homologous across taxa. There are multiple file formats that are used to store aligned sequences, including fasta, nexus, and phylip.

5.7 Combining information across multiple genome regions

Sometimes when we build phylogenies from genomic data, we are interested in the specific evolutionary history of the genome regions themselves. This is the case, for example, when looking at gene family evolution. Often, though, we are using the evolutionary history of the genome regions as a proxy for the evolutionary history of the organisms, such as when we are asking questions with DNA data about how species are related to each other. There are a variety of conditions, though, under which different genome regions can have different evolutionary histories from each other, and some of these histories deviate from the evolutionary history of the taxa as a whole (Maddison 1997). Processes that can lead different regions to have different evolutionary histories include:

Horizontal transfer of genome regions across species. This can occur, for example, when a virus carries DNA from one species to another. Rates of horizontal gene transfer vary greatly across different groups of organisms.
Duplication and loss of genome regions. Genome replication is not always perfect, and sometimes results in an organism inheriting multiple copies of a whole genome or specific genome regions from a parent. In fact, many genes are members of gene families within a species that arose via a history of duplication and subsequent differentiation. As genome regions are duplicated and sometimes lost, their evolutionary history can become distinct from that of other genome regions.
All variation that we see across species originally arises within populations. Sometimes new variants are lost, and sometimes they become fixed across entire species. If multiple variants persist through multiple speciation events, though, the evolutionary history of some genome regions can differ from that of species. This process is called Incomplete Lineage Sorting (ILS). Rates of ILS are higher on shorter branches and when populations are larger.

If there are concerns about incongruent evolutionary histories in the group of taxa under investigation, there are a variety of methods available for addressing each of these sources of incongruence. In general, they involve independently estimating the phylogeny of each region, and then integrating information across these independent phylogenies.

In cases where such differences are expected to be rare or have little impact, it is common practice to concatenate alignments. Each homologous gene region is aligned independently and then the alignments are joined end to end into a giant superalignment. In this case, information is integrated across regions prior to phylogenetic inference.

References

Maddison, Wayne P. 1997. “Gene Trees in Species Trees.” Systematic Biology 46 (3): 523–36. https://doi.org/10.1093/sysbio/46.3.523.