For this exercise we will focus mainly on IQ-TREE and MrBayes.
For visualization of trees, we are going to use FigTree.
1. Sequence composition tests for nucleotide/aminoacid sequences
The purpose of sequence composition test is to test for homogeneity of character composition (nucleotide or amino-acid) for each sequence
This is done by a Chi^2 test. In this test and average composition is calculated and whether the character composition of a sequence deviates significantly from the overall composition is done by testing the chi2 value using the chi2-distribution with k-1 degrees of freedom (df=3 for DNA or df=19 for amino acids).
Sequence composition tests are performed at the beginning of each IQ-TREE run.
It is important to see this test as an explorative tool. On the IQ-TREE page is perfectly described:
This test should be regarded as an explorative tool which might help to nail down problems in a dataset. One would typically not remove failing sequences by default. But if the tree shows unexpected topology the test might point in direction of the origin of the problem.
Could you think about a few reasons why sequence composition might be significantly different in some sequences in a dataset?
Let's try to compute the chi2 test for a small 18S dataset.
For our purpose we will use IQ-TREE. Open a new terminal and start the IQ-TREE analysis.
iqtree -s INFILE
This will tell IQ-TREE to compute a ML tree with the best model. As we don't want to compute the tree, we can stop the analysis immediately after the chi2 composition test has been computed. To stop the analysis use
Did any of the seuqnces failed chi2 test?
2. Tree topology tests
Tree selection is a common practice in phylogenetics and is used to find an optimal tree from taxonomic molecular sequences. One of the widely used selection criteria is the maximum likelihood (ML) method, which calculates a likelihood value for each of the candidate trees and then selects the tree with the largest likelihood value. If we have imaginary sequences of infnite length for a infnite number of taxa, the optimal tree will represent the true history of evolution unless the assumed model of evolution, that is the substitution process, is extremely misspecifed. In practice, however, the sequence length is finite. When the sequence length is not long enough, sampling error causes tree selection to fluctuate, and th e optimal tree may not reflect the true tree. In other words, the optimal tree may have been designated optimal by chance.
Three main tree topology tests are used:
Approximately unbiased (AU) test
The KH test (Kishino and Hasegawa, 1989) was designed to test 2 trees and thus has no correction for multiple testing. This is solved in the SH test (Shimodaira and Hasegawa, 1999). However, the SH test becomes too conservative when testing many trees. The AU test (Shimodaira, 2002) fixes this issue and is thus recommended as replacement for both KH and SH tests.
All these tests can be computed using IQ-TREE.
We will use the ATP7A protein dataset, to test these three topologies. We can also make constraint trees ourselves - so let's remind ourselves about the newick format. Besides manipulating a fully resolved tree, you can provide a tree that has all branches collapsed except the ones to test. Importantly, the constraint trees have to start with a
( bracked and end with
iqtree -s ALIGNMENT_FILE --trees TREES_FILE --test-weight --test-au --test 10000 -m JTT+R3 -n 0
This will tell IQ-TREE calculate the KH, SH and AU tests
--test-au for the given trees
--trees TREES_FILE and that we want 10000 RELL replicates
--test 10000 and we also want the weighted tests
-n 0 sets the number of iterations to stop for a fast analysis, normally you would run a full tree search.
The output of the analysis can be viewed in the
*.iqtree file. Open it in Notepad++.
3. Saturation tests
This is to assess the saturation level in our datasets, i.e. whether it is still possible to distinguish singular character substitutions from those that hide multiple evolutionary events or if too many characters have exchanged too many times, masking the true history. To visualize this, we plot for each taxon pair the observed number of substitutions vs. the hypothetical value inferred from the tree (branch lengths). Under no saturation, this correlation is linear, whereas under some saturation, there is a skew of inferred values away from the diagonal (this is usually the case for real data). This is because we do not always see all the substitutions events in the data. In extreme cases, this is a warning sign.
Branch lengths are a good proxy for the true number of substitutions, and can be obtained from a robust tree inferred from our data (presumably ML with a more advanced model).
A simpler version of saturation analysis is the plot of nucleotide transitions vs. transversions. Under no saturation, both increase linearly. Under saturation, transitions will be affected more, leading to a skew – the number of observed transitions will be lower than transversions.
Perform a saturation analysis using AsaturA on the arginine deiminase dataset.
1. AsaturA doesn't come bundled with the LG matrix. You will need to download it from here.
2. With AsaturA open, choose the alignment file, a substitution matrix (LG) and a distance correction method (Poisson).
3. Go to the Choose a Cutoff Value function and overlay the rare and frequent substitution plots (red and blue data points) by setting the slider so that they largely overlap at low values, then hit OK. We can see that there is considerable saturation in the alignment for evolutionary distances >0.8. You can also check which sites are most saturated using the Show Substitutions Sites function.
4. AsaturA offers a possibility to compute a tree based on this cutoff, if one is interested. To compute a tree select Bootstrap, then hit Make It So, enter an outfile name and then the bootstrap replicates (500 could do).
Before naming the output file, make sure you make a new directory and save it in a new directory, as during the bootstrap, AsaturA will generate a lot files.
5. You might get an error log error between samples which means there are sequences in the alignment that do not overlap at all after application of the cutoff value. Most of the time the software will tell you which sequences are problematic. These can be excluded by clicking on Choose Species and removing the problematic species in the right window.
6. After the tree computation finished click on Draw Tree and from the new window select Compute tree data. You will be asked to select the outgroup for rooting the tree. Go in the end of the list and select 1_Ca_Korarchaeum_cryptof and hit OK. After the computation finished, click on Show Tree to view and save the tree.
4. Fast evolving site removal
Sometimes the removal of fast-evolving sites helps to understand how they influence the final topology. BioTiger is a tree-independent method shown to perform comparably to other ML approaches.
Note that as the authors (Cummins and McInerney, 2011; DOI: 10.1093/sysbio/syr064) point out, “bootstrap support values or Bayesian clade probability values are probably meaningless when there is a directed attempt to remove sites that disagree with the rest of the data”. In other words, branch support will likely increase with the removal of incongruent characters. Therefore, “this method should be used as one part of data exploration”.
Tiger package can be downloaded from here. If you followed the instructions for WSL or Conda setup, you should have it already installed.
For this example we will use the multigene dataset used for phylogenomics.
1. First, let's look at the parameters:
2. To create an index file, use an alignment.
-u tells the program which character is considered unknown,
-s is useful to split a multi-gene dataset into smaller files:
tiger index -i INPUT.ali -u X -o OUTFILE [-s 10]
3. During indexing the software creates a *ref.ti file. We can use this file to calculate the rates. For split files, a for loop can be implemented (ask for it).
tiger rate -i OUTFILE.ref.ti
4. Finally, process the output.
-fa is the input alignment again, as the reference file only contains the rates.
-b defines the number of bins and
-exc is used to exclude individual bins from the final alignment. Use
-f 4 for a fasta output.
tiger output -i OUTFILE.ref.gr -fa INPUT.ali -o SLOWSITESFASTA -b 10 -exc 9,10 --mask
You can use my tiger_trimX.py script to remove the masked columns, but you will need to install the biopython package containing the AlignIO module
conda install -c conda-forge biopython.