1. Sequence composition tests for nucleotide/aminoacid sequences
The purpose of sequence composition test is to test for homogeneity of character composition (nucleotide or amino-acid) for each sequence
This is done by a Chi^2 test. In this test and average composition is calculated and whether the character composition of a sequence deviates significantly from the overall composition is done by testing the chi2 value using the chi2-distribution with k-1 degrees of freedom (df=3 for DNA or df=19 for amino acids).
Sequence composition tests are performed at the beginning of each IQ-TREE run.
It is important to see this test as an explorative tool. On the IQ-TREE page is perfectly described:
This test should be regarded as an explorative tool which might help to nail down problems in a dataset. One would typically not remove failing sequences by default. But if the tree shows unexpected topology the test might point in direction of the origin of the problem.
Could you think about a few reasons why sequence composition might be significantly different in some sequences in a dataset?
Let's try to compute the chi2 test for a small 18S dataset.
For our purpose we will use IQ-TREE. Open a new terminal and start the IQ-TREE analysis.
iqtree -s INFILE
This will tell IQ-TREE to compute a ML tree with the best model. As we don't want to compute the tree, we can stop the analysis immediately after the chi2 composition test has been computed. To stop the analysis use Ctrl+C.
Did any of the sequences fail the chi2 test?
2. Tree topology tests
Tree selection is a common practice in phylogenetics and is used to find an optimal tree from taxonomic molecular sequences. One of the widely used selection criteria is the maximum likelihood (ML) method, which calculates a likelihood value for each of the candidate trees and then selects the tree with the largest likelihood value. If we have imaginary sequences of infnite length for a infnite number of taxa, the optimal tree will represent the true history of evolution unless the assumed model of evolution, that is the substitution process, is extremely misspecifed. In practice, however, the sequence length is finite. When the sequence length is not long enough, sampling error causes tree selection to fluctuate, and th e optimal tree may not reflect the true tree. In other words, the optimal tree may have been designated optimal by chance.
Three main tree topology tests are used:
Kishino-Hasegawa(KH) test
Shimodaira–Hasegawa(SH) test
Approximately unbiased (AU) test
The KH test (Kishino and Hasegawa, 1989) was designed to test 2 trees and thus has no correction for multiple testing. This is solved in the SH test (Shimodaira and Hasegawa, 1999). However, the SH test becomes too conservative when testing many trees. The AU test (Shimodaira, 2002) fixes this issue and is thus recommended as replacement for both KH and SH tests.
All these tests can be computed using IQ-TREE.
We will use the ATP7A protein dataset, to test these three topologies. Let's check the differences and then split the topologies into individual files:
split -l 1 TOPOLOGY_FILE topo
For each topology, we will compute the best constraint tree.
iqtree -m GTR+F+R3 -s ATP7A_orthologues_nucl_trim.fas -g topoaa --prefix ATP7A.constr1
iqtree -m GTR+F+R3 -s ATP7A_orthologues_nucl_trim.fas -g topoab --prefix ATP7A.constr2
..
iqtree -m GTR+F+R3 -s ATP7A_orthologues_nucl_trim.fas -g topoae --prefix ATP7A.constr5
iqtree -m GTR+F+R3 -s ATP7A_orthologues_nucl_trim.fas --prefix ATP7A.unconstr
Finally, merge the resulting trees into one and compare them.
cat ATP7A*constr*treefile > all.trees
iqtree -m GTR+F+R3 -s ATP7A_orthologues_nucl_trim.fas --trees all.trees --test-weight --test-au -n 0 --test 10000 --prefix ATP7A_autest
This will tell IQ-TREE calculate the KH, SH and AU tests --test-au for the given trees --trees TREES_FILE and that we want 10000 RELL replicates --test 10000 and we also want the weighted tests --test-weight
The output of the analysis can be viewed in the *.iqtree file. Open it in Notepad++.
3. Fast evolving site removal
Sometimes the removal of fast-evolving sites helps to understand how they influence the final topology. BioTiger is a tree-independent method shown to perform comparably to other ML approaches.
Note that as the authors (Cummins and McInerney, 2011; DOI: 10.1093/sysbio/syr064) point out, “bootstrap support values or Bayesian clade probability values are probably meaningless when there is a directed attempt to remove sites that disagree with the rest of the data”. In other words, branch support will likely increase with the removal of incongruent characters. Therefore, “this method should be used as one part of data exploration”.
Tiger package can be downloaded from here. If you followed the instructions for WSL or Conda setup, you should have it already installed.
For this example we will use the multigene dataset used for phylogenomics.
1. First, let's look at the parameters:
tiger
2. To create an index file, use an alignment. -u tells the program which character is considered unknown, -s is useful to split a multi-gene dataset into smaller files:
tiger index -i INPUT.ali -u X -o OUTFILE [-s 10]
3. During indexing the software creates a *ref.ti file. We can use this file to calculate the rates. For split files, a for loop can be implemented (ask for it).
tiger rate -i OUTFILE.ref.ti
4. Finally, process the output. -fa is the input alignment again, as the reference file only contains the rates. -b defines the number of bins and -exc is used to exclude individual bins from the final alignment. Use -f 4 for a fasta output.
tiger output -i OUTFILE.ref.gr -fa INPUT.ali -o SLOWSITESFASTA -b 10 -exc 9,10 --mask
Alternatively, you could also gradually remove highest-entropy sites from an alignment using bmge (-h parameter).