Light

Dark

Part 4: Multigene phylogenetics

For this exercise we will focus mainly on IQ-TREE and MrBayes.

For visualization of trees, we are going to use FigTree.

How to make your trees more robust?

Single-gene trees often lack sufficient phylogenetic signal to allow unambiguous conclusions. We have covered how model and method selection influence tree topology. Now, we will look at special cases when strengthening of the phylogenetic signal is possible by adding more genes, and at statistical testing of alternate branching.

Why multigene phylogenies?

At present, phylogenomic or multi-gene analyses of hundreds of genes is now becoming commonplace. This is because these datasets contain more phylogenetic signal, and it is assumed that phylogenetic bias is neutralized. Still, multi-gene analyses must be performed with caution. Merging genes expected to have different evolutionary histories (horizontal gene transfers vs. vertical inheritance; organellar vs. nuclear genes; genes with very different evolutionary rates) could result in phylogenetic artifacts.

In general there are three approaches to perform multi-gene analyses:

Combining the trimmed alignments into a single file and work as with one long gene, overlooking the fact that different genes evolved differently. This is not suitable for organellar+nuclear datasets, nor for joint amino acid+nucleotide alignments.
Join the alignments but inform the algorithm that each gene is a partition and should be treated individually.
Infer trees from each alignment separately and then combine the single-gene trees into a so-called supertree.

We will focus on the second method during these practicals. Because we tell the software about each gene which is present in the combined alignment, we can combine different types of data (e.g. DNA and protein) and the software will be able to apply different models for different types of data. Note that partitioned phylogenies were shown to suffer from enhanced long-branch attraction (Wang et al., 2019; DOI: 10.1093/sysbio/syz021) due to the “small sample bias compounded over many single protein alignments”. Hence, merging short protein alignments into larger partitions is advisable (we will show you how).

Partitioning could be also useful in a particular case of nucleotide alignments. Which one? (Hint: We are thinking of protein-coding sequences)

Creating a multigene dataset

Most of the softwares used for tree reconstitution use a single file as input. The notable exception is IQ-TREE, which uses two alignment files, but only if you use combined DNA/RNA and protein sequences.

For this reason, merging each individual datasets into a supermatrix is essential. This can be done manually, or automatically using certain scripts written in python. The good news is that tools with graphical user interface have been developed for this purpose.

We will use SequenceMatrix to create a multigene dataset. For this purpose we will use a small dataset with eight proteins.

1. Download the dataset from here.

2. After downloading the file, extract the content of the archive. Let's open two of the files in Aliview.

When you merge datasets, in order to merge them correctly, it is essential that the sequence names for one species is exactly the same in all datasets which you plan to merge. Otherwise they will not be merged, but they will be a different taxon. Also, for SequenceMatrix it is important that the names are no longer than 11 characters similar to a phylip file.

3. Start SequenceMatrix

SequenceMatrix accepts as input FASTA or NEXUS file format.

3. Select all the fasta files from the multigene dataset and drag and drop them in the SequenceMatrix window.

4. You will be asked if you want to encode external gaps as missing data. Answer Yes to all as we want to encode these external gaps.

Sometimes SequenceMatrix can bug out in machines with low amount of memory. In case your software gets stuck in Loading into memory you will need to force quit the software and you can use the multigene datasets from here.

5. Check the resulting table in SequenceMatrix. Are there any taxons which lack some protein sequences? Do you notice something unusual?

SequenceMatrix was specifically designed for nucleotide sequences. It has no problem merging protein sequences, however, this is the reason why in a sequence statistics you see (xx 'N') as in nucleotide alphabet N means an unknown base, which is not the case in protein datasets.

6. Let's save the merged dataset. As we will use it for MrBayes we will save it as a "naked" NEXUS file. Click on Export and select Export sequences as NEXUS("naked", e.g. for GARLI). Let's save it under the name merged.nex.

7. Let's also save it also as interleaved. Click on Export and select Export sequences as NEXUS(interleaved, 1000 bp). Let's save it under the name merged_interleaved.nex.

Naked NEXUS doesn't contain quotation marks in the name, and also it doesn't contain any partition informations. This is important, as we want to partition our dataset, so either we do this manually, or we take this from the merged_interleaved.nex file.

8. Let's open both merged.nex and merged_interleaved.nex files in a text editor. Do you notice the differences between the files?

9. From here on we will work on the merged.nex file. SequenceMatrix was made specifically for nucleotide sequence, so we need to change the DATATYPE from DNA to PROTEIN.

FORMAT DATATYPE=PROTEIN GAP=- MISSING=? ;

With this, we fixed the main issue with this file, now we will continue ahead and add the partition information in the nexus file.

Data partitioning

For multigene phylogenies, besides the data itself, we need to prepare a list of partitions which will tell the software where in the aligment a particular gene is.

The partitions for MrBayes are defined in the MrBayes data block, after the matrix itself. Let's focus on merged.nex file. At the end of the file we see a END; line. Hit enter and let's start to define the MrBayes data block:

BEGIN mrbayes;

Let's check the merged_interleaved.nex file. At the end of the file there is a data block which starts with BEGIN SETS; In this datablock we find the partition information at the end:

CHARSET alphatub_prot_fasta = 1-364;
CHARSET betatub_prot_fasta = 365-739;
CHARSET ef1alpha_prot_fasta = 740-1072;
CHARSET ef2_prot_fasta = 1073-1514;
CHARSET gamma_tubulin_prot_fasta = 1515-1916;
CHARSET hsp70_prot_fasta = 1917-2500;
CHARSET hsp90_prot_fasta = 2501-3113;
CHARSET polyubq_prot_fasta = 3114-3189;

This block defines the range for each gene.

We add this to in the MrBayes data block in the merged.nex file and now we will also define the partitions in the data block.

partition genes = 8: alphatub_prot_fasta, betatub_prot_fasta, ef1alpha_prot_fasta, ef2_prot_fasta, gamma_tubulin_prot_fasta, hsp70_prot_fasta, hsp90_prot_fasta, polyubq_prot_fasta;
set partition=genes;

This part of the block specifies how many partitions are, and tells MrBayes to setup the partitions according to the genes.

Now we will need to add in the MrBayes data block commands which specifies what models to use for each partition.

lset applyto=(1) rates=invgamma covarion=yes;
lset applyto=(2) rates=invgamma covarion=yes;
lset applyto=(3) rates=invgamma covarion=yes;
lset applyto=(4) rates=invgamma covarion=yes;
lset applyto=(5) rates=invgamma covarion=yes;
lset applyto=(6) rates=invgamma covarion=yes;
lset applyto=(7) rates=invgamma covarion=yes;
lset applyto=(8) rates=invgamma covarion=yes;
prset applyto=(1) aamodelpr=fixed(wag);
prset applyto=(2) aamodelpr=fixed(wag);
prset applyto=(3) aamodelpr=fixed(wag);
prset applyto=(4) aamodelpr=fixed(wag);
prset applyto=(5) aamodelpr=fixed(wag);
prset applyto=(6) aamodelpr=fixed(wag);
prset applyto=(7) aamodelpr=fixed(wag);
prset applyto=(8) aamodelpr=fixed(wag);

Because we have a multigene dataset we want to use the unlink function to tell MrBayes that certain parameters are independent for each partition. Very different evolutionary histories of these genes require setting different gamma values for each partition (this was tested on this particular dataset).

unlink statefreq=(all) shape=(all) pinvar=(all) switchrates=(all) brlens=(all) Aamodel=(1,2,3,4,5,6,7,8);
prset ratepr=variable;

When MrBayes attempts to perform an analysis, the model is set for individual partitions. If the same parameter applies to different partitions and if that parameter has the same prior, the program will link the parameters: that is, it will use a single value for the parameter across all compatible partitions. The program's default, then, is to strive for parsimony. However, there are lots of cases where you may want unlink a parameter across partitions. For example, you may want a different transition/transversion rate ratio to apply to different partitions. unlink allows you to unlink the parameters, or to make them different across partitions.

We almost finished the MrBayes data block, we only need to add the chain computation command as well as a line which summarizes the results and computes the consensus tree.

mcmc ngen=5000 samplefreq=100 printfreq=100 diagnfreq=1000 nchains=4 filename=mbout savebrlens=yes;
sumt filename=mbout relburnin=yes burninfrac=0.25 contype=allcompat;

At this stage we finished the MrBayes data block. To close the data block let's append an end:

END;

We save the changes to this file and then we open a new terminal. We navigate to the location of this file and we start MrBayes by typing mb into the terminal.

Then we execute the file by typing:

execute merged.nex

This computation should take a few minutes, as we asked for very few number of generations (5000). This is not enough for the chains to converge. Normally we should ask for a few million generations, however this would take days to compute.

IQ-TREE data partitioning

IQ-TREE supports partition files similar to MrBayes, however, when we mix the data different types, they need to be separated. Thus, when one would like to use mixed protein and nucleotide data for phylogenetics the protein and nucleotide partitions must be provided as two different files.

We will use the same merged.nex file, however, for IQ-TREE we will need to make a new partition file. IQ-TREE uses a similar format as MrBayes. You can download the partition file from here.

You don't need to remove the MrBayes data block from merged.nex file. IQ-TREE will ignore it.

After downloading, open the partition file in a text editor. As you can see, the partition file is structured similarly to that for MrBayes:

#nexus
begin sets;
CHARSET alphatub_prot_fasta = merged.nex: 1-364;
CHARSET betatub_prot_fasta = merged.nex: 365-739;
CHARSET ef1alpha_prot_fasta = merged.nex: 740-1072;
CHARSET ef2_prot_fasta = merged.nex: 1073-1514;
CHARSET gamma_tubulin_prot_fasta = merged.nex: 1515-1916;
CHARSET hsp70_prot_fasta = merged.nex: 1917-2500;
CHARSET hsp90_prot_fasta = merged.nex: 2501-3113;
CHARSET polyubq_prot_fasta = merged.nex: 3114-3189;
end;

Do you see any difference in the partition file for IQ-TREE and the one for MrBayes?

In order to run this analysis with IQ-TREE, all the files (merged.nex, partition.nex) must be in the same directory.

After you put all the files in that folder, navigate there using the terminal and start the analysis by typing:

iqtree -p partition.nex -B 1000 [-T AUTO]

The parameter -p defines the input partition file. -T AUTO tells the software to automatically determine the best number of threads to use and -B 1000 to perform 1000 ultrafast bootstraps.

We don't need to specify the sequence files (merged.nex), because IQ-TREE is informed about these files in the partition file.

Finally, we will analyse the same dataset using site-specific frequency model (PMSF) (Wang et al. 2018). The PMSF are the amino-acid profiles for each alignment site computed from an input mixture model and a guide tree. As the guide tree we may use the tree produced from the previous run. Let's rename it as guidetree.tre. As the input alignment file we will use the file merged.nex, which is present in our directory. We will start the analysis as follows:

iqtree -s merged.nex -m LG+C20+F+G -ft guidetree.tre [-T AUTO]

The parameter -s defines the input alignment file. -T AUTO tells the software to automatically determine the best number of threads to use and -m sets the type of mixture model and -ft defines the guidetree. Note that without -ft option, IQ-TREE will conduct the analysis under the specified mixture model.