Single gene phylogeny

Posted on 2019-03-02 Edited on 2025-11-25 In PhyloSuite demo

PhyloSuite: from data aquisition to phylogenetic tree annotation (single gene tree building)

In order to make it easier for users to get started, here is a tutorial to briefly introduce the use of PhyloSuite, using 51 ciliate 18S sequences as an example (https://onlinelibrary.wiley.com/doi/full/10.1111/j.1550-7408.2009.00413.x).

1. Sequence download and preparation

The accession numbers we will use are below (fetched from a table in the abovementioned paper):

AF164122 X65150 AY547546 U17356 EU286810
AF100301 U47620 M97909 L31520 EU039903
Z29442 U97110 EU039887 Z29438 AY007454
AF060454 AY007445 U82204 X71134 X65149
U57767 L26447 AF300281 AF530529 U17355
AY331802 AY242119 AY331790 EU286812 DQ232761
X71140 EU286811 AY331794 EU264561 EU286809
X65151 U24248 AM292311 EU039899 DQ057347
AF300286 EF123708 EU039896 L26446 U57765
Z22931 AY007450 EU264560 AY378112 DQ834370
U97112

First, download these IDs to the PhyloSuite workspace:

Select any of the work folders (here I chose ‘files’);
Click Open file(s)/Input ID(s) to open the input window.
Copy the above IDs into the text box (spaces, line breaks, tabs, etc. are supported as separators);
Enter your email (tell NCBI who is downloading the sequences);
Click Start to download.

After the sequence is downloaded to the workspace, we find that some taxonomic groups are designated as N/A, indicating that the taxonomic group is not recognized. Select all sequences (CTRL+A), then right-click, and select Get taxonomy (NCBI). You will get a prompt after the selection asking to confirm that you are sure that you wish to replace the current taxonomy with the taxonomic information from NCBI’s taxonomy database (this is important for subsequent phylogenetic tree annotation).

After the above steps are completed, there may still be some IDs that are not recognized, indicating their taxonomic group is not in the latest NCBI’s taxonomic classification system. You may also wish change the current names manually, in which case you can double-click the corresponding cell and enter the correct taxonomic group name.

2. Sequence extraction

Select all sequences (Ctrl+A) and open the context menu (right-click).

Choose Extract
Since we are extracting 18S sequences, we select 18S here;
Choose the name-type you wish to use for each sequence: organism, ID, isolate, etc..
Select the taxonomic categories relevant for your study; in this case it is ‘class’.
Here you can select colors to designate taxa in the phylogenetic tree in iTOL. Double-click to modify the color. If the number of colors that you set is lesser than the number of studied taxonomic groups, random colors will be assigned to the remaining taxa. For example, there are 11 classes, and I only set 5 colors, so the remaining 6 classes will use random colors;
Click Start to run the Extraction function.

After the extraction, open the extract_results/rRNA folder (the feature identity of 18S is rRNA, so the folder name is rRNA). We can see that different entries use different names for 18S, so we need to unify the names first.

Open the extract_results/StatFiles folder, find the name_for_unification.csv file, change the name to 18S in the New Name column, and save (you must save as csv format). As I am not sure that 16S-like entries are homologous to 18S, I will leave them unchanged.

Then go back to the window to extract the sequence

Click the Settings button next to the sequence type to enter the settings interface;
Drag the modified name_for_unification.csv file into the area indicated in the figure, and the modified names will be automatically recognized.

Just close the settings window, your new settings will be automatically saved. At this point, you only need to re-run the extraction anew to get the 18S sequence after replacing the name. After reopening the 18S.fas file in the rRNA folder, you will find only 42 sequences in it (remember we have a total of 51 IDs). Three more sequences, annotated as 16-like, are in the same folder. Because I wasn’t sure that these are homologues, I left them in a separate file so I can manually revise their homology. If you are certain, you can rename them to 18S as described above. The extraction of the remaining 6 missing sequences was not successful, the details of which can be found in the ‘StatFiles’ folder in files: StatFiles/feature_unrecognized.csv and StatFiles/ID_with_no_features.csv. Best way to see what went wrong is to open these sequences (main working window, find the ID, right-click, open) and compare them with successfully extracted files. Once/if you figure out what is wrong with them, you may manually edit these files in a text editor, save them, and re-extract all sequences.

However, if you are confident that all the sequences are 18S, here is an easy way:

Check ‘Default’ instead of ‘Custom’ in the Extracter function and set the output file name;
Start to extract.

Double-clicking the “18S.fas” file will open the sequence viewer, where you will see that all 51 sequences were successfully extracted.

3. Multiple sequence alignment

Now that we have extracted all sequences, we can start the phylogenetic analysis. For this, we have to align them first using MAFFT. There are many ways to do this, only two of which will be shown here.

First, select the extract_results folder in the main PhyloSuite window (extracted in the previous step), and then open MAFFT through Alignment-->MAFFT in the menu bar. The extracted sequences will be automatically imported into MAFFT;
You can select MAFFT from the Alignment menu, and then a pop-up window will automatically offer you available input file from the entire workspace. Scroll through them to find the results that you need, or click ‘No, thanks’ and select the input file manually.

You will find that other plug-ins in PhyloSuite also use this interface to import files.

Next is the sequence alignment

First you may choose to delete the 16S sequence files using the delete button (the button next to it can be used to view the sequence).
Align the sequences using Normal mode;
Other parameters can be set according to your own requirements; and then click Start.

After the alignment is completed, you can view the sequence and modify them.

Select the mafft_results folder;
Double-click the alignment result to open it in the sequence viewer.
If necessary, manual correction of the sequence can be performed in the same way as in MEGA.

4. Nucleotide replacement model selection

Next is the evolution model selection. Since ModelFinder does not recognize the results of MAFFT automatically (Concatenation function ordinarily serves as a bridge between these two functions, but we won’t use it for a single-gene dataset), we need to manually import it. The operation is also simple:

Double-click mafft_results to open the folder;
Open the ModelFinder software window via Phylogeny-->ModelFinder in the menu bar and drag the aligned 18S sequence into the file input box of ModelFinder.
Select Mrbayes in Model for to select the optimal models for Mrbayes;
Other parameters can be selected according to your needs. After settings, you can click Start to start the program.

5. Phylogenetic tree reconstruction based on Bayesian Inference (BI)

Double-click *.iqtree in the ModelFinder_results folder to view the selection result of the optimal model;
Select the ModelFinder_results folder in the main PhyloSuite interface;
Open the Bayesian software Phylogeny-->MrBayes from the menu bar;
You can see that the alignment and model selection results have been automatically imported into MrBayes. You may choose to select one or more outgroups through the Outgroup drop-down menu;
Select the number of generations you wish to run;
Set other parameters if you wish to do so; and then click Start.

6. Phylogenetic tree reconstruction based on the Maximum Likelihood method (ML)

IQ-TREE is able to select the best-fit model and conduct phylogenetic analysis, you can complete both in one step.

Drag the aligned sequences (*.fas) directly into the IQ-TREE file input box.
By selecting the Auto model, you are allowing IQ-TREE to select the optimal evolutionary model first, and then use it to conduct the phylogenetic analysis;
You can choose +R to calculate the FreeRate model, which is recommended by the IQ-TREE author;
Since 18S is small, it is best to use the standard bootstrap method based on the author’s recommendation;
Other parameters can be selected according to your needs; then run it by clicking Start.

7. Flowchart

In addition to the above-mentioned conventional operation, PhyloSuite also provides a streamlined workflow mode that allows to conduct the analysis described above (sections 3 to 6) in a single go. You only need to input the file in the first program (MAFFT) and the remaining steps will be executed automatically in the default order: alignment–>concatenation–>model selection–>reconstruct the phylogenetic tree; with results of the upstream program being used as the input file for the downstream program.

Click Flowchart in the menu bar to open the program window;
Select the analyses you want to conduct; here it will be MAFFT, Concatenation (although it is a single gene, the Concatenation must also be selected, because the downstream programs can only recognize concatenated results as input files, in this way, the Concatenation function served as format convertion), ModelFinder, IQ-TREE, and MrBayes;
Click on the red Input file here to open the MAFFT program window;
Click on the MAFFT file input box select one of the automatically prepared input files (see section 3 for details);
Select the extraction results from before (note that I only kept 18S);
Just close the window and MAFFT will automatically save the input file and the parameters that you set (also applied to other programs).

After inputting the file into the MAFFT and setting the parameters, you may also wish to set the parameters for other programs.

Click the Check | Start button, and a parameter summary window will pop up, allowing you to modify and check the parameters;
Click the blue words to set the parameters (the corresponding widgets will be highlighted in the pop-up window);
After setting the parameters for each software, just close the window to save the settings;
PhyloSuite also provides the ability to automatically check the conflicting parameters. For example, if the model selected by Modelfinder does not use partition model, you should also turn off partition mode for MrBayes. Click autocorrect to do this automatically.
After the parameters are checked (note that for the demo purpose, we chose very short running times for MrBayes and IQ-TREE), you can click Ok, start to start the program.

IQ-TREE and MrBayes programs can be run simultaneously (they don’t have upstream/downstream program relationship). After a program has finished, you can view the results by clicking the button next to the program name.

After all the programs have finished, the main interface will give you a summary description of the operation, as well as references to the software used, both of which can be useful for writing the Materials & Methods section of your manuscript.

8. Phylogenetic tree annotation

After phylogenetic tree reconstruction, the last step is to annotate the phylogenetic tree. Here we will use the results of IQ-TREE as an example.
1. First open iTOL’s home page https://itol.embl.de/, log in to your account, and select My Trees;
2. Double-click IQtree_results to open the tree of the IQ-TREE, and drag the *.treefile file into the area shown in the figure to automatically import the phylogenetic tree;

3. Click on the tree title to enter the editing interface.
4. Double-click extract_results to open the extraction files (that we generated in section 2). The files within the itolFiles folder are used for iTOL annotation. The operation is very simple, just drag the file into the iTOL interface. As shown in the figure, after dragging in itol_gb_labels.txt, you will find that the name of the species has been replaced with the accession number (in my case, there is an error log showing that some of the listed IDs were not found, because 9 sequences have not been extracted; as it will not affect the annotation, we can ignore it for the purpose of this demo). Two more files can be used to replace names (itol_labels.txt and itol_ori_labels.txt);

5. Check file names to see which taxonomic category information they carry. For example, itol_Class_ColourStrip.txt and itol_Class_Text.txt can be used to annotate (colourize) different classes.
a. Drag these two file (itol_Class_ColourStrip.txt and itol_Class_Text.txt) together into the iTOL interface;
b. Each file is treated as a different dataset by the iTOL. You can set the parameters for each dataset by clicking the jagged setting button;
c. Set the Text size factor of the Class text dataset to change the font size for the name of the class;
d. Set the Strip width of the color_strip dataset to change the width of the vertical block.

When we extracted the sequences (section 2), there was an option to set the color for the taxonomic groups. The colors of the five classes that we set before can be seen here, while the colors of other classes were randomly chosen. Note that the location of the taxonomic group name may not be in the middle of the colour strip representing the clade (see Colpodea in the figure above), but you can export the tree into an SVG format file and then fine-tune it using Adobe Illustrator.
The advantages of this annotation method are especially pronounced when you have hundreds of species, or when you want to compare multiple topologies based on the same dataset. Colourated annotation is very helpful in such cases, especially for identifying paraphyly and polyphyly

When annotating the BI tree, just drag the *.con.tre tree file into iTOL, and the rest of the operations are the same as for the IQ-TREE tree.

At this point, the single-gene phylogenetic analysis process is finished. This is the complete operation from data acquisition to phylogenetic tree annotation. Of course, you can start from any step. For example, it is possible to start directly from the MAFFT multiple sequence alignment, but in this way there is no way to annotate the phylogenetic tree.

9. Software download website

https://github.com/dongzhang0725/PhyloSuite/releases or https://dongzhang0725.github.io/dongzhang0725.github.io/installation/#Chinese_download_link (China)

10. Reference

Zhang, D., F. Gao, I. Jakovlić, H. Zou, J. Zhang, W.X. Li, and G.T. Wang, PhyloSuite: An integrated and scalable desktop platform for streamlined molecular sequence data management and evolutionary phylogenetics studies. Molecular Ecology Resources, 2020. 20(1): p. 348–355. DOI: 10.1111/1755-0998.13096.

PhyloSuite：从数据获取到系统发育树注释（单基因建树）

为了使用户更容易上手，这里提供一个教程，以 51 条纤毛虫 18S 序列为例 (https://onlinelibrary.wiley.com/doi/full/10.1111/j.1550-7408.2009.00413.x)，简要介绍 PhyloSuite 的使用。

1. 序列下载与准备

我们将使用的 accession 号如下（从上述文章的表格中获取）：

AF164122 X65150 AY547546 U17356 EU286810
AF100301 U47620 M97909 L31520 EU039903
Z29442 U97110 EU039887 Z29438 AY007454
AF060454 AY007445 U82204 X71134 X65149
U57767 L26447 AF300281 AF530529 U17355
AY331802 AY242119 AY331790 EU286812 DQ232761
X71140 EU286811 AY331794 EU264561 EU286809
X65151 U24248 AM292311 EU039899 DQ057347
AF300286 EF123708 EU039896 L26446 U57765
Z22931 AY007450 EU264560 AY378112 DQ834370
U97112

首先，将这些 ID 下载到 PhyloSuite 的工作区：

选择任意工作文件夹（这里我选择了 ‘files’）；
点击 打开文件/输入ID 打开输入窗口。
将上面的 ID 复制到文本框中（支持空格、换行符、制表符等作为分隔符）；
输入您的电子邮件（告知 NCBI 是谁在下载序列）；
点击 开始 进行下载。

序列下载到工作区后，我们发现有些分类群被标记为 N/A，表示该分类群未被识别。选择所有序列 (CTRL+A)，然后右键单击，选择 获取分类信息 (NCBI)。选择后会得到一个提示，要求确认您是否确定要用 NCBI 分类数据库中的分类信息替换当前的分类信息（这对后续的系统发育树注释很重要）。

完成上述步骤后，可能仍有一些 ID 未被识别，表明它们的分类群不在最新的 NCBI 分类系统中。您可能也希望手动更改当前名称，在这种情况下，可以双击相应的单元格并输入正确的分类群名称。

2. 序列提取

选择所有序列 (Ctrl+A) 并打开上下文菜单（右键单击）。

选择 提取；
由于我们提取的是 18S 序列，这里我们选择 18S；
选择您希望用于每条序列的名称类型：organism, ID, isolate 等；
选择与您研究相关的分类类别；本例中是 ‘class’；
在这里您可以选择颜色以在 iTOL 的系统发育树中标识分类群。双击可修改颜色。如果您设置的颜色数量少于研究的分类群数量，将为剩余的分类群分配随机颜色。例如，有 11 个纲，我只设置了 5 种颜色，那么剩余的 6 个纲将使用随机颜色；
点击 开始 运行提取功能。

提取后，打开 extract_results/rRNA 文件夹（18S 的 feature 标识是 rRNA，所以文件夹名是 rRNA）。我们可以看到不同的条目对 18S 使用了不同的名称，所以我们需要先统一名称。

打开 extract_results/StatFiles 文件夹，找到 name_for_unification.csv 文件，在 New Name 列中将名称改为 18S，然后保存（必须保存为 csv 格式）。由于我不确定 16S-like 条目是否与 18S 同源，我将保持它们不变。

然后返回到提取序列的窗口

点击序列类型旁边的 设置 按钮进入设置界面；
将修改后的 name_for_unification.csv 文件拖拽到图中指示的区域，修改后的名称将被自动识别。

只需关闭设置窗口，您的新设置将自动保存。此时，您只需要重新运行提取即可获得替换名称后的 18S 序列。重新打开 rRNA 文件夹中的 18S.fas 文件后，您会发现其中只有 42 条序列（记得我们总共有 51 个 ID）。另外三条注释为 16S-like 的序列也在同一文件夹中。因为我不确定它们是否是同源物，所以将它们留在一个单独的文件中，以便我可以手动修正它们的同源性。如果您确定，可以如上所述将它们重命名为 18S。其余 6 条缺失序列的提取没有成功，详细信息可以在 files 中的 ‘StatFiles’ 文件夹中找到：StatFiles/feature_unrecognized.csv 和 StatFiles/ID_with_no_features.csv。查看问题的最佳方式是打开这些序列（主工作窗口，找到 ID，右键单击，打开）并与成功提取的文件进行比较。一旦/如果您找出问题所在，可以在文本编辑器中手动编辑这些文件，保存它们，然后重新提取所有序列。

但是，如果您确信所有序列都是 18S，这里有一个简单的方法：

在提取器功能中勾选 ‘Default’ 而不是 ‘Custom’，并设置输出文件名；
开始提取。

双击 “18S.fas” 文件将打开序列查看器，您将看到所有 51 条序列都已成功提取。

3. 多序列比对

现在我们已提取所有序列，可以开始系统发育分析了。为此，我们必须先使用 MAFFT 对它们进行比对。有多种方法可以做到这一点，这里只展示其中两种。

首先，在主 PhyloSuite 窗口中选择 extract_results 文件夹（上一步提取的），然后通过菜单栏中的 Alignment-->MAFFT 打开 MAFFT。提取的序列将自动导入 MAFFT；
您可以从 Alignment 菜单中选择 MAFFT，然后会自动弹出一个窗口，提供整个工作区中可用的输入文件。滚动查找您需要的结果，或点击 ‘No, thanks’ 手动选择输入文件。

您会发现 PhyloSuite 中的其他插件也使用此界面导入文件。

接下来是序列比对

首先您可以选择使用删除按钮删除 16S 序列文件（旁边的按钮可用于查看序列）。
使用 Normal 模式比对序列；
其他参数可根据您自己的要求设置；然后点击 开始。

比对完成后，您可以查看序列并进行修改。

选择 mafft_results 文件夹；
双击比对结果在序列查看器中打开它。
如有必要，可以像在 MEGA 中一样手动校正序列。

4. 核苷酸替代模型选择

接下来是进化模型选择。由于 ModelFinder 不能自动识别 MAFFT 的结果（串联 功能通常作为这两个功能之间的桥梁，但我们不会对单基因数据集使用它），我们需要手动导入它。操作也很简单：

双击 mafft_results 打开文件夹；
通过菜单栏中的 Phylogeny-->ModelFinder 打开 ModelFinder 软件窗口，并将比对好的 18S 序列拖拽到 ModelFinder 的文件输入框中。
在 Model for 中选择 Mrbayes，以为 Mrbayes 选择最优模型；
其他参数可根据您的需要选择。设置完成后，可以点击 开始 启动程序。

5. 基于贝叶斯推断 (BI) 的系统发育树重建

双击 ModelFinder_results 文件夹中的 *.iqtree 查看最优模型的选择结果；
在主 PhyloSuite 界面中选择 ModelFinder_results 文件夹；
从菜单栏打开贝叶斯软件 Phylogeny-->MrBayes；
您可以看到比对和模型选择结果已自动导入 MrBayes。您可以通过 Outgroup 下拉菜单选择一个或多个外群；
选择您希望运行的代数；
如果您希望，可以设置其他参数；然后点击 开始。

6. 基于最大似然法 (ML) 的系统发育树重建

IQ-TREE 能够选择最佳拟合模型并进行系统发育分析，您可以一步完成两者。

直接将比对好的序列 (*.fas) 拖拽到 IQ-TREE 的文件输入框中。
选择 Auto 模型，您就允许 IQ-TREE 先选择最优进化模型，然后使用它进行系统发育分析；
您可以选择 +R 来计算 FreeRate 模型，这是 IQ-TREE 作者推荐的；
由于 18S 较小，根据作者的建议，最好使用标准的自举法；
其他参数可根据您的需要选择；然后点击 开始 运行它。

7. 工作流

除了上述常规操作外，PhyloSuite 还提供了一个简化的工作流模式，允许一次性完成上述（第 3 至 6 节）描述的分析。您只需要在第一个程序 (MAFFT) 中输入文件，剩余步骤将按默认顺序自动执行：比对–>串联–>模型选择–>重建系统发育树；上游程序的结果将作为下游程序的输入文件。

点击菜单栏中的 Flowchart 打开程序窗口；
选择您想要进行的分析；这里将是 MAFFT、Concatenation（虽然是单基因，也必须选择 Concatenation，因为下游程序只能识别串联结果作为输入文件，这样 Concatenation 功能就起到了格式转换的作用）、ModelFinder、IQ-TREE 和 MrBayes；
点击红色的 Input file here 打开 MAFFT 程序窗口；
点击 MAFFT 文件输入框，选择其中一个自动准备好的输入文件（详见第 3 节）；
选择之前的提取结果（注意我只保留了 18S）；
只需关闭窗口，MAFFT 将自动保存输入文件以及您设置的参数（也适用于其他程序）。

将文件输入 MAFFT 并设置参数后，您可能还希望为其他程序设置参数。

点击 Check | Start 按钮，将弹出一个参数摘要窗口，允许您修改和检查参数；
点击蓝色文字设置参数（相应的小部件将在弹出窗口中高亮显示）；
为每个软件设置参数后，只需关闭窗口即可保存设置；
PhyloSuite 还提供了自动检查冲突参数的能力。例如，如果 Modelfinder 选择的模型不使用分区模型，您也应该为 MrBayes 关闭分区模式。点击 autocorrect 自动完成此操作。
参数检查完毕后（请注意，出于演示目的，我们为 MrBayes 和 IQ-TREE 选择了非常短的运行时间），您可以点击 Ok, start 启动程序。

IQ-TREE 和 MrBayes 程序可以同时运行（它们没有上游/下游程序关系）。一个程序完成后，您可以通过点击程序名称旁边的按钮查看结果。

所有程序完成后，主界面将给出操作摘要说明，以及所用软件的参考文献，这两者都对撰写稿件的材料与方法部分很有用。

8. 系统发育树注释

系统发育树重建后，最后一步是对系统发育树进行注释。这里我们将以 IQ-TREE 的结果为例。

首先打开 iTOL 的主页 https://itol.embl.de/，登录您的账户，然后选择 My Trees；
双击 IQtree_results 打开 IQ-TREE 的树，并将 *.treefile 文件拖入图中所示区域以自动导入系统发育树；
点击树标题进入编辑界面。
双击 extract_results 打开提取文件（我们在第 2 节生成的）。itolFiles 文件夹内的文件用于 iTOL 注释。操作非常简单，只需将文件拖入 iTOL 界面即可。如图所示，拖入 itol_gb_labels.txt 后，您会发现物种名称已被替换为 accession 号（在我的例子中，有一个错误日志显示列出的一些 ID 未找到，因为有 9 条序列未被提取；由于它不会影响注释，出于本演示的目的我们可以忽略它）。还有两个文件可用于替换名称 (itol_labels.txt 和 itol_ori_labels.txt)；
检查文件名以查看它们携带了哪个分类层级的信息。例如，itol_Class_ColourStrip.txt 和 itol_Class_Text.txt 可用于注释（着色）不同的纲。
a. 将这两个文件 (itol_Class_ColourStrip.txt 和 itol_Class_Text.txt) 一起拖入 iTOL 界面；
b. 每个文件都被 iTOL 视为不同的数据集。您可以通过点击锯齿状的设置按钮为每个数据集设置参数；
c. 设置 Class text 数据集的 Text size factor 来更改纲名称的字体大小；
d. 设置 color_strip 数据集的 Strip width 来更改垂直色块的宽度。

当我们提取序列时（第 2 节），有一个选项可以设置分类群的颜色。我们之前设置的五个纲的颜色可以在这里看到，而其他纲的颜色是随机选择的。请注意，分类群名称的位置可能不在代表该分支的色带中间（见上图中的 Colpodea），但您可以将树导出为 SVG 格式文件，然后使用 Adobe Illustrator 进行微调。
当您有数百个物种时，或者当您想比较基于同一数据集的多个拓扑结构时，这种注释方法的优势尤其明显。彩色注释在这种情况下非常有帮助，特别是对于识别并系群和多系群。

注释 BI 树时，只需将 *.con.tre 树文件拖入 iTOL，其余操作与 IQ-TREE 树相同。

至此，单基因系统发育分析流程结束。这是从数据获取到系统发育树注释的完整操作。当然，您可以从任何步骤开始。例如，可以直接从 MAFFT 多序列比对开始，但这样就无法注释系统发育树。

9. 软件下载网站

https://github.com/dongzhang0725/PhyloSuite/releases 或 https://dongzhang0725.github.io/dongzhang0725.github.io/installation/#Chinese_download_link (中国)