Customizing the extraction

When extracting sequences, you should use different settings for different data-types. PhyloSuite has five predefined data-types, Mitogenome, general, cox1, 16S and 18S, but you can easily add your own data-type extraction settings, suited to your data.

Here I will demonstrate the process of making the extraction settings for the mitochondrial genome data-type.

1. GenBank file format

The GenBank file format (Feature section) is shown below (for detailed GenBank format please visit https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html):

FEATURES             Location/Qualifiers
    source[Feature] 1..5028
                    /organism[Qualifier]  ="Saccharomyces cerevisiae[Value or Name]"
                    /db_xref[Qualifier]   ="taxon:4932[Value or Name]"
                    /chromosome[Qualifier]="IX[Value or Name]"
                    /map[Qualifier]       ="9[Value or Name]"
    CDS[Feature]    1..206
                    /product[Qualifier]   ="TCP1-beta[Value or Name]"

Three terms, Feature, Qualifier and Name (see above), are important for us to extract a sequence. After importing your sequences into PhyloSuite (see this example for how to select and download mitogenomes), you can open the sequence to check which features and qualifiers it contains.

2. General extraction

See this brief demo for how to operate the extraction function.

By default, PhyloSuite provides settings for six data-types (loci): Mitogenome, chloroplast genome, general, cox1, 16S and 18S. You can inspect them via Settings-->GenBank File Extracting. When you are not sure which data-type setting to use, you can choose general.

After the extraction, you can open overview.csv file to see the summary of this extraction.

From this file, you can see the features that these data contain, and settings for the general data-type. In my example, you can see that many genes are synonymous (for example, COI, COX1 and COXI are synonymous); they can be unified in the following step.

3. Customize extraction

For detailed description of the extraction settings, please refer here.

Based on the overview, now we can adjust the settings to suit our sequences.

  1. Click gear button to open the settings;
  2. Click Current version button, then select Add new version; you can name it mtDNA in the popup window.

You will see that only rRNA feature is predefined, so we can add more features according to the overview.csv file above. Click the add button, and then enter the feature name that you intend to include into your extraction. You can refer to “Features found in sequences” section in the overview file to decide which features to include (note that feature names are case-sensitive). Here, I added CDS, tRNA and misc_feature. Every time we add a feature, its default qualifiers are gene, product and note. Generally, they are sufficient to cover most cases, but you can add other qualifiers via the corresponding add button.

Qualifiers are used to recognize gene names: PhyloSuite will search the value/name of qualifiers one by one. In the example below (see screenshot), PhyloSuite will first search the value/name of gene for CDS, if there is no gene qualifier, then it searches the value/name of product, and finally the value/name of note. If none of the specified qualifiers are found for the feature, it will be recorded in the summary.csv file.

Finally, we need to unify synonymous gene names. In the extracted results folder (the one from the section 2 of this demo), go to StatFiles folder and open the name_for_unification.csv file.

  1. Modify gene aliases to the chosen standard name in the “New Name” column;
  2. Drag the modified “name_for_unification.csv” file and drop it into the area specified below (see screenshot).

Closing the settings window will save your parameters automatically. See also https://dongzhang0725.github.io/dongzhang0725.github.io/PhyloSuite-demo/single-gene-phylogeny/#2-Sequence-extraction.

4. Extracting with customized settings

Now we can use the customized setting to extract the sequences anew. The multiple gene name aliases should be unified in the extracted file (see screenshot).

Other demo tutorials can be seen here.