In phylogenetic analyses, a large number of identical sequences can sometimes prove to be problematic. This post outlines a protocol for creating and running a customized Unix shell script that reinserts identical sequences into a phylogenetic tree file (NEWICK or NEXUS format), for situations in which identical sequences were removed pre-analysis.
Identical sequences may have been removed using the Mothur 'unique.seqs' function (in which case a '.names' file would have been generated, storing the information about which sequences were removed) or RAxML (which generates a '.reduced.phy' file for phylogenetic analysis and a log file that contains a list of the removed sequences and their remaining representatives in a format that can be easily extracted using Unix or Microsoft Excel). The protocol described here relies on using a '.names' file. If sequences were not removed using Mothur, the '.names' file can be manually generated (here are notes on the basic format: http://www.mothur.org/wiki/Names_file ), or the original sequence file can be processed using the Mothur 'unique.seqs' function.
This script will need to be built from the ground up as a customized Unix shell script for your sequence set. This can be assembled easily in Microsoft Excel or one of its clones:
Column A: 'sed -i s/' all the way down the column
Column B: sequence IDs for representative sequences (Column 1 of the '.names' file)
Column C: backslashes all the way down the column
Column D: lists of sequences represented by each representative sequence (including the representative itself) separated by commas; each line must correlate with the Column B identifiers (Column D corresponds to Column 2 of the Mothur '.names' file)
Column E: '/g file_name.tre' all the way down the column
After this is put together, save it as tab-delimited text, open it with an advanced text editor (one that can perform a search and replace on tabs, e.g., TextWrangler or TextPad), remove all tabs (search for tabs and replace them with nothing), and add the first few lines manually to make it a working script. [Note: If one sequence name anywhere in the tree file or '.names' file is nested within another (e.g., 'bacterium' and 'bacterium2'), a colon can be added immediately after the name of the representative sequence with the shorter name, as long as a colon is added after the list of sequences being represented by that sequence.] The script can now be run on the original tree file and it will transform it into a tree file containing all of the sequences in the original sequence set (before removing identical sequences).
Here's an example:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -o search_replace.log -j y
sed -i s/5005c2/5005c2,HL06C03c12/g RAxML_bipartitions.Rhizo_RAxML_topo_BP_50plus.tre
sed -i s/5005c4/5005c4,CL08C02c09,uncultured_bacterium_FD01A08,uncultured_bacterium_FD04E06/g RAxML_bipartitions.Rhizo_RAxML_topo_BP_50plus.tre
sed -i s/5015c31/5015c31,EL02B02c77,EL02C01c63,EL02C02c85,EL02C03c68,EL04C01c65,EL04C01c68,EL04C01c71,EL04C01c72,EL05B03c02,EL06C03c65,EL06C03f17,EL08B01c10,EL08B01c13,EL08B03c17,EL08B03c19,EL09A01c65,EL09A01c67,EL09A01c68,EL09A01c70,EL09A03c19,EL09A03c20,EL09B02c36,EL09B02c39,EL10B01c41,HL10A02c32,NL07B01c12,NL08C03c25,NL08C03f89,NL08C03f90,NL08C03f93/g RAxML_bipartitions.Rhizo_RAxML_topo_BP_50plus.tre
sed -i s/5027c58/5027c58,EL08B01c09/g RAxML_bipartitions.Rhizo_RAxML_topo_BP_50plus.tre
sed -i s/uncultured_bacterium:/uncultured_bacterium,HL08B03c26:/g RAxML_bipartitions.Rhizo_RAxML_topo_BP_50plus.tre
sed -i s/uncultured_bacterium_5C231311/uncultured_bacterium_5C231311,GQ109020/g RAxML_bipartitions.Rhizo_RAxML_topo_BP_50plus.tre
sed -i s/uncultured_bacterium_FD02D06/uncultured_bacterium_FD02D06,EL02C01c61,EL02C02c84,EL02C03c67,EL08C01c23,EL08C03c09,EL10C03c15,NL01B03c63,NL07B01c10,NL07B01d89,NL07B03d84,NL10A02c32/g RAxML_bipartitions.Rhizo_RAxML_topo_BP_50plus.tre
sed -i s/uncultured_bacterium_nbw397h09c1/uncultured_bacterium_nbw397h09c1,HL05A03c20/g RAxML_bipartitions.Rhizo_RAxML_topo_BP_50plus.tre
sed -i s/uncultured_bacterium_Sed3/uncultured_bacterium_Sed3,EF064161/g RAxML_bipartitions.Rhizo_RAxML_topo_BP_50plus.tre
This information and many more bioinformatics tricks, tips, and scripts can be found in my doctoral dissertation (Hodkinson 2011), which will be coming out soon!
-
Brendan
Update: These instructions are now published as part of a paper in Environmental Microbiology (Hodkinson et al. 2012) and the data/analysis/instruction files are available from the Dryad data repository (
Hodkinson et al. 2011).
----------------------------------------------
References
The above instructions are published in the following sources:
Hodkinson, B. P. 2011. A phylogenetic, ecological, and functional characterization of non-photoautotrophic bacteria in the lichen microbiome. Doctoral Dissertation, Duke University, Durham, NC.
Download Dissertation (PDF file)
Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2011. Data from: Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Dryad Digital Repository
doi:10.5061/dryad.t99b1.
Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. In press. Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome.
Environmental Microbiology 14(1): 147-161. [
doi:10.1111/j.1462-2920.2011.02560.x]
Download publication (PDF file)
Download supplementary phylogeny (PDF file)
View data and analysis file web-portal (website)
Download data and analysis file archive (ZIP file)
----------------------------------------------
This work was funded in part by NSF DEB-1011504.