Tuesday, February 22, 2011

Fast UniFrac for barcoded 454 16S amplicons

When analyzing large 16S sequence data sets from multiple samples to infer community patterns, one of the best analytical methods available is Fast UniFrac.  Barcoded 454 sequencing (a type of 'pyrosequencing') is quickly becoming the method of choice for generating the data for these types of analyses, and Fast UniFrac has been built especially to handle the large data sets generated using this method.  However, there are simple data management issues that can keep researchers from using this methodology.  One major hindrance with Fast UniFrac is that, if one wants to take advantage of the 16S reference tree that is built in to the program (something that becomes almost obligatory with sufficiently large data sets), it is necessary to put all of the sequence names in a specific format that reveals the sample of origin while preserving the individual sequence identifiers.  Below I describe two distinct procedures that I developed for assembling the input file for the first step of the Fast UniFrac pyrosequencing analysis procedure.  I have tested the first procedure on a data set of ~120,000 sequences (this required sending an email to request a higher Fast UniFrac quota, which is typically capped at 100,000 sequences) of ~500bp in length and the second procedure on a data set of ~40,000 sequences of the same approximate length.

Files needed:
Sequence files ('.fna'+'.qual' or '.sff' or '.fasta'+'.groups') with sequence names that are all the same length (standard for 454 data)
Oligos file with primers and barcodes for each sample (for use with Mothur) [http://www.mothur.org/wiki/Trim.seqs]

Special programs needed:
Mothur
A text editor that can perform search and replace on returns (e.g., TextWrangler for Macintosh or TextPad for Windows)
Microsoft Excel 2008 (previous versions max out at ~65000 rows and may do other unexpected things to large data sets)
BLAST (local)
PyCogent
Enthought Python Distribution (I used v6.3) (alternatively, you can download Python and Numpy, but the EPD has versions of these that play well together and should hopefully work with PyCogent as long as they are first in your path... getting Python, Numpy, and PyCogent to talk can be more complicated than it would seem)

Method #1 (no UNIX scripting required):

1) Take the '.fna' (fasta) and '.qual' (quality) files (or alternatively the '.sff' file) plus the manually-produced '.oligos' file for your data set and process them with Mothur using the 'trim.seqs' function [http://www.mothur.org/wiki/Trim.seqs].  This will produce a '.fasta' file with your sequences and a '.groups' file that shows which sequences come from which samples ('environments' in the language of UniFrac).  I recommend using the many functions of Mothur to further cull your data set; however, the amazing power of Mothur will be the subject of future posts.  If you do decide to change the composition of the '.fasta' file in any way (this is always necessary at least to some degree), you can use the Mothur 'list.seqs' function (performed on the finalized '.fasta' file) to generate an '.accnos' file, then follow that by the 'get.seqs' function (performed on the '.accnos' file and the original '.groups' file) to generate a finalized '.groups' file that correlates with the finalized '.fasta' file.

2) Open the '.groups' file in Microsoft Excel and edit the file to create a four-column spreadsheet: (A) '>' on every row; (B) sample names; (C) delimiter (I use '#') on every row, and (D) sequence names.  Highlight all four columns and sort ascending according to column D (sequence names).  This is done by going to 'Data' > 'Sort' > 'Column D', 'ascending'.  Save as 'file1.csv' (comma-delimited text).

3) Open the new 'file1.csv' file in a text editor and remove all commas (this can be easily done in TextWrangler for Mac by opening the file and going to 'Search' > 'Find', putting ',' in the 'Find:' box, typing nothing in the 'Replace:' box, and clicking 'Replace All').  Save as 'file2.txt' (plain text).

4) Open the '.fasta' file with a text editor (e.g., TextWrangler).  Remove all returns (e.g., Find: '\r', Replace: '' in TextWrangler with the 'Grep' box checked on the Find/Replace screen).  Replace all instances of '>' with a return and '>' (e.g., '\r>'; remember that 'Grep' must be checked if you are using TextWrangler).  Now the first line is empty; manually delete that line.  Save as 'file3.txt' (plain text).

5) Import 'file2.txt' into the first column of Microsoft Excel (column A).  On the same spreadsheet, import the names and sequences from file3.txt to the second and third columns (columns B and C) by clicking on cell B1 (column B, row 1) and importing 'file3.txt' as a text file with 'fixed width'; set the field width for the column break manually at the interface between the sequence name and the beginning of the nucleotide sequence (for my data, it was around the 15th place; if this doesn't work with the text import wizard, fixed-width delimitation can be performed under 'Text to Columns...' in the 'Data' menu).  Highlight columns B and C *only* (if you highlight column A as well, this will not work).  Go to 'Data' > 'Sort' > 'Column B', 'ascending'.  At this point, you should have rows that look something like: '>CLSt#GR5DBVW03HJKND' '>GR5DBVW03HJKND' 'TACGATCGATCGATCAGCATCGATCA...' where the columns are correlated with one another and reading across a row should show the same identifier in column A as in column B.  Be sure that this is the case throughout the spreadsheet.  Now delete column B (the one with the identifiers from the imported fasta file).  Save as 'file4.csv' (comma-delimited text).

6) Open 'file4.csv' with a text editor.  Replace all instances of ',,' with a single return (e.g., find ',,' and replace with '\r' in TextWrangler... remember to have the 'Grep' box checked).  Save as 'file5.fasta'.

7) Follow the instructions for "The BLAST to GreenGenes protocol" and 1-7 of the "Steps" in the Fast UniFrac tutorial (download raw pairwise distance matrices for further analyses), all found here:
http://bmf2.colorado.edu/fastunifrac/tutorial.psp

Method #2 (simpler overall, but requires assembling and running a customized UNIX shell script):

1) Take the '.fna' (fasta) and '.qual' (quality) files (or alternatively the '.sff' file) plus the manually-produced '.oligos' file for your data set and process them with Mothur using the 'trim.seqs' function [http://www.mothur.org/wiki/Trim.seqs].  This will produce a '.fasta' file with your sequences and a '.groups' file that shows which sequences come from which samples ('environments' in the language of UniFrac).  I recommend using the many functions of Mothur to further cull your data set; however, the amazing power of Mothur will be the subject of future posts.  If you do decide to change the composition of the '.fasta' file in any way (this is always necessary at least to some degree), you can use the Mothur 'list.seqs' function (performed on the finalized '.fasta' file) to generate an '.accnos' file, then follow that by the 'get.seqs' function (performed on the '.accnos' file and the original '.groups' file) to generate a finalized '.groups' file that correlates with the finalized '.fasta' file.

2) Create and run a shell script on the '.fasta' file that looks something like this:

#!/bin/bash
#$ -S /bin/bash 
#$ -cwd
#$ -o search_replace.log -j y


sed -i s/GMQ03P202B3SVL/MID08#GMQ03P202B3SVL/g file_name.fasta
sed -i s/GMQ03P202BTOMD/MID08#GMQ03P202BTOMD/g file_name.fasta
sed -i s/GMQ03P202CGYZW/MID08#GMQ03P202CGYZW/g file_name.fasta
sed -i s/GMQ03P202BT4E9/MID08#GMQ03P202BT4E9/g file_name.fasta
sed -i s/GMQ03P202BXELV/MID08#GMQ03P202BXELV/g file_name.fasta

It will have to contain a row for every sequence.  This sort of thing can be assembled in Microsoft Excel:
Column A: 'sed -i s/' all the way down the column
Column B: sequence IDs (from a Mothur '.groups' file)
Column C: backslashes all the way down the column
Column D: sample identifiers (correlated with the sequence IDs... easily done if the sequence IDs and sample identifiers are both taken from a Mothur '.groups' file and the order is preserved)
Column E: delimiter (e.g., #) all the way down the column
Column F: identical to column B
Column G: '/g file_name.fasta' all the way down the column
After that is put together, save it as comma-delimited text, open it with a text editor, remove all commas, add the first few lines manually to make it a working script, and run.

3) Follow the instructions for "The BLAST to GreenGenes protocol" and 1-7 of the "Steps" in the Fast UniFrac tutorial (download raw pairwise distance matrices for further analyses), all found here: http://bmf2.colorado.edu/fastunifrac/tutorial.psp


Once you have the raw pairwise distance matrices, they can be analyzed in R or another statistical package/program.  This will be the subject of a later post.  As part of the Fast UniFrac protocol (when using the GreenGeenes core backbone tree), it will be necessary to use PyCogent.  If you have trouble with PyCogent itself (or the command-line in general), please see my earlier post with some tricks and tips on getting that part of the Fast UniFrac 'BLAST-to-GreenGenes' protocol to work.  Hopefully the current post can help someone in constructing the initial input file for the first part of the Fast UniFrac procedure!

- Brendan

P.S. This procedure places '#' in the sequence names as the delimiter. This information will be needed during the PyCogent-dependent portion of the Fast UniFrac protocol. Here is an example of what I type in the command line once I have navigated to the folder where I have placed the 'create_unifrac_env_file_BLAST.py' python script and the 'blast_output...' file (the confusing thing with my nomenclature is that the Python input is named 'output' because it is the BLAST output, while the Python output is named 'input' because it becomes the Fast UniFrac input; the logic behind this is that the essence of this step is the transformation of the BLAST output file into the Fast UniFrac input file):
python create_unifrac_env_file_BLAST.py blast_output_allreplaced_ready4python.txt fastunifrac_input_file.txt #

Update: These instructions are now published as part of my dissertation (Hodkinson 2011) and an article in Environmental Microbiology (Hodkinson et al. 2012a); the supporting data/analysis/instruction files for the latter are available from the Dryad data repository (Hodkinson et al. 2012b). 

----------------------------------------------

References

The above instructions are published in the following sources:

Hodkinson, B. P. 2011. A phylogenetic, ecological, and functional characterization of non-photoautotrophic bacteria in the lichen microbiome. Doctoral Dissertation, Duke University, Durham, NC.
Download Dissertation (PDF file)

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2012a. Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Environmental Microbiology 14(1): 147-161.

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2012b. Data from: Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Dryad Digital Repository doi:10.5061/dryad.t99b1.

----------------------------------------------

The development of the above protocols was supported in part by NSF (DEB-1011504) and the US Department of Energy.

No comments:

Post a Comment