Thursday, February 28, 2013

PandaSeq to QIIME

Recently I have been working with some paired-end amplicon MiSeq data. One important step with paired-end amplicon data is assembling each pair of reads to make composite sequences. PandaSeq is a good program for doing this, but when assembling two reads, it seems to use only the portion of the identifier shared between the two reads as the identifier for the assembled read. This becomes problematic for me when performing downstream analyses with QIIME. Therefore, I wrote the following simple Perl script to make the identifiers of the PandaSeq assembly file jive with the identifiers in the corresponding indexing reads file generated after the MiSeq run (so they can both be processed together using QIIME):


my $filename = <$ARGV[0]>;
chomp $filename;
open (FASTQ, $filename);
if ($filename =~ /(.*)\.[^.]*/)
open OUT, ">$1.fixed.fastq";

while ()
if ($_ =~ /^\@(\w*\-\w*\:\d*\:\w*\-\w*\:\d*\:\d*\:\d*\:\d*)\:/)
print OUT "\@$1 2:N:0:\n";
print OUT $_;

The above text can be copied into a file (to make the Perl script) and then invoked with the following:
perl assembly.fastq

Note that these instructions presuppose that you have three .fastq files from your paired-end MiSeq run:
01 - Reads from one end of each amplicon
02 - Index reads
03 - Reads from the other end of each amplicon
These files are generated by default when making .fastq files with some Illumina software, but sometimes making all three files (notably the indexing read file) requires specifying certain parameters. As of version 1.6.0 of QIIME, the group of indexing reads must be entered as a separate file if these data are to be properly integrated into the QIIME workflow.

One final issue that arises once reads have been assembled is that there are now fewer reads in the assembly file than there are in the index file. This can be remedied by making a barcode file with only the entries associated with sequences in the PandaSeq assembled data set. I use the following two commands (after running the above Perl script) to take care of this issue:
sed -n '1~4'p assembly.fixed.fastq | sed 's/^@//g' > defs_in_assembly.txt -f SampleID_NoIndex_L001_R2_001.fastq -o index_reads_filtered.fastq -s defs_in_assembly.txt

I then do a final check to see if the entries in the index file and the assembly file are truly the same:
sed -n '1~4'p index_reads_filtered.fastq | sed 's/^@//g' > index_defs_filtered.txt
diff -s index_defs_filtered.txt defs_in_assembly.txt

If those two files are the same, then the .fastq files should be ready to run through the script to start the QIIME workflow.

Please let me know if you use the above Perl script or if you run into issues with any of this!

- Brendan

[Update - I just got back a data set from another facility in which the first part of the identifier for each sequence (the section before the first colon) was written in a slightly different format. To deal with this, the above 'if' line that comes after 'while' should read as follows:
if ($_ =~ /^\@(\w*\:\d*\:\w*\-\w*\:\d*\:\d*\:\d*\:\d*)\:/)
If you are not sure which format your identifiers take, it may be best to try the script as written above and then try it with this modification.]


  1. Would you be able to post of examples of what your sequence names look like when you get them from the sequencing center, after they are assembled in PandaSeq, and what this perl script changes them to? That would be very helpful.

    Otherwise, this is a great help, thank you!

  2. Hi Daniel,

    That's a good idea. This should make it a bit more helpful, especially if some facilities use slightly different conventions for identifiers.

    Identifiers for the first read (the amplicon's left side) look like this:
    @HWI-M00590:25:000000000-A35K0:1:1101:14342:1507 1:N:0:
    @HWI-M00590:25:000000000-A35K0:1:1101:15983:1528 1:N:0:
    @HWI-M00590:25:000000000-A35K0:1:1101:15941:1552 1:N:0:

    Identifiers for the second read (the index read) look like this:
    @HWI-M00590:25:000000000-A35K0:1:1101:14342:1507 2:N:0:
    @HWI-M00590:25:000000000-A35K0:1:1101:15983:1528 2:N:0:
    @HWI-M00590:25:000000000-A35K0:1:1101:15941:1552 2:N:0:

    Identifiers for the third read (the amplicon's right side) look like this:
    @HWI-M00590:25:000000000-A35K0:1:1101:14342:1507 3:N:0:
    @HWI-M00590:25:000000000-A35K0:1:1101:15983:1528 3:N:0:
    @HWI-M00590:25:000000000-A35K0:1:1101:15941:1552 3:N:0:

    Once the first and third read (the reads from the two sides of the amplicon) have been assembled using Pandaseq, the identifiers of the assembly look like this:

    Then the Perl script changes the identifiers in the assembly to look exactly like the identifiers for the second (i.e., index) read:
    @HWI-M00590:25:000000000-A35K0:1:1101:14342:1507 2:N:0:
    @HWI-M00590:25:000000000-A35K0:1:1101:15983:1528 2:N:0:
    @HWI-M00590:25:000000000-A35K0:1:1101:15941:1552 2:N:0:

    Here is an example of one of the identifiers referenced in the 'update' section at the end of my post:
    @M00830:38:000000000-A3AD8:1:1101:18437:1751 2:N:0:
    It really only differs in that there is no hyphen in the portion of the identifier that is before the first colon.

    If you'd like to create a modified version of the Perl script to match a different type of identifier, I'd be happy to help out!