#!/usr/bin/perl
my $filename = <$ARGV[0]>;
chomp $filename;
open (FASTQ, $filename);
{
if ($filename =~ /(.*)\.[^.]*/)
{
open OUT, ">$1.fixed.fastq";
}
}
while (
{
if ($_ =~ /^\@(\w*\-\w*\:\d*\:\w*\-\w*\:\d*\:\d*\:\d*\:\d*)\:/)
{
print OUT "\@$1 2:N:0:\n";
}
else
{
print OUT $_;
}
}
The above text can be copied into a file (to make the Perl script) and then invoked with the following:
perl script.pl assembly.fastq
Note that these instructions presuppose that you have three .fastq files from your paired-end MiSeq run:
01 - Reads from one end of each amplicon
02 - Index reads
03 - Reads from the other end of each amplicon
These files are generated by default when making .fastq files with some Illumina software, but sometimes making all three files (notably the indexing read file) requires specifying certain parameters. As of version 1.6.0 of QIIME, the group of indexing reads must be entered as a separate file if these data are to be properly integrated into the QIIME workflow.
One final issue that arises once reads have been assembled is that there are now fewer reads in the assembly file than there are in the index file. This can be remedied by making a barcode file with only the entries associated with sequences in the PandaSeq assembled data set. I use the following two commands (after running the above Perl script) to take care of this issue:
sed -n '1~4'p assembly.fixed.fastq | sed 's/^@//g' > defs_in_assembly.txt
filter_fasta.py -f SampleID_NoIndex_L001_R2_001.fastq -o index_reads_filtered.fastq -s defs_in_assembly.txt
I then do a final check to see if the entries in the index file and the assembly file are truly the same:
sed -n '1~4'p index_reads_filtered.fastq | sed 's/^@//g' > index_defs_filtered.txt
diff -s index_defs_filtered.txt defs_in_assembly.txt
Please let me know if you use the above Perl script or if you run into issues with any of this!
- Brendan
[Update - I just got back a data set from another facility in which the first part of the identifier for each sequence (the section before the first colon) was written in a slightly different format. To deal with this, the above 'if' line that comes after 'while' should read as follows:
if ($_ =~ /^\@(\w*\:\d*\:\w*\-\w*\:\d*\:\d*\:\d*\:\d*)\:/)
If you are not sure which format your identifiers take, it may be best to try the script as written above and then try it with this modification.]