So I wrote a couple of simple Perl scripts that would allow me to make my alignments in Sequencher (the standard program for editing raw sequence reads) and easily move it over to Mesquite or MacClade (standard programs for assembling data matrices for downstream phylogenetic analyses) so that it could be joined with a reference alignment that I had made previously. In this way, I could avoid completely realigning all sequences to one another through an automatic alignment program, thereby preserving certain sequence alignment patterns (note that I often deal with over 1000 sequences at a time). If you use Linux or Macintosh, running a Perl script is generally a pretty simple matter (since Perl interpreters are typically built into the operating system). If you use Windows, you will probably need to download an interpreter like Strawberry Perl or ActivePerl.
The type of data that I was dealing with was a set of bidirectional Sanger sequences (one forward, one reverse primer for each sequence) of fragments ~650 bp in length. These sequences were cloned and therefore had vector overhang on both ends of both strands, which had to be deleted. If you have data that are similar, here is a procedure that can be used to preserve the Sequencher alignment pattern and bring it into MacClade/Mesquite (potentially for merging with a curated reference alignment, if you have one of these):
[a] In the Sequencher alignment, make sure at least one sequencing strand of each pair of strands (from the bidirectionally-sequenced pool of DNA fragments) has all of the corrected bases, and delete the second strand for each pair. This gives an alignment with one strand for each sequence. [This Sequencher alignment can be tweaked visually to align with a reference set that is already pre-aligned by introducing gaps into the Sequencher alignment to accommodate the gaps in the reference alignment.]
[b] The Sequencher alignment can then be exported as a contig in aligned fasta format and subsequently opened in MacClade/Mesquite. [Note: If you have exported the sequences from Sequencher as a concatenated set of sequence fragments, it might use ':' instead of '-' to represent the gaps; make sure all of the gaps are changed to '-' for integration into MacClade or Mesquite (this can be done as a simple search and replace with any text editor).]
For my particular sequences, I had to deal with the issue of all of the sequence names being proceeded by my initials and having strand-specific information tacked on to the end (both standard pieces of information added by the sequencing facility). Here is another blog post with the Perl script that I wrote for editing the fasta file to extract the 10-digit alpha-numeric code used to identify my sequences. Also, I had to line my sequence block up with the portion of my reference alignment with which it correlated. In my particular situation, the block of sequences that I had aligned began 488 bases into the reference alignment. Here is the script that I used to add 488 bases to the front of each sequence in the fasta file (this script relies on having a 10-digit code name for each sequence):
print "\nPlease type the name of your input file: ";
my $filename = <STDIN>
open (FASTA, $filename);
if ($filename =~ /(.*)\.[^.]*/)
open OUT, ">$1.ed.fasta";
if ($_ =~ /^>(..........)/)
print OUT "\r>$1\r\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\n";
print OUT $_;
The final step was to simply open up my reference alignment in MacClade and import the newly-generated fasta file of aligned cloned sequences... and they lined up perfectly! I then tweaked exclusion sets, saved the full alignment, and was ready for downstream phylogenetic analyses.
Even though MacClade and Mesquite are very good programs overall for alignment, aligning a set of 1000+ sequences is extremely cumbersome, and Sequencher can be much faster and easier as long as the sequences are relatively conserved. With this set of Perl scripts discussed above, hopefully researchers will no longer perceive impediments or inefficiency in a process that includes aligning and correcting relatively conserved sequences in Sequencher (with all of the raw sequence data) before moving them over to MacClade/Mesquite for final data set assembly and formatting.
The above protocols are published in the following sources:
Hodkinson, B. P. 2011. A Phylogenetic, Ecological, and Functional Characterization of Non-Photoautotrophic Bacteria in the Lichen Microbiome. Doctoral Dissertation, Duke University, Durham, NC.
Download Dissertation (PDF file)
Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2011. Data from: Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Dryad Digital Repository doi:10.5061/dryad.t99b1.
Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. In press. Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Environmental Microbiology.
This work was funded in part by NSF DEB-1011504.