Tuesday, July 5, 2011

Perl: Renaming DNA Sequences

When I began my dissertation studies, I did not know the wonders of the Perl programming language. However, within the past year, it has proven to be an invaluable tool for manipulating DNA sequence data sets and helping me to tackle projects that once seemed too large in scope. In this post I will just give one example of a Perl script that I wrote after getting some training from Dr. Bob Thomson, who I met at the NSF-Sponsored "Fast, Free Phylogenies" workshop at NIMBioS (Knoxville, TN).

When processing large numbers of DNA sequences, it always helps to have a standardized naming system so that the sequences can be handled in an automated way during downstream analyses. For a recent large-scale cloning experiment that involved picking 2880 clonal bacterial colonies (to amplify and sequence a vector-inserted 16S gene fragment from each), I developed a 10-digit alpha-numeric code that allowed me to encode all of the necessary data about my sequences into each specific sequence identifier. However, the sequencing facility also needed to use its own codes to keep track of my sequences, so I ended up with long names that had my own codes in the middle with information identifying them as my sequences in front and information about the individual sequence reads themselves tacked on at the end. Therefore, to recover the names (without retaining the sequencing facility's additions) in an automated way, I wrote a simple Perl script to edit a fasta file containing the sequences (this was run after the process of manual sequence correction had been finished).

The following script (‘Clon_16S_fasta_renamer.pl’) allowed me to extract the 10-digit alpha-numeric codes that I used in my dissertation studies (Hodkinson 2011) from the long names (with extraneous information) that come from the sequencing facility. It creates a new fasta file with these modified identifiers. Specifically, it takes sequences that have "BH_" (my initials), followed by a 10-digit code, followed by additional characters, and simply renames each sequence using just the 10-digit code (effectively stripping out "BH_" at the beginning and and extra characters at the end). The new file will have the same name, but the extension will be replaced by ".ed.fasta". This can be easily modified for any set of sequences that are identified using a standardized naming scheme.

#!/usr/bin/perl

print "\nPlease type the name of your input file: ";
my $filename = <STDIN>;
chomp $filename;
open (FASTA, $filename);
    {
    if ($filename =~ /(.*)\.[^.]*/)
        {
        open OUT, ">$1.ed.fasta";
        }
    }

while (<FASTA>)
    {
    if ($_ =~ /^>BH\_(..........)/)
        {
        print OUT ">$1\n";
        }
    if ($_ =~ /^[A,C,G,T,R,Y,K,M,S,W,B,D,H,V,N,:,-][A,C,G,T,R,Y,K,M,S,W,B,D,H,V,N,:,-][A,C,G,T,R,Y,K,M,S,W,B,D,H,V,N,:,-][A,C,G,T,R,Y,K,M,S,W,B,D,H,V,N,:,-]*/)
        {
        print OUT $_;
        }
    if ($_ =~ /^[A,C,G,T,R,Y,K,M,S,W,B,D,H,V,N,:,-][A,C,G,T,R,Y,K,M,S,W,B,D,H,V,N,:,-]$/)
        {
        print OUT $_;
        }
    if ($_ =~ /^[A,C,G,T,R,Y,K,M,S,W,B,D,H,V,N,:,-]$/)
        {
        print OUT $_;
        }
    }

If you need to know how to run a Perl script, you can look it up on Google, but here is one example of how to run a Perl script using Windows (it's actually easier on almost any other type of operating system). Since I was performing a simple task with a very specific data set, it was easy for me to use basic Perl commands. However, for more complex sequence manipulations, BioPerl provides an excellent collection of Perl modules for biological applications.

- Brendan

----------------------------------------------

References

The above script is published in the following sources:

Hodkinson, B. P. 2011. A Phylogenetic, Ecological, and Functional Characterization of Non-Photoautotrophic Bacteria in the Lichen Microbiome. Doctoral Dissertation, Duke University, Durham, NC.
Download Dissertation (PDF file)

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2011. Data from: Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Dryad Digital Repository doi:10.5061/dryad.t99b1.

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. In press. Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Environmental Microbiology.

----------------------------------------------

This work was funded in part by NSF DEB-1011504 and EF-0832858.

No comments:

Post a Comment