Tuesday, April 6, 2010

Submitting to GenBank

Anyone who does much in the way of DNA sequencing/analysis eventually has to deal with depositing sequences in GenBank.  For a sequence or two, it is relatively simple to use the online 'BankIt' submission system.  However, for larger batches it becomes necessary to take advantage of the batch functions available with the 'Sequin' program.  To get to the point of being able to submit large batches each containing multiple 'features', there is quite a steep learning curve (especially if you are trying to teach yourself).  Unfortunately, the web-pages on the NCBI website do not seem quite sufficient to make submission a simple process.  While submitting sequences for a number of recent papers (e.g., Hodkinson & Lutzoni 2009, Hodkinson & Lendemer 2010, Lendemer & Hodkinson 2009, 2010), I wrote myself a tutorial on how to submit RNA-encoding sequences (rRNA, introns, transcribed spacers, etc.) to GenBank.  Most of this will apply to all sequence types, but getting the information for protein-coding sequences correct might still be an issue requiring some extra assistance.  In the 16-step outline below, the most difficult and problematic aspect (i.e., annotating multiple sequence features) is emphasized and greater detail is given in this area.

GenBank Submission Using SEQUIN:
1)  Make a FASTA+GAP file with bracketed modifiers for all basic info that varies between sequences (e.g., organism, isolate, specimen-voucher, etc.; for basic formatting see http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm#AlignmentFormats; for appropriate modifiers see http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm#DefinitionLine).
2) Run SEQUIN and type in the authorship, contact, and citation information that applies to all sequences.
3) Import file into SEQUIN as a 'Phylogenetic Study' set in 'FASTA+GAP' format.
4) Click 'Edit' 'Alignment Assistant...'.
5) Click 'Features' 'Apply To Alignment >' 'RNA'.
6) For the feature that you wish to annotate, be sure to check the box saying if the 5' or 3' end is partial, if either one is.
7) Type in the alignment coordinates of the particular feature that you are annotating.
8) In the 'RNA Type' box, pick the type of feature (e.g., 'misc_RNA' for ITS1 and ITS2, or 'rRNA' for 18S, 5.8S, or 28S).
9) In the field next to 'RNA Name', put in the specific type of RNA (18S ribosomal RNA, internal transcribed spacer 1, etc.).
10) Click 'Accept.'
11) Repeat steps 5-10 for each section of RNA in the sequence set.
12) In the 'Alignment Assistant' window, go to the 'File' menu and click 'Close'.
13) To check/edit your work: next to 'Target Sequence' choose 'ALL SEQUENCES' and next to 'Format' choose 'Graphic' (double-click on any particular feature annotation to see details and/or make changes; if a particular annotation is entirely erronious, highlight the annotation and go to 'Edit' then 'Clear').
14) Click 'Done' on the main viewing window.
15) When it asks 'Are you ready to save the record?' click 'Yes'.
16) Save the file to the hard drive and email it to 'gb-sub@ncbi.nlm.nih.gov'.

I have been told that this protocol is helpful in getting Sequin to 'work'.  I hope that posting it here will help even more people!

- Brendan

