Thursday, February 10, 2011

PyCogent for Fast UniFrac

As someone studying the composition of lichen-associated bacterial communities, I have generated several data sets of 16S rRNA gene sequences from bacteria that live in this specialized niche. Beyond the simple question of "who lives there?" we can start to use phylogenetic inferences to examine the ecology of this niche by comparing sets of 16S sequences from different communities and taking into account where the different members fall in a phylogeny. UniFrac is a tool that allows the integration of phylogenetic information into ecological comparative community analyses, and its hip new cousin Fast UniFrac is all the rage these days. But, alas, fully utilizing the special features of Fast UniFrac (such as mapping pyrosequencing reads to a reference phylogeny) requires PyCogent, the installation of which has given me much grief recently. 

PyCogent is a great Python-based toolkit that can be used for conducting a number of analyses on biological sequence data (DNA, RNA, proteins); it is billed as "making sense from sequence" (Knight et al. 2007).  There is a good guide to PyCogent known as the PyCogent Cookbook.  Some programs/packages/pipelines that depend on PyCogent include QIIME and Fast UniFrac (for the latter, PyCogent is required only if you have a large 16S data set that requires a guide tree).

I have had trouble getting the different versions of Python, NumPy, and PyCogent to communicate with one another through UNIX (on both CentOS and MacOSX... although all of the various versions of the different dependencies may have been an issue, since I do not own the machines and I run several versions of Python myself locally), but I ran through the simple 2-step protocol listed below on Windows XP and Windows 7
 and it worked very well for running the Python script associated with the Fast UniFrac 'BLAST-to-GreenGenes' protocol. This is a little odd since it is written that installation of PyCogent by itself is not supported for Windows... and the procedure that I outline below seems to be a pretty simple way to get it installed.

Installing and running PyCogent requires using the command line. If you would like to do this on a Windows machine and you are unfamiliar with the Windows command line, you can google tutorials on "MS-DOS" and/or "command prompt".  There is a decent introductory guide here. The instructions below are written in a broad, inclusive way so that they should work with a UNIX-based system as well (including Macintosh; if you are a Mac user and are unfamiliar with the command line, you can google something like "Mac OSX Terminal" or find a good beginners' tutorial here).  

Whatever type of system it is, the PATH variables must be set correctly so that the programs can find one another.  As long as you do not have previous versions of Python, NumPy, or PyCogent installed, Windows should automatically set the environmental variables so that this protocol will work without a hitch (Macintosh most likely will not set the variables automatically because it usually comes with a pre-installed Python that it will always want to use).  Click here to see a post that further addresses one of the issues with the wrong version of Python/NumPy getting in the way. 

Here is my simplistic protocol for getting PyCogent moving enough to run the Python script mentioned above (I should note that this protocol is not approved by the makers of PyCogent, since it may not produce a fully-functional package, but it does allow me to run the script):

1) Installing Python, NumPy, etc.:
Install the most recent version of the Enthought Python Distribution package (free for academics).

2) Installing PyCogent:
Download the most recent version of PyCogent ('.tgz' file).
Unzip the folder (using, e.g., WinRAR, WinZip, or 7-Zip; an automatic partial unzip might leave it as '.gz' but one of the previously mentioned programs will allow you to unzip it fully and you can drag the folder to your desktop if necessary).
In the command line, navigate to the PyCogent directory.
Type in the command line:
python install

There are some further notes on installation here and in the README, but please note that it was the fact that these instructions didn't quite get me to where I was going that inspired me to write this post. Still, they are likely to provide exactly what is needed for most situations.

Depending on the sort of jobs you need to run using PyCogent, a single computer may or may not have enough computing power.  I have an interest in PyCogent because I need it to run the aforementioned script that makes the Fast UniFrac '.env' input file (see the Fast UniFrac tutorial for more details on how this fits into the overall Fast UniFrac protocol).  A single computer processor has more than enough computing power to handle this job, but some of the more advanced QIIME functions will certainly require greater power for sufficiently large data sets.

Hopefully the notes here can make Fast UniFrac more accessible to more people (specifically, when the mapping of pyrosequencing reads to a reference tree is required), since the various errors that may occur with PyCogent, NumPy, Python, etc. can be difficult. If you wish to use PyCogent directly, you will probably have to be somewhat familiar with the Python programming language, although the cookbook has enough examples that one may be able to stumble through it naively (not that I would recommend it). If you're like me, and only use PyCogent so that you can map sequences to a reference tree for Fast UniFrac, then everything else you'll need to know can probably be found in the excellent Fast UniFrac tutorial. The Fast UniFrac 'BLAST-to-GreenGenes' procedure also requires a local installation of BLAST (installation instructions for PC, Mac, Linux, etc.). Making the initial input file for this specific type of Fast UniFrac analysis can require some creative thinking, and will be the subject of a future post.

- Brendan


  1. Not sure what led you to think that you need BLAST for UniFrac. You need a phylogenetic tree (typically made by Neighbor-Joining, but could be anything), and an "environment" file mapping sequence names to samples/environments.

    I've posted a bunch about it:

    Happy to answer questions, if I can.

  2. Thanks for the comment! The post wasn't really a tutorial on Fast UniFrac, especially since I provided links to the Fast UniFrac page, which has an excellent tutorial and also explains the theory quite well. You are absolutely correct that Fast UniFrac itself can be run without PyCogent. However, as written above:
    "fully utilizing the special features of Fast UniFrac requires PyCogent".

    The most powerful new feature of Fast UniFrac is not its ability to simply process larger trees (this has presumably been available in the command line version of UniFrac from the start); it's the fact that you can now use a reference phylogeny that is essentially built into the framework of the program. This is crucial for handling massive data sets of 454 (pyrosequencing) 16S amplicons or PhyloChip data. Please refer to my next post, which gives some of the grizzly details of how to run these analyses on gigantic data sets for which phylogenetic reconstruction is impractical.

    Here's the link to the Fast UniFrac abstract where you'll see that they talk about "BLAST-based sequence assignment" for pyrosequencing and PhyloChip data:

  3. Brendan,

    I'm sorry to hear that you had some difficulties getting PyCogent to work properly on your system. It should be noted that we've never tested against Enthought and that could easily be the source of your issues. On your CentOS system, you'll run into permission issues if you attempt to install system-wide. However, if you install everything locally you shouldn't have a problem. Arguably the easiest solution is to simply download and run the QIIME virtual machine ( as it works "out of the box" and comes setup with a whole host of bioinformatics tools including PyCogent. QIIME is additionally available on Amazon's EC2 if you need to scale up and leverage more compute power.

  4. Hi Daniel,

    Thanks so much for commenting here! I really appreciate your input. I do think that the QIIME virtual machine would be a good alternative to what I have posted here. Since I have not tried it out myself, I can't really say anything beyond the fact that it sounds like a reasonable option.

    I do think that it's paradoxical that I had the most success using Enthought on a Windows machine, since neither of these seems to be recommended. However, that was what inspired me to post this in the first place! I hope that people who may be running their analyses primarily through other programs, like Mothur, may still find posted here some relatively simple ways to take advantage of the unique capabilities of Fast UniFrac (some of which require PyCogent).

    I certainly think that anyone who is really serious about using PyCogent for a whole host of functions should just buckle down and work through any of the problems that might arise in getting it installed. I haven't run into problems since I started using Enthought (although I certainly didn't go through all of the PyCogent functions and test them). The reason that I went for Enthought is that it has Python, NumPy, etc., all bundled together and the various issues associated with different versions of different programs not being able to communicate with one another never arise.

    On CentOS, I did do local installations of the various components (none of them being Enthought), but I needed to retain different versions of Python for the different programs that I was running (some require old, some require new) and then different versions of NumPy are required for some of the different versions of Python, and then... I felt like I was falling further and further down the rabbit hole, although I'm sure with enough effort I would have come out the other end! I guess it wasn't really 'permissions' that were giving me trouble as much as the fact that all of the newer programs that I installed kept looking for their dependencies in the set of system-wide programs, and would preferentially go there, which invariably led to older versions that would not do. So I changed my PATH and changed my PYTHONPATH, and it still was going back to older system-wide installations for certain components. In summary, the protocol described above is just much simpler and it suits my needs... for now, that is!

    Amazon's EC2 looks pretty awesome. Once I relocate to a new institution I might start trying it out!

    All the best,