Friday, December 16, 2011

Using R for community analyses

For the central chapter of my doctoral dissertation (published in Environmental Microbiology), I wanted to visualize similarities and differences between lichen-associated bacterial communities, find which factors were most strongly correlated with community differences, and determine the significance of these correlations.  When communities are characterized using large sets of DNA sequence data, this can be achieved by using both genetic measures (e.g., UniFrac distance) and taxon-based measures (e.g., Bray-Curtis OTU-based dissimilarity).

This post gives an example of a script (or a series of commands) to be run with the R statistical package (R Development Core Team 2010) for performing ANOSIM and NMDS analyses of (1) an OTU-based relative abundance matrix derived from a set of clone libraries and (2) a UniFrac distance matrix derived from a 454 amplicon library (see example files at the bottom of this post).  These examples are derived from Hodkinson et al. (2012), although the scripts and example files shown here are abbreviated in order to save space.  Full example files can be downloaded from the Dryad data package associated with this publication [Hodkinson et al. 2011; http://dx.doi.org/10.5061/dryad.t99b1].  The matrices were obtained using the Fast UniFrac program (see instructions here) and Mothur (using the get.relabund command; for the full pipeline that I authored to process these data, see the Dryad data package).  Some minor modifications were required to get the matrices into the precise format needed for R (edits were performed in Microsoft Excel and files were saved in .csv format... note that all of the spaces/tabs in the examples below would actually be commas in the raw text versions of the files).

Running the script as it is presented below (with hash marks in front of the 'plot' commands) will allow one to obtain an output file with the numerical results of all analyses.  However, in order to visualize the NMDS plots, one can run each of the commands line-by-line in the R-GUI (of course, all 'anosim' commands can be ignored if the only goal is NMDS), removing the hash marks in front of the plot commands (note that R has many fancy options for visualization using 'plot'; these may be worth investigating further for your own purposes).  Here is a link to the version of R that I used for these analyses: http://cran.r-project.org/bin/windows/base/old/2.12.0/.  Once R is installed, you can call the script with a command like this one:
C:\\"Program Files"\R\R-2.12.0\bin\i386\Rterm.exe --verbose --no-restore --file=C:\\ANOSIM_NMDS_Clon_454_win.R > C:\\ANOSIM_NMDS_Clon_454_win.out

Here is the script, named 'ANOSIM_NMDS_Clon_454_win.R' (abbreviated example script; full file available at http://dx.doi.org/10.5061/dryad.t99b1):

# load the necessary libraries
library(ecodist)
library(vegan)
# set the output file
sink("ANOSIM_nmds.out", append=TRUE, split=TRUE)
# load the clone data set
samplesClon<-read.csv("C:\\SamplesClon.csv", header=TRUE)
samplesClon
# load the Bray-Curtis OTU-based matrix
OTUsClon<-read.csv("C:\\HodkinsonClon16S_OTUs_1per_171.csv", head = FALSE, row.names = 1)
# transform the OTU matrix into a Bray-Curtis matrix
OTUsClon.dist<-vegdist(OTUsClon, "bray")
# run Bray-Curtis OTU-based ANOSIM
anosim(dis = OTUsClon.dist, grouping = samplesClon$Photobiont, strata = samplesClon$Site)
anosim(dis = OTUsClon.dist, grouping = samplesClon$Mycobiont, strata = samplesClon$Site)
anosim(dis = OTUsClon.dist, grouping = samplesClon$Mycobiont, strata = samplesClon$Photobiont)
anosim(dis = OTUsClon.dist, grouping = samplesClon$Site, strata = samplesClon$Photobiont)
anosim(dis = OTUsClon.dist, grouping = samplesClon$Site, strata = samplesClon$Mycobiont)
# run Bray-Curtis OTU-based NMDS
nmds <- nmds(as.dist(OTUsClon.dist))
nmin <- nmds.min(nmds)
nmin
#plot(nmin, pch=as.numeric(samplesClon$Photobiont))
#plot(nmin, pch=as.numeric(samplesClon$Mycobiont))
#plot(nmin, pch=as.numeric(samplesClon$Site))
# load the 454 data set
samples454<-read.csv("C:\\Samples454.csv", header=TRUE)
samples454
# load the normalized weighted UniFrac matrix
unifracnw454<-read.csv("C:\\DistUFnw454.csv", head = FALSE, row.names = 1)
# run normalized weighted UniFrac ANOSIM
anosim(dat = as.dist(unifracnw454), grouping = samples454$Photobiont, strata = samples454$Site)
anosim(dat = as.dist(unifracnw454), grouping = samples454$Site, strata = samples454$Mycobiont)
anosim(dat = as.dist(unifracnw454), grouping = samples454$Site, strata = samples454$Photobiont)
anosim(dat = as.dist(unifracnw454), grouping = samples454$Mycobiont, strata = samples454$Photobiont)
anosim(dat = as.dist(unifracnw454), grouping = samples454$Mycobiont, strata = samples454$Site)
# run normalized weighted UniFrac NMDS
nmds <- nmds(as.dist(unifracnw454))
nmin <- nmds.min(nmds)
nmin
#plot(nmin, pch=as.numeric(samples454$Photobiont))
#plot(nmin, pch=as.numeric(samples454$Mycobiont))
#plot(nmin, pch=as.numeric(samples454$Site))
# close the output file
sink()
# unload the libraries
detach("package:ecodist")
detach("package:vegan")


Samples454.csv (referenced in the above script)

SampleID Photobiont Site Mycobiont
CLCl Ch C Cl
CLDi Cy C Di
CLLe Cy C Le
CLPe Cy C Pe
CLSt Cy C St
CLUs Ch C Us
ELCl Ch E Cl
ELFl Ch E Fl
ELOp Ch E Op
ELPe Cy E Pe
ELUm Ch E Um
HLCl Ch H Cl
HLLe Cy H Le
HLPe Cy H Pe
HLSt Cy H St
HLUs Ch H Us
NLCl Ch N Cl
NLFl Ch N Fl
NLOp Ch N Op
NLPe Cy N Pe
NLUm Ch N Um


DistUFnw454.csv (referenced in the above script; full file available at http://dx.doi.org/10.5061/dryad.t99b1)

CLCl
0.16 0.17 0.11 0.08 0.12 ...
CLDi 0.16
0.24 0.18 0.13 0.18 ...
CLLe 0.17 0.24
0.11 0.16 0.20 ...
CLPe 0.11 0.18 0.11
0.10 0.18 ...
CLSt 0.08 0.13 0.16 0.10
0.13 ...
CLUs 0.12 0.18 0.20 0.18 0.13
...
ELCl 0.20 0.22 0.30 0.25 0.19 0.17
ELFl 0.14 0.18 0.24 0.19 0.14 0.15 ...
ELOp 0.33 0.35 0.40 0.37 0.34 0.31 ...
ELPe 0.16 0.21 0.24 0.19 0.15 0.19 ...
ELUm 0.22 0.24 0.32 0.27 0.22 0.22 ...
HLCl 0.19 0.18 0.27 0.24 0.17 0.15 ...
HLLe 0.13 0.16 0.20 0.15 0.11 0.12 ...
HLPe 0.09 0.17 0.15 0.11 0.11 0.16 ...
HLSt 0.14 0.16 0.21 0.18 0.12 0.13 ...
HLUs 0.10 0.20 0.18 0.15 0.14 0.11 ...
NLCl 0.19 0.21 0.28 0.25 0.19 0.15 ...
NLFl 0.18 0.21 0.27 0.24 0.19 0.15 ...
NLOp 0.24 0.27 0.32 0.29 0.25 0.25 ...
NLPe 0.10 0.18 0.13 0.09 0.10 0.14 ...
NLUm 0.09 0.18 0.21 0.15 0.13 0.16 ...


SamplesClon.csv (referenced in the above script)

SampleID Photobiont Site Mycobiont
CL01 Cy C Peltigera
CL02 Ch C Alectoria
CL03 Ch C Cladonia
CL04 Tr C Placopsis
CL05 Cy C Erioderma
CL06 Cy C Sticta
CL07 Cy C Dictyonema
CL08 Cy C Leptogium
CL09 Tr C Stereocaulon
CL10 Ch C Usnea
EL01 Cy E Peltigera
EL01t Tr E Peltigera
EL02 Ch E Alectoria
EL03 Ch E Cladonia
EL04 Ch E Rhizocarpon
EL05 Ch E Ophioparma
EL06 Ch E Flavocetraria
EL07 Ch E Arctoparmelia
EL08 Ch E Umbilicaria
EL09 Ch E Masonhalea
EL10 Ch E Dactylina
HL01 Cy H Peltigera
HL02 Ch H Platismatia
HL03 Ch H Cladonia
HL04 Ch H Parmotrema
HL05 Ch H Flavoparmelia
HL06 Cy H Sticta
HL07 Ch H Xanthoparmelia
HL08 Cy H Leptogium
HL09 Tr H Stereocaulon
HL10 Ch H Usnea
NL01 Cy N Peltigera
NL01t Tr N Peltigera
NL03 Ch N Cladonia
NL04 Tr N Amygdalaria
NL05 Ch N Ophioparma
NL06 Ch N Flavocetraria
NL07 Ch N Parmelia
NL08 Ch N Umbilicaria
NL09 Tr N Stereocaulon
NL10 Ch N Cetraria


HodkinsonClon16S_OTUS_1per_171.csv (referenced in the above script; full file available at http://dx.doi.org/10.5061/dryad.t99b1)

CL01 0.00 0.00 0.00 0.00 0.00 0.00 ...
CL02 0.00 0.00 0.00 0.00 0.00 0.00 ...
CL03 0.00 0.25 0.00 0.00 0.00 0.00 ...
CL04 0.11 0.00 0.00 0.00 0.00 0.00 ...
CL05 0.00 0.03 0.34 0.05 0.00 0.00 ...
CL06 0.00 0.10 0.02 0.03 0.00 0.02 ...
CL07 0.00 0.56 0.00 0.00 0.00 0.19 ...
CL08 0.00 0.03 0.00 0.00 0.00 0.00 ...
CL09 0.02 0.13 0.06 0.15 0.00 0.00 ...
CL10 0.00 0.00 0.00 0.00 0.00 0.00 ...
EL01 0.00 0.00 0.00 0.00 0.00 0.17 ...
EL01t 0.00 0.00 0.00 0.00 0.00 0.31 ...
EL02 0.00 0.26 0.00 0.00 0.00 0.00 ...
EL03 0.00 0.03 0.00 0.00 0.00 0.00 ...
EL04 0.00 0.10 0.00 0.00 0.00 0.00 ...
EL05 0.00 0.04 0.00 0.00 0.00 0.00 ...
EL06 0.00 0.07 0.00 0.00 0.00 0.00 ...
EL07 0.00 0.00 0.00 0.00 0.00 0.00 ...
EL08 0.00 0.34 0.00 0.00 0.00 0.00 ...
EL09 0.00 0.21 0.00 0.00 0.00 0.00 ...
EL10 0.00 0.21 0.00 0.00 0.04 0.00 ...
HL01 0.00 0.00 0.00 0.00 0.00 0.00 ...
HL02 0.13 0.00 0.00 0.00 0.00 0.00 ...
HL03 0.00 0.00 0.00 0.00 0.00 0.00 ...
HL04 0.06 0.00 0.00 0.00 0.00 0.00 ...
HL05 0.00 0.03 0.00 0.00 0.00 0.00 ...
HL06 0.00 0.00 0.00 0.06 0.00 0.00 ...
HL07 0.00 0.14 0.00 0.00 0.00 0.00 ...
HL08 0.00 0.00 0.00 0.00 0.00 0.00 ...
HL09 0.00 0.00 0.00 0.00 0.00 0.00 ...
HL10 0.00 0.02 0.00 0.00 0.00 0.00 ...
NL01 0.00 0.15 0.00 0.00 0.00 0.00 ...
NL01t 0.00 0.00 0.00 0.00 0.00 0.15 ...
NL03 0.00 0.00 0.00 0.00 0.00 0.00 ...
NL04 0.00 0.15 0.00 0.00 0.00 0.00 ...
NL05 0.00 0.17 0.00 0.00 0.00 0.00 ...
NL06 0.00 0.02 0.00 0.00 0.00 0.00 ...
NL07 0.00 0.33 0.00 0.00 0.00 0.00 ...
NL08 0.00 0.80 0.00 0.00 0.00 0.00 ...
NL09 0.00 0.00 0.00 0.00 0.00 0.00 ...
NL10 0.00 0.05 0.00 0.00 0.00 0.00 ...


Here are some examples of non-metric multidimensional scaling plots produced from OTU-based Bray-Curtis dissimilarities between lichen-associated bacterial communities (Hodkinson et al. 2012). Photobiont is indicated by color (light green = green algae; dark blue = Cyanobacteria; black = Tripartite), while the site is indicated with symbols (• = Eagle Summit, AK; + = Nome, AK; x = Highlands, NC, and * = Cerro de la Muerte, CR). Numbers indicate the identity of different mycobiont types (see associated publication for details). Continuous lines act as visual aids to delimit communities associated with the two major photobiont-types, whereas dashed lines delimit communities associated with chlorolichens from northern versus southern sites.



















Plot A shows results obtained from clone library data of 16S sequences from the order Rhizobiales.



Plot B was produced from 454 barcoded 16S amplicon data (representing approximately half the number of samples as the clone library set but ~100 times as many sequences from a much wider range of bacterial diversity; since this plot was produced through OTU-based methods it does not precisely correlate with what would be produced through the abbreviated script above, which uses UniFrac distances to generate NMDS plots based on 454 data).

For additional information and documentation, including results of ANOSIM analyses, please see the references below!

- Brendan

---------------------------------------

References

Hodkinson, B. P. 2011. A phylogenetic, ecological, and functional characterization of non-photoautotrophic bacteria in the lichen microbiome. Doctoral Dissertation, Duke University, Durham, NC.
Download Dissertation (PDF file)

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2012. Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Environmental Microbiology 14(1): 147-161. [doi:10.1111/j.1462-2920.2011.02560.x]
Download publication (PDF file)
Download supplementary phylogeny (PDF file)
View data and analysis file webportal (website)
Download data and analysis file archive (ZIP file)

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2011. Data from: Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Dryad Digital Repository. doi:10.5061/dryad.t99b1

R Development Core Team. 2010. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.r-project.org/

Thursday, December 8, 2011

Editor's Pick

I recently received an email notifying me that the journal The Lichenologist had just released a compilation of papers entitled 'Editor's Pick of Papers: from Genome to Ecosystems.' This group consists of the editor's top 15 articles published in the journal over the past two years (volumes 42 and 43), during which well over 100 works have been published in The Lichenologist. I was excited to see that one of my papers was selected! To read my previous blog post outlining the content of the paper, please click here. Many thanks to Peter Crittenden for his work as editor-in-chief and to the British Lichen Society for publishing the journal!

- Brendan

Tuesday, December 6, 2011

Two new orders of fungi

This week I had a manuscript published in which two entirely new orders of fungi were established! They are Sarrameanales and Trapeliales, and they both belong to the subclass Ostropomycetidae in the class Lecanoromycetes (the main class of lichen-forming fungi). To read more about the taxonomy of these two orders, see the publication in the December 2011 issue of Phytologia. This research can partly be seen as a taxonomic extension of the NSF-funded Assembling the Fungal Tree of Life project, since the distinctness of these orders was only recognized as a result of multi-gene phylogenetic analyses conducted as part of this effort.

We are making progress taxonomically, but even the higher-level work is not done yet; there are still many groups of 'incertae sedis' fungi out there and probably many more legitimate orders to be described... which means that there's plenty more exciting work to be done in the future!

- Brendan

------------------------------------

Reference

Hodkinson, B. P., and J. C. Lendemer. 2011. The orders of Ostropomycetidae (Lecanoromycetes, Ascomycota): Recognition of Sarrameanales and Trapeliales with a request to retain Pertusariales over Agyriales. Phytologia 93(3): 407-412.
Download publication (PDF file)

Sunday, November 20, 2011

My New Supercomputing Center

Recently I applied for and received an XSEDE (Extreme Science and Engineering Discovery Environment) Startup Allocation Award (DEB-110024), allowing me to use the San Diego Supercomputing Center (SDSC) Trestles cluster. The cluster itself contains over 10,000 processor cores; you can read more about it here. While there are many different types of research being conducted using this cluster, my work focuses on ecological and evolutionary analyses of molecular sequence data. Previously I conducted my computationally-intensive analyses using the Duke Shared Cluster Resource (DSCR). However, since leaving Duke University a few months ago, I have been without a cluster. I am happy to say that I now have access to a cluster once more! Let the supercomputing resume!

- Brendan

Thursday, November 10, 2011

Outreach and training through Youtube videos

Over the past few months I have acted as the New York Botanical Garden's Workflow Coordinator for the NSF-funded "Digitization TCN Collaborative Research: North American Lichens and Bryophytes: Sensitive Indicators of Environmental Quality and Change" (EF-1115086).  This project is a collaboration between multiple institutions across North America and is aimed at cataloging label data from the vast majority of North American lichen and bryophyte specimens.  Recently, as part of this project, we at NYBG released two videos on Youtube.  The first acts as an introduction to the project for the general public and gives some of the rationale behind it:


The second is a training video that can be used by members of the partner institutions or others who are thinking about taking on similar projects:


As you will probably notice from the videos themselves, Charlie Zimmerman, the imager that I have supervised for this project here at the garden, was the one who did most of the writing, filming and editing... and we greatly appreciate all of his work!

Please enjoy the videos!

- Brendan

Sunday, November 6, 2011

Building linkage-probability-based RNA secondary structure models for phylogenetic inference

RNA secondary structure models are increasingly being integrated into likelihood-based phylogenetic inferences, but the dynamic structure of functional RNA molecules makes any single structural inference necessarily inaccurate. In this post I present an objective method for determining which elements of secondary structure are most stable based on the statistical significance of linkage probabilities between sites on a given RNA molecule. I briefly outline how this information can be integrated into a phylogenetic analysis by creating an input file that contains these statistically significant structural elements.

For some additional background on RNA secondary structure, see this previous post:
http://squamules.blogspot.com/2011/08/its-rna-secondary-structure.html

Functional RNA molecules include pairs of nucleotide sites that are linked to one another physically, resulting in specific secondary structures that define the shape of each molecule.  This linkage causes certain sites to evolve in tandem with their counterparts.  As such, the secondary structure of RNA molecules has been recognized for some time as a significant consideration in the inference of phylogenies from functional RNA-encoding genes (Kimura 1985, Tillier and Collins 1995).  Typically, RNA secondary structure is used to optimize multiple-sequence-alignment accuracy for functional RNA-encoding genes (Gutell et al. 1992, Kjer 1995, Lendemer and Hodkinson 2009).  However, in recent years, the use of secondary structure in modeling evolution for likelihood-based phylogenetic inferences has begun to gain popularity (Hodkinson and Lendemer in review, Savill et al. 2001, Telford et al. 2005).  This approach requires defining the pairs of linked sites on the encoded RNA molecule and treating these pairs as states that are separate from the standard independent nucleotide states (A, C, T, G) in a phylogenetic inference.  This method allows one to properly account for the interdependency of interacting nucleotides, since paired RNA nucleotides are no longer required to be treated as independent sites, leading to a more accurate approach for modeling sequence evolution.

Current protocols for integrating RNA secondary structure data into phylogenetic analyses require a single hypothetical structure to be used as an input.  Structures are typically inferred using algorithms that minimize free energy or use other thermodynamic considerations to produce the best single structural inference (Mathews & Turner 2006).  However, RNA secondary structure is dynamic, frequently changing in the cell as RNA catalyzes reactions and performs various cellular functions.  When RNA molecules encounter certain enzymes and cellular components, the thermodynamic rules that previously favored one structure might strongly favor another.  Additionally, different methods of structural inference are not always comparable, and small differences in algorithms can favor significantly different structures; the use of differing structural models in a phylogenetic context can have consequences in terms of both topology inference and the calculation of support (Ullrich et al. 2010).

These problems can largely be solved by removing statistically non-significant linkages from phylogenetic analyses, leaving only the most probable structural elements to be incorporated into downstream inferences.  The determination of which RNA secondary structural elements are supported with statistical significance is often overlooked and is certainly not a standard part of the current work flow for scientists integrating secondary structural data into phylogenetic analyses.

Since RNA secondary structure can serve as such a useful tool for revealing the evolutionary history of certain groups, it is essential that objective criteria be established for incorporating structural elements into phylogenetic inferences.  The simple method outlined here allows one (a) to evaluate the probability that each site on an RNA-encoding gene is linked to each other site and (b) to produce an 'elemental' secondary structure model for phylogenetic inference containing only the statistically-supported elements of the structure.

The UNAFold package provides a particularly useful set of tools for exploring various aspects of RNA secondary structure (Markham and Zuker 2008).  UNAFold's 'hybrid-ss.exe' yields a set of '.plot' files that give the probability of each base binding to each other base for all reasonable pairings.  After installing UNAFold and running 'hybrid-ss.exe' on a FASTA-formatted sequence, one can choose the '.plot' file with the number that most closely approximates the typical cellular temperature (in degrees Celsius) of the organism from which the sequence is derived.  This '.plot' file can be modified in Excel by sorting according to 'P(i,j)' values (the probability of pairing) and isolating only the rows for which 'P(i,j)' is above 0.95.  This stringent 95% pairing probability cut-off seems most easily justifiable; however, other cut-off values could potentially be used in the context of this method.

For integrating this type of data into a phylogenetic analysis (e.g., using RAxML 7.2.8; http://wwwkramer.in.tum.de/exelixis/software.html; Stamatakis 2006), the standard 'Vienna' dot-bracket notation is used (Hofacker et al. 1994).  Any standard secondary structure inference program can be used to create an initial structure that may serve as a template; parentheses can be converted to periods using a standard text-editor or secondary structure editing program (e.g., 4SALE; http://4sale.bioapps.biozentrum.uni-wuerzburg.de/; Seibel et al. 2006) for sites whose linkage is statistically non-significant.  These procedures will produce a secondary structure model that includes only the statistically supported elements of structure.  When this 'elemental' secondary structure model is incorporated into phylogenetic analyses, it could serve to decrease the degree of uncertainty inserted into the standard secondary structure-based inferences.

Future advances may allow the integration of various intermediate linkage probabilities to be considered in the calculation of tree likelihoods.  However, it seems that certain theoretical hurdles remain to be overcome before this type of analysis can be possible.  Meanwhile, a methodology like the one outlined here could be beneficial if one wishes to reduce the amount of chance introduced into phylogenetic analyses while still accounting for the fact that certain sites are inextricably linked.

- Brendan

---------------------------------------------------

References

Gutell, R. R., A. Power, G. Z. Hertz, E. J. Putz, and G. D. Stormo. 1992. Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acids Research 20(21): 5785–5795.

Hodkinson, B. P., and J. C. Lendemer. In review. Systematics of a enigmatic sterile crustose lichen. 

Hofacker, I. L., W. Fontana, P. F. Stadler, S. Bonhoeffer, M. Tacker, and P. Schuster. 1994. Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie / Chemical Monthly 125: 167-188.

Kimura, M. 1985. The role of compensatory neutral mutations in molecular evolution. Journal of Genetics 64(1):7-19.

Kjer, K. M. 1995. Use of rRNA secondary structure in phylogenetic studies to identify homologous positions: an example of alignment and data presentation from frogs. Molecular Phylogenetics and Evolution 4: 314-330.

Lendemer, J. C., and B. P. Hodkinson. 2009. The wisdom of fools: new molecular and morphological insights into the North American apodetiate species of Cladonia. Opuscula Philolichenum 7: 79-100.

Markham, N., and M. Zuker. 2008. UNAFold: software for nucleic acid folding and hybridization. Methods in Molecular Biology 453: 3-31.

Mathews, D. H., and D. H. Turner. 2006. Prediction of RNA secondary structure by free energy minimization. Journal of Molecular Biology 16(3): 270-278.

Savill N. J., D. C. Hoyle, and P. G. Higgs. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum likelihood methods. Genetics 157: 399-411.

Seibel P. N., T. Müller, T. Dandekar, J. Schultz, and M. Wolf. 2006. 4SALE - A tool for synchronous RNA sequence and secondary structure alignment and editing. BMC Bioinformatics 7: 498.

Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood- based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688-2690.

Telford, M., M. Wise, and V. Gowri-Shankar. 2005. Consideration of RNA secondary structure significantly improves likelihood-based estimates of phylogeny: examples from the Bilateria. Molecular Biology and Evolution 22: 1129-1136.

Tillier, E. R. M., and R. A. Collins. 1995. Neighbor-joining and maximum likelihood with RNA sequences: addressing interdependence of sites. Molecular Biology and Evolution 12: 7-15.

Ullrich, B., K. Reinhold, O. Niehuis, and B. Misof. 2010. Secondary structure and phylogenetic analysis of the internal transcribed spacers 1 and 2 of bush crickets (Orthoptera: Tettigoniidae: Barbitistini). Journal of Zoological Systematics and Evolutionary Research 48(3): 219-228.

---------------------------------------------------

This article can be cited as:
Hodkinson, B. P. 2011. Building linkage-probability-based RNA secondary structure models for phylogenetic inference. Squamules Unlimited, New York. [Available at: http://squamules.blogspot.com/2011/11/building-linkage-probability-based-rna.html]

Monday, October 31, 2011

Visiting Harvard

Last week I visited Harvard University and gave a talk on my research into the consortia of bacteria associated with lichens. I was invited by Anne Pringle and was really excited to see some of the great research going on in her lab. One highlight of the visit was a tour of the Farlow Herbarium with Don Pfister. It was interesting to discuss how the herbarium would implement procedures for the NSF-funded TCN North American Lichens and Bryophytes project, for which I am the coodinator at the New York Botanical Garden (NSF Awards - NYBG: EF-1115086; Harvard: EF-1114957). Soon we will be putting online YouTube videos detailing procedures and workflows for this project, which will hopefully help other institutions such as the Farlow Herbarium to get the project up and running quickly once they are ready to begin. I hope that this visit will represent the first step in establishing lasting collaborations with some of the biologists at Harvard!

- Brendan

Thursday, October 13, 2011

Lichen Conservation

I saw this story in the news and thought that it was interesting:
http://www.bbc.co.uk/news/uk-wales-south-west-wales-15180179
It talks about cutting down trees at a location in Wales in order to engineer a park so that it will remain an optimal location for lichen growth. I am very interested in lichen conservation, but I typically take the stance that we should just "leave it alone" (i.e., let the preserved places evolve naturally) rather than actively attempting to create, maintain, or otherwise engineer a suitable habitat for lichens. But perhaps in parts of the world where there are only a few small, fragmented pieces of land on which lichens can thrive, a different approach must be taken (so that the preserved places remain more or less static instead of evolving naturally into old-growth habitats).  It's not something that I would jump into without reservations, but perhaps this instance will be a good test case.  I still instinctively remain wary of anything that could set off a chain of interventions that may lead us where we never intended to go!  What are your thoughts?

- Brendan

Tuesday, September 27, 2011

Unravelling Lecidea

Recently, I co-authored a paper (Schmull et al. 2011) in which we presented the results of analyses aimed at determining the phylogenetic placement of numerous lineages of lichen-forming fungi that were previously placed in the genus Lecidea based solely on morphology. For a long time, it has been known that the assemblage of species placed in Lecidea by Zahlbruckner did not form a single evolutionary lineage. However, placing all of the species in known families has been problematic. For our paper, we conducted two separate 6-gene analyses of lichen-forming fungi in the class Lecanoromycetes in order to infer the placement of twenty-five Lecidea taxa. Most species fell within three families: Lecanoraceae, Pilocarpaceae, and Lecideaceae (the familiy of the 'real' Lecidea). Those within the first two families will unquestionably need to be given new generic names in the near future. The main story that I hope will come out of this paper is that there is much more work to be done! We have used molecular data to demonstrate the scope of the problem with the genus Lecidea, but the definitive placement of all described species will require a great deal of additional study. I'm looking forward to continuing work on this group in the future!

- Brendan

-----------------------------------------------

Reference

Schmull, M., J. Miadlikowska, M. Pelzer, E. Stocker-Wörgötter, V. Hofstetter, E. Fraker, B. P. Hodkinson, V. Reeb, M. Kukwa, H. T. Lumbsch, F. Kauff, and F. Lutzoni. 2011. Phylogenetic affiliations of members of the heterogeneous lichen-forming genus Lecidea sensu Zahlbruckner (Lecanoromycetes, Ascomycota). Mycologia 103(5): 983-1003.
Download publication (PDF file)
Download nucleotide alignment (NEXUS file)
Download supplementary data table 1 (PDF file)
Download supplementary data table 2 (PDF file)

Wednesday, September 14, 2011

Musical Lichenology

I recently received an email from Sean Beeching, famous to readers of this blog for his poetry (click here and here for samples). This is what he wrote:

"Here is a video that Nancy Lowe of discoverlife shot during our lichen workshop [http://www.youtube.com/watch?v=rvxnv-6Z6rg]. I am sawing up branches to show the students the lichens that were growing on them in time to Tommy [Jordan]'s banjo playing. He and I play together in the evenings during the workshop. You might also have a look at the lichen key I made for our students at discoverlife, go to the nature guides at the discoverlife.org website and then page down to 'Lichens, Georgia.' The site just passed a billion hits."


For anyone in the southeastern US interested in learning a thing or two about lichens, I would highly recommend any workshop with Sean!

- Brendan

Wednesday, September 7, 2011

Diversity of Lichenology

Not too long ago I reviewed the 100th anniversary issue of the journal/book series Bibliotheca Lichenologica for The Bryologist. Here is what I wrote:

The 100th volume of Bibliotheca Lichenologica (‘‘Diversity of Lichenology — Anniversary Volume’’) provides important contributions to the field and gives us further insights into the biology of lichens while connecting us to our historical roots. While the volumes of this series have taken many different forms, this edition appears as a standard journal volume, with 18 scientific and historical articles from a total of 37 authors representing a diverse array of lichenologists. It should be noted that the emphasis is on lichens of Eurasia and/or the Southern Hemisphere; however, many of the articles will appeal to a general, worldwide audience.

In terms of taxonomy, this volume will be of particular interest to those following the changes in the family Teloschistaceae. Kondratyuk et al. describe 35 new species in the family. Many of these are known only from the type locality, or a small handful of specimens, making further evaluation of some of the taxa difficult, although the authors are to be commended for providing excellent color photographs of the thalli. The work by Fedorenko et al. focusing on the phylogeny of ‘xanthorioid’ lichens represents a significant contribution in terms of both the data generated and the provisional generic concepts articulated. However, both of the aforementioned works seem to highlight the fact that the largest task still remains: an integrated systematic revision of the family Teloschistaceae that includes crustose, foliose, and fruticose forms. In different ways, these works both improve our understanding of this family for which the taxonomy continues to evolve.

This volume also includes important insights into the taxonomy of the oft-neglected lichenicolous fungi. Hafellner presents an excellent ‘traditional’ treatment of the lichenicolous genera Phacothecium and Phacographa. The work is notably thorough, articulating precisely what is known about the genera, while highlighting areas where additional data and analyses are needed. The author also provides a useful key to opegraphoid lichenicolous fungi with widely exposed hymenia, along with a summary table of the phenotypic characters separating the five opegraphoid arthonialean genera with lichenicolous members discussed in the text (i.e., Opegrapha, Lecanographa, Phacothecium, Phacographa, and Plectocarpon).

Among the works that will appeal to a broader audience is Kärnefelt’s contribution entitled ‘‘Fifty influential lichenologists.’’ This portion of the volume provides a veritable ‘‘Who’s Who’’ of lichenology. The careers of some of the world’s major players in our field, both modern and historical, are briefly summarized, starting with ‘‘the father of lichenology,’’ Erik Acharius. Another work with broad appeal is by Lücking et al., entitled ‘‘How many tropical lichens are there… really?’’ This piece discusses the various factors involved in calculating a ‘ballpark’ estimate of the overall diversity of lichens in the neotropics, the tropics in general, and the globe as a whole (the estimate for the latter comes in at 28,000 species!). Although any estimate is subjective in nature, various points that have not been explicitly integrated into previous estimates are considered (e.g., taxonomic ‘orphans,’ species pairs, photosymbiodemes, chemotypes, and cryptic species). Another portion of the volume that will appeal to amateurs and professionals alike is the section by Randlane et al., which provides what is undoubtedly the best synthetic work on the species of Usnea for the continent of Europe. Range maps and photographs of small-scale features make this section both informative and interesting.

Reading this volume also makes it apparent that the changing landscape of lichenological research has led to certain problems that require special attention. Many of the problematic issues plaguing our field seem to be associated with the process of adjusting to the molecular age. A number of the studies published herein leave the reader wanting to know more, especially in terms of molecular data and how they were analyzed (or how they could be analyzed differently). Beyond the simple deposition of sequences in a public repository, alignments must be reviewed and made available if alignment-based phylogenetic analyses are to be reproducible. Authors will find that making their assembled molecular datasets freely available to readers (on their own personal websites if necessary) increases the impact and relevance of their work by permitting others to easily build on their studies. Ultimately, this practice will allow our field to advance more quickly and will raise the bar in terms of research quality. In summary, this volume of Bibliotheca Lichenologica, as its title suggest, provides an excellent picture of the diversity of lichenology and represents quite well the overall state of the field as we enter this next decade. Many of the works contained in this anniversary edition provide important contributions, and any lichenologist’s collection would be enriched by the addition of this volume.

- Brendan

-----------------------------

Citation:
Hodkinson, B. P. 2010. Lichenological Diversity. The Bryologist 113(4): 828-829.

Wednesday, August 31, 2011

ITS RNA secondary structure

I have recently been conducting phylogenetic and taxonomic studies of selected groups of lichen-forming fungi using sequences from the quickly evolving nuclear ribosomal ITS (internal transcribed spacer) region to examine relationships within and between species (e.g., Hodkinson & Lendemer 2011, Hodkinson et al. 2010, Lendemer & Hodkinson 2009, 2010, in prep). In order to properly analyze the evolutionary relationships between the organisms from which these molecules were derived, I built secondary structure models for the RNA molecules encoded by ITS1 and ITS2 (the two rapidly evolving sections of the ITS region) for some of the groups. 

The ITS1 and ITS2 spacer regions encode stretches of RNA that fold up in specific conformations and help to assemble the ribosomes (the pieces of cellular machinery that build protein molecules based on specific messenger RNA sequences transcribed from DNA). The particular folding pattern is referred to as the molecule's "secondary structure."  Here is an example of a secondary structure model that I put together for ITS2 of Parmotrema perforatum:


Notice the A(adenine)-U(uracil) pairings and the G(guanine)-C(cytosine) pairings, just like the complementary strands of DNA (except that with DNA you have T for thymine instead of U for uracil).

There are two main reasons that one might want to have a secondary structure model when inferring phylogeny:

[1] Nucleotide Alignment - An understanding of the overall structure of the molecule can aid in discerning which sets of sites in different organisms actually represent the same character when they have different states and there are adjacent nucleotides that have been inserted or deleted in some taxa (Kjer 1995). Many studies use principles of secondary structure to aid in alignment.

[2] Phylogenetic Inference - Since paired sites in some sense evolve in tandem (if one nucleotide changes, the linked nucleotide will often change to compensate over evolutionary time), it is most appropriate within a likelihood framework to apply a different model of evolution to the paired nucleotides so that this can be taken into consideration. This type of inference can be done with RAxML (Stamatakis 2006) and I have recently integrated this into my workflow (Hodkinson & Lendemer in prep).

The really interesting thing to think about is the fact that this type of macromolecule needs to be able to move in order to function, which means that the structure is not actually static, but dynamic. While we usually use the 'best' structure for phylogenetic inference, there are actually many structures that are nearly equally good, and the molecule actually changes its conformation through space and time, flipping between these different conformations in order to perform its functions in the cell. To drive the point home, here is a quick video I made of the ITS2 molecule of Cladonia stipitata Lendemer & Hodkinson (2009) shifting between different likely conformations:



------------------------------

Sources cited:

Hodkinson, B. P., and J. C. Lendemer. 2011. Molecular analyses reveal semi-cryptic species in Xanthoparmelia tasmanica. Bibliotheca Lichenologica 106: 115-126.
Download publication (PDF file)
Download nucleotide alignment (NEXUS file)

Hodkinson, B. P., and J. C. Lendemer. In prep. Systematics of a enigmatic sterile crustose lichen. 

Hodkinson, B. P., J. C. Lendemer, and T. L. Esslinger. 2010. Parmelia barrenoae, a macrolichen new to North America and Africa. North American Fungi 5(3): 1-5.
Download publication (PDF file)

Kjer, K. M. 1995. Use of rRNA secondary structure in phylogenetic studies to identify homologous positions: an example of alignment and data presentation from the frogs. Molecular Phylogenetics and Evolution 4: 314–330.

Lendemer, J. C., and B. P. Hodkinson. 2009. The Wisdom of Fools: new molecular and morphological insights into the North American apodetiate species of Cladonia. Opuscula Philolichenum 7: 79-100.
Download publication (PDF file)
Download nucleotide alignment (NEXUS file)

Lendemer, J. C., and B. P. Hodkinson. 2010. A new perspective on Punctelia subrudecta in North America: previously-rejected morphological characters corroborate molecular phylogenetic evidence and provide insight into an old problem. The Lichenologist 42(4): 405-421.
Download publication (PDF file)
Download nucleotide alignment (NEXUS file)

Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688–2690.

Wednesday, August 17, 2011

Taxonomy: Art or Science?

When Googling "science definition," the first thing that came up was "The intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment." After a little more research, I was surprised to see that this seems to be one of the stricter definitions of science (others may be as broad as "the state of knowing" or some such...), but it is one with which I can get on board. I tend to think of science itself in a very strict sense, as the process of developing and testing hypotheses. However, my big caveat is that there are many activities that are involved in (and are absolutely essential to) the practice of science that are not science per se according to that definition. This does not diminish their value to science. Some of this has to do with the acquisition of background knowledge that informs the hypotheses to be tested, while some of it is associated with making the results of inquiry available and comprehensible to the scientific community and the public.

So then is taxonomy art or science? With taxonomy, there is not a "right" answer, although there are plenty of wrong answers if one wishes to have a system that is informed by the results of scientific inquiry. Taxonomic units are all in some sense arbitrary. Although a group of organisms may form a "clade," whether we recognize that clade with a certain name is somewhat arbitrary. I personally like to think of taxonomic units being defined by specific innovations (morphological, molecular, ecological, etc.) that have changed the evolutionary trajectory of a group, but that rule is certainly not universally applied, and there could certainly be many alternative taxonomies even if such standards were applied.

For me, the argument for taxonomy as an art does not actually diminish taxonomy in any way as part of what we must do in order to be effective and responsible scientists. In fact, having this perspective on taxonomy can help to enhance the understanding of the significance of taxonomy for science. As scientists, we must use what we discover through the scientific process to help facilitate communication about natural phenomena. Taxonomy is a tool that we use to communicate ideas about organisms, so taxonomy is an absolutely necessary part of the pursuit of scientific truth, even if it is not "science" itself.

One test for me of whether taxonomy is itself a science in the very strictest sense of the word is whether it is directly involved in the process of hypothesis testing. One can use principles of phylogenetics, ecology, or molecular biology to test hypotheses, but taxonomic principles would not be used. When we begin to dissect some of the scientific questions that are often deemed "taxonomic questions," it can be argued that they are not actually taxonomic in nature, and that the taxonomic repercussions would really only be a byproduct of obtaining results through scientific inquiry. For instance, a question like "Is this a good genus?" is really asking something like "Do the species form a distinct clade?", which is a question that is evolutionary in nature. Likewise, the question "Do these individuals make up one species?" is perhaps just a way of saying "How can we properly apply a biological, morphological, chemical, ecological, and/or phylogenetic species concept to this group of individuals?", a question that draws on different fields of biology.

I can see that many systematists would hesitate to state that taxonomy is an art, because of what it implies. If it is an art, then it opens the door for people to say that people who do taxonomy are not really scientists at all. But a consummate scientist is not just someone who constantly tests hypotheses one after another without consideration for anything else. To be a scientist, one must also lay the groundwork for scientific pursuits, and defining the terms used to communicate ideas about specific units of the tree of life (whether or not it is itself an artistic pursuit) is crucial to the advancement of science.

- Brendan

Thursday, August 4, 2011

Using Sequencher for Multiple Sequence Alignments

Much of the molecular research that I have done over the years has involved working with DNA sequences generated through Sanger sequencing. These sequences are never perfect, and always require manual correction. It is especially helpful to correct sequences and align them to other similar sequences simultaneously. In this way, alignment and structural data can be taken into consideration when interpreting the chromatograms for the DNA sequences.

So I wrote a couple of simple Perl scripts that would allow me to make my alignments in Sequencher (the standard program for editing raw sequence reads) and easily move it over to Mesquite or MacClade (standard programs for assembling data matrices for downstream phylogenetic analyses) so that it could be joined with a reference alignment that I had made previously. In this way, I could avoid completely realigning all sequences to one another through an automatic alignment program, thereby preserving certain sequence alignment patterns (note that I often deal with over 1000 sequences at a time). If you use Linux or Macintosh, running a Perl script is generally a pretty simple matter (since Perl interpreters are typically built into the operating system).  If you use Windows, you will probably need to download an interpreter like Strawberry Perl or ActivePerl.

The type of data that I was dealing with was a set of bidirectional Sanger sequences (one forward, one reverse primer for each sequence) of fragments ~650 bp in length. These sequences were cloned and therefore had vector overhang on both ends of both strands, which had to be deleted. If you have data that are similar, here is a procedure that can be used to preserve the Sequencher alignment pattern and bring it into MacClade/Mesquite (potentially for merging with a curated reference alignment, if you have one of these):

[a] In the Sequencher alignment, make sure at least one sequencing strand of each pair of strands (from the bidirectionally-sequenced pool of DNA fragments) has all of the corrected bases, and delete the second strand for each pair. This gives an alignment with one strand for each sequence. [This Sequencher alignment can be tweaked visually to align with a reference set that is already pre-aligned by introducing gaps into the Sequencher alignment to accommodate the gaps in the reference alignment.]
[b] The Sequencher alignment can then be exported as a contig in aligned fasta format and subsequently opened in MacClade/Mesquite. [Note: If you have exported the sequences from Sequencher as a concatenated set of sequence fragments, it might use ':' instead of '-' to represent the gaps; make sure all of the gaps are changed to '-' for integration into MacClade or Mesquite (this can be done as a simple search and replace with any text editor).]

For my particular sequences, I had to deal with the issue of all of the sequence names being proceeded by my initials and having strand-specific information tacked on to the end (both standard pieces of information added by the sequencing facility). Here is another blog post with the Perl script that I wrote for editing the fasta file to extract the 10-digit alpha-numeric code used to identify my sequences. Also, I had to line my sequence block up with the portion of my reference alignment with which it correlated. In my particular situation, the block of sequences that I had aligned began 488 bases into the reference alignment.  Here is the script that I used to add 488 bases to the front of each sequence in the fasta file (this script relies on having a 10-digit code name for each sequence):

#!/usr/bin/perl

print "\nPlease type the name of your input file: ";
my $filename = <STDIN>;
chomp $filename;
open (FASTA, $filename);
        {
        if ($filename =~ /(.*)\.[^.]*/)
                {
                open OUT, ">$1.ed.fasta";
                }
        }

while (<FASTA>)
        {
       if ($_ =~ /^>(..........)/)
                {
                print OUT "\r>$1\r\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\n";
                }
       else
                {
                print OUT $_;
                }
       }


The final step was to simply open up my reference alignment in MacClade and import the newly-generated fasta file of aligned cloned sequences... and they lined up perfectly!  I then tweaked exclusion sets, saved the full alignment, and was ready for downstream phylogenetic analyses.

Even though MacClade and Mesquite are very good programs overall for alignment, aligning a set of 1000+ sequences is extremely cumbersome, and Sequencher can be much faster and easier as long as the sequences are relatively conserved.  With this set of Perl scripts discussed above, hopefully researchers will no longer perceive impediments or inefficiency in a process that includes aligning and correcting relatively conserved sequences in Sequencher (with all of the raw sequence data) before moving them over to MacClade/Mesquite for final data set assembly and formatting.

- Brendan

----------------------------------------------

References

The above protocols are published in the following sources:

Hodkinson, B. P. 2011. A Phylogenetic, Ecological, and Functional Characterization of Non-Photoautotrophic Bacteria in the Lichen Microbiome. Doctoral Dissertation, Duke University, Durham, NC.
Download Dissertation (PDF file)

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2011. Data from: Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Dryad Digital Repository doi:10.5061/dryad.t99b1.

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. In press. Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Environmental Microbiology.

----------------------------------------------


This work was funded in part by NSF DEB-1011504.