Friday, December 16, 2011

Using R for community analyses

For the central chapter of my doctoral dissertation (published in Environmental Microbiology), I wanted to visualize similarities and differences between lichen-associated bacterial communities, find which factors were most strongly correlated with community differences, and determine the significance of these correlations.  When communities are characterized using large sets of DNA sequence data, this can be achieved by using both genetic measures (e.g., UniFrac distance) and taxon-based measures (e.g., Bray-Curtis OTU-based dissimilarity).

This post gives an example of a script (or a series of commands) to be run with the R statistical package (R Development Core Team 2010) for performing ANOSIM and NMDS analyses of (1) an OTU-based relative abundance matrix derived from a set of clone libraries and (2) a UniFrac distance matrix derived from a 454 amplicon library (see example files at the bottom of this post).  These examples are derived from Hodkinson et al. (2012), although the scripts and example files shown here are abbreviated in order to save space.  Full example files can be downloaded from the Dryad data package associated with this publication [Hodkinson et al. 2011; http://dx.doi.org/10.5061/dryad.t99b1].  The matrices were obtained using the Fast UniFrac program (see instructions here) and Mothur (using the get.relabund command; for the full pipeline that I authored to process these data, see the Dryad data package).  Some minor modifications were required to get the matrices into the precise format needed for R (edits were performed in Microsoft Excel and files were saved in .csv format... note that all of the spaces/tabs in the examples below would actually be commas in the raw text versions of the files).

Running the script as it is presented below (with hash marks in front of the 'plot' commands) will allow one to obtain an output file with the numerical results of all analyses.  However, in order to visualize the NMDS plots, one can run each of the commands line-by-line in the R-GUI (of course, all 'anosim' commands can be ignored if the only goal is NMDS), removing the hash marks in front of the plot commands (note that R has many fancy options for visualization using 'plot'; these may be worth investigating further for your own purposes).  Here is a link to the version of R that I used for these analyses: http://cran.r-project.org/bin/windows/base/old/2.12.0/.  Once R is installed, you can call the script with a command like this one:
C:\\"Program Files"\R\R-2.12.0\bin\i386\Rterm.exe --verbose --no-restore --file=C:\\ANOSIM_NMDS_Clon_454_win.R > C:\\ANOSIM_NMDS_Clon_454_win.out

Here is the script, named 'ANOSIM_NMDS_Clon_454_win.R' (abbreviated example script; full file available at http://dx.doi.org/10.5061/dryad.t99b1):

# load the necessary libraries
library(ecodist)
library(vegan)
# set the output file
sink("ANOSIM_nmds.out", append=TRUE, split=TRUE)
# load the clone data set
samplesClon<-read.csv("C:\\SamplesClon.csv", header=TRUE)
samplesClon
# load the Bray-Curtis OTU-based matrix
OTUsClon<-read.csv("C:\\HodkinsonClon16S_OTUs_1per_171.csv", head = FALSE, row.names = 1)
# transform the OTU matrix into a Bray-Curtis matrix
OTUsClon.dist<-vegdist(OTUsClon, "bray")
# run Bray-Curtis OTU-based ANOSIM
anosim(dis = OTUsClon.dist, grouping = samplesClon$Photobiont, strata = samplesClon$Site)
anosim(dis = OTUsClon.dist, grouping = samplesClon$Mycobiont, strata = samplesClon$Site)
anosim(dis = OTUsClon.dist, grouping = samplesClon$Mycobiont, strata = samplesClon$Photobiont)
anosim(dis = OTUsClon.dist, grouping = samplesClon$Site, strata = samplesClon$Photobiont)
anosim(dis = OTUsClon.dist, grouping = samplesClon$Site, strata = samplesClon$Mycobiont)
# run Bray-Curtis OTU-based NMDS
nmds <- nmds(as.dist(OTUsClon.dist))
nmin <- nmds.min(nmds)
nmin
#plot(nmin, pch=as.numeric(samplesClon$Photobiont))
#plot(nmin, pch=as.numeric(samplesClon$Mycobiont))
#plot(nmin, pch=as.numeric(samplesClon$Site))
# load the 454 data set
samples454<-read.csv("C:\\Samples454.csv", header=TRUE)
samples454
# load the normalized weighted UniFrac matrix
unifracnw454<-read.csv("C:\\DistUFnw454.csv", head = FALSE, row.names = 1)
# run normalized weighted UniFrac ANOSIM
anosim(dat = as.dist(unifracnw454), grouping = samples454$Photobiont, strata = samples454$Site)
anosim(dat = as.dist(unifracnw454), grouping = samples454$Site, strata = samples454$Mycobiont)
anosim(dat = as.dist(unifracnw454), grouping = samples454$Site, strata = samples454$Photobiont)
anosim(dat = as.dist(unifracnw454), grouping = samples454$Mycobiont, strata = samples454$Photobiont)
anosim(dat = as.dist(unifracnw454), grouping = samples454$Mycobiont, strata = samples454$Site)
# run normalized weighted UniFrac NMDS
nmds <- nmds(as.dist(unifracnw454))
nmin <- nmds.min(nmds)
nmin
#plot(nmin, pch=as.numeric(samples454$Photobiont))
#plot(nmin, pch=as.numeric(samples454$Mycobiont))
#plot(nmin, pch=as.numeric(samples454$Site))
# close the output file
sink()
# unload the libraries
detach("package:ecodist")
detach("package:vegan")


Samples454.csv (referenced in the above script)

SampleID Photobiont Site Mycobiont
CLCl Ch C Cl
CLDi Cy C Di
CLLe Cy C Le
CLPe Cy C Pe
CLSt Cy C St
CLUs Ch C Us
ELCl Ch E Cl
ELFl Ch E Fl
ELOp Ch E Op
ELPe Cy E Pe
ELUm Ch E Um
HLCl Ch H Cl
HLLe Cy H Le
HLPe Cy H Pe
HLSt Cy H St
HLUs Ch H Us
NLCl Ch N Cl
NLFl Ch N Fl
NLOp Ch N Op
NLPe Cy N Pe
NLUm Ch N Um


DistUFnw454.csv (referenced in the above script; full file available at http://dx.doi.org/10.5061/dryad.t99b1)

CLCl
0.16 0.17 0.11 0.08 0.12 ...
CLDi 0.16
0.24 0.18 0.13 0.18 ...
CLLe 0.17 0.24
0.11 0.16 0.20 ...
CLPe 0.11 0.18 0.11
0.10 0.18 ...
CLSt 0.08 0.13 0.16 0.10
0.13 ...
CLUs 0.12 0.18 0.20 0.18 0.13
...
ELCl 0.20 0.22 0.30 0.25 0.19 0.17
ELFl 0.14 0.18 0.24 0.19 0.14 0.15 ...
ELOp 0.33 0.35 0.40 0.37 0.34 0.31 ...
ELPe 0.16 0.21 0.24 0.19 0.15 0.19 ...
ELUm 0.22 0.24 0.32 0.27 0.22 0.22 ...
HLCl 0.19 0.18 0.27 0.24 0.17 0.15 ...
HLLe 0.13 0.16 0.20 0.15 0.11 0.12 ...
HLPe 0.09 0.17 0.15 0.11 0.11 0.16 ...
HLSt 0.14 0.16 0.21 0.18 0.12 0.13 ...
HLUs 0.10 0.20 0.18 0.15 0.14 0.11 ...
NLCl 0.19 0.21 0.28 0.25 0.19 0.15 ...
NLFl 0.18 0.21 0.27 0.24 0.19 0.15 ...
NLOp 0.24 0.27 0.32 0.29 0.25 0.25 ...
NLPe 0.10 0.18 0.13 0.09 0.10 0.14 ...
NLUm 0.09 0.18 0.21 0.15 0.13 0.16 ...


SamplesClon.csv (referenced in the above script)

SampleID Photobiont Site Mycobiont
CL01 Cy C Peltigera
CL02 Ch C Alectoria
CL03 Ch C Cladonia
CL04 Tr C Placopsis
CL05 Cy C Erioderma
CL06 Cy C Sticta
CL07 Cy C Dictyonema
CL08 Cy C Leptogium
CL09 Tr C Stereocaulon
CL10 Ch C Usnea
EL01 Cy E Peltigera
EL01t Tr E Peltigera
EL02 Ch E Alectoria
EL03 Ch E Cladonia
EL04 Ch E Rhizocarpon
EL05 Ch E Ophioparma
EL06 Ch E Flavocetraria
EL07 Ch E Arctoparmelia
EL08 Ch E Umbilicaria
EL09 Ch E Masonhalea
EL10 Ch E Dactylina
HL01 Cy H Peltigera
HL02 Ch H Platismatia
HL03 Ch H Cladonia
HL04 Ch H Parmotrema
HL05 Ch H Flavoparmelia
HL06 Cy H Sticta
HL07 Ch H Xanthoparmelia
HL08 Cy H Leptogium
HL09 Tr H Stereocaulon
HL10 Ch H Usnea
NL01 Cy N Peltigera
NL01t Tr N Peltigera
NL03 Ch N Cladonia
NL04 Tr N Amygdalaria
NL05 Ch N Ophioparma
NL06 Ch N Flavocetraria
NL07 Ch N Parmelia
NL08 Ch N Umbilicaria
NL09 Tr N Stereocaulon
NL10 Ch N Cetraria


HodkinsonClon16S_OTUS_1per_171.csv (referenced in the above script; full file available at http://dx.doi.org/10.5061/dryad.t99b1)

CL01 0.00 0.00 0.00 0.00 0.00 0.00 ...
CL02 0.00 0.00 0.00 0.00 0.00 0.00 ...
CL03 0.00 0.25 0.00 0.00 0.00 0.00 ...
CL04 0.11 0.00 0.00 0.00 0.00 0.00 ...
CL05 0.00 0.03 0.34 0.05 0.00 0.00 ...
CL06 0.00 0.10 0.02 0.03 0.00 0.02 ...
CL07 0.00 0.56 0.00 0.00 0.00 0.19 ...
CL08 0.00 0.03 0.00 0.00 0.00 0.00 ...
CL09 0.02 0.13 0.06 0.15 0.00 0.00 ...
CL10 0.00 0.00 0.00 0.00 0.00 0.00 ...
EL01 0.00 0.00 0.00 0.00 0.00 0.17 ...
EL01t 0.00 0.00 0.00 0.00 0.00 0.31 ...
EL02 0.00 0.26 0.00 0.00 0.00 0.00 ...
EL03 0.00 0.03 0.00 0.00 0.00 0.00 ...
EL04 0.00 0.10 0.00 0.00 0.00 0.00 ...
EL05 0.00 0.04 0.00 0.00 0.00 0.00 ...
EL06 0.00 0.07 0.00 0.00 0.00 0.00 ...
EL07 0.00 0.00 0.00 0.00 0.00 0.00 ...
EL08 0.00 0.34 0.00 0.00 0.00 0.00 ...
EL09 0.00 0.21 0.00 0.00 0.00 0.00 ...
EL10 0.00 0.21 0.00 0.00 0.04 0.00 ...
HL01 0.00 0.00 0.00 0.00 0.00 0.00 ...
HL02 0.13 0.00 0.00 0.00 0.00 0.00 ...
HL03 0.00 0.00 0.00 0.00 0.00 0.00 ...
HL04 0.06 0.00 0.00 0.00 0.00 0.00 ...
HL05 0.00 0.03 0.00 0.00 0.00 0.00 ...
HL06 0.00 0.00 0.00 0.06 0.00 0.00 ...
HL07 0.00 0.14 0.00 0.00 0.00 0.00 ...
HL08 0.00 0.00 0.00 0.00 0.00 0.00 ...
HL09 0.00 0.00 0.00 0.00 0.00 0.00 ...
HL10 0.00 0.02 0.00 0.00 0.00 0.00 ...
NL01 0.00 0.15 0.00 0.00 0.00 0.00 ...
NL01t 0.00 0.00 0.00 0.00 0.00 0.15 ...
NL03 0.00 0.00 0.00 0.00 0.00 0.00 ...
NL04 0.00 0.15 0.00 0.00 0.00 0.00 ...
NL05 0.00 0.17 0.00 0.00 0.00 0.00 ...
NL06 0.00 0.02 0.00 0.00 0.00 0.00 ...
NL07 0.00 0.33 0.00 0.00 0.00 0.00 ...
NL08 0.00 0.80 0.00 0.00 0.00 0.00 ...
NL09 0.00 0.00 0.00 0.00 0.00 0.00 ...
NL10 0.00 0.05 0.00 0.00 0.00 0.00 ...


Here are some examples of non-metric multidimensional scaling plots produced from OTU-based Bray-Curtis dissimilarities between lichen-associated bacterial communities (Hodkinson et al. 2012). Photobiont is indicated by color (light green = green algae; dark blue = Cyanobacteria; black = Tripartite), while the site is indicated with symbols (• = Eagle Summit, AK; + = Nome, AK; x = Highlands, NC, and * = Cerro de la Muerte, CR). Numbers indicate the identity of different mycobiont types (see associated publication for details). Continuous lines act as visual aids to delimit communities associated with the two major photobiont-types, whereas dashed lines delimit communities associated with chlorolichens from northern versus southern sites.



















Plot A shows results obtained from clone library data of 16S sequences from the order Rhizobiales.



Plot B was produced from 454 barcoded 16S amplicon data (representing approximately half the number of samples as the clone library set but ~100 times as many sequences from a much wider range of bacterial diversity; since this plot was produced through OTU-based methods it does not precisely correlate with what would be produced through the abbreviated script above, which uses UniFrac distances to generate NMDS plots based on 454 data).

For additional information and documentation, including results of ANOSIM analyses, please see the references below!

- Brendan

---------------------------------------

References

Hodkinson, B. P. 2011. A phylogenetic, ecological, and functional characterization of non-photoautotrophic bacteria in the lichen microbiome. Doctoral Dissertation, Duke University, Durham, NC.
Download Dissertation (PDF file)

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2012. Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Environmental Microbiology 14(1): 147-161. [doi:10.1111/j.1462-2920.2011.02560.x]
Download publication (PDF file)
Download supplementary phylogeny (PDF file)
View data and analysis file webportal (website)
Download data and analysis file archive (ZIP file)

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2011. Data from: Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Dryad Digital Repository. doi:10.5061/dryad.t99b1

R Development Core Team. 2010. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.r-project.org/