\documentclass[article,shortnames,nojss]{jss} %\VignetteIndexEntry{An R coenocline simulation package} %\VignettePackage{coenocliner} %\VignetteEngine{knitr::knitr} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% almost as usual \author{Gavin L. Simpson\\Institute of Environmental Change and Society\\University of Regina} \title{\pkg{coenocliner}: a coenocline simulation package for R} %% for pretty printing and a nice hypersummary also set: \Plainauthor{Gavin L. Simpson} %% comma-separated \Plaintitle{coenocliner: a coenocline package for R} %% without formatting \Shorttitle{coenocliner} %% a short title (if necessary) %% an abstract and keywords \Abstract{ This vignette provides an introduction to, and user-guide for, the \pkg{coenocliner} package for \proglang{R}. \pkg{coenocliner} can be used to generated random count or occurrence data from parameterised species response curves. The classic Gaussian response and the generalised beta response functions are provided and simulated count or occurrence data can be generated via random draws from a number of probability distributions. } \Keywords{coenocline simulation, \proglang{R}, ecology, species, Gaussian response, generalized beta response, Poisson, negative binomial} \Plainkeywords{coenocline simulation, R, ecology, species, Gaussian response, generalized beta response, Poisson, negative binomial} %% without formatting %% at least one keyword must be supplied %% The address of (at least) one author should be given %% in the following format: \Address{ Gavin L. Simpson\\ Insitutute of Environmental Change and Society\\ University of Regina\\ 3737 Wascana Parkway\\ Regina\\ SK, S4S 0A2\\ Canada\\ E-mail: \email{ucfagls@gmail.com}\\ URL: \url{http://www.fromthebottomoftheheap.net/} } %% It is also possible to add a telephone and fax number %% before the e-mail in the following format: %% Telephone: +43/1/31336-5053 %% Fax: +43/1/31336-734 %% for those who use Sweave please include the following line (with % symbols): %% need no \usepackage{Sweave.sty} %% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage{textcomp} \newcommand{\glstextapprox}{\raisebox{0.5ex}{\texttildelow}} \begin{document} <>= library("knitr") opts_chunk$set(fig.lp = "fig:", size = "small", out.width = ".7\\linewidth", fig.align = "center", fig.show = "hold") @ \section{Introduction} Coenoclines are, according to the Oxford Dictionary of Ecology \citep{Allaby1998}, \textit{``gradients of communities (e.g.~in a transect from the summit to the base of a hill), reflecting the changing importance, frequency, or other appropriate measure of different species populations''}. In much ecological research, and that of related fields, data on these coenoclines are collected and analyzed in a variety of ways. When developing new statistical methods or when trying to understand the behaviour of existing methods, we often resort to simulating data with known pattern or structure and then torture whatever method is of interest with the simulated data to tease out how well methods work or where they breakdown. There's a long history of using computers to simulate species abundance data along coenoclines but until recently no \proglang{R} packages were available that performed coenocline simulation. Dave Roberts' \pkg{coenoflex} package was on CRAN for a while but the latest version was unstable and removed from CRAN because of issues with the \proglang{FORTRAN}\footnote{Dave has since fixed these issues but the updated package is not yet, as of August 2014, re-available from CRAN.}. \pkg{coenocliner} was designed to fill this gap. \pkg{coenocliner} can simulate species abundance or occurrence data along one or two gradients from either a Gaussian or generalised beta response model. Parameters for the response model are supplied for each species and parameterised species repsonse curves along the gradients are returned. Simulated abundance or occurrence data can be produced by sampling from one of several error distributions which use the parameterised species response curves as the expected count or probability of occurrence for the chosen error distribution. The error distributions available in \pkg{coenocliner} are \begin{itemize} \item Poisson \item Negative binomial \item Bernoulli (occurrence; Binomial with denominator $m = 1$) \item Binomial (counts with specified denominator $m$) \item Beta-binomial \item Zero-inflated Poisson (ZIP) \item Zero-inflated negative binomial (ZINB) \item Zero-inflated Binomial (ZIB) \item Zero-inflated Beta-Binomial (ZIBB) \end{itemize} This vignette provides a brief overview of the \pkg{coenocliner} package. \section[A brief overview of coenocliner]{A brief overview of \pkg{coenocliner}} To begin, load \pkg{coenocliner} and check the start-up message to see if you are using the current (\Sexpr{packageVersion("coenocliner")}) release of the package <>= library("coenocliner") @ The main function in \pkg{coenocliner} is \texttt{coenocline()}, which provides a relatively simple interface to coenocline simulation allowing flexible specification of gradient locations and response model parameters for species. Gradient locations are specified via argument \texttt{x}, which can be a single vector, or, in the case of two gradients, a matrix or a list containing vectors of gradient values. The matrix version assumes the first gradient's values are in the first column and those for the second gradient in the second column <>= xy <- cbind(x = seq(from = 4, to = 7, length.out = 100), y = seq(from = 1, to = 100, length.out = 100)) @ Similarly, for the list version, the first component contains the values for the first gradient and the second component the values for the second gradient <>= xy <- list(x = seq(from = 4, to = 6, length.out = 100), y = seq(from = 1, to = 100, length.out = 100)) @ The species response model used is indicated via the \texttt{responseModel} argument; available options are \texttt{"gaussian"} and \texttt{"beta"} for the classic Gaussian response model and the generalise beta response model respectively. Parameters are supplied to \texttt{coenocline()} via the \texttt{params} argument. \texttt{showParams()} can be used list the parameters for the desired response model. The parameters for the Gaussian response model are <>= showParams("gaussian") @ As indicated, some parameters are only supplied once per species, regardless of whether there are one or two gradients. Hence for the Gaussian model, the parameter \texttt{h} is only supplied for the first gradient even if two gradients are required. Parameters are supplied as a matrix with named columns, or as a list with named components. For example, for a Gaussian response for each of 3 species we could use either of the two forms <>= opt <- c(4,5,6) tol <- rep(0.25, 3) h <- c(10,20,30) parm <- cbind(opt = opt, tol = tol, h = h) # matrix form parl <- list(opt = opt, tol = tol, h = h) # list form @ In the case of two gradients, a list with two components, one per gradient, is required. The first component contains parameters for the first gradient, the second element contains those for the second gradient. These components can be either a matrix or a list, as described previously. For example a list with parameters supplied as matrices <>= opty <- c(25, 50, 75) tol <- c(5, 10, 20) pars <- list(px = parm, py = cbind(opt = opty, tol = tol)) @ Note that parameter $h$ is not specified in the second set as this parameter, the height of the response curve at the gradient optima, applies globally --- in the case of two gradients, $h$ refers to the height of the bell-shaped curve at the bivariate optimum. Notice also how parameters are specified at the species level. To evaluate the response curve at the supplied gradient locations each set of parameters needs to be repeated for each gradient location. Thankfully \texttt{coenocline()} takes care of this detail for us. Additional parameters that may be needed for the response model but which are not specified at the species level are supplied as a list with named components to argument \texttt{extraParams}. An example is the correlation between Gaussian response curves in case of two gradients. This, unfortunately, means that a single correlation between response curves applies to all species\footnote{This is not strictly true as you can work out how the species parameters are replicated relative to gradient values and hence pass a vector of the correct length with the species-specific values included. Study the outputs from \texttt{expand()} when supplied gradient locations and parameters to work out how to specify \texttt{extraParams} appropriately}, and is caused by a poor choice of implementation. Thankfully this is relatively easy to fix, which will be done for version 0.3-0 along with a fix for a similar issue relating to the statement of additional parameters for the error distribution used (see below). To simulate realistic count data we need to sample \textit{with error} from the parameterised species response curves. Which of the distributions (listed earlier) is used is specified via argument \texttt{countModel}; available options are <>= c("poisson", "negbin", "bernoulli", "binary", "binomial", "betabinomial", "ZIP", "ZINB", "ZIB", "ZIBB") @ Some of these distributions (all bar \texttt{"poisson"} and \texttt{"bernoulli"}) require additional arguments, such as the $\alpha$ parameter for (one parameterisation of) the negative binomial distribution. These arguments are supplied as a list with named components. Again, due to the same implementation snafu as for \texttt{extraParams}, such parameters act globally for all species\footnote{Again, this is not strictly true as you can work out how the species parameters are replicated relative to gradient values and hence pass a vector of the correct length with the species-specific values included. Study the outputs from \texttt{expand()} when supplied gradient locations and parameters to work out how to specify \texttt{countParams} appropriately}. The final argument is \texttt{expectation}, which defaults to \texttt{FALSE}. When set to \texttt{TRUE}, simulating species counts or occurrences with error is skipped and the values of the parameterised response curve evaluated at the gradient locations are returned. This option is handy if you want to look at or plot the species response curves used in a simulation. \subsection{Example usage} In the next few sections the basic usage of \texttt{coenocline()} is illustrated. \subsubsection{Gaussian responses along a single gradient} This example, of multiple species responses along a single environmental gradient, illustrates the simplest usage of \texttt{coenocline()}. The example uses a hypothetical pH gradient with species optima drawn at random uniformally along the gradient. Species tolerances are the same for all species. The maximum abundance of each species, $h$, is drawn from a lognormal distribution with a mean of \glstextapprox{}20 ($e^3$). This simulation will be for a community of 20 species, evaluated at 100 equally spaced locations. First we set up the parameters <>= set.seed(2) M <- 20 # number of species ming <- 3.5 # gradient minimum... maxg <- 7 # ...and maximum locs <- seq(ming, maxg, length = 100) # gradient locations opt <- runif(M, min = ming, max = maxg) # species optima tol <- rep(0.25, M) # species tolerances h <- ceiling(rlnorm(M, meanlog = 3)) # max abundances pars <- cbind(opt = opt, tol = tol, h = h) # put in a matrix @ As a check, before simulating any count data, we can look at the coenocline implied by these parameters by returning the expectations only from \texttt{coenocline()} <>= mu <- coenocline(locs, responseModel = "gaussian", params = pars, expectation = TRUE) @ This returns a matrix of values obtained by evaluating each species response curve at the supplied gradient locations. There is one column per species and one row per gradient location <>= class(mu) dim(mu) mu @ A quick way to visualise the parameterised species response is to use the \texttt{plot()} method <>= plot(mu, lty = "solid", type = "l", xlab = "pH", ylab = "Abundance") @ The resultant plot is shown in Figure~\ref{fig:example1-plot-expectations}. As this looks OK, we can simulate some count data. The simplest model for doing so is to make random draws from a Poisson distribution with the mean, $\lambda$, for each species set to value of the response curve evaluated at each gradient location. Hence the values in \texttt{mu} that we just created can be thought of as the expected count per species at each of the gradient locations we are interested in. To simulate Poisson count data, use \texttt{expectation = FALSE} or remove this argument from the call. To be more explicit, we should also state \texttt{countModel = "poisson"}\footnote{\texttt{countModel = "poisson"} is the default so this can be excluded from the call.}. <>= simp <- coenocline(locs, responseModel = "gaussian", params = pars, countModel = "poisson") @ Again, \texttt{matplot} is useful in visualizing the simulated data <>= plot(simp, lty = "solid", type = "p", pch = 1:10, cex = 0.8, xlab = "pH", ylab = "Abundance") @ The resultant plot is shown in Figure~\ref{fig:example1-plot-simulations}. Whilst the simulated counts look reasonable and follow the response curves in Figure~\ref{fig:example1-plot-expectations} there is a problem; the variation around the expected curves is too small. This is due to the error variance implied by the Poisson distribution encapsulating only that variance which would arise due to repeated sampling at the gradient locations. Most species abundance data exhibit much larger degrees of variation than that shown in Figure~\ref{fig:example1-plot-simulations}. A solution to this is to sample from a distribution that incorporates additional variance or \textit{overdispersion}. A natural partner to the Poisson that includes overdispersion is the negative binomial. To simulate count data using the negative binomial distribution we must alter \texttt{countModel} and supply the overdispersion parameter $\alpha$ to use\footnote{Recall that this is only easily specifiable globally in version 0.1-0 of \pkg{coenocliner}.} via \texttt{countParams}. <>= simnb <- coenocline(locs, responseModel = "gaussian", params = pars, countModel = "negbin", countParams = list(alpha = 0.5)) @ Using \texttt{plot} it is apparent that the simluated species data are now far more relalistic (Figure~\ref{fig:example1-plot-nb-simulations}) <>= plot(simnb, lty = "solid", type = "p", pch = 1:10, cex = 0.8, xlab = "pH", ylab = "Abundance") @ \subsubsection{Generalised beta responses along a single gradient} In this example, I recreate figure 2 in \citet{Minchin1987} and then simulate species abundances from the species response curves. The species parameters for the generalised beta response for the six species in \citet{Minchin1987} are <>= A0 <- c(5,4,7,5,9,8) * 10 # max abundance m <- c(25,85,10,60,45,60) # location on gradient of modal abundance r <- c(3,3,4,4,6,5) * 10 # species range of occurence on gradient alpha <- c(0.1,1,2,4,1.5,1) # shape parameter gamma <- c(0.1,1,2,4,0.5,4) # shape parameter locs <- 1:100 # gradient locations pars <- list(m = m, r = r, alpha = alpha, gamma = gamma, A0 = A0) # species parameters, in list form @ To recreate figure 2 in \citet{Minchin1987} evaluations at the chosen gradient locations, \texttt{locs}, of the parameterised generalised beta are required and can be generated by passing \texttt{coenocline()} the gradient locations and the chosen species parameters as before, choosing the generalised beta response model and using \texttt{expectation = TRUE} <>= mu <- coenocline(locs, responseModel = "beta", params = pars, expectation = TRUE) @ As before \texttt{mu} is a matrix with one column per species <>= mu @ and as such we can use \texttt{matplot} to draw the species responses <>= plot(mu, lty = "solid", type = "l", xlab = "Gradient", ylab = "Abundance") @ Figure~\ref{fig:example2-beta-plot-expectations} is a good facsimile of figure 2 in \citet{Minchin1987}. \subsubsection{Gaussian response along two gradients} In this example I illustrate how to simulate species abundance in an environment comprising two gradients. Parameters for the simulation are defined first, including the number of species and samples required, followed by definitions of the gradient units and lengths, species optima, and tolerances for each gradient, and the maximal abundance ($h$). <>= set.seed(10) N <- 30 # number of samples M <- 20 # number of species ## First gradient ming1 <- 3.5 # 1st gradient minimum... maxg1 <- 7 # ...and maximum loc1 <- seq(ming1, maxg1, length = N) # 1st gradient locations opt1 <- runif(M, min = ming1, max = maxg1) # species optima tol1 <- rep(0.5, M) # species tolerances h <- ceiling(rlnorm(M, meanlog = 3)) # max abundances par1 <- cbind(opt = opt1, tol = tol1, h = h) # put in a matrix ## Second gradient ming2 <- 1 # 2nd gradient minimum... maxg2 <- 100 # ...and maximum loc2 <- seq(ming2, maxg2, length = N) # 2nd gradient locations opt2 <- runif(M, min = ming2, max = maxg2) # species optima tol2 <- ceiling(runif(M, min = 5, max = 50)) # species tolerances par2 <- cbind(opt = opt2, tol = tol2) # put in a matrix ## Last steps... pars <- list(px = par1, py = par2) # put parameters into a list locs <- expand.grid(x = loc1, y = loc2) # put gradient locations together @ Notice how the parameter sets for each gradient are individual matrices which are combined in a list, \texttt{pars}, ready for use. Also different this time is the \texttt{expand.grid()} call which is used to generate all pairwise combinations of the locations on the two gradients. This has the effect of creating a coordinate pair on the two gradients at which we'll evaluate the response curves. In effect this creates a grid of points over the gradient space. Having set up the parameters, the call to \texttt{coenocline()} is the same as before, except now we specify a degree of correlation between the two gradients via \texttt{extraParams = list(corr = 0.5)} <>= mu2d <- coenocline(locs, responseModel = "gaussian", params = pars, extraParams = list(corr = 0.5), expectation = TRUE) @ \texttt{mu2d} now contains a matrix of expected species abundances, one column per species as before. Because of the way the \texttt{expand.grid()} function works, the ordering of species abudances in each column has the first gradient locations varying fastest --- the locations on the first gradient are repeated in order for each location on the second gradient <>= head(locs) @ As a result, we can reshape the abundances for a single species into a matrix reflecting the grid of locations over the gradient space via a simple \texttt{matrix()} call, setting the number of columns in the resultant matrix equal to the number of gradient locations in the simulation. Thankfully this is taken care of for you in the \texttt{persp()} method\footnote{The \texttt{plot()} methods for class \texttt{"coenocline"} will use the \texttt{persp()} if you try to plot an object with two gradients.}. The method will draw perspective plots for all species in the simulation; it is keft to the user to handle how these should be displayed on the current graphics device. By way of illustration, I plot the expected abundance for four of the species in \texttt{mu2d} <>= layout(matrix(1:4, ncol = 2)) op <- par(mar = rep(1, 4)) persp(mu2d, species = c(2, 8, 13, 19), ticktype = "detailed", zlab = "Abundance") par(op) layout(1) @ The selected species response curves are shown in Figure~\ref{fig:example3-persp-plots}. Simulated counts for each species can be produced by removing \texttt{expectation = TRUE} from the call and choosing an error distribution to make random draws from. For example, for negative binomial errors with dispersion $\alpha = 1$ we can use <>= sim2d <- coenocline(locs, responseModel = "gaussian", params = pars, extraParams = list(corr = 0.5), countModel = "negbin", countParams = list(alpha = 1)) @ The resulting simulated counts for the same four selected species are shown in Figure~\ref{fig:example3-persp-plots2}, which was generated using the code below <>= layout(matrix(1:4, ncol = 2)) op <- par(mar = rep(1, 4)) persp(sim2d, species = c(2, 8, 13, 19), ticktype = "detailed", zlab = "Abundance") par(op) layout(1) @ \section{Future directions} \pkg{coenocliner} was designed to be quite modular and hence easy to add new species response models or probability distributions from which to simulate count or abundance data. The current release covers the main functionality envisaged when I started to develop \pkg{coenocliner}. Some fine-tuning and polishing of the functions is needed as are some methods such as \texttt{plot} methods and nicer displays of the outputted species data such as row and column labels on the resultant community matrices. Beyond this, it would be useful to include options in other community simulator packages, such as COMPAS, which include competition effects between species etc. Such modifcations would occur after the simulated counts were generated and would act to modify those counts in line with ecological theory. \bibliography{coenocliner} \section{Appendix} \subsection{Computational details} This vignette was created using the following R and add-on package versions <>= toLatex(sessionInfo()) @ \end{document}