\documentclass{article} %\VignetteIndexEntry{Examples for enhanced mle code} %\VignettePackage{bbmle} %\VignetteDepends{Hmisc} %\VignetteDepends{emdbook} %\VignetteDepends{ggplot2} %\VignetteDepends{lattice} %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} % for UTF-8/single quotes from sQuote() \usepackage[english]{babel} % for texi2dvi ~ bug \usepackage{graphicx} \usepackage{natbib} \usepackage{array} \usepackage{color} \usepackage[colorlinks=true,bookmarks=true]{hyperref} \hypersetup{linkcolor=purple,urlcolor=blue,citecolor=gray} \usepackage{url} \author{Ben Bolker} \title{Maximum likelihood estimation and analysis with the \code{bbmle} package} \newcommand{\code}[1]{{\tt #1}} \newcommand{\bbnote}[1]{\color{red} {\em #1} \color{black}} \date{\today} \begin{document} \bibliographystyle{chicago} %\bibliographystyle{plain} \maketitle \tableofcontents <>= if (require("knitr")) opts_chunk$set(fig.width=5,fig.height=5,tidy=FALSE,warning=FALSE,error=TRUE) @ <>= library(Hmisc) @ The \code{bbmle} package, designed to simplify maximum likelihood estimation and analysis in R, extends and modifies the \code{mle} function and class in the \code{stats4} package that comes with R by default. \code{mle} is in turn a wrapper around the \code{optim} function in base R. The maximum-likelihood-estimation function and class in \code{bbmle} are both called \code{mle2}, to avoid confusion and conflict with the original functions in the \code{stats4} package. The major differences between \code{mle} and \code{mle2} are: \begin{itemize} \item \code{mle2} is more robust, with additional warnings (e.g. if the Hessian can't be computed by finite differences, \code{mle2} returns a fit with a missing Hessian rather than stopping with an error) \item \code{mle2} uses a \code{data} argument to allow different data to be passed to the negative log-likelihood function \item \code{mle2} has a formula interface like that of (e.g.) \code{gls} in the \code{nlme} package. For relatively simple models the formula for the maximum likelihood can be written in-line, rather than defining a negative log-likelihood function. The formula interface also simplifies fitting models with categorical variables. Models fitted using the formula interface also have applicable \code{predict} and \code{simulate} methods. \item \code{bbmle} defines \code{anova}, \code{AIC}, \code{AICc}, and \code{BIC} methods for \code{mle2} objects, as well as \code{AICtab}, \code{BICtab}, \code{AICctab} functions for producing summary tables of information criteria for a set of models. \end{itemize} Other packages with similar functionality (extending GLMs in various ways) are \begin{itemize} \item on CRAN: \code{aods3} (overdispersed models such as beta-binomial); \code{vgam} (a wide range of models); \code{betareg} (beta regression); \code{pscl} (zero-inflated, hurdle models); \code{maxLik} (another general-purpose maximizer, with a different selection of optimizers) \item In Jim Lindsey's code repository (\url{http://popgen.unimaas.nl/~jlindsey/rcode.html}): \code{gnlr} and \code{gnlr3} \end{itemize} \section{Example: \emph{Orobanche}/overdispersed binomial} This example will use the classic data set on \emph{Orobanche} germination from \cite{Crowder1978} (you can also use \code{glm(...,family="quasibinomial")} or the \code{aods3} package to analyze these data). \subsection{Test basic fit to simulated beta-binomial data} First, generate a single beta-binomially distributed set of points as a simple test. Load the \code{emdbook} package to get functions for the beta-binomial distribution (random-deviate function \code{rbetabinom} --- these functions are also available in Jim Lindsey's \code{rmutil} package). <>= library(emdbook) @ Generate random deviates from a random beta-binomial: <>= set.seed(1001) x1 <- rbetabinom(n=1000,prob=0.1,size=50,theta=10) @ Load the package: <>= library(bbmle) @ Construct a simple negative log-likelihood function: <>= mtmp <- function(prob,size,theta) { -sum(dbetabinom(x1,prob,size,theta,log=TRUE)) } @ Fit the model --- use \code{data} to pass the \code{size} parameter (since it wasn't hard-coded in the \code{mtmp} function): <>= suppressWarnings( m0 <- mle2(mtmp,start=list(prob=0.2,theta=9),data=list(size=50)) ) @ (here and below, I'm suppressing lots of warnings about {\tt NaNs produced}) The \code{summary} method for \code{mle2} objects shows the parameters; approximate standard errors (based on quadratic approximation to the curvature at the maximum likelihood estimate); and a test of the parameter difference from zero based on this standard error and on an assumption that the likelihood surface is quadratic (or equivalently that the sampling distribution of the estimated parameters is normal). <>= summary(m0) @ Construct the likelihood profile (you can apply \code{confint} directly to \code{m0}, but if you're going to work with the likelihood profile [e.g. plotting, or looking for confidence intervals at several different $\alpha$ values] then it is more efficient to compute the profile once): <>= suppressWarnings( p0 <- profile(m0) ) @ Compare the confidence interval estimates based on inverting a spline fit to the profile (the default); based on the quadratic approximation at the maximum likelihood estimate; and based on root-finding to find the exact point where the profile crosses the critical level. <>= confint(p0) confint(m0,method="quad") confint(m0,method="uniroot") @ All three types of confidence limits are similar. Plot the profiles: <>= par(mfrow=c(1,2)) plot(p0,plot.confstr=TRUE) @ By default, the plot method for likelihood profiles displays the square root of the the deviance difference (twice the difference in negative log-likelihood from the best fit), so it will be {\sf V}-shaped for cases where the quadratic approximation works well (as in this case). (For a better visual estimate of whether the profile is quadratic, use the \code{absVal=FALSE} option to the \code{plot} method.) You can also request confidence intervals calculated using \code{uniroot}, which may be more exact when the profile is not smooth enough to be modeled accurately by a spline. However, this method is also more sensitive to numeric problems. Instead of defining an explicit function for \code{minuslogl}, we can also use the formula interface. The formula interface assumes that the density function given (1) has \code{x} as its first argument (if the distribution is multivariate, then \code{x} should be a matrix of observations) and (2) has a \code{log} argument that will return the log-probability or log-probability density if \code{log=TRUE}. Some of the extended functionality (prediction etc.) depends on the existence of an \code{s}- variant function for the distribution that returns (at least) the mean and median as a function of the parameters (currently defined: \code{snorm}, \code{sbinom}, \code{sbeta}, \code{snbinom}, \code{spois}). <>= m0f <- mle2(x1~dbetabinom(prob,size=50,theta), start=list(prob=0.2,theta=9),data=data.frame(x1)) @ Note that you must specify the data via the \code{data} argument when using the formula interface. This may be slightly more unwieldy than just pulling the data from your workspace when you are doing simple things, but in the long run it makes tasks like predicting new responses much simpler. It's convenient to use the formula interface to try out likelihood estimation on the transformed parameters: <>= m0cf <- mle2(x1~dbetabinom(prob=plogis(lprob),size=50,theta=exp(ltheta)), start=list(lprob=0,ltheta=2),data=data.frame(x1)) confint(m0cf,method="uniroot") confint(m0cf,method="spline") @ In this case the answers from \code{uniroot} and \code{spline} (default) methods barely differ. \subsection{Real data (\emph{Orobanche}, \cite{Crowder1978})} Data are copied from the \code{aods3} package (but a copy is saved with the package to avoid depending on the \code{aods3} package): <>= load(system.file("vignetteData","orob1.rda",package="bbmle")) summary(orob1) @ Now construct a negative log-likelihood function that differentiates among groups: <>= X <- model.matrix(~dilution, data = orob1) ML1 <- function(prob1,prob2,prob3,theta,x) { prob <- c(prob1,prob2,prob3)[as.numeric(x$dilution)] size <- x$n -sum(dbetabinom(x$m,prob,size,theta,log=TRUE)) } @ % Would like to show an intermediate example that does plogis(X %*% beta) % explicitly but argument processing is messed up for list-like parameters ... % sigh ... Results from \cite{Crowder1978}: <>= crowder.results <- matrix(c(0.132,0.871,0.839,78.424,0.027,0.028,0.032,-34.991, rep(NA,7),-34.829, rep(NA,7),-56.258), dimnames=list(c("prop diffs","full model","homog model"), c("prob1","prob2","prob3","theta","sd.prob1","sd.prob2","sd.prob3","NLL")), byrow=TRUE,nrow=3) latex(crowder.results,file="",table.env=FALSE,title="model") @ <>= (m1 <- mle2(ML1, start=list(prob1=0.5,prob2=0.5,prob3=0.5,theta=1), data=list(x=orob1))) @ Or: <>= ## would prefer ~dilution-1, but problems with starting values ... (m1B <- mle2(m~dbetabinom(prob,size=n,theta), param=list(prob~dilution), start=list(prob=0.5,theta=1), data=orob1)) @ The result warns us that the optimization has not converged; we also don't match Crowder's results for $\theta$ exactly. We can fix both of these problems by setting \code{parscale} appropriately. Since we don't bound $\theta$ (or below, $\sigma$) we get a fair number of warnings with this and the next few fitting and profiling attempts. We will ignore these for now, since the final results reached are reasonable (and match or nearly match Crowder's values); the appropriate, careful thing to do would be either to fit on a transformed scale where all real-valued parameter values were legal, or to use \code{method="L-BFGS-B"} (or \code{method="bobyqa"} with the \code{optimx} package) to bound the parameters appropriately. You can also use \code{suppressWarnings()} if you're sure you don't need to know about any warnings (beware: this will suppress \emph{all} warnings, those you weren't expecting as well as those you were \ldots) <>= opts_chunk$set(warning=FALSE) @ <>= (m2 <- mle2(ML1,start=as.list(coef(m1)), control=list(parscale=coef(m1)), data=list(x=orob1))) @ Calculate likelihood profile (restrict the upper limit of $\theta$, simply because it will make the picture below a little bit nicer): <>= p2 <- profile(m2,prof.upper=c(Inf,Inf,Inf,theta=2000)) @ Get the curvature-based parameter standard deviations (which Crowder used rather than computing likelihood profiles): <>= round(stdEr(m2),3) @ We are slightly off Crowder's numbers --- rounding error? Crowder also defines a variance (overdispersion) parameter $\sigma^2=1/(1+\theta)$. <>= sqrt(1/(1+coef(m2)["theta"])) @ Using the delta method (via the \code{deltavar} function in the \code{emdbook} package) to approximate the standard deviation of $\sigma$: <>= sqrt(deltavar(sqrt(1/(1+theta)),meanval=coef(m2)["theta"], vars="theta",Sigma=vcov(m2)[4,4])) @ Another way to fit in terms of $\sigma$ rather than $\theta$ is to compute $\theta=1/\sigma^2-1$ on the fly in a formula: <>= m2b <- mle2(m~dbetabinom(prob,size=n,theta=1/sigma^2-1), data=orob1, parameters=list(prob~dilution,sigma~1), start=list(prob=0.5,sigma=0.1)) ## ignore warnings (we haven't bothered to bound sigma<1) round(stdEr(m2b)["sigma"],3) p2b <- profile(m2b,prof.lower=c(-Inf,-Inf,-Inf,0)) @ As might be expected since the standard deviation of $\sigma$ is large, the quadratic approximation is poor: <>= r1 <- rbind(confint(p2)["theta",], confint(m2,method="quad")["theta",]) rownames(r1) <- c("spline","quad") r1 @ Plot the profile: <>= plot(p2, which="theta",plot.confstr=TRUE, show.points = TRUE) @ What does the profile for $\sigma$ look like? <>= ## not working? ## plot(p2b,which="sigma",plot.confstr=TRUE, show.points=TRUE) par(las = 1, bty = "l") with(p2b@profile$sigma, plot(par.vals[,"sigma"], abs(z), type = "b")) @ Now fit a homogeneous model: <>= ml0 <- function(prob,theta,x) { size <- x$n -sum(dbetabinom(x$m,prob,size,theta,log=TRUE)) } m0 <- mle2(ml0,start=list(prob=0.5,theta=100), data=list(x=orob1)) @ The log-likelihood matches Crowder's result: <>= logLik(m0) @ It's easier to use the formula interface to specify all three of the models fitted by Crowder (homogeneous, probabilities differing by group, probabilities and overdispersion differing by group): <>= m0f <- mle2(m~dbetabinom(prob,size=n,theta), parameters=list(prob~1,theta~1), data=orob1, start=list(prob=0.5,theta=100)) m2f <- update(m0f, parameters=list(prob~dilution,theta~1), start=list(prob=0.5,theta=78.424)) m3f <- update(m0f, parameters=list(prob~dilution,theta~dilution), start=list(prob=0.5,theta=78.424)) @ \code{anova} runs a likelihood ratio test on nested models: <>= anova(m0f,m2f,m3f) @ The various \code{ICtab} commands produce tables of information criteria; by default the results are sorted and presented as $\Delta$IC; there are various options, including printing model weights. <>= AICtab(m0f,m2f,m3f,weights=TRUE) BICtab(m0f,m2f,m3f,nobs=nrow(orob1),weights=TRUE) AICctab(m0f,m2f,m3f,nobs=nrow(orob1),weights=TRUE) @ <>= opts_chunk$set(warning=FALSE) @ \section{Example: reed frog size predation} Data from an experiment by Vonesh \citep{VoneshBolker2005} <>= frogdat <- data.frame( size=rep(c(9,12,21,25,37),each=3), killed=c(0,2,1,3,4,5,rep(0,4),1,rep(0,4))) frogdat$initial <- rep(10,nrow(frogdat)) @ <>= library(ggplot2) @ <>= gg1 <- ggplot(frogdat,aes(x=size,y=killed))+geom_point()+ stat_sum(aes(size=..n..))+ labs(size="#")+scale_x_continuous(limits=c(0,40))+ scale_size(breaks=1:3) @ <>= m3 <- mle2(killed~dbinom(prob=c*(size/d)^g*exp(1-size/d), size=initial),data=frogdat,start=list(c=0.5,d=5,g=1)) pdat <- data.frame(size=1:40,initial=rep(10,40)) pdat1 <- data.frame(pdat,killed=predict(m3,newdata=pdat)) @ <>= m4 <- mle2(killed~dbinom(prob=c*((size/d)*exp(1-size/d))^g, size=initial),data=frogdat,start=list(c=0.5,d=5,g=1)) pdat2 <- data.frame(pdat,killed=predict(m4,newdata=pdat)) @ <>= gg1 + geom_line(data=pdat1,colour="red")+ geom_line(data=pdat2,colour="blue") @ <>= coef(m4) prof4 <- profile(m4) @ Three different ways to draw the profile: (1) Built-in method (base graphics): <>= plot(prof4) @ (2) Using \code{xyplot} from the \code{lattice} package: \setkeys{Gin}{width=\textwidth} <>= prof4_df <- as.data.frame(prof4) library(lattice) xyplot(abs(z)~focal|param,data=prof4_df, subset=abs(z)<3, type="b", xlab="", ylab=expression(paste(abs(z), " (square root of ",Delta," deviance)")), scale=list(x=list(relation="free")), layout=c(3,1)) @ (3) Using \code{ggplot} from the \code{ggplot2} package: <>= ss <-subset(prof4_df,abs(z)<3) ggplot(ss, aes(x=focal,y=abs(z)))+geom_line()+ geom_point()+ facet_grid(.~param,scale="free_x") @ \section*{Additions/enhancements/differences from \code{stats4::mle}} \begin{itemize} \item{\code{anova} method} \item{warnings on convergence failure} \item{more robust to non-positive-definite Hessian; can also specify \code{skip.hessian} to skip Hessian computation when it is problematic} \item{when profiling fails because better value is found, report new values} \item{can take named vectors as well as lists as starting parameter vectors} \item{added \code{AICc}, \code{BIC} definitions, \code{ICtab} functions} \item{added \code{"uniroot"} and \code{"quad"} options to \code{confint}} \item{more options for colors and line types etc etc. The old arguments are: <>= function (x, levels, conf = c(99, 95, 90, 80, 50)/100, nseg = 50, absVal = TRUE, ...) {} @ The new one is: <>= function (x, levels, which=1:p, conf = c(99, 95, 90, 80, 50)/100, nseg = 50, plot.confstr = FALSE, confstr = NULL, absVal = TRUE, add = FALSE, col.minval="green", lty.minval=2, col.conf="magenta", lty.conf=2, col.prof="blue", lty.prof=1, xlabs=nm, ylab="score", show.points=FALSE, main, xlim, ylim, ...) {} @ \code{which} selects (by character vector or numbers) which parameters to plot: \code{nseg} does nothing (even in the old version); \code{plot.confstr} turns on the labels for the confidence levels; \code{confstr} gives the labels; \code{add} specifies whether to add the profile to an existing plot; \code{col} and \code{lty} options specify the colors and line types for horizontal and vertical lines marking the minimum and confidence vals and the profile curve; \code{xlabs} gives a vector of x labels; \code{ylab} gives the y label; \code{show.points} specifies whether to show the raw points computed. } \item{\code{mle.options()}} \item{\code{data} argument} \item{handling of names in argument lists} \item{can use alternative optimizers (\code{nlminb}, \code{nlm}, \code{constrOptim}, \code{optimx}, \code{optimize})} \item{uses code from \code{numDeriv} package to compute Hessians rather than built-in optimizer code} \item{by default, uses \code{MASS::ginv} (generalized inverse) rather than \code{solve} to invert Hessian (more robust to positive-semidefinite Hessians \ldots)} \item{can use \code{vecpar=TRUE} (and \code{parnames()}) to use objective functions with parameters specified as vectors (for compatibility with \code{optim} etc.)} \end{itemize} \section{Newer stuff} \textbf{To do:} \begin{itemize} \item{use \code{predict}, \code{simulate} etc. to demonstrate different parametric bootstrap approaches to confidence and prediction intervals \begin{itemize} \item use \code{predict} to get means and standard deviations, use delta method? \item use \code{vcov}, assuming quadratic profiles, with \code{predict(\ldots,newparams=\ldots)} \item prediction intervals assuming no parameter uncertainty with \code{simulate} \item both together \ldots \end{itemize} } \end{itemize} \section{Technical details} \subsection{Profiling and confidence intervals} This section describes the algorithm for constructing profiles and confidence intervals, which is not otherwise documented anywhere except in the code. * indicates changes from the version in \code{stats4:::mle} \subsubsection{Estimating standard error} In order to construct the profile for a particular parameter, one needs an initial estimate of the scale over which to vary that parameter. The estimated standard error of the parameter based on the estimated curvature of the likelihood surface at the MLE is a good guess. \begin{itemize} \item if \code{std.err} is missing, extract the standard error from the summary coefficient table (ultimately computed from \code{sqrt(diag(inverse Hessian))} of the fit) \item * a user-set value of \code{std.err} overrides this behavior unless the value is specified as \code{NA} (in which case the estimate from the previous step is used) \item * if the standard error value is still \code{NA} (i.e. the user did not specify it and the value estimated from the Hessian is missing or \code{NA}) use \code{sqrt(1/diag(hessian))}. This represents a (fairly feeble) attempt to come up with a plausible number when the Hessian is not positive definite but still has positive diagonal entries \item if all else fails, stop and * print an error message that encourages the user to specify the values with \code{std.err} \end{itemize} There may be further tricks that would help guess the appropriate scale: for example, one could guess on the basis of a comparison between the parameter values and negative log-likelihoods at the starting and ending points of the fits. On the other hand, (a) this would take some effort and still be subject to failure for sufficiently pathological fits and (b) there is some value to forcing the user to take explicit, manual steps to remedy such problems, as they may be signs of poorly defined or buggy log-likelihood functions. \subsubsection{Profiling} Profiling is done on the basis of a constructed function that minimizes the negative log-likelihood for a fixed value of the focal parameter and returns the signed square-root of the deviance difference from the minimum (denoted by $z$). At the MLE $z=0$ by definition; it should never be $<0$ unless something has gone wrong with the original fit. The LRT significance cutoffs for $z$ are equal to the usual two-tailed normal distribution cutoffs (e.g. $\pm \approx 1.96$ for 95\% confidence regions). In each direction (decreasing and increasing from the MLE for the focal parameter): \begin{itemize} \item fix the focal parameter \item adjust control parameters etc. accordingly (e.g. remove the entry for the focal parameter so that the remaining control parameters match the non-fixed parameters) \item{controls on the profiling (which can be set manually, but for which there is not much guidance in the documentation): \begin{itemize} \item \code{zmax} Maximum $z$ to aim for. (Default: \code{sqrt(qchisq(1-alpha/2, p))}) The default maximum $\alpha$ (type~I error) is 0.01. \bbnote{I don't understand this criterion. It seems to expand the size of the univariate profile to match a cutoff for the multivariate confidence region of the model. The $\chi^2$ cutoff for deviance to get the $(1-\alpha)$ multivariate confidence region (i.e., on all $p$ of the parameters) would be \code{qchisq(1-alpha,p)} --- % representing a one-tailed test on the deviance. Taking the square root makes sense, since we are working with the square root of the deviance, but I don't understand (1) why we are expanding the region to allow for the multivariate confidence region (since we are computing univariate profiles) [you could at least argue that this is conservative, making the region a little bigger than it needs to be]; (2) why we are using $1-\alpha/2$ rather than $1-\alpha$. } For comparison, \code{MASS::profile.glm} (written by Bates and Venables in 1996, ported to R by BDR in 1998) uses \code{zmax}=\code{sqrt(qchisq(1-alpha,1))} \bbnote{(this makes more sense to me \ldots)}. On the other hand, the profiling code in \code{lme4a} (the \code{profile} method for \code{merMod}, in \code{profile.R}) uses \code{qchisq(1-alphamax, nptot)} \ldots \item \code{del} Step size (scaled by standard error) (Default: \code{zmax}/5.) Presumably (?) copied from \code{MASS::profile.glm}, which says (in \code{?profile.glm}): ``[d]efault value chosen to allow profiling at about 10 parameter values.'' \item \code{maxsteps} Maximum number of profiling steps to try in each direction. (Default: 100) \end{itemize} } \item While \verb+step