\documentclass{article} %\VignetteEngine{knitr::knitr} %\VignetteDepends{fitode} %\VignetteDepends{bbmle} %\VignetteIndexEntry{Getting started with fitode package} \title{Getting started with the fitode package} \author{Sang Woo Park} \usepackage{amsmath} \usepackage{natbib} \usepackage{hyperref} \newcommand{\rzero}{{\cal R}_0} \newcommand{\code}[1]{{\tt #1}} \newcommand{\bmb}[1]{{\color{blue} bmb: \emph{#1}}} \bibliographystyle{chicago} \date{\today} \begin{document} \maketitle <>= library("knitr") ## DON'T cache=TRUE or warning=FALSE globally ## DO set error=FALSE so that errors are not caught ## (except temporarily for debugging purposes) opts_chunk$set(fig.width=6,fig.height=4,error=FALSE) knit_hooks$set(basefig=function(before, options, envir) { if (before) { oldpar <- par(bty="l",las=1) on.exit(par(oldpar)) } else { } }) @ \tableofcontents \pagebreak \section{Introduction} \code{fitode} is an R package for fitting ordinary differential equations (ODE) using Maximum Likelihood or Bayesian Markov Chain Monte Carlo (MCMC). It relies on symbolic differentiation features of the \code{Deriv} package to solve the sensitivity equations so that gradient-based optimization algorithms can be used. \begin{itemize} \item response distributions: Gamma, Gaussian, Poisson, and negative binomial (NB1 and NB2 parameterization) \item link functions on model parameters: log, logit, and identity \item fitting multiple states to multivariate time series \item prior/penalization: Beta, Gamma, and Gaussian distributions \item confidence intervals on parameters and their transformations via delta method, profiling, and importance sampling \end{itemize} In order to construct a model in \code{fitode} you need to: \begin{itemize} \item specify the gradients using formula notation (e.g., $dX/dt=f(X)$ is expressed as \verb+X ~ f(X)+) \item specify the observation process using formula notation (e.g., \verb+Xobs ~ dnorm(mean=X, sd=sigma)+ \item specify the initial conditions using formula notation \item specify the parameters of the model \item specify the link functions (log-link is the default) \end{itemize} To fit a model, you need to: \begin{itemize} \item specify the data (as well as the time column) \item specify the starting values for optimization or MCMC \item optionally specify fixed parameters \item optionally specify prior distributions (or penalizations); not specifying prior distribution in MCMC will result in improper priors on link scales \end{itemize} This document was generated using \Sexpr{R.version$version.string} and package versions: <>= used.pkgs <- c("bbmle", "Deriv", "deSolve", "fitode", "ggplot2") pkgver <- vapply(sort(used.pkgs),function(x) as.character(packageVersion(x)),"") print(pkgver,quote=FALSE) @ \section{Basic fitting - estimating epidemic growth rates} \subsection{Data} Here, we study a time series of confirmed cases of Ebola during the 2014 outbreak in Sierra Leone to characterize epidemic growth patterns. Once you load \code{fitode}, the data set (\code{SierraLeone2014}) will be automatically loaded into the global environment. <>= library(ggplot2); theme_set(theme_bw()) library(fitode) plot(SierraLeone2014) @ \subsection{Exponential growth model} Exponential growth is one of the simplest models we can use to characterize the initial spread of a disease: \begin{equation} \frac{dX}{dt} = rX. \end{equation} This model is parameterized by the initial growth rate $r$ and the initial value $X(0)$. Variable $X$ describes the dynamics of \emph{mean} confirmed cases; we will assume that the observed number of confirmed cases at time $t$ follows a Poisson error distribution with mean $X(t)$. This model can be constructed in \code{fitode} as follows: <>= exp_model <- odemodel( name="exponential", model=list( X ~ r * X ), observation=list( confirmed ~ dpois(lambda=X) ), initial=list( X ~ X0 ), par=c("r", "X0") ) @ Note that the name(s) of the observed variable(s) (here, \code{confirmed}) must be different from the name(s) of the state variable(s) (here, \code{X}). In order to fit this model to the data, we have to specify starting parameters for the optimization. To do so, we can simulate the model for various parameters and try to find a reasonable parameter set by eye. For example, here is a parameter set found by trial and error: <>= start <- c(r=7, X0=30) ss <- simulate(exp_model, parms=start, times=SierraLeone2014$times) plot(SierraLeone2014) lines(X ~ times, data=ss) abline(v=2014.8, col="red", lty=2) @ Here, we used the \code{simulate} function to simulate the model. It requires a parameter set (\code{parms} argument) and a time vector (\code{times} argument) to run. It returns a deterministic ODE solution for each state variable as well as stochastic simulated observations based on the ODE solution; we will ignore the simulated observations for now. The data does not exhibit exponential growth forever. In order to fit the exponential model, we have to determine a fitting window. Here we will fit the model from the beginning of the epidemic to time 2014.8 (red dashed line in the previous figure). <>= exp_fit <- fitode( model=exp_model, data=subset(SierraLeone2014, times <= 2014.8), start=start ) @ The estimated parameters are very close to our initial guess: <>= exp_fit @ Since the exponential ODE has a simple closed-form analytical solution, we could also have used MLE directly in this case: <>= mlefit <- bbmle::mle2(confirmed ~ dpois(X0*exp(r*(times-times[1]))), data=subset(SierraLeone2014, times <= 2014.8), start=as.list(start)) coef(mlefit) @ We can quantify the uncertainty in the parameters by using \code{confint}: <>= confint(exp_fit) @ By default, \code{confint} will calculate the confidence intervals using the delta method. \bmb{are these 'delta method' or Wald CIs??} We diagnose the fit by using the \code{plot} function (the \code{level=0.95} argument specifies that 95\% confidence intervals should be drawn): <>= plot(exp_fit, level=0.95) @ The confidence intervals on our predictions are suspiciously narrow, probably because of our choice of the error function. The Poisson distribution assumes that variance of the residuals is equal to the mean (i.e., the fitted value). Instead, we can use a negative binomial distribution, which assumes that variance is a quadratic function of the mean. Then, we have to estimate an extra parameter (\code{size} argument of the \code{dnbinom}) to account for overdispersion. We use the \code{update} function to adjust only these particular aspects of the model, leaving the gradient specification the same: <>= exp_fit_nbinom <- update( exp_fit, observation=list( confirmed ~ dnbinom(mu=X, size=phi) ), par=c("r", "X0", "phi"), start=c(start, phi=10) ) @ Note that we need to specify a starting value for the overdispersion parameter as well. Alternatively, we can update the \code{odemodel} object and refit the model: <>= exp_model_nbinom <- update(exp_model, name="exponential (nbinom)", observation=list( confirmed ~ dnbinom(mu=X, size=phi) ), par=c("r", "X0", "phi") ) exp_fit_nbinom2 <- fitode( model=exp_model_nbinom, data=SierraLeone2014[SierraLeone2014$times <+ 2014.8,], start=c(start, phi=10) ) @ Both approaches give the same results. We can plot this fit: <>= plot(exp_fit_nbinom, level=0.95) @ Our uncertainty is now more reasonable. This change widens the confidence intervals on parameters as well: <>= confint(exp_fit_nbinom) @ \subsection{Logistic growth model} Exponential growth model accounts for only the initial portion of the observed data. Instead, we might want to try to model the entire time series. Note that the cumulative number of cases saturates over time: <>= plot(cumsum(confirmed) ~ times, data=SierraLeone2014) @ We can use a logistic model to describe this saturating pattern: \begin{equation} \frac{dX}{dt} = r X \left(1 - \frac{X}{K}\right). \end{equation} While we can fit $X$ directly to cumulative number of cases, it can lead to overly confident results due to accumulation of observation error \citep{king2015avoidable}. Instead, we can use \emph{interval counts} to model the true number of cases: $X(t) - X(t - \Delta t)$, where $\Delta t$ is the reporting time step. This is done by using the \code{diffnames} argument <>= logistic_model <- update( exp_model_nbinom, name="logistic (nbinom)", model=list( X ~ r * X * (1 - X/K) ), diffnames="X", par=c("r", "X0", "K", "phi") ) @ In this case, we need to modify the data set by adding an extra \code{NA} observation before the first observation; this allows \code{fitode} to take the interval difference and still end up with the same number of observations as the time series. <>= SierraLeone2014b <- rbind( c(times=SierraLeone2014$times[1] - diff(SierraLeone2014$times)[1], confirmed=NA), SierraLeone2014 ) @ Again, we can try to find a reasonable parameter set by trial and error: <>= start_logistic <- c(coef(exp_fit_nbinom), K=sum(SierraLeone2014$confirmed)) ## need to use a different value for X0 start_logistic[["X0"]] <- 300 ss_logistic <- simulate( logistic_model, parms=start_logistic, times=SierraLeone2014b$times ) plot(SierraLeone2014) lines(X~times, data=ss_logistic) @ and fit the model: <>= logistic_fit <- fitode( logistic_model, data=SierraLeone2014b, start=start_logistic ) @ In this case, we get a much higher growth rate estimate: <>= confint(logistic_fit) @ Plot: <>= plot(logistic_fit, level=0.95) @ There is a clear bias in our fit; the estimated trajectory underestimates the peak of the epidemic. This is likely to affect our parameter estimates. We can be smarter about our choices of fitting window. Instead of using the entire time series, we can fit the logistic model from the beginning of the epidemic to the next observation after the peak \citep{ma2014estimating}. <>= ma_begin <- 1 ma_end <- which.max(SierraLeone2014b$confirmed) + 1 logistic_fit_ma <- update( logistic_fit, data=SierraLeone2014b[ma_begin:ma_end,] ) @ We get a much better fit: <>= plot(logistic_fit, level=0.95) plot(logistic_fit_ma, level=0.95, add=TRUE, col.traj="red", col.conf="red") @ We get slightly wider confidence intervals on the parameters because we're using less data: <>= confint(logistic_fit_ma) @ \subsection{SIR model} The Susceptible-Infected-Recovered (SIR) model describes how disease spreads in a homogeneous population: \begin{equation} \begin{aligned} \frac{dS}{dt} &= - \beta S \frac{I}{N}\\ \frac{dI}{dt} &= \beta S \frac{I}{N} - \gamma I\\ \frac{dR}{dt} &= \gamma I \end{aligned} \end{equation} We can assume that confirmed cases are put into control and are no longer infectious, thus effectively recovering from infection \citep{he2009plug}; in other words, we model cumulative number of confirmed cases with cumulative number of recovered cases (state variable $R$). Again, we use interval counts by using \code{diffnames="R"}: <>= SIR_model <- odemodel( name="SIR (nbinom)", model=list( S ~ - beta * S * I/N, I ~ beta * S * I/N - gamma * I, R ~ gamma * I ), observation=list( confirmed ~ dnbinom(mu=R, size=phi) ), initial=list( S ~ N * (1 - i0), I ~ N * i0, R ~ 0 ), diffnames="R", par=c("beta", "gamma", "N", "i0", "phi"), link=c(i0="logit") ) @ For simplicity, we assumed that there are no recovered individuals at the beginning of the epidemic\footnote{Since these individuals would be completely uninvolved in the epidemic, and we are estimating the population size, we can make this assumption without any loss of generality}. The initial conditions are given by \begin{equation} \begin{aligned} S(0) &= N (1 - i_0)\\ I(0) &= N i_0\\ R(0) &= 0 \end{aligned} \end{equation} where $i_0$ is the initial proportion of infected individuals. Setting \code{link=c(i0="logit")} tells \code{fitode} that the parameter \code{i0} needs to be between 0 and 1. \footnote{The \emph{logit}, or log-odds, function (\code{qlogis()} in R), is the inverse of a logistic curve; it is a natural way to transform a value from the range [0,1] to $[-\infty,\infty]$.} Searching for starting values: <>= SIR_start <- c(beta=70, gamma=60, N=40000, i0=0.0004, phi=6) ss_SIR <- simulate(SIR_model, parms=SIR_start, times=SierraLeone2014b$times) plot(SierraLeone2014) lines(ss_SIR$times, ss_SIR$R) @ Fit: <>= SIR_fit <- fitode( SIR_model, data=SierraLeone2014b, start=SIR_start ) @ Plot: <>= plot(SIR_fit, level=0.95) @ Again, the SIR model underestimates the peak. This could be a problem with fitting window. When we get rid of the long tail in the time series, we get a much better fit: <>= SIR_fit_b <- update( SIR_fit, data=SierraLeone2014b[SierraLeone2014b$times < 2015.4,] ) @ <>= plot(SIR_fit_b, level=0.95) @ There are several ways we can get the confidence intervals on the growth rate ($r = \beta - \gamma$). By default, the package uses the delta method. <>= confint(SIR_fit_b, parm=list(r~beta-gamma)) @ We discuss other methods later. Figure~\ref{fig:fitsummary} compares the results of all of the methods we have tried. <>= fit_summ <- data.frame( fits=c("exponential\n(poisson)", "exponential\n(nbinom)", "logistic\n(full)", "logistic\n(window)", "SIR\n(full)", "SIR\n(window)"), estimate=c(coef(exp_fit)[1], coef(exp_fit_nbinom)[1], coef(logistic_fit)[1], coef(logistic_fit_ma)[1], -diff(coef(SIR_fit)[1:2]), -diff(coef(SIR_fit_b)[1:2])), lwr=c(confint(exp_fit)[1,2], confint(exp_fit_nbinom)[1,2], confint(logistic_fit)[1,2], confint(logistic_fit_ma)[1,2], confint(SIR_fit, parm=list(r~beta-gamma))[2], confint(SIR_fit_b, parm=list(r~beta-gamma))[2]), upr=c(confint(exp_fit)[1,3], confint(exp_fit_nbinom)[1,3], confint(logistic_fit)[1,3], confint(logistic_fit_ma)[1,3], confint(SIR_fit, parm=list(r~beta-gamma))[3], confint(SIR_fit_b, parm=list(r~beta-gamma))[3]) ) fit_summ$fits <- factor(fit_summ$fits, level=fit_summ$fits) print(ggplot(fit_summ) + geom_pointrange(aes(fits, estimate, ymin=lwr, ymax=upr)) + labs(y="Initial epidemic growth rate") + coord_flip() ) @ \section{Advanced fitting - multivariate time series} Data: <>= ## FIXME: store these data locally hare <- read.csv("https://raw.githubusercontent.com/stan-dev/example-models/master/knitr/lotka-volterra/hudson-bay-lynx-hare.csv", skip=2) plot(Hare~Year, data=hare, type="l") lines(Lynx~Year, data=hare, type="l", col=2) @ Lotka-Volterra model: \begin{equation} \begin{aligned} \frac{du}{dt} &= \alpha u - \beta uv\\ \frac{dv}{dt} &= \delta uv - \gamma v \end{aligned} \end{equation} <>= lotka_model <- odemodel( name="Lotka Volterra model", model=list( u ~ alpha * u - beta * u * v, v ~ delta * u * v - gamma * v ), observation=list( Hare ~ dnbinom(mu=u, size=size1), Lynx ~ dnbinom(mu=v, size=size2) ), initial=list( u ~ u0, v ~ v0 ), par=c("alpha", "beta", "delta", "gamma", "u0", "v0", "size1", "size2") ) @ Fit with good starting values (estimated by someone else): <>= harestart <- c(alpha=0.55, beta=0.028, delta=0.026, gamma=0.84, u0=30, v0=10, size1=1, size2=1) harefit <- fitode(lotka_model, data=hare, start=harestart, tcol="Year") plot(harefit, level=0.95) @ Esimates of size parameters are extremely large: <>= coef(harefit) @ This suggests that Poisson is actually good enough: <>= ## FIXME: we need this (fancy stuff with filling in all ## of the links as log) now because I'm checking links more carefully ## is there a way around this? poisson_pars <- setdiff(lotka_model@par, c("size1", "size2")) harefit_poisson <- update( harefit, observation=list( Hare ~ dpois(lambda=u), Lynx ~ dpois(lambda=v) ), link=setNames(rep("log",length(poisson_pars)),poisson_pars), par=poisson_pars ) @ <>= plot(harefit_poisson, level=0.95) @ Using \code{confint()} on the two models (with the default method, Wald approximation) shows that the confidence intervals on the parameters are nearly identical --- except for the two negative binomial parameters which have extremely (ridiculously) wide confidence intervals, e.g. the 95\% CI for the dispersion parameter on hares is \{\Sexpr{confint(harefit)["size1","2.5 %"]}, \Sexpr{confint(harefit)["size1","97.5 %"]}\}. This problem occurs because the standard Wald approximation fails badly for these parameters. We can get \emph{lower} bounds on the confidence intervals for the dispersion parameters by using likelihood profiling. We have to work a little harder; we (1) manually set the parameter standard error to provide an initial scale for the profile (since the Wald estimate of the standard errors fails badly in this case) and (2) allow the profiling to proceed even if it discovers a fit that is slightly better (by up to 0.1 log-likelihood units) than the original fit. <>= confint(harefit, parm=c("size1","size2"), method="profile", std.err=1, tol.newmin=0.1) @ The results tell us that the upper 95\% CIs are undefined (as would be expected if the model is not significantly better than Poisson), and the lower 95\% CIs are $\approx 50$. \bibliography{fitode} \end{document}