--- title: "modelObj: A Model Object Framework for Regression Analysis" author: Shannon T. Holloway output: pdf_document vignette: > %\VignetteIndexEntry{modelObj} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` \newcommand{\bma}[1]{\mbox{\boldmath $#1$}} \newcommand{\proglang}[1]{\texttt{#1}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\pkg}[1]{\texttt{#1}} \newcommand{\var}[1]{\textbf{#1}} \setlength\parindent{0pt} \begin{abstract} It is becoming common practice for researchers to disseminate new statistical methods to the broader research community by developing and releasing \proglang{R} packages that implement their methods. Often, methods are developed on the framework of traditional regression. To simplify software development, researchers and developers often make choices regarding the types of regression models that can be used; hard-coding a specific regression method into the library and limiting or eliminating the ability of the user to modify regression control parameters. In many instances, there is not a fundamental reason why a new statistical method should use a specific regression method (e.g., linear vs. non-linear) and such choices artificially limit the general application of new packages. In addition, if a new method requires multiple regression steps, a developer must artificially break the method into multiple function calls, each for a specific regression step, or provide a cumbersome and/or confusing interface. We have developed a new \proglang{R} package to facilitate the use of existing and future \proglang{R} regression libraries that simplifies the development of general, non-model-specific implementations of new statistical methods. \end{abstract} \section{Introduction} IMPACT is a joint venture of the University of North Carolina at Chapel Hill, Duke University, and North Carolina State University. The program project aims to improve the health and longevity of people by improving the clinical trial process. A key component of this research has been the development of public-use software packages that implement new statistical methods developed by the 30$+$ investigators. Whenever possible, these methods have been developed in \proglang{R}. The \code{modelObj} library was born from the need to create simple, general-use implementations of new statistical methods that do not limit the underlying regression method(s) and do not require continued upgrading as new regression methods become available. When creating \proglang{R} packages for statistical methods developed on the framework of traditional regression or classification methods, researchers and/or software developers often make choices regarding the types of models that can be specified by the user; hard-coding the regression method into the library and limiting or eliminating the ability of the user to modify regression control parameters. However, the choice of a specific regression method may not be fundamental constraint of the new method, and such choices can limit the general application of an implementation. In addition, a new method may require multiple regression steps. For example, \pkg{DynTxRegime} implements the Augmented Inverse Probability Weighted Estimators (AIPWE) for average treatment effects and requires multiple regression analyses. To implement this method generally without using the framework described herein would require that the procedure be artificially broken into multiple function calls, each for a specific regression step, or that the user interface to the method be cumbersome and/or confusing. The \pkg{modelObj} library is built on the premise of a ``model object." A model object contains all of the information needed to complete a standard regression analysis and subsequent prediction step: a formula object, the existing \proglang{R} regression method to be used to obtain parameter estimates (the so-called solver method), any control arguments to be passed to the regression method, the \proglang{R} method to be used to obtain predictions, and any arguments to be passed to the prediction method. This information is grouped into a single object of class \code{modelObj} by a call to \code{buildModelObj(\dots)}. To use a package built on the model object framework, the user creates the \code{modelObj} prior to calling the statistical method and passes the model object as input. The \code{modelObj} library provides simple functions that developers can use to implement standard regression procedures, such as \code{fit(\dots)} to obtain parameter estimates and \code{predict(\dots)} to obtain predictions. \section{Interacting with packages that implement the model object framework.} Users of packages that have been developed based on the model object framework will interface with the \code{modelObj} library through calls to \code{buildModelObj(\dots)}. These calls create a model object for a single regression step and are passed as input to the method. The \code{buildModelObj(\dots)} function takes as input \begin{itemize} \item{\var{model} : } {an object of class formula. Any lhs variables provided will be ignored. If the fitting function specified in \var{solver.method} takes as input a model matrix rather than a formula object, \var{model} will be used to obtain the model matrix.} \item{\var{solver.method} : } {an object of class character; the name of the \proglang{R} regression method. For example, a user might commonly specify `lm' or `glm.' For classification, `rpart' might be used. \textbf{The specified method MUST have a corresponding predict method.}} \item{\var{solver.args} : } {an object of class list; additional arguments to be passed to solver.method. The name of each element of the list must match a formal argument of solver.method. For example, for logistic regression using glm: \begin{verbatim} solver.method = "glm" solver.args = list("family"=binomial). \end{verbatim} If \var{solver.method} takes as input a formula object, it is assumed that the function specified has formal arguments ``formula" and ``data." If the solver.method does not use ``formula" and/or ``data," \var{solver.args} must explicitly indicate the variable names used for these inputs. For example, list(``x"=``formula") if the formula object is passed to solver.method through input argument ``x" or list(``df"=``data") if the data.frame object is passed to solver.method through input argument ``df." If \var{solver.method} instead takes as input a model matrix, it is assumed that the function specified has formal arguments ``x" and ``y" for the design matrix and response, respectively. If the solver.method does not use ``x" and/or ``y," \var{solver.args} must explicitly indicate the variable names used for these inputs. For example, list(``X"=``x") if the formula object is passed to solver.method through input argument ``X" or list(``Y"=``y") if the data.frame object is passed to solver.method through input argument ``Y."} \item{\var{predict.method} : } {an object of class character; the function name of the \proglang{R} function to be used to obtain predictions. For example, `predict.lm' or `predict.glm.' If no function is explicitly given, the generic `predict' method is assumed. Most often, this input can be omitted.} \item{\textbf{predict.args} : } {an object of class list; additional arguments to be passed to predict.method. The name of each element of the list must match a formal argument of predict.method. For example, if a logistic regression using glm was used to fit the model formula object and predictions on the scale of the response are desired, \begin{verbatim} solver.method = "glm" solver.args = list("family"=binomial) predict.method = "predict" predict.args = list("type"="response"). \end{verbatim} It is assumed that the \proglang{R} method specified in predict.method has formal arguments ``object" and ``newdata." If predict.method does not use these formal arguments, predict.args must explicitly indicate the variable names used for these inputs. For example, list(``x"=``object") if the object returned by solver.method is passed to predict.method through input argument ``x" or list(``ndf"=``newdata") if the data.frame object is passed to predict.method through input argument ``ndf." } \end{itemize} Unless modified through \var{solver.args} and \var{predict.args}, default settings are assumed for the methods specified in \var{solver.method} and \var{predict.method}. As a simple example, ```{r} library(modelObj) object1 <- buildModelObj(model=~x1, solver.method='lm') ``` defines a model object for a linear model, the parameter estimates of which are to be obtained using \code{lm}, and predictions obtained using \code{predict}. The solver and prediction methods will use default settings. As a more complex (though contrived) example, consider the following functions ```{r} mylm <- function(X,Y){ obj <- list() obj$lm <- lm.fit(x=X, y=Y) obj$var <- "does something neat" class(obj) = "mylm" return(obj) } predict.mylm <- function(obj,data=NULL){ if( is(data,"NULL") ) { obj <- exp(obj$lm$fitted.values) } else { obj <- data %*% obj$lm$coefficients obj <- exp(obj) } return(obj) } ``` which, for the sake of argument, represent a ``new" regression method that a user would like to utilize. These functions are chosen to illustrate solver methods and prediction methods that do not accept the standard formal arguments. They provide a simple illustration of how flexible the framework can be. In this circumstance, the user would define the following modeling object: ```{r} object2 <- buildModelObj(model = ~x1, solver.method = mylm, solver.args = list('X' = "x", 'Y' = "y"), predict.method = predict.mylm, predict.args = list('obj' = "object", 'data' = "newdata")) ``` \section{Developing packages that implement the model object framework.} The \code{buildModelObj()} function invoked by a user returns an object of class \code{modelObj}. Developers that use this utility package should carefully document for users any required settings for \var{solver.method} and \var{predict.method}. For example, the scale of the response needed for predictions. Though the developer can access and modify the argument lists provided by users using methods \code{predictorArgs()} and \code{solverArgs()}, there is no strict variable naming convention in \proglang{R}, and some methods do not adhere to the ``usual" choices. Thus, identifying the formal argument to adjust may be tricky. Once provided an object of class modelObj, developers can **see** all of the values contained in the object but can modify only the argument lists to be passed to methods. Specifically: \begin{itemize} \item{\code{model} } {Retrieves the formula object.} \item{\code{solver} } {Retrieves the character name of the regression method.} \item{\code{solverArgs} } {Retrieves the list of arguments to be passed to the regression method.} \item{\code{predictor} } {Retrieves the character name of the prediction method.} \item{\code{predictorArgs}} {Retrieves the list of arguments to be passed to the prediction method.} \item{\code{solverArgs<-} } {Sets the list of arguments to be passed to the regression method.} \item{\code{predictorArgs<-}} {Sets the list of arguments to be passed to the prediction method.} \end{itemize} The primary utility method available for objects of class \code{modelObj} is \code{fit(\dots)}, which implements the regression step. The inputs for \code{fit(\dots)} are: \begin{itemize} \item{\var{object} } {an object of class \code{modelObj}.} \item{\var{data} } {an object of class data.frame; the covariates to be used to obtain the fit.} \item{\var{response}} {an object of class numeric; the response} \item{\textbf{\dots}} {ignored} \end{itemize} The \code{fit(\dots)} method constructs and executes the function call to the specified solver method using the formula object and formal arguments provided by the user in solver.args. The \code{fit(\dots)} method uses an internal naming convention for the response, and thus only the right-hand-side of the formula object is referenced. \code{fit(\dots)} returns an S4 object of class \code{modelObjFit}. Developers can access members of this class using the following methods: \begin{itemize} \item{\code{fitObject} } {Retrieves the value object returned by the regression method. Through this retrieve method, developers have the ability to access any defined methods for the regression function, such as \code{coef}, \code{residuals}, or \code{plot}.} \item{\code{predictor} } {Retrieves the character name of the prediction method to be used to obtain predictions.} \item{\code{predictorArgs}} {Retrieves the list of arguments to be passed to the prediction method when making predictions.} \end{itemize} Note that predictor and predictorArgs only give you access to **see** what has been specified for the prediction method. Should changes need to be made to the arguments, one must apply these changes to the defining modelObj before creating the modelObjFit object. Additional methods available for \code{modelObjFit} objects are \begin{itemize} \item{\code{coef(\dots)} } {If defined for the regression method, returns the estimated coefficients.} \item{\code{plot(\dots)} } {If defined for the regression method, generates the plot of the model fitting class.} \item{\code{predict(\dots)} } {Obtains predictions from the results of the model fitting function.} \item{\code{residuals(\dots)}} {If defined for the regression method, returns the residuals.} \item{\code{show(\dots)}} {Uses the predefined show method of the regression method.} \item{\code{summary(\dots)}} {Uses the predefined summary method of the regression method.} \end{itemize} Again, the value object returned by the regression method can be retrieved using \code{fitObject}; thereby providing access to any \proglang{R} methods developed for the regression method. We have chosen to implement only the most common methods (\code{coef()}, \code{residuals()}, etc.) for the \code{modelObjFit} object. Note that not all regression methods have these functions. If these functions are required in your implementation, additional checks must be incorporated into your code to ensure their availability. \section{Example Implementation of \code{modelObj}} We use a standard \proglang{R} dataset to illustrate the implementation of the model object framework. The `pressure' data frame contains ``data on the relation between temperature in degrees Celsius and vapor pressure of mercury in millimeters (of mercury)." The details of the datatset are not relevant for this illustration. However, the reader is referred to ?pressure for details. ```{r} summary(pressure) ``` It is straightforward to implement a regression step. As an example, suppose we are developing a new package called \pkg{wow}. The primary function of this package is \code{exampleFun()}. In this function, we want to obtain a fit and return the square of the fitted response and the estimated coefficients in a list. Our function takes the following form ```{r} exampleFun <- function(modelObj, data, Y){ fitObj <- fit(object = modelObj, data = data, response = Y) ##Test that coef() is an available method cfs <- try(coef(fitObj), silent=TRUE) if(class(cfs) == 'try-error'){ warning("Provided regression method does not have a coef method.\n") cfs <- NULL } fitted <- predict(fitObj)^2 return(list("fittedSq"=fitted, "coef"=cfs)) } ``` To use this function, a user must create the object of class \code{modelObj} and provide it as input to \code{exampleFun()}. The user can implement a linear model as follows: ```{r} ylog <- log(pressure$pressure) objlm <- buildModelObj(model = ~temperature, solver.method = "lm", predict.method = "predict.lm", predict.args = list("type"="response")) fitObjlm <- exampleFun(objlm, pressure, ylog) print(fitObjlm$coef) ``` Or, the non-linear least squares method \code{nls}: ```{r} objnls <- buildModelObj(model = ~exp(a + b*temperature), solver.method = "nls", solver.args = list('start'=list(a=1, b=0.1)), predict.method = "predict", predict.args = list("type" = "response")) fitObjnls <- exampleFun(objnls, pressure, pressure$pressure) print(fitObjnls$coef) ``` Or, even the previously defined ``new" method: ```{r} objectnew <- buildModelObj(model = ~temperature, solver.method = mylm, solver.args = list('X' = "x", 'Y' = "y"), predict.method = predict.mylm, predict.args = list('obj'="object", 'data'="newdata")) fitObjnew <- exampleFun(objectnew, pressure, ylog) print(fitObjnew$coef) ``` In the last example, the function returned NULL for the parameter estimates because there is no available method to retrieve the estimated parameters. The same function, \code{exampleFun()} can be used to implement each of these models, and no development is required to extend the wow function to new regression methods as they become available.