%\VignetteIndexEntry{Penalized user guide} \documentclass[a4paper]{article} \usepackage{natbib} \bibliographystyle{chicago} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \sloppy \title{L1 and L2 Penalized Regression Models} \author{Jelle Goeman \and Rosa Meijer \and Nimisha Chaturvedi} \date{Package version \Sexpr{packageDescription("penalized")$Version}\\Date: \today} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents \newpage \section{Citing \Rpackage{penalized}.} To cite the package \Rpackage{penalized}, please cite Goeman, J. J., L1 penalized estimation in the Cox proportional hazards model. \emph{Biometrical Journal} 52(1), 70--84. \section{Introduction} This short note explains the use of the \Rpackage{penalized} package. The package is designed for penalized estimation in generalized linear models. The lasso and elastic net algorithm that it implements is described in \cite{Goeman2010}. The supported models at this moment are linear regression, logistic regression, poisson regression and the Cox proportional hazards model, but others are likely to be included in the future. As to penalties, the package allows an L1 absolute value (``lasso'') penalty \cite{Tibshirani1996, Tibshirani1997}, an L2 quadratic (``ridge'') penalty \citep{Hoerl1970, Cessie1992, Verweij1994}, or a combination of the two \citep[the ``naive elastic net'' of][]{Zou2005}. It is also possible to have a \Rfunarg{fused lasso} penalty with L1 absolute value (``lasso'') penalty on the coefficients and their differences \cite{Tibshirani,Tibshirani2007}. The package also includes facilities for likelihood cross-validation and for optimization of the tuning parameter. L1 and L2 penalized estimation methods shrink the estimates of the regression coefficients towards zero relative to the maximum likelihood estimates. The purpose of this shrinkage is to prevent overfit arising due to either collinearity of the covariates or high-dimensionality. Although both methods are shrinkage methods, the effects of L1 and L2 penalization are quite different in practice. Applying an L2 penalty tends to result in all small but non-zero regression coefficients, whereas applying an L1 penalty tends to result in many regression coefficients shrunk exactly to zero and a few other regression coefficients with comparatively little shrinkage. Combining L1 and L2 penalties tends to give a result in between, with fewer regression coefficients set to zero than in a pure L1 setting, and more shrinkage of the other coefficients. The \Rfunarg{fused lasso} penalty, an extension of the lasso penalty, encourages sparsity of the coefficients and their differences by penalizing the L1-norm for both of them at the same time, thus producing sparse and piecewise constant stretches of non-zero coefficients. The amount of shrinkage is determined by tuning parameters $\lambda_1$ and $\lambda_2$. A value of zero always means no shrinkage (= maximum likelihood estimation) and a value of infinity means infinite shrinkage (= setting all regression coefficients to zero). For more details about the methods, please refer to the above-mentioned papers. It is important to note that shrinkage methods are generally not invariant to the relative scaling of the covariates. Before fitting a model, it is prudent to consider if the covariates already have a natural scaling relative to each other or whether they should be standardized. The main algorithm for L1 penalized estimation (lasso, elastic net) that used in this package is documented in \cite{Goeman2010}. It has been combined with ideas from \cite{Eilers2001} and \cite{Houwelingen2006} for efficient L2 penalized estimation. The algorithm used for fused lasso penalized estimation is described in Chaturvedi (2012) <>= options(continue = " ") @ \section{Penalized likelihood estimation} The basic function of the package is the \Rfunction{penalized} function, which performs penalized estimation for fixed values of $\lambda_1$ and $\lambda_2$. Its syntax has been loosely modeled on that of the functions \Rfunction{glm} (package \Rpackage{stats}) and \Rfunction{coxph} (package \Rpackage{survival}), but it is slightly more flexible in some respects. Two main input types are allowed: one using \Rclass{formula} objects, one using matrices. \subsection{the nki70 data} As example data we use the 70 gene signature of \cite{Veer2002} in the gene expression data set of \cite{Vijver2002}. <>= library(penalized) library(survival) data(nki70) @ This loads a \Rclass{data.frame} with 144 breast cancer patients and 77 covariates. The first two covariates indicate the survival time and event status (time is in months), the next five are clinical covariates (diameter of the tumor, lymph node status, estrogen receptor status, grade of the tumor and age of the patients), and the other 70 are gene expression measurements of the 70 molecular markers. As we are interested in survival as an outcome, we also need the survival package. <>= set.seed(1) @ \subsection{the penalized function} The \Rfunction{penalized} function can be used to fit a penalized prediction model for prediction of a response. For example, to predict the Estrogen Receptor status \Robject{ER} for the patients in the \Robject{nki70} data with the two markers ``DIAPH3'' and ``NUSAP1'' at $\lambda_1 = 0$ and $\lambda_2 = 1$, we can say (all are equivalent) <>= fit <- penalized(ER, ~DIAPH3+NUSAP1, data=nki70, lambda2=1) fit <- penalized(ER, nki70[,10:11], data=nki70, lambda2=1) fit <- penalized(ER~DIAPH3+NUSAP1, data=nki70, lambda2=1) @ The covariates may be specified in the second function argument (\Rfunarg{penalized}) as a \Rclass{formula} object with an open left hand side, as in the first line. Alternatively, they may be specified as a \Rclass{matrix} or \Rclass{data.frame}, as in the second line. If, as here, they are supplied as a \Rclass{data.frame}, they are coerced to a matrix. For consistency with \Rfunction{glm} and \Rfunction{coxph} the third option is also allowed, in which the covariates are included in the first function argument. The \Rfunction{penalized} function tries to determine the appropriate generalized linear model from the \Rfunarg{response} variable. This automatic choice may not always be appropriate. In such cases the model may be specified explicitly using the \Rfunarg{model} argument. For the examples in the rest of this vignette we use the Cox proportional hazerds model, using the survival time (\Robject{Surv(time,event)}) as the response to be predicted. This is a \Rclass{Surv} object. <>= fit <- penalized(Surv(time,event)~DIAPH3+NUSAP1, data=nki70, lambda2=1) @ We use \Rfunction{attach} to avoid specifying the \Rfunarg{data} argument every time. <>= attach(nki70) @ \subsection{choice of lambda} It is difficult to say in advance which value of \Rfunarg{lambda1} or \Rfunarg{lambda2} to use. The \Rpackage{penalized} package offers ways of finding optimal values using cross-validation. This is explained in Section \ref{cv} Note that for small values of \Rfunarg{lambda1} or \Rfunarg{lambda2} the algorithm be very slow, may fail to converge or may run into numerical problems, especially in high-dimensional data. When this happens, increase the value of \Rfunarg{lambda1} or \Rfunarg{lambda2}. It is possible to specify both \Rfunarg{lambda1} or \Rfunarg{lambda2}. In this case both types of penalties apply, and a so-called \emph{elastic net}. <>= fit <- penalized(Surv(time,event)~DIAPH3+NUSAP1, data=nki70, lambda1=1, lambda2=1) @ Sometimes it can be useful to have different values of \Rfunarg{lambda1} or \Rfunarg{lambda2} for different covariates. This can be done in \Rpackage{penalized} by specifying \Rfunarg{lambda1} or \Rfunarg{lambda2} as a vector. <>= fit <- penalized(Surv(time,event)~DIAPH3+NUSAP1, data=nki70, lambda2=c(1,2)) @ \subsection{penfit objects} The penalized function returns a \Rclass{penfit} object, from which useful information can be extracted. For example, to extract regression coefficients, (martingale) residuals, individual relative risks and baseline survival curve, write <>= residuals(fit)[1:10] fitted(fit)[1:10] basesurv(fit) @ See \Robject{help(penfit)} for more information on \Rclass{penfit} objects and Section \ref{breslow} on \Rclass{breslow} objects. The \Rfunction{coefficients} function extracts the named vector of regression coefficients. It has an extra second argument \Rfunarg{which} that can be used to specify which coefficients are of interest. Possible values of \Rfunarg{which} are \Robject{nonzero} (the default) for extracting all non-zero coefficients, \Robject{all} for all coefficients, and \Robject{penalized} and \Rfunarg{unpenalized} for only the penalized or unpenalized ones. <>= coefficients(fit, "all") @ To extract the loglikelihood of the fit and the evaluated penalty function, use <>= loglik(fit) penalty(fit) @ The \Rfunction{loglik} function gives the loglikelihood without the penalty, and the \Rfunction{penalty} function gives the fitted penalty, i.e. for L1 \Robject{lambda1} times the sum of the absolute values of the fitted penalized coefficients, and for L2 it is 0.5 times \Robject{lambda1} times the sum of squared penalized coefficients. The \Rclass{penfit} object can also be used to generate predictions for new data using the \Rfunction{predict} function. Pretending that the first three subjects in the \Robject{nki70} data are new subjects, we can find their predicted survival curves with either of <>= predict(fit, ~DIAPH3+NUSAP1, data=nki70[1:3,]) predict(fit, nki70[1:3,c("DIAPH3","NUSAP1")]) @ See Section \ref{breslow} for more on breslow objects. We can get five year survival predictions by saying <>= pred <- predict(fit, nki70[1:3,c("DIAPH3","NUSAP1")]) survival(pred, time=5) @ \subsection{standardization} If the covariates are not naturally on the same scale, it is advisable to standardize them. The function argument \Rfunarg{standardize} (default: \Robject{FALSE}) standardizes the covariates to unit second central moment before applying penalization. This standardization makes sure that each covariate is affected more or less equally by the penalization. The fitted regression coefficients that the function returns have been scaled back and correspond to the original scale of the covariates. To extract the regression coefficients of the standardized covariates, use the \Rfunction{coefficients} function with \Robject{standardize = TRUE}. This option is also available if the model was not fitted with standardized covariates, as the covariates are always standardized internally for numerical stability. To find the weights used by the function, use \Robject{weights(fit)}. <>= coefficients(fit) coefficients(fit, standardize = TRUE) weights(fit) @ \subsection{unpenalized covariates} In some situations it is desirable that not all covariates are subject to a penalty. Any additional covariates that should be included in the model without being penalized can be specified separately. using the third function argument (\Rfunarg{unpenalized}). For example (the two commands below are equivalent) <>= fit <- penalized(Surv(time,event), nki70[,8:77], ~ER, lambda2=1) fit <- penalized(Surv(time,event)~ER, nki70[,8:77], lambda2=1) @ This adds estrogen receptor status as an unpenalized covariate. Note in the second line that right hand side of the \Rclass{formula} object in the \Rfunarg{response} argument is automatically taken to be the \Rfunarg{unpenalized} argument because the \Rfunarg{penalized} argument was given by the user. In linear and logistic regression the intercept is by default never penalized. The use of an intercept can be suppressed with \Robject{penalized = $\tilde{}$0}. The intercept is always removed from the penalized model matrix, unless the penalized model consists of only an intercept. It is possible to include an offset term in the model. Use the \Rfunction{offset} function in the \Rfunarg{unpenalized} argument, which must then be of \Rfunarg{formula} type. The Cox model implementation allows \Rfunction{strata} terms. <>= fit <- penalized(Surv(time,event)~strata(ER), nki70[,8:77], lambda2=1) @ \subsection{factors} If some of the factors included in the \Rclass{formula} object \Rfunarg{penalized} are of type \Rclass{factor}, these are automatically made into dummy variables, as in \Rfunction{glm} and \Rfunction{coxph}, but in a special way that is more appropriate for penalized regression. Unordered factors are turned into as many dummy variables as the factor has levels. This ensures a symmetric treatment of all levels and guarantees that the fit does not depend on the ordering of the levels. See \Robject{help(contr.none)} for details. Ordered factors are turned into dummy variables that code for the difference between successive levels (one dummy less than the number of levels). L2 penalization on such factors therefore leads to small successive differences; L1 penalization leads to ranges of successive levels with identical effects. See \Robject{help(contr.diff)} for details. When fitting a model with factors with more than two levels with an L1 penalty, it is advisable to add a small L2 penalty as well in order to speed up convergence. By varying the L2 penalty it can be checked that the L2 penalty is not so large that it influences the estimates. To override the automatic choice of contrasts, use \Rfunction{C} (package \Rpackage{stats}). The \Rfunarg{response} argument may also be also be specified as a \Rclass{factor} in a logistic regression model. In that case, the value \Robject{levels(response)[1]} is treated as a failure (0), and all other values as a success (1). \subsection{fitting in steps} In some cases it may be interesting to visualize the effect of changing the tuning parameter \Rfunarg{lambda1} or \Rfunarg{lambda2} on the values of the fitted regression coefficients. This can be done using the function argument \Rfunarg{steps} in combination with the \Rfunction{plotpath} function. At this moment, this functionality is only available for visualizing the effect of \Rfunarg{lambda1} (not for\Rfunarg{fused lasso} estimation). When using the \Rfunarg{steps} argument, the function starts fitting the model at the maximal value of $\lambda_1$, that is the smallest value that shrinks all regression coefficients to zero. From that value it continues fitting the model for \Rfunarg{steps} successively decreasing values of $\lambda_1$ until the specified value of \Rfunarg{lambda1} is reached. If the argument \Rfunarg{steps} is supplied to \Rfunction{penalized}, the function returns a \Rclass{list} of \Rclass{penfit} objects. These can be accessed individually or their coefficients can be plotted using \Rfunction{plotpath}. <>= fit <- penalized(Surv(time,event), nki70[,8:77], lambda1=1, steps=50, trace = FALSE) plotpath(fit, log="x") @ Following \cite{Park2007} it is possible to choose the values of $\lambda_1$ in such a way that these are the change-points at which the active set changes. This can be done by setting \Rfunarg{steps = "Park"}. <>= fit <- penalized(Surv(time,event), nki70[,8:77], lambda1=1, steps="Park", trace = FALSE) @ Note that \Rfunction{plotpath} plots the unstandardized coefficients by default. Standardized coefficients can be plotted (even when the model was not fitted with standardized coefficients) with the \Rfunarg{standardize} argument. \begin{figure} <>= plotpath(fit, log="x") @ \end{figure} \subsection{a positivity constraint} In some applications it is natural to restrict all estimated regression coefficients to be non-negative. Such a positivity constraint is an alternative type of constrained estimation that is easily combined with L1 and L2 penalization in the algorithm implemented in the \Rpackage{penalized} package. To add a positivity restriction to the regression coefficients of all penalized covariates, set the function argument \Rfunarg{positive} to \Robject{TRUE} (the default is \Robject{FALSE}). Note that it is not strictly necessary to also include an L1 or L2 penalty; the model can also be fitted with only a positivity constraint. <>= fit <- penalized(Surv(time,event), nki70[,8:77], positive=TRUE) @ \begin{figure} <>= fit0 <- penalized(Surv(time,event), nki70[,8:77], positive=TRUE, steps=50) plotpath(fit0) @ \end{figure} <>= coefficients(fit) @ It is also possible to constrain only part of the regression coefficients to be non-negative by giving the \Rfunarg{positive} argument as a logical vector. <>= coef(penalized(Surv(time,event), nki70[,8:16], positive=c(F,rep(T,8)))) @ \subsection{fused lasso} For problems involving features that can be ordered in some meaningful way, it might be useful to take into account the information about their spatial structure while estimating the coefficients. For example, copy number data exhibit spatial correlation along the genome. This suggests that feature selection should take genomic location into account for producing more interpretable results for copy number based classifiers. Fused lasso takes the genomic location into account by putting a L1 penalty on the coefficients as well as on their differences. It, thus produces sparse results with local constancy of the coefficient profile. Fot estimating using \Rfunarg{fused lasso} one can use the basic \Rfunarg{penalized} function of the package with the function argument \Rfunarg{fusedl} set to \Robject{TRUE}. The argument \Rfunarg{fusedl} can take values in two form: logical or a vector. For example in case of copy number data if the information about the genomic location is available then \Rfunarg{fusedl} can be given as an input, a vector of these locations. If the function argument \Rfunarg{fusedl} is set to \Robject{TRUE} or it is a vector then the \Rfunarg{penalized} function performs \Rfunarg{fused lasso} penalized estimation for a fixed value of $\lambda_1$ and $\lambda_2$. Note that for \Rfunarg{fused lasso} estimation, the value for $\lambda_2$ given in the function is used for putting L1 penalty on the differences of the coefficients. We demonstrate the fused lasso feature of the penalized function by applying it on a simulated dataset with binomial response. We generate a data set with 100 samples and 70 probes, with mean equal to 0 and variance equal to 0.5. <>= X <- matrix(0,70,100) for (i in 1:100){ X[1:70,i] <- rnorm(70,mean=0,sd=0.5) } colnames(X) = as.character (1:ncol(X)) rownames(X) = as.character (1:nrow(X)) @ Out of these 100 samples, 50 are selected randomly for getting aberrations in the region 30:40. The mean for the aberrated region is taken to be -0.7 and variance 0.5. <>= a <- sample(1:ncol(X),50,prob=rep(0.5,length(1:ncol(X)))) for (i in 1:50){ X[30:40,a[i]]<-rnorm(length(30:40),mean = -0.7 ,sd=0.5) } @ We generate the probabilities of the samples being 1 or 0, by using beta equal to -1 in logistic model. <>= Xbeta <- rnorm(100, mean = 0, sd = 0.5) Xbeta[a] <- rnorm (length(a) , mean = -0.7 , sd = 0.5) beta <- numeric(100) beta [1:length(beta)] <- -1 responsep <- numeric(100) for(i in 1:100){ coeff <- -beta[i] * Xbeta[i] responsep[i] <- 1/(1+exp(coeff)) } X <- t(X) response=responsep for(i in 1:100){ response[i] <- sample(0:1 , size = 1 , prob = c((1-responsep[i]),responsep[i])) } @ For estimating the coefficients using fused lasso, the \Rfunarg{flasso} argument in the \Rfunarg{penalized} function can take two forms. If it is logical then the genomic location (for example chromosome number in copy number data) information becomes 1 for every probe. Otherwise if the information is available then it can be given as a vector to the \Rfunarg{flasso} argument. <>= fit <- penalized(response, X, lambda1 = 2, lambda2=2,fusedl=TRUE) @ \begin{figure} <>= fit <- penalized(response, X, lambda1 = 2, lambda2=3,fusedl=TRUE) plot(coefficients(fit,"all")[-1],main = "fused lasso", col="red",xlab = "probes",ylab = "coefficients",type="l") @ \end{figure} <>= chr = c(rep(1,30),rep(2,20),rep(3,10),rep(4,10)) fit <- penalized(response, X, lambda1 = 2, lambda2=2,fusedl=chr) @ \section{Pretesting} Before fitting a penalized regression model it can be worthwhile to test the global null hypothesis of no association between any of the predictor variables and the response. A package that can do this is, and which ties very closely to the \Rpackage{penalized} package is the \Rpackage{globaltest} package, available from \texttt{www.bioconductor.org}. The package can be installed using the bioconductor install script <>= source("http://bioconductor.org/biocLite.R") biocLite("globaltest") @ The interface of \Rpackage{globaltest} is very similar to the interface of \Rpackage{penalized}. To test for any evidence of association in the \Robject{nki70} data, say <>= gt(Surv(time,event), nki70[,8:77]) @ The resulting p-value can be interpreted as a global indicator of predictive ability. Data sets that have a significant test result almost always have a an optimal lambda value smaller than infinity. See the vignette of the \Rpackage{globaltest} package for details. \section{Cross-validation and optimization} \label{cv} Cross-validation can be used to assess the predictive quality of the penalized prediction model or to compare the predictive ability of different values of the tuning parameter. The \Rpackage{penalized} package uses likelihood cross-validation for all models. Likelihood cross-validation has some advantages over other optimization criteria: it tends to be a continuous function of the tuning parameter; it can be defined in a general way for almost any model, and it does not require calculation the effective dimension of a model, which is problematic in L1 penalized models. For the Cox proportional hazards model, the package uses cross-validated log partial likelihood \citep{Verweij1993}, which is a natural extension of the cross-validated log likelihood to the Cox model. Five functions are available for calculating the cross-validated log likelihood and for optimizing the cross-validated log likelihood with respect to the tuning parameters. They have largely the same arguments. See \Robject{help(cvl)} for an overview. \subsection{cross-validation} The function \Rfunction{cvl} calculates the cross-validated log likelihood for fixed values of $\lambda_1$ and $\lambda_2$. It accepts the same arguments as \Rfunction{penalized} (except \Rfunarg{steps}: see \Rfunction{profL1} below) as well as the \Rfunarg{fold} argument. This will usually be a single number $k$ to indicate $k$-fold cross-validation. In that case, the allocation of the subjects to the folds is random. Alternatively, the precise allocation of the subjects into the folds can be specified by giving \Rfunarg{fold} as a vector of the length of the number of subjects with values form 1 to $k$, each indicating the fold allocation of the corresponding subject. The default is to do leave-one-out cross-validation. For having cross-validated fused lasso estimates, one should set the argument \Rfunarg{fusedl} to \Robject{TRUE}. In addition there is the argument \Rfunarg{approximate} (default value is \Robject{FALSE}). If its value is set to \Robject{TRUE}, instead of true cross-validation an approximation method is used that is much faster. This method is explained in more detail in the next subsection. When a linear model is fitted, the approximation method is no longer approximative but results in exact answers. For this reason, the package will automatically set the value of \Rfunarg{approximate} to \Robject{TRUE} in case of a linear model. This argument does not works for fused lasso cross-validation. The function \Rfunction{cvl} returns a names \Rclass{list} with four elements: \begin{description} \item[\Robject{cvl}] the cross-validated log likelihood. \item[\Robject{fold}] the fold allocation used; this may serve as input to a next call to \Rfunction{cvl} to ensure comparability. \item[\Robject{predictions}] the predictions made on each left-out subject. The format depends on the model used. In logistic regression this is just a vector of probabilities. In the Cox model this is a collection of predicted survival curves (a \Rclass{breslow} object). In the linear model this is a collection of predicted means and predicted standard deviations (the latter are the maximum penalized likelihood estimates of $\sigma^2$). \item[\Robject{fullfit}] the fit on the full data (a \Rclass{penfit} object) \end{description} <>= fit <- cvl(Surv(time,event), nki70[,8:77], lambda1=1, fold=10) @ <>= fit$cvl fit$fullfit @ <>= fit <- cvl(Surv(time,event), nki70[,8:77], lambda1=2, fold=fit$fold) @ \subsection{approximated cross-validation} To save time, one can choose to set the argument \Rfunarg{approximate} in the function \Rfunction{cvl} to \Robject{TRUE}. For now, this option is only available for ridge models, so models where the lasso penalty equals 0. In that case the cross-validated likelihood will not be calculated by leaving out one or a set of observation each time and refitting the model, but will be based on approximations. These approximations are based on a Taylor expansion around the estimate of the full model. Since refitting the model is no longer necessary, a lot of time can be saved in this way. The results are in most cases quite accurate, but the method tends to be a little too optimistic which results in slightly too high values of the corresponding \emph{cvl}. The method works best for large data sets and a resampling scheme with many folds. The same option can be chosen in \Rfunction{optL2} and \Rfunction{profL2} (see below). Here one must again be aware of the tendency to be a little too optimistic which will result in optimal penalty values that are a little smaller than the ones found by real cross-validation. In the following example, the cross-validated log-likelihood is calculated twice. First by using leave-one-out cross-validation, then by using the approximation method. <>= fit1 <- cvl(Surv(time,event), nki70[,8:77], lambda2=10) @ <>= fit1$cvl @ <>= fit2 <- cvl(Surv(time,event), nki70[,8:77], lambda2=10, approximate=TRUE) @ <>= fit2$cvl @ As we can see, the answers are very similar. \subsection{breslow objects} \label{breslow} The \Rclass{breslow} class is defined in the \Rpackage{penalized} package to store estimated survival curves. They are used for the predictions in cross-validation and for the baseline survival estimated in the \Rfunction{penalized} function. See \Robject{help(breslow}) for details. <>= fit$predictions time(fit$predictions) as.data.frame(basesurv(fit$fullfit))[1:10,] plot(fit$predictions) @ \begin{figure} <>= plot(fit$predictions) @ \end{figure} We can easily extract the 5 year cross-validated survival probabilities <>= survival(fit$predictions, 5)[1:10] @ \subsection{profiling the cross-validated log likelihood} The functions \Rfunction{profL1} and \Rfunction{profL2} can be used to examine the effect of the parameters $\lambda_1$ and $\lambda_2$ on the cross-validated log likelihood. The \Rfunction{profL1} function can be used to vary $\lambda_1$ while keeping $\lambda_2$ fixed, vice versa for \Rfunction{profL2}. The minimum and maximum values between which the cross-validated log likelihood is to be profiled can be given as \Rfunarg{minlambda1} and \Rfunarg{maxlambda1} or \Rfunarg{minlambda2} and \Rfunarg{maxlambda2}, respectively. The default value of \Rfunarg{minlambda1} and \Rfunarg{minlambda2} is at zero. The default value of \Rfunarg{maxlambda1} is at the maximal value of $\lambda_1$, that is the smallest value that shrinks all regression coefficients to zero. There is no default for \Rfunarg{maxlambda2}. The number of steps between the minimal and maximal values can be given in the \Rfunarg{steps} argument (default 100). These steps are equally spaced if the argument \Rfunarg{log} is \Robject{FALSE} or equally spaced on the log scale if the argument \Rfunarg{log} is \Robject{TRUE}. Note that the default value of \Rfunarg{log} differs between \Rfunction{profL1} (\Robject{FALSE}) and \Rfunction{profL2} (\Robject{TRUE}). If \Rfunarg{log} is \Robject{TRUE}, \Rfunarg{minlambda1} or \Rfunarg{minlambda2} must be given by the user as the default value is not usable. By default, the profiling is stopped prematurely when the cross-validated log likelihood drops below the cross-validated log likelihood of the null model with all penalized regression coefficients equal to zero. This is done because it avoids lengthy calculations at small values of $\lambda$ when the models are most likely not interesting. The automatic stopping can be controlled using the option \Rfunarg{minsteps} (default \Rfunarg{steps}/2). The algorithm only considers early stopping after it has done at least \Rfunarg{minsteps} steps. Setting \Robject{minsteps} equal to \Rfunarg{steps} cancels the automatic stopping. The functions \Rfunction{profL1} and \Rfunction{profL2} return a named list with the same elements as returned by \Rfunction{cvl}, but each of \Robject{cvl}, \Robject{predictions}, \Robject{fullfit} is now a \Rclass{vector} or a \Rclass{list} (as appropriate) as multiple cross-validated likelihoods were calculated. An additional vector \Robject{lambda} is returned which lists the values of $\lambda_1$ or $\lambda_2$ at which the cross-validated likelihood was calculated. The allocation of the subjects into cross-validation folds is done only once, so that all cross-validated likelihoods are calculated using the same allocation. This makes the cross-validated log likelihoods more comparable. As in \Rfunction{cvl} the allocation is returned in \Robject{fold}. It is also possible in these functions to set \Robject{fold = 1}. This will cause no cross-validation to be performed, but will let only the full data fits be calculated. This can be used in a similar way to the use of the \Rfunction{penalized} function with its \Rfunarg{steps} argument, only with more flexibility. The profiles can be plotted using the output of \Rfunction{profL1} and \Rfunction{profL2} or directly using the \Rfunarg{plot} arguments of these functions. <>= fit1 <- profL1(Surv(time,event), nki70[,50:70],fold=10, plot=TRUE) fit2 <- profL2(Surv(time,event), nki70[,50:70],fold=fit1$fold, minl = 0.01, maxl = 1000) plot(fit2$lambda, fit2$cvl, type="l", log="x") @ \begin{figure} <>= plot(fit1$lambda, fit1$cvl, type="l") @ \end{figure} \begin{figure} <>= plot(fit2$lambda, fit2$cvl, type="l", log="x") @ \end{figure} The \Rfunction{plotpath} function can again be used to visualize the effect of the tuning parameter on the regression coefficients. <>= plotpath(fit2$fullfit, log="x") @ \begin{figure} <>= plotpath(fit2$fullfit, log="x") @ \end{figure} \subsection{optimizing the cross-validated likelihood} Often we are not interested in the whole profile of the cross-validated likelihood, but only in the optimum. The functions \Rfunction{optL1} and \Rfunction{optL2} can be used to find the optimal value of $\lambda_1$ or $\lambda_2$. The algorithm used for the optimization is the Brent algorithm for minimization without derivatives \citep[][see also \Robject{help(optimize)}]{Brent1973}. When using this algorithm, it is important to realize that this algorithm is guaranteed to work only for unimodal functions and that it may converge to a local maximum. This is especially relevant for L1 optimization, as the cross-validated likelihood as a function of $\lambda_1$ very often has several local maxima. It is recommended only to use \Rfunction{optL1} in combination with \Rfunction{profL1} to prevent convergence to the wrong optimum. The cross-validated likelihood as a function of $\lambda_2$, on the other hand, is far better behaved and practically never has local maxima. The function \Rfunction{optL2} can safely be used even without combining it with \Rfunction{profL2}. The functions \Rfunction{optL1} and \Rfunction{optL2} take the same arguments as \Rfunction{cvl}, and some additional ones. The arguments \Rfunarg{minlambda1} and \Rfunarg{maxlambda1}, and \Rfunarg{minlambda2} and \Rfunarg{maxlambda2} can be used to specify the range between which the cross-validated log likelihood is to be optimized. Both arguments can be left out in both functions, but supplying them can improve convergence speed. In \Rfunction{optL1}, the parameter range can be use to ensure that the function converges to the right maximum. In \Rfunction{optL2} the user can also supply only one of \Rfunarg{minlambda2} and \Rfunarg{maxlambda2} to give the algorithm advance information of the order of magnitude of $\lambda_2$. In this case, the algorithm will search for an optimum around \Rfunarg{minlambda2} or \Rfunarg{maxlambda2}. The functions \Rfunction{optL1} and \Rfunction{optL2} return a named list just as \Rfunction{cvl}, with an additional element \Robject{lambda} which returns the optimum found. The returned \Robject{cvl}, \Robject{predictions}, \Robject{fullfit} all relate to the optimal $\lambda$ found. <>= opt1 <- optL1(Surv(time,event), nki70[,50:70], fold=fit1$fold) @ <>= opt1$lambda opt1$cvl @ <>= opt2 <- optL2(Surv(time,event), nki70[,50:70], fold=fit2$fold) @ \section{A note on standard errors and confidence intervals} It is a very natural question to ask for standard errors of regression coefficients or other estimated quantities. In principle such standard errors can easily be calculated, e.g.\ using the bootstrap. Still, this package deliberately does not provide them. The reason for this is that standard errors are not very meaningful for strongly biased estimates such as arise from penalized estimation methods. Penalized estimation is a procedure that reduces the variance of estimators by introducing substantial bias. The bias of each estimator is therefore a major component of its mean squared error, whereas its variance may contribute only a small part. Unfortunately, in most applications of penalized regression it is impossible to obtain a sufficiently precise estimate of the bias. Any bootstrap-based calculations can only give an assessment of the variance of the estimates. Reliable estimates of the bias are only available if reliable unbiased estimates are available, which is typically not the case in situations in which penalized estimates are used. Reporting a standard error of a penalized estimate therefore tells only part of the story. It can give a mistaken impression of great precision, completely ignoring the inaccuracy caused by the bias. It is certainly a mistake to make confidence statements that are only based on an assessment of the variance of the estimates, such as bootstrap-based confidence intervals do. Reliable confidence intervals around the penalized estimates can be obtained in the case of low dimensional models using the standard generalized linear model theory as implemented in \Rfunction{lm}, \Rfunction{glm} and \Rfunction{coxph}. Methods for constructing reliable confidence intervals in the high-dimensional situation are, to my knowledge, not available. \bibliography{penalized} \end{document}