--- title: "csmpv" author: "Aixiang Jiang" date: "`r Sys.Date()`" output: html_document: df_print: paged vignette: > %\VignetteIndexEntry{csmpv} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.width=6, fig.height=4.5) # Set the width and height ``` The csmpv R package provides an extensive set of functions that encompass biomarker confirmation, variable selection, modeling, predictive analysis, and validation. It can handle diverse outcome variable types, including binary, continuous, and time-to-event data. Its key objectives include: - Biomarker Confirmation/Validation: This feature employs both single-variable and multivariable regression techniques to confirm and validate established biomarkers. - Biomarker Discovery: The package streamlines the identification of new biomarkers through variable selection methods like LASSO2, LASSO_plus, and LASSO2plus. - Predictive Model Development: By harnessing a fusion of machine learning and traditional statistical tools, this process facilitates the creation of predictive models focused on specific biomarkers. - Model Prediction: Developed models can predict outcomes when applied to new datasets. - Model Validation: These models validate outcomes when applied to novel datasets, provided an outcome variable is present. To simplify the modeling process, we've designed an all-in-one function capable of managing predictive model development, prediction, and validation for all eight methods within this package across three distinct outcome types. This versatile function streamlines the process, allowing for a concise implementation with just a single function call. It can handle a single method with single or multiple outcome variables. Moreover, if a validation dataset is available, the prediction and validation processes can seamlessly integrate into a unified operation. In addition to these core functionalities, the csmpv package introduces a unique approach allowing the creation of binary risk classification models based on survival models. This innovative feature enables predicting risk binary outcomes for new datasets using the developed model. Please note, the external validation of this model is limited due to the absence of binary classification variables in new datasets. Despite this limitation, the predicted binary classification can serve as a surrogate biomarker, and its correlation with survival outcomes in new datasets can be tested when survival outcome information is available. To enhance user experience, the csmpv R package focuses on streamlining coding efforts. Each user-end function acts as a comprehensive wrapper condensing multiple analyses into a single function call. Additionally, result files are conveniently saved locally, further simplifying the analytical process. # I Installation The csmpv package is available on CRAN, and it can be directly installed in R using the following command: ```{r, eval=FALSE} install.packages("csmpv") ``` Alternatively, let’s proceed to install csmpv from GitHub using the devtools or remotes R package. ```{r, eval=FALSE} # Install devtools package if not already installed options(repos = c(CRAN = "https://cloud.r-project.org")) install.packages("devtools") # Install csmpv package from GitHub devtools::install_github("ajiangsfu/csmpv",force = TRUE) # Using force = TRUE will ensure the installation, overriding any existing versions ``` ``` {r, eval=FALSE} # Install remotes package if not already installed install.packages("remotes") # Install csmpv package from GitHub remotes::install_github("ajiangsfu/csmpv",force = TRUE) # Using force = TRUE will ensure the installation, overriding any existing versions ``` Both methods will download and install the csmpv package from the GitHub repository. Please ensure an active internet connection and the necessary dependencies for a successful installation. # II Example Code In this section, we will show some example code, however, before that, we will introduce example data first. ## 1. Example data The example data was extracted from our in-house diffuse large B-cell lymphoma (DLBCL) dataset, specifically utilizing supplemental Table S1 from Alduaij et al. (2023, DOI: 10.1182/blood.2022018248). To handle missing values, we retained only samples with all the necessary variables and no missing values. Furthermore, to ensure compatibility with all eight modeling methods within csmpv, we transformed all categorical variables into binary format, overcoming limitations in XGBoost (eXtreme Gradient Boosting) and LASSO (Least Absolute Shrinkage and Selection Operator) when dealing with categorical variables with more than two levels. Following these procedures, an object named datlist was generated and is included in csmpv, accessible straightforwardly after installing and loading csmpv, as demonstrated below. ``` {r} library(csmpv) data("datlist", package = "csmpv") tdat = datlist$training dim(tdat) vdat = datlist$validation dim(vdat) ``` Subsequently, we defined three outcome variables and their respective independent variables. To illustrate different types of outcome variables, we'll define examples for binary, continuous, and time-to-event categories: - Binary: DZsig (dark zone signature) - Continuous: Age - Time-to-event: FFP (freedom from progression) For binary and time-to-event variables, independent variables are defined as: ```{r} Xvars = c("B.Symptoms","MYC.IHC","BCL2.IHC", "CD10.IHC","BCL6.IHC", "MUM1.IHC","Male","AgeOver60", "stage3_4","PS1","LDH.Ratio1", "Extranodal1","Bulk10cm","HANS_GCB", "DTI") ``` For the continuous variable, the corresponding independent variables align with those above, excluding AgeOver60 due to its correlation with the outcome variable Age: ```{r} AgeXvars = setdiff(Xvars, "AgeOver60") ``` To enhance reproducibility and minimize variability from random number generation, we established and set a specific random seed: ```{r} set.seed(12345) ``` Users can define their own temporary directory to save all results. If not, tempdir() can be used to get the system's temporary directory. ```{r} temp_dir = tempdir() knitr::opts_knit$set(root.dir = temp_dir) ``` ## 2. Biomarker confirmation/validation Whether this procedure is labeled as biomarker confirmation, validation, or testing, the fundamental aspect involves regular regression analyses on both single and multiple variables across three distinct outcome categories: binary, continuous, and time-to-event. In this context, our objective is to assess the presence of an association between outcomes and a set of independent variables. It's important to note that this differs from model validation, which will be covered subsequently. ### 2.1 Binary outcome To confirm biomarkers for binary outcomes: ```{r, echo=FALSE} options(warn = -1) ``` ``` {r, results = 'hide',message=FALSE, warnings=FALSE} bconfirm = confirmVars(data = tdat, biomks = Xvars, Y = "DZsig", outfile = "confirmBinary") ``` The confirmVars function acts as a wrapper, invoking various functions to perform regression analysis based on different outcome types. By default, the outcome type is binary, requiring no explicit specification when handling binary outcomes. Upon execution, the bconfirm object comprises a multivariable model and a list of two forest plots. The first plot consolidates individual forest plots for each single variable, while the second represents the forest plot for the multivariable model. These outputs are locally saved, along with a combined table containing models for each single variable. ``` {r} print(bconfirm$fit) bconfirm$allplot[[2]] ``` For instance, the above output showcases a multivariable model and it's corresponding forest plot. ### 2.2 Continous outcome To confirm biomarkers for continuous outcomes: ``` {r,results = 'hide',message=FALSE, warnings=FALSE} cconfirm = confirmVars(data = tdat, biomks = AgeXvars, Y = "Age", outcomeType = "continuous", outfile = "confirmContinuous") ``` The same confirmVars function is called; however, this time, we specify the outcome type as continuous. In a similar fashion, the cconfirm object comprises two elements: a multivariable model and a list of two forest plots. The first plot consolidates all forest plots for each single variable, while the second represents the forest plot for the multivariable model. All these outputs are saved locally, accompanied by a combined table containing models for each single variable. Below, you'll find the multivariable model and it's corresponding forest plot: ```{r} print(cconfirm$fit) cconfirm$allplot[[2]] ``` ### 2.3 Time-to-event outcome To confirm biomarkers for time-to-event outcomes: ``` {r, results = 'hide',message=FALSE, warnings=FALSE} tconfirm = confirmVars(data = tdat, biomks = Xvars, time = "FFP..Years.", event = "Code.FFP", outcomeType = "time-to-event", outfile = "confirmSurvival") ``` The confirmVars function is called once again, this time with the outcome type specified as time-to-event, necessitating the inclusion of both time and event variable names. Similarly, two PDF and two table files are saved, accompanied by locally stored Kaplan-Meier plots. A single Kaplan-Meier plot is generated for each independent categorical variable with no more than four levels. The tconfirm object also stores two elements: a multivariable model and a list of two forest plots. Below, you'll find the multivariable model and it's corresponding forest plot: ```{r} print(tconfirm$fit) tconfirm$allplot[[2]] ``` ## 3. Biomarker discovery with variable selection This section details the process of biomarker discovery through variable selection, utilizing three distinct methods: LASSO2, LASSO2plus, and LASSO_plus. ### 3.1 Variable selection with LASSO2 The variable selection process using our customized LASSO algorithm, LASSO2, employs a tailored approach distinct from the conventional LASSO algorithm. This adjustment aims to address the randomness introduced by random splits and to guarantee the inclusion of at least two variables. This process utilizes glmnet::cv.glmnet for cross-validation-based variable selection, setting lambda as the largest value where the error remains within 1 standard error of the minimum. However, as indicated in the cv.glmnet's help file, variability in results can arise due to the randomness inherent in cross-validation splits. To counteract this variability, our new function, LASSO2, performs 10 runs of glmnet::cv.glmnet. The resulting average lambda value from these iterations becomes the final lambda used for regularization regression on the complete dataset. It's important to note that since LASSO2 selects the lambda as the largest lambda within 1 standard error of the minimum, following the default behavior of cv.glmnet, it may yield a smaller number of selected variables compared to the lambda that minimizes the mean cross-validated error. This more conservative approach could potentially result in only one or no selected variables. To address this potential issue, when LASSO2 identifies only one or no variables, it defaults to selecting the first lambda that results in at least two variables being chosen from the full dataset. This strategy ensures the inclusion of at least two variables, striking a balance between model complexity and the necessity for meaningful variable inclusion. #### 3.1.1 Binary outcome For binary outcomes, no additional specification is needed for outcomeType, as it is the default value. ``` {r} bl = LASSO2(data = tdat, biomks = Xvars, Y = "DZsig", outfile = "binaryLASSO2") ``` One figure and one text file are saved locally. ```{r} bl$coefs ``` This displays the selected variables and their corresponding shrunken coefficients. #### 3.1.2 Continuous outcome For variable selection involving a continuous outcome variable, specify outcomeType = "continuous": ``` {r} cl = LASSO2(data = tdat, biomks = AgeXvars, outcomeType = "continuous", Y = "Age", outfile = "continuousLASSO2") ``` Similar to before, one figure and one text file are saved locally. ```{r} cl$coefs ``` This shows the selected variables and their associated shrunken coefficients for the continuous outcome. #### 3.1.3 Time-to-event outcome For variable selection with a time-to-event outcome, set outcomeType = "time-to-event", and ensure you provide the variable names for both time and event: ``` {r} tl = LASSO2(data = tdat, biomks = Xvars, outcomeType = "time-to-event", time = "FFP..Years.",event = "Code.FFP", outfile = "survivalLASSO2") ``` In a similar fashion, one figure and one text file are saved locally. ```{r} tl$coefs ``` This shows the selected variables and their associated shrunk coefficients for time-to-event outcome. ### 3.2 Variable selection with LASSO2plus LASSO2plus is an innovative approach that combines LASSO2, a modified LASSO algorithm, with other techniques. It selects variables in three steps: - Apply LASSO2, which is slightly different from the standard LASSO as discussed in Section 3.1. - Fit a simple regression model for each variable and adjusting the p-values using the Benjamini Hochberg method (1995). - Perform a stepwise variable selection procedure on the combined list of variables from the previous steps. Therefore, LASSO2plus can incorporates both the regularization and the significance testing aspects of variable selection. All parameter settings for LASSO2plus are the same as for LASSO2. #### Binary outcome For binary outcomes, no additional specification is needed for outcomeType, as it is the default value. ``` {r} b2fit = LASSO2plus(data = tdat, biomks = Xvars, Y = "DZsig", outfile = "binaryLASSO2plus") b2fit$fit$coefficients ``` The coefficients are shown above. Two figures and two tables are stored locally. #### Continuous outcome For variable selection involving a continuous outcome variable, specify outcomeType = "continuous": ``` {r} c2fit = LASSO2plus(data = tdat, biomks = AgeXvars, outcomeType = "continuous", Y = "Age", outfile = "continuousLASSO2plus") c2fit$fit$coefficients ``` Again, the coefficients shown above and Two figures and two tables are stored locally. #### Time-to-event outcome For variable selection with a time-to-event outcome, set outcomeType = "time-to-event", and ensure you provide the variable names for both time and event: ``` {r} t2fit = LASSO2plus(data = tdat, biomks = Xvars, outcomeType = "time-to-event", time = "FFP..Years.",event = "Code.FFP", outfile = "survivalLASSO2plus") t2fit$fit$coefficients ``` Similar to the other types of outcomes, the coefficients are displayed above, and two figures along with two tables are stored locally. ### 3.3. Variable selection with LASSO_plus LASSO_plus is another innovative approach that builds on the LASSO algorithm and adds more techniques. However, it differs from LASSO2plus that is described in Section 3.2 in its initial step. It selects variables in the following three steps: - 1) Utilize 'Modified LASSO' for regularization regression instead of LASSO2 involves selecting a list, appearing at least two times during lambda simulations, based on the number of variables, and aiming to find the one nearest to a predefined target. - 2) Fit a simple regression model for each variable and adjusting the p-values using the Benjamini Hochberg method (1995); - 3) Perform a stepwise variable selection procedure on the combined list of variables from the previous steps. Therefore, LASSO_plus also incorporates both the regularization and the significance testing aspects of variable selection. In LASSO_plus, all parameters from LASSO2 and LASSO2plus are retained, with the addition of the unique parameter topN. Please be aware that the topN parameter in LASSO_plus serves as a guide for variable selection. #### Binary outcome Setting the topN parameter to 5 is intended to incorporate the top 5 variables into the final model. However, it's important to note that the resulting model may not always precisely comprise 5 variables. Consequently, when using the same topN value for different datasets, the number of selected variables may vary. For binary outcomes, outcome type specification is unnecessary, as it defaults to this type. ``` {r} bfit = LASSO_plus(data = tdat, biomks = Xvars, Y = "DZsig", outfile = "binaryLASSO_plus", topN = 5) bfit$fit$coefficients ``` The identified variables and their corresponding coefficients are displayed above. A figure and a table are locally stored. #### Continuous outcome For continuous outcome variables, ensure you specify outcomeType = "continuous": ``` {r} cfit = LASSO_plus(data = tdat, biomks = AgeXvars, outcomeType = "continuous", Y = "Age", outfile = "continuousLASSO_plus", topN = 5) cfit$fit$coefficients ``` The identified variables and their corresponding coefficients are displayed above. A figure and a table are locally stored #### Time-to-event outcome When dealing with time-to-event outcomes, set outcomeType = "time-to-event", and ensure you provide the names of variables for both time and event: ``` {r} tfit = LASSO_plus(data = tdat, biomks = Xvars, outcomeType = "time-to-event", time = "FFP..Years.",event = "Code.FFP", outfile = "survivalLASSO_plus", topN = 5) tfit$fit$coefficients ``` Displayed above are the identified variables and their corresponding coefficients. A figure and a table are locally stored ## 4. Predictive model development Predictive model development is a crucial aspect of the csmpv R package, involving eight distinct approaches: - Use shrunk coefficients directly from LASSO2. - Select variables with LASSO2, then run a regular regression model. - Extract coefficients directly from LASSO_plus output. - Extract coefficients directly from LASSO2plus output. - Build a machine learning model with XGBoost. - Utilize LASSO2 for variable selection and build an XGBoost model. - Utilize LASSO_plus for variable selection and build an XGBoost model. - Utilize LASSO2plus for variable selection and build an XGBoost model. ### 4.1 LASSO2 Directly use shrunk coefficients from LASSO2 output as shown in Section 3.1. ### 4.2 LASSO2 + regular regression The approach involves utilizing the variables selected by LASSO2 to conduct a standard regression model. Rather than relying on the shrunken coefficients obtained from LASSO2, this method opts for a conventional regression analysis with the chosen variables. While it's feasible to manually extract variables from an LASSO2 object for regular regression based on the outcome type, LASSO2_reg function is introduced to simplify this process for coding convenience and efficiency. All parameter settings are the same as for LASSO2. #### Binary outcome ```{r, results = 'hide',message=FALSE, warnings=FALSE} blr = LASSO2_reg(data = tdat, biomks = Xvars, Y = "DZsig", outfile = "binaryLASSO2_reg") ``` ```{r} blr$fit$coefficients ``` #### Continuous outcome ```{r, results = 'hide',message=FALSE, warnings=FALSE} clr = LASSO2_reg(data = tdat, biomks = AgeXvars, outcomeType = "continuous", Y = "Age", outfile = "continuousLASSO2_reg") ``` ```{r} clr$fit$coefficients ``` #### Time-to-event outcome ```{r, results = 'hide',message=FALSE, warnings=FALSE} tlr = LASSO2_reg(data = tdat, biomks = Xvars, outcomeType = "time-to-event", time = "FFP..Years.",event = "Code.FFP", outfile = "survivalLASSO2_reg") ``` ```{r} tlr$fit$coefficients ``` The selected variables and their coefficients are displayed above. For each outcome type, three figure files, one text file, and two tables are saved locally. Furthermore, Kaplan-Meier plots are generated and saved locally for time-to-event outcome variables. Specifically, a single Kaplan-Meier plot is produced for each independent categorical variable, with no more than four levels. ### 4.3 LASSO_plus Directly use coefficients from LASSO2_plus output as shown in Section 3.3. ### 4.4 LASSO2plus Directly use coefficients from the LASSO2plus output, as described in Section 3.2. ### 4.5 XGBoost XGBoost is a powerful machine learning algorithm recognized for its boosting capabilities. The XGBtraining function within the csmpv package leverages the strengths of XGBoost for model training. As XGBoost doesn’t inherently feature a dedicated variable selection procedure, you’ll need to manually define or select a set of variables using other methods. Once you have a predefined set of variables for constructing an XGBoost model, the XGBtraining function in the csmpv package streamlines this process. #### Binary outcome ```{r} bxfit = XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig", outfile = "binary_XGBoost") head(bxfit$XGBoost_score) ``` The output from the above code consists of training log-loss values for specific iterations of the model. Log-loss, a widely used loss function in classification tasks, assesses the alignment between the model's predicted probabilities and the actual class labels. By default, XGBtraining runs for 5 iterations, and the output is saved locally as a text file. The bxfit object contains four components: - XGBoost object. - XGBoost scores for all entries in the tdat dataset. Notably, XGBoost operates as a black box model and doesn't return coefficients; however, it provides model scores. For binary outcomes, these scores represent the probability of the positive class. - Observed outcome. - Outcome type. #### Continuous outcome ```{r} cxfit = XGBtraining(data = tdat, biomks = AgeXvars, outcomeType = "continuous", Y = "Age", outfile = "continuous_XGBoost") head(cxfit$XGBoost_score) ``` The reported values, denoted by "train-rmse", signify the RMSE metric calculated during each iteration of the XGBoost model on the training set. RMSE measures the root mean squared error between predicted and actual values in the training set, with lower values indicating superior model performance. The results are saved locally in a text file. These metrics illustrate the iterative nature of training the XGBoost model, where each iteration aims to minimize the RMSE on the training set. The diminishing RMSE values signify the model's learning process, demonstrating its progressive improvement in predictive accuracy during training. Within cxfit, there are four elements: - XGBoost object. - XGBoost scores for all entries in tdat. Notably, XGBoost, functioning as a black box model, does not yield coefficients but provides model scores. For continuous outcomes, these scores represent the estimated continuous values. - Observed outcome. - Outcome type. #### Time-to-event outcome ```{r} txfit = XGBtraining(data = tdat, biomks = Xvars, outcomeType = "time-to-event", time = "FFP..Years.",event = "Code.FFP", outfile = "survival_XGBoost") head(txfit$XGBoost_score) ``` The negative log-likelihood, displayed in the output, serves as a standard loss function in survival analysis, notably prominent in Cox proportional hazards models. It quantifies the disparity between predicted survival probabilities and observed survival times and events within the training data. Minimizing this metric is crucial, as lower values signify a better fit of the model to the training data. The resulting output is saved locally as a text file. By monitoring the negative log-likelihood throughout the training process, you can evaluate the model's learning progress and its convergence toward an optimal solution. Ideally, a decreasing trend in the negative log-likelihood indicates the model's improved fit to the training data across iterations. In txfit, there are six components: - XGBoost object. - XGBoost scores for all entries in tdat. Notably, XGBoost operates as a black box model and does not yield coefficients but provides model scores. For time-to-event outcomes, these scores represent the risk score. - Baseline hazard table. - Observed time. - Event. - Outcome type. ### 4.6 LASSO2 + XGBoost Combine LASSO2 variable selection with XGBoost modeling using the LASSO2_XGBtraining function, which selects variables via LASSO2 but constructs an XGBoost model without relying on shrunk coefficients. The resulting objects maintain the same format as XGBtraining, while the results related to LASSO2 and XGBoost are saved locally. #### Binary outcome ```{r} blxfit = LASSO2_XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig", outfile = "binary_LASSO2_XGBoost") head(blxfit$XGBoost_score) ``` #### Continuous outcome ```{r} clxfit = LASSO2_XGBtraining(data = tdat, biomks = AgeXvars, outcomeType = "continuous", Y = "Age", outfile = "continuous_LASSO2_XGBoost") head(clxfit$XGBoost_score) ``` #### Time-to-event outcome ```{r} tlxfit = LASSO2_XGBtraining(data = tdat, biomks = Xvars, outcomeType = "time-to-event", time = "FFP..Years.",event = "Code.FFP", outfile = "survival_LASSO2_XGBoost") head(tlxfit$XGBoost_score) ``` ### 4.7 LASSO_plus + XGBoost To combine LASSO_plus variable selection with XGBoost modeling, the LASSO_plus_XGBtraining R function is employed. This approach selects variables using LASSO_plus but does not utilize the coefficients from LASSO_plus to construct the model; instead, it generates an XGBoost model. The format of the returned objects are identical to those of the XGBtraining function, while results related to LASSO_plus and XGBoost are saved locally. #### Binary outcome ```{r, warning=FALSE} blpxfit = LASSO_plus_XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig", topN = 5,outfile = "binary_LASSO_plus_XGBoost") head(blpxfit$XGBoost_score) ``` The majority of the outputs stem from LASSO_plus, with the final portion being attributed to XGBoost. #### Continuous outcome ```{r, warning=FALSE} clpxfit = LASSO_plus_XGBtraining(data = tdat, biomks = AgeXvars, outcomeType = "continuous", Y = "Age", topN = 5,outfile = "continuous_LASSO_plus_XGBoost") head(clpxfit$XGBoost_score) ``` Similar to the previous scenario, the primary outputs stem from LASSO_plus, while the concluding section originates from XGBoost. #### Time-to-event outcome ```{r,warning=FALSE} tlpxfit = LASSO_plus_XGBtraining(data = tdat, biomks = Xvars, outcomeType = "time-to-event", time = "FFP..Years.",event = "Code.FFP", topN = 5,outfile = "survival_LASSO_plus_XGBoost") head(tlpxfit$XGBoost_score) ``` Analogous to the previous cases, the bulk of the outputs originate from LASSO_plus, while the final segment is attributed to XGBoost. ### 4.8 LASSO2plus + XGBoost To seamlessly integrate LASSO2plus variable selection with XGBoost modeling, we leverage the LASSO2plus_XGBtraining R function. This hybrid approach utilizes LASSO2plus for variable selection but diverges from using its coefficients to construct the model. Instead, it generates an XGBoost model, producing an output akin to that of the XGBtraining function. The format of the returned objects mirror those of the XGBtraining function, while results related to LASSO2plus and XGBoost are saved locally. #### Binary outcome ```{r, warning=FALSE} bl2xfit = LASSO2plus_XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig", outfile = "binary_LASSO2plus_XGBoost") head(bl2xfit$XGBoost_score) ``` The primary outputs stem from LASSO2plus, while the latter part pertains to XGBoost. #### Continuous outcome ```{r, warning=FALSE} cl2xfit = LASSO2plus_XGBtraining(data = tdat, biomks = AgeXvars, outcomeType = "continuous", Y = "Age", outfile = "continuous_LASSO2plus_XGBoost") head(cl2xfit$XGBoost_score) ``` Similar to the previous case, the primary outputs arise from LASSO2plus, while the final section pertains to XGBoost. #### Time-to-event outcome ```{r, warning=FALSE} tl2xfit = LASSO2plus_XGBtraining(data = tdat, biomks = Xvars, outcomeType = "time-to-event", time = "FFP..Years.", event = "Code.FFP", outfile = "survival_LASSO2plus_XGBoost") head(tl2xfit$XGBoost_score) ``` Similarly, most outputs arise from LASSO2plus, while the final section pertains to XGBoost. ## 5. Model prediction In this section, we outline the prediction process for the six different modeling approaches included in this package when given the input variables (X) in a new dataset. ### 5.1 LASSO2 prediction We begin by discussing predictions for LASSO2 model outcomes. #### Binary ouctome To predict binary outcomes using LASSO2, we use the following code snippet: ```{r} pbl = LASSO2_predict(bl, newdata = vdat, outfile = "pred_LASSO2_binary") head(pbl) ``` The pbl object holds the predicted probabilities for the positive group for each entry/sample. #### Continuous ouctome For continuous outcomes prediction, the code snippet is as follows: ```{r} pcl = LASSO2_predict(cl, newdata = vdat, outfile = "pred_LASSO2_cont") head(pbl) ``` The pcl object holds the predicted Y values for each entry/sample. #### Time-to-event ouctome When predicting time-to-event outcomes, we use the code: ```{r} ptl = LASSO2_predict(tl, newdata = vdat, outfile = "pred_LASSO2_time_to_event") head(pbl) ``` The ptl object holds predicted risk scores for each entry/sample. ### 5.2 LASSO2 + regular regression prediction Moving forward, let's explore predictions regarding the regression model using selected variables from LASSO2. The function rms_model specifically caters to model prediction when utilizing a regular modeling object like those produced by LASSO2_reg. Upon performing predictions for binary and continuous outcomes, this step generates one figure and five tables. Additionally, for time-to-event outcomes, an extra table is generated. These resulting files are all saved locally for convenient access. #### Binary ouctome To predict binary outcomes using the LASSO2 + regular regression model: ```{r} pblr = rms_model(blr$fit, newdata = vdat, outfile = "pred_LASSO2reg_binary") head(pblr) ``` #### Continuous ouctome For continuous outcomes prediction: ```{r} pclr = rms_model(clr$fit, newdata = vdat, outfile = "pred_LASSO2reg_continuous") head(pclr) ``` #### Time-to-event ouctome To predict time-to-event outcomes: ```{r} ptlr = rms_model(tlr$fit, data = tdat, newdata = vdat, outfile = "pred_LASSO2reg_time_to_event") head(ptlr) ``` For time-to-event outcomes, the LASSO2_reg object requires the training dataset to be provided. ### 5.3 LASSO_plus prediction We also use rms_model to predict LASSO_plus model outcomes. #### Binary ouctome To predict binary outcomes using the LASSO_plus model: ```{r} pbfit = rms_model(bfit$fit, newdata = vdat, outfile = "pred_LASSOplus_binary") ``` #### Continuous ouctome For continuous outcomes prediction: ```{r} pcfit = rms_model(cfit$fit, newdata = vdat, outfile = "pred_LASSOplus_continuous") ``` #### Time-to-event ouctome To predict time-to-event outcomes: ```{r} ptfit = rms_model(tfit$fit, data = tdat, newdata = vdat, outfile = "pred_LASSOplus_time_to_event") ``` ### 5.4 LASSO2plus prediction Similarly, we use rms_model to predict LASSO2plus model outcomes. #### Binary ouctome To predict binary outcomes using the LASSO_plus model: ```{r} p2bfit = rms_model(b2fit$fit, newdata = vdat, outfile = "pred_LASSO2plus_binary") ``` #### Continuous ouctome For continuous outcomes prediction: ```{r} p2cfit = rms_model(c2fit$fit, newdata = vdat, outfile = "pred_LASSO2plus_continuous") ``` #### Time-to-event ouctome To predict time-to-event outcomes: ```{r} p2tfit = rms_model(t2fit$fit, data = tdat, newdata = vdat, outfile = "pred_LASSO2plus_time_to_event") ``` ### 5.5 XGBoost prediction Continuing, we discuss predictions for the XGBoost model outcomes. #### Binary ouctome To predict binary outcomes using the XGBoost model: ```{r} pbxfit = XGBtraining_predict(bxfit, newdata = vdat, outfile = "pred_XGBoost_binary") ``` #### Continuous ouctome For continuous outcomes prediction: ```{r} pcxfit = XGBtraining_predict(cxfit, newdata = vdat, outfile = "pred_XGBoost_cont") ``` #### Time-to-event ouctome To predict time-to-event outcomes: ```{r} ptxfit = XGBtraining_predict(txfit, newdata = vdat, outfile = "pred_XGBoost_time_to_event") ``` ### 5.6 LASSO2 + XGBoost prediction Next, we explore predictions for the combined LASSO2 and XGBoost model outcomes. #### Binary ouctome To predict binary outcomes: ```{r} pblxfit = XGBtraining_predict(blxfit, newdata = vdat, outfile = "pred_LXGBoost_binary") ``` #### Continuous ouctome To predict continuous outcomes: ```{r} pclxfit = XGBtraining_predict(clxfit, newdata = vdat, outfile = "pred_LXGBoost_cont") ``` #### Time-to-event ouctome To predict time-to-event outcomes: ```{r} ptlxfit = XGBtraining_predict(tlxfit, newdata = vdat, outfile = "pred_LXGBoost_time_to_event") ``` #### 5.7 LASSO_plus + XGBoost prediction Lastly, we discuss predictions for the combined LASSO_plus and XGBoost model outcomes. ##### 1) Binary ouctome To predict binary outcomes: ```{r} pblpxfit = XGBtraining_predict(blpxfit, newdata = vdat, outfile = "pred_LpXGBoost_binary") ``` ##### 2) Continuous ouctome For continuous outcomes prediction: ```{r} pclpxfit = XGBtraining_predict(clpxfit, newdata = vdat, outfile = "pred_LpXGBoost_cont") ``` ##### 3) Time-to-event ouctome To predict time-to-event outcomes: ```{r} ptlpxfit = XGBtraining_predict(tlpxfit, newdata = vdat, outfile = "pred_LpXGBoost_time_to_event") ``` ### 5.8 LASSO2plus + XGBoost prediction #### Binary ouctome To predict binary outcomes using the LASSO2plus + XGBoost model: ```{r} pbl2xfit = XGBtraining_predict(bl2xfit, newdata = vdat, outfile = "pred_L2XGBoost_binary") ``` #### Continuous ouctome For continuous outcomes prediction: ```{r} pcl2xfit = XGBtraining_predict(cl2xfit, newdata = vdat, outfile = "pred_L2XGBoost_cont") ``` #### Time-to-event ouctome To predict time-to-event outcomes: ```{r} ptl2xfit = XGBtraining_predict(tl2xfit, newdata = vdat, outfile = "pred_L2XGBoost_time_to_event") ``` ## 6. (External) Model Validation In the validation phase, we assess our models' effectiveness by utilizing a fresh dataset that includes the outcome variable. This separate dataset, known as the validation dataset, stands distinct from the one used for training and is termed external validation. This distinction is crucial, setting it apart from internal validation methods such as sampling, cross-validation, leave-one-out, and bootstrapping. It's important to emphasize that while the same functions are used for both prediction and validation, the validation process requires the inclusion of an outcome variable. This distinction prompts additional analyses and comparisons beyond mere prediction. All generated validation plots and associated result files are stored locally for easy reference. ### 6.1 LASSO2 validation We conduct validation for the LASSO2 model with different types of outcome variables. #### Binary ouctome ```{r} vbl = LASSO2_predict(bl, newdata = vdat, newY = TRUE, outfile = "valid_LASSO2_binary") ``` While the returned object vbl also holds predicted probabilities for the ’DZsig’ positive group, in addition, a validation performance figure is saved locally. #### Continuous ouctome ```{r} vcl = LASSO2_predict(cl, newdata = vdat, newY = TRUE, outfile = "valid_LASSO2_cont") ``` Similarly, the returned object vcl holds predicted value, and a validation performance plot is saved. #### Time-to-event ouctome ```{r} vtl = LASSO2_predict(tl, newdata = vdat, newY = TRUE, outfile = "valid_LASSO2_time_to_event") ``` The returned object vtl keeps the predicted risk scores, and locally saved validation results include a calibration plot and a table containing performance statistics. ### 6.2 LASSO2 + regular regression validation Similar to prediction step, we use rms_model to validate the regular regression model using the selected variables from LASSO2. #### Binary ouctome ```{r} vblr = rms_model(blr$fit, newdata = vdat, newY = TRUE, outfile = "valid_LASSO2reg_binary") ``` The above code generates and saves two figures and five tables and some of them are duplicated to the prediction step. #### Continuous ouctome ```{r} vclr = rms_model(clr$fit, newdata = vdat, newY = TRUE, outfile = "valid_LASSO2reg_continuous") ``` The above code also generates and saves two figures and five tables and some of them are duplicated to the prediction step. #### Time-to-event ouctome ```{r} vtlr = rms_model(tlr$fit, data = tdat, newdata = vdat, newY = TRUE, outfile = "valid_LASSO2reg_time_to_event") ``` Same as for prediction step, validation of time-to-event outcome requires training data as well. The above code generates and saves two figures and six tables and some of them are duplicated to the prediction step. ### 6.3 LASSO_plus validation Next, we utilize the same rms_model function for validating the LASSO_plus models, with parameter settings and outputs mirroring those detailed in the model validation of Section 6.2. #### Binary ouctome ```{r} vbfit = rms_model(bfit$fit, newdata = vdat, newY = TRUE, outfile = "valid_LASSOplus_binary") ``` #### Continuous ouctome ```{r} vcfit = rms_model(cfit$fit, newdata = vdat, newY = TRUE, outfile = "valid_LASSOplus_continuous") ``` #### Time-to-event ouctome ```{r} vtfit = rms_model(tfit$fit, data = tdat, newdata = vdat, newY = TRUE, outfile = "valid_LASSOplus_time_to_event") ``` ### 6.4 LASSO2plus validation Additionally, we leverage the same rms_model function to validate the LASSO_plus models. The parameter configurations and outputs align with those outlined in the regression validation detailed in Section 6.2. #### Binary ouctome ```{r} v2bfit = rms_model(b2fit$fit, newdata = vdat, newY = TRUE, outfile = "valid_LASSO2plus_binary") ``` #### Continuous ouctome ```{r} v2cfit = rms_model(c2fit$fit, newdata = vdat, newY = TRUE, outfile = "valid_LASSO2plus_continuous") ``` #### Time-to-event ouctome ```{r} v2tfit = rms_model(t2fit$fit, data = tdat, newdata = vdat, newY = TRUE, outfile = "valid_LASSO2plus_time_to_event") ``` ### 6.5 XGBoost validation The XGBtraining_predict function introduced in Section 5.5, also serves for model validation when the outcome variable is present in the validation cohort. The parameter settings and outputs are the same as those for the LASSO2_prediction function detailed in Section 6.1. #### Binary ouctome ```{r} vbxfit = XGBtraining_predict(bxfit, newdata = vdat, newY = TRUE, outfile = "valid_XGBoost_binary") ``` Predicted probability for the positive group is given for each entry/sample. #### Continuous ouctome ```{r} vcxfit = XGBtraining_predict(cxfit, newdata = vdat, newY = TRUE, outfile = "valid_XGBoost_cont") ``` #### Time-to-event ouctome ```{r} vtxfit = XGBtraining_predict(txfit, newdata = vdat, newY = TRUE, outfile = "valid_XGBoost_time_to_event") ``` ### 6.6 LASSO2 + XGBoost validation The same XGBtraining_predict function is employed for LASSO2 + XGBoost model validation as for the standalone XGBoost model shown in Section 6.5, with consistent parameter settings and identical outputs. #### Binary ouctome ```{r} vblxfit = XGBtraining_predict(blxfit, newdata = vdat, newY = TRUE, outfile = "valid_LXGBoost_binary") ``` #### Continuous ouctome ```{r} vclxfit = XGBtraining_predict(clxfit, newdata = vdat, newY = TRUE, outfile = "valid_LXGBoost_cont") ``` #### Time-to-event ouctome ```{r} vtlxfit = XGBtraining_predict(tlxfit, newdata = vdat, newY = TRUE, outfile = "valid_LXGBoost_time_to_event") ``` ### 6.7 LASSO_plus + XGBoost validation The same XGBtraining_predict function is employed for LASSO_plus + XGBoost model validation as for the standalone XGBoost model shown in Section 6.5, with consistent parameter settings and identical outputs. #### Binary ouctome ```{r} vblpxfit = XGBtraining_predict(blpxfit, newdata = vdat, newY = TRUE, outfile = "valid_LpXGBoost_binary") ``` #### Continuous ouctome ```{r} vclpxfit = XGBtraining_predict(clpxfit, newdata = vdat, newY = TRUE, outfile = "valid_LpXGBoost_cont") ``` #### Time-to-event ouctome ```{r} vtlpxfit = XGBtraining_predict(tlpxfit, newdata = vdat, newY = TRUE, outfile = "valid_LpXGBoost_time_to_event") ``` ### 6.8 LASSO2plus + XGBoost validation The same XGBtraining_predict function is employed for LASSO2plus + XGBoost model validation as for the standalone XGBoost model shown in Section 6.5, with consistent parameter settings and identical outputs. #### Binary ouctome ```{r} vbl2xfit = XGBtraining_predict(bl2xfit, newdata = vdat, newY = TRUE, outfile = "valid_L2XGBoost_binary") ``` #### Continuous ouctome ```{r} vcl2xfit = XGBtraining_predict(cl2xfit, newdata = vdat, newY = TRUE, outfile = "valid_L2XGBoost_cont") ``` #### Time-to-event ouctome ```{r} vtl2xfit = XGBtraining_predict(tl2xfit, newdata = vdat, newY = TRUE, outfile = "valid_L2XGBoost_time_to_event") ``` ## 7. All-in-one! If you find it challenging to call various functions separately, the all-in-one function provides a simplified solution. It efficiently manages predictive model development and validation for all eight methods integrated into this package, spanning three distinct outcome types, with a single function call. Moreover, you can employ this versatile function for a single method with one or more outcome variables, offering flexibility to suit your specific needs. If a validation dataset is at your disposal, the function seamlessly incorporates the validation process within the same operation. ```{r, eval = FALSE} modelout = csmpvModelling(tdat = tdat, vdat = vdat, Ybinary = "DZsig", varsBinary = Xvars, Ycont = "Age", varsCont = AgeXvars, time = "FFP..Years.", event = "Code.FFP", varsSurvival = Xvars, outfileName= "all_in_one") ``` This single function call generates all models and provides predictions and validations for each of them with our example training data: tdat and validation data: vdat.. To save space, the running results are hidden. In other words, this single function call can replace all three sections discussed in Sections 4, 5, and 6. The models will be returned, and all result files will be saved locally. Certainly, we can use this all-in-one function to work on one outcome variable and one model at a time, for example: ```{r, warning=FALSE} DZlassoreg = csmpvModelling(tdat = tdat, vdat = vdat, Ybinary = "DZsig", varsBinary = Xvars, methods = "LASSO2_reg", outfileName= "just_one") ``` This is equivalent to using LASSO2_reg for modeling, followed by prediction and validation with rms_model for the "DZsig" classification. ## 8. Special modelling In preceding sections, the target model type consistently matched the provided output. However, scenarios can emerge where they do not necessarily correspond. For instance, situations might arise in which we aim to construct a risk classification model even when our training cohort lacks risk classification data but includes survival information. To undertake this specialized modeling, let's assume that we have a set of variables associated with survival outcomes. XGpred can utilize all of these variables and provides choices for variable selection within the functions with three options: LASSO2, LASSO2plus, and LASSO_plus. By employing the same variable list, denoted as Xvars, we can invoke the XGpred function with options to perform variable selection using LASSO_plus to identify the top 5 variables. This wrapper function applies XGBoost and Cox modeling to determine high and low-risk groups using survival data. Subsequently, these groups undergo filtration and are used to construct an empirical Bayesian-based binary risk classification model. ``` {r,results = 'hide',message=FALSE, warnings=FALSE} xgobj = XGpred(data = tdat, varsIn = Xvars, selection = TRUE, vsMethod = "LASSO_plus", topN = 5, time = "FFP..Years.", event = "Code.FFP", outfile = "XGpred") ``` The XGpred output object, xgobj, encompasses all the essential information for risk classification, including that of the training cohort. By default, this function generates binary classification; any samples not classified into high-risk groups are placed into the low-risk group. When 'nclass' is set to 3, samples are categorized into low, middle, and high-risk groups, as demonstrated with the following code: ``` {r,results = 'hide',message=FALSE, warnings=FALSE} xgobj3 = XGpred(data = tdat, varsIn = Xvars, time = "FFP..Years.", event = "Code.FFP", selection = TRUE, vsMethod = "LASSO_plus", topN = 5, nclass = 3, outfile = "XGpred") ``` To observe the performance of the risk group in the training set, we can generate a KM plot using the confirmVars function: ```{r, results = 'hide',message=FALSE, warnings=FALSE} tdat$XGpred_class2 = xgobj$XGpred_prob_class training_risk_confirm2 = confirmVars(data = tdat, biomks = "XGpred_class2", time = "FFP..Years.", event = "Code.FFP", outfile = "training2grps_riskSurvival", outcomeType = "time-to-event") training_risk_confirm2[[3]] tdat$XGpred_class3 = xgobj3$XGpred_prob_class training_risk_confirm3 = confirmVars(data = tdat, biomks = "XGpred_class3", time = "FFP..Years.", event = "Code.FFP", outfile = "training3grps_riskSurvival", outcomeType = "time-to-event") training_risk_confirm3[[3]] ``` Then we can predict the risk classification for a validation cohort: ``` {r} xgNew = XGpred_predict(newdat = vdat, XGpredObj = xgobj) xgNew3 = XGpred_predict(newdat = vdat, XGpredObj = xgobj3) ``` While the default calibration shift (scoreShift) is set to 0, you can adjust it based on model scores if there's a platform/batch difference between the training and validation cohorts. If survival data is available for the testing dataset, we can employ the confirmVars function introduced earlier to assess the reasonableness of risk classification. ```{r,results = 'hide',message=FALSE, warnings=FALSE } vdat$XGpred_class2 = xgNew$XGpred_prob_class risk_confirm2 = confirmVars(data = vdat, biomks = "XGpred_class2", time = "FFP..Years.", event = "Code.FFP", outfile = "riskSurvival2grps", outcomeType = "time-to-event") risk_confirm2[[3]] vdat$XGpred_class3 = xgNew3$XGpred_prob_class risk_confirm3 = confirmVars(data = vdat, biomks = "XGpred_class3", time = "FFP..Years.", event = "Code.FFP", outfile = "riskSurvival3grps", outcomeType = "time-to-event") risk_confirm3[[3]] ``` ## csmpv R package general information Title: Biomarker confirmation, selection, modelling, prediction and validation Version: 1.0.2 Author: Aixiang Jiang Maintainer: Aixiang Jiang aijiang@bccrc.ca{.email} Depends: R (\>= 4.2.0) Suggests: knitr VignetteBuilder: knitr Imports: survival, glmnet, Hmisc, rms, forestmodel, ggplot2, ggpubr,survminer, mclust, xgboost, cowplot ## References ```{r} devtools::session_info() ``` Hastie et al. (1992, ISBN 0 534 16765-9) Therneau et al. (2000, ISBN 0-387-98784-3) Friedman et al. (2010) Simon et al. (2011) Chen and Guestrin (2016) Aoki et al. (2023)