Title: Subsampling Ranking Forward Selection (SuRF)
Version: 1.1.0.1
Maintainer: Toby Kenney <tkenney@mathstat.dal.ca>
Depends: R (≥ 3.2.3)
Imports: glmnet, survival, dplyr
Suggests: foreach, parallel, doParallel, knitr
Author: Lihui Liu [aut], Toby Kenney [aut, cre]
Description: Performs variable selection based on subsampling, ranking forward selection. Details of the method are published in Lihui Liu, Hong Gu, Johan Van Limbergen, Toby Kenney (2020) SuRF: A new method for sparse variable selection, with application in microbiome data analysis Statistics in Medicine 40 897-919 <doi:10.1002/sim.8809>. Xo is the matrix of predictor variables. y is the response variable. Currently only binary responses using logistic regression are supported. X is a matrix of additional predictors which should be scaled to have sum 1 prior to analysis. fold is the number of folds for cross-validation. Alpha is the parameter for the elastic net method used in the subsampling procedure: the default value of 1 corresponds to LASSO. prop is the proportion of variables to remove in the each subsample. weights indicates whether observations should be weighted by class size. When the class sizes are unbalanced, weighting observations can improve results. B is the number of subsamples to use for ranking the variables. C is the number of permutations to use for estimating the critical value of the null distribution. If the 'doParallel' package is installed, the function can be run in parallel by setting ncores to the number of threads to use. If the default value of 1 is used, or if the 'doParallel' package is not installed, the function does not run in parallel. display.progress indicates whether the function should display messages indicating its progress. family is a family variable for the glm() fitting. Note that the 'glmnet' package does not permit the use of nonstandard link functions, so will always use the default link function. However, the glm() fitting will use the specified link. The default is binomial with logistic regression, because this is a common use case. pval is the p-value for inclusion of a variable in the model. Under the null case, the number of false positives will be geometrically distributed with this as probability of success, so if this parameter is set to p, the expected number of false positives should be p/(1-p).
Encoding: UTF-8
License: GPL-3
VignetteBuilder: knitr
NeedsCompilation: no
Packaged: 2022-01-07 15:26:42 UTC; tkenney
Repository: CRAN
Date/Publication: 2022-01-08 01:52:49 UTC

Ranking

Description

This function is to rank the variables after B subsampings;It also removes the highly correlated variables from lower level

Usage

Ranking(data, model)

Arguments

data

the data object return from the dataclean function (the last column is the outcome)

model

model object from sub-sampling procedure

Value

table: a table shows the ranked variable list with its frequency (descending order)

Beta: coefficients flag (1 or 0) indicating if the variable is selected; intercept is not included;


Ranking_cox

Description

This function is to rank the variables after B subsampings for cox proportional model; It also removes the highly correlated variables from lower level

Usage

Ranking_cox(data, model)

Arguments

data

the data object return from the dataclean function(the last column is the outcome)

model

cox proportion model object from sub-sampling procedure (B times)

Value

table: a table shows the ranked variable list with its frequency (descending order)

Beta: coefficients flag (1 or 0) indicating if the variable is selected; intercept is not included;


SURF

Description

SuRF is a sparse variable selection method with uses a subsampling approach an LASSO to rank variables before applying forward selection using a permutation test. The function is able to give results at a range of significance levels simultaneously.

Usage

SURF(
  Xo,
  y,
  X = NULL,
  fold = 10,
  Alpha = 1,
  prop = 0.1,
  weights = FALSE,
  B = 1000,
  C = 200,
  ncores = 1,
  display.progress = TRUE,
  family = stats::binomial(link = "logit"),
  alpha_u = 0.1,
  alpha = 0.05
)

Arguments

Xo

- other type of predictor variables

y

- response variable, a vecotr for most families. For family="cox", y will should be a matrix of the response variable in column1 and censoring status in column 2.

X

- count data, need to be converted to proportion

fold

- number of folds for cross-validation in Lasso

Alpha

- Alpha parameter for elastic net

prop

- proportion of observations left out in subsampling

weights

- use weighted regression: for unbalanced class sizes (bimomial family only) or weighted sample for other families;In a binomial model, weights: =TRUE: if weighted version is desired; =FALSE, otherwise ; In other models,weights: =vector of weights of the same size as the sample size N: if weighted version is desired;=FALSE, otherwise (other generalized model)

B

- number of subsamples to take

C

- number of permutations used to estimate null distribution

display.progress

- whether SuRF should print a message on completion of each

alpha_u

- the upper bound of significance level for the permutation test: alpha_u has to be in the range of (0,1). The large of this value, the longer the program will run;

alpha

- the alpha value of interest (alpha >0 and must be <=alpha_u). It can be a single value or a vector.If missing, by default it is 0.05.

ncores

whether SuRF should compute in parallel: 1 indicates NOT; anything greater will compute in parallel

family

The distribution family of the response variable

Details

SuRF consists of two steps. In the first step, LASSO variable selection is applied to a large number of subsamples of the data set, to provide a list of selected variables for each subsample. This list is used to rank the variables, based on the number of subsamples in which each variable is selected, so that variables that are selected in more subsamples are ranked more highly. In the second step, this list is used as a basis for forward selection, with variables higher on the list tried first. If a highly-ranked variable is not selected, later variables are tried, and after each variable is selected, the variables not yet selected (even previously non-selected variables) are tried in order of the ranking from Step 1. The decision whether to include a variable is based on a permutation test for the deviance statistic.

Full details of the SuRF method are in the paper:

Lihui Liu, Hong Gu, Johan Van Limbergen, Toby Kenney (2020) SuRF: A new method for sparse variable selection, with application in microbiome data analysis Statistics in Medicine 40 897-919

doi: https://onlinelibrary.wiley.com/doi/10.1002/sim.8809

Value

Bmod: sub-sampling results

trdata: data frame including both X and y

ranklist: ranking table

modpath: variable selection path (along the alpha range)

selmod: model results at the selected alpha(s)

family: model family used

Examples

library(survival)
library(glmnet)
library(SuRF.vs)
N=100;p=200
nzc=p/3
X=matrix(rnorm(N*p),N,p)
beta=rnorm(nzc)
fx=X[,seq(nzc)]%*%beta/3
hx=exp(fx)
ty=rexp(N,hx)
tcens=rbinom(n=N,prob=.3,size=1)# censoring indicator (1 or 0)
Xo=NULL
B=20
Alpha=1
fold=5
ncores=1
prop=0.1
C=3
alpha_u=0.2
alpha=seq(0.01,0.1,len=20)

#binomial model
XX=X[,1:2]
f=1+XX%*%c(2,1.5)
p=exp(f)/(1+exp(f))
y=rbinom(100,1,p)
weights=FALSE
family=stats::binomial(link="logit")


surf_binary=SURF(Xo=X,y=y,fold=5,weights=weights,B=10,C=5,family=family,alpha_u=0.1,alpha=alpha)


#linear regression
y=1+XX%*%c(0.1,0.2)
family=stats::gaussian(link="identity")

surf_lm=SURF(Xo=X,y=y,fold=5,weights=weights,B=10,C=5,family=family,alpha_u=0.1,alpha=alpha)


#cox proportional model
y=cbind(time=ty,status=1-tcens)
weights=rep(1,100)
rseed=floor(runif(20,1,100))
weights[rseed]=2
family=list(family="cox")

surf_cox=SURF(Xo=X,y=y,fold=5,weights=weights,B=10,C=5,family=family,alpha_u=alpha_u,alpha=alpha)


Subsample.w

Description

This function is to subsample the data and perform LASSO (single time) on the selected samples

Usage

Subsample.w(data, fold, Alpha, prop, weights, family, Type)

Arguments

data

the dataframe should be arranged in the way such that columns are X1,X2,X3....,Xp, status. Where Xi's are variables and status is the outcome(for the logistic regression, the outcome is in terms of 0/1)

fold

fold used in lasso cross validation to select the tuning parameter

Type

should use 'class' for classification always

Alpha

1 for Lasso,0 for ridgeression

prop

percentage of samples left out for each subsamping

weights

=TRUE: if weighted version is desired; =FALSE, otherwise (binomial model);weights: =vector of weights of the same size as the sample size N: if weighted version is desired;=FALSE, otherwise (other generalized model)

family

the distribution family for the response variable.

Value

lambda: the tuning parameter that within 1 sd of the tuning parameter gives the lowest CV error

coef: a table shows the name of the selected variables by LASSO and its coefficients

table: there are a equal proportion of samples from each status left out and we use the model built on the selected

subsamples to predict those left out ones. Table contains two columns: column1 is the predicted value and column2 is the true class

error: misclassification error based on the above table

Beta: should be a vector of length p+1 and this is the beta coefficients from the LASSO model; Be aware of that the intercept is placed at the end of this vector


Subsample.w_cox

Description

This function is to subsample the data and perform LASSO (single time) on the selected samples for cox proportional model

Usage

Subsample.w_cox(data, fold, Alpha, prop, weights)

Arguments

data

the dataframe should be arranged in the way such that columns are X1,X2,X3....,Xp, status. Where Xi's are variables and status is the outcome(for the logistic regression, the outcome is in terms of 0/1)

fold

fold used in lasso cross validation to select the tuning parameter

Alpha

1 for Lasso,0 for ridgeression

prop

percentage of samples left out for each sub-sampling

weights

= a vector of weights: if weighted version is desired, =FALSE, otherwise

Value

#lambda: the tuning parameter that within 1 sd of the tuning parameter gives the lowest CV error

coef: a table shows the name of the selected variables by LASSO and its coefficients

table: there are a equal proportion of samples from each status left out and we use the model built on the selected subsamples to predict those left out ones. Table contains two columns: column1 is the predicted value and column2 isthe true value of the outcome

error: misclassification error based on the above table

Beta: should be a vector of length p+1 and this is the beta coefficients from the LASSO model.


Subsample_B

Description

This function is to run sub-sampling B times

Usage

Subsample_B(B, data, fold, Alpha, prop, weights, ncores, family)

Arguments

B

the number of sub-samplings to run (e.g., B=1000)

data

the dataframe should be arranged in the way such that columns are X1,X2,X3....,Xp, status. Where Xi's are variables and status is the outcome(for the logistic regression, the outcome is in terms of 0/1)

fold

fold used in lasso cross validation to select the tuning parameter

Alpha

1 for Lasso,0 for ridgeression

prop

percentage of samples left out for each subsamping

family

The distribution family of the response variable

weights

In a binomial model, weights: =TRUE: if weighted version is desired; =FALSE, otherwise ; In other models,weights: =vector of weights of the same size as the sample size N: if weighted version is desired;=FALSE, otherwise (other generalized model)

ncores

the number of cores to use for parallel computation

Value

Class.Err: mis-classification error on the left out ones over B runs. A vector of length B.

Lambda: tuning parameters selected from B runs. It is a vector of length B

BETA: It is a matrix used to save the beta coefficients from all B runs #' @export


dataclean

Description

This function is to 1)Scale the count data (count data only) to proportion 2)create a data frame consisting of proportion data, and 3) Keep an variable name list (original variable names and names in terms of X's, e.g.X1,X2,..,etc. ) #environmental data (host genome and other information about observations)

Usage

dataclean(X.c, X.o, y)

Arguments

X.c

data frame that has count data from all levels (only count data will be row scaled)

X.o

data frame that has other environmental variables (no scaling will be done, those variables will scaled together with proportion data in LASSO step)

y

a vector representing the outcome (0 or 1 for binomial model)

Value

data.Xy: a dataframe containing all variables named as X1,X2,...,Xp and the binary outcome (called status)in the last column; this data frame will be used in the other functions for data analysis


selectnew

Description

This function is to add new node (new deviance distribution for adding the new variable)

Usage

selectnew(
  vslist,
  ranktable,
  data,
  weights,
  ncores = 1,
  family,
  alpha_l,
  alpha_u,
  C
)

Arguments

ranktable

:ranking table from ranking step

family

:generalized model families

alpha_l

is the minimum significance level(>=0)

data

the dataframe should be arranged in the way such that columns are X1,X2,X3....,Xp, status. Where Xi's are variables and status is the outcome(for the logistic regression, the outcome is in terms of 0/1)

ncores

no of parallel computing cores

C

the number of permutation times

alpha_u

the upper significance level

weights

In a binomial model, weights: =TRUE: if weighted version is desired; =FALSE, otherwise ; In other models,weights: =vector of weights of the same size as the sample size N: if weighted version is desired;=FALSE, otherwise (other generalized model)

vslist

the current list of selected variables

Value

vslist: the updated list of selected variables

dev.dist: deviance distributions used for selecting the new variable

vtlist has 1)alpha.range for the newly selected variable,2)selvar: the newly selected variable name,3)pval: pvalue for the newly selected varaible,and 4)dev: the deviance value contributed by the newly selected variable


selpath

Description

#This function is to trace the selection path

Usage

selpath(data, weights, ranktable, ncores, family, C, alpha_u)

Arguments

ranktable

:ranking table from ranking step

family

:generalized model families

data

the dataframe should be arranged in the way such that columns are X1,X2,X3....,Xp, status. Where Xi's are variables and status is the outcome(for the logistic regression, the outcome is 0/1)

ncores

no of parallel computing cores

C

the number of permutation times

alpha_u

the upper significance level

weights

In a binomial model, weights: =TRUE: if weighted version is desired; =FALSE, otherwise ; In other models,weights: =vector of weights of the same size as the sample size N: if weighted version is desired;=FALSE, otherwise (other generalized model)

Value

selpoint: a list. it contains each selected variable point,information includes 1)vslist: the variable sect before selecting this variable listed in 'selvar' 2)alpha.range: the variable will be selected within this alpha range 3)pval: pvalue of the variable 4)selvar:selected variable 5)vslist:variable sect after selecting the variable listed in 'selvar'

sel.nodes: a list. deviance distributions used for selecting the new variable; it includes 1)vslist: the variable sect before the new selection 2)dev.dist: the permutation for selecting the new variable 3)vtlist that has i)pval: pvalue of the proposed variable ii)selvar: selected variable (the proposed variable is NULL if not selected) iii)dev:deviance contributed by the proposed variable


selvar_alpha

Description

This function is to extract summarize the results at 'alpha' level from 'mod' object to obtain 1)final selected variables 'selvar' #2)pvalue of each selected variable according to the variables in the 'selvar' 3)deviance contributed by each selected variable (given the previous selected variables 4)deviance permutation distribution 5)cutoff value based on (1-alpha)percentile of the deviance permutation distribution;when no variable is selected, only return the last deviance distribution and the cutoff value;this function can be used separately after running selpath(); the alpha value must be >0 and <= alpha_u parameter from SURF()

Usage

selvar_alpha(res, alpha)

Arguments

res

'mod' object returned from 'selpath' function

alpha

alpha level(default alpha=0.05)(a single value up to the value 'alpha_u' sepecified in selpath() function)

Value

selvar:final selected variable

pval:pvalue of each selected variable (present if at least 1 var is selected)

devlist:deviance contributed by each selected variable (given the previous selected variables;present if at least 1 var is selected )

dist.mat:a list of deviance permutation distributions (including the distribution from the step from which no more variable is added)


update_dev

Description

This function is to derive the deviance distribution based on the permutation method This function is not to be used independently but will be called by the function selectnew()

Usage

update_dev(data, vslist, C, weights, ncores, family)

Arguments

data

the variable 'data' within seqcutoff()

vslist

a vector of selected variables

C

the number of permutation times

family

family=stats::gaussian(link="identity"));family=stats::binomial(link="logit");family=list(family="cox");etc.

weights

In a binomial model, weights: =TRUE: if weighted version is desired; =FALSE, otherwise ; In other models,weights: =vector of weights of the same size as the sample size N: if weighted version is desired;=FALSE, otherwise (other generalized model)

ncores

the number of cores to use for parallel computation

Value

dev: a vector of deviance after C permutations (length OF this vector is C)


update_dev_cox

Description

For COX proportional model ONLY. This function is to derive the deviance distribution based on the permutation method;This function is not to be used independently but will be called by the function selectnew()

Usage

update_dev_cox(data, vslist, C, ncores, weights)

Arguments

data

the variable 'data' within seqcutoff()

vslist

a vector of selected variables

C

the number of permutation times

weights

=TRUE: if weighted version is desired, =FALSE, otherwise (binomial model); weights: =vector of weights of the same size as the sample size N: if weighted version is desired, =FALSE, otherwise (other generalized model)

ncores

the number of cores to use for parallel computation

Value

dev: a vector of deviance after C permutations (length OF this vector is C)