Title: | Tidy Estimation of Heterogeneous Treatment Effects |
Version: | 1.0.2 |
Description: | Estimates heterogeneous treatment effects using tidy semantics on experimental or observational data. Methods are based on the doubly-robust learner of Kennedy (n.d.) <doi:10.48550/arXiv.2004.14497>. You provide a simple recipe for what machine learning algorithms to use in estimating the nuisance functions and 'tidyhte' will take care of cross-validation, estimation, model selection, diagnostics and construction of relevant quantities of interest about the variability of treatment effects. |
URL: | https://github.com/ddimmery/tidyhte https://ddimmery.github.io/tidyhte/index.html |
BugReports: | https://github.com/ddimmery/tidyhte/issues |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.3 |
Suggests: | covr, devtools, estimatr, ggplot2, glmnet, knitr, mockr, nprobust, palmerpenguins, quadprog, quickblock, rmarkdown, testthat (≥ 3.0.0), vimp, WeightedROC |
Config/testthat/edition: | 3 |
Imports: | checkmate, dplyr, lifecycle, magrittr, progress, purrr, R6, rlang, SuperLearner, tibble |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2023-08-11 15:35:39 UTC; drewd |
Author: | Drew Dimmery |
Maintainer: | Drew Dimmery <drew.dimmery@univie.ac.at> |
Repository: | CRAN |
Date/Publication: | 2023-08-14 11:30:02 UTC |
tidyhte: Tidy Estimation of Heterogeneous Treatment Effects
Description
Estimates heterogeneous treatment effects using tidy semantics on experimental or observational data. Methods are based on the doubly-robust learner of Kennedy (n.d.) arXiv:2004.14497. You provide a simple recipe for what machine learning algorithms to use in estimating the nuisance functions and 'tidyhte' will take care of cross-validation, estimation, model selection, diagnostics and construction of relevant quantities of interest about the variability of treatment effects.
Details
The best place to get started with tidyhte
is vignette("experimental_analysis")
which
walks through a full analysis of HTE on simulated data, or vignette("methodological_details")
which gets into more of the details underlying the method.
Author(s)
Maintainer: Drew Dimmery drew.dimmery@univie.ac.at (ORCID) [copyright holder]
References
Kennedy, E. H. (2020). Towards optimal doubly robust estimation of heterogeneous causal effects. arXiv preprint arXiv:2004.14497.
See Also
The core public-facing functions are make_splits
, produce_plugin_estimates
,
construct_pseudo_outcomes
and estimate_QoI
. Configuration is accomplished through HTE_cfg
in addition to a variety of related classes (see basic_config
).
Configuration of a Constant Estimator
Description
Constant_cfg
is a configuration class for estimating a constant model.
That is, the model is a simple, one-parameter mean model.
Super class
tidyhte::Model_cfg
-> Constant_cfg
Public fields
model_class
The class of the model, required for all classes which inherit from
Model_cfg
.
Methods
Public methods
Method new()
Create a new Constant_cfg
object.
Usage
Constant_cfg$new()
Returns
A new Constant_cfg
object.
Examples
Constant_cfg$new()
Method clone()
The objects of this class are cloneable with this method.
Usage
Constant_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
## ------------------------------------------------
## Method `Constant_cfg$new`
## ------------------------------------------------
Constant_cfg$new()
Configuration of Model Diagnostics
Description
Diagnostics_cfg
is a configuration class for estimating a variety of
diagnostics for the models trained in the course of HTE estimation.
Public fields
ps
Model diagnostics for the propensity score model.
outcome
Model diagnostics for the outcome models.
effect
Model diagnostics for the joint effect model.
params
Parameters for any requested diagnostics.
Methods
Public methods
Method new()
Create a new Diagnostics_cfg
object with specified diagnostics to estimate.
Usage
Diagnostics_cfg$new(ps = NULL, outcome = NULL, effect = NULL, params = NULL)
Arguments
ps
Model diagnostics for the propensity score model.
outcome
Model diagnostics for the outcome models.
effect
Model diagnostics for the joint effect model.
params
List providing values for parameters to any requested diagnostics.
Returns
A new Diagnostics_cfg
object.
Examples
Diagnostics_cfg$new( outcome = c("SL_risk", "SL_coefs", "MSE", "RROC"), ps = c("SL_risk", "SL_coefs", "AUC") )
Method add()
Add diagnostics to the Diagnostics_cfg
object.
Usage
Diagnostics_cfg$add(ps = NULL, outcome = NULL, effect = NULL)
Arguments
ps
Model diagnostics for the propensity score model.
outcome
Model diagnostics for the outcome models.
effect
Model diagnostics for the joint effect model.
Returns
An updated Diagnostics_cfg
object.
Examples
cfg <- Diagnostics_cfg$new( outcome = c("SL_risk", "SL_coefs", "MSE", "RROC"), ps = c("SL_risk", "SL_coefs") ) cfg <- cfg$add(ps = "AUC")
Method clone()
The objects of this class are cloneable with this method.
Usage
Diagnostics_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
Diagnostics_cfg$new(
outcome = c("SL_risk", "SL_coefs", "MSE", "RROC"),
ps = c("SL_risk", "SL_coefs", "AUC")
)
## ------------------------------------------------
## Method `Diagnostics_cfg$new`
## ------------------------------------------------
Diagnostics_cfg$new(
outcome = c("SL_risk", "SL_coefs", "MSE", "RROC"),
ps = c("SL_risk", "SL_coefs", "AUC")
)
## ------------------------------------------------
## Method `Diagnostics_cfg$add`
## ------------------------------------------------
cfg <- Diagnostics_cfg$new(
outcome = c("SL_risk", "SL_coefs", "MSE", "RROC"),
ps = c("SL_risk", "SL_coefs")
)
cfg <- cfg$add(ps = "AUC")
Predictor class for the cross-fit predictor of "partial" CATEs
Description
Predictor class for the cross-fit predictor of "partial" CATEs
Predictor class for the cross-fit predictor of "partial" CATEs
Details
The class makes it easier to manage the K predictors for retrieving K-fold cross-validated estimates, as well as to measure how treatment effects change when only a single covariate is changed from its "natural" levels (in the sense "natural" used by the direct / indirect effects literature).
Public fields
models
A list of the K model fits
num_splits
The number of folds used in cross-fitting.
num_mc_samples
The number of samples to retrieve across the covariate space. If num_mc_samples is larger than the sample size, then the entire dataset will be used.
covariates
The unquoted names of the covariates used in the second-stage model.
model_class
The model class (in the sense of
Model_cfg
). For instance, a SuperLearner model will have model class "SL".
Methods
Public methods
Method new()
FX.predictor
is a class which simplifies the management of a set of cross-fit
prediction models of treatment effects and provides the ability to get the "partial"
effects of particular covariates.
Usage
FX.Predictor$new(models, num_splits, num_mc_samples, covariates, model_class)
Arguments
models
A list of the K model fits.
num_splits
Integer number of cross-fitting folds.
num_mc_samples
Integer number of Monte-Carlo samples across the covariate space. If this is larger than the sample size, then the whole dataset will be used.
covariates
The unquoted names of the covariates.
model_class
The model class (in the sense of
Model_cfg
).
Method predict()
Predicts the PCATE surface over a particular covariate, returning a tibble with the predicted HTE for every Monte-Carlo sample.
Usage
FX.Predictor$predict(data, covariate)
Arguments
data
The full dataset
covariate
The unquoted covariate name for which to calculate predicted treatment effects.
Returns
A tibble with columns:
-
covariate_value
- The value of the covariate of interest -
.hte
- An estimated HTE -
.id
- The identifier for the original row (which hadcovariate
modified tocovariate_value
).
Method clone()
The objects of this class are cloneable with this method.
Usage
FX.Predictor$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
R6 class to represent partitions of the data between training and held-out
Description
R6 class to represent partitions of the data between training and held-out
R6 class to represent partitions of the data between training and held-out
Details
This takes a set of folds calculated elsewhere and represents these folds in a consistent format.
Public fields
train
A dataframe containing only the training set
holdout
A dataframe containing only the held-out data
in_holdout
A logical vector indicating if the initial data lies in the holdout set.
Methods
Public methods
Method new()
Creates an R6 object of the data split between training and test set.
Usage
HTEFold$new(data, split_id)
Arguments
data
The dataset to be split
split_id
An identifier indicating which data should lie in the holdout set.
Returns
Returns an object of class HTEFold
Method clone()
The objects of this class are cloneable with this method.
Usage
HTEFold$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Configuration of Quantities of Interest
Description
HTE_cfg
is a configuration class that pulls everything together, indicating
the full configuration for a given HTE analysis. This includes how to estimate
models and what Quantities of Interest to calculate based off those underlying models.
Public fields
outcome
Model_cfg
object indicating how outcome models should be estimated.treatment
Model_cfg
object indicating how the propensity score model should be estimated.effect
Model_cfg
object indicating how the joint effect model should be estimated.qoi
QoI_cfg
object indicating what the Quantities of Interest are and providing all necessary detail on how they should be estimated.verbose
Logical indicating whether to print debugging information.
Methods
Public methods
Method new()
Create a new HTE_cfg
object with all necessary information about how
to carry out an HTE analysis.
Usage
HTE_cfg$new( outcome = NULL, treatment = NULL, effect = NULL, qoi = NULL, verbose = FALSE )
Arguments
outcome
Model_cfg
object indicating how outcome models should be estimated.treatment
Model_cfg
object indicating how the propensity score model should be estimated.effect
Model_cfg
object indicating how the joint effect model should be estimated.qoi
QoI_cfg
object indicating what the Quantities of Interest are and providing all necessary detail on how they should be estimated.verbose
Logical indicating whether to print debugging information.
Examples
mcate_cfg <- MCATE_cfg$new(cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100))) pcate_cfg <- PCATE_cfg$new( cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)), model_covariates = c("x1", "x2", "x3"), num_mc_samples = list(x1 = 100) ) vimp_cfg <- VIMP_cfg$new() diag_cfg <- Diagnostics_cfg$new( outcome = c("SL_risk", "SL_coefs", "MSE"), ps = c("SL_risk", "SL_coefs", "AUC") ) qoi_cfg <- QoI_cfg$new( mcate = mcate_cfg, pcate = pcate_cfg, vimp = vimp_cfg, diag = diag_cfg ) ps_cfg <- SLEnsemble_cfg$new( learner_cfgs = list(SLLearner_cfg$new("SL.glm"), SLLearner_cfg$new("SL.gam")) ) y_cfg <- SLEnsemble_cfg$new( learner_cfgs = list(SLLearner_cfg$new("SL.glm"), SLLearner_cfg$new("SL.gam")) ) fx_cfg <- SLEnsemble_cfg$new( learner_cfgs = list(SLLearner_cfg$new("SL.glm"), SLLearner_cfg$new("SL.gam")) ) HTE_cfg$new(outcome = y_cfg, treatment = ps_cfg, effect = fx_cfg, qoi = qoi_cfg)
Method clone()
The objects of this class are cloneable with this method.
Usage
HTE_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
## ------------------------------------------------
## Method `HTE_cfg$new`
## ------------------------------------------------
mcate_cfg <- MCATE_cfg$new(cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)))
pcate_cfg <- PCATE_cfg$new(
cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)),
model_covariates = c("x1", "x2", "x3"),
num_mc_samples = list(x1 = 100)
)
vimp_cfg <- VIMP_cfg$new()
diag_cfg <- Diagnostics_cfg$new(
outcome = c("SL_risk", "SL_coefs", "MSE"),
ps = c("SL_risk", "SL_coefs", "AUC")
)
qoi_cfg <- QoI_cfg$new(
mcate = mcate_cfg,
pcate = pcate_cfg,
vimp = vimp_cfg,
diag = diag_cfg
)
ps_cfg <- SLEnsemble_cfg$new(
learner_cfgs = list(SLLearner_cfg$new("SL.glm"), SLLearner_cfg$new("SL.gam"))
)
y_cfg <- SLEnsemble_cfg$new(
learner_cfgs = list(SLLearner_cfg$new("SL.glm"), SLLearner_cfg$new("SL.gam"))
)
fx_cfg <- SLEnsemble_cfg$new(
learner_cfgs = list(SLLearner_cfg$new("SL.glm"), SLLearner_cfg$new("SL.gam"))
)
HTE_cfg$new(outcome = y_cfg, treatment = ps_cfg, effect = fx_cfg, qoi = qoi_cfg)
Configuration for a Kernel Smoother
Description
KernelSmooth_cfg
is a configuration class for non-parametric local-linear
regression to construct a smooth representation of the relationship between
two variables. This is typically used for displaying a surface of the conditional
average treatment effect over a continuous covariate.
Kernel smoothing is handled by the nprobust
package.
Super class
tidyhte::Model_cfg
-> KernelSmooth_cfg
Public fields
model_class
The class of the model, required for all classes which inherit from
Model_cfg
.neval
The number of points at which to evaluate the local regression. More points will provide a smoother line at the cost of somewhat higher computation.
eval_min_quantile
Minimum quantile at which to evaluate the smoother.
Methods
Public methods
Method new()
Create a new KernelSmooth_cfg
object with specified number of evaluation points.
Usage
KernelSmooth_cfg$new(neval = 100, eval_min_quantile = 0.05)
Arguments
neval
The number of points at which to evaluate the local regression. More points will provide a smoother line at the cost of somewhat higher computation.
eval_min_quantile
Minimum quantile at which to evaluate the smoother. A value of zero will do no clipping. Clipping is performed from both the top and the bottom of the empirical distribution. A value of alpha would evaluate over [alpha, 1 - alpha].
Returns
A new KernelSmooth_cfg
object.
Examples
KernelSmooth_cfg$new(neval = 100)
Method clone()
The objects of this class are cloneable with this method.
Usage
KernelSmooth_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
See Also
Examples
## ------------------------------------------------
## Method `KernelSmooth_cfg$new`
## ------------------------------------------------
KernelSmooth_cfg$new(neval = 100)
Configuration of Known Model
Description
Known_cfg
is a configuration class for when a particular model is known
a-priori. The prototypical usage of this class is when heterogeneous
treatment effects are estimated in the context of a randomized control
trial with known propensity scores.
Super class
tidyhte::Model_cfg
-> Known_cfg
Public fields
covariate_name
The name of the column in the dataset which corresponds to the known model score.
model_class
The class of the model, required for all classes which inherit from
Model_cfg
.
Methods
Public methods
Method new()
Create a new Known_cfg
object with specified covariate column.
Usage
Known_cfg$new(covariate_name)
Arguments
covariate_name
The name of the column, a string, in the dataset corresponding to the known model score (i.e. the true conditional expectation).
Returns
A new Known_cfg
object.
Examples
Known_cfg$new("propensity_score")
Method clone()
The objects of this class are cloneable with this method.
Usage
Known_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
## ------------------------------------------------
## Method `Known_cfg$new`
## ------------------------------------------------
Known_cfg$new("propensity_score")
Configuration of Marginal CATEs
Description
MCATE_cfg
is a configuration class for estimating marginal response
surfaces based on heterogeneous treatment effect estimates. "Marginal"
in this context implies that all other covariates are marginalized.
Thus, if two covariates are highly correlated, it is likely that their
MCATE surfaces will be extremely similar.
Public fields
cfgs
Named list of covariates names to a
Model_cfg
object defining how to present that covariate's CATE surface (while marginalizing over all other covariates).std_errors
Boolean indicating whether the results should be returned with standard errors or not.
estimand
String indicating the estimand to target.
Methods
Public methods
Method new()
Create a new MCATE_cfg
object with specified model name and hyperparameters.
Usage
MCATE_cfg$new(cfgs, std_errors = TRUE)
Arguments
cfgs
Named list from moderator name to a
Model_cfg
object defining how to present that covariate's CATE surface (while marginalizing over all other covariates)std_errors
Boolean indicating whether the results should be returned with standard errors or not.
Returns
A new MCATE_cfg
object.
Examples
MCATE_cfg$new(cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)))
Method add_moderator()
Add a moderator to the MCATE_cfg
object. This entails defining a configuration
for displaying the effect surface for that moderator.
Usage
MCATE_cfg$add_moderator(var_name, cfg)
Arguments
var_name
The name of the moderator to add (and the name of the column in the dataset).
cfg
A
Model_cfg
defining how to display the selected moderator's effect surface.
Returns
An updated MCATE_cfg
object.
Examples
cfg <- MCATE_cfg$new(cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100))) cfg <- cfg$add_moderator("x2", KernelSmooth_cfg$new(neval = 100))
Method clone()
The objects of this class are cloneable with this method.
Usage
MCATE_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
MCATE_cfg$new(cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)))
## ------------------------------------------------
## Method `MCATE_cfg$new`
## ------------------------------------------------
MCATE_cfg$new(cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)))
## ------------------------------------------------
## Method `MCATE_cfg$add_moderator`
## ------------------------------------------------
cfg <- MCATE_cfg$new(cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)))
cfg <- cfg$add_moderator("x2", KernelSmooth_cfg$new(neval = 100))
Base Class of Model Configurations
Description
Model_cfg
is the base class from which all other model configurations
inherit.
Public fields
model_class
The class of the model, required for all classes which inherit from
Model_cfg
.
Methods
Public methods
Method new()
Create a new Model_cfg
object with any necessary parameters.
Usage
Model_cfg$new()
Returns
A new Model_cfg
object.
Method clone()
The objects of this class are cloneable with this method.
Usage
Model_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
R6 class to represent data to be used in estimating a model
Description
R6 class to represent data to be used in estimating a model
R6 class to represent data to be used in estimating a model
Details
This class provides consistent names and interfaces to data which will be used in a supervised regression / classification model.
Public fields
label
The labels for the eventual model as a vector.
features
The matrix representation of the data to be used for model fitting. Constructed using
stats::model.matrix
.model_frame
The data-frame representation of the data as constructed by
stats::model.frame
.split_id
The split identifiers as a vector.
num_splits
The integer number of splits in the data.
cluster
A cluster ID as a vector, constructed using the unit identifiers.
weights
The case-weights as a vector.
Methods
Public methods
Method new()
Creates an R6 object to represent data to be used in a prediction model.
Usage
Model_data$new(data, label_col, ..., .weight_col = NULL)
Arguments
data
The full dataset to populate the class with.
label_col
The unquoted name of the column to use as the label in supervised learning models.
...
The unquoted names of features to use in the model.
.weight_col
The unquoted name of the column to use as case-weights in subsequent models.
Returns
A Model_data
object.
Examples
library("dplyr") df <- dplyr::tibble( uid = 1:100, x1 = rnorm(100), x2 = rnorm(100), x3 = sample(4, 100, replace = TRUE) ) %>% dplyr::mutate( y = x1 + x2 + x3 + rnorm(100), x3 = factor(x3) ) df <- make_splits(df, uid, .num_splits = 5) data <- Model_data$new(df, y, x1, x2, x3)
Method SL_cv_control()
A helper function to create the cross-validation options to be used by SuperLearner.
Usage
Model_data$SL_cv_control()
Method clone()
The objects of this class are cloneable with this method.
Usage
Model_data$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
See Also
SuperLearner::SuperLearner.CV.control
Examples
## ------------------------------------------------
## Method `Model_data$new`
## ------------------------------------------------
library("dplyr")
df <- dplyr::tibble(
uid = 1:100,
x1 = rnorm(100),
x2 = rnorm(100),
x3 = sample(4, 100, replace = TRUE)
) %>% dplyr::mutate(
y = x1 + x2 + x3 + rnorm(100),
x3 = factor(x3)
)
df <- make_splits(df, uid, .num_splits = 5)
data <- Model_data$new(df, y, x1, x2, x3)
Configuration of Partial CATEs
Description
PCATE_cfg
is a configuration class for estimating marginal
response surfaces based on heterogeneous treatment effect estimates.
"Partial" in this context is used similarly to the use in partial
dependence plots or in partial regression. In essence, a PCATE
attempts to partial out the contribution to the CATE from all other
covariates. Two highly correlated variables may have very different
PCATE surfaces.
Public fields
cfgs
Named list of covariates names to a
Model_cfg
object defining how to present that covariate's CATE surface.model_covariates
A character vector of all the covariates to be included in the second-level effect regression.
num_mc_samples
A named list from covariate name to the number of Monte Carlo samples to take to calculate the double integral (See Details).
estimand
String indicating the estimand to target.
Methods
Public methods
Method new()
Create a new PCATE_cfg
object with specified model name and hyperparameters.
Usage
PCATE_cfg$new(model_covariates, cfgs, num_mc_samples = 100)
Arguments
model_covariates
A character vector of all the covariates to be included in the second-level effect regression.
cfgs
Named list from moderator name to a
Model_cfg
object defining how to present that covariate's CATE surface.num_mc_samples
A named list from covariate name to the number of Monte Carlo samples to take to calculate the double integral (See Details). If all covariates should use the same number of samples, simply pass the (integer) number of samples.
effect_cfg
A
Model_cfg
object indicating how to fit the second level effect regression (joint across all selected covariates).
Returns
A new PCATE_cfg
object.
Examples
PCATE_cfg$new( cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)), model_covariates = c("x1", "x2", "x3"), num_mc_samples = list(x1 = 100) )
Method add_moderator()
Add a moderator to the PCATE_cfg
object. This entails adding it to the joint
model of effects and defines a configuration for displaying the effect surface
for that moderator.
Usage
PCATE_cfg$add_moderator(var_name, cfg)
Arguments
var_name
The name of the moderator to add (and the name of the column in the dataset).
cfg
A
Model_cfg
defining how to display the selected moderator's effect surface.
Returns
An updated PCATE_cfg
object.
Examples
cfg <- PCATE_cfg$new( cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)), model_covariates = c("x1", "x2", "x3"), num_mc_samples = list(x1 = 100) ) cfg <- cfg$add_moderator("x2", KernelSmooth_cfg$new(neval = 100))
Method clone()
The objects of this class are cloneable with this method.
Usage
PCATE_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
PCATE_cfg$new(
cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)),
model_covariates = c("x1", "x2", "x3"),
num_mc_samples = list(x1 = 100)
)
## ------------------------------------------------
## Method `PCATE_cfg$new`
## ------------------------------------------------
PCATE_cfg$new(
cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)),
model_covariates = c("x1", "x2", "x3"),
num_mc_samples = list(x1 = 100)
)
## ------------------------------------------------
## Method `PCATE_cfg$add_moderator`
## ------------------------------------------------
cfg <- PCATE_cfg$new(
cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)),
model_covariates = c("x1", "x2", "x3"),
num_mc_samples = list(x1 = 100)
)
cfg <- cfg$add_moderator("x2", KernelSmooth_cfg$new(neval = 100))
Configuration of Quantities of Interest
Description
QoI_cfg
is a configuration class for the Quantities of Interest to be
generated by the HTE analysis.
Public fields
mcate
A configuration object of type
MCATE_cfg
of marginal effects to calculate.pcate
A configuration object of type
PCATE_cfg
of partial effects to calculate.vimp
A configuration object of type
VIMP_cfg
of variable importance to calculate.diag
A configuration object of type
Diagnostics_cfg
of model diagnostics to calculate.ate
Logical flag indicating whether an estimate of the ATE should be returned.
predictions
Logical flag indicating whether estimates of the CATE for every unit should be returned.
Methods
Public methods
Method new()
Create a new QoI_cfg
object with specified Quantities of Interest
to estimate.
Usage
QoI_cfg$new( mcate = NULL, pcate = NULL, vimp = NULL, diag = NULL, ate = TRUE, predictions = FALSE )
Arguments
mcate
A configuration object of type
MCATE_cfg
of marginal effects to calculate.pcate
A configuration object of type
PCATE_cfg
of partial effects to calculate.vimp
A configuration object of type
VIMP_cfg
of variable importance to calculate.diag
A configuration object of type
Diagnostics_cfg
of model diagnostics to calculate.ate
A logical flag for whether to calculate the Average Treatment Effect (ATE) or not.
predictions
A logical flag for whether to return predictions of the CATE for every unit or not.
Returns
A new Diagnostics_cfg
object.
Examples
mcate_cfg <- MCATE_cfg$new(cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100))) pcate_cfg <- PCATE_cfg$new( cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)), model_covariates = c("x1", "x2", "x3"), num_mc_samples = list(x1 = 100) ) vimp_cfg <- VIMP_cfg$new() diag_cfg <- Diagnostics_cfg$new( outcome = c("SL_risk", "SL_coefs", "MSE"), ps = c("SL_risk", "SL_coefs", "AUC") ) QoI_cfg$new( mcate = mcate_cfg, pcate = pcate_cfg, vimp = vimp_cfg, diag = diag_cfg )
Method clone()
The objects of this class are cloneable with this method.
Usage
QoI_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
mcate_cfg <- MCATE_cfg$new(cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)))
pcate_cfg <- PCATE_cfg$new(
cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)),
model_covariates = c("x1", "x2", "x3"),
num_mc_samples = list(x1 = 100)
)
vimp_cfg <- VIMP_cfg$new()
diag_cfg <- Diagnostics_cfg$new(
outcome = c("SL_risk", "SL_coefs", "MSE"),
ps = c("SL_risk", "SL_coefs", "AUC")
)
QoI_cfg$new(
mcate = mcate_cfg,
pcate = pcate_cfg,
vimp = vimp_cfg,
diag = diag_cfg
)
## ------------------------------------------------
## Method `QoI_cfg$new`
## ------------------------------------------------
mcate_cfg <- MCATE_cfg$new(cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)))
pcate_cfg <- PCATE_cfg$new(
cfgs = list(x1 = KernelSmooth_cfg$new(neval = 100)),
model_covariates = c("x1", "x2", "x3"),
num_mc_samples = list(x1 = 100)
)
vimp_cfg <- VIMP_cfg$new()
diag_cfg <- Diagnostics_cfg$new(
outcome = c("SL_risk", "SL_coefs", "MSE"),
ps = c("SL_risk", "SL_coefs", "AUC")
)
QoI_cfg$new(
mcate = mcate_cfg,
pcate = pcate_cfg,
vimp = vimp_cfg,
diag = diag_cfg
)
Elastic net regression with pairwise interactions
Description
Penalized regression using elastic net. Alpha = 0 corresponds to ridge regression and alpha = 1 corresponds to Lasso. Included in the model are pairwise interactions between covariates.
See vignette("glmnet_beta", package = "glmnet")
for a nice tutorial on
glmnet.
Usage
SL.glmnet.interaction(
Y,
X,
newX,
family,
obsWeights,
id,
alpha = 1,
nfolds = 10,
nlambda = 100,
useMin = TRUE,
loss = "deviance",
...
)
Arguments
Y |
Outcome variable |
X |
Covariate dataframe |
newX |
Dataframe to predict the outcome |
family |
"gaussian" for regression, "binomial" for binary classification. Untested options: "multinomial" for multiple classification or "mgaussian" for multiple response, "poisson" for non-negative outcome with proportional mean and variance, "cox". |
obsWeights |
Optional observation-level weights |
id |
Optional id to group observations from the same unit (not used currently). |
alpha |
Elastic net mixing parameter, range [0, 1]. 0 = ridge regression and 1 = lasso. |
nfolds |
Number of folds for internal cross-validation to optimize lambda. |
nlambda |
Number of lambda values to check, recommended to be 100 or more. |
useMin |
If TRUE use lambda that minimizes risk, otherwise use 1 standard-error rule which chooses a higher penalty with performance within one standard error of the minimum (see Breiman et al. 1984 on CART for background). |
loss |
Loss function, can be "deviance", "mse", or "mae". If family = binomial can also be "auc" or "class" (misclassification error). |
... |
Any additional arguments are passed through to cv.glmnet. |
Configuration for a SuperLearner Ensemble
Description
SLEnsemble_cfg
is a configuration class for estimation of a model
using an ensemble of models using SuperLearner
.
Super class
tidyhte::Model_cfg
-> SLEnsemble_cfg
Public fields
cvControl
A list of parameters for controlling the cross-validation used in SuperLearner.
SL.library
A vector of the names of learners to include in the SuperLearner ensemble.
SL.env
An environment containing all of the programmatically generated learners to be included in the SuperLearner ensemble.
family
stats::family
object to determine how SuperLearner should be fitted.model_class
The class of the model, required for all classes which inherit from
Model_cfg
.
Methods
Public methods
Method new()
Create a new SLEnsemble_cfg
object with specified settings.
Usage
SLEnsemble_cfg$new( cvControl = NULL, learner_cfgs = NULL, family = stats::gaussian() )
Arguments
cvControl
A list of parameters for controlling the cross-validation used in SuperLearner. For more details, see
SuperLearner::SuperLearner.CV.control
.learner_cfgs
A list of
SLLearner_cfg
objects.family
stats::family
object to determine how SuperLearner should be fitted.
Returns
A new SLEnsemble_cfg
object.
Examples
SLEnsemble_cfg$new( learner_cfgs = list(SLLearner_cfg$new("SL.glm"), SLLearner_cfg$new("SL.gam")) )
Method add_sublearner()
Adds a model (or class of models) to the SuperLearner ensemble. If hyperparameter values are specified, this method will add a learner for every element in the cross-product of provided hyperparameter values.
Usage
SLEnsemble_cfg$add_sublearner(learner_name, hps = NULL)
Arguments
learner_name
Possible values use
SuperLearner
naming conventions. A full list is available withSuperLearner::listWrappers("SL")
hps
A named list of hyper-parameters. Every element of the cross-product of these hyper-parameters will be included in the ensemble. cfg <- SLEnsemble_cfg$new( learner_cfgs = list(SLLearner_cfg$new("SL.glm")) ) cfg <- cfg$add_sublearner("SL.gam", list(deg.gam = c(2, 3)))
Method clone()
The objects of this class are cloneable with this method.
Usage
SLEnsemble_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
SLEnsemble_cfg$new(
learner_cfgs = list(SLLearner_cfg$new("SL.glm"), SLLearner_cfg$new("SL.gam"))
)
## ------------------------------------------------
## Method `SLEnsemble_cfg$new`
## ------------------------------------------------
SLEnsemble_cfg$new(
learner_cfgs = list(SLLearner_cfg$new("SL.glm"), SLLearner_cfg$new("SL.gam"))
)
Configuration of SuperLearner Submodel
Description
SLLearner_cfg
is a configuration class for a single
sublearner to be included in SuperLearner. By constructing with a named list
of hyperparameters, this configuration allows distinct submodels
for each unique combination of hyperparameters. To understand what models
and hyperparameters are available, examine the methods listed in
SuperLearner::listWrappers("SL")
.
Public fields
model_name
The name of the model as passed to
SuperLearner
through theSL.library
parameter.hyperparameters
Named list from hyperparameter name to a vector of values that should be swept over.
Methods
Public methods
Method new()
Create a new SLLearner_cfg
object with specified model name and hyperparameters.
Usage
SLLearner_cfg$new(model_name, hp = NULL)
Arguments
model_name
The name of the model as passed to
SuperLearner
through theSL.library
parameter.hp
Named list from hyperparameter name to a vector of values that should be swept over. Hyperparameters not included in this list are left at their SuperLearner default values.
Returns
A new SLLearner_cfg
object.
Examples
SLLearner_cfg$new("SL.glm") SLLearner_cfg$new("SL.gam", list(deg.gam = c(2, 3)))
Method clone()
The objects of this class are cloneable with this method.
Usage
SLLearner_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
## ------------------------------------------------
## Method `SLLearner_cfg$new`
## ------------------------------------------------
SLLearner_cfg$new("SL.glm")
SLLearner_cfg$new("SL.gam", list(deg.gam = c(2, 3)))
Configuration for a Stratification Estimator
Description
Stratified_cfg
is a configuration class for stratifying a covariate
and calculating statistics within each cell.
Super class
tidyhte::Model_cfg
-> Stratified_cfg
Public fields
model_class
The class of the model, required for all classes which inherit from
Model_cfg
.covariate
The name of the column in the dataset which corresponds to the covariate on which to stratify.
Methods
Public methods
Method new()
Create a new Stratified_cfg
object with specified number of evaluation points.
Usage
Stratified_cfg$new(covariate)
Arguments
covariate
The name of the column in the dataset which corresponds to the covariate on which to stratify.
Returns
A new Stratified_cfg
object.
Examples
Stratified_cfg$new(covariate = "test_covariate")
Method clone()
The objects of this class are cloneable with this method.
Usage
Stratified_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
## ------------------------------------------------
## Method `Stratified_cfg$new`
## ------------------------------------------------
Stratified_cfg$new(covariate = "test_covariate")
Configuration of Variable Importance
Description
VIMP_cfg
is a configuration class for estimating a variable importance measure
across all moderators. This provides a meaningful measure of which moderators
explain the most of the CATE surface.
Public fields
estimand
String indicating the estimand to target.
sample_splitting
Logical indicating whether to use sample splitting in the calculation of variable importance.
linear
Logical indicating whether the variable importance assuming a linear model should be estimated.
Methods
Public methods
Method new()
Create a new VIMP_cfg
object with specified model configuration.
Usage
VIMP_cfg$new(sample_splitting = TRUE, linear_only = FALSE)
Arguments
sample_splitting
Logical indicating whether to use sample splitting in the calculation of variable importance. Choosing not to use sample splitting means that inference will only be valid for moderators with non-null importance.
linear_only
Logical indicating whether the variable importance should use only a single linear-only model. Variable importance measure will only be consistent for the population quantity if the true model of pseudo-outcomes is linear.
Returns
A new VIMP_cfg
object.
Examples
VIMP_cfg$new()
Method clone()
The objects of this class are cloneable with this method.
Usage
VIMP_cfg$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
References
Williamson, B. D., Gilbert, P. B., Carone, M., & Simon, N. (2021). Nonparametric variable importance assessment using machine learning techniques. Biometrics, 77(1), 9-22.
Williamson, B. D., Gilbert, P. B., Simon, N. R., & Carone, M. (2021). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association, 1-14.
Examples
VIMP_cfg$new()
## ------------------------------------------------
## Method `VIMP_cfg$new`
## ------------------------------------------------
VIMP_cfg$new()
Add an additional diagnostic to the effect model
Description
This adds a diagnostic to the effect model.
Usage
add_effect_diagnostic(hte_cfg, diag)
Arguments
hte_cfg |
|
diag |
Character indicating the name of the diagnostic
to include. Possible values are |
Value
Updated HTE_cfg
object
Examples
library("dplyr")
basic_config() %>%
add_effect_diagnostic("RROC") -> hte_cfg
Add an additional model to the joint effect ensemble
Description
This adds a learner to the ensemble used for estimating a model of the conditional expectation of the pseudo-outcome.
Usage
add_effect_model(hte_cfg, model_name, ...)
Arguments
hte_cfg |
|
model_name |
Character indicating the name of the model to
incorporate into the joint effect ensemble. Possible values
use |
... |
Parameters over which to grid-search for this model class. |
Value
Updated HTE_cfg
object
Examples
library("dplyr")
basic_config() %>%
add_effect_model("SL.glm.interaction") -> hte_cfg
Uses a known propensity score
Description
This replaces the propensity score model with a known value of the propensity score.
Usage
add_known_propensity_score(hte_cfg, covariate_name)
Arguments
hte_cfg |
|
covariate_name |
Character indicating the name of the covariate name in the dataframe corresponding to the known propensity score. |
Value
Updated HTE_cfg
object
Examples
library("dplyr")
basic_config() %>%
add_known_propensity_score("ps") -> hte_cfg
Adds moderators to the configuration
Description
This adds a definition about how to display a moderators to the MCATE config. A moderator is any variable that you want to view information about CATEs with respect to.
Usage
add_moderator(hte_cfg, model_type, ..., .model_arguments = NULL)
Arguments
hte_cfg |
|
model_type |
Character indicating the model type for these moderators.
Currently two model types are supported: |
... |
The (unquoted) names of the moderator variables. |
.model_arguments |
A named list from argument name to value to pass into the
constructor for the model. See |
Value
Updated HTE_cfg
object
Examples
library("dplyr")
basic_config() %>%
add_moderator("Stratified", x2, x3) %>%
add_moderator("KernelSmooth", x1, x4, x5) -> hte_cfg
Add an additional diagnostic to the outcome model
Description
This adds a diagnostic to the outcome model.
Usage
add_outcome_diagnostic(hte_cfg, diag)
Arguments
hte_cfg |
|
diag |
Character indicating the name of the diagnostic
to include. Possible values are |
Value
Updated HTE_cfg
object
Examples
library("dplyr")
basic_config() %>%
add_outcome_diagnostic("RROC") -> hte_cfg
Add an additional model to the outcome ensemble
Description
This adds a learner to the ensemble used for estimating a model of the conditional expectation of the outcome.
Usage
add_outcome_model(hte_cfg, model_name, ...)
Arguments
hte_cfg |
|
model_name |
Character indicating the name of the model to
incorporate into the outcome ensemble. Possible values
use |
... |
Parameters over which to grid-search for this model class. |
Value
Updated HTE_cfg
object
Examples
library("dplyr")
basic_config() %>%
add_outcome_model("SL.glm.interaction") -> hte_cfg
Add an additional diagnostic to the propensity score
Description
This adds a diagnostic to the propensity score.
Usage
add_propensity_diagnostic(hte_cfg, diag)
Arguments
hte_cfg |
|
diag |
Character indicating the name of the diagnostic
to include. Possible values are |
Value
Updated HTE_cfg
object
Examples
library("dplyr")
basic_config() %>%
add_propensity_diagnostic(c("AUC", "MSE")) -> hte_cfg
Add an additional model to the propensity score ensemble
Description
This adds a learner to the ensemble used for estimating propensity scores.
Usage
add_propensity_score_model(hte_cfg, model_name, ...)
Arguments
hte_cfg |
|
model_name |
Character indicating the name of the model to
incorporate into the propensity score ensemble. Possible values
use |
... |
Parameters over which to grid-search for this model class. |
Value
Updated HTE_cfg
object
Examples
library("dplyr")
basic_config() %>%
add_propensity_score_model("SL.glmnet", alpha = c(0, 0.5, 1)) -> hte_cfg
Adds variable importance information
Description
This adds a variable importance quantity of interest to the outputs.
Usage
add_vimp(hte_cfg, sample_splitting = TRUE, linear_only = FALSE)
Arguments
hte_cfg |
|
sample_splitting |
Logical indicating whether to use sample splitting or not. Choosing not to use sample splitting means that inference will only be valid for moderators with non-null importance. |
linear_only |
Logical indicating whether the variable importance should use only a single linear-only model. Variable importance measure will only be consistent for the population quantity if the true model of pseudo-outcomes is linear. |
Value
Updated HTE_cfg
object
References
Williamson, B. D., Gilbert, P. B., Carone, M., & Simon, N. (2021). Nonparametric variable importance assessment using machine learning techniques. Biometrics, 77(1), 9-22.
Williamson, B. D., Gilbert, P. B., Simon, N. R., & Carone, M. (2021). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association, 1-14.
Examples
library("dplyr")
basic_config() %>%
add_vimp(sample_splitting = FALSE) -> hte_cfg
Attach an HTE_cfg
to a dataframe
Description
This adds a configuration attribute to a dataframe for HTE estimation. This configuration details the full analysis of HTE that should be performed.
Usage
attach_config(data, .HTE_cfg)
Arguments
data |
dataframe |
.HTE_cfg |
|
Details
For information about how to set up an HTE_cfg
object, see the Recipe API
documentation basic_config()
.
To see an example analysis, read vignette("experimental_analysis")
in the context
of an experiment, vignette("experimental_analysis")
for an observational study, or
vignette("methodological_details")
for a deeper dive under the hood.
See Also
basic_config()
, make_splits()
, produce_plugin_estimates()
,
construct_pseudo_outcomes()
, estimate_QoI()
Examples
library("dplyr")
if(require("palmerpenguins")) {
data(package = 'palmerpenguins')
penguins$unitid = seq_len(nrow(penguins))
penguins$propensity = rep(0.5, nrow(penguins))
penguins$treatment = rbinom(nrow(penguins), 1, penguins$propensity)
cfg <- basic_config() %>%
add_known_propensity_score("propensity") %>%
add_outcome_model("SL.glm.interaction") %>%
remove_vimp()
attach_config(penguins, cfg) %>%
make_splits(unitid, .num_splits = 4) %>%
produce_plugin_estimates(outcome = body_mass_g, treatment = treatment, species, sex) %>%
construct_pseudo_outcomes(body_mass_g, treatment) %>%
estimate_QoI(species, sex)
}
Create a basic config for HTE estimation
Description
This provides a basic recipe for HTE estimation that can be extended by providing additional information about models to be estimated and what quantities of interest should be returned based on those models. This basic model includes only linear models for nuisance function estimation, and basic diagnostics.
Usage
basic_config()
Details
Additional models, diagnostics and quantities of interest should be added using their respective helper functions provided as part of the Recipe API.
To see an example analysis, read vignette("experimental_analysis")
in the context
of an experiment, vignette("experimental_analysis")
for an observational study, or
vignette("methodological_details")
for a deeper dive under the hood.
Value
HTE_cfg
object
See Also
add_propensity_score_model()
, add_known_propensity_score()
,
add_propensity_diagnostic()
, add_outcome_model()
, add_outcome_diagnostic()
,
add_effect_model()
, add_effect_diagnostic()
, add_moderator()
, add_vimp()
Examples
library("dplyr")
basic_config() %>%
add_known_propensity_score("ps") %>%
add_outcome_model("SL.glm.interaction") %>%
add_outcome_model("SL.glmnet", alpha = c(0.05, 0.15, 0.2, 0.25, 0.5, 0.75)) %>%
add_outcome_model("SL.glmnet.interaction", alpha = c(0.05, 0.15, 0.2, 0.25, 0.5, 0.75)) %>%
add_outcome_diagnostic("RROC") %>%
add_effect_model("SL.glm.interaction") %>%
add_effect_model("SL.glmnet", alpha = c(0.05, 0.15, 0.2, 0.25, 0.5, 0.75)) %>%
add_effect_model("SL.glmnet.interaction", alpha = c(0.05, 0.15, 0.2, 0.25, 0.5, 0.75)) %>%
add_effect_diagnostic("RROC") %>%
add_moderator("Stratified", x2, x3) %>%
add_moderator("KernelSmooth", x1, x4, x5) %>%
add_vimp(sample_splitting = FALSE) -> hte_cfg
Calculates a SATE and a PATE using AIPW
Description
This function takes fully prepared data (with all auxilliary columns from the necessary models) and estimates average treatment effects using AIPW.
Usage
calculate_ate(data)
Arguments
data |
The dataset of interest after it has been prepared fully. |
References
Kennedy, E. H. (2020). Towards optimal doubly robust estimation of heterogeneous causal effects. arXiv preprint arXiv:2004.14497.
Tsiatis, A. A., Davidian, M., Zhang, M., & Lu, X. (2008). Covariate adjustment for two‐sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Statistics in medicine, 27(23), 4658-4677.
See Also
basic_config()
, attach_config()
, make_splits()
, produce_plugin_estimates()
,
construct_pseudo_outcomes()
, estimate_QoI()
Calculate diagnostics
Description
This function calculates the diagnostics requested by the Diagnostics_cfg
object.
Usage
calculate_diagnostics(data, treatment, outcome, .diag.cfg)
Arguments
data |
Data frame with all additional columns (such as model predictions) included. |
treatment |
Unquoted treatment variable name |
outcome |
Unquoted outcome variable name |
.diag.cfg |
|
Value
Returns a tibble with columns:
-
estimand
- Character indicating the diagnostic that was calculated -
level
- Indicates the scope of this diagnostic (e.g. does it apply only to the model of the outcome under treatment). -
term
- Indicates a more granular descriptor of what the value is for, such as the specific model within the SuperLearner ensemble. -
estimate
- Point estimate of the diagnostic. -
std_error
- Standard error of the diagnostic.
See Also
Calculate Linear Variable Importance of HTEs
Description
calculate_linear_vimp
estimates the linear hypothesis test of removing a particular moderator
from a linear model containing all moderators. Unlike calculate_vimp
, this will only be
unbiased and have correct asymptotic coverage rates if the true model is linear. This linear
approach is also substantially faster, so may be useful when prototyping an analysis.
Usage
calculate_linear_vimp(
full_data,
weight_col,
pseudo_outcome,
...,
.VIMP_cfg,
.Model_cfg
)
Arguments
full_data |
dataframe |
weight_col |
Unquoted name of the weight column. |
pseudo_outcome |
Unquoted name of the pseudo-outcome. |
... |
Unquoted names of covariates to include in the joint effect model. The variable importance will be calculated for each of these covariates. |
.VIMP_cfg |
A |
.Model_cfg |
A |
References
Williamson, B. D., Gilbert, P. B., Carone, M., & Simon, N. (2021). Nonparametric variable importance assessment using machine learning techniques. Biometrics, 77(1), 9-22.
Williamson, B. D., Gilbert, P. B., Simon, N. R., & Carone, M. (2021). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association, 1-14.
See Also
Calculate "partial" CATE estimates
Description
Usage
calculate_pcate_quantities(
full_data,
.weights,
.outcome,
fx_model,
...,
.MCATE_cfg
)
Regression ROC Curve calculation
Description
This function calculates the RegressionROC Curve of of Hernández-Orallo doi:10.1016/j.patcog.2013.06.014. It provides estimates for the positive and negative errors when predictions are shifted by a variety of constants (which range across the domain of observed residuals). Curves closer to the axes are, in general, to be preferred. In general, this curve provides a simple way to visualize the error properties of a regression model.
Usage
calculate_rroc(label, prediction, nbins = 100)
Arguments
label |
True label |
prediction |
Model prediction of the label (out of sample) |
nbins |
Number of shift values to sweep over |
Details
The dot shows the errors when no shift is applied, corresponding to the base model predictions.
Value
A tibble with nbins
rows.
References
Hernández-Orallo, J. (2013). ROC curves for regression. Pattern Recognition, 46(12), 3395-3411.
Calculate Variable Importance of HTEs
Description
calculate_vimp
estimates the reduction in (population) $R^2$ from
removing a particular moderator from a model containing all moderators.
Usage
calculate_vimp(
full_data,
weight_col,
pseudo_outcome,
...,
.VIMP_cfg,
.Model_cfg
)
Arguments
full_data |
dataframe |
weight_col |
Unquoted name of the weight column. |
pseudo_outcome |
Unquoted name of the pseudo-outcome. |
... |
Unquoted names of covariates to include in the joint effect model. The variable importance will be calculated for each of these covariates. |
.VIMP_cfg |
A |
.Model_cfg |
A |
References
Williamson, B. D., Gilbert, P. B., Carone, M., & Simon, N. (2021). Nonparametric variable importance assessment using machine learning techniques. Biometrics, 77(1), 9-22.
Williamson, B. D., Gilbert, P. B., Simon, N. R., & Carone, M. (2021). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association, 1-14.
See Also
Checks that a dataframe has an attached configuration for HTEs
Description
This helper function ensures that the provided dataframe has the necessary auxilliary configuration information for HTE estimation.
Usage
check_data_has_hte_cfg(data)
Arguments
data |
Dataframe of interest. |
Value
Returns NULL. Errors if a problem is discovered.
Checks that an appropriate identifier has been provided
Description
This helper function makes a few simple checks to identify obvious issues with the way provided column of unit identifiers.
Usage
check_identifier(data, id_col)
Arguments
data |
Dataframe of interest. |
id_col |
Quoted name of identifier column. |
Value
Returns NULL. Errors if a problem is discovered.
Checks that nuisance models have been estimated and exist in the supplied dataset.
Description
This helper function makes a few simple checks to identify obvious issues with the way that nuisance functions are created and prepared.
Usage
check_nuisance_models(data)
Arguments
data |
Dataframe which should have appropriate columns of nuisance function
predictions: |
Value
Returns NULL. Errors if a problem is discovered.
Checks that splits have been properly created.
Description
This helper function makes a few simple checks to identify obvious issues with the way that splits have been made in the supplied data.
Usage
check_splits(data)
Arguments
data |
Dataframe which should have appropriate |
Value
Returns NULL. Errors if a problem is discovered.
Checks that an appropriate weighting variable has been provided
Description
This helper function makes a few simple checks to identify obvious issues with the weights provided.
Usage
check_weights(data, weight_col)
Arguments
data |
Dataframe of interest. |
weight_col |
Quoted name of weights column. |
Value
Returns NULL. Errors if a problem is discovered.
Construct Pseudo-outcomes
Description
construct_pseudo_outcomes
takes a dataset which has been prepared
with plugin estimators of nuisance parameters and transforms these into
a "pseudo-outcome": an unbiased estimator of the conditional average
treatment effect under exogeneity.
Usage
construct_pseudo_outcomes(data, outcome, treatment, type = "dr")
Arguments
data |
dataframe (already prepared with |
outcome |
Unquoted name of outcome variable. |
treatment |
Unquoted name of treatment variable. |
type |
String representing how to construct the pseudo-outcome. Valid values are "dr" (the default), "ipw" and "plugin". See "Details" for more discussion of these options. |
Details
Taking averages of these pseudo-outcomes (or fitting a model to them) will approximate averages (or models) of the underlying treatment effect.
See Also
attach_config()
, make_splits()
, produce_plugin_estimates()
, estimate_QoI()
Estimate Quantities of Interest
Description
estimate_QoI
takes a dataframe already prepared with split IDs,
plugin estimates and pseudo-outcomes and calculates the requested
quantities of interest (QoIs).
Usage
estimate_QoI(data, ...)
Arguments
data |
data frame (already prepared with |
... |
Unquoted names of moderators to calculate QoIs for. |
Details
To see an example analysis, read vignette("experimental_analysis")
in the context
of an experiment, vignette("experimental_analysis")
for an observational study, or
vignette("methodological_details")
for a deeper dive under the hood.
See Also
attach_config()
, make_splits()
, produce_plugin_estimates()
,
construct_pseudo_outcomes()
,
Examples
library("dplyr")
if(require("palmerpenguins")) {
data(package = 'palmerpenguins')
penguins$unitid = seq_len(nrow(penguins))
penguins$propensity = rep(0.5, nrow(penguins))
penguins$treatment = rbinom(nrow(penguins), 1, penguins$propensity)
cfg <- basic_config() %>%
add_known_propensity_score("propensity") %>%
add_outcome_model("SL.glm.interaction") %>%
remove_vimp()
attach_config(penguins, cfg) %>%
make_splits(unitid, .num_splits = 4) %>%
produce_plugin_estimates(outcome = body_mass_g, treatment = treatment, species, sex) %>%
construct_pseudo_outcomes(body_mass_g, treatment) %>%
estimate_QoI(species, sex)
}
Function to calculate diagnostics based on model outputs
Description
This function defines the calculations of common model diagnostics which are available.
Usage
estimate_diagnostic(data, label, prediction, diag_name, params)
Arguments
data |
The full data frame with all auxilliary columns. |
label |
The (string) column name for the labels to evaluate against. |
prediction |
The (string) column name of predictions from the model to diagnose. |
diag_name |
The (string) name of the diagnostic to calculate. Currently available are "AUC", "MSE", "SL_coefs", "SL_risk", "RROC" |
params |
Any other necessary options to pass to the given diagnostic. |
Examples
df <- dplyr::tibble(y = rbinom(100, 1, 0.5), p = rep(0.5, 100), w = rexp(100), u = 1:100)
attr(df, "weights") <- "w"
attr(df, "identifier") <- "u"
estimate_diagnostic(df, "y", "p", "AUC")
Fits a treatment effect model using the appropriate settings
Description
This function prepares data, fits the appropriate model and returns the resulting estimates in a standardized format.
Usage
fit_effect(full_data, weight_col, fx_col, ..., .Model_cfg)
Arguments
full_data |
The full dataset of interest for the modelling problem. |
weight_col |
The unquoted weighting variable name to use in model fitting. |
fx_col |
The unquoted column name of the pseudo-outcome. |
... |
The unquoted names of covariates to use in the model. |
.Model_cfg |
A |
Value
A list with one element, fx
. This element contains a Predictor
object of
the appropriate subclass corresponding to the Model_cfg
fit to the data.
Fit a predictor for treatment effects
Description
This function predicts treatment effects in a second stage model.
Usage
fit_fx_predictor(full_data, weights, psi_col, ..., .pcate.cfg, .Model_cfg)
Arguments
full_data |
The full original data with all auxilliary columns. |
weights |
Weights to be used in the analysis. |
psi_col |
The unquoted column name of the calculated pseudo-outcome. |
... |
Covariate data, passed in as the unquoted names of columns in |
.pcate.cfg |
A |
.Model_cfg |
A |
Value
A list with two items:
-
model
- TheFX.Predictor
model object used internally for PCATE estimation. -
data
- The data augmented with column.pseudo_outcome_hat
for the cross-fit predictions of the HTE for each unit.
See Also
Fits a plugin model using the appropriate settings
Description
This function prepares data, fits the appropriate models and returns the resulting estimates in a standardized format.
Usage
fit_plugin(full_data, weight_col, outcome_col, ..., .Model_cfg)
Arguments
full_data |
The full dataset of interest for the modelling problem. |
weight_col |
The unquoted weighting variable name to use in model fitting. |
outcome_col |
The unquoted column name to use as a label for the supervised learning problem. |
... |
The unquoted names of covariates to use in the model. |
.Model_cfg |
A |
Value
A new Predictor
object of the appropriate subclass corresponding to the
Model_cfg
fit to the data.
Fits a propensity score model using the appropriate settings
Description
This function prepares data, fits the appropriate model and returns the resulting estimates in a standardized format.
Usage
fit_plugin_A(full_data, weight_col, a_col, ..., .Model_cfg)
Arguments
full_data |
The full dataset of interest for the modelling problem. |
weight_col |
The unquoted weighting variable name to use in model fitting. |
a_col |
The unquoted column name of the treatment. |
... |
The unquoted names of covariates to use in the model. |
.Model_cfg |
A |
Value
A list with one element, ps
. This element contains a Predictor
object of
the appropriate subclass corresponding to the Model_cfg
fit to the data.
Fits a T-learner using the appropriate settings
Description
This function prepares data, fits the appropriate model and returns the resulting estimates in a standardized format.
Usage
fit_plugin_Y(full_data, weight_col, y_col, a_col, ..., .Model_cfg)
Arguments
full_data |
The full dataset of interest for the modelling problem. |
weight_col |
The unquoted weighting variable name to use in model fitting. |
y_col |
The unquoted column name of the outcome. |
a_col |
The unquoted column name of the treatment. |
... |
The unquoted names of covariates to use in the model. |
.Model_cfg |
A |
Value
A list with two elements, mu1
and mu0
corresponding to the models fit to
the treatment and control potential outcomes, respectively. Each is a new Predictor
object of the appropriate subclass corresponding to the the Model_cfg
fit to the data.
Removes rows which have missing data on any of the supplied columns.
Description
This function removes rows with missingness based on the columns provided. If rows are dropped, a message is displayed to the user to inform them of this fact.
Usage
listwise_deletion(data, ...)
Arguments
data |
The dataset from which to drop cases which are not fully observed. |
... |
Unquoted column names which must be non-missing. Missingness in these columns will result in dropped observations. Missingness in other columns will not. |
Value
The original data with all observations which are fully observed.
Define splits for cross-fitting
Description
This takes a dataset, a column with a unique identifier and an
arbitrary number of covariates on which to stratify the splits.
It returns the original dataset with an additional column .split_id
corresponding to an identifier for the split.
Usage
make_splits(data, identifier, ..., .num_splits)
Arguments
data |
dataframe |
identifier |
Unquoted name of unique identifier column |
... |
variables on which to stratify (requires that |
.num_splits |
number of splits to create. If VIMP is requested in |
Details
To see an example analysis, read vignette("experimental_analysis")
in the context
of an experiment, vignette("experimental_analysis")
for an observational study, or
vignette("methodological_details")
for a deeper dive under the hood.
Value
original dataframe with additional .split_id
column
See Also
attach_config()
, produce_plugin_estimates()
, construct_pseudo_outcomes()
,
estimate_QoI()
Examples
library("dplyr")
if(require("palmerpenguins")) {
data(package = 'palmerpenguins')
penguins$unitid = seq_len(nrow(penguins))
penguins$propensity = rep(0.5, nrow(penguins))
penguins$treatment = rbinom(nrow(penguins), 1, penguins$propensity)
cfg <- basic_config() %>%
add_known_propensity_score("propensity") %>%
add_outcome_model("SL.glm.interaction") %>%
remove_vimp()
attach_config(penguins, cfg) %>%
make_splits(unitid, .num_splits = 4) %>%
produce_plugin_estimates(outcome = body_mass_g, treatment = treatment, species, sex) %>%
construct_pseudo_outcomes(body_mass_g, treatment) %>%
estimate_QoI(species, sex)
}
Prediction for an SL.glmnet object
Description
Prediction for the glmnet wrapper.
Usage
## S3 method for class 'SL.glmnet.interaction'
predict(
object,
newdata,
remove_extra_cols = TRUE,
add_missing_cols = TRUE,
...
)
Arguments
object |
Result object from SL.glmnet |
newdata |
Dataframe or matrix that will generate predictions. |
remove_extra_cols |
Remove any extra columns in the new data that were not part of the original model. |
add_missing_cols |
Add any columns from original data that do not exist in the new data, and set values to 0. |
... |
Any additional arguments (not used). |
See Also
SL.glmnet
Estimate models of nuisance functions
Description
This takes a dataset with an identified outcome and treatment column along
with any number of covariates and appends three columns to the dataset corresponding
to an estimate of the conditional expectation of treatment (.pi_hat
), along with the
conditional expectation of the control and treatment potential outcome surfaces
(.mu0_hat
and .mu1_hat
respectively).
Usage
produce_plugin_estimates(data, outcome, treatment, ..., .weights = NULL)
Arguments
data |
dataframe (already prepared with |
outcome |
Unquoted name of the outcome variable. |
treatment |
Unquoted name of the treatment variable. |
... |
Unquoted names of covariates to include in the models of the nuisance functions. |
.weights |
Unquoted name of weights column. If NULL, all analysis will assume weights are all equal to one and sample-based quantities will be returned. |
Details
To see an example analysis, read vignette("experimental_analysis")
in the context
of an experiment, vignette("experimental_analysis")
for an observational study, or
vignette("methodological_details")
for a deeper dive under the hood.
See Also
attach_config()
, make_splits()
, construct_pseudo_outcomes()
, estimate_QoI()
Examples
library("dplyr")
if(require("palmerpenguins")) {
data(package = 'palmerpenguins')
penguins$unitid = seq_len(nrow(penguins))
penguins$propensity = rep(0.5, nrow(penguins))
penguins$treatment = rbinom(nrow(penguins), 1, penguins$propensity)
cfg <- basic_config() %>%
add_known_propensity_score("propensity") %>%
add_outcome_model("SL.glm.interaction") %>%
remove_vimp()
attach_config(penguins, cfg) %>%
make_splits(unitid, .num_splits = 4) %>%
produce_plugin_estimates(outcome = body_mass_g, treatment = treatment, species, sex) %>%
construct_pseudo_outcomes(body_mass_g, treatment) %>%
estimate_QoI(species, sex)
}
Removes variable importance information
Description
This removes the variable importance quantity of interest
from an HTE_cfg
.
Usage
remove_vimp(hte_cfg)
Arguments
hte_cfg |
|
Value
Updated HTE_cfg
object
Examples
library("dplyr")
basic_config() %>%
remove_vimp() -> hte_cfg
Partition the data into folds
Description
This takes a dataset and a split ID and generates two subsets of the data corresponding to a training set and a holdout.
Usage
split_data(data, split_id)
Arguments
data |
dataframe |
split_id |
integer representing the split to construct |
Value
Returns an R6 object HTEFold
with three public fields:
-
train
- The split to be used for training the plugin estimates -
holdout
- The split not used for training -
in_holdout
- A logical vector indicating for each unit whether they lie in the holdout.