---
title: "MCBoost - Basics and Extensions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{MCBoost - Basics and Extensions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
NOT_CRAN = identical(tolower(Sys.getenv("NOT_CRAN")), "true")
HAS_DEPS = identical(tolower(Sys.getenv("_R_CHECK_DEPENDS_ONLY_")), "true")
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  purl = (NOT_CRAN & !HAS_DEPS),
  eval = (NOT_CRAN & !HAS_DEPS)
)
```


```{r setup}
library("mcboost")
library("mlr3")
set.seed(83007)
```


## Example 0: Multi-Accuracy in 6 lines of code

As a brief introduction we show how to use **mcboost** in only 6 lines of code.
For our example, we use the data from the *sonar* binary classification task.
We instantiate a `MCBoost` instance by specifying a `auditor_fitter`.
This `auditor_fitter` defines the splits into groups in each boosting iteration
based on the obtained residuals.
In this example, we choose a `Tree` based model.
Afterwards, we run the `$multicalibrate()` method on our data to start multi-calibration.
We only use the first 200 samples of the *sonar* data set to train our multi-calibrated model.


```{r}
tsk = tsk("sonar")
d = tsk$data(cols = tsk$feature_names)
l = tsk$data(cols = tsk$target_names)[[1]]
mc = MCBoost$new(auditor_fitter = "TreeAuditorFitter")
mc$multicalibrate(d[1:200,], l[1:200])
```

After the calibration, we use the model to predict on the left-out data (8 observations).

```{r}
mc$predict_probs(d[201:208,])
```


## What does mcboost do?

Internally mcboost runs the following procedure `max_iter` times:

1. Predict on X using the model from the previous iteration, `init_predictor` in the first iteration.
1. Compute the residuals `res = y - y_hat`
1. Split predictions into `num_buckets` according to `y_hat`.
1. Fit the auditor (`auditor_fitter`) (here called`c(x)`) on the data in each bucket with target variable `r`.
1. Compute `misscal = mean(c(x) * res(x))`
1. if `misscal > alpha`:
    For the bucket with highest `misscal`, update the model using the prediction `c(x)`.
    else:
    Stop the procedure

A lot more details can be found either in the code, or in the corresponding publications.


## Example 1: Multi-Accuracy Boosting on the Adult Dataset

First we download the data and create an `mlr3` classification task:

```{r}
library(data.table)
adult_train = fread(
  "https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/data/adult_train.csv",
  stringsAsFactors = TRUE
)
adult_train$Country = NULL
adult_train$fnlwgt = NULL
train_tsk = TaskClassif$new("adult_train", adult_train, target = "Target")
```

We removed the features `Country` and `fnlwgt` since we expect them to have no predictive power.
`fnlwgt` means final weight and aims to allocate similar weights to people with similar demographic characteristics,
while `Country` has 42 distinct levels but 89 \% of the observations are from the United States.

### 1.1 Preprocessing

Then we do basic preprocessing:

  * Collapse rarest factors according to their prevalence
  * Drop missing factor levels
  * One-hot encode categorical variables
  * Impute NA's using a histogram approach

```{r}
library(mlr3pipelines)
pipe = po("collapsefactors", no_collapse_above_prevalence = 0.0006) %>>%
  po("fixfactors") %>>%
  po("encode") %>>%
  po("imputehist")
prep_task = pipe$train(train_tsk)[[1]]
```

In order to simulate settings where a sensitive feature is not available,
we remove the (dummy encoded) feature `Race` from the training task.

```{r}
prep_task$set_col_roles(c("Race.Amer.Indian.Eskimo", "Race.Asian.Pac.Islander", "Race.Black", "Race.Other", "Race.White"), remove_from = "feature")
```

Now we fit a `random forest`.

```{r}
library(mlr3learners)
l = lrn("classif.ranger", num.trees = 10L, predict_type = "prob")
l$train(prep_task)
```

### 1.2 MCBoost

A simple way to use the predictions from any `model` in **mcboost** is to wrap the predict
function and provide it as an initial predictor. This can be done from any model / any library.
Note, that we have to make sure, that our `init_predictor` returns a numeric vector of predictions.

```{r}
init_predictor = function(data) {
  l$predict_newdata(data)$prob[, 2]
}
```

As **mcboost** requires the data to be provided in `X, y` format (a `data.table` or `data.frame` of features and a
vector of labels), we create those two objects.

```{r}
data = prep_task$data(cols = prep_task$feature_names)
labels = 1 - one_hot(prep_task$data(cols = prep_task$target_names)[[1]])
```

We use a ridge regularized linear regression model as the auditor.

```{r}
mc = MCBoost$new(auditor_fitter = "RidgeAuditorFitter", init_predictor = init_predictor)
mc$multicalibrate(data, labels)
```

The `print` method additionally lists the average auditor values in the different buckets in each iteration:

```{r}
mc
```

### 1.3 Evaluation on Test Data

```{r}
adult_test = fread(
  "https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/data/adult_test.csv",
  stringsAsFactors = TRUE
)
adult_test$Country = NULL
adult_test$fnlwgt = NULL

# The first row seems to have an error
adult_test = adult_test[Target != "",]
adult_test$Target = droplevels(adult_test$Target)

# Note, that we have to convert columns from numeric to integer here:
sdc = train_tsk$feature_types[type == "integer", id]
adult_test[, (sdc) := lapply(.SD, as.integer), .SDcols = sdc]

test_tsk = TaskClassif$new("adult_test", adult_test, target = "Target")
prep_test = pipe$predict(test_tsk)[[1]]
```

Now, we can again extract `X, y`.

```{r}
test_data = prep_test$data(cols = prep_test$feature_names)
test_labels = 1 - one_hot(prep_test$data(cols = prep_test$target_names)[[1]])
```

and **predict**.

```{r}
prs = mc$predict_probs(test_data)
```

The accuracy of the multi-calibrated model

```{r}
mean(round(prs) == test_labels)
```

is similar to the non-calibrated model.

```{r}
mean(round(init_predictor(test_data)) == test_labels)
```

But if we have a look at the bias for the different subpopulations of feature `Race`,
we can see that the predictions got more calibrated.
Note that we did not explicitly give neither the initial model
nor the auditor access to the feature `Race`.

```{r}
# Get bias per subgroup for multi-calibrated predictor
adult_test$biasmc = (prs - test_labels)
adult_test[, .(abs(mean(biasmc)), .N), by = .(Race)]
# Get bias per subgroup for initial predictor
adult_test$biasinit = (init_predictor(test_data) - test_labels)
adult_test[, .(abs(mean(biasinit)), .N), by = .(Race)]
```

### 1.4 The Auditor Effect

We can also obtain the auditor effect after multicalibration.
This indicates "how much" each observation has been affected by multi-calibration (on average across iterations).

```{r}
ae = mc$auditor_effect(test_data)
hist(ae)
```

We can see that there are a few instances with more pronounced effects, while most have actually only a low effect.

In order to get more insights, we compute quantiles of the
less and more effected population (median as cut-point) and analyze differences.

```{r}
effect = apply(test_data[ae >= median(ae[ae > 0]),], 2, quantile)
no_effect  = apply(test_data[ae < median(ae[ae>0]),], 2, quantile)
difference = apply((effect-no_effect), 2, mean)
difference[difference > 0.1]
```

There seems to be a difference in some variables like `Education` and `Marital_Status`.

We can further analyze the individuals:

```{r}
test_data[ae >= median(ae[ae>0]), names(which(difference > 0.1)), with = FALSE]
```

### Predicting using only the first 'n' iterations

Multi-calibration is an iterative procedure.
The `t` parameter can be used to predict using only the first `t` iterations.
This then predicts using only the first `t` iterations of the multi-calibration procedure.

```{r}
prs = mc$predict_probs(test_data, t = 3L)
```


## Example 2: MCBoost with non-mlr3 models: GLM

`mcboost` does not require your model to be a `mlr3` model.
As an input, `mcboost` expects a function `init_predictor` that takes as input `data` and returns a prediction.


```{r}
tsk = tsk("sonar")
data = tsk$data()[, Class := as.integer(Class) - 1L]
mod = glm(data = data, formula = Class ~ .)
```

The `init_predictor` could then use the `glm` model:

```{r}
init_predictor = function(data) {
  predict(mod, data)
}
```

... and we can calibrate this predictor.

```{r}
d = data[, -1]
l = data$Class
mc = MCBoost$new(init_predictor = init_predictor)
mc$multicalibrate(d[1:200,], l[1:200])
mc$predict_probs(d[201:208,])
```


## Example 3: Avoiding Overfitting in MCBoost

Very often `MCBoost`'s calibration is very aggressive and tends to overfit.
This section tries to introduce a method to regularize against this overfitting.

### 3.1 CVLearner

In this section we use a
`Cross-Validated` learner that predicts on held-out data during the training phase. This idea is based on Wolpert (1992)'s Stacked Generalization.
Other, simpler methods include choosing a smaller step size `eta` or reducing the number of `iters`.

```{r}
tsk = tsk("sonar")
```

As an `init_predictor` we again use a `ranger` model from mlr3 and
construct an init predictor using the convenience function provided by `mcboost`.

```{r}
learner = lrn("classif.ranger", predict_type = "prob")
learner$train(tsk)
init_predictor = mlr3_init_predictor(learner)
```

... and we can calibrate this predictor.
This time, we use a `CVTreeAuditorFitter` instead of a `TreeAuditorFitter`. This allows us to avoid
overfitting similar to a technique coined `stacked generalization` first described by Wolpert in 1992.
Note, that this can sometimes take a little longer since each learner is cross-validated using `3` folds (default).

```{r}
d = data[, -1]
l = data$Class
mc = MCBoost$new(init_predictor = init_predictor, auditor_fitter=CVTreeAuditorFitter$new(), max_iter = 2L)
mc$multicalibrate(d[1:200,], l[1:200])
mc$predict_probs(d[201:208,])
```

### 3.2 Data Splitting

We can also use a fresh chunk of the validation data in each iteration. `mcboost` implements two strategies, `"bootstrap"` and `"split"`. While `"split"` simply splits up the data,  `"bootstrap"` draws a new bootstrap sample of the data in each iteration.

```{r}
tsk = tsk("sonar")
```

Again, we use a `ranger` mlr3 model as our initial predictor:

```{r}
learner = lrn("classif.ranger", predict_type = "prob")
learner$train(tsk)
init_predictor = mlr3_init_predictor(learner)
```

and we can now calibrate:

```{r}
d = data[, -1]
l = data$Class
mc = MCBoost$new(
  init_predictor = init_predictor,
  auditor_fitter= TreeAuditorFitter$new(),
  iter_sampling = "bootstrap"
)
mc$multicalibrate(d[1:200,], l[1:200])
mc$predict_probs(d[201:208,])
```


## Example 4: Adjusting the SubPop Fitter

For this example, we use the *sonar* dataset once again:

```{r}
tsk = tsk("sonar")
data = tsk$data(cols = tsk$feature_names)
labels = tsk$data(cols = tsk$target_names)[[1]]
```

### 4.1 LearnerAuditorFitter

The Subpop-fitter can be easily adjusted by constructing it from a `LearnerAuditorFitter`.
This allows for using any **mlr3** learner.
See [here](https://mlr3extralearners.mlr-org.com/articles/learners/list_learners.html) for a list of available learners.

```{r}
rf = LearnerAuditorFitter$new(lrn("regr.rpart", minsplit = 10L))
mc = MCBoost$new(auditor_fitter = rf)
mc$multicalibrate(data, labels)
```

The `TreeAuditorFitter` and `RidgeAuditorFitter` are two instantiations of this Fitter with pre-defined learners. By providing their character strings the fitter could be automatically constructed.

### 4.2 SubpopAuditorFitter & SubgroupAuditorFitter

In some occasions, instead of using a `Learner`, we might want to use a fixed set of subgroups.
Those can either be defined from the data itself or provided from the outside.

**Splitting via the dataset**

In order to split the data into groups according to a set of columns, we use a `SubpopAuditorFitter`
together with a list of `subpops`. Those define the group splits to multi-calibrate on.
These splits can be either a `character` string, referencing a binary variable in the data
or a `function` that, when evaluated on the data, returns a binary vector.

In order to showcase both options, we add a binary variable to our `data`:

```{r}
data[, Bin := sample(c(1, 0), nrow(data), replace = TRUE)]
```

```{r}
rf = SubpopAuditorFitter$new(list(
  "Bin",
  function(data) {data[["V1"]] > 0.2},
  function(data) {data[["V1"]] > 0.2 | data[["V3"]] < 0.29}
))
```

```{r}
mc = MCBoost$new(auditor_fitter = rf)
mc$multicalibrate(data, labels)
```

And we can again apply it to predict on new data:

```{r}
mc$predict_probs(data)
```

**Manually defined masks**

If we want to add the splitting from the outside, by supplying binary masks for the
rows of the data, we can provide manually defined masks.
Note, that the masks have to correspond with the number of rows in the dataset.

```{r}
rf = SubgroupAuditorFitter$new(list(
  rep(c(0, 1), 104),
  rep(c(1, 1, 1, 0), 52)
))
```

```{r}
mc = MCBoost$new(auditor_fitter = rf)
mc$multicalibrate(data, labels)
```

During prediction, we now have to supply a set of masks for the prediction data.

```{r}
predict_masks = list(
  rep(c(0, 1), 52),
  rep(c(1, 1, 1, 0), 26)
)
```

```{r}
mc$predict_probs(data[1:104,], subgroup_masks = predict_masks)
```


## Example 5: Multi-Calibrating data with missing values using a pipeline

When data has missing values or other non-standard columns, we often have to pre-process data in order
to be able to fit models.
Those preprocessing steps can be embedded into the `SubPopFitter` by using a **mlr3pipelines** Pipeline.
The following code shows a brief example:

```{r}
tsk = tsk("penguins")
# first we convert to a binary task
row_ids = tsk$data(cols = c("species", "..row_id"))[species %in% c("Adelie", "Gentoo")][["..row_id"]]
tsk$filter(row_ids)$droplevels()
tsk
```

```{r}
library("mlr3pipelines")
library("mlr3learners")

# Convert task to X,y
X = tsk$data(cols = tsk$feature_names)
y = tsk$data(cols = tsk$target_names)

# Our inital model is a pipeline that imputes missings and encodes categoricals
init_model = as_learner(po("encode") %>>% po("imputehist") %>>%
  lrn("classif.glmnet", predict_type = "prob"))
# And we fit it on a subset of the data in order to simulate a poorly performing model.
init_model$train(tsk$clone()$filter(row_ids[c(1:9, 160:170)]))
init_model$predict(tsk)$score()

# We define a pipeline that imputes missings and encodes categoricals
auditor = as_learner(po("encode") %>>% po("imputehist") %>>% lrn("regr.rpart"))

mc = MCBoost$new(auditor_fitter = auditor, init_predictor = init_model)
mc$multicalibrate(X, y)
```

and we can observe where it improved:

```{r}
mc
```


## Example 6: Multi-Calibration Regression

We abuse the `Communities & Crime` dataset in order to showcase how `mcboost` can be used in a regression setting.

First we download the data and create an `mlr3` regression task:

```{r}
library(data.table)
library(mlr3oml)
oml = OMLData$new(42730)
data = oml$data

tsk = TaskRegr$new("communities_crime", data, target = "ViolentCrimesPerPop")
```

Currently, **mcboost** only allows to work with targets between 0 and 1.
Luckily, our target variable's values are already in that range, but
if they were not, we could simply scale them to [0;1] before our analysis.

```{r}
summary(data$ViolentCrimesPerPop)
```

We again split our task into **train** and **test**.
We do this in `mlr3` by creating a 2/3 - 1/3 split using `mlr3::partition()` and assigning the train ids to the row role `"use"`.

```{r}
split = partition(tsk)
tsk$set_row_roles(split$train, "use")
```

### 6.1 Preprocessing

Then we do basic preprocessing, since we do not have any categorical
variables, we only impute NA's using a histogram approach.

```{r}
library(mlr3pipelines)
pipe =  po("imputehist")
prep_task = pipe$train(list(tsk))[[1]]

prep_task$set_col_roles(c("racepctblack", "racePctWhite", "racePctAsian", "racePctHisp", "community"), remove_from = "feature")
```

Now we fit our first `Learner`: A `random forest`.

```{r}
library(mlr3learners)
l = lrn("regr.ranger", num.trees = 10L)
l$train(prep_task)
```

### 6.2 MCBoost

A simple way to use the predictions from any `Model` in **mcboost** is to wrap the predict
function and provide it as an initial predictor. This can be done from any model / any library.
Note, that we have to make sure, that our `init_predictor` returns a numeric vector of predictions.

```{r}
init_predictor = function(data) {
  l$predict_newdata(data)$response
}
```

As **mcboost** requires the data to be provided in `X, y` format (a `data.table` or `data.frame` of features and a
vector of labels), we create those two objects.

```{r}
data = prep_task$data(cols = prep_task$feature_names)
labels = prep_task$data(cols = prep_task$target_names)[[1]]
```

```{r}
mc = MCBoost$new(auditor_fitter = "RidgeAuditorFitter", init_predictor = init_predictor, eta = 0.1)
mc$multicalibrate(data, labels)
```

### 6.3 Evaluation on Test Data

We first create the test task by assigning the test ids to the row role `"use"`, and then
use our preprocessing `pipe's`  predict function to also impute missing values
for the validation data. Then we again extract features `X` and target `y`.

```{r}
test_task = tsk$clone()
test_task$row_roles$use = split$test
test_task = pipe$predict(list(test_task))[[1]]
test_data = test_task$data(cols = tsk$feature_names)
test_labels = test_task$data(cols = tsk$target_names)[[1]]
```

and **predict**.

```{r}
prs = mc$predict_probs(test_data)
```

Now we can compute the MSE of the multi-calibrated model

```{r}
mean((prs - test_labels)^2)
```

and compare to the non-calibrated version:

```{r}
mean((init_predictor(test_data) - test_labels)^2)
```

But looking at sub-populations we can see that the predictions got
more calibrated.
Since we cannot show all subpopulations we only show the MSE for the feature `racepctblack`.

```{r}
test_data$se_mcboost = (prs - test_labels)^2
test_data$se_init = (init_predictor(test_data) - test_labels)^2

test_data[, .(mcboost = mean(se_mcboost), initial = mean(se_init), .N), by = .(racepctblack > 0.5)]
```