Type: | Package |
Version: | 0.2.0 |
Title: | Projection Pursuit Classification Forest |
Maintainer: | Natalia da Silva <natalia.dasilva@fcea.edu.uy> |
Description: | Implements projection pursuit forest algorithm for supervised classification. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
URL: | https://github.com/natydasilva/PPforest |
LazyData: | yes |
Depends: | R (≥ 4.1.0) |
Imports: | Rcpp (≥ 0.12.7), magrittr, plyr, dplyr (≥ 0.7.5), tidyr, doParallel, tibble, tidyselect |
Suggests: | knitr, gridExtra, GGally, ggplot2, RColorBrewer, roxygen2 (≥ 3.0.0), PPtreeViz, rmarkdown |
VignetteBuilder: | knitr |
LinkingTo: | Rcpp,RcppArmadillo |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
BugReports: | https://github.com/natydasilva/PPforest/issues |
NeedsCompilation: | yes |
Packaged: | 2025-07-23 22:09:04 UTC; nataliadasilva |
Author: | Natalia da Silva |
Repository: | CRAN |
Date/Publication: | 2025-07-23 22:20:19 UTC |
NCI60 data set
Description
NCI60 data set
Usage
data(NCI60)
Format
cDNA microarrays were used to examine the variation in gene expression among the 60 cell lines. The cell lines are derived from tumors with different sites of origin. This data set contain 61 observations and 30 feature variables from 8 different tissue types.
- Type
has 8 different tissue types, 9 cases of breast, 5 cases of central nervous system (CNS), 7 cases pf colon, 8 cases of leukemia, 8 cases of melanoma, 9 cases of non-small-cell lung carcinoma (NSCLC), 6 cases of ovarian and 9 cases of renal.
- Gene1, Gen2, ..., Gen30
Numeric gene expression information.
A data frame with 61 rows and 31 variables
Source
Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American statistical Association 97 77-87.
Predict class for the test set and calculate prediction error after finding the PPtree structure, .
Description
Predict class for the test set and calculate prediction error after finding the PPtree structure, .
Usage
PPclassify( Tree.result, test.data = NULL, Rule = 1, true.class = NULL)
Arguments
Tree.result |
the result from PPtree_split |
test.data |
the test dataset |
Rule |
split rule 1:mean of two group means, 2:weighted mean, 3: mean of max(left group) and min(right group), 4: weighted mean of max(left group) and min(right group) |
true.class |
true class of test dataset if available |
Value
predict.class predicted class
predict.error prediction error
References
Lee, YD, Cook, D., Park JW, and Lee, EK(2013) PPtree: Projection pursuit classification tree, Electronic Journal of Statistics, 7:1369-1386.
Projection Pursuit Random Forest
Description
PPforest
implements a random forest using projection pursuit trees algorithm (based on PPtreeViz package).
Usage
PPforest(data, y, std = 'scale', size.tr, m, PPmethod, size.p,
lambda = .1, parallel = FALSE, cores = 2, rule = 1)
Arguments
data |
Data frame with the complete data set. |
y |
A character with the name of the response variable. |
std |
if TRUE standardize the data set, needed to compute global importance measure. |
size.tr |
is the size proportion of the training if we want to split the data in training and test. |
m |
is the number of bootstrap replicates, this corresponds with the number of trees to grow. To ensure that each observation is predicted a few times we have to select this number no too small. |
PPmethod |
is the projection pursuit index to optimize in each classification tree. The options are |
size.p |
proportion of variables randomly sampled in each split. |
lambda |
penalty parameter in PDA index and is between 0 to 1 . If |
parallel |
logical condition, if it is TRUE then parallelize the function |
cores |
number of cores used in the parallelization |
rule |
split rule 1: mean of two group means 2: weighted mean of two group means - weight with group size 3: weighted mean of two group means - weight with group sd 4: weighted mean of two group means - weight with group se 5: mean of two group medians 6: weighted mean of two group medians - weight with group size 7: weighted mean of two group median - weight with group IQR 8: weighted mean of two group median - weight with group IQR and size |
Value
An object of class PPforest
with components.
prediction.training |
predicted values for training data set. |
training.error |
error of the training data set. |
prediction.test |
predicted values for the test data set if |
error.test |
error of the test data set if |
oob.error.forest |
out of bag error in the forest. |
oob.error.tree |
out of bag error for each tree in the forest. |
boot.samp |
information of bootrap samples. |
output.trees |
output from a |
proximity |
Proximity matrix, if two cases are classified in the same terminal node then the proximity matrix is increased by one in |
votes |
a matrix with one row for each input data point and one column for each class, giving the fraction of (OOB) votes from the |
n.tree |
number of trees grown in |
n.var |
number of predictor variables selected to use for spliting at each node. |
type |
classification. |
confusion |
confusion matrix of the prediction (based on OOB data). |
call |
the original call to |
train |
is the training data based on |
test |
is the test data based on |
References
da Silva, N., Cook, D., & Lee, E. K. (2021). A projection pursuit forest algorithm for supervised classification. Journal of Computational and Graphical Statistics, 30(4), 1168-1180.
Examples
#crab example with all the observations used as training
set.seed(123)
pprf.crab <- PPforest(data = crab, y = 'Type',
std = 'no', size.tr = 0.8, m = 100, size.p = 1,
PPmethod = 'LDA' , parallel = TRUE, cores = 2, rule = 1)
pprf.crab
Projection pursuit classification tree with random variable selection in each split
Description
Find tree structure using projection pursuit indices of classification in each split.
Usage
PPtree_split(form, data, PPmethod='LDA',
size.p=1, lambda = 0.1,...)
Arguments
form |
A character with the name of the class variable. |
data |
Data frame with the complete data set. |
PPmethod |
index to use for projection pursuit: 'LDA' and 'PDA' |
size.p |
proportion of variables randomly sampled in each split, default is 1, returns a PPtree. |
lambda |
penalty parameter in PDA index and is between 0 to 1 . If |
... |
arguments to be passed to methods |
Value
An object of class PPtreeclass
with components
Tree.Struct |
Tree structure of projection pursuit classification tree |
projbest.node |
1-dim optimal projections of each split node |
splitCutoff.node |
cutoff values of each split node |
origclass_num |
original class numeric |
origdata |
original data |
References
Lee, YD, Cook, D., Park JW, and Lee, EK (2013) PPtree: Projection pursuit classification tree, Electronic Journal of Statistics, 7:1369-1386.
Examples
#crab data set
Tree.crab <- PPtree_split('Type~.', data = crab, PPmethod = 'LDA', size.p = 0.5)
Tree.crab
For each bootstrap sample grow a projection pursuit tree (PPtree object).
Description
For each bootstrap sample grow a projection pursuit tree (PPtree object).
Usage
baggtree(
data,
y,
m = 500,
PPmethod = "LDA",
lambda = 0.1,
size.p = 1,
parallel = FALSE,
cores = 2
)
Arguments
data |
Data frame with the complete data set. |
y |
A character with the name of the y variable. |
m |
is the number of bootstrap replicates, this corresponds with the number of trees to grow. To ensure that each observation is predicted a few times we have to select this number no too small. |
PPmethod |
is the projection pursuit index to be optimized, options LDA or PDA, by default it is LDA. |
lambda |
a parameter for PDA index |
size.p |
proportion of random sample variables in each split if size.p= 1 it is bagging and if size.p<1 it is a forest. |
parallel |
logical condition, if it is TRUE then parallelize the function |
cores |
number of cores used in the parallelization |
Value
data frame with trees_pp output for all the bootstraps samples.
Astralian crabs
Description
Astralian crabs
Usage
data(crab)
Format
Measurements on rock crabs of the genus Leptograpsus. The data set contains 200 observations from two species of crab (blue and orange), there are 50 specimens of each sex of each species, collected on site at Fremantle, Western Australia.
- Type
is the class variable and has 4 classes with the combinations of species and sex (BlueMale, BlueFemale, OrangeMale and OrangeFemale).
- FL
the size of the frontal lobe length, in mm.
- RW
rear width, in mm
- CL
length of midline of the carapace, in mm.
- CW
maximum width of carapace, in mm.
- BD
depth of the body; for females, measured after displacement of the abdomen, in mm.
A data frame with 200 rows and 6 variables
Source
Campbell, N. A. & Mahon, R. J. (1974), A Multivariate Study of Variation in Two Species of Rock Crab of genus Leptograpsus, Australian Journal of Zoology 22(3), 417 - 425.
Fish catch data set
Description
Fish catch data set
Usage
data(fishcatch)
Format
There are 159 fishes of 7 species are caught and measured. Altogether there are 7 variables. All the fishes are caught from the same lake(Laengelmavesi) near Tampere in Finland.
- Type
has 7 fish classes, with 35 cases of Bream, 11 cases of Parkki, 56 cases of Perch, 17 cases of Pike, 20 cases of Roach, 14 cases of Smelt and 6 cases of Whitewish.
- weight
Weight of the fish (in grams).
- length1
Length from the nose to the beginning of the tail (in cm).
- length2
Length from the nose to the notch of the tail (in cm).
- length3
Length from the nose to the end of the tail (in cm).
- height
Maximal height as % of Length3.
- width
Maximal width as % of Length3.
A data frame with 159 rows and 7 variables
Source
<http://www.amstat.org/publications/jse/jse_data_archive.htm>
Glass data set
Description
Glass data set
Usage
data(glass)
Format
Contains measurements 214 observations of 6 types of glass; defined in terms of their oxide content.
- Type
has 6 types of glasses.
- X1
refractive index.
- X2
Sodium (unit measurement: weight percent in corresponding oxide).
- X3
Magnesium.
- X4
Aluminum.
- X5
Silicon.
- X6
Potassium.
- X7
Calcium.
- X8
Barium.
- X9
Iron.
A data frame with 214 rows and 10 variables
The image data set
Description
The image data set
Usage
data(image)
Format
Contains 2310 observations of instances from 7 outdoor images
- Type
has 7 types of outdoor images, brickface, cement, foliage, grass, path, sky, and window.
- X1
the column of the center pixel of the region
- X2
the row of the center pixel of the region.
- X3
the number of pixels in a region = 9.
- X4
the results of a line extraction algorithm that counts how many lines of length 5 (any orientation) with low contrast, less than or equal to 5, go through the region.
- X5
measure the contrast of horizontally adjacent pixels in the region. There are 6, the mean and standard deviation are given. This attribute is used as a vertical edge detector.
- X6
X5 sd
- X7
measures the contrast of vertically adjacent pixels. Used for horizontal line detection.
- X8
sd X7
- X9
the average over the region of (R + G + B)/3
- X10
the average over the region of the R value.
- X11
the average over the region of the B value.
- X12
the average over the region of the G value.
- X13
measure the excess red: (2R - (G + B)).
- X14
measure the excess blue: (2B - (G + R)).
- X15
measure the excess green: (2G - (R + B)).
- X16
3-d nonlinear transformation of RGB. (Algorithm can be found in Foley and VanDam, Fundamentals of Interactive Computer Graphics).
- X17
mean of X16.
- X18
hue mean.
A data frame contains 2310 observations and 19 variables
Leukemia data set
Description
Leukemia data set
Usage
data(leukemia)
Format
This dataset comes from a study of gene expression in two types of acute leukemias, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Gene expression levels were measured using Affymetrix high density oligonucleotide arrays containing 6817 human genes. A data set containing 72 observations from 3 leukemia types classes.
- Type
has 3 classes with 38 cases of B-cell ALL, 25 cases of AML and 9 cases of T-cell ALL.
- Gene1, Gen2, ..., Gen40
gene expression levels.
A data frame with 72 rows and 41 variables
Source
Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American statistical Association 97 77-87.
Lymphoma data set
Description
Lymphoma data set
Usage
data(lymphoma)
Format
Gene expression in the three most prevalent adult lymphoid malignancies: B-cell chronic lymphocytic leukemia (B-CLL), follicular lymphoma (FL), and diffuse large B-cell lym- phoma (DLBCL). Gene expression levels were measured using a specialized cDNA microarray, the Lymphochip, containing genes that are preferentially expressed in lymphoid cells or that are of known immunologic or oncologic importance. This data set contain 80 observations from 3 lymphoma types.
- Type
Class variable has 3 classes with 29 cases of B-cell ALL (B-CLL), 42 cases of diffuse large B-cell lymphoma (DLBCL) and 9 cases of follicular lymphoma (FL).
- Gene1, Gen2, ..., Gen50
gene expression.
A data frame with 80 rows and 51 variables
Source
Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Ex- pression Data. Journal of the American statistical Association 97 77-87.
Data structure with the projected and boundary by node and class.
Description
Data structure with the projected and boundary by node and class.
Usage
node_data(ppf, tr, Rule = 1)
Arguments
ppf |
is a PPforest object |
tr |
numerical value to identify a tree |
Rule |
split rule 1:mean of two group means, 2:weighted mean, 3: mean of max(left group) and min(right group), 4: weighted mean of max(left group) and min(right group) |
Value
Data frame with projected data for each class and node id and the boundaries
Examples
#crab data set with all the observations used as training
pprf.crab <- PPforest(data = crab, std = 'min-max', y = 'Type',
size.tr = 1, m = 200, size.p = .5, PPmethod = 'LDA')
node_data(ppf = pprf.crab, tr = 1)
The olive data set
Description
The olive data set
Usage
data(olive)
Format
Contains 572 observations and 10 variables
- Region
Three super-classes of Italy: North, South and the island of Sardinia.
- area
Nine collection areas: three from North, four from South and 2 from Sardinia.
- palmitic
fatty acids percent x 100.
- palmitoleic
fatty acids percent x 100.
- stearic
fatty acids percent x 100.
- oleic
fatty acids percent x 100.
- linoleic
fatty acids percent x 100.
- linolenic
fatty acids percent x 100.
- arachidic
fatty acids percent x 100.
- eicosenoic
fatty acids percent x 100.
A data frame contains 573 observations and 10 variables
Parkinson data set
Description
Parkinson data set
Usage
data(parkinson)
Format
A data set containing 195 observations from 2 parkinson types.
- Type
Class variable has 2 classes, there are 48 cases of healthy people and 147 cases with Parkinson. The feature variables are biomedical voice measures.
- X1
Average vocal fundamental frequency.
- X2
Maximum vocal fundamental frequency.
- X3
Minimum vocal fundamental frequency.
- X4
MDVP:Jitter(%) measures of variation in fundamental frequency.
- X5
MDVP:Jitter(Abs) measures of variation in fundamental frequency.
- X6
MDVP:RAP measures of variation in fundamental frequency.
- X7
MDVP:PPQ measures of variation in fundamental frequency.
- X8
Jitter:DDP measures of variation in fundamental frequency.
- X9
MDVP:Shimmer measures of variation in amplitude.
- X10
MDVP:Shimmer(dB) measures of variation in amplitude.
- X11
Shimmer:APQ3 measures of variation in amplitude.
- X12
Shimmer:APQ5 measures of variation in amplitude.
- X13
MDVP:APQ measures of variation in amplitude.
- X14
Shimmer:DDA measures of variation in amplitude.
- X15
NHR measures of ratio of noise to tonal components in the voice.
- X16
HNR measures of ratio of noise to tonal components in the voice.
- X17
RPDE nonlinear dynamical complexity measures.
- X18
D2 nonlinear dynamical complexity measures.
- X19
DFA - Signal fractal scaling exponent.
- X20
spread1 Nonlinear measures of fundamental frequency variation.
- X21
spread2 Nonlinear measures of fundamental frequency variation.
- X22
PPE Nonlinear measures of fundamental frequency variation.
A data frame with 195 rows and 23 variables
Source
<https://archive.ics.uci.edu/ml/datasets/Parkinsons>
Obtain the permuted importance variable measure
Description
Obtain the permuted importance variable measure
Usage
permute_importance(ppf)
Arguments
ppf |
is a PPforest object |
Value
A data frame with permuted importance measures, imp is the permuted importance measure defined in Brieman paper, imp2 is the permuted importance measure defined in randomForest package, the standard deviation (sd.im and sd.imp2) for each measure is computed and the also the standardized mesure.
Examples
pprf.crab <- PPforest(data = crab, y = 'Type',
std = 'min-max', size.tr = 1, m = 100, size.p = .4, PPmethod = 'LDA', parallel = TRUE, core = 2)
permute_importance(ppf = pprf.crab)
Global importance measure for a PPforest object as the average IMP PPtree measure over all the trees in the forest
Description
Global importance measure for a PPforest object as the average IMP PPtree measure over all the trees in the forest
Usage
ppf_avg_imp(ppf, y)
Arguments
ppf |
is a PPforest object |
y |
A character with the name of the class variable. |
Value
Data frame with the global importance measure
References
da Silva, N., Cook, D., & Lee, E. K. (2021). A projection pursuit forest algorithm for supervised classification. Journal of Computational and Graphical Statistics, 30(4), 1168-1180.
Examples
#crab data set with all the observations used as training
pprf.crab <- PPforest(data = crab, std = 'min-max', y = 'Type',
size.tr = 1, m = 100, size.p = .5, PPmethod = 'LDA')
ppf_avg_imp(pprf.crab, 'Type')
Global importance measure for a PPforest object
Description
Global importance measure for a PPforest object
Usage
ppf_global_imp(data, y, ppf)
Arguments
data |
Data frame with the complete data set. |
y |
A character with the name of the class variable. |
ppf |
is a PPforest object |
Value
Data frame with the global importance measure
References
da Silva, N., Cook, D., & Lee, E. K. (2021). A projection pursuit forest algorithm for supervised classification. Journal of Computational and Graphical Statistics, 30(4), 1168-1180.
Examples
#crab data set with all the observations used as training
pprf.crab <- PPforest(data = crab, y = 'Type',
std = 'no', size.tr = 1, m = 200, size.p = .5,
PPmethod = 'LDA', parallel = TRUE, cores = 2)
ppf_global_imp(data = crab, y = 'Type', pprf.crab)
Predict method for PPforest objects
Description
Predict method for PPforest objects
Usage
## S3 method for class 'PPforest'
predict(object, newdata, rule = 1, parallel = TRUE, cores = 2, ...)
Arguments
object |
A fitted PPforest object |
newdata |
A data frame with predictors (same structure as training data) |
rule |
Split rule used in classification (integer from 1 to 8) 1: mean of two group means 2: weighted mean of two group means - weight with group size 3: weighted mean of two group means - weight with group sd 4: weighted mean of two group means - weight with group se 5: mean of two group medians 6: weighted mean of two group medians - weight with group size 7: weighted mean of two group median - weight with group IQR 8: weighted mean of two group median - weight with group IQR and size |
parallel |
Logical, whether to use parallel processing |
cores |
Number of cores to use if parallel = TRUE |
... |
Additional arguments (ignored) |
Value
A list with:
- predtree
Matrix with individual tree predictions
- predforest
Final predicted classes based on majority vote
Examples
## Not run:
set.seed(123)
train <- sample(1:nrow(crab), nrow(crab)*.7)
crab_train <- data.frame(crab[train, ])
crab_test <- data.frame(crab[-train, ])
# if you split your data in training and test outside PPforest size.tr should be 1.
pprf.crab <- PPforest(data = crab_train, class = 'Type',
std = 'scale', size.tr = 1, m = 200, size.p = .4, PPmethod = 'LDA', parallel = TRUE )
pred <- predict(pprf.crab, newdata = crab_test[,-1], parallel = TRUE)
## End(Not run)
Print PPforest object
Description
Print PPforest object
Usage
## S3 method for class 'PPforest'
print(x, ...)
Arguments
x |
is a PPforest class object |
... |
additional parameter |
Value
printed results for PPforest object
Data structure with the projected and boundary by node and class.
Description
Data structure with the projected and boundary by node and class.
Usage
ternary_str(ppf, id, sp, dx, dy)
Arguments
ppf |
is a PPforest object |
id |
is a vector with the selected projection directions |
sp |
is the simplex dimensions, if k is the number of classes sp = k - 1 |
dx |
first direction included in id |
dy |
second direction included in id |
Value
Data frame needed to visualize a ternary plot
References
da da Silva, N., Cook, D. & Lee, EK. Interactive graphics for visually diagnosing forest classifiers in R. Comput Stat 40, 3105–3125 (2025). https://doi.org/10.1007/s00180-023-01323-x
Examples
#crab data set with all the observations used as training
pprf.crab <- PPforest(data = crab, std ='min-max', y = "Type",
size.tr = 1, m = 100, size.p = .5, PPmethod = 'LDA')
require(dplyr)
pl_ter <- function(dat, dx, dy ){
p1 <- dat[[1]] %>% dplyr::filter(pair %in% paste(dx, dy, sep = "-") ) %>%
dplyr::select(Class, x, y) %>%
ggplot2::ggplot(ggplot2::aes(x, y, color = Class)) +
ggplot2::geom_segment(data = dat[[2]], ggplot2::aes(x = x1, xend = x2,
y = y1, yend = y2), color = "black" ) +
ggplot2::geom_point(size = I(3), alpha = .5) +
ggplot2::labs(y = " ", x = " ") +
ggplot2::theme(legend.position = "none", aspect.ratio = 1) +
ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2") +
ggplot2::labs(x = paste0("T", dx, " "), y = paste0("T", dy, " ")) +
ggplot2::theme(aspect.ratio = 1)
p1
}
#ternary plot in tree different selected dierections
pl_ter(ternary_str(pprf.crab, id = c(1, 2, 3), sp = 3, dx = 1, dy = 2), 1, 2 )
Obtain predicted class for new data from baggtree function or PPforest
Description
Obtain predicted class for new data from baggtree function or PPforest
Usage
trees_pred(object, xnew, parallel = FALSE, cores = 2, rule = 1)
Arguments
object |
Projection pursuit classification forest structure from PPforest or baggtree |
xnew |
data frame with explicative variables used to get new predicted values. |
parallel |
logical condition, if it is TRUE then parallelize the function |
cores |
number of cores used in the parallelization |
rule |
Split rule used in classification (integer from 1 to 8). 1: mean of two group means 2: weighted mean of two group means - weight with group size 3: weighted mean of two group means - weight with group sd 4: weighted mean of two group means - weight with group se 5: mean of two group medians 6: weighted mean of two group medians - weight with group size 7: weighted mean of two group median - weight with group IQR 8: weighted mean of two group median - weight with group IQR and size |
Value
predicted values from PPforest or baggtree
Wine data set
Description
Wine data set
Usage
data(wine)
Format
A data set containing 178 observations from 3 wine grown cultivares in Italy.
- Type
Class variable has 3 classes that are 3 different wine grown cultivares in Italy.
- X1 to X13
Check vbles
A data frame with 178 rows and 14 variables