Type: | Package |
Title: | R Interface for the RAPIDS cuML Suite of Libraries |
Version: | 0.3.2 |
Maintainer: | Daniel Falbel <daniel@rstudio.com> |
Description: | R interface for RAPIDS cuML (https://github.com/rapidsai/cuml), a suite of GPU-accelerated machine learning libraries powered by CUDA (https://en.wikipedia.org/wiki/CUDA). |
License: | MIT + file LICENSE |
URL: | https://mlverse.github.io/cuda.ml/ |
BugReports: | https://github.com/mlverse/cuda.ml/issues |
Depends: | R (≥ 3.2) |
Imports: | ellipsis, hardhat, parsnip, Rcpp (≥ 1.0.6), rlang (≥ 0.1.4) |
Suggests: | callr, glmnet, MASS, magrittr, mlbench, purrr, reticulate, testthat, xgboost |
LinkingTo: | Rcpp |
Encoding: | UTF-8 |
RoxygenNote: | 7.1.2 |
OS_type: | unix |
SystemRequirements: | RAPIDS cuML (see https://rapids.ai/start.html) |
NeedsCompilation: | yes |
Packaged: | 2022-01-07 22:00:35 UTC; yitaoli |
Author: | Yitao Li |
Repository: | CRAN |
Date/Publication: | 2022-01-08 01:42:47 UTC |
Get the major version of the RAPIDS cuML shared library cuda.ml was linked to.
Description
Get the major version of the RAPIDS cuML shared library cuda.ml was linked to.
Usage
cuML_major_version()
Value
The major version of the RAPIDS cuML shared library cuda.ml was
linked to in a character vector, or NA_character_
if cuda.ml was not
linked to any version of RAPIDS cuML.
Examples
library(cuda.ml)
print(cuML_major_version())
Get the minor version of the RAPIDS cuML shared library cuda.ml was linked to.
Description
Get the minor version of the RAPIDS cuML shared library cuda.ml was linked to.
Usage
cuML_minor_version()
Value
The minor version of the RAPIDS cuML shared library cuda.ml was
linked to in a character vector, or NA_character_
if cuda.ml was not
linked to any version of RAPIDS cuML.
Examples
library(cuda.ml)
print(cuML_minor_version())
cuda.ml
Description
This package provides a R interface for the RAPIDS cuML library.
Author(s)
Yitao Li <yitao@rstudio.com>
Perform Single-Linkage Agglomerative Clustering.
Description
Recursively merge the pair of clusters that minimally increases a given linkage distance.
Usage
cuda_ml_agglomerative_clustering(
x,
n_clusters = 2L,
metric = c("euclidean", "l1", "l2", "manhattan", "cosine"),
connectivity = c("pairwise", "knn"),
n_neighbors = 15L
)
Arguments
x |
The input matrix or dataframe. Each data point should be a row and should consist of numeric values only. |
n_clusters |
The number of clusters to find. Default: 2L. |
metric |
Metric used for linkage computation. Must be one of "euclidean", "l1", "l2", "manhattan", "cosine". If connectivity is "knn" then only "euclidean" is accepted. Default: "euclidean". |
connectivity |
The type of connectivity matrix to compute. Must be one of "pairwise", "knn". Default: "pairwise". - 'pairwise' will compute the entire fully-connected graph of pairwise distances between each set of points. This is the fastest to compute and can be very fast for smaller datasets but requires O(n^2) space. - 'knn' will sparsify the fully-connected connectivity matrix to save memory and enable much larger inputs. "n_neighbors" will control the amount of memory used and the graph will be connected automatically in the event "n_neighbors" was not large enough to connect it. |
n_neighbors |
The number of neighbors to compute when
|
Value
A clustering object with the following attributes:
"n_clusters": The number of clusters found by the algorithm.
"children": The children of each non-leaf node. Values less than
nrow(x)
correspond to leaves of the tree which are the original
samples. children[i + 1][1]
and children[i + 1][2]
were
merged to form node (nrow(x) + i)
in the i
-th iteration.
"labels": cluster label of each data point.
Examples
library(cuda.ml)
library(MASS)
library(magrittr)
library(purrr)
set.seed(0L)
gen_pts <- function() {
centers <- list(c(1000, 1000), c(-1000, -1000), c(-1000, 1000))
pts <- centers %>%
map(~ mvrnorm(50, mu = .x, Sigma = diag(2)))
rlang::exec(rbind, !!!pts) %>% as.matrix()
}
clust <- cuda_ml_agglomerative_clustering(
x = gen_pts(),
metric = "euclidean",
n_clusters = 3L
)
print(clust$labels)
Determine whether a CuML model can predict class probabilities.
Description
Given a trained CuML model, return TRUE
if the model is a classifier
and is capable of outputting class probabilities as prediction results (e.g.,
if the model is a KNN or an ensemble classifier), otherwise return
FALSE
.
Usage
cuda_ml_can_predict_class_probabilities(model)
Arguments
model |
A trained CuML model. |
Value
A logical value indicating whether the model supports outputting class probabilities.
Run the DBSCAN clustering algorithm.
Description
Run the DBSCAN (Density-based spatial clustering of applications with noise) clustering algorithm.
Usage
cuda_ml_dbscan(
x,
min_pts,
eps,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace")
)
Arguments
x |
The input matrix or dataframe. Each data point should be a row and should consist of numeric values only. |
min_pts , eps |
A point 'p' is a core point if at least 'min_pts' are within distance 'eps' from it. |
cuML_log_level |
Log level within cuML library functions. Must be one of "off", "critical", "error", "warn", "info", "debug", "trace". Default: off. |
Value
A list containing the cluster assignments of all data points. A data point not belonging to any cluster (i.e., "noise") will have NA its cluster assignment.
Examples
library(cuda.ml)
library(magrittr)
gen_pts <- function() {
centroids <- list(c(1000, 1000), c(-1000, -1000), c(-1000, 1000))
pts <- centroids %>%
purrr::map(~ MASS::mvrnorm(10, mu = .x, Sigma = diag(2)))
rlang::exec(rbind, !!!pts)
}
m <- gen_pts()
clusters <- cuda_ml_dbscan(m, min_pts = 5, eps = 3)
print(clusters)
Train a linear model using elastic regression.
Description
Train a linear model with combined L1 and L2 priors as the regularizer.
Usage
cuda_ml_elastic_net(x, ...)
## Default S3 method:
cuda_ml_elastic_net(x, ...)
## S3 method for class 'data.frame'
cuda_ml_elastic_net(
x,
y,
alpha = 1,
l1_ratio = 0.5,
max_iter = 1000L,
tol = 0.001,
fit_intercept = TRUE,
normalize_input = FALSE,
selection = c("cyclic", "random"),
...
)
## S3 method for class 'matrix'
cuda_ml_elastic_net(
x,
y,
alpha = 1,
l1_ratio = 0.5,
max_iter = 1000L,
tol = 0.001,
fit_intercept = TRUE,
normalize_input = FALSE,
selection = c("cyclic", "random"),
...
)
## S3 method for class 'formula'
cuda_ml_elastic_net(
formula,
data,
alpha = 1,
l1_ratio = 0.5,
max_iter = 1000L,
tol = 0.001,
fit_intercept = TRUE,
normalize_input = FALSE,
selection = c("cyclic", "random"),
...
)
## S3 method for class 'recipe'
cuda_ml_elastic_net(
x,
data,
alpha = 1,
l1_ratio = 0.5,
max_iter = 1000L,
tol = 0.001,
fit_intercept = TRUE,
normalize_input = FALSE,
selection = c("cyclic", "random"),
...
)
Arguments
x |
Depending on the context: * A __data frame__ of predictors. * A __matrix__ of predictors. * A __recipe__ specifying a set of preprocessing steps * created from [recipes::recipe()]. * A __formula__ specifying the predictors and the outcome. |
... |
Optional arguments; currently unused. |
y |
A numeric vector (for regression) or factor (for classification) of desired responses. |
alpha |
Multiplier of the penalty term (i.e., the result would become
and Ordinary Least Square model if |
l1_ratio |
The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1.
For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1
penalty.
For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
The penalty term is computed using the following formula:
penalty = |
max_iter |
The maximum number of coordinate descent iterations. Default: 1000L. |
tol |
Stop the coordinate descent when the duality gap is below this threshold. Default: 1e-3. |
fit_intercept |
If TRUE, then the model tries to correct for the global mean of the response variable. If FALSE, then the model expects data to be centered. Default: TRUE. |
normalize_input |
Ignored when |
selection |
If "random", then instead of updating coefficients in cyclic order, a random coefficient is updated in each iteration. Default: "cyclic". |
formula |
A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data |
When a __recipe__ or __formula__ is used, |
Value
An elastic net regressor that can be used with the 'predict' S3 generic to make predictions on new data points.
Examples
library(cuda.ml)
model <- cuda_ml_elastic_net(
formula = mpg ~ ., data = mtcars, alpha = 1e-3, l1_ratio = 0.6
)
cuda_ml_predictions <- predict(model, mtcars)
# predictions will be comparable to those from a `glmnet` model with `lambda`
# set to 1e-3 and `alpha` set to 0.6
# (in `glmnet`, `lambda` is the weight of the penalty term, and `alpha` is
# the elastic mixing parameter between L1 and L2 penalties.
library(glmnet)
glmnet_model <- glmnet(
x = as.matrix(mtcars[names(mtcars) != "mpg"]), y = mtcars$mpg,
alpha = 0.6, lambda = 1e-3, nlambda = 1, standardize = FALSE
)
glm_predictions <- predict(
glmnet_model, as.matrix(mtcars[names(mtcars) != "mpg"]),
s = 0
)
print(
all.equal(
as.numeric(glm_predictions),
cuda_ml_predictions$.pred,
tolerance = 1e-2
)
)
Determine whether Forest Inference Library (FIL) functionalities are enabled in the current installation of cuda.ml.
Description
CuML Forest Inference Library (FIL) functionalities (see
https://github.com/rapidsai/cuml/tree/main/python/cuml/fil#readme) will
require Treelite C API. If you need FIL to run tree-based model ensemble on
GPU, and fil_enabled()
returns FALSE, then please consider installing
Treelite and then re-installing cuda.ml.
Usage
cuda_ml_fil_enabled()
Value
A logical value indicating whether the Forest Inference Library (FIL) functionalities are enabled.
Examples
if (cuda_ml_fil_enabled()) {
# run GPU-accelerated Forest Inference Library (FIL) functionalities
} else {
message(
"FIL functionalities are disabled in the current installation of ",
"{cuda.ml}. Please reinstall Treelite C library first, and then re-install",
" {cuda.ml} to enable FIL."
)
}
Load a XGBoost or LightGBM model file.
Description
Load a XGBoost or LightGBM model file using Treelite. The resulting model object can be used to perform high-throughput batch inference on new data points using the GPU acceleration functionality from the CuML Forest Inference Library (FIL).
Usage
cuda_ml_fil_load_model(
filename,
mode = c("classification", "regression"),
model_type = c("xgboost", "lightgbm"),
algo = c("auto", "naive", "tree_reorg", "batch_tree_reorg"),
threshold = 0.5,
storage_type = c("auto", "dense", "sparse"),
threads_per_tree = 1L,
n_items = 0L,
blocks_per_sm = 0L
)
Arguments
filename |
Path to the saved model file. |
mode |
Type of task to be performed by the model. Must be one of "classification", "regression". |
model_type |
Format of the saved model file. Notice if |
algo |
Type of the algorithm for inference, must be one of the following. - "auto": Choose the algorithm automatically. Currently 'batch_tree_reorg' is used for dense storage, and 'naive' for sparse storage. - "naive": Simple inference using shared memory. - "tree_reorg": Similar to naive but with trees rearranged to be more coalescing- friendly. - "batch_tree_reorg": Similar to 'tree_reorg' but predicting multiple rows per thread block. Default: "auto". |
threshold |
Class probability threshold for classification. Ignored for regression tasks. Default: 0.5. |
storage_type |
In-memory storage format of the FIL model. Must be one of
the following.
- "auto":
Choose the storage type automatically,
- "dense":
Create a dense forest,
- "sparse":
Create a sparse forest. Requires |
threads_per_tree |
If >1, then have multiple (neighboring) threads infer on the same tree within a block, which will improve memory bandwith near tree root (but consuming more shared memory). Default: 1L. |
n_items |
Number of input samples each thread processes. If 0, then choose (up to 4) that fit into shared memory. Default: 0L. |
blocks_per_sm |
Indicates how CuML should determine the number of thread
blocks to lauch for the inference kernel.
- 0:
Launches the number of blocks proportional to the number of data points.
- >= 1:
Attempts to lauch |
Value
A GPU-accelerated FIL model that can be used with the 'predict' S3 generic to make predictions on new data points.
Examples
library(cuda.ml)
library(xgboost)
model_path <- file.path(tempdir(), "xgboost.model")
model <- xgboost(
data = as.matrix(mtcars[names(mtcars) != "mpg"]),
label = as.matrix(mtcars["mpg"]),
max.depth = 6,
eta = 1,
nthread = 2,
nrounds = 20,
objective = "reg:squarederror"
)
xgb.save(model, model_path)
model <- cuda_ml_fil_load_model(
model_path,
mode = "regression",
model_type = "xgboost"
)
preds <- predict(model, mtcars[names(mtcars) != "mpg"])
print(preds)
Apply the inverse transformation defined by a trained cuML model.
Description
Given a trained cuML model, apply the inverse transformation defined by that model to an input dataset.
Usage
cuda_ml_inverse_transform(model, x, ...)
Arguments
model |
A model object. |
x |
The dataset to be transformed. |
... |
Additional model-specific parameters (if any). |
Value
The transformed data points.
Determine whether a CuML model is a classifier.
Description
Given a trained CuML model, return TRUE
if the model is a classifier,
otherwise FALSE
(e.g., if the model is a regressor).
Usage
cuda_ml_is_classifier(model)
Arguments
model |
A trained CuML model. |
Value
A logical value indicating whether the model is a classifier.
Run the K means clustering algorithm.
Description
Run the K means clustering algorithm.
Usage
cuda_ml_kmeans(
x,
k,
max_iters = 300,
tol = 0,
init_method = c("kmeans++", "random"),
seed = 0L,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace")
)
Arguments
x |
The input matrix or dataframe. Each data point should be a row and should consist of numeric values only. |
k |
The number of clusters. |
max_iters |
Maximum number of iterations. Default: 300. |
tol |
Relative tolerance with regards to inertia to declare convergence. Default: 0 (i.e., do not use inertia-based stopping criterion). |
init_method |
Method for initializing the centroids. Valid methods include "kmeans++", "random", or a matrix of k rows, each row specifying the initial value of a centroid. Default: "kmeans++". |
seed |
Seed to the random number generator. Default: 0. |
cuML_log_level |
Log level within cuML library functions. Must be one of "off", "critical", "error", "warn", "info", "debug", "trace". Default: off. |
Value
A list containing the cluster assignments and the centroid of each cluster. Each centroid will be a column within the 'centroids' matrix.
Examples
library(cuda.ml)
kclust <- cuda_ml_kmeans(
iris[names(iris) != "Species"],
k = 3, max_iters = 100
)
print(kclust)
Build a KNN model.
Description
Build a k-nearest-model for classification or regression tasks.
Usage
cuda_ml_knn(x, ...)
## Default S3 method:
cuda_ml_knn(x, ...)
## S3 method for class 'data.frame'
cuda_ml_knn(
x,
y,
algo = c("brute", "ivfflat", "ivfpq", "ivfsq"),
metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan",
"braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine",
"correlation"),
p = 2,
neighbors = 5L,
...
)
## S3 method for class 'matrix'
cuda_ml_knn(
x,
y,
algo = c("brute", "ivfflat", "ivfpq", "ivfsq"),
metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan",
"braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine",
"correlation"),
p = 2,
neighbors = 5L,
...
)
## S3 method for class 'formula'
cuda_ml_knn(
formula,
data,
algo = c("brute", "ivfflat", "ivfpq", "ivfsq"),
metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan",
"braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine",
"correlation"),
p = 2,
neighbors = 5L,
...
)
## S3 method for class 'recipe'
cuda_ml_knn(
x,
data,
algo = c("brute", "ivfflat", "ivfpq", "ivfsq"),
metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan",
"braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine",
"correlation"),
p = 2,
neighbors = 5L,
...
)
Arguments
x |
Depending on the context: * A __data frame__ of predictors. * A __matrix__ of predictors. * A __recipe__ specifying a set of preprocessing steps * created from [recipes::recipe()]. * A __formula__ specifying the predictors and the outcome. |
... |
Optional arguments; currently unused. |
y |
A numeric vector (for regression) or factor (for classification) of desired responses. |
algo |
The query algorithm to use. Must be one of
"brute", "ivfflat", "ivfpq", "ivfsq" or a KNN algorithm specification
constructed using the Descriptions of supported algorithms: - "brute": for brute-force, slow but produces exact results. - "ivfflat": for inverted file, divide the dataset in partitions and perform search on relevant partitions only. - "ivfpq": for inverted file and product quantization (vectors are divided into sub-vectors, and each sub-vector is encoded using intermediary k-means clusterings to provide partial information). - "ivfsq": for inverted file and scalar quantization (vectors components are quantized into reduced binary representation allowing faster distances calculations). Default: "brute". |
metric |
Distance metric to use. Must be one of "euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan", "braycurtis", "canberra", "minkowski", "lp", "chebyshev", "linf", "jensenshannon", "cosine", "correlation". Default: "euclidean". |
p |
Parameter for the Minkowski metric. If p = 1, then the metric is equivalent to manhattan distance (l1). If p = 2, the metric is equivalent to euclidean distance (l2). |
neighbors |
Number of nearest neighbors to query. Default: 5L. |
formula |
A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data |
When a __recipe__ or __formula__ is used, |
Value
A KNN model that can be used with the 'predict' S3 generic to make predictions on new data points. The model object contains the following: - "knn_index": a GPU pointer to the KNN index. - "algo": enum value of the algorithm being used for the KNN query. - "metric": enum value of the distance metric used in KNN computations. - "p": parameter for the Minkowski metric. - "n_samples": number of input data points. - "n_dims": dimension of each input data point.
Examples
library(cuda.ml)
library(MASS)
library(magrittr)
library(purrr)
set.seed(0L)
centers <- list(c(3, 3), c(-3, -3), c(-3, 3))
gen_pts <- function(cluster_sz) {
pts <- centers %>%
map(~ mvrnorm(cluster_sz, mu = .x, Sigma = diag(2)))
rlang::exec(rbind, !!!pts) %>% as.matrix()
}
gen_labels <- function(cluster_sz) {
seq_along(centers) %>%
sapply(function(x) rep(x, cluster_sz)) %>%
factor()
}
sample_cluster_sz <- 1000
sample_pts <- cbind(
gen_pts(sample_cluster_sz) %>% as.data.frame(),
label = gen_labels(sample_cluster_sz)
)
model <- cuda_ml_knn(label ~ ., sample_pts, algo = "ivfflat", metric = "euclidean")
test_cluster_sz <- 10
test_pts <- gen_pts(test_cluster_sz) %>% as.data.frame()
predictions <- predict(model, test_pts)
print(predictions, n = 30)
Build a specification for the "ivfflat" KNN query algorithm.
Description
Build a specification of the flat-inverted-file KNN query algorithm, with all required parameters specified explicitly.
Usage
cuda_ml_knn_algo_ivfflat(nlist, nprobe)
Arguments
nlist |
Number of cells to partition dataset into. |
nprobe |
At query time, the number of cells used for approximate nearest neighbor search. |
Value
An object encapsulating all required parameters of the "ivfflat" KNN query algorithm.
Build a specification for the "ivfpq" KNN query algorithm.
Description
Build a specification of the inverted-file-product-quantization KNN query algorithm, with all required parameters specified explicitly.
Usage
cuda_ml_knn_algo_ivfpq(
nlist,
nprobe,
m,
n_bits,
use_precomputed_tables = FALSE
)
Arguments
nlist |
Number of cells to partition dataset into. |
nprobe |
At query time, the number of cells used for approximate nearest neighbor search. |
m |
Number of subquantizers. |
n_bits |
Bits allocated per subquantizer. |
use_precomputed_tables |
Whether to use precomputed tables. |
Value
An object encapsulating all required parameters of the "ivfpq" KNN query algorithm.
Build a specification for the "ivfsq" KNN query algorithm.
Description
Build a specification of the inverted-file-scalar-quantization KNN query algorithm, with all required parameters specified explicitly.
Usage
cuda_ml_knn_algo_ivfsq(
nlist,
nprobe,
qtype = c("QT_8bit", "QT_4bit", "QT_8bit_uniform", "QT_4bit_uniform", "QT_fp16",
"QT_8bit_direct", "QT_6bit"),
encode_residual = FALSE
)
Arguments
nlist |
Number of cells to partition dataset into. |
nprobe |
At query time, the number of cells used for approximate nearest neighbor search. |
qtype |
Quantizer type. Must be one of "QT_8bit", "QT_4bit", "QT_8bit_uniform", "QT_4bit_uniform", "QT_fp16", "QT_8bit_direct", "QT_6bit". |
encode_residual |
Whether to encode residuals. |
Value
An object encapsulating all required parameters of the "ivfsq" KNN query algorithm.
Train a linear model using LASSO regression.
Description
Train a linear model using LASSO (Least Absolute Shrinkage and Selection Operator) regression.
Usage
cuda_ml_lasso(x, ...)
## Default S3 method:
cuda_ml_lasso(x, ...)
## S3 method for class 'data.frame'
cuda_ml_lasso(
x,
y,
alpha = 1,
max_iter = 1000L,
tol = 0.001,
fit_intercept = TRUE,
normalize_input = FALSE,
selection = c("cyclic", "random"),
...
)
## S3 method for class 'matrix'
cuda_ml_lasso(
x,
y,
alpha = 1,
max_iter = 1000L,
tol = 0.001,
fit_intercept = TRUE,
normalize_input = FALSE,
selection = c("cyclic", "random"),
...
)
## S3 method for class 'formula'
cuda_ml_lasso(
formula,
data,
alpha = 1,
max_iter = 1000L,
tol = 0.001,
fit_intercept = TRUE,
normalize_input = FALSE,
selection = c("cyclic", "random"),
...
)
## S3 method for class 'recipe'
cuda_ml_lasso(
x,
data,
alpha = 1,
max_iter = 1000L,
tol = 0.001,
fit_intercept = TRUE,
normalize_input = FALSE,
selection = c("cyclic", "random"),
...
)
Arguments
x |
Depending on the context: * A __data frame__ of predictors. * A __matrix__ of predictors. * A __recipe__ specifying a set of preprocessing steps * created from [recipes::recipe()]. * A __formula__ specifying the predictors and the outcome. |
... |
Optional arguments; currently unused. |
y |
A numeric vector (for regression) or factor (for classification) of desired responses. |
alpha |
Multiplier of the L1 penalty term (i.e., the result would become
and Ordinary Least Square model if |
max_iter |
The maximum number of coordinate descent iterations. Default: 1000L. |
tol |
Stop the coordinate descent when the duality gap is below this threshold. Default: 1e-3. |
fit_intercept |
If TRUE, then the model tries to correct for the global mean of the response variable. If FALSE, then the model expects data to be centered. Default: TRUE. |
normalize_input |
Ignored when |
selection |
If "random", then instead of updating coefficients in cyclic order, a random coefficient is updated in each iteration. Default: "cyclic". |
formula |
A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data |
When a __recipe__ or __formula__ is used, |
Value
A LASSO regressor that can be used with the 'predict' S3 generic to make predictions on new data points.
Examples
library(cuda.ml)
model <- cuda_ml_lasso(formula = mpg ~ ., data = mtcars, alpha = 1e-3)
cuda_ml_predictions <- predict(model, mtcars)
# predictions will be comparable to those from a `glmnet` model with `lambda`
# set to 1e-3 and `alpha` set to 1
# (in `glmnet`, `lambda` is the weight of the penalty term, and `alpha` is
# the elastic mixing parameter between L1 and L2 penalties.
library(glmnet)
glmnet_model <- glmnet(
x = as.matrix(mtcars[names(mtcars) != "mpg"]), y = mtcars$mpg,
alpha = 1, lambda = 1e-3, nlambda = 1, standardize = FALSE
)
glm_predictions <- predict(
glmnet_model, as.matrix(mtcars[names(mtcars) != "mpg"]),
s = 0
)
print(
all.equal(
as.numeric(glm_predictions),
cuda_ml_predictions$.pred,
tolerance = 1e-2
)
)
Train a logistic regression model.
Description
Train a logistic regression model using Quasi-Newton (QN) algorithms (i.e., Orthant-Wise Limited Memory Quasi-Newton (OWL-QN) if there is L1 regularization, Limited Memory BFGS (L-BFGS) otherwise).
Usage
cuda_ml_logistic_reg(x, ...)
## Default S3 method:
cuda_ml_logistic_reg(x, ...)
## S3 method for class 'data.frame'
cuda_ml_logistic_reg(
x,
y,
fit_intercept = TRUE,
penalty = c("l2", "l1", "elasticnet", "none"),
tol = 1e-04,
C = 1,
class_weight = NULL,
sample_weight = NULL,
max_iters = 1000L,
linesearch_max_iters = 50L,
l1_ratio = NULL,
...
)
## S3 method for class 'matrix'
cuda_ml_logistic_reg(
x,
y,
fit_intercept = TRUE,
penalty = c("l2", "l1", "elasticnet", "none"),
tol = 1e-04,
C = 1,
class_weight = NULL,
sample_weight = NULL,
max_iters = 1000L,
linesearch_max_iters = 50L,
l1_ratio = NULL,
...
)
## S3 method for class 'formula'
cuda_ml_logistic_reg(
formula,
data,
fit_intercept = TRUE,
penalty = c("l2", "l1", "elasticnet", "none"),
tol = 1e-04,
C = 1,
class_weight = NULL,
sample_weight = NULL,
max_iters = 1000L,
linesearch_max_iters = 50L,
l1_ratio = NULL,
...
)
## S3 method for class 'recipe'
cuda_ml_logistic_reg(
x,
data,
fit_intercept = TRUE,
penalty = c("l2", "l1", "elasticnet", "none"),
tol = 1e-04,
C = 1,
class_weight = NULL,
sample_weight = NULL,
max_iters = 1000L,
linesearch_max_iters = 50L,
l1_ratio = NULL,
...
)
Arguments
x |
Depending on the context: * A __data frame__ of predictors. * A __matrix__ of predictors. * A __recipe__ specifying a set of preprocessing steps * created from [recipes::recipe()]. * A __formula__ specifying the predictors and the outcome. |
... |
Optional arguments; currently unused. |
y |
A numeric vector (for regression) or factor (for classification) of desired responses. |
fit_intercept |
If TRUE, then the model tries to correct for the global mean of the response variable. If FALSE, then the model expects data to be centered. Default: TRUE. |
penalty |
The penalty type, must be one of "none", "l1", "l2", "elasticnet". If "none" or "l2" is selected, then L-BFGS solver will be used. If "l1" is selected, solver OWL-QN will be used. If "elasticnet" is selected, OWL-QN will be used if l1_ratio > 0, otherwise L-BFGS will be used. Default: "l2". |
tol |
Tolerance for stopping criteria. Default: 1e-4. |
C |
Inverse of regularization strength; must be a positive float. Default: 1.0. |
class_weight |
If |
sample_weight |
Array of weights assigned to individual samples.
If |
max_iters |
Maximum number of solver iterations. Default: 1000L. |
linesearch_max_iters |
Max number of linesearch iterations per outer iteration used in the LBFGS- and OWL- QN solvers. Default: 50L. |
l1_ratio |
The Elastic-Net mixing parameter, must |
formula |
A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data |
When a __recipe__ or __formula__ is used, |
Examples
library(cuda.ml)
X <- scale(as.matrix(iris[names(iris) != "Species"]))
y <- iris$Species
model <- cuda_ml_logistic_reg(X, y, max_iters = 100)
predictions <- predict(model, X)
# NOTE: if we were only performing binary classifications (e.g., by having
# `iris_data <- iris %>% mutate(Species = (Species == "setosa"))`), then the
# above would be conceptually equivalent to the following:
#
# iris_data <- iris %>% mutate(Species = (Species == "setosa"))
# model <- glm(
# Species ~ ., data = iris_data, family = binomial(link = "logit"),
# control = glm.control(epsilon = 1e-8, maxit = 100)
# )
#
# predict(model, iris_data, type = "response")
Train a OLS model.
Description
Train an Ordinary Least Square (OLS) model for regression tasks.
Usage
cuda_ml_ols(x, ...)
## Default S3 method:
cuda_ml_ols(x, ...)
## S3 method for class 'data.frame'
cuda_ml_ols(
x,
y,
method = c("svd", "eig", "qr"),
fit_intercept = TRUE,
normalize_input = FALSE,
...
)
## S3 method for class 'matrix'
cuda_ml_ols(
x,
y,
method = c("svd", "eig", "qr"),
fit_intercept = TRUE,
normalize_input = FALSE,
...
)
## S3 method for class 'formula'
cuda_ml_ols(
formula,
data,
method = c("svd", "eig", "qr"),
fit_intercept = TRUE,
normalize_input = FALSE,
...
)
## S3 method for class 'recipe'
cuda_ml_ols(
x,
data,
method = c("svd", "eig", "qr"),
fit_intercept = TRUE,
normalize_input = FALSE,
...
)
Arguments
x |
Depending on the context: * A __data frame__ of predictors. * A __matrix__ of predictors. * A __recipe__ specifying a set of preprocessing steps * created from [recipes::recipe()]. * A __formula__ specifying the predictors and the outcome. |
... |
Optional arguments; currently unused. |
y |
A numeric vector (for regression) or factor (for classification) of desired responses. |
method |
Must be one of "svd", "eig", "qr". - "svd": compute SVD decomposition using Jacobi iterations. - "eig": use an eigendecomposition of the covariance matrix. - "qr": use the QR decomposition algorithm and solve 'Rx = Q^T y'. If the number of features is larger than the sample size, then the "svd" algorithm will be force-selected because it is the only algorithm that can support this type of scenario. Default: "svd". |
fit_intercept |
If TRUE, then the model tries to correct for the global mean of the response variable. If FALSE, then the model expects data to be centered. Default: TRUE. |
normalize_input |
Ignored when |
formula |
A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data |
When a __recipe__ or __formula__ is used, |
Value
A OLS regressor that can be used with the 'predict' S3 generic to make predictions on new data points.
Examples
library(cuda.ml)
model <- cuda_ml_ols(formula = mpg ~ ., data = mtcars, method = "qr")
predictions <- predict(model, mtcars[names(mtcars) != "mpg"])
# predictions will be comparable to those from a `stats::lm` model
lm_model <- stats::lm(formula = mpg ~ ., data = mtcars, method = "qr")
lm_predictions <- predict(lm_model, mtcars[names(mtcars) != "mpg"])
print(
all.equal(
as.numeric(lm_predictions),
predictions$.pred,
tolerance = 1e-3
)
)
Perform principal component analysis.
Description
Compute principal component(s) of the input data. Each feature from the input will be mean-centered (but not scaled) before the SVD computation takes place.
Usage
cuda_ml_pca(
x,
n_components = NULL,
eig_algo = c("dq", "jacobi"),
tol = 1e-07,
n_iters = 15L,
whiten = FALSE,
transform_input = TRUE,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace")
)
Arguments
x |
The input matrix or dataframe. Each data point should be a row and should consist of numeric values only. |
n_components |
Number of principal component(s) to keep. Default: min(nrow(x), ncol(x)). |
eig_algo |
Eigen decomposition algorithm to be applied to the covariance matrix. Valid choices are "dq" (divid-and-conquer method for symmetric matrices) and "jacobi" (the Jacobi method for symmetric matrices). Default: "dq". |
tol |
Tolerance for singular values computed by the Jacobi method. Default: 1e-7. |
n_iters |
Maximum number of iterations for the Jacobi method. Default: 15. |
whiten |
If TRUE, then de-correlate all components, making each component have unit variance and removing multi-collinearity. Default: FALSE. |
transform_input |
If TRUE, then compute an approximate representation of the input data. Default: TRUE. |
cuML_log_level |
Log level within cuML library functions. Must be one of "off", "critical", "error", "warn", "info", "debug", "trace". Default: off. |
Value
A PCA model object with the following attributes:
- "components": a matrix of n_components
rows containing the top
principal components.
- "explained_variance": amount of variance within the input data explained
by each component.
- "explained_variance_ratio": fraction of variance within the input data
explained by each component.
- "singular_values": singular values (non-negative) corresponding to the
top principal components.
- "mean": the column wise mean of x
which was used to mean-center
x
first.
- "transformed_data": (only present if "transform_input" is set to TRUE)
an approximate representation of input data based on principal
components.
- "pca_params": opaque pointer to PCA parameters which will be used for
performing inverse transforms.
The model object can be used as input to the inverse_transform() function to map a representation based on principal components back to the original feature space.
Examples
library(cuda.ml)
iris.pca <- cuda_ml_pca(iris[1:4], n_components = 3)
print(iris.pca)
Train a random forest model.
Description
Train a random forest model for classification or regression tasks.
Usage
cuda_ml_rand_forest(x, ...)
## Default S3 method:
cuda_ml_rand_forest(x, ...)
## S3 method for class 'data.frame'
cuda_ml_rand_forest(
x,
y,
mtry = NULL,
trees = NULL,
min_n = 2L,
bootstrap = TRUE,
max_depth = 16L,
max_leaves = Inf,
max_predictors_per_note_split = NULL,
n_bins = 128L,
min_samples_leaf = 1L,
split_criterion = NULL,
min_impurity_decrease = 0,
max_batch_size = 128L,
n_streams = 8L,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace"),
...
)
## S3 method for class 'matrix'
cuda_ml_rand_forest(
x,
y,
mtry = NULL,
trees = NULL,
min_n = 2L,
bootstrap = TRUE,
max_depth = 16L,
max_leaves = Inf,
max_predictors_per_note_split = NULL,
n_bins = 128L,
min_samples_leaf = 1L,
split_criterion = NULL,
min_impurity_decrease = 0,
max_batch_size = 128L,
n_streams = 8L,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace"),
...
)
## S3 method for class 'formula'
cuda_ml_rand_forest(
formula,
data,
mtry = NULL,
trees = NULL,
min_n = 2L,
bootstrap = TRUE,
max_depth = 16L,
max_leaves = Inf,
max_predictors_per_note_split = NULL,
n_bins = 128L,
min_samples_leaf = 1L,
split_criterion = NULL,
min_impurity_decrease = 0,
max_batch_size = 128L,
n_streams = 8L,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace"),
...
)
## S3 method for class 'recipe'
cuda_ml_rand_forest(
x,
data,
mtry = NULL,
trees = NULL,
min_n = 2L,
bootstrap = TRUE,
max_depth = 16L,
max_leaves = Inf,
max_predictors_per_note_split = NULL,
n_bins = 128L,
min_samples_leaf = 1L,
split_criterion = NULL,
min_impurity_decrease = 0,
max_batch_size = 128L,
n_streams = 8L,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace"),
...
)
Arguments
x |
Depending on the context: * A __data frame__ of predictors. * A __matrix__ of predictors. * A __recipe__ specifying a set of preprocessing steps * created from [recipes::recipe()]. * A __formula__ specifying the predictors and the outcome. |
... |
Optional arguments; currently unused. |
y |
A numeric vector (for regression) or factor (for classification) of desired responses. |
mtry |
The number of predictors that will be randomly sampled at each split when creating the tree models. Default: the square root of the total number of predictors. |
trees |
An integer for the number of trees contained in the ensemble. Default: 100L. |
min_n |
An integer for the minimum number of data points in a node that are required for the node to be split further. Default: 2L. |
bootstrap |
Whether to perform bootstrap. If TRUE, each tree in the forest is built on a bootstrapped sample with replacement. If FALSE, the whole dataset is used to build each tree. |
max_depth |
Maximum tree depth. Default: 16L. |
max_leaves |
Maximum leaf nodes per tree. Soft constraint. Default: Inf (unlimited). |
max_predictors_per_note_split |
Number of predictor to consider per node split. Default: square root of the total number predictors. |
n_bins |
Number of bins used by the split algorithm. Default: 128L. |
min_samples_leaf |
The minimum number of data points in each leaf node. Default: 1L. |
split_criterion |
The criterion used to split nodes, can be "gini" or "entropy" for classifications, and "mse" or "mae" for regressions. Default: "gini" for classification; "mse" for regression. |
min_impurity_decrease |
Minimum decrease in impurity requried for node to be spilt. Default: 0. |
max_batch_size |
Maximum number of nodes that can be processed in a given batch. Default: 128L. |
n_streams |
Number of CUDA streams to use for building trees. Default: 8L. |
cuML_log_level |
Log level within cuML library functions. Must be one of "off", "critical", "error", "warn", "info", "debug", "trace". Default: off. |
formula |
A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data |
When a __recipe__ or __formula__ is used, |
Value
A random forest classifier / regressor object that can be used with the 'predict' S3 generic to make predictions on new data points.
Examples
library(cuda.ml)
# Classification
model <- cuda_ml_rand_forest(
formula = Species ~ .,
data = iris,
trees = 100
)
predictions <- predict(model, iris[names(iris) != "Species"])
# Regression
model <- cuda_ml_rand_forest(
formula = mpg ~ .,
data = mtcars,
trees = 100
)
predictions <- predict(model, mtcars[names(mtcars) != "mpg"])
Random projection for dimensionality reduction.
Description
Generate a random projection matrix for dimensionality reduction, and optionally transform input data to a projection in a lower dimension space using the generated random matrix.
Usage
cuda_ml_rand_proj(
x,
n_components = NULL,
eps = 0.1,
gaussian_method = TRUE,
density = NULL,
transform_input = TRUE,
seed = 0L
)
Arguments
x |
The input matrix or dataframe. Each data point should be a row and should consist of numeric values only. |
n_components |
Dimensionality of the target projection space. If NULL,
then the parameter is deducted using the Johnson-Lindenstrauss lemma,
taking into consideration the number of samples and the |
eps |
Error tolerance during projection. Default: 0.1. |
gaussian_method |
If TRUE, then use the Gaussian random projection method. Otherwise, use the sparse random projection method. See https://en.wikipedia.org/wiki/Random_projection for details. Default: TRUE. |
density |
Ratio of non-zero component in the random projection matrix. If NULL, then the value is set to the minimum density as recommended by Ping Li et al.: 1 / sqrt(n_features). Default: NULL. |
transform_input |
Whether to project input data onto a lower dimension space using the random matrix. Default: TRUE. |
seed |
Seed for the pseudorandom number generator. Default: 0L. |
Value
A context object containing GPU pointer to a random matrix that can
be used as input to the cuda_ml_transform()
function.
If transform_input
is set to TRUE, then the context object will also
contain a "transformed_data" attribute containing the lower dimensional
projection of the input data.
Examples
library(cuda.ml)
library(mlbench)
data(Vehicle)
vehicle_data <- Vehicle[order(Vehicle$Class), which(names(Vehicle) != "Class")]
model <- cuda_ml_rand_proj(vehicle_data, n_components = 4)
set.seed(0L)
print(kmeans(model$transformed_data, centers = 4, iter.max = 1000))
Train a linear model using ridge regression.
Description
Train a linear model with L2 regularization.
Usage
cuda_ml_ridge(x, ...)
## Default S3 method:
cuda_ml_ridge(x, ...)
## S3 method for class 'data.frame'
cuda_ml_ridge(
x,
y,
alpha = 1,
fit_intercept = TRUE,
normalize_input = FALSE,
...
)
## S3 method for class 'matrix'
cuda_ml_ridge(
x,
y,
alpha = 1,
fit_intercept = TRUE,
normalize_input = FALSE,
...
)
## S3 method for class 'formula'
cuda_ml_ridge(
formula,
data,
alpha = 1,
fit_intercept = TRUE,
normalize_input = FALSE,
...
)
## S3 method for class 'recipe'
cuda_ml_ridge(
x,
data,
alpha = 1,
fit_intercept = TRUE,
normalize_input = FALSE,
...
)
Arguments
x |
Depending on the context: * A __data frame__ of predictors. * A __matrix__ of predictors. * A __recipe__ specifying a set of preprocessing steps * created from [recipes::recipe()]. * A __formula__ specifying the predictors and the outcome. |
... |
Optional arguments; currently unused. |
y |
A numeric vector (for regression) or factor (for classification) of desired responses. |
alpha |
Multiplier of the L2 penalty term (i.e., the result would become
and Ordinary Least Square model if |
fit_intercept |
If TRUE, then the model tries to correct for the global mean of the response variable. If FALSE, then the model expects data to be centered. Default: TRUE. |
normalize_input |
Ignored when |
formula |
A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data |
When a __recipe__ or __formula__ is used, |
Value
A ridge regressor that can be used with the 'predict' S3 generic to make predictions on new data points.
Examples
library(cuda.ml)
model <- cuda_ml_ridge(formula = mpg ~ ., data = mtcars, alpha = 1e-3)
cuda_ml_predictions <- predict(model, mtcars[names(mtcars) != "mpg"])
# predictions will be comparable to those from a `glmnet` model with `lambda`
# set to 2e-3 and `alpha` set to 0
# (in `glmnet`, `lambda` is the weight of the penalty term, and `alpha` is
# the elastic mixing parameter between L1 and L2 penalties.
library(glmnet)
glmnet_model <- glmnet(
x = as.matrix(mtcars[names(mtcars) != "mpg"]), y = mtcars$mpg,
alpha = 0, lambda = 2e-3, nlambda = 1, standardize = FALSE
)
glmnet_predictions <- predict(
glmnet_model, as.matrix(mtcars[names(mtcars) != "mpg"]),
s = 0
)
print(
all.equal(
as.numeric(glmnet_predictions),
cuda_ml_predictions$.pred,
tolerance = 1e-3
)
)
Serialize a CuML model
Description
Given a CuML model, serialize its state into a connection.
Usage
cuda_ml_serialize(model, connection = NULL, ...)
cuda_ml_serialise(model, connection = NULL, ...)
Arguments
model |
The model object. |
connection |
An open connection or |
... |
Additional arguments to |
Value
NULL
unless connection
is NULL
, in which case
the serialized model state is returned as a raw vector.
See Also
Train a MBSGD linear model.
Description
Train a linear model using mini-batch stochastic gradient descent.
Usage
cuda_ml_sgd(x, ...)
## Default S3 method:
cuda_ml_sgd(x, ...)
## S3 method for class 'data.frame'
cuda_ml_sgd(
x,
y,
fit_intercept = TRUE,
loss = c("squared_loss", "log", "hinge"),
penalty = c("none", "l1", "l2", "elasticnet"),
alpha = 1e-04,
l1_ratio = 0.5,
epochs = 1000L,
tol = 0.001,
shuffle = TRUE,
learning_rate = c("constant", "invscaling", "adaptive"),
eta0 = 0.001,
power_t = 0.5,
batch_size = 32L,
n_iters_no_change = 5L,
...
)
## S3 method for class 'matrix'
cuda_ml_sgd(
x,
y,
fit_intercept = TRUE,
loss = c("squared_loss", "log", "hinge"),
penalty = c("none", "l1", "l2", "elasticnet"),
alpha = 1e-04,
l1_ratio = 0.5,
epochs = 1000L,
tol = 0.001,
shuffle = TRUE,
learning_rate = c("constant", "invscaling", "adaptive"),
eta0 = 0.001,
power_t = 0.5,
batch_size = 32L,
n_iters_no_change = 5L,
...
)
## S3 method for class 'formula'
cuda_ml_sgd(
formula,
data,
fit_intercept = TRUE,
loss = c("squared_loss", "log", "hinge"),
penalty = c("none", "l1", "l2", "elasticnet"),
alpha = 1e-04,
l1_ratio = 0.5,
epochs = 1000L,
tol = 0.001,
shuffle = TRUE,
learning_rate = c("constant", "invscaling", "adaptive"),
eta0 = 0.001,
power_t = 0.5,
batch_size = 32L,
n_iters_no_change = 5L,
...
)
## S3 method for class 'recipe'
cuda_ml_sgd(
x,
data,
fit_intercept = TRUE,
loss = c("squared_loss", "log", "hinge"),
penalty = c("none", "l1", "l2", "elasticnet"),
alpha = 1e-04,
l1_ratio = 0.5,
epochs = 1000L,
tol = 0.001,
shuffle = TRUE,
learning_rate = c("constant", "invscaling", "adaptive"),
eta0 = 0.001,
power_t = 0.5,
batch_size = 32L,
n_iters_no_change = 5L,
...
)
Arguments
x |
Depending on the context: * A __data frame__ of predictors. * A __matrix__ of predictors. * A __recipe__ specifying a set of preprocessing steps * created from [recipes::recipe()]. * A __formula__ specifying the predictors and the outcome. |
... |
Optional arguments; currently unused. |
y |
A numeric vector (for regression) or factor (for classification) of desired responses. |
fit_intercept |
If TRUE, then the model tries to correct for the global mean of the response variable. If FALSE, then the model expects data to be centered. Default: TRUE. |
loss |
Loss function, must be one of "squared_loss", "log", "hinge". |
penalty |
Type of regularization to perform, must be one of "none", "l1", "l2", "elasticnet". - "none": no regularization. - "l1": perform regularization based on the L1-norm (LASSO) which tries to minimize the sum of the absolute values of the coefficients. - "l2": perform regularization based on the L2 norm (Ridge) which tries to minimize the sum of the square of the coefficients. - "elasticnet": perform the Elastic Net regularization which is based on the weighted averable of L1 and L2 norms. Default: "none". |
alpha |
Multiplier of the penalty term. Default: 1e-4. |
l1_ratio |
The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1.
For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1
penalty.
For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
The penalty term is computed using the following formula:
penalty = |
epochs |
The number of times the model should iterate through the entire dataset during training. Default: 1000L. |
tol |
Threshold for stopping training. Training will stop if
(loss in current epoch) > (loss in previous epoch) - |
shuffle |
Whether to shuffles the training data after each epoch. Default: True. |
learning_rate |
Must be one of "constant", "invscaling", "adaptive". - "constant": the learning rate will be kept constant.
- "invscaling": (learning rate) = (initial learning rate) / pow(t, power_t)
where |
eta0 |
The initial learning rate. Default: 1e-3. |
power_t |
The exponent used in the invscaling learning rate calculations. |
batch_size |
The number of samples that will be included in each batch. Default: 32L. |
n_iters_no_change |
The maximum number of epochs to train if there is no imporvement in the model. Default: 5. |
formula |
A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data |
When a __recipe__ or __formula__ is used, |
Value
A linear model that can be used with the 'predict' S3 generic to make predictions on new data points.
Examples
library(cuda.ml)
model <- cuda_ml_sgd(
mpg ~ ., mtcars,
batch_size = 4L, epochs = 50000L,
learning_rate = "adaptive", eta0 = 1e-5,
penalty = "l2", alpha = 1e-5, tol = 1e-6,
n_iters_no_change = 10L
)
preds <- predict(model, mtcars[names(mtcars) != "mpg"])
print(all.equal(preds$.pred, mtcars$mpg, tolerance = 0.09))
Train a SVM model.
Description
Train a Support Vector Machine model for classification or regression tasks.
Usage
cuda_ml_svm(x, ...)
## Default S3 method:
cuda_ml_svm(x, ...)
## S3 method for class 'data.frame'
cuda_ml_svm(
x,
y,
cost = 1,
kernel = c("rbf", "tanh", "polynomial", "linear"),
gamma = NULL,
coef0 = 0,
degree = 3L,
tol = 0.001,
max_iter = NULL,
nochange_steps = 1000L,
cache_size = 1024,
epsilon = 0.1,
sample_weights = NULL,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace"),
...
)
## S3 method for class 'matrix'
cuda_ml_svm(
x,
y,
cost = 1,
kernel = c("rbf", "tanh", "polynomial", "linear"),
gamma = NULL,
coef0 = 0,
degree = 3L,
tol = 0.001,
max_iter = NULL,
nochange_steps = 1000L,
cache_size = 1024,
epsilon = 0.1,
sample_weights = NULL,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace"),
...
)
## S3 method for class 'formula'
cuda_ml_svm(
formula,
data,
cost = 1,
kernel = c("rbf", "tanh", "polynomial", "linear"),
gamma = NULL,
coef0 = 0,
degree = 3L,
tol = 0.001,
max_iter = NULL,
nochange_steps = 1000L,
cache_size = 1024,
epsilon = 0.1,
sample_weights = NULL,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace"),
...
)
## S3 method for class 'recipe'
cuda_ml_svm(
x,
data,
cost = 1,
kernel = c("rbf", "tanh", "polynomial", "linear"),
gamma = NULL,
coef0 = 0,
degree = 3L,
tol = 0.001,
max_iter = NULL,
nochange_steps = 1000L,
cache_size = 1024,
epsilon = 0.1,
sample_weights = NULL,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace"),
...
)
Arguments
x |
Depending on the context: * A __data frame__ of predictors. * A __matrix__ of predictors. * A __recipe__ specifying a set of preprocessing steps * created from [recipes::recipe()]. * A __formula__ specifying the predictors and the outcome. |
... |
Optional arguments; currently unused. |
y |
A numeric vector (for regression) or factor (for classification) of desired responses. |
cost |
A positive number for the cost of predicting a sample within or on the wrong side of the margin. Default: 1. |
kernel |
Type of the SVM kernel function (must be one of "rbf", "tanh", "polynomial", or "linear"). Default: "rbf". |
gamma |
The gamma coefficient (only relevant to polynomial, RBF, and tanh kernel functions, see explanations below). Default: 1 / (num features). The following kernels are implemented: - RBF K(x_1, x_2) = exp(-gamma |x_1-x_2|^2) - TANH K(x_1, x_2) = tanh(gamma <x_1,x_2> + coef0) - POLYNOMIAL K(x_1, x_2) = (gamma <x_1,x_2> + coef0)^degree - LINEAR K(x_1,x_2) = <x_1,x_2>, where < , > denotes the dot product. |
coef0 |
The 0th coefficient (only applicable to polynomial and tanh kernel functions, see explanations below). Default: 0. The following kernels are implemented: - RBF K(x_1, x_2) = exp(-gamma |x_1-x_2|^2) - TANH K(x_1, x_2) = tanh(gamma <x_1,x_2> + coef0) - POLYNOMIAL K(x_1, x_2) = (gamma <x_1,x_2> + coef0)^degree - LINEAR K(x_1,x_2) = <x_1,x_2>, where < , > denotes the dot product. |
degree |
Degree of the polynomial kernel function (note: not applicable to other kernel types, see explanations below). Default: 3. The following kernels are implemented: - RBF K(x_1, x_2) = exp(-gamma |x_1-x_2|^2) - TANH K(x_1, x_2) = tanh(gamma <x_1,x_2> + coef0) - POLYNOMIAL K(x_1, x_2) = (gamma <x_1,x_2> + coef0)^degree - LINEAR K(x_1,x_2) = <x_1,x_2>, where < , > denotes the dot product. |
tol |
Tolerance to stop fitting. Default: 1e-3. |
max_iter |
Maximum number of outer iterations in SmoSolver. Default: 100 * (num samples). |
nochange_steps |
Number of steps with no change w.r.t convergence. Default: 1000. |
cache_size |
Size of kernel cache (MiB) in device memory. Default: 1024. |
epsilon |
Espsilon parameter of the epsilon-SVR model. There is no penalty for points that are predicted within the epsilon-tube around the target values. Please note this parameter is only relevant for regression tasks. Default: 0.1. |
sample_weights |
Optional weight assigned to each input data point. |
cuML_log_level |
Log level within cuML library functions. Must be one of "off", "critical", "error", "warn", "info", "debug", "trace". Default: off. |
formula |
A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data |
When a __recipe__ or __formula__ is used, |
Value
A SVM classifier / regressor object that can be used with the 'predict' S3 generic to make predictions on new data points.
Examples
library(cuda.ml)
# Classification
model <- cuda_ml_svm(
formula = Species ~ .,
data = iris,
kernel = "rbf"
)
predictions <- predict(model, iris[names(iris) != "Species"])
# Regression
model <- cuda_ml_svm(
formula = mpg ~ .,
data = mtcars,
kernel = "rbf"
)
predictions <- predict(model, mtcars)
Transform data using a trained cuML model.
Description
Given a trained cuML model, transform an input dataset using that model.
Usage
cuda_ml_transform(model, x, ...)
Arguments
model |
A model object. |
x |
The dataset to be transformed. |
... |
Additional model-specific parameters (if any). |
Value
The transformed data points.
t-distributed Stochastic Neighbor Embedding.
Description
t-distributed Stochastic Neighbor Embedding (TSNE) for visualizing high- dimensional data.
Usage
cuda_ml_tsne(
x,
n_components = 2L,
n_neighbors = ceiling(3 * perplexity),
method = c("barnes_hut", "fft", "exact"),
angle = 0.5,
n_iter = 1000L,
learning_rate = 200,
learning_rate_method = c("adaptive", "none"),
perplexity = 30,
perplexity_max_iter = 100L,
perplexity_tol = 1e-05,
early_exaggeration = 12,
late_exaggeration = 1,
exaggeration_iter = 250L,
min_grad_norm = 1e-07,
pre_momentum = 0.5,
post_momentum = 0.8,
square_distances = TRUE,
seed = NULL,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace")
)
Arguments
x |
The input matrix or dataframe. Each data point should be a row and should consist of numeric values only. |
n_components |
Dimension of the embedded space. |
n_neighbors |
The number of datapoints to use in the attractive forces. Default: ceiling(3 * perplexity). |
method |
T-SNE method, must be one of "barnes_hut", "fft", "exact". The "exact" method will be more accurate but slower. Both "barnes_hut" and "fft" methods are fast approximations. |
angle |
Valid values are between 0.0 and 1.0, which trade off speed and accuracy, respectively. Generally, these values are set between 0.2 and 0.8. (Barnes-Hut only.) |
n_iter |
Maximum number of iterations for the optimization. Should be at least 250. Default: 1000L. |
learning_rate |
Learning rate of the t-SNE algorithm, usually between (10, 1000). If the learning rate is too high, then t-SNE result could look like a cloud / ball of points. |
learning_rate_method |
Must be one of "adaptive", "none". If "adaptive", then learning rate, early exaggeration, and perplexity are automatically tuned based on input size. Default: "adaptive". |
perplexity |
The target value of the conditional distribution's perplexity (see https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding for details). |
perplexity_max_iter |
The number of epochs the best Gaussian bands are found for. Default: 100L. |
perplexity_tol |
Stop optimizing the Gaussian bands when the conditional distribution's perplexity is within this desired tolerance compared to its taget value. Default: 1e-5. |
early_exaggeration |
Controls the space between clusters. Not critical to tune this. Default: 12.0. |
late_exaggeration |
Controls the space between clusters. It may be beneficial to increase this slightly to improve cluster separation. This will be applied after 'exaggeration_iter' iterations (FFT only). |
exaggeration_iter |
Number of exaggeration iterations. Default: 250L. |
min_grad_norm |
If the gradient norm is below this threshold, the optimization will be stopped. Default: 1e-7. |
pre_momentum |
During the exaggeration iteration, more forcefully apply gradients. Default: 0.5. |
post_momentum |
During the late phases, less forcefully apply gradients. Default: 0.8. |
square_distances |
Whether TSNE should square the distance values. |
seed |
Seed to the psuedorandom number generator. Setting this can make
repeated runs look more similar. Note, however, that this highly
parallelized t-SNE implementation is not completely deterministic between
runs, even with the same |
cuML_log_level |
Log level within cuML library functions. Must be one of "off", "critical", "error", "warn", "info", "debug", "trace". Default: off. |
Value
A matrix containing the embedding of the input data in a low- dimensional space, with each row representing an embedded data point.
Examples
library(cuda.ml)
embedding <- cuda_ml_tsne(iris[1:4], method = "exact")
set.seed(0L)
print(kmeans(embedding, centers = 3))
Truncated SVD.
Description
Dimensionality reduction using Truncated Singular Value Decomposition.
Usage
cuda_ml_tsvd(
x,
n_components = 2L,
eig_algo = c("dq", "jacobi"),
tol = 1e-07,
n_iters = 15L,
transform_input = TRUE,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace")
)
Arguments
x |
The input matrix or dataframe. Each data point should be a row and should consist of numeric values only. |
n_components |
Desired dimensionality of output data. Must be strictly
less than |
eig_algo |
Eigen decomposition algorithm to be applied to the covariance matrix. Valid choices are "dq" (divid-and-conquer method for symmetric matrices) and "jacobi" (the Jacobi method for symmetric matrices). Default: "dq". |
tol |
Tolerance for singular values computed by the Jacobi method. Default: 1e-7. |
n_iters |
Maximum number of iterations for the Jacobi method. Default: 15. |
transform_input |
If TRUE, then compute an approximate representation of the input data. Default: TRUE. |
cuML_log_level |
Log level within cuML library functions. Must be one of "off", "critical", "error", "warn", "info", "debug", "trace". Default: off. |
Value
A TSVD model object with the following attributes:
- "components": a matrix of n_components
rows to be used for
dimensionalitiy reduction on new data points.
- "explained_variance": (only present if "transform_input" is set to TRUE)
amount of variance within the input data explained by each component.
- "explained_variance_ratio": (only present if "transform_input" is set to
TRUE) fraction of variance within the input data explained by each
component.
- "singular_values": The singular values corresponding to each component.
The singular values are equal to the 2-norms of the n_components
variables in the lower-dimensional space.
- "tsvd_params": opaque pointer to TSVD parameters which will be used for
performing inverse transforms.
Examples
library(cuda.ml)
iris.tsvd <- cuda_ml_tsvd(iris[1:4], n_components = 2)
print(iris.tsvd)
Uniform Manifold Approximation and Projection (UMAP) for dimension reduction.
Description
Run the Uniform Manifold Approximation and Projection (UMAP) algorithm to find a low dimensional embedding of the input data that approximates an underlying manifold.
Usage
cuda_ml_umap(
x,
y = NULL,
n_components = 2L,
n_neighbors = 15L,
n_epochs = 500L,
learning_rate = 1,
init = c("spectral", "random"),
min_dist = 0.1,
spread = 1,
set_op_mix_ratio = 1,
local_connectivity = 1L,
repulsion_strength = 1,
negative_sample_rate = 5L,
transform_queue_size = 4,
a = NULL,
b = NULL,
target_n_neighbors = n_neighbors,
target_metric = c("categorical", "euclidean"),
target_weight = 0.5,
transform_input = TRUE,
seed = NULL,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace")
)
Arguments
x |
The input matrix or dataframe. Each data point should be a row and should consist of numeric values only. |
y |
An optional numeric vector of target values for supervised dimension reduction. Default: NULL. |
n_components |
The dimension of the space to embed into. Default: 2. |
n_neighbors |
The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Default: 15. |
n_epochs |
The number of training epochs to be used in optimizing the low dimensional embedding. Default: 500. |
learning_rate |
The initial learning rate for the embedding optimization. Default: 1.0. |
init |
Initialization mode of the low dimensional embedding. Must be one of "spectral", "random". Default: "spectral". |
min_dist |
The effective minimum distance between embedded points. Default: 0.1. |
spread |
The effective scale of embedded points. In combination with
|
set_op_mix_ratio |
Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection. Default: 1.0. |
local_connectivity |
The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. Default: 1. |
repulsion_strength |
Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples. Default: 1.0. |
negative_sample_rate |
The number of negative samples to select per positive sample in the optimization process. Default: 5. |
transform_queue_size |
For transform operations (embedding new points using a trained model this will control how aggressively to search for nearest neighbors. Default: 4.0. |
a , b |
More specific parameters controlling the embedding. If not set,
then these values are set automatically as determined by |
target_n_neighbors |
The number of nearest neighbors to use to construct the target simplcial set. Default: n_neighbors. |
target_metric |
The metric for measuring distance between the actual and
and the target values ( |
target_weight |
Weighting factor between data topology and target topology. A value of 0.0 weights entirely on data, a value of 1.0 weights entirely on target. The default of 0.5 balances the weighting equally between data and target. |
transform_input |
If TRUE, then compute an approximate representation of the input data. Default: TRUE. |
seed |
Optional seed for pseudo random number generator. Default: NULL. Setting a PRNG seed will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but at the expense of potentially slower training and increased memory usage. If the PRNG seed is not set, then the trained embeddings will not be deterministic. |
cuML_log_level |
Log level within cuML library functions. Must be one of "off", "critical", "error", "warn", "info", "debug", "trace". Default: off. |
Value
A UMAP model object that can be used as input to the
cuda_ml_transform()
function.
If transform_input
is set to TRUE, then the model object will
contain a "transformed_data" attribute containing the lower dimensional
embedding of the input data.
Examples
library(cuda.ml)
model <- cuda_ml_umap(
x = iris[1:4],
y = iris[[5]],
n_components = 2,
n_epochs = 200,
transform_input = TRUE
)
set.seed(0L)
print(kmeans(model$transformed, iter.max = 100, centers = 3))
Unserialize a CuML model state
Description
Unserialize a CuML model state into a CuML model object.
Usage
cuda_ml_unserialize(connection, ...)
cuda_ml_unserialise(connection, ...)
Arguments
connection |
An open connection or a raw vector. |
... |
Additional arguments to |
Value
A unserialized CuML model.
See Also
Determine whether cuda.ml was linked to a valid version of the RAPIDS cuML shared library.
Description
Determine whether cuda.ml was linked to a valid version of the RAPIDS cuML shared library.
Usage
has_cuML()
Value
A logical value indicating whether the current installation cuda.ml was linked to a valid version of the RAPIDS cuML shared library.
Examples
library(cuda.ml)
if (!has_cuML()) {
warning(
"Please install the RAPIDS cuML shared library first, and then re-",
"install {cuda.ml}."
)
}
Make predictions on new data points.
Description
Make predictions on new data points using a FIL model.
Usage
## S3 method for class 'cuda_ml_fil'
predict(object, x, output_class_probabilities = FALSE, ...)
Arguments
object |
A trained CuML model. |
x |
A matrix or dataframe containing new data points. |
output_class_probabilities |
Whether to output class probabilities.
NOTE: setting |
... |
Additional arguments to |
Value
Predictions on new data points.
Make predictions on new data points.
Description
Make predictions on new data points using a CuML KNN model.
Usage
## S3 method for class 'cuda_ml_knn'
predict(object, x, output_class_probabilities = NULL, ...)
Arguments
object |
A trained CuML model. |
x |
A matrix or dataframe containing new data points. |
output_class_probabilities |
Whether to output class probabilities.
NOTE: setting |
... |
Additional arguments to |
Value
Predictions on new data points.
Make predictions on new data points.
Description
Make predictions on new data points using a linear model.
Usage
## S3 method for class 'cuda_ml_linear_model'
predict(object, x, ...)
Arguments
object |
A trained CuML model. |
x |
A matrix or dataframe containing new data points. |
... |
Additional arguments to |
Value
Predictions on new data points.
Make predictions on new data points.
Description
Make predictions on new data points using a CuML logistic regression model.
Usage
## S3 method for class 'cuda_ml_logistic_reg'
predict(object, x, ...)
Arguments
object |
A trained CuML model. |
x |
A matrix or dataframe containing new data points. |
... |
Additional arguments to |
Value
Predictions on new data points.
Make predictions on new data points.
Description
Make predictions on new data points using a CuML random forest model.
Usage
## S3 method for class 'cuda_ml_rand_forest'
predict(
object,
x,
output_class_probabilities = NULL,
cuML_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace"),
...
)
Arguments
object |
A trained CuML model. |
x |
A matrix or dataframe containing new data points. |
output_class_probabilities |
Whether to output class probabilities.
NOTE: setting |
cuML_log_level |
Log level within cuML library functions. Must be one of "off", "critical", "error", "warn", "info", "debug", "trace". Default: off. |
... |
Additional arguments to |
Value
Predictions on new data points.
Make predictions on new data points.
Description
Make predictions on new data points using a CuML SVM model.
Usage
## S3 method for class 'cuda_ml_svm'
predict(object, x, ...)
Arguments
object |
A trained CuML model. |
x |
A matrix or dataframe containing new data points. |
... |
Additional arguments to |
Value
Predictions on new data points.