Version: | 1.3.1 |
Date: | 2021-12-15 |
Title: | Toolkit for Credit Modeling, Analysis and Visualization |
Maintainer: | Dongping Fan <fdp@pku.edu.cn> |
Description: | Provides a highly efficient R tool suite for Credit Modeling, Analysis and Visualization.Contains infrastructure functionalities such as data exploration and preparation, missing values treatment, outliers treatment, variable derivation, variable selection, dimensionality reduction, grid search for hyper parameters, data mining and visualization, model evaluation, strategy analysis etc. This package is designed to make the development of binary classification models (machine learning based models as well as credit scorecard) simpler and faster. The references including: 1 Refaat, M. (2011, ISBN: 9781447511199). Credit Risk Scorecard: Development and Implementation Using SAS; 2 Bezdek, James C.FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences (0098-3004),<doi:10.1016/0098-3004(84)90020-7>. |
Depends: | R (≥ 2.10) |
Imports: | data.table,dplyr,ggplot2,foreach,doParallel,glmnet,rpart,cli,xgboost |
Suggests: | pdp,pmml,XML,knitr,gbm,randomForest,rmarkdown |
VignetteBuilder: | knitr |
Encoding: | UTF-8 |
ByteCompile: | yes |
LazyData: | yes |
LazyLoad: | yes |
License: | AGPL-3 |
RoxygenNote: | 7.1.2 |
NeedsCompilation: | no |
Author: | Dongping Fan [aut, cre] |
Repository: | CRAN |
Packaged: | 2022-01-07 07:44:53 UTC; HANSEN |
Date/Publication: | 2022-01-07 11:32:41 UTC |
creditmodel: toolkit for credit modeling and data analysis
Description
creditmodel provides a highly efficient R tool suite for Credit Modeling, Analysis and Visualization. Contains infrastructure functionalities such as data exploration and preparation, missing values treatment, outliers treatment, variable derivation, variable selection, dimensionality reduction, grid search for hyper parameters, data mining and visualization, model evaluation, strategy analysis etc. This package is designed to make the development of binary classification models (machine learning based models as well as credit scorecard) simpler and faster.
Details
It has three main goals:
creditmodel is a free and open source automated modeling R package designed to help model developers improve model development efficiency and enable many people with no background in data science to complete the modeling work in a short time. Let them focus more on the problem itself and allocate more time to decision-making.
creditmodel covers various tools such as data preprocessing, variable processing/derivation, variable screening/dimensionality reduction, modeling, data analysis, data visualization, model evaluation, strategy analysis, etc. It is a set of customized "core" tool kit for model developers.
'creditmodel' is suitable for machine learning automated modeling of classification targets, and is more suitable for the risk and marketing data of financial credit, e-commerce, and insurance with relatively high noise and low information content.
To learn more about creditmodel, start with the WeChat Platform: hansenmode
Author(s)
Maintainer: Dongping Fan fdp@pku.edu.cn
Fuzzy String matching
Description
Fuzzy String matching
Usage
x %alike% y
Arguments
x |
A string. |
y |
A string. |
Value
Logical.
Examples
"xyz" %alike% "xy"
Fuzzy String matching
Description
Fuzzy String matching
Usage
x %islike% y
Arguments
x |
A string. |
y |
A string. |
Value
Logical.
Examples
"xyz" %islike% "yz$"
PCA Dimension Reduction
Description
PCA_reduce
is used for PCA reduction of high demension data .
Usage
PCA_reduce(train = train, test = NULL, mc = 0.9)
Arguments
train |
A data.frame with independent variables and target variable. |
test |
A data.frame of test data. |
mc |
Threshold of cumulative imp. |
Examples
## Not run:
num_x_list = get_names(dat = UCICreditCard, types = c('numeric'),
ex_cols = "ID$|date$|default.payment.next.month$", get_ex = FALSE)
PCA_dat = PCA_reduce(train = UCICreditCard[num_x_list])
## End(Not run)
UCI Credit Card data
Description
This research aimed at the case of customers's default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 24 variables as explanatory variables
Format
A data frame with 30000 rows and 26 variables.
Details
ID: Customer id
apply_date: This is a fake occur time.
LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
SEX: Gender (male; female).
EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).
AGE: Age (year) History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:
PAY_0: the repayment status in September
PAY_2: the repayment status in August
PAY_3: ...
PAY_4: ...
PAY_5: ...
PAY_6: the repayment status in April The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months;...;8 = payment delay for eight months; 9 = payment delay for nine months and above. Amount of bill statement (NT dollar)
BILL_AMT1: amount of bill statement in September
BILL_AMT2: mount of bill statement in August
BILL_AMT3: ...
BILL_AMT4: ...
BILL_AMT5: ...
BILL_AMT6: amount of bill statement in April Amount of previous payment (NT dollar)
PAY_AMT1: amount paid in September
PAY_AMT2: amount paid in August
PAY_AMT3: ....
PAY_AMT4: ...
PAY_AMT5: ...
PAY_AMT6: amount paid in April
default.payment.next.month: default payment (Yes = 1, No = 0), as the response variable
Source
http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
See Also
add_variable_process
Description
This function is not intended to be used by end user.
Usage
add_variable_process(add)
Arguments
add |
A data.frame contained address variables. |
address_varieble
Description
This function is not intended to be used by end user.
Usage
address_varieble(
df,
address_cols = NULL,
address_pattern = NULL,
parallel = TRUE
)
Arguments
df |
A data.frame. |
address_cols |
Variables of address, |
address_pattern |
Regular expressions, used to match address variable names. |
parallel |
Logical, parallel computing. Default is TRUE. |
missing Analysis
Description
#' analysis_nas
is for understanding the reason for missing data and understand distribution of missing data so we can categorise it as:
missing completely at random(MCAR)
Mmissing at random(MAR), or
missing not at random, also known as IM.
Usage
analysis_nas(
dat,
class_var = FALSE,
nas_rate = NULL,
na_vars = NULL,
mat_nas_shadow = NULL,
dt_nas_random = NULL,
...
)
Arguments
dat |
A data.frame with independent variables and target variable. |
class_var |
Logical, nas analysis of the nominal variables. Default is TRUE. |
nas_rate |
A list contains nas rate of each variable. |
na_vars |
Names of variables which contain nas. |
mat_nas_shadow |
A shadow matrix of variables which contain nas. |
dt_nas_random |
A data.frame with random nas imputation. |
... |
Other parameters. |
Value
A data.frame with outliers analysis for each variable.
Outliers Analysis
Description
#' analysis_outliers
is the function for outliers analysis.
Usage
analysis_outliers(dat, target, x, lof = NULL)
Arguments
dat |
A data.frame with independent variables and target variable. |
target |
The name of target variable. |
x |
The name of variable to process. |
lof |
Outliers of each variable detected by |
Value
A data.frame with outliers analysis for each variable.
Percent Format
Description
as_percent
is a small function for making percent format..
Usage
as_percent(x, digits = 2)
Arguments
x |
A numeric vector or list. |
digits |
Number of digits.Default: 2. |
Value
x with percent format.
Examples
as_percent(0.2363, digits = 2)
as_percent(1)
auc_value
auc_value
is for get best lambda required in lasso_filter. This function required in lasso_filter
Description
auc_value
auc_value
is for get best lambda required in lasso_filter. This function required in lasso_filter
Usage
auc_value(target, prob)
Arguments
target |
Vector of target. |
prob |
A list of redict probability or score. |
Value
Lanmbda value
Cramer's V matrix between categorical variables.
Description
char_cor_vars
is function for calculating Cramer's V matrix between categorical variables.
char_cor
is function for calculating the correlation coefficient between variables by cremers 'V
Usage
char_cor_vars(dat, x)
char_cor(dat, x_list = NULL, ex_cols = "date$", parallel = FALSE, note = FALSE)
Arguments
dat |
A data frame. |
x |
The name of variable to process. |
x_list |
Names of independent variables. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical. Outputs info. Default is TRUE. |
Value
A list contains correlation index of x with other variables in dat.
Examples
## Not run:
char_x_list = get_names(dat = UCICreditCard,
types = c('factor', 'character'),
ex_cols = "ID$|date$|default.payment.next.month$", get_ex = FALSE)
char_cor(dat = UCICreditCard[char_x_list])
## End(Not run)
character to number
Description
char_to_num
is for transfering character variables which are actually numerical numbers containing strings to numeric.
Usage
char_to_num(
dat,
char_list = NULL,
m = 0,
p = 0.5,
note = FALSE,
ex_cols = NULL
)
Arguments
dat |
A data frame |
char_list |
The list of charecteristic variables that need to merge categories, Default is NULL. In case of NULL, merge categories for all variables of string type. |
m |
The minimum number of categories. |
p |
The max percent of categories. |
note |
Logical, outputs info. Default is TRUE. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
Value
A data.frame
Examples
dat_sub = lendingclub[c('dti_joint', 'emp_length')]
str(dat_sub)
#variables that are converted to numbers containing strings
dat_sub = char_to_num(dat_sub)
str(dat_sub)
Checking Data
Description
checking_data
cheking dat before processing.
Usage
checking_data(
dat = NULL,
target = NULL,
occur_time = NULL,
note = FALSE,
pos_flag = NULL
)
Arguments
dat |
A data.frame with independent variables and target variable. |
target |
The name of target variable. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
note |
Logical.Outputs info.Default is TRUE. |
pos_flag |
The value of positive class of target variable, default: "1". |
Value
data.frame
Examples
dat = checking_data(dat = UCICreditCard, target = "default.payment.next.month")
city_varieble
Description
This function is used for city variables derivation.
Usage
city_varieble(
df = df,
city_cols = NULL,
city_pattern = NULL,
city_class = city_class,
parallel = TRUE
)
Arguments
df |
A data.frame. |
city_cols |
Variables of city, |
city_pattern |
Regular expressions, used to match city variable names. Default is "city$". |
city_class |
Class or levels of cities. |
parallel |
Logical, parallel computing. Default is TRUE. |
Processing of Address Variables
Description
This function is not intended to be used by end user.
Usage
city_varieble_process(df_city, x, city_class)
Arguments
df_city |
A data.frame. |
x |
Variables of city, |
city_class |
Class or levels of cities. |
cohort_table_plot
cohort_table_plot
is for ploting cohort(vintage) analysis table.
Description
This function is not intended to be used by end user.
Usage
cohort_table_plot(cohort_dat)
cohort_plot(cohort_dat)
Arguments
cohort_dat |
A data.frame generated by |
Correlation Heat Plot
Description
cor_heat_plot
is for ploting correlation matrix
Usage
cor_heat_plot(
cor_mat,
low_color = love_color("deep_red"),
high_color = love_color("light_cyan"),
title = "Correlation Matrix"
)
Arguments
cor_mat |
A correlation matrix. |
low_color |
color of the lowest correlation between variables. |
high_color |
color of the highest correlation between variables. |
title |
title of plot. |
Examples
train_test = train_test_split(UCICreditCard,
split_type = "Random", prop = 0.8,save_data = FALSE)
dat_train = train_test$train
dat_test = train_test$test
cor_mat = cor(dat_train[,8:12],use = "complete.obs")
cor_heat_plot(cor_mat)
Correlation Plot
Description
cor_plot
is for ploting correlation matrix
Usage
cor_plot(
dat,
dir_path = tempdir(),
x_list = NULL,
gtitle = NULL,
save_data = FALSE,
plot_show = FALSE
)
Arguments
dat |
A data.frame with independent variables and target variable. |
dir_path |
The path for periodically saved graphic files. Default is "./model/LR" |
x_list |
Names of independent variables. |
gtitle |
The title of the graph & The name for periodically saved graphic file. Default is "_correlation_of_variables". |
save_data |
Logical, save results in locally specified folder. Default is TRUE |
plot_show |
Logical, show graph in current graphic device. |
Examples
train_test = train_test_split(UCICreditCard,
split_type = "Random", prop = 0.8,save_data = FALSE)
dat_train = train_test$train
dat_test = train_test$test
cor_plot(dat_train[,8:12],plot_show = TRUE)
cos_sim
Description
This function is not intended to be used by end user.
Usage
cos_sim(x, y, cos_margin = 1)
Arguments
x |
A list of numbers |
y |
A list of numbers |
cos_margin |
Margin of matrix, 1 for rows and 2 for cols, Default is 1. |
Value
A number of cosin similarity
Customer Segmentation
Description
customer_segmentation
is a function for clustering and find the best segment variable.
Usage
customer_segmentation(
dat,
x_list = NULL,
ex_cols = NULL,
cluster_control = list(meth = "Kmeans", kc = 2, nstart = 1, epsm = 1e-06, sf = 2,
max_iter = 100),
tree_control = list(cv_folds = 5, maxdepth = kc + 1, minbucket = nrow(dat)/(kc + 1)),
save_data = FALSE,
file_name = NULL,
dir_path = tempdir()
)
Arguments
dat |
A data.frame contained only predict variables. |
x_list |
A list of x variables. |
ex_cols |
A list of excluded variables. Default is NULL. |
cluster_control |
A list controls cluster. kc is the number of cluster center (default is 2), nstart is the number of random groups (default is 1), max_iter max iteration number(default is 100) .
|
tree_control |
A list of controls for desison tree to find the best segment variable.
|
save_data |
Logical. If TRUE, save outliers analysis file to the specified folder at |
file_name |
The name for periodically saved segmentation file. Default is NULL. |
dir_path |
The path for periodically saved segmentation file. |
Value
A "data.frame" object contains cluster results.
References
Bezdek, James C. "FCM: The fuzzy c-means clustering algorithm". Computers & Geosciences (0098-3004),doi: 10.1016/0098-3004(84)90020-7
Examples
clust = customer_segmentation(dat = lendingclub[1:10000,20:30],
x_list = NULL, ex_cols = "id$|loan_status",
cluster_control = list(meth = "FCM", kc = 2), save_data = FALSE,
tree_control = list(minbucket = round(nrow(lendingclub) / 10)),
file_name = NULL, dir_path = tempdir())
Generating Initial Equal Size Sample Bins
Description
cut_equal
is used to generate initial breaks for equal frequency binning.
Usage
cut_equal(dat_x, g = 10, sp_values = NULL, cut_bin = "equal_depth")
Arguments
dat_x |
A vector of an variable x. |
g |
numeric, number of initial bins for equal_bins. |
sp_values |
a list of special value. Default: list(-1, "missing") |
cut_bin |
A string, 'equal_depth' or 'equal_width', default is 'equal_depth'. |
See Also
get_breaks
, get_breaks_all
,get_tree_breaks
Examples
#equal sample size breaks
equ_breaks = cut_equal(dat = UCICreditCard[, "PAY_AMT2"], g = 10)
Stratified Folds
Description
this function creates stratified folds for cross validation.
Usage
cv_split(dat, k = 5, occur_time = NULL, seed = 46)
Arguments
dat |
A data.frame. |
k |
k is an integer specifying the number of folds. |
occur_time |
time variable for creating OOT folds. Default is NULL. |
seed |
A seed. Default is 46. |
Value
a list of indices
Examples
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
Data Cleaning
Description
The data_cleansing
function is a simpler wrapper for data cleaning functions, such as
delete variables that values are all NAs;
checking dat and target format.
delete low variance variables
replace null or NULL or blank with NA;
encode variables which NAs & miss value rate is more than 95
encode variables which unique value rate is more than 95
merge categories of character variables that is more than 10;
transfer time variables to dateformation;
remove duplicated observations;
process outliers;
process NAs.
Usage
data_cleansing(
dat,
target = NULL,
obs_id = NULL,
occur_time = NULL,
pos_flag = NULL,
x_list = NULL,
ex_cols = NULL,
miss_values = NULL,
remove_dup = TRUE,
outlier_proc = TRUE,
missing_proc = "median",
low_var = 0.999,
missing_rate = 0.999,
merge_cat = TRUE,
note = TRUE,
parallel = FALSE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir()
)
Arguments
dat |
A data frame with x and target. |
target |
The name of target variable. |
obs_id |
The name of ID of observations.Default is NULL. |
occur_time |
The name of occur time of observations.Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
x_list |
A list of x variables. |
ex_cols |
A list of excluded variables. Default is NULL. |
miss_values |
Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "missing". |
remove_dup |
Logical, if TRUE, remove the duplicated observations. |
outlier_proc |
Logical, process outliers or not. Default is TRUE. |
missing_proc |
If logical, process missing values or not. If "median", then Nas imputation with k neighbors median. If "avg_dist", the distance weighted average method is applied to determine the NAs imputation with k neighbors. If "default", assigning the missing values to -1 or "missing", otherwise ,processing the missing values according to the results of missing analysis. |
low_var |
The maximum percent of unique values (including NAs) for filtering low variance variables. |
missing_rate |
The maximum percent of missing values for recoding values to missing and non_missing. |
merge_cat |
The minimum number of categories for merging categories of character variables. |
note |
Logical. Outputs info. Default is TRUE. |
parallel |
Logical, parallel computing or not. Default is FALSE. |
save_data |
Logical, save the result or not. Default is FALSE. |
file_name |
The name for periodically saved data file. Default is NULL. |
dir_path |
The path for periodically saved data file. Default is tempdir(). |
Value
A preprocessed data.frame
See Also
remove_duplicated
,
null_blank_na
,
entry_rate_na
,
low_variance_filter
,
process_nas
,
process_outliers
Examples
#data cleaning
dat_cl = data_cleansing(dat = UCICreditCard[1:2000,],
target = "default.payment.next.month",
x_list = NULL,
obs_id = "ID",
occur_time = "apply_date",
ex_cols = c("PAY_6|BILL_"),
outlier_proc = TRUE,
missing_proc = TRUE,
low_var = TRUE,
save_data = FALSE)
Data Exploration
Description
#'The data_exploration
includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions, to correlations, cross-tabulation and characteristic analysis.
Usage
data_exploration(
dat,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
note = FALSE
)
Arguments
dat |
A data.frame with x and target. |
save_data |
Logical. If TRUE, save files to the specified folder at |
file_name |
The file name for periodically saved outliers analysis file. Default is NULL. |
dir_path |
The path for periodically saved outliers analysis file. Default is tempdir(). |
note |
Logical, outputs info. Default is TRUE. |
Value
A list contains both categrory and numeric variable analysis.
Examples
data_ex = data_exploration(dat = UCICreditCard[1:1000,])
Date Time Cut Point
Description
date_cut
is a small function to get date point.
Usage
date_cut(dat_time, pct = 0.7, g = 100)
Arguments
dat_time |
time vectors. |
pct |
the percent of cutting. Default: 0.7. |
g |
Number of cuts. |
Value
A Date.
Examples
date_cut(dat_time = lendingclub$issue_d, pct = 0.8)
#"2018-08-01"
Recovery One-Hot Encoding
Description
de_one_hot_encoding
is for one-hot encoding recovery processing
Usage
de_one_hot_encoding(dat_one_hot, cat_vars = NULL, na_act = TRUE, note = FALSE)
Arguments
dat_one_hot |
A dat frame with the one hot encoding variables |
cat_vars |
variables to be recovery processed, default is null, if null, find these variables through regular expressions . |
na_act |
Logical,If true, the missing value is assigned as "missing", if FALSE missing value is omitted, the default is TRUE. |
note |
Logical.Outputs info.Default is TRUE. |
Value
A dat frame with the one hot encoding recorery character variables
See Also
Examples
#one hot encoding
dat1 = one_hot_encoding(dat = UCICreditCard,
cat_vars = c("SEX", "MARRIAGE"),
merge_cat = TRUE, na_act = TRUE)
#de one hot encoding
dat2 = de_one_hot_encoding(dat_one_hot = dat1,
cat_vars = c("SEX","MARRIAGE"),
na_act = FALSE)
Recovery Percent Format
Description
de_percent
is a small function for recoverying percent format..
Usage
de_percent(x, digits = 2)
Arguments
x |
Character with percent formant. |
digits |
Number of digits.Default: 2. |
Value
x without percent format.
Examples
de_percent("24%")
derived_interval
Description
This function is not intended to be used by end user.
Usage
derived_interval(dat_s, interval_type = c("cnt_interval", "time_interval"))
Arguments
dat_s |
A data.frame contained only predict variables. |
interval_type |
Available of c("cnt_interval", "time_interval") |
derived_partial_acf
Description
This function is not intended to be used by end user.
Usage
derived_partial_acf(dat_s)
Arguments
dat_s |
A data.frame |
derived_pct
Description
This function is not intended to be used by end user.
Usage
derived_pct(dat_s, pct_type = "total_pct")
Arguments
dat_s |
A data.frame contained only predict variables. |
pct_type |
Available of "total_pct" |
Derivation of Behavioral Variables
Description
This function is used for derivating behavioral variables and is not intended to be used by end user.
Usage
derived_ts_vars(
dat,
grx = NULL,
td = NULL,
ID = NULL,
ex_cols = NULL,
x_list = NULL,
der = c("cvs", "sums", "means", "maxs", "max_mins", "time_intervals",
"cnt_intervals", "total_pcts", "cum_pcts", "partial_acfs"),
parallel = TRUE,
note = TRUE
)
derived_ts(
dat,
grx_x = NULL,
x_list = NULL,
td = NULL,
ID = NULL,
ex_cols = NULL,
der = c("cvs", "sums", "means", "maxs", "max_mins", "time_intervals",
"cnt_intervals", "total_pcts", "cum_pcts", "partial_acfs")
)
Arguments
dat |
A data.frame contained only predict variables. |
grx |
Regular expressions used to match variable names. |
td |
Number of variables to derivate. |
ID |
The name of ID of observations or key variable of data. Default is NULL. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
x_list |
Names of independent variables. |
der |
Variables to derivate |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
grx_x |
Regular expression used to match a group of variable names. |
Details
The key to creating a good model is not the power of a specific modelling technique, but the breadth and depth of derived variables that represent a higher level of knowledge about the phenomena under examination.
Number of digits
Description
digits_num
is for caculating optimal digits number for numeric variables.
Usage
digits_num(dat_x)
Arguments
dat_x |
A numeric variable. |
Value
A number of digits
Examples
## Not run:
digits_num(lendingclub[,"dti"])
# 7
## End(Not run)
Entropy Weight Method
Description
entropy_weight
is for calculating Entropy Weight.
Usage
entropy_weight(dat, pos_vars, neg_vars)
Arguments
dat |
A data.frame with independent variables. |
pos_vars |
Names or index of positive direction variables, the bigger the better. |
neg_vars |
Names or index of negative direction variables, the smaller the better. |
Details
Step1 Raw data normalization Step2 Find out the total amount of contributions of all samples to the index Xj Step3 Each element of the step generated matrix is transformed into the product of each element and the LN (element), and the information entropy is calculated. Step4 Calculate redundancy. Step5 Calculate the weight of each index.
Value
A data.frame with weights of each variable.
Examples
entropy_weight(dat = ewm_data,
pos_vars = c(6,8,9,10),
neg_vars = c(7,11))
Max Percent of missing Value
Description
entry_rate_na
is the function to recode variables with missing values up to a certain percentage with missing and non_missing.
Usage
entry_rate_na(dat, nr = 0.98, note = FALSE)
Arguments
dat |
A data frame with x and target. |
nr |
The maximum percent of NAs. |
note |
Logical.Outputs info.Default is TRUE. |
Value
A data.frame
Examples
datss = entry_rate_na(dat = lendingclub[1:1000, ], nr = 0.98)
euclid_dist
Description
This function is not intended to be used by end user.
Usage
euclid_dist(x, y, cos_margin = 1)
Arguments
x |
A list |
y |
A list |
cos_margin |
rows or cols |
Functions of xgboost feval
Description
eval_auc
,eval_ks
,eval_lift
,eval_tnr
is for getting best params of xgboost.
Usage
eval_auc(preds, dtrain)
eval_ks(preds, dtrain)
eval_tnr(preds, dtrain)
eval_lift(preds, dtrain)
Arguments
preds |
A list of predict probability or score. |
dtrain |
Matrix of x predictors. |
Value
List of best value
Entropy Weight Method Data
Description
This data is for Entropy Weight Method examples.
Format
A data frame with 10 rows and 13 variables.
high_cor_filter
Description
fast_high_cor_filter
In a highly correlated variable group, select the variable with the highest IV.
high_cor_filter
In a highly correlated variable group, select the variable with the highest IV.
Usage
fast_high_cor_filter(
dat,
p = 0.95,
x_list = NULL,
com_list = NULL,
ex_cols = NULL,
save_data = FALSE,
cor_class = TRUE,
vars_name = TRUE,
parallel = FALSE,
note = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
high_cor_filter(
dat,
com_list = NULL,
x_list = NULL,
ex_cols = NULL,
onehot = TRUE,
parallel = FALSE,
p = 0.7,
file_name = NULL,
dir_path = tempdir(),
save_data = FALSE,
note = FALSE,
...
)
Arguments
dat |
A data.frame with independent variables. |
p |
Threshold of correlation between features. Default is 0.95. |
x_list |
Names of independent variables. |
com_list |
A data.frame with important values of each variable. eg : IV_list |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
cor_class |
Culculate catagery variables's correlation matrix. Default is FALSE. |
vars_name |
Logical, output a list of filtered variables or table with detailed compared value of each variable. Default is TRUE. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical. Outputs info. Default is TRUE. |
file_name |
The name for periodically saved results files. Default is "Feature_selected_COR". |
dir_path |
The path for periodically saved results files. Default is "./variable". |
... |
Additional parameters. |
onehot |
one-hot-encoding independent variables. |
Value
A list of selected variables.
See Also
get_correlation_group
, high_cor_selector
, char_cor_vars
Examples
# calculate iv for each variable.
iv_list = feature_selector(dat_train = UCICreditCard[1:1000,], dat_test = NULL,
target = "default.payment.next.month",
occur_time = "apply_date",
filter = c("IV"), cv_folds = 1, iv_cp = 0.01,
ex_cols = "ID$|date$|default.payment.next.month$",
save_data = FALSE, vars_name = FALSE)
fast_high_cor_filter(dat = UCICreditCard[1:1000,],
com_list = iv_list, save_data = FALSE,
ex_cols = "ID$|date$|default.payment.next.month$",
p = 0.9, cor_class = FALSE ,var_name = FALSE)
Feature Selection Wrapper
Description
feature_selector
This function uses four different methods (IV, PSI, correlation, xgboost) in order to select important features.The correlation algorithm must be used with IV.
Usage
feature_selector(
dat_train,
dat_test = NULL,
x_list = NULL,
target = NULL,
pos_flag = NULL,
occur_time = NULL,
ex_cols = NULL,
filter = c("IV", "PSI", "XGB", "COR"),
cv_folds = 1,
iv_cp = 0.01,
psi_cp = 0.5,
xgb_cp = 0,
cor_cp = 0.98,
breaks_list = NULL,
hopper = FALSE,
vars_name = TRUE,
parallel = FALSE,
note = TRUE,
seed = 46,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
Arguments
dat_train |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
x_list |
Names of independent variables. |
target |
The name of target variable. |
pos_flag |
The value of positive class of target variable, default: "1". |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
filter |
The methods for selecting important and stable variables. |
cv_folds |
Number of cross-validations. Default: 5. |
iv_cp |
The minimum threshold of IV. 0 < iv_i ; 0.01 to 0.1 usually work. Default: 0.02 |
psi_cp |
The maximum threshold of PSI. 0 <= psi_i <=1; 0.05 to 0.2 usually work. Default: 0.1 |
xgb_cp |
Threshold of XGB feature's Gain. 0 <= xgb_cp <=1. Default is 1/number of independent variables. |
cor_cp |
Threshold of correlation between features. 0 <= cor_cp <=1; 0.7 to 0.98 usually work. Default is 0.98. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
hopper |
Logical.Filtering screening. Default is FALSE. |
vars_name |
Logical, output a list of filtered variables or table with detailed IV and PSI value of each variable. Default is FALSE. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical.Outputs info. Default is TRUE. |
seed |
Random number seed. Default is 46. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved results files. Default is "select_vars". |
dir_path |
The path for periodically saved results files. Default is "./variable" |
... |
Other parameters. |
Value
A list of selected features
See Also
psi_iv_filter
, xgb_filter
, gbm_filter
Examples
feature_selector(dat_train = UCICreditCard[1:1000,c(2,8:12,26)],
dat_test = NULL, target = "default.payment.next.month",
occur_time = "apply_date", filter = c("IV", "PSI"),
cv_folds = 1, iv_cp = 0.01, psi_cp = 0.1, xgb_cp = 0, cor_cp = 0.98,
vars_name = FALSE,note = FALSE)
Fuzzy Cluster means.
Description
This function is used for Fuzzy Clustering.
Usage
fuzzy_cluster_means(
dat,
kc = 2,
sf = 2,
nstart = 1,
max_iter = 100,
epsm = 1e-06
)
fuzzy_cluster(dat, kc = 2, init_centers, sf = 3, max_iter = 100, epsm = 1e-06)
Arguments
dat |
A data.frame contained only predict variables. |
kc |
The number of cluster center (default is 2), |
sf |
Default is 2. |
nstart |
The number of random groups (default is 1), |
max_iter |
Max iteration number(default is 100) . |
epsm |
Default is 1e-06. |
init_centers |
Initial centers of obs. |
References
Bezdek, James C. "FCM: The fuzzy c-means clustering algorithm". Computers & Geosciences (0098-3004),doi: 10.1016/0098-3004(84)90020-7
gather or aggregate data
Description
This function is used for gathering or aggregating data.
Usage
gather_data(dat, x_list = NULL, ID = NULL, FUN = sum_x)
Arguments
dat |
A data.frame contained only predict variables. |
x_list |
The names of variables to gather. |
ID |
The name of ID of observations or key variable of data. Default is NULL. |
FUN |
The function of gathering method. |
Details
The key to creating a good model is not the power of a specific modelling technique, but the breadth and depth of derived variables that represent a higher level of knowledge about the phenomena under examination.
Examples
dat = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7,
8,8,8,9,9,9,10,10,11,11,11,11,11,11),
terms = c('a','b','c','a','c','d','d','a',
'b','c','a','c','d','a','c',
'd','a','e','f','b','c','f','b',
'c','h','h','i','c','d','g','k','k'),
time = c(8,3,1,9,6,1,4,9,1,3,4,8,2,7,1,
3,4,1,8,7,2,5,7,8,8,2,1,5,7,2,7,3))
gather_data(dat = dat, x_list = "time", ID = 'id', FUN = sum_x)
Select Features using GBM
Description
gbm_filter
is for selecting important features using GBM.
Usage
gbm_filter(
dat,
target = NULL,
x_list = NULL,
ex_cols = NULL,
pos_flag = NULL,
GBM.params = gbm_params(),
cores_num = 2,
vars_name = TRUE,
note = TRUE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
seed = 46,
...
)
Arguments
dat |
A data.frame with independent variables and target variable. |
target |
The name of target variable. |
x_list |
Names of independent variables. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
GBM.params |
Parameters of GBM. |
cores_num |
The number of CPU cores to use. |
vars_name |
Logical, output a list of filtered variables or table with detailed IV and PSI value of each variable. Default is TRUE. |
note |
Logical, outputs info. Default is TRUE. |
save_data |
Logical, save results results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved results files. Default is "Feature_importance_GBDT". |
dir_path |
The path for periodically saved results files. Default is "./variable". |
seed |
Random number seed. Default is 46. |
... |
Other parameters to pass to gbdt_params. |
Value
Selected variables.
See Also
psi_iv_filter
, xgb_filter
, feature_selector
Examples
GBM.params = gbm_params(n.trees = 2, interaction.depth = 2, shrinkage = 0.1,
bag.fraction = 1, train.fraction = 1,
n.minobsinnode = 30,
cv.folds = 2)
## Not run:
features = gbm_filter(dat = UCICreditCard[1:1000, c(8:12, 26)],
target = "default.payment.next.month",
occur_time = "apply_date",
GBM.params = GBM.params
, vars_name = FALSE)
## End(Not run)
GBM Parameters
Description
gbm_params
is the list of parameters to train a GBM using in training_model
.
Usage
gbm_params(
n.trees = 1000,
interaction.depth = 6,
shrinkage = 0.01,
bag.fraction = 0.5,
train.fraction = 0.7,
n.minobsinnode = 30,
cv.folds = 5,
...
)
Arguments
n.trees |
Integer specifying the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. Default is 100. |
interaction.depth |
Integer specifying the maximum depth of each tree(i.e., the highest level of variable interactions allowed) . A value of 1 implies an additive model, a value of 2 implies a model with up to 2 - way interactions, etc. Default is 1. |
shrinkage |
a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step - size reduction; 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. Default is 0.1 . |
bag.fraction |
the fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If bag.fraction < 1 then running the same model twice will result in similar but different fits. gbm uses the R random number generator so set.seed can ensure that the model can be reconstructed. Preferably, the user can save the returned gbm.object using save. Default is 0.5 . |
train.fraction |
The first train.fraction * nrows(data) observations are used to fit the gbm and the remainder are used for computing out-of-sample estimates of the loss function. |
n.minobsinnode |
Integer specifying the minimum number of observations in the terminal nodes of the trees. Note that this is the actual number of observations, not the total weight. |
cv.folds |
Number of cross - validation folds to perform. If cv.folds > 1 then gbm, in addition to the usual fit, will perform a cross - validation, calculate an estimate of generalization error returned in cv.error. |
... |
Other parameters |
Details
See details at: gbm
Value
A list of parameters.
See Also
training_model
, lr_params
, xgb_params
, rf_params
get_auc_ks_lambda
get_auc_ks_lambda
is for get best lambda required in lasso_filter. This function required in lasso_filter
Description
get_auc_ks_lambda
get_auc_ks_lambda
is for get best lambda required in lasso_filter. This function required in lasso_filter
Usage
get_auc_ks_lambda(
lasso_model,
x_test,
y_test,
save_data = FALSE,
plot_show = TRUE,
file_name = NULL,
dir_path = tempdir()
)
Arguments
lasso_model |
A lasso model genereted by glmnet. |
x_test |
A matrix of test dataset with x. |
y_test |
A matrix of y test dataset with y. |
save_data |
Logical, save results in locally specified folder. Default is FALSE |
plot_show |
Logical, if TRUE plot the results. Default is TRUE. |
file_name |
The name for periodically saved results files. Default is NULL. |
dir_path |
The path for periodically saved results files. |
Value
Lanmbda values with max K-S and AUC.
See Also
lasso_filter
, get_sim_sign_lambda
Table of Binning
Description
get_bins_table
is used to generates summary information of varaibles.
get_bins_table_all
can generates bins table for all specified independent variables.
Usage
get_bins_table_all(
dat,
x_list = NULL,
target = NULL,
pos_flag = NULL,
dat_test = NULL,
ex_cols = NULL,
breaks_list = NULL,
parallel = FALSE,
note = FALSE,
bins_total = TRUE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir()
)
get_bins_table(
dat,
x,
target = NULL,
pos_flag = NULL,
dat_test = NULL,
breaks = NULL,
breaks_list = NULL,
bins_total = TRUE,
note = FALSE
)
Arguments
dat |
A data.frame with independent variables and target variable. |
x_list |
Names of independent variables. |
target |
The name of target variable. |
pos_flag |
Value of positive class, Default is "1". |
dat_test |
A data.frame of test data. Default is NULL. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
bins_total |
Logical, total sum for each columns. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved bins table file. Default is "bins_table". |
dir_path |
The path for periodically saved bins table file. Default is "./variable". |
x |
The name of an independent variable. |
breaks |
Splitting points for an independent variable. Default is NULL. |
See Also
get_iv
,
get_iv_all
,
get_psi
,
get_psi_all
Examples
breaks_list = get_breaks_all(dat = UCICreditCard, x_list = names(UCICreditCard)[3:4],
target = "default.payment.next.month", equal_bins =TRUE,best = FALSE,g=5,
ex_cols = "ID|apply_date", save_data = FALSE)
get_bins_table_all(dat = UCICreditCard, breaks_list = breaks_list,
target = "default.payment.next.month")
Generates Best Breaks for Binning
Description
get_breaks
is for generating optimal binning for numerical and nominal variables.
The get_breaks_all
is a simpler wrapper for get_breaks
.
Usage
get_breaks_all(
dat,
target = NULL,
x_list = NULL,
ex_cols = NULL,
pos_flag = NULL,
occur_time = NULL,
oot_pct = 0.7,
best = TRUE,
equal_bins = FALSE,
cut_bin = "equal_depth",
g = 10,
sp_values = NULL,
tree_control = list(p = 0.05, cp = 1e-06, xval = 5, maxdepth = 10),
bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.05, b_odds = 0.1, b_psi
= 0.05, b_or = 0.15, mono = 0.3, odds_psi = 0.2, kc = 1),
parallel = FALSE,
note = FALSE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
get_breaks(
dat,
x,
target = NULL,
pos_flag = NULL,
best = TRUE,
equal_bins = FALSE,
cut_bin = "equal_depth",
g = 10,
sp_values = NULL,
occur_time = NULL,
oot_pct = 0.7,
tree_control = NULL,
bins_control = NULL,
note = FALSE,
...
)
Arguments
dat |
A data frame with x and target. |
target |
The name of target variable. |
x_list |
A list of x variables. |
ex_cols |
A list of excluded variables. Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
oot_pct |
Percentage of observations retained for overtime test (especially to calculate PSI). Defualt is 0.7 |
best |
Logical, if TRUE, merge initial breaks to get optimal breaks for binning. |
equal_bins |
Logical, if TRUE, equal sample size initial breaks generates.If FALSE , tree breaks generates using desison tree. |
cut_bin |
A string, if equal_bins is TRUE, 'equal_depth' or 'equal_width', default is 'equal_depth'. |
g |
Integer, number of initial bins for equal_bins. |
sp_values |
A list of missing values. |
tree_control |
the list of tree parameters.
|
bins_control |
the list of parameters.
|
parallel |
Logical, parallel computing or not. Default is FALSE. |
note |
Logical.Outputs info.Default is TRUE. |
save_data |
Logical, save results in locally specified folder. Default is TRUE |
file_name |
File name that save results in locally specified folder. Default is "breaks_list". |
dir_path |
Path to save results. Default is "./variable" |
... |
Additional parameters. |
x |
The Name of an independent variable. |
Value
A table containing a list of splitting points for each independent variable.
See Also
get_tree_breaks
, cut_equal
, select_best_class
, select_best_breaks
Examples
#controls
tree_control = list(p = 0.02, cp = 0.000001, xval = 5, maxdepth = 10)
bins_control = list(bins_num = 10, bins_pct = 0.02, b_chi = 0.02, b_odds = 0.1,
b_psi = 0.05, b_or = 15, mono = 0.2, odds_psi = 0.1, kc = 5)
# get categrory variable breaks
b = get_breaks(dat = UCICreditCard[1:1000,], x = "MARRIAGE",
target = "default.payment.next.month",
occur_time = "apply_date",
sp_values = list(-1, "missing"),
tree_control = tree_control, bins_control = bins_control)
# get numeric variable breaks
b2 = get_breaks(dat = UCICreditCard[1:1000,], x = "PAY_2",
target = "default.payment.next.month",
occur_time = "apply_date",
sp_values = list(-1, "missing"),
tree_control = tree_control, bins_control = bins_control)
# get breaks of all predictive variables
b3 = get_breaks_all(dat = UCICreditCard[1:1000,], target = "default.payment.next.month",
x_list = c("MARRIAGE","PAY_2"),
occur_time = "apply_date", ex_cols = "ID",
sp_values = list(-1, "missing"),
tree_control = tree_control, bins_control = bins_control,
save_data = FALSE)
get_correlation_group
Description
get_correlation_group
is funtion for obtaining highly correlated variable groups.
select_cor_group
is funtion for selecting highly correlated variable group.
select_cor_list
is funtion for selecting highly correlated variable list.
Usage
get_correlation_group(cor_mat, p = 0.8)
select_cor_group(cor_vars)
select_cor_list(cor_vars_list)
Arguments
cor_mat |
A correlation matrix of independent variables. |
p |
Threshold of correlation between features. Default is 0.7. |
cor_vars |
Correlated variables. |
cor_vars_list |
List of correlated variable |
Value
A list of selected variables.
Examples
## Not run:
cor_mat = cor(UCICreditCard[8:20],
use = "complete.obs", method = "spearman")
get_correlation_group(cor_mat, p = 0.6 )
## End(Not run)
Calculate Information Value (IV)
get_iv
is used to calculate Information Value (IV) of an independent variable.
get_iv_all
can loop through IV for all specified independent variables.
Description
Calculate Information Value (IV)
get_iv
is used to calculate Information Value (IV) of an independent variable.
get_iv_all
can loop through IV for all specified independent variables.
Usage
get_iv_all(
dat,
x_list = NULL,
ex_cols = NULL,
breaks_list = NULL,
target = NULL,
pos_flag = NULL,
best = TRUE,
equal_bins = FALSE,
tree_control = NULL,
bins_control = NULL,
g = 10,
parallel = FALSE,
note = FALSE
)
get_iv(
dat,
x,
target = NULL,
pos_flag = NULL,
breaks = NULL,
breaks_list = NULL,
best = TRUE,
equal_bins = FALSE,
tree_control = NULL,
bins_control = NULL,
g = 10,
note = FALSE
)
Arguments
dat |
A data.frame with independent variables and target variable. |
x_list |
Names of independent variables. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
target |
The name of target variable. |
pos_flag |
Value of positive class, Default is "1". |
best |
Logical, merge initial breaks to get optimal breaks for binning. |
equal_bins |
Logical, generates initial breaks for equal frequency binning. |
tree_control |
Parameters of using Decision Tree to segment initial breaks. See detials: |
bins_control |
Parameters used to control binning. See detials: |
g |
Number of initial breakpoints for equal frequency binning. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
x |
The name of an independent variable. |
breaks |
Splitting points for an independent variable. Default is NULL. |
Details
IV Rules of Thumb for evaluating the strength a predictor Less than 0.02:unpredictive 0.02 to 0.1:weak 0.1 to 0.3:medium 0.3 + :strong
References
Information Value Statistic:Bruce Lund, Magnify Analytics Solutions, a Division of Marketing Associates, Detroit, MI(Paper AA - 14 - 2013)
See Also
get_iv
,get_iv_all
,get_psi
,get_psi_all
Examples
get_iv_all(dat = UCICreditCard,
x_list = names(UCICreditCard)[3:10],
equal_bins = TRUE, best = FALSE,
target = "default.payment.next.month",
ex_cols = "ID|apply_date")
get_iv(UCICreditCard, x = "PAY_3",
equal_bins = TRUE, best = FALSE,
target = "default.payment.next.month")
get logistic coef
Description
get_logistic_coef
is for geting logistic coefficients.
Usage
get_logistic_coef(
lg_model,
file_name = NULL,
dir_path = tempdir(),
save_data = FALSE
)
Arguments
lg_model |
An object of logistic model. |
file_name |
The name for periodically saved coefficient file. Default is "LR_coef". |
dir_path |
The Path for periodically saved coefficient file. Default is "./model". |
save_data |
Logical, save the result or not. Default is FALSE. |
Value
A data.frame with logistic coefficients.
Examples
# dataset spliting
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
#rename the target variable
dat = re_name(dat, "default.payment.next.month", "target")
dat = data_cleansing(dat, target = "target", obs_id = "ID",
occur_time = "apply_date", miss_values = list("", -1))
#train_ test pliting
train_test = train_test_split(dat, split_type = "OOT", prop = 0.7,
occur_time = "apply_date")
dat_train = train_test$train
dat_test = train_test$test
#get breaks of all predictive variables
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2")
breaks_list = get_breaks_all(dat = dat_train, target = "target",
x_list = x_list, occur_time = "apply_date", ex_cols = "ID",
save_data = FALSE, note = FALSE)
#woe transforming
train_woe = woe_trans_all(dat = dat_train,
target = "target",
breaks_list = breaks_list,
woe_name = FALSE)
test_woe = woe_trans_all(dat = dat_test,
target = "target",
breaks_list = breaks_list,
note = FALSE)
Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ '))
set.seed(46)
lr_model = glm(Formula, data = train_woe[, c("target", x_list)], family = binomial(logit))
#get LR coefficient
dt_imp_LR = get_logistic_coef(lg_model = lr_model, save_data = FALSE)
bins_table = get_bins_table_all(dat = dat_train, target = "target",
x_list = x_list,dat_test = dat_test,
breaks_list = breaks_list, note = FALSE)
#score card
LR_score_card = get_score_card(lg_model = lr_model, bins_table, target = "target")
#scoring
train_pred = dat_train[, c("ID", "apply_date", "target")]
test_pred = dat_test[, c("ID", "apply_date", "target")]
train_pred$pred_LR = score_transfer(model = lr_model,
tbl_woe = train_woe,
save_data = TRUE)[, "score"]
test_pred$pred_LR = score_transfer(model = lr_model,
tbl_woe = test_woe, save_data = FALSE)[, "score"]
get central value.
Description
This function is not intended to be used by end user.
Usage
get_median(x, weight_avg = NULL)
Arguments
x |
A vector or list. |
weight_avg |
avg weight to calculate means. |
Get Variable Names
Description
get_names
is for getting names of particular classes of variables
Usage
get_names(
dat,
types = c("logical", "factor", "character", "numeric", "integer64", "integer",
"double", "Date", "POSIXlt", "POSIXct", "POSIXt"),
ex_cols = NULL,
get_ex = FALSE
)
Arguments
dat |
A data.frame with independent variables and target variable. |
types |
The class or types of variables which names to get. Default: c('numeric', 'integer', 'double') |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
get_ex |
Logical ,if TRUE, return a list contains names of excluded variables. |
Value
A list contains names of variables
See Also
Examples
x_list = get_names(dat = UCICreditCard, types = c('factor', 'character'),
ex_cols = c("default.payment.next.month","ID$|_date$"), get_ex = FALSE)
x_list = get_names(dat = UCICreditCard, types = c('numeric', 'character', "integer"),
ex_cols = c("default.payment.next.month", "ID$|SEX "), get_ex = FALSE)
get_nas_random
Description
This function is not intended to be used by end user.
Usage
get_nas_random(dat)
Arguments
dat |
A data.frame contained only predict variables. |
Calculate Population Stability Index (PSI)
get_psi
is used to calculate Population Stability Index (PSI) of an independent variable.
get_psi_all
can loop through PSI for all specified independent variables.
Description
Calculate Population Stability Index (PSI)
get_psi
is used to calculate Population Stability Index (PSI) of an independent variable.
get_psi_all
can loop through PSI for all specified independent variables.
Usage
get_psi_all(
dat,
x_list = NULL,
target = NULL,
dat_test = NULL,
breaks_list = NULL,
occur_time = NULL,
start_date = NULL,
cut_date = NULL,
oot_pct = 0.7,
pos_flag = NULL,
parallel = FALSE,
ex_cols = NULL,
as_table = FALSE,
g = 10,
bins_no = TRUE,
note = FALSE
)
get_psi(
dat,
x,
target = NULL,
dat_test = NULL,
occur_time = NULL,
start_date = NULL,
cut_date = NULL,
pos_flag = NULL,
breaks = NULL,
breaks_list = NULL,
oot_pct = 0.7,
g = 10,
as_table = TRUE,
note = FALSE,
bins_no = TRUE
)
Arguments
dat |
A data.frame with independent variables and target variable. |
x_list |
Names of independent variables. |
target |
The name of target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
start_date |
The earliest occurrence time of observations. |
cut_date |
Time points for spliting data sets, e.g. : spliting Actual and Expected data sets. |
oot_pct |
Percentage of observations retained for overtime test (especially to calculate PSI). Defualt is 0.7 |
pos_flag |
Value of positive class, Default is "1". |
parallel |
Logical, parallel computing. Default is FALSE. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
as_table |
Logical, output results in a table. Default is TRUE. |
g |
Number of initial breakpoints for equal frequency binning. |
bins_no |
Logical, add serial numbers to bins. Default is TRUE. |
note |
Logical, outputs info. Default is TRUE. |
x |
The name of an independent variable. |
breaks |
Splitting points for an independent variable. Default is NULL. |
Details
PSI Rules for evaluating the stability of a predictor Less than 0.02: Very stable 0.02 to 0.1: Stable 0.1 to 0.2: Unstable 0.2 to 0.5] : Change more than 0.5: Great change
See Also
get_iv
,get_iv_all
,get_psi
,get_psi_all
Examples
# dat_test is null
get_psi(dat = UCICreditCard, x = "PAY_3", occur_time = "apply_date")
# dat_test is not all
# train_test split
train_test = train_test_split(dat = UCICreditCard, prop = 0.7, split_type = "OOT",
occur_time = "apply_date", start_date = NULL, cut_date = NULL,
save_data = FALSE, note = FALSE)
dat_ex = train_test$train
dat_ac = train_test$test
# generate psi table
get_psi(dat = dat_ex, dat_test = dat_ac, x = "PAY_3",
occur_time = "apply_date", bins_no = TRUE)
Calculate IV & PSI
Description
get_iv_psi
is used to calculate Information Value (IV) and Population Stability Index (PSI) of an independent variable.
get_iv_psi_all
can loop through IV & PSI for all specified independent variables.
Usage
get_psi_iv_all(
dat,
dat_test = NULL,
x_list = NULL,
target,
ex_cols = NULL,
pos_flag = NULL,
breaks_list = NULL,
occur_time = NULL,
oot_pct = 0.7,
equal_bins = FALSE,
cut_bin = "equal_depth",
tree_control = NULL,
bins_control = NULL,
bins_total = FALSE,
best = TRUE,
g = 10,
as_table = TRUE,
note = FALSE,
parallel = FALSE,
bins_no = TRUE
)
get_psi_iv(
dat,
dat_test = NULL,
x,
target,
pos_flag = NULL,
breaks = NULL,
breaks_list = NULL,
occur_time = NULL,
oot_pct = 0.7,
equal_bins = FALSE,
cut_bin = "equal_depth",
tree_control = NULL,
bins_control = NULL,
bins_total = FALSE,
best = TRUE,
g = 10,
as_table = TRUE,
note = FALSE,
bins_no = TRUE
)
Arguments
dat |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
x_list |
Names of independent variables. |
target |
The name of target variable. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
oot_pct |
Percentage of observations retained for overtime test (especially to calculate PSI). Defualt is 0.7 |
equal_bins |
Logical, generates initial breaks for equal frequency or width binning. |
cut_bin |
A string, if equal_bins is TRUE, 'equal_depth' or 'equal_width', default is 'equal_depth'. |
tree_control |
Parameters of using Decision Tree to segment initial breaks. See detials: |
bins_control |
Parameters used to control binning. See detials: |
bins_total |
Logical, total sum for each variable. |
best |
Logical, merge initial breaks to get optimal breaks for binning. |
g |
Number of initial breakpoints for equal frequency binning. |
as_table |
Logical, output results in a table. Default is TRUE. |
note |
Logical, outputs info. Default is TRUE. |
parallel |
Logical, parallel computing. Default is FALSE. |
bins_no |
Logical, add serial numbers to bins. Default is FALSE. |
x |
The name of an independent variable. |
breaks |
Splitting points for an independent variable. Default is NULL. |
See Also
get_iv
,get_iv_all
,get_psi
,get_psi_all
Examples
iv_list = get_psi_iv_all(dat = UCICreditCard[1:1000, ],
x_list = names(UCICreditCard)[3:5], equal_bins = TRUE,
target = "default.payment.next.month", ex_cols = "ID|apply_date")
get_psi_iv(UCICreditCard, x = "PAY_3",
target = "default.payment.next.month",bins_total = TRUE)
Plot PSI(Population Stability Index)
Description
You can use the psi_plot
to plot PSI of your data.
get_psi_plots
can loop through plots for all specified independent variables.
Usage
get_psi_plots(
dat_train,
dat_test = NULL,
x_list = NULL,
ex_cols = NULL,
breaks_list = NULL,
occur_time = NULL,
g = 10,
plot_show = TRUE,
save_data = FALSE,
file_name = NULL,
parallel = FALSE,
g_width = 8,
dir_path = tempdir()
)
psi_plot(
dat_train,
x,
dat_test = NULL,
occur_time = NULL,
g_width = 8,
breaks_list = NULL,
breaks = NULL,
g = 10,
plot_show = TRUE,
save_data = FALSE,
dir_path = tempdir()
)
Arguments
dat_train |
A data.frame with independent variables. |
dat_test |
A data.frame of test data. Default is NULL. |
x_list |
Names of independent variables. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
occur_time |
The name of occur time. |
g |
Number of initial breakpoints for equal frequency binning. |
plot_show |
Logical, show model performance in current graphic device. Default is FALSE. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved data file. Default is NULL. |
parallel |
Logical, parallel computing. Default is FALSE. |
g_width |
The width of graphs. |
dir_path |
The path for periodically saved graphic files. |
x |
The name of an independent variable. |
breaks |
Splitting points for a continues variable. |
Examples
train_test = train_test_split(UCICreditCard[1:1000,], split_type = "Random",
prop = 0.8, save_data = FALSE)
dat_train = train_test$train
dat_test = train_test$test
get_psi_plots(dat_train[, c(8, 9)], dat_test = dat_test[, c(8, 9)])
Score Card
Description
get_score_card
is for generating a stardard scorecard
Usage
get_score_card(
lg_model,
target,
bins_table,
a = 600,
b = 50,
file_name = NULL,
dir_path = tempdir(),
save_data = FALSE
)
Arguments
lg_model |
An object of glm model. |
target |
The name of target variable. |
bins_table |
a data.frame generated by |
a |
Base line of score. |
b |
Numeric.Increased scores from doubling Odds. |
file_name |
The name for periodically saved scorecard file. Default is "LR_Score_Card". |
dir_path |
The path for periodically saved scorecard file. Default is "./model" |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
Value
scorecard
Examples
# dataset spliting
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
#rename the target variable
dat = re_name(dat, "default.payment.next.month", "target")
dat = data_cleansing(dat, target = "target", obs_id = "ID",
occur_time = "apply_date", miss_values = list("", -1))
#train_ test pliting
train_test = train_test_split(dat, split_type = "OOT", prop = 0.7,
occur_time = "apply_date")
dat_train = train_test$train
dat_test = train_test$test
#get breaks of all predictive variables
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2")
breaks_list = get_breaks_all(dat = dat_train, target = "target",
x_list = x_list, occur_time = "apply_date", ex_cols = "ID",
save_data = FALSE, note = FALSE)
#woe transforming
train_woe = woe_trans_all(dat = dat_train,
target = "target",
breaks_list = breaks_list,
woe_name = FALSE)
test_woe = woe_trans_all(dat = dat_test,
target = "target",
breaks_list = breaks_list,
note = FALSE)
Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ '))
set.seed(46)
lr_model = glm(Formula, data = train_woe[, c("target", x_list)], family = binomial(logit))
#get LR coefficient
dt_imp_LR = get_logistic_coef(lg_model = lr_model, save_data = FALSE)
bins_table = get_bins_table_all(dat = dat_train, target = "target",
dat_test = dat_test,
x_list = x_list,
breaks_list = breaks_list, note = FALSE)
#score card
LR_score_card = get_score_card(lg_model = lr_model, bins_table, target = "target")
#scoring
train_pred = dat_train[, c("ID", "apply_date", "target")]
test_pred = dat_test[, c("ID", "apply_date", "target")]
train_pred$pred_LR = score_transfer(model = lr_model,
tbl_woe = train_woe,
save_data = FALSE)[, "score"]
test_pred$pred_LR = score_transfer(model = lr_model,
tbl_woe = test_woe, save_data = FALSE)[, "score"]
get_shadow_nas
Description
This function is not intended to be used by end user.
Usage
get_shadow_nas(dat)
Arguments
dat |
A data.frame contained only predict variables. |
get_sim_sign_lambda
get_sim_sign_lambda
is for get Best lambda required in lasso_filter. This function required in lasso_filter
Description
get_sim_sign_lambda
get_sim_sign_lambda
is for get Best lambda required in lasso_filter. This function required in lasso_filter
Usage
get_sim_sign_lambda(lasso_model, sim_sign = "negtive")
Arguments
lasso_model |
A lasso model genereted by glmnet. |
sim_sign |
Default is "negtive". This is related to pos_plag. If pos_flag equals 1 or 1, the value must be set to negetive. If pos_flag equals 0 or 0, the value must be set to positive. |
Details
lambda.sim_sign give the model with the same positive or negetive coefficients of all variables.
Value
Lanmbda value
Getting the breaks for terminal nodes from decision tree
Description
get_tree_breaks
is for generating initial braks by decision tree for a numerical or nominal variable.
The get_breaks
function is a simpler wrapper for get_tree_breaks
.
Usage
get_tree_breaks(
dat,
x,
target,
pos_flag = NULL,
tree_control = list(p = 0.02, cp = 1e-06, xval = 5, maxdepth = 10),
sp_values = NULL
)
Arguments
dat |
A data frame with x and target. |
x |
name of variable to cut breaks by tree. |
target |
The name of target variable. |
pos_flag |
The value of positive class of target variable, default: "1". |
tree_control |
the list of parameters to control cutting initial breaks by decision tree.
|
sp_values |
A list of special value. Default: NULL. |
See Also
Examples
#tree breaks
tree_control = list(p = 0.02, cp = 0.000001, xval = 5, maxdepth = 10)
tree_breaks = get_tree_breaks(dat = UCICreditCard, x = "MARRIAGE",
target = "default.payment.next.month", tree_control = tree_control)
Get X List.
Description
get_x_list
is for getting intersect names of x_list, train and test.
Usage
get_x_list(
dat_train = NULL,
dat_test = NULL,
x_list = NULL,
ex_cols = NULL,
note = FALSE
)
Arguments
dat_train |
A data.frame with independent variables. |
dat_test |
Another data.frame. |
x_list |
Names of independent variables. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
note |
Logical. Outputs info. Default is TRUE. |
Value
A list contains names of variables
See Also
Examples
x_list = get_x_list(x_list = NULL,dat_train = UCICreditCard,
ex_cols = c("default.payment.next.month","ID$|_date$"))
Compare the two highly correlated variables
Description
high_cor_selector
is function for comparing the two highly correlated variables, select a variable with the largest IV value.
Usage
high_cor_selector(
cor_mat,
p = 0.95,
x_list = NULL,
com_list = NULL,
retain = TRUE
)
Arguments
cor_mat |
A correlation matrix. |
p |
The threshold of high correlation. |
x_list |
Names of independent variables. |
com_list |
A data.frame with important values of each variable. eg : IV_list. |
retain |
Logical, output selected variables, if FALSE, output filtered variables. |
Value
A list of selected variables.
is_date
Description
is_date
is a small function for distinguishing time formats
Usage
is_date(x)
Arguments
x |
list or vectors |
Value
A Date.
Examples
is_date(lendingclub$issue_d)
Imputate nas using KNN
Description
This function is not intended to be used by end user.
Usage
knn_nas_imp(
dat,
x,
nas_rate = NULL,
mat_nas_shadow = NULL,
dt_nas_random = NULL,
k = 10,
scale = FALSE,
method = "median",
miss_value_num = -1
)
Arguments
dat |
A data.frame with independent variables. |
x |
The name of variable to process. |
nas_rate |
A list contains nas rate of each variable. |
mat_nas_shadow |
A shadow matrix of variables which contain nas. |
dt_nas_random |
A data.frame with random nas imputation. |
k |
Number of neighbors of each obs which x is missing. |
scale |
Logical.Standardization of variable. |
method |
The methods of imputation by knn. "median" is knn imputation with k neighbors median, "avg_dist" is knn imputation with k neighbors of distance weighted mean. |
miss_value_num |
Default value of missing data imputation for numeric variables, Defualt is -1. |
ks_table & plot
Description
ks_table
is for generating a model performance table.
ks_table_plot
is for ploting the table generated by ks_table
ks_psi_plot
is for K-S & PSI distrbution ploting.
Usage
ks_table(
train_pred,
test_pred = NULL,
target = NULL,
score = NULL,
g = 10,
breaks = NULL,
pos_flag = list("1", "1", "Bad", 1)
)
ks_table_plot(
train_pred,
test_pred,
target = "target",
score = "score",
g = 10,
plot_show = TRUE,
g_width = 12,
file_name = NULL,
save_data = FALSE,
dir_path = tempdir(),
gtitle = NULL
)
ks_psi_plot(
train_pred,
test_pred,
target = "target",
score = "score",
gtitle = NULL,
plot_show = TRUE,
g_width = 12,
save_data = FALSE,
breaks = NULL,
g = 10,
dir_path = tempdir()
)
model_key_index(tb_pred)
Arguments
train_pred |
A data frame of training with predicted prob or score. |
test_pred |
A data frame of validation with predict prob or score. |
target |
The name of target variable. |
score |
The name of prob or score variable. |
g |
Number of breaks for prob or score. |
breaks |
Splitting points of prob or score. |
pos_flag |
The value of positive class of target variable, default: "1". |
plot_show |
Logical, show model performance in current graphic device. Default is FALSE. |
g_width |
Width of graphs. |
file_name |
The name for periodically saved data file. Default is NULL. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
dir_path |
The path for periodically saved graphic files. |
gtitle |
The title of the graph & The name for periodically saved graphic file. Default is "_ks_psi_table". |
tb_pred |
A table generated by codeks_table |
Examples
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
dat = re_name(dat, "default.payment.next.month", "target")
dat = data_cleansing(dat, target = "target", obs_id = "ID",
occur_time = "apply_date", miss_values = list("", -1))
train_test = train_test_split(dat, split_type = "OOT", prop = 0.7,
occur_time = "apply_date")
dat_train = train_test$train
dat_test = train_test$test
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2")
Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ '))
set.seed(46)
lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit))
dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5)
dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5)
# model evaluation
ks_psi_plot(train_pred = dat_train, test_pred = dat_test,
score = "pred_LR", target = "target",
plot_show = TRUE)
tb_pred = ks_table_plot(train_pred = dat_train, test_pred = dat_test,
score = "pred_LR", target = "target",
g = 10, g_width = 13, plot_show = FALSE)
key_index = model_key_index(tb_pred)
ks_value
Description
ks_value
is for get K-S value for a prob or score.
Usage
ks_value(target, prob)
Arguments
target |
Vector of target. |
prob |
A list of redict probability or score. |
Value
KS value
Variable selection by LASSO
Description
lasso_filter
filter variables by lasso.
Usage
lasso_filter(
dat_train,
dat_test = NULL,
target = NULL,
x_list = NULL,
pos_flag = NULL,
ex_cols = NULL,
sim_sign = "negtive",
best_lambda = "lambda.auc",
save_data = FALSE,
plot.it = TRUE,
seed = 46,
file_name = NULL,
dir_path = tempdir(),
note = FALSE
)
Arguments
dat_train |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
target |
The name of target variable. |
x_list |
Names of independent variables. |
pos_flag |
The value of positive class of target variable, default: "1". |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
sim_sign |
The coefficients of all variables should be all negetive or positive, after turning to woe. Default is "negetive" for pos_flag is "1". |
best_lambda |
Metheds of best lambda stardards using to filter variables by LASSO. There are 3 methods: ("lambda.auc", "lambda.ks", "lambda.sim_sign") . Default is "lambda.auc". |
save_data |
Logical, save results in locally specified folder. Default is FALSE |
plot.it |
Logical, shrinkage plot. Default is TRUE. |
seed |
Random number seed. Default is 46. |
file_name |
The name for periodically saved results files. Default is "Feature_selected_LASSO". |
dir_path |
The path for periodically saved results files. Default is "./variable". |
note |
Logical, outputs info. Default is FALSE. |
Value
A list of filtered x variables by lasso.
Examples
sub = cv_split(UCICreditCard, k = 40)[[1]]
dat = UCICreditCard[sub,]
dat = re_name(dat, "default.payment.next.month", "target")
dat_train = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date",
miss_values = list("", -1))
dat_train = process_nas(dat_train)
#get breaks of all predictive variables
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2")
breaks_list = get_breaks_all(dat = dat_train, target = "target",
x_list = x_list, occur_time = "apply_date", ex_cols = "ID",
save_data = FALSE, note = FALSE)
#woe transform
train_woe = woe_trans_all(dat = dat_train,x_list = x_list,
target = "target",
breaks_list = breaks_list,
woe_name = FALSE)
lasso_filter(dat_train = train_woe,
target = "target", x_list = x_list,
save_data = FALSE, plot.it = FALSE)
Lending Club data
Description
This data contains complete loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The data containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter(time period: 2018Q1:2018Q4).
Format
A data frame with 63532 rows and 145 variables.
Details
id: A unique LC assigned ID for the loan listing.
issue_d: The month which the loan was funded.
loan_status: Current status of the loan.
addr_state: The state provided by the borrower in the loan application.
acc_open_past_24mths: Number of trades opened in past 24 months.
all_util: Balance to credit limit on all trades.
annual_inc: The self:reported annual income provided by the borrower during registration.
avg_cur_bal: Average current balance of all accounts.
bc_open_to_buy: Total open to buy on revolving bankcards.
bc_util: Ratio of total current balance to high credit/credit limit for all bankcard accounts.
dti: A ratio calculated using the borrower's total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower's self:reported monthly income.
dti_joint: A ratio calculated using the co:borrowers' total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co:borrowers' combined self:reported monthly income
emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
emp_title: The job title supplied by the Borrower when applying for the loan.
funded_amnt_inv: The total amount committed by investors for that loan at that point in time.
grade: LC assigned loan grade
inq_last_12m: Number of credit inquiries in past 12 months
installment: The monthly payment owed by the borrower if the loan originates.
max_bal_bc: Maximum current balance owed on all revolving accounts
mo_sin_old_il_acct: Months since oldest bank installment account opened
mo_sin_old_rev_tl_op: Months since oldest revolving account opened
mo_sin_rcnt_rev_tl_op: Months since most recent revolving account opened
mo_sin_rcnt_tl: Months since most recent account opened
mort_acc: Number of mortgage accounts.
pct_tl_nvr_dlq: Percent of trades never delinquent
percent_bc_gt_75: Percentage of all bankcard accounts > 75
purpose: A category provided by the borrower for the loan request.
sub_grade: LC assigned loan subgrade
term: The number of payments on the loan. Values are in months and can be either 36 or 60.
tot_cur_bal: Total current balance of all accounts
tot_hi_cred_lim: Total high credit/credit limit
total_acc: The total number of credit lines currently in the borrower's credit file
total_bal_ex_mort: Total credit balance excluding mortgage
total_bc_limit: Total bankcard high credit/credit limit
total_cu_tl: Number of finance trades
total_il_high_credit_limit: Total installment high credit/credit limit
verification_status_joint: Indicates if the co:borrowers' joint income was verified by LC, not verified, or if the income source was verified
zip_code: The first 3 numbers of the zip code provided by the borrower in the loan application.
See Also
lift_value
Description
lift_value
is for getting max lift value for a prob or score.
Usage
lift_value(target, prob)
Arguments
target |
Vector of target. |
prob |
A list of predict probability or score. |
Value
Max lift value
local_outlier_factor
local_outlier_factor
is function for calculating the lof factor for a data set using knn
This function is not intended to be used by end user.
Description
local_outlier_factor
local_outlier_factor
is function for calculating the lof factor for a data set using knn
This function is not intended to be used by end user.
Usage
local_outlier_factor(dat, k = 10)
Arguments
dat |
A data.frame contained only predict variables. |
k |
Number of neighbors for LOF.Default is 10. |
Logarithmic transformation
Description
log_trans
is for logarithmic transformation
Usage
log_trans(
dat,
target,
x_list = NULL,
cor_dif = 0.01,
ex_cols = NULL,
note = TRUE
)
log_vars(dat, x_list = NULL, target = NULL, cor_dif = 0.01, ex_cols = NULL)
Arguments
dat |
A data.frame. |
target |
The name of target variable. |
x_list |
A list of x variables. |
cor_dif |
The correlation coefficient difference with the target of logarithm transformed variable and original variable. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
note |
Logical, outputs info. Default is TRUE. |
Value
Log transformed data.frame.
Examples
dat = log_trans(dat = UCICreditCard, target = "default.payment.next.month",
x_list =NULL,cor_dif = 0.01,ex_cols = "ID", note = TRUE)
Loop Function.
#' loop_function
is an iterator to loop through
Description
Loop Function.
#' loop_function
is an iterator to loop through
Usage
loop_function(
func = NULL,
args = list(data = NULL),
x_list = NULL,
bind = "rbind",
parallel = TRUE,
as_list = FALSE
)
Arguments
func |
A function. |
args |
A list of argauments required by function. |
x_list |
Names of objects to loop through. |
bind |
Complie results, "rbind" & "cbind" are available. |
parallel |
Logical, parallel computing. |
as_list |
Logical, whether outputs to be a list. |
Value
A data.frame or list
Examples
dat = UCICreditCard[24:26]
num_x_list = get_names(dat = dat, types = c('numeric', 'integer', 'double'),
ex_cols = NULL, get_ex = FALSE)
dat[ ,num_x_list] = loop_function(func = outliers_kmeans_lof, x_list = num_x_list,
args = list(dat = dat),
bind = "cbind", as_list = FALSE,
parallel = FALSE)
love_color
Description
love_color
is for get plots for a variable.
Usage
love_color(color = NULL, type = "Blues", n = 10, ...)
Arguments
color |
The name of colors. |
type |
The type of colors, "deep", or the name of palette:. The sequential palettes names are Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGn YlGnBu YlOrBr YlOrRd The diverging palettes are BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn Spectral The qualitative palettes are Accent, Dark2, Paired, Pastel1, Pastel2, Set1, Set2, Set3 |
n |
Number of different colors, minimum is 1. |
... |
Other parameters. |
Examples
love_color(color="dark_cyan")
Filtering Low Variance Variables
Description
low_variance_filter
is for removing variables with repeated values up to a certain percentage.
Usage
low_variance_filter(
dat,
lvp = 0.97,
only_NA = FALSE,
note = FALSE,
ex_cols = NULL
)
Arguments
dat |
A data frame with x and target. |
lvp |
The maximum percent of unique values (including NAs). |
only_NA |
Logical, only process variables which NA's rate are more than lvp. |
note |
Logical.Outputs info.Default is TRUE. |
ex_cols |
A list of excluded variables. Default is NULL. |
Value
A data.frame
Examples
dat = low_variance_filter(lendingclub[1:1000, ], lvp = 0.9)
Logistic Regression & Scorecard Parameters
Description
lr_params
is the list of parameters to train a LR model or Scorecard using in training_model
.
lr_params_search
is for searching the optimal parameters of logistic regression,if any parameters of params in lr_params
is more than one.
Usage
lr_params(
tree_control = list(p = 0.02, cp = 1e-08, xval = 5, maxdepth = 10),
bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.02, b_odds = 0.1, b_psi
= 0.03, b_or = 0.15, mono = 0.2, odds_psi = 0.15, kc = 1),
f_eval = "ks",
best_lambda = "lambda.ks",
method = "random_search",
iters = 10,
lasso = TRUE,
step_wise = TRUE,
score_card = TRUE,
sp_values = NULL,
forced_in = NULL,
obsweight = c(1, 1),
thresholds = list(cor_p = 0.8, iv_i = 0.02, psi_i = 0.1, cos_i = 0.5),
...
)
lr_params_search(
method = "random_search",
dat_train,
target,
dat_test = NULL,
occur_time = NULL,
x_list = NULL,
prop = 0.7,
iters = 10,
tree_control = list(p = 0.02, cp = 0, xval = 1, maxdepth = 10),
bins_control = list(bins_num = 10, bins_pct = 0.02, b_chi = 0.02, b_odds = 0.1, b_psi
= 0.05, b_or = 0.1, mono = 0.1, odds_psi = 0.03, kc = 1),
thresholds = list(cor_p = 0.8, iv_i = 0.02, psi_i = 0.1, cos_i = 0.6),
step_wise = FALSE,
lasso = FALSE,
f_eval = "ks"
)
Arguments
tree_control |
the list of parameters to control cutting initial breaks by decision tree. See details at: |
bins_control |
the list of parameters to control merging initial breaks. See details at: |
f_eval |
Custimized evaluation function, "ks" & "auc" are available. |
best_lambda |
Metheds of best lanmbda stardards using to filter variables by LASSO. There are 3 methods: ("lambda.auc", "lambda.ks", "lambda.sim_sign") . Default is "lambda.auc". |
method |
Method of searching optimal parameters. "random_search","grid_search","local_search" are available. |
iters |
Number of iterations of "random_search" optimal parameters. |
lasso |
Logical, if TRUE, variables filtering by LASSO. Default is TRUE. |
step_wise |
Logical, stepwise method. Default is TRUE. |
score_card |
Logical, transfer woe to a standard scorecard. If TRUE, Output scorecard, and score prediction, otherwise output probability. Default is TRUE. |
sp_values |
Vaules will be in separate bins.e.g. list(-1, "missing") means that -1 & missing as special values.Default is NULL. |
forced_in |
Names of forced input variables. Default is NULL. |
obsweight |
An optional vector of 'prior weights' to be used in the fitting process. Should be NULL or a numeric vector. If you oversample or cluster diffrent datasets to training the LR model, you need to set this parameter to ensure that the probability of logistic regression output is the same as that before oversampling or segmentation. e.g.:There are 10,000 0 obs and 500 1 obs before oversampling or under-sampling, 5,000 0 obs and 3,000 1 obs after oversampling. Then this parameter should be set to c(10000/5000, 500/3000). Default is NULL.. |
thresholds |
Thresholds for selecting variables.
|
... |
Other parameters |
dat_train |
data.frame of train data. Default is NULL. |
target |
name of target variable. |
dat_test |
data.frame of test data. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place.Default is NULL. |
x_list |
names of independent variables. Default is NULL. |
prop |
Percentage of train-data after the partition. Default: 0.7. |
Value
A list of parameters.
See Also
training_model
, xgb_params
, gbm_params
, rf_params
Variance-Inflation Factors
Description
lr_vif
is for calculating Variance-Inflation Factors.
Usage
lr_vif(lr_model)
Arguments
lr_model |
An object of logistic model. |
Examples
sub = cv_split(UCICreditCard, k = 30)[[1]]
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2")
dat = re_name(UCICreditCard[sub,], "default.payment.next.month", "target")
dat = dat[,c("target",x_list)]
dat = data_cleansing(dat, miss_values = list("", -1))
train_test = train_test_split(dat, prop = 0.7)
dat_train = train_test$train
dat_test = train_test$test
Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ '))
set.seed(46)
lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit))
lr_vif(lr_model)
get_logistic_coef(lr_model)
class(dat)
mod = lr_model
lr_vif(lr_model)
Max Min Normalization
Description
max_min_norm
is for normalizing each column vector of matrix 'x' using max_min normalization
Usage
max_min_norm(x)
Arguments
x |
Vector |
Value
Normalized vector
Examples
dat_s = apply(UCICreditCard[,12:14], 2, max_min_norm)
Merge Category
Description
merge_category
is for merging category of nominal variables which number of categories is more than m or percent of samples in any categories is less than p.
Usage
merge_category(dat, char_list = NULL, ex_cols = NULL, m = 10, note = TRUE)
Arguments
dat |
A data frame with x and target. |
char_list |
The list of charecteristic variables that need to merge categories, Default is NULL. In case of NULL,merge categories for all variables of string type. |
ex_cols |
A list of excluded variables. Default is NULL. |
m |
The minimum number of categories. |
note |
Logical, outputs info. Default is TRUE. |
Value
A data.frame with merged category variables.
Examples
#merge_catagory
dat = merge_category(lendingclub,ex_cols = "id$|_d$")
char_list = get_names(dat = dat,types = c('factor', 'character'),
ex_cols = "id$|_d$", get_ex = FALSE)
str(dat[,char_list])
Min Max Normalization
Description
min_max_norm
is for normalizing each column vector of matrix 'x' using min_max normalization
Usage
min_max_norm(x)
Arguments
x |
Vector |
Value
Normalized vector
Examples
dat_s = apply(UCICreditCard[,12:14], 2, min_max_norm)
model result plots
model_result_plot
is a wrapper of following:
perf_table
is for generating a model performance table.
ks_plot
is for K-S.
roc_plot
is for ROC.
lift_plot
is for Lift Chart.
score_distribution_plot
is for ploting the score distribution.
Description
model result plots
model_result_plot
is a wrapper of following:
perf_table
is for generating a model performance table.
ks_plot
is for K-S.
roc_plot
is for ROC.
lift_plot
is for Lift Chart.
score_distribution_plot
is for ploting the score distribution.
performance table
ks_plot
lift_plot
roc_plot
score_distribution_plot
Usage
model_result_plot(
train_pred,
score,
target,
test_pred = NULL,
gtitle = NULL,
perf_dir_path = NULL,
save_data = FALSE,
plot_show = TRUE,
total = TRUE,
g = 10,
cut_bin = "equal_depth",
digits = 4
)
perf_table(
train_pred,
test_pred = NULL,
target = NULL,
score = NULL,
g = 10,
cut_bin = "equal_depth",
breaks = NULL,
digits = 2,
pos_flag = list("1", "1", "Bad", 1),
total = FALSE,
binsNO = FALSE
)
ks_plot(
train_pred,
test_pred = NULL,
target = NULL,
score = NULL,
gtitle = NULL,
breaks = NULL,
g = 10,
cut_bin = "equal_width",
perf_tb = NULL
)
lift_plot(
train_pred,
test_pred = NULL,
target = NULL,
score = NULL,
gtitle = NULL,
breaks = NULL,
g = 10,
cut_bin = "equal_depth",
perf_tb = NULL
)
roc_plot(
train_pred,
test_pred = NULL,
target = NULL,
score = NULL,
gtitle = NULL
)
score_distribution_plot(
train_pred,
test_pred,
target,
score,
gtitle = NULL,
breaks = NULL,
g = 10,
cut_bin = "equal_depth",
perf_tb = NULL
)
Arguments
train_pred |
A data frame of training with predicted prob or score. |
score |
The name of prob or score variable. |
target |
The name of target variable. |
test_pred |
A data frame of validation with predict prob or score. |
gtitle |
The title of the graph & The name for periodically saved graphic file. |
perf_dir_path |
The path for periodically saved graphic files. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
plot_show |
Logical, show model performance in current graphic device. Default is TRUE. |
total |
Whether to summarize the table. default: TRUE. |
g |
Number of breaks for prob or score. |
cut_bin |
A string, if equal_bins is TRUE, 'equal_depth' or 'equal_width', default is 'equal_depth'. |
digits |
Digits of numeric,default is 4. |
breaks |
Splitting points of prob or score. |
pos_flag |
The value of positive class of target variable, default: "1". |
binsNO |
Bins Number.Default is FALSE. |
perf_tb |
Performance table. |
Examples
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
dat = re_name(dat, "default.payment.next.month", "target")
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2")
dat = data_cleansing(dat, target = "target", obs_id = "ID",x_list = x_list,
occur_time = "apply_date", miss_values = list("", -1))
dat = process_nas(dat,default_miss = TRUE)
train_test = train_test_split(dat, split_type = "OOT", prop = 0.7,
occur_time = "apply_date")
dat_train = train_test$train
dat_test = train_test$test
Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ '))
set.seed(46)
lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit))
dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5)
dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5)
# model evaluation
perf_table(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR")
ks_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR")
roc_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR")
#lift_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR")
#score_distribution_plot(train_pred = dat_train, test_pred = dat_test,
#target = "target", score = "pred_LR")
#model_result_plot(train_pred = dat_train, test_pred = dat_test,
#target = "target", score = "pred_LR")
Arrange list of plots into a grid
Description
Plot multiple ggplot-objects as a grid-arranged single plot.
Usage
multi_grid(..., grobs = list(...), nrow = NULL, ncol = NULL)
Arguments
... |
Other parameters. |
grobs |
A list of ggplot-objects to be arranged into the grid. |
nrow |
Number of rows in the plot grid. |
ncol |
Number of columns in the plot grid. |
Details
This function takes a list
of ggplot-objects as argument.
Plotting functions of this package that produce multiple plot
objects (e.g., when there is an argument facet.grid
) usually
return multiple plots as list.
Value
An object of class gtable
.
Examples
library(ggplot2)
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
dat = re_name(dat, "default.payment.next.month", "target")
dat = data_cleansing(dat, target = "target", obs_id = "ID",
occur_time = "apply_date", miss_values = list("", -1))
dat = process_nas(dat)
train_test = train_test_split(dat, split_type = "OOT", prop = 0.7,
occur_time = "apply_date")
dat_train = train_test$train
dat_test = train_test$test
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2")
Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ '))
set.seed(46)
lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit))
dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5)
dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5)
# model evaluation
p1 = ks_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR")
p2 = roc_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR")
p3 = lift_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR")
p4 = score_distribution_plot(train_pred = dat_train, test_pred = dat_test,
target = "target", score = "pred_LR")
p_plots= multi_grid(p1,p2,p3,p4)
plot(p_plots)
multi_left_join
Description
multi_left_join
is for left jion a list of datasets fast.
Usage
multi_left_join(..., df_list = list(...), key_dt = NULL, by = NULL)
Arguments
... |
Datasets need join |
df_list |
A list of datasets. |
key_dt |
Name or index of Key table to left join. |
by |
Name of Key columns to join. |
Examples
multi_left_join(UCICreditCard[1:10, 1:10], UCICreditCard[1:10, c(1,8:14)],
UCICreditCard[1:10, c(1,20:25)], by = "ID")
The length of a string.
Description
Returns the number of "code points", in a string.
Usage
n_char(string)
Arguments
string |
A string. |
Value
A numeric vector giving number of characters (code points) in each element of the character vector. Missing string have missing length.
Examples
n_char(letters)
n_char(NA)
Encode NAs
Description
null_blank_na
is the function to replace null ,NULL, blank or other missing vaules with NA.
Usage
null_blank_na(dat, miss_values = NULL, note = FALSE)
Arguments
dat |
A data frame with x and target. |
miss_values |
Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "missing". |
note |
Logical.Outputs info.Default is TRUE. |
Value
A data.frame
Examples
datss = null_blank_na(dat = UCICreditCard[1:1000, ], miss_values =list(-1,-2))
One-Hot Encoding
Description
one_hot_encoding
is for converting the factor or character variables into multiple columns
Usage
one_hot_encoding(
dat,
cat_vars = NULL,
ex_cols = NULL,
merge_cat = TRUE,
na_act = TRUE,
note = FALSE
)
Arguments
dat |
A dat frame. |
cat_vars |
The name or Column index list to be one_hot encoded. |
ex_cols |
Variables to be excluded, use regular expression matching |
merge_cat |
Logical. If TRUE, to merge categories greater than 8, default is TRUE. |
na_act |
Logical,If true, the missing value is processed, if FALSE missing value is omitted . |
note |
Logical.Outputs info.Default is TRUE. |
Value
A dat frame with the one hot encoding applied to all the variables with type as factor or character.
See Also
Examples
dat1 = one_hot_encoding(dat = UCICreditCard,
cat_vars = c("SEX", "MARRIAGE"),
merge_cat = TRUE, na_act = TRUE)
dat2 = de_one_hot_encoding(dat_one_hot = dat1,
cat_vars = c("SEX","MARRIAGE"), na_act = FALSE)
Outliers Detection
outliers_detection
is for outliers detecting using Kmeans and Local Outlier Factor (lof)
Description
Outliers Detection
outliers_detection
is for outliers detecting using Kmeans and Local Outlier Factor (lof)
Usage
outliers_detection(dat, x, kc = 3, kn = 5)
Arguments
dat |
A data.frame with independent variables. |
x |
The name of variable to process. |
kc |
Number of clustering centers for Kmeans |
kn |
Number of neighbors for LOF. |
Value
Outliers of each variable.
Entropy
Description
This function is not intended to be used by end user.
Usage
p_ij(x)
e_ij(x)
Arguments
x |
A numeric vector. |
Value
A numeric vector of entropy.
prob to socre
Description
p_to_score
is for transforming probability to score.
Usage
p_to_score(p, PDO = 20, base = 600, ratio = 1)
Arguments
p |
Probability. |
PDO |
Point-to-Double Odds. |
base |
Base Point. |
ratio |
The corresponding odds when the score is base. |
Value
new prob.
See Also
partial_dependence_plot
Description
partial_dependence_plot
is for generating a partial dependence plot.
get_partial_dependence_plots
is for ploting partial dependence of all vairables in x_list.
Usage
partial_dependence_plot(model, x, x_train, n.trees = NULL)
get_partial_dependence_plots(
model,
x_train,
x_list,
n.trees = NULL,
dir_path = getwd(),
save_data = TRUE,
plot_show = FALSE,
parallel = FALSE
)
Arguments
model |
A data frame of training with predicted prob or score. |
x |
The name of an independent variable. |
x_train |
A data.frame with independent variables. |
n.trees |
Number of trees for best.iter of gbm. |
x_list |
Names of independent variables. |
dir_path |
The path for periodically saved graphic files. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
plot_show |
Logical, show model performance in current graphic device. Default is FALSE. |
parallel |
Logical, parallel computing. Default is FALSE. |
Examples
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
dat = re_name(dat, "default.payment.next.month", "target")
dat = data_cleansing(dat, target = "target", obs_id = "ID",
occur_time = "apply_date", miss_values = list("", -1))
train_test = train_test_split(dat, split_type = "OOT", prop = 0.7,
occur_time = "apply_date")
dat_train = train_test$train
dat_test = train_test$test
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2")
Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ '))
set.seed(46)
lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit))
#plot partial dependency of one variable
partial_dependence_plot(model = lr_model, x ="LIMIT_BAL", x_train = dat_train)
#plot partial dependency of all variables
pd_list = get_partial_dependence_plots(model = lr_model, x_list = x_list[1:2],
x_train = dat_train, save_data = FALSE,plot_show = TRUE)
Plot Colors
Description
You can use the plot_colors
to show colors on the graph device.
Usage
plot_colors(colors)
color_ramp_palette(colors)
Arguments
colors |
A vector of colors. |
Examples
plot_colors(rgb(158,122,122, maxColorValue = 255 ))
plot_oot_perf
plot_oot_perf
is for ploting performance of cross time samples in the future
Description
plot_oot_perf
plot_oot_perf
is for ploting performance of cross time samples in the future
Usage
plot_oot_perf(
dat_test,
x,
occur_time,
target,
k = 3,
g = 10,
period = "month",
best = FALSE,
equal_bins = TRUE,
pl = "rate",
breaks = NULL,
cut_bin = "equal_depth",
gtitle = NULL,
perf_dir_path = NULL,
save_data = FALSE,
plot_show = TRUE
)
Arguments
dat_test |
A data frame of testing dataset with predicted prob or score. |
x |
The name of prob or score variable. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
target |
The name of target variable. |
k |
If period is NULL, number of equal frequency samples. |
g |
Number of breaks for prob or score. |
period |
OOT period, 'weekly' and 'month' are available.if NULL, use k equal frequency samples. |
best |
Logical, merge initial breaks to get optimal breaks for binning. |
equal_bins |
Logical, generates initial breaks for equal frequency or width binning. |
pl |
'lift' is for lift chart plot,'rate' is for positive rate plot. |
breaks |
Splitting points of prob or score. |
cut_bin |
A string, if equal_bins is TRUE, 'equal_depth' or 'equal_width', default is 'equal_depth'. |
gtitle |
The title of the graph & The name for periodically saved graphic file. |
perf_dir_path |
The path for periodically saved graphic files. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
plot_show |
Logical, show model performance in current graphic device. Default is TRUE. |
Examples
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
dat = re_name(dat, "default.payment.next.month", "target")
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2")
dat = data_cleansing(dat, target = "target", obs_id = "ID",x_list = x_list,
occur_time = "apply_date", miss_values = list("", -1))
dat = process_nas(dat)
train_test = train_test_split(dat, split_type = "OOT", prop = 0.7,
occur_time = "apply_date")
dat_train = train_test$train
dat_test = train_test$test
Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ '))
set.seed(46)
lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit))
dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5)
dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5)
plot_oot_perf(dat_test = dat_test, occur_time = "apply_date", target = "target", x = "pred_LR")
plot_table
Description
plot_table
is for table visualizaiton.
Usage
plot_table(
grid_table,
theme = c("cyan", "grey", "green", "red", "blue", "purple"),
title = NULL,
title.size = 12,
title.color = "black",
title.face = "bold",
title.position = "middle",
subtitle = NULL,
subtitle.size = 8,
subtitle.color = "black",
subtitle.face = "plain",
subtitle.position = "middle",
tile.color = "white",
tile.size = 1,
colname.size = 3,
colname.color = "white",
colname.face = "bold",
colname.fill.color = love_color("dark_cyan"),
text.size = 3,
text.color = love_color("dark_grey"),
text.face = "plain",
text.fill.color = c("white", love_color("pale_grey"))
)
Arguments
grid_table |
A data.frame or table |
theme |
The theme of color, "cyan","grey","green","red","blue","purple" are available. |
title |
The title of table |
title.size |
The title size of plot. |
title.color |
The title color. |
title.face |
The title face, such as "plain", "bold". |
title.position |
The title position,such as "left","middle","right". |
subtitle |
The subtitle of table |
subtitle.size |
The subtitle size. |
subtitle.color |
The subtitle color. |
subtitle.face |
The subtitle face, such as "plain", "bold",default is "bold". |
subtitle.position |
The subtitle position,such as "left","middle","right", default is "middle". |
tile.color |
The color of table lines, default is 'white'. |
tile.size |
The size of table lines , default is 1. |
colname.size |
The size of colnames, default is 3. |
colname.color |
The color of colnames, default is 'white'. |
colname.face |
The face of colnames,default is 'bold'. |
colname.fill.color |
The fill color of colnames, default is love_color("dark_cyan"). |
text.size |
The size of text, default is 3. |
text.color |
The color of text, default is love_color("dark_grey"). |
text.face |
The face of text, default is 'plain'. |
text.fill.color |
The fill color of text, default is c('white',love_color("pale_grey"). |
Examples
iv_list = get_psi_iv_all(dat = UCICreditCard[1:1000, ],
x_list = names(UCICreditCard)[3:5], equal_bins = TRUE,
target = "default.payment.next.month", ex_cols = "ID|apply_date")
iv_dt =get_psi_iv(UCICreditCard, x = "PAY_3",
target = "default.payment.next.month", bins_total = TRUE)
plot_table(iv_dt)
plot_theme
Description
plot_theme
is a simper wrapper of theme for ggplot2.
Usage
plot_theme(
legend.position = "top",
angle = 30,
legend_size = 7,
axis_size_y = 8,
axis_size_x = 8,
axis_title_size = 10,
title_size = 11,
title_vjust = 0,
title_hjust = 0,
linetype = "dotted",
face = "bold"
)
Arguments
legend.position |
see details at: codelegend.position |
angle |
see details at: codeaxis.text.x |
legend_size |
see details at: codelegend.text |
axis_size_y |
see details at: codeaxis.text.y |
axis_size_x |
see details at: codeaxis.text.x |
axis_title_size |
see details at: codeaxis.title.x |
title_size |
see details at: codeplot.title |
title_vjust |
see details at: codeplot.title |
title_hjust |
see details at: codeplot.title |
linetype |
see details at: codepanel.grid.major |
face |
see details at: codeaxis.title.x |
Details
see details at: codetheme
pred_score
Description
pred_score
is for using logistic regression model model to predict new data.
Usage
pred_score(
model,
dat,
x_list = NULL,
bins_table = NULL,
obs_id = NULL,
miss_values = list(-1, "-1", "NULL", "-1", "-9999", "-9996", "-9997", "-9995",
"-9998", -9999, -9998, -9997, -9996, -9995),
woe_name = FALSE
)
Arguments
model |
Logistic Regression Model generated by |
dat |
Dataframe of new data. |
x_list |
Into the model variables. |
bins_table |
a data.frame generated by |
obs_id |
The name of ID of observations or key variable of data. Default is NULL. |
miss_values |
Special values. |
woe_name |
Logical. Whether woe variable's name contains 'woe'.Default is FALSE. |
Value
new scores.
See Also
training_model
, lr_params
, xgb_params
, rf_params
missing Treatment
Description
process_nas_var
is for missing value analysis and treatment using knn imputation, central impulation and random imputation.
process_nas
is a simpler wrapper for process_nas_var
.
Usage
process_nas(
dat,
x_list = NULL,
class_var = FALSE,
miss_values = list(-1, "missing"),
default_miss = list(-1, "missing"),
parallel = FALSE,
ex_cols = NULL,
method = "median",
note = FALSE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
process_nas_var(
dat = dat,
x,
missing_type = NULL,
method = "median",
nas_rate = NULL,
default_miss = list("missing", -1),
mat_nas_shadow = NULL,
dt_nas_random = NULL,
note = FALSE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
Arguments
dat |
A data.frame with independent variables. |
x_list |
Names of independent variables. |
class_var |
Logical, nas analysis of the nominal variables. Default is TRUE. |
miss_values |
Other extreme value might be used to represent missing values, e.g:-1, -9999, -9998. These miss_values will be encoded to NA. |
default_miss |
Default value of missing data imputation, Defualt is list(-1,'missing'). |
parallel |
Logical, parallel computing. Default is FALSE. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
method |
The methods of imputation by knn. If "median", then Nas imputation with k neighbors median. If "avg_dist", the distance weighted average method is applied to determine the NAs imputation with k neighbors. If "default", assigning the missing values to -1 or "missing", otherwise ,processing the missing values according to the results of missing analysis. |
note |
Logical, outputs info. Default is TRUE. |
save_data |
Logical. If TRUE, save missing analysis to |
file_name |
The file name for periodically saved missing analysis file. Default is NULL. |
dir_path |
The path for periodically saved missing analysis file. Default is "./variable". |
... |
Other parameters. |
x |
The name of variable to process. |
missing_type |
Type of missing, genereted by codeanalysis_nas |
nas_rate |
A list contains nas rate of each variable. |
mat_nas_shadow |
A shadow matrix of variables which contain nas. |
dt_nas_random |
A data.frame with random nas imputation. |
Value
A dat frame with no NAs.
Examples
dat_na = process_nas(dat = UCICreditCard[1:1000,],
parallel = FALSE,ex_cols = "ID$", method = "median")
Outliers Treatment
Description
outliers_kmeans_lof
is for outliers detection and treatment using Kmeans and Local Outlier Factor (lof)
process_outliers
is a simpler wrapper for outliers_kmeans_lof
.
Usage
process_outliers(
dat,
target,
ex_cols = NULL,
kc = 3,
kn = 5,
x_list = NULL,
parallel = FALSE,
note = FALSE,
process = TRUE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir()
)
outliers_kmeans_lof(
dat,
x,
target = NULL,
kc = 3,
kn = 5,
note = FALSE,
process = TRUE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir()
)
Arguments
dat |
Dataset with independent variables and target variable. |
target |
The name of target variable. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
kc |
Number of clustering centers for Kmeans |
kn |
Number of neighbors for LOF. |
x_list |
Names of independent variables. |
parallel |
Logical, parallel computing. |
note |
Logical, outputs info. Default is TRUE. |
process |
Logical, process outliers, not just analysis. |
save_data |
Logical. If TRUE, save outliers analysis file to the specified folder at |
file_name |
The file name for periodically saved outliers analysis file. Default is NULL. |
dir_path |
The path for periodically saved outliers analysis file. Default is "./variable". |
x |
The name of variable to process. |
Value
A data frame with outliers process to all the variables.
Examples
dat_out = process_outliers(UCICreditCard[1:10000,c(18:21,26)],
target = "default.payment.next.month",
ex_cols = "date$", kc = 3, kn = 10,
parallel = FALSE,note = TRUE)
Variable reduction based on Information Value & Population Stability Index filter
Description
psi_iv_filter
is for selecting important and stable features using IV & PSI.
Usage
psi_iv_filter(
dat,
dat_test = NULL,
target,
x_list = NULL,
breaks_list = NULL,
pos_flag = NULL,
ex_cols = NULL,
occur_time = NULL,
best = FALSE,
equal_bins = TRUE,
g = 10,
sp_values = NULL,
tree_control = list(p = 0.05, cp = 1e-06, xval = 5, maxdepth = 10),
bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.05, b_odds = 0.1, b_psi
= 0.05, b_or = 0.15, mono = 0.3, odds_psi = 0.2, kc = 1),
oot_pct = 0.7,
psi_i = 0.1,
iv_i = 0.01,
cos_i = 0.7,
vars_name = FALSE,
note = TRUE,
parallel = FALSE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
Arguments
dat |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
target |
The name of target variable. |
x_list |
Names of independent variables. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
best |
Logical, if TRUE, merge initial breaks to get optimal breaks for binning. |
equal_bins |
Logical, if TRUE, equal sample size initial breaks generates.If FALSE , tree breaks generates using desison tree. |
g |
Integer, number of initial bins for equal_bins. |
sp_values |
A list of missing values. |
tree_control |
the list of tree parameters. |
bins_control |
the list of parameters. |
oot_pct |
Percentage of observations retained for overtime test (especially to calculate PSI). Defualt is 0.7 |
psi_i |
The maximum threshold of PSI. 0 <= psi_i <=1; 0.05 to 0.2 usually work. Default: 0.1 |
iv_i |
The minimum threshold of IV. 0 < iv_i ; 0.01 to 0.1 usually work. Default: 0.01 |
cos_i |
cos_similarity of posive rate of train and test. 0.7 to 0.9 usually work.Default: 0.5. |
vars_name |
Logical, output a list of filtered variables or table with detailed IV and PSI value of each variable. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
parallel |
Logical, parallel computing. Default is FALSE. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved results files. Default is "Feature_importance_IV_PSI". |
dir_path |
The path for periodically saved results files. Default is tempdir(). |
... |
Other parameters. |
Value
A list with the following elements:
-
Feature
Selected variables. -
IV
IV of variables. -
PSI
PSI of variables. -
COS
cos_similarity of posive rate of train and test.
See Also
xgb_filter
, gbm_filter
, feature_selector
Examples
psi_iv_filter(dat= UCICreditCard[1:1000,c(2,4,8:9,26)],
target = "default.payment.next.month",
occur_time = "apply_date",
parallel = FALSE)
List as data.frame quickly
Description
quick_as_df
is function for fast dat frame transfromation.
Usage
quick_as_df(df_list)
Arguments
df_list |
A list of data. |
Value
packages installed and library,
Examples
UCICreditCard = quick_as_df(UCICreditCard)
Ranking Percent Process
Description
ranking_percent_proc
is for processing ranking percent variables.
ranking_percent_dict
is for generating ranking percent dictionary.
Usage
ranking_percent_proc(
dat,
ex_cols = NULL,
x_list = NULL,
rank_dict = NULL,
pct = 0.01,
parallel = FALSE,
note = FALSE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
ranking_percent_proc_x(dat, x, rank_dict = NULL, pct = 0.01)
ranking_percent_dict(
dat,
x_list = NULL,
ex_cols = NULL,
pct = 0.01,
parallel = FALSE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
ranking_percent_dict_x(dat, x = NULL, pct = 0.01)
Arguments
dat |
A data.frame. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
x_list |
A list of x variables. |
rank_dict |
The dictionary of rank_percent generated by |
pct |
Percent of rank. Default is 0.01. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
save_data |
Logical, save results in locally specified folder. Default is FALSE |
file_name |
The name for periodically saved rank_percent data file. Default is "dat_rank_percent". |
dir_path |
The path for periodically saved rank_percent data file Default is "tempdir()" |
... |
Additional parameters. |
x |
The name of an independent variable. |
Value
Data.frame with new processed variables.
Examples
rank_dict = ranking_percent_dict(dat = UCICreditCard[1:1000,],
x_list = c("LIMIT_BAL","BILL_AMT2","PAY_AMT3"), ex_cols = NULL )
UCICreditCard_new = ranking_percent_proc(dat = UCICreditCard[1:1000,],
x_list = c("LIMIT_BAL", "BILL_AMT2", "PAY_AMT3"), rank_dict = rank_dict, parallel = FALSE)
re_code
re_code
search for matches to argument pattern within each element of a character vector:
Description
re_code
re_code
search for matches to argument pattern within each element of a character vector:
Usage
re_code(x, codes)
Arguments
x |
Variable to recode. |
codes |
A data.frame of original value & recode value |
Examples
SEX = sample(c("F","M"),1000,replace = TRUE)
codes= data.frame(ori_value = c('F','M'), code = c(0,1) )
SEX_re = re_code(SEX,codes)
Rename
Description
re_name
is for renaming variables.
Usage
re_name(dat, oldname = c(), newname = c())
Arguments
dat |
A data frame with vairables to rename. |
oldname |
Old names of vairables. |
newname |
New names of vairables. |
Value
data with new variable names.
Examples
dt = re_name(dat = UCICreditCard, "default.payment.next.month" , "target")
names(dt['target'])
Read data
Description
read_data
is for loading data, formats like csv, txt,data and so on.
Usage
read_data(
path,
pattern = NULL,
encoding = "unknown",
header = TRUE,
sep = "auto",
stringsAsFactors = FALSE,
select = NULL,
drop = NULL,
nrows = Inf
)
check_data_format(path)
Arguments
path |
Path to file or file name in working directory & path to file. |
pattern |
An optional regular expression. Only file names which match the regular expression will be returned. |
encoding |
Default is "unknown". Other possible options are "UTF-8" and "Latin-1". |
header |
Does the first data line contain column names? |
sep |
The separator between columns. |
stringsAsFactors |
Logical. Convert all character columns to factors? |
select |
A vector of column names or numbers to keep, drop the rest. |
drop |
A vector of column names or numbers to drop, keep the rest. |
nrows |
The maximum number of rows to read. |
Filtering highly correlated variables with reduce method
Description
reduce_high_cor_filter
is function for filtering highly correlated variables with reduce method.
Usage
reduce_high_cor_filter(
dat,
x_list = NULL,
size = ncol(dat)/10,
p = 0.95,
com_list = NULL,
ex_cols = NULL,
cor_class = TRUE,
parallel = FALSE
)
Arguments
dat |
A data.frame with independent variables. |
x_list |
Names of independent variables. |
size |
Size of vairable group. |
p |
Threshold of correlation between features. Default is 0.7. |
com_list |
A data.frame with important values of each variable. eg : IV_list |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
cor_class |
Culculate catagery variables's correlation matrix. Default is FALSE. |
parallel |
Logical, parallel computing. Default is FALSE. |
Remove Duplicated Observations
Description
remove_duplicated
is the function to remove duplicated observations
Usage
remove_duplicated(
dat = dat,
obs_id = NULL,
occur_time = NULL,
target = NULL,
note = FALSE
)
Arguments
dat |
A data frame with x and target. |
obs_id |
The name of ID of observations. Default is NULL. |
occur_time |
The name of occur time of observations.Default is NULL. |
target |
The name of target variable. |
note |
Logical.Outputs info.Default is TRUE. |
Value
A data.frame
Examples
datss = remove_duplicated(dat = UCICreditCard,
target = "default.payment.next.month",
obs_id = "ID", occur_time = "apply_date")
Replace Value
Description
replace_value
is for replacing values of some variables .
replace_value_x
is for replacing values of a variable.
Usage
replace_value(
dat = dat,
x_list = NULL,
x_pattern = NULL,
replace_dat,
MARGIN = 2,
VALUE = if (MARGIN == 2) colnames(replace_dat) else rownames(replace_dat),
RE_NAME = TRUE,
parallel = FALSE
)
replace_value_x(
dat,
x,
replace_dat,
MARGIN = 2,
VALUE = if (MARGIN == 2) colnames(replace_dat) else rownames(replace_dat),
RE_NAME = TRUE
)
Arguments
dat |
A data.frame. |
x_list |
Names of variables to replace value. |
x_pattern |
Regular expressions, used to match variable names. |
replace_dat |
A data.frame contains value to replace. |
MARGIN |
A vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names. |
VALUE |
Values to replace. |
RE_NAME |
Logical, rename the replaced variable. |
parallel |
Logical, parallel computing. Default is TRUE. |
x |
Name of variable to replace value. |
Packages required and intallment
Description
require_packages
is function for librarying required packages and installing missing packages if needed.
Usage
require_packages(..., pkg = as.character(substitute(list(...))))
Arguments
... |
Packages need loaded |
pkg |
A list or vector of names of required packages. |
Value
packages installed and library.
Examples
## Not run:
require_packages(data.table, ggplot2, dplyr)
## End(Not run)
Random Forest Parameters
Description
rf_params
is the list of parameters to train a Random Forest using in training_model
.
Usage
rf_params(ntree = 100, nodesize = 30, samp_rate = 0.5, tune_rf = FALSE, ...)
Arguments
ntree |
Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. |
nodesize |
Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). Note that the default values are different for classification (1) and regression (5). |
samp_rate |
Percentage of sample to draw. Default is 0.2. |
tune_rf |
A logical.If TRUE, then tune Random Forest model.Default is FALSE. |
... |
Other parameters |
Details
See details at : https://www.stat.berkeley.edu/~breiman/Using_random_forests_V3.1.pdf
Value
A list of parameters.
See Also
training_model
, lr_params
, gbm_params
, xgb_params
Functions for vector operation.
Description
Functions for vector operation.
Usage
rowAny(x)
rowAllnas(x)
colAllnas(x)
colAllzeros(x)
rowAll(x)
rowCVs(x, na.rm = FALSE)
rowSds(x, na.rm = FALSE)
colSds(x, na.rm = TRUE)
rowMaxs(x, na.rm = FALSE)
rowMins(x, na.rm = FALSE)
rowMaxMins(x, na.rm = FALSE)
colMaxMins(x, na.rm = FALSE)
cnt_x(x)
sum_x(x)
max_x(x)
min_x(x)
avg_x(x)
Arguments
x |
A data.frame or Matrix. |
na.rm |
Logical, remove NAs. |
Value
A data.frame or Matrix.
Examples
#any row has missing values
row_amy = rowAny(UCICreditCard[8:10])
#rows which is all missing values
row_na = rowAllnas(UCICreditCard[8:10])
#cols which is all missing values
col_na = colAllnas(UCICreditCard[8:10])
#cols which is all zeros
row_zero = colAllzeros(UCICreditCard[8:10])
#sum all numbers of a row
row_all = rowAll(UCICreditCard[8:10])
#caculate cv of a row
row_cv = rowCVs(UCICreditCard[8:10])
#caculate sd of a row
row_sd = rowSds(UCICreditCard[8:10])
#caculate sd of a column
col_sd = colSds(UCICreditCard[8:10])
Save data
Description
save_data
is for saving a data.frame or a list fast.
Usage
save_data(
...,
files = list(...),
file_name = as.character(substitute(list(...))),
dir_path = getwd(),
note = FALSE,
as_list = FALSE,
row_names = FALSE,
append = FALSE
)
Arguments
... |
datasets |
files |
A dataset or a list of datasets. |
file_name |
The file name of data. |
dir_path |
A string. The dir path to save breaks_list. |
note |
Logical. Outputs info.Default is TRUE. |
as_list |
Logical. List format or data.frame format to save. Default is FALSE. |
row_names |
Logical,retain rownames. |
append |
Logical, append newdata to old. |
Examples
save_data(UCICreditCard,"UCICreditCard", tempdir())
Score Transformation
Description
score_transfer
is for transfer woe to score.
Usage
score_transfer(
model,
tbl_woe,
a = 600,
b = 50,
file_name = NULL,
dir_path = tempdir(),
save_data = FALSE
)
Arguments
model |
A data frame with x and target. |
tbl_woe |
a data.frame with woe variables. |
a |
Base line of score. |
b |
Numeric.Increased scores from doubling Odds. |
file_name |
The name for periodically saved score file. Default is "dat_score". |
dir_path |
The path for periodically saved score file. Default is "./data" |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
Value
A data.frame with variables which values transfered to score.
Examples
# dataset spliting
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
#rename the target variable
dat = re_name(dat, "default.payment.next.month", "target")
dat = data_cleansing(dat, target = "target", obs_id = "ID",
occur_time = "apply_date", miss_values = list("", -1))
#train_ test pliting
train_test = train_test_split(dat, split_type = "OOT", prop = 0.7,
occur_time = "apply_date")
dat_train = train_test$train
dat_test = train_test$test
#get breaks of all predictive variables
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2")
breaks_list = get_breaks_all(dat = dat_train, target = "target",
x_list = x_list, occur_time = "apply_date", ex_cols = "ID",
save_data = FALSE, note = FALSE)
#woe transforming
train_woe = woe_trans_all(dat = dat_train,
target = "target",
breaks_list = breaks_list,
woe_name = FALSE)
test_woe = woe_trans_all(dat = dat_test,
target = "target",
breaks_list = breaks_list,
note = FALSE)
Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ '))
set.seed(46)
lr_model = glm(Formula, data = train_woe[, c("target", x_list)], family = binomial(logit))
#get LR coefficient
dt_imp_LR = get_logistic_coef(lg_model = lr_model, save_data = FALSE)
bins_table = get_bins_table_all(dat = dat_train, target = "target",
x_list = x_list,dat_test = dat_test,
breaks_list = breaks_list, note = FALSE)
#score card
LR_score_card = get_score_card(lg_model = lr_model, bins_table, target = "target")
#scoring
train_pred = dat_train[, c("ID", "apply_date", "target")]
test_pred = dat_test[, c("ID", "apply_date", "target")]
train_pred$pred_LR = score_transfer(model = lr_model,
tbl_woe = train_woe,
save_data = FALSE)[, "score"]
test_pred$pred_LR = score_transfer(model = lr_model,
tbl_woe = test_woe, save_data = FALSE)[, "score"]
Generates Best Binning Breaks
Description
select_best_class
& select_best_breaks
are for merging initial breaks of variables using chi-square, odds-ratio,PSI,G/B index and so on.
The get_breaks
is a simpler wrapper for select_best_class
& select_best_class
.
Usage
select_best_class(
dat,
x,
target,
breaks = NULL,
occur_time = NULL,
oot_pct = 0.7,
pos_flag = NULL,
bins_control = NULL,
sp_values = NULL,
...
)
select_best_breaks(
dat,
x,
target,
breaks = NULL,
pos_flag = NULL,
sp_values = NULL,
occur_time = NULL,
oot_pct = 0.7,
bins_control = NULL,
...
)
Arguments
dat |
A data frame with x and target. |
x |
The name of variable to process. |
target |
The name of target variable. |
breaks |
Splitting points for an independent variable. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
oot_pct |
The percentage of Actual and Expected set for PSI calculating. |
pos_flag |
The value of positive class of target variable, default: "1". |
bins_control |
the list of parameters.
|
sp_values |
A list of special value. |
... |
Other parameters. |
Details
The folloiwing is the list of Reference Principles
1.The increasing or decreasing trend of variables is consistent with the actual business experience.(The percent of Non-monotonic intervals of which are not head or tail is less than 0.35)
2.Maximum 10 intervals for a single variable.
3.Each interval should cover more than 2
4.Each interval needs at least 30 or 1
5.Combining the values of blank, missing or other special value into the same interval called missing.
6.The difference of Chi effect size between intervals should be at least 0.02 or more.
7.The difference of absolute odds ratio between intervals should be at least 0.1 or more.
8.The difference of positive rate between intervals should be at least 1/10 of the total positive rate.
9.The difference of G/B index between intervals should be at least 15 or more.
10.The PSI of each interval should be less than 0.1.
Value
A list of breaks for x.
See Also
get_tree_breaks
,
cut_equal
,
get_breaks
Examples
#equal sample size breaks
equ_breaks = cut_equal(dat = UCICreditCard[, "PAY_AMT2"], g = 10)
# select best bins
bins_control = list(bins_num = 10, bins_pct = 0.02, b_chi = 0.02,
b_odds = 0.1, b_psi = 0.05, b_or = 0.15, mono = 0.3, odds_psi = 0.1, kc = 1)
select_best_breaks(dat = UCICreditCard, x = "PAY_AMT2", breaks = equ_breaks,
target = "default.payment.next.month", occur_time = "apply_date",
sp_values = NULL, bins_control = bins_control)
sim_str
Description
This function is not intended to be used by end user.
Usage
sim_str(a, b, sep = "_|[.]|[A-Z]")
Arguments
a |
A string |
b |
A string |
sep |
Seprater of strings. Default is "_|[.]|[A-Z]". |
split_bins
Description
split_bins
is for binning using breaks.
Usage
split_bins(
dat,
x,
breaks = NULL,
bins_no = TRUE,
as_factor = FALSE,
labels = NULL,
use_NA = TRUE,
char_free = FALSE
)
Arguments
dat |
A data.frame with independent variables. |
x |
The name of an independent variable. |
breaks |
Breaks for binning. |
bins_no |
Number the generated bins. Default is TRUE. |
as_factor |
Whether to convert to factor type. |
labels |
Labels of bins. |
use_NA |
Whether to process NAs. |
char_free |
Logical, if TRUE, characters are not splitted. |
Value
A data.frame with Bined x.
Examples
bins = split_bins(dat = UCICreditCard,
x = "PAY_AMT1", breaks = NULL, bins_no = TRUE)
Split bins all
Description
split_bins
is for transforming data to bins.
The split_bins_all
function is a simpler wrapper for split_bins
.
Usage
split_bins_all(
dat,
x_list = NULL,
ex_cols = NULL,
breaks_list = NULL,
bins_no = TRUE,
note = FALSE,
return_x = FALSE,
char_free = FALSE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
Arguments
dat |
A data.frame with independent variables. |
x_list |
A list of x variables. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
breaks_list |
A list contains breaks of variables. it is generated by codeget_breaks_all,codeget_breaks |
bins_no |
Number the generated bins. Default is TRUE. |
note |
Logical, outputs info. Default is TRUE. |
return_x |
Logical, return data.frame containing only variables in x_list. |
char_free |
Logical, if TRUE, characters are not splitted. |
save_data |
Logical, save results in locally specified folder. Default is TRUE |
file_name |
The name for periodically saved woe file. Default is "dat_woe". |
dir_path |
The path for periodically saved woe file Default is "./data" |
... |
Additional parameters. |
Value
A data.frame with splitted bins.
See Also
get_tree_breaks
, cut_equal
, select_best_class
, select_best_breaks
Examples
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
dat = re_name(dat, "default.payment.next.month", "target")
dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date",
miss_values = list("", -1))
train_test = train_test_split(dat, split_type = "OOT", prop = 0.7,
occur_time = "apply_date")
dat_train = train_test$train
dat_test = train_test$test
#get breaks of all predictive variables
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2")
breaks_list = get_breaks_all(dat = dat_train, target = "target",
x_list = x_list, occur_time = "apply_date", ex_cols = "ID",
save_data = FALSE, note = FALSE)
#woe transform
train_bins = split_bins_all(dat = dat_train,
breaks_list = breaks_list,
woe_name = FALSE)
test_bins = split_bins_all(dat = dat_test,
breaks_list = breaks_list,
note = FALSE)
Automatic production of hive SQL
Description
Returns text parse of hive SQL
Usage
sql_hive_text_parse(
sql_dt,
key_sql = NULL,
key_table = NULL,
key_id = NULL,
key_where = c("dt = date_add(current_date(),-1)"),
only_key = FALSE,
left_id = NULL,
left_where = c("dt = date_add(current_date(),-1)"),
new_name = NULL,
...
)
Arguments
sql_dt |
The data dictionary has three columns: table, map and feature. |
key_sql |
You can write your own SQL for the main table. |
key_table |
Key table. |
key_id |
Primary key id. |
key_where |
Key table conditions. |
only_key |
Only key table. |
left_id |
Right table's key id. |
left_where |
Right table conditions. |
new_name |
A string, Rename all variables except primary key with suffix 'new_name'. |
... |
Other params. |
Value
Text parse of hive SQL
Examples
#sql_dt:table, map and feature
sql_dt = data.frame(table = c("table_1", "table_1", "table_1", "table_1","table_1",
"table_2", "table_2","table_2",
"table_2","table_2","table_2","table_2",
"table_2","table_2","table_2","table_2",
"table_2","table_2","table_2","table_3","table_3",
"table_3","table_3","table_3"),
map = c("all","all", "all","all","all","all","all","all","all","all",
"all", "all","all","id_card_info",
"id_card_info","id_card_info", "mobile_info","mobile_info",
"mobile_info","all", "all","all", "all","all"),
feature =c( "user_id","real_name","id_card_encode","mobile_encode","dt",
"user_id","type_code","first_channel",
"second_channel","user_name","user_sex","user_birthday",
"user_age","card_province","card_zone",
"card_city","city","province","carrier","user_id",
"biz_id","biz_code","apply_time","dt"))
#sample 1
sql_hive_text_parse(sql_dt = sql_dt,
key_sql = NULL,
key_table = "table_2",
key_where = c("user_sex = 'male",
"user_age > 20"),
only_key = FALSE,
key_id = "user_id",
left_id = "user_id",
left_where = c("dt = date_add(current_date(),-1)",
"apply_time >= '2020-05-01' "
), new_name ="basic"
)
#sample 2
sql_hive_text_parse(sql_dt = subset(sql_dt),
key_sql = "SELECT
user_id,
max(apply_time) as max_apply_time
FROM table_3
WHERE dt = date_add(current_date(),-1)
GROUP BY user_id",
key_id = "user_id",
left_id = "user_id",
left_where = c("dt = date_add(current_date(),-1)"
),
new_name = NULL)
Parallel computing and export variables to global Env.
Description
This function is not intended to be used by end user.
Usage
start_parallel_computing(parallel = TRUE)
Arguments
parallel |
A logical, default is TRUE. |
Value
parallel works.
Stop parallel computing
Description
This function is not intended to be used by end user.
Usage
stop_parallel_computing(cluster)
Arguments
cluster |
Parallel works. |
Value
stop clusters.
string match
#' str_match
search for matches to argument pattern within each element of a character vector:
Description
string match
#' str_match
search for matches to argument pattern within each element of a character vector:
Usage
str_match(pattern, str_r)
Arguments
pattern |
character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. missing values are allowed except for regexpr and gregexpr. |
str_r |
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported. |
Examples
orignal_nam = c("12mdd","11mdd","10mdd")
str_match(str_r = orignal_nam,pattern= "\\d+")
Summary table
Description
#'The sum_table
includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions, to correlations, cross-tabulation and characteristic analysis.
Usage
sum_table(dat, ..., x_s = as.character(substitute(list(...))), x_list = NULL)
Arguments
dat |
A data.frame with x and target. |
... |
x of dat |
x_s |
A list of x. |
x_list |
Names of dat. |
Value
A list contains both categrory and numeric variable analysis.
Examples
sum_table(UCICreditCard)
sum_table(UCICreditCard,LIMIT_BAL,AGE,EDUCATION,SEX)
TF-IDF
Description
The term_filter
is for filtering stop_words and low frequency words.
The term_idf
is for computing idf(inverse documents frequency) of terms.
The term_tfidf
is for computing tf-idf of documents.
Usage
term_tfidf(term_df, idf = NULL)
term_idf(term_df, n_total = NULL)
term_filter(term_df, low_freq = 0.01, stop_words = NULL)
Arguments
term_df |
A data.frame with id and term. |
idf |
A data.frame with idf. |
n_total |
Number of documents. |
low_freq |
Use rate of terms or use numbers of terms. |
stop_words |
Stop words. |
Value
A data.frame
Examples
term_df = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7,
8,8,8,9,9,9,10,10,11,11,11,11,11,11),
terms = c('a','b','c','a','c','d','d','a','b','c','a','c','d','a','c',
'd','a','e','f','b','c','f','b','c','h','h','i','c','d','g','k','k'))
term_df = term_filter(term_df = term_df, low_freq = 1)
idf = term_idf(term_df)
tf_idf = term_tfidf(term_df,idf = idf)
Process time series data
Description
This function is used for time series data processing.
Usage
time_series_proc(dat, ID = NULL, group = NULL, time = NULL)
Arguments
dat |
A data.frame contained only predict variables. |
ID |
The name of ID of observations or key variable of data. Default is NULL. |
group |
The group of behavioral or status variables. |
time |
The name of variable which is time when behavior was happened. |
Details
The key to creating a good model is not the power of a specific modelling technique, but the breadth and depth of derived variables that represent a higher level of knowledge about the phenomena under examination.
Examples
dat = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7,
8,8,8,9,9,9,10,10,11,11,11,11,11,11),
terms = c('a','b','c','a','c','d','d','a',
'b','c','a','c','d','a','c',
'd','a','e','f','b','c','f','b',
'c','h','h','i','c','d','g','k','k'),
time = c(8,3,1,9,6,1,4,9,1,3,4,8,2,7,1,
3,4,1,8,7,2,5,7,8,8,2,1,5,7,2,7,3))
time_series_proc(dat = dat, ID = 'id', group = 'terms',time = 'time')
Time Format Transfering
Description
time_transfer
is for transfering time variables to time format.
Usage
time_transfer(dat, date_cols = NULL, ex_cols = NULL, note = FALSE)
Arguments
dat |
A data frame |
date_cols |
Names of time variable or regular expressions for finding time variables. Default is "DATE$|time$|date$|timestamp$|stamp$". |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
note |
Logical, outputs info. Default is TRUE. |
Value
A data.frame with transfermed time variables.
Examples
#transfer a variable.
dat = time_transfer(dat = lendingclub,date_cols = "issue_d")
class(dat[,"issue_d"])
#transfer a group of variables with similar name.
#transfer all time variables.
dat = time_transfer(dat = lendingclub[1:3],date_cols = "_d$")
class(dat[,"issue_d"])
time_variable
Description
This function is not intended to be used by end user.
Usage
time_variable(
dat,
date_cols = NULL,
enddate = NULL,
units = c("secs", "mins", "hours", "days", "weeks")
)
Arguments
dat |
A data.frame. |
date_cols |
Time variables. |
enddate |
End time. |
units |
Units of diff_time, "secs", "mins", "hours", "days", "weeks" is available. |
Processing of Time or Date Variables
Description
This function is not intended to be used by end user.
Usage
time_vars_process(
df_tm = df_tm,
x,
enddate = NULL,
units = c("secs", "mins", "hours", "days", "weeks")
)
Arguments
df_tm |
A data.frame |
x |
Time variable. |
enddate |
End time. |
units |
Units of diff_time, "secs", "mins", "hours", "days", "weeks" is available. |
tnr_value
Description
tnr_value
is for get true negtive rate for a prob or score.
Usage
tnr_value(prob, target)
Arguments
prob |
A list of redict probability or score. |
target |
Vector of target. |
Value
True Positive Rate
Trainig LR model
Description
train_lr
is for training the logistic regression model using in training_model
.
Usage
train_lr(
dat_train,
dat_test = NULL,
target,
x_list = NULL,
occur_time = NULL,
prop = 0.7,
tree_control = list(p = 0.02, cp = 1e-08, xval = 5, maxdepth = 10),
bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.02, b_odds = 0.1, b_psi
= 0.03, b_or = 0.15, mono = 0.2, odds_psi = 0.15, kc = 1),
thresholds = list(cor_p = 0.8, iv_i = 0.02, psi_i = 0.1, cos_i = 0.6),
lasso = TRUE,
step_wise = TRUE,
best_lambda = "lambda.auc",
seed = 1234,
...
)
Arguments
dat_train |
data.frame of train data. Default is NULL. |
dat_test |
data.frame of test data. Default is NULL. |
target |
name of target variable. |
x_list |
names of independent variables. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place.Default is NULL. |
prop |
Percentage of train-data after the partition. Default: 0.7. |
tree_control |
the list of parameters to control cutting initial breaks by decision tree. See details at: |
bins_control |
the list of parameters to control merging initial breaks. See details at: |
thresholds |
Thresholds for selecting variables.
|
lasso |
Logical, if TRUE, variables filtering by LASSO. Default is TRUE. |
step_wise |
Logical, stepwise method. Default is TRUE. |
best_lambda |
Metheds of best lanmbda stardards using to filter variables by LASSO. There are 3 methods: ("lambda.auc", "lambda.ks", "lambda.sim_sign") . Default is "lambda.auc". |
seed |
Random number seed. Default is 1234. |
... |
Other parameters |
Train-Test-Split
Description
train_test_split
Functions for partition of data.
Usage
train_test_split(
dat,
prop = 0.7,
split_type = "Random",
occur_time = NULL,
cut_date = NULL,
start_date = NULL,
save_data = FALSE,
dir_path = tempdir(),
file_name = NULL,
note = FALSE,
seed = 43
)
Arguments
dat |
A data.frame with independent variables and target variable. |
prop |
The percentage of train data samples after the partition. |
split_type |
Methods for partition.
|
occur_time |
The name of the variable that represents the time at which each observation takes place. It is used for "OOT" split. |
cut_date |
Time points for spliting data sets, e.g. : spliting Actual and Expected data sets. |
start_date |
The earliest occurrence time of observations. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
dir_path |
The path for periodically saved data file. Default is "./data". |
file_name |
The name for periodically saved data file. Default is "dat". |
note |
Logical. Outputs info. Default is TRUE. |
seed |
Random number seed. Default is 46. |
Value
A list of indices (train-test)
Examples
train_test = train_test_split(lendingclub,
split_type = "OOT", prop = 0.7,
occur_time = "issue_d", seed = 12, save_data = FALSE)
dat_train = train_test$train
dat_test = train_test$test
Training XGboost
Description
train_xgb
is for training a xgb model using in training_model
.
Usage
train_xgb(
seed_number = 1234,
dtrain,
nthread = 2,
nfold = 1,
watchlist = NULL,
nrounds = 100,
f_eval = "ks",
early_stopping_rounds = 10,
verbose = 0,
params = NULL,
...
)
Arguments
seed_number |
Random number seed. Default is 1234. |
dtrain |
train-data of xgb.DMatrix datasets. |
nthread |
Number of threads |
nfold |
Number of the cross validation of xgboost |
watchlist |
named list of xgb.DMatrix datasets to use for evaluating model performance.generating by |
nrounds |
Max number of boosting iterations. |
f_eval |
Custimized evaluation function,"ks" & "auc" are available. |
early_stopping_rounds |
If NULL, the early stopping function is not triggered. If set to an integer k, training with a validation set will stop if the performance doesn't improve for k rounds. |
verbose |
If 0, xgboost will stay silent. If 1, it will print information about performance. |
params |
List of contains parameters of xgboost. The complete list of parameters is available at: http://xgboost.readthedocs.io/en/latest/parameter.html |
... |
Other parameters |
Training model
Description
training_model
Model builder
Usage
training_model(
model_name = "mymodel",
dat,
dat_test = NULL,
target = NULL,
occur_time = NULL,
obs_id = NULL,
x_list = NULL,
ex_cols = NULL,
pos_flag = NULL,
prop = 0.7,
split_type = if (!is.null(occur_time)) "OOT" else "Random",
preproc = TRUE,
low_var = 0.99,
missing_rate = 0.98,
merge_cat = 30,
remove_dup = TRUE,
outlier_proc = TRUE,
missing_proc = "median",
default_miss = list(-1, "missing"),
miss_values = NULL,
one_hot = FALSE,
trans_log = FALSE,
feature_filter = list(filter = c("IV", "PSI", "COR", "XGB"), iv_cp = 0.02, psi_cp =
0.1, xgb_cp = 0, cv_folds = 1, hopper = FALSE),
algorithm = list("LR", "XGB", "GBM", "RF"),
LR.params = lr_params(),
XGB.params = xgb_params(),
GBM.params = gbm_params(),
RF.params = rf_params(),
breaks_list = NULL,
parallel = FALSE,
cores_num = NULL,
save_pmml = FALSE,
plot_show = FALSE,
vars_plot = TRUE,
model_path = tempdir(),
seed = 46,
...
)
Arguments
model_name |
A string, name of the project. Default is "mymodel" |
dat |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
target |
The name of target variable. |
occur_time |
The name of the variable that represents the time at which each observation takes place.Default is NULL. |
obs_id |
The name of ID of observations or key variable of data. Default is NULL. |
x_list |
Names of independent variables. Default is NULL. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
prop |
Percentage of train-data after the partition. Default: 0.7. |
split_type |
Methods for partition. See details at : |
preproc |
Logical. Preprocess data. Default is TRUE. |
low_var |
Logical, delete low variance variables or not. Default is TRUE. |
missing_rate |
The maximum percent of missing values for recoding values to missing and non_missing. |
merge_cat |
merge categories of character variables that is more than m. |
remove_dup |
Logical, if TRUE, remove the duplicated observations. |
outlier_proc |
Logical, process outliers or not. Default is TRUE. |
missing_proc |
If logical, process missing values or not. If "median", then Nas imputation with k neighbors median. If "avg_dist", the distance weighted average method is applied to determine the NAs imputation with k neighbors. If "default", assigning the missing values to -1 or "missing", otherwise ,processing the missing values according to the results of missing analysis. |
default_miss |
Default value of missing data imputation, Defualt is list(-1,'missing'). |
miss_values |
Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "missing". |
one_hot |
Logical. If TRUE, one-hot_encoding of category variables. Default is FASLE. |
trans_log |
Logical, Logarithmic transformation. Default is FALSE. |
feature_filter |
Parameters for selecting important and stable features.See details at: |
algorithm |
Algorithms for training a model. list("LR", "XGB", "GBDT", "RF") are available. |
LR.params |
Parameters of logistic regression & scorecard. See details at : |
XGB.params |
Parameters of xgboost. See details at : |
GBM.params |
Parameters of GBM. See details at : |
RF.params |
Parameters of Random Forest. See details at : |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
parallel |
Default is FALSE. |
cores_num |
The number of CPU cores to use. |
save_pmml |
Logical, save model in PMML format. Default is TRUE. |
plot_show |
Logical, show model performance in current graphic device. Default is FALSE. |
vars_plot |
Logical, if TRUE, plot distribution ,correlation or partial dependence of model input variables . Default is TRUE. |
model_path |
The path for periodically saved data file. Default is |
seed |
Random number seed. Default is 46. |
... |
Other parameters. |
Value
A list containing Model Objects.
See Also
train_test_split
,data_cleansing
, feature_selector
, lr_params
, xgb_params
, gbm_params
, rf_params
,fast_high_cor_filter
,get_breaks_all
,lasso_filter
, woe_trans_all
, get_logistic_coef
, score_transfer
,get_score_card
, model_key_index
,ks_psi_plot
,ks_table_plot
Examples
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
x_list = c("LIMIT_BAL")
B_model = training_model(dat = dat,
model_name = "UCICreditCard",
target = "default.payment.next.month",
x_list = x_list,
occur_time =NULL,
obs_id =NULL,
dat_test = NULL,
preproc = FALSE,
outlier_proc = FALSE,
missing_proc = FALSE,
feature_filter = NULL,
algorithm = list("LR"),
LR.params = lr_params(lasso = FALSE,
step_wise = FALSE,
score_card = FALSE),
breaks_list = NULL,
parallel = FALSE,
cores_num = NULL,
save_pmml = FALSE,
plot_show = FALSE,
vars_plot = FALSE,
model_path = tempdir(),
seed = 46)
Process group numeric variables
Description
This function is used for grouped numeric data processing.
Usage
var_group_proc(dat, ID = NULL, group = NULL, num_var = NULL)
Arguments
dat |
A data.frame contained only predict variables. |
ID |
The name of ID of observations or key variable of data. Default is NULL. |
group |
The group of behavioral or status variables. |
num_var |
The name of numeric variable to process. |
Examples
dat = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7,
8,8,8,9,9,9,10,10,11,11,11,11,11,11),
terms = c('a','b','c','a','c','d','d','a',
'b','c','a','c','d','a','c',
'd','a','e','f','b','c','f','b',
'c','h','h','i','c','d','g','k','k'),
time = c(8,3,1,9,6,1,4,9,1,3,4,8,2,7,1,
3,4,1,8,7,2,5,7,8,8,2,1,5,7,2,7,3))
time_series_proc(dat = dat, ID = 'id', group = 'terms',time = 'time')
variable_process
Description
This function is not intended to be used by end user.
Usage
variable_process(add)
Arguments
add |
A data.frame |
WOE Transformation
Description
woe_trans
is for transforming data to woe.
The woe_trans_all
function is a simpler wrapper for woe_trans
.
Usage
woe_trans_all(
dat,
x_list = NULL,
ex_cols = NULL,
bins_table = NULL,
target = NULL,
breaks_list = NULL,
note = FALSE,
save_data = FALSE,
parallel = FALSE,
woe_name = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
woe_trans(
dat,
x,
bins_table = NULL,
target = NULL,
breaks_list = NULL,
woe_name = FALSE
)
Arguments
dat |
A data.frame with independent variables. |
x_list |
A list of x variables. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
bins_table |
A table contians woe of each bin of variables, it is generated by codeget_bins_table_all,codeget_bins_table |
target |
The name of target variable. Default is NULL. |
breaks_list |
A list contains breaks of variables. it is generated by codeget_breaks_all,codeget_breaks |
note |
Logical, outputs info. Default is TRUE. |
save_data |
Logical, save results in locally specified folder. Default is TRUE |
parallel |
Logical, parallel computing. Default is FALSE. |
woe_name |
Logical. Add "_woe" at the end of the variable name. |
file_name |
The name for periodically saved woe file. Default is "dat_woe". |
dir_path |
The path for periodically saved woe file Default is "./data" |
... |
Additional parameters. |
x |
The name of an independent variable. |
Value
A list of breaks for each variables.
See Also
get_tree_breaks
, cut_equal
, select_best_class
, select_best_breaks
Examples
sub = cv_split(UCICreditCard, k = 30)[[1]]
dat = UCICreditCard[sub,]
dat = re_name(dat, "default.payment.next.month", "target")
dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date",
miss_values = list("", -1))
train_test = train_test_split(dat, split_type = "OOT", prop = 0.7,
occur_time = "apply_date")
dat_train = train_test$train
dat_test = train_test$test
#get breaks of all predictive variables
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2")
breaks_list = get_breaks_all(dat = dat_train, target = "target",
x_list = x_list, occur_time = "apply_date", ex_cols = "ID",
save_data = FALSE, note = FALSE)
#woe transform
train_woe = woe_trans_all(dat = dat_train,
target = "target",
breaks_list = breaks_list,
woe_name = FALSE)
test_woe = woe_trans_all(dat = dat_test,
target = "target",
breaks_list = breaks_list,
note = FALSE)
XGboost data
Description
xgb_data
is for prepare data using in training_model
.
Usage
xgb_data(
dat_train,
target,
dat_test = NULL,
x_list = NULL,
prop = 0.7,
occur_time = NULL
)
Arguments
dat_train |
data.frame of train data. Default is NULL. |
target |
name of target variable. |
dat_test |
data.frame of test data. Default is NULL. |
x_list |
names of independent variables of raw data. Default is NULL. |
prop |
Percentage of train-data after the partition. Default: 0.7. |
occur_time |
The name of the variable that represents the time at which each observation takes place.Default is NULL. |
Select Features using XGB
Description
xgb_filter
is for selecting important features using xgboost.
Usage
xgb_filter(
dat_train,
dat_test = NULL,
target = NULL,
pos_flag = NULL,
x_list = NULL,
occur_time = NULL,
ex_cols = NULL,
xgb_params = list(nrounds = 100, max_depth = 6, eta = 0.1, min_child_weight = 1,
subsample = 1, colsample_bytree = 1, gamma = 0, scale_pos_weight = 1,
early_stopping_rounds = 10, objective = "binary:logistic"),
f_eval = "auc",
cv_folds = 1,
cp = NULL,
seed = 46,
vars_name = TRUE,
note = TRUE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
Arguments
dat_train |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
target |
The name of target variable. |
pos_flag |
The value of positive class of target variable, default: "1". |
x_list |
Names of independent variables. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
xgb_params |
Parameters of xgboost.The complete list of parameters is available at: http://xgboost.readthedocs.io/en/latest/parameter.html. |
f_eval |
Custimized evaluation function,"ks" & "auc" are available. |
cv_folds |
Number of cross-validations. Default: 5. |
cp |
Threshold of XGB feature's Gain. Default is 1/number of independent variables. |
seed |
Random number seed. Default is 46. |
vars_name |
Logical, output a list of filtered variables or table with detailed IV and PSI value of each variable. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
save_data |
Logical, save results results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved results files. Default is "Feature_importance_XGB". |
dir_path |
The path for periodically saved results files. Default is "./variable". |
... |
Other parameters to pass to xgb_params. |
Value
Selected variables.
See Also
psi_iv_filter
, gbm_filter
, feature_selector
Examples
dat = UCICreditCard[1:1000,c(2,4,8:9,26)]
xgb_params = list(nrounds = 100, max_depth = 6, eta = 0.1,
min_child_weight = 1, subsample = 1,
colsample_bytree = 1, gamma = 0, scale_pos_weight = 1,
early_stopping_rounds = 10,
objective = "binary:logistic")
## Not run:
xgb_features = xgb_filter(dat_train = dat, dat_test = NULL,
target = "default.payment.next.month", occur_time = "apply_date",f_eval = 'ks',
xgb_params = xgb_params,
cv_folds = 1, ex_cols = "ID$|date$|default.payment.next.month$", vars_name = FALSE)
## End(Not run)
XGboost Parameters
Description
xgb_params
is the list of parameters to train a XGB model using in training_model
.
xgb_params_search
is for searching the optimal parameters of xgboost,if any parameters of params in xgb_params
is more than one.
Usage
xgb_params(
nrounds = 1000,
params = list(max_depth = 6, eta = 0.01, gamma = 0, min_child_weight = 1, subsample =
1, colsample_bytree = 1, scale_pos_weight = 1),
early_stopping_rounds = 100,
method = "random_search",
iters = 10,
f_eval = "auc",
nfold = 1,
nthread = 2,
...
)
xgb_params_search(
dat_train,
target,
dat_test = NULL,
x_list = NULL,
prop = 0.7,
occur_time = NULL,
method = "random_search",
iters = 10,
nrounds = 100,
early_stopping_rounds = 10,
params = list(max_depth = 6, eta = 0.01, gamma = 0, min_child_weight = 1, subsample =
1, colsample_bytree = 1, scale_pos_weight = 1),
f_eval = "auc",
nfold = 1,
nthread = 2,
...
)
Arguments
nrounds |
Max number of boosting iterations. |
params |
List of contains parameters of xgboost. The complete list of parameters is available at: http://xgboost.readthedocs.io/en/latest/parameter.html |
early_stopping_rounds |
If NULL, the early stopping function is not triggered. If set to an integer k, training with a validation set will stop if the performance doesn't improve for k rounds. |
method |
Method of searching optimal parameters."random_search","grid_search","local_search" are available. |
iters |
Number of iterations of "random_search" optimal parameters. |
f_eval |
Custimized evaluation function,"ks" & "auc" are available. |
nfold |
Number of the cross validation of xgboost |
nthread |
Number of threads |
... |
Other parameters |
dat_train |
A data.frame of train data. Default is NULL. |
target |
Name of target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
x_list |
Names of independent variables. Default is NULL. |
prop |
Percentage of train-data after the partition. Default: 0.7. |
occur_time |
The name of the variable that represents the time at which each observation takes place.Default is NULL. |
Value
A list of parameters.