Type: Package
Title: A Family of Beta Mixture Models for Clustering Beta-Valued DNA Methylation Data
Version: 1.0.4
Description: A family of novel beta mixture models (BMMs) has been developed by Majumdar et al. (2022) <doi:10.48550/arXiv.2211.01938> to appositely model the beta-valued cytosine-guanine dinucleotide (CpG) sites, to objectively identify methylation state thresholds and to identify the differentially methylated CpG (DMC) sites using a model-based clustering approach. The family of beta mixture models employs different parameter constraints applicable to different study settings. The EM algorithm is used for parameter estimation, with a novel approximation during the M-step providing tractability and ensuring computational feasibility.
License: GPL-3
Depends: R (≥ 3.5.0)
Imports: foreach, doParallel, stats, utils, ggplot2, plotly, scales, pROC
Encoding: UTF-8
NeedsCompilation: yes
LazyData: true
LazyDataCompression: xz
RoxygenNote: 7.3.2
Suggests: rmarkdown, knitr
VignetteBuilder: knitr
Maintainer: Koyel Majumdar <koyel.majumdar@ucdconnect.ie>
Packaged: 2024-09-24 09:00:43 UTC; koyel
Author: Koyel Majumdar ORCID iD [aut, cre], Romina Silva [aut], Antoinette Sabrina Perry [aut], Ronald William Watson [aut], Andrea Rau ORCID iD [aut], Florence Jaffrezic [aut], Thomas Brendan Murphy ORCID iD [aut], Isobel Claire Gormley ORCID iD [aut]
Repository: CRAN
Date/Publication: 2024-09-24 09:30:05 UTC

AUC and WD function

Description

Function to find the level of similarities between the R cumulative distributions estimated in each of the K clusters.

Usage

AUC_WD_metric(alpha, delta, K, R)

Arguments

alpha

The first shape parameter estimated for the beta mixture model.

delta

The second shape parameter estimated for the beta mixture model.

K

The number of clusters estimated.

R

The number of sample types in the dataset.

Details

Function to find the level of similarities between the R cumulative distributions estimated in each of the K clusters.

Value

The list with AUC and WD values.

See Also

betaclust


The DMC identification function

Description

A function to identify the most differentially methylated clusters from K clusters.

Usage

DMC_identification(
  object,
  data,
  CpG_site_list,
  threshold = 0.65,
  metric = "AUC"
)

Arguments

object

A betaclust object

data

A dataframe of dimension C \times NR containing methylation values for C CpG sites from R samples collected from N patients which was passed as an argument to the betaclust function.

CpG_site_list

The IlmnID of all the CpG sites analysed by betaclust function.

threshold

The threshold value/s for selecting the most differentially methylated clusters, default= 0.65

metric

The metric (AUC or WD selected). default="AUC"

Details

This function selects the most diffentially methylated clusters based on AUC and WD metric calculated and returns the CpG sites belonging to those clusters.

Value

The function returns a dataframe of CpG sites and methylation values identified to belong to the most differentially methylated clusters

See Also

beta_kr

pca.methylation.data

plot.betaclust

summary.betaclust

betaclust

Examples


my.seed <- 190
M <- 3
N <- 4
R <- 2
data_output <- betaclust(pca.methylation.data[1:30,2:9], M, N, R,
            model_names = "K.R",
            parallel_process = FALSE, seed = my.seed)
dmc_df <-DMC_identification(data_output,pca.methylation.data[1:30,2:9],
pca.methylation.data[1:30,1],
 threshold = 0.65, metric = "AUC")



Fit the K.. model

Description

Fit the K.. model from the betaclust family of beta mixture models for DNA methylation data. The K.. model analyses a single DNA sample type and identifies the thresholds between the different methylation states.

Usage

beta_k(data, M = 3, parallel_process = FALSE, seed = NULL)

Arguments

data

A dataframe of dimension C \times N containing methylation values for C CpG sites from R = 1 sample type collected from N patients. Samples are grouped together in the dataframe such that the columns are ordered as Sample1_Patient1, Sample1_Patient2, etc.

M

Number of methylation states to be identified in a DNA sample type.

parallel_process

The "TRUE" option results in parallel processing of the models for increased computational efficiency. The default option has been set as "FALSE" due to package testing limitations.

seed

Seed to allow for reproducibility (default = NULL).

Details

The K.. model clusters each of the C CpG sites into one of K methylation states, based on data from N patients for one DNA sample type (i.e. R = 1). As each CpG site can belong to any of the M = 3 methylation states (hypomethylated, hemimethylated or hypermethylated), the default value of K = M = 3. Under the K.. model the shape parameters of each cluster are constrained to be equal for each patient. The returned object from this function can be passed as an input parameter to the threshold function available in this package to calculate the thresholds between the methylation states.

Value

A list containing:

See Also

beta_kn

betaclust

threshold

Examples

my.seed <- 190
M <- 3
data_output <- beta_k(pca.methylation.data[1:30,2:5], M,
                      parallel_process = FALSE, seed = my.seed)
thresholds <- threshold(data_output, pca.methylation.data[1:30,2:5], "K..")

Fit the KN. model

Description

Fit the KN. model from the betaclust family of beta mixture models for DNA methylation data. The KN. model analyses a single DNA sample type and identifies the thresholds between the different methylation states.

Usage

beta_kn(data, M = 3, parallel_process = FALSE, seed = NULL)

Arguments

data

A dataframe of dimension C \times N containing methylation values for C CpG sites from R = 1 sample type collected from N patients. Samples are grouped together in the dataframe such that the columns are ordered as Sample1_Patient1, Sample1_Patient2, etc.

M

Number of methylation states to be identified in a DNA sample type.

parallel_process

The "TRUE" option results in parallel processing of the models for increased computational efficiency. The default option has been set as "FALSE" due to package testing limitations.

seed

Seed to allow for reproducibility (default = NULL).

Details

The KN. model clusters each of the C CpG sites into one of K methylation states, based on data from N patients for one DNA sample type (i.e. R = 1). As each CpG site can belong to any of the M = 3 methylation states (hypomethylated, hemimethylated or hypermethylated), the default value of K = M = 3. The KN. model differs from the K.. model as it is less parsimonious, allowing cluster and patient-specific shape parameters. The returned object can be passed as an input parameter to the threshold function available in this package to calculate the thresholds between the methylation states.

Value

A list containing:

See Also

beta_k

betaclust

threshold

Examples

my.seed <- 190
M <- 3
data_output <- beta_kn(pca.methylation.data[1:30,2:5], M,
                       parallel_process = FALSE, seed = my.seed)
thresholds <- threshold(data_output, pca.methylation.data[1:30,2:5], "KN.")

Fit the K.R Model

Description

A beta mixture model for identifying differentially methylated CpG sites between R DNA sample types collected from N patients.

Usage

beta_kr(data, M = 3, N, R, parallel_process = FALSE, seed = NULL)

Arguments

data

A dataframe of dimension C \times NR containing methylation values for C CpG sites from R sample types collected from N patients. Samples are grouped together in the dataframe such that the columns are ordered as Sample1_Patient1, Sample1_Patient2, Sample2_Patient1, Sample2_Patient2, etc.

M

Number of methylation states to be identified.

N

Number of patients in the study.

R

Number of sample types collected from each patient for study.

parallel_process

The "TRUE" option results in parallel processing of the models for increased computational efficiency. The default option has been set as "FALSE" due to package testing limitations.

seed

Seed to allow for reproducibility (default = NULL).

Details

The K.R model allows identification of the differentially methylated CpG sites between the R DNA sample types collected from each of N patients. As each CpG site in a DNA sample can belong to one of M methylation states, there can be K=M^R methylation state changes between R DNA sample types. The shape parameters vary for each DNA sample type but are constrained to be equal for each patient. An initial clustering using k-means is performed to identify K clusters. The resulting clustering solution is provided as starting values to the Expectation-Maximisation algorithm. A digamma approximation is used to obtain the maximised parameters in the M-step.

Value

A list containing:

See Also

betaclust

Examples

my.seed <- 190
M <- 3
N <- 4
R <- 2
data_output = beta_kr(pca.methylation.data[1:30,2:9], M, N, R,
                      parallel_process = FALSE, seed = my.seed)

The betaclust wrapper function

Description

A family of model-based clustering techniques to identify methylation states in beta-valued DNA methylation data.

Usage

betaclust(
  data,
  M = 3,
  N,
  R,
  model_names = "K..",
  model_selection = "BIC",
  parallel_process = FALSE,
  seed = NULL
)

Arguments

data

A dataframe of dimension C \times NR containing methylation values for C CpG sites from R sample types collected from N patients. Samples are grouped together in the dataframe such that the columns are ordered as Sample1_Patient1, Sample1_Patient2, Sample2_Patient1, Sample2_Patient2, etc.

M

Number of methylation states to be identified in a DNA sample type.

N

Number of patients in the study.

R

Number of sample types collected from each patient for the study.

model_names

Models to run from the set of models, K.., KN. and K.R, default = K.. . See details.

model_selection

Information criterion used for model selection. Options are AIC, BIC or ICL (default = BIC).

parallel_process

The "TRUE" option results in parallel processing of the models for increased computational efficiency. The default option has been set as "FALSE" due to package testing limitations.

seed

Seed to allow for reproducibility (default = NULL).

Details

This is a wrapper function which can be used to fit all three models (K.., KN., K.R) within a single function.

The K.. and KN. models are used to analyse a single DNA sample type (R = 1) and cluster the C CpG sites into the K clusters which represent the different methylation states in a DNA sample type. As each CpG site can belong to any of the M=3 methylation states (hypomethylation, hemimethylation and hypermethylation), the default value for K=M=3. The thresholds between methylation states are objectively inferred from the clustering solution.

The K.R model is used to analyse R independent sample types collected from N patients, where each sample contains C CpG sites, and cluster the dataset into K=M^R clusters to identify the differentially methylated CpG (DMC) sites between the R DNA sample types.

Value

The function returns an object of the betaclust class which contains the following values:

References

Silva, R., Moran, B., Russell, N.M., Fahey, C., Vlajnic, T., Manecksha, R.P., Finn, S.P., Brennan, D.J., Gallagher, W.M., Perry, A.S.: Evaluating liquid biopsies for methylomic profiling of prostate cancer. Epigenetics 15(6-7), 715-727 (2020). doi: 10.1080/15592294.2020.1712876.

Majumdar, K., Silva, R., Perry, A.S., Watson, R.W., Murphy, T.B., Gormley, I.C.: betaclust: a family of mixture models for beta valued DNA methylation data. arXiv [stat.ME] (2022). doi: 10.48550/ARXIV.2211.01938.

See Also

beta_k

beta_kn

beta_kr

pca.methylation.data

plot.betaclust

summary.betaclust

threshold

Examples


my.seed <- 190
M <- 3
N <- 4
R <- 2
data_output <- betaclust(pca.methylation.data[1:30,2:9], M, N, R,
            model_names = c("K..","KN.","K.R"), model_selection = "BIC",
            parallel_process = FALSE, seed = my.seed)



The empirical cumulative distribution function plot

Description

An empirical cumulative distribution function (ECDF) plot for a betaclust object.

Usage

ecdf.betaclust(x, R = 2, sample_name = NULL, title = NULL)

Arguments

x

A dataframe containing methylation values of identified differentially methylated regions related to a gene. Samples are grouped together in the dataframe such that the columns are ordered as Sample1_Patient1, Sample1_Patient2, Sample2_Patient1, Sample2_Patient2, etc.

R

Number of tissue sample types from which DNA methylation data are collected (default R = 2).

sample_name

The order in which the sample types are grouped in the dataframe. If no value is specified then default values of sample names, e.g. Sample 1, Sample 2, etc are used (default = NULL).

title

The title that the user wants to display on the graph. The default is "NULL".

Details

This function plots the ECDF of the differentially methylated CpG sites identified using the K.R model for all patient samples. The plot visualises the methylation state changes between the different DNA samples for each patient.

Value

The ECDF plot for the selected CpG sites for all patients and their DNA sample types.

See Also

betaclust

beta_kr


Akaike Information Criterion

Description

Compute the AIC value for the optimal model.

Usage

em_aic(llk, C, M, N, R, model_name = "K..")

Arguments

llk

Log-likelihood value.

C

Number of CpG sites.

M

Number of methylation states identified in a DNA sample.

N

Number of patients.

R

Number of DNA sample types collected from each patient.

model_name

Fitted mixture model. Options are "K..", "KN." and/or "K.R" (default = "K..").

Details

Computes the AIC for a specified model given the log-likelihood, the dimension of the data, and the model names.

Value

The AIC value for the selected model.

See Also

em_bic

em_icl


Bayesian Information Criterion

Description

Compute the BIC value for the optimal model.

Usage

em_bic(llk, C, M, N, R, model_name = "K..")

Arguments

llk

Log-likelihood value.

C

Number of CpG sites.

M

Number of methylation states identified in a DNA sample.

N

Number of patients.

R

Number of DNA sample types collected from each patient.

model_name

Fitted mixture model. Options are "K..", "KN." and/or "K.R" (default = "K..").

Details

Computes the BIC for a specified model given the log-likelihood, the dimension of the data, and the model names.

Value

The BIC value for the selected model.

See Also

em_aic

em_icl


Integrated Complete-data Likelihood (ICL) Criterion

Description

Compute the ICL value for the optimal model.

Usage

em_icl(llk, C, M, N, R, model_name = "K..", z)

Arguments

llk

Log-likelihood value.

C

Number of CpG sites.

M

Number of methylation states identified in a DNA sample.

N

Number of patients.

R

Number of DNA sample types collected from each patient.

model_name

Fitted mixture model. Options are "K..", "KN." and/or "K.R" (default = "K..").

z

A matrix of posterior probabilities of cluster membership, stored as z in the object from beta_k, beta_kn and beta_kr functions.

Details

Computes the ICL for a specified model given the log-likelihood, the dimension of the data, and the model names. This criterion penalises the BIC by including an entropy term favouring well separated clusters.

Value

The ICL value for the selected model.

See Also

em_aic

em_bic


MethylationEPIC manifest data.

Description

A dataset containing a subset of the manifest data from the Illumina MethylationEPIC beadchip array. A subset of the complete dataset has been uploaded in the package for testing purpose. The complete dataset is available on GitHub.

Usage

data(legacy.data)

Format

A data frame with 5,080 rows and 6 columns.

See Also

pca.methylation.data


DNA methylation data from patients with prostate cancer

Description

A dataset containing pre-processed beta methylation values from R=2 sample types, collected from N=4 patients with prostate cancer.

Usage

data(pca.methylation.data)

Format

A data frame with 5,067 rows and 9 columns. The data contain no missing values.

Details

The raw methylation array data was first quality controlled and preprocessed using the RnBeads package. The array data were then normalized and and probes located outside of CpG sites and on the sex chromosome were filtered out. The CpG sites with missing values were removed from the resulting dataset. A subset of the complete dataset has been uploaded in the package for testing purposes. The complete dataset is available on GitHub.

References

Mueller F, Scherer M, Assenov Y, Lutsik P, Walter J, Lengauer T, Bock C (2019). “RnBeads 2.0: comprehensive analysis of DNA methylation data.” Genome Biology, 20(55).

Assenov Y, Mueller F, Lutsik P, Walter J, Lengauer T, Bock C (2014). “Comprehensive Analysis of DNA Methylation Data with RnBeads.” Nature Methods, 11(11), 1138–1140.

See Also

legacy.data


Plots for visualizing the betaclust class object

Description

Visualise a betaclust clustering solution by plotting the fitted and kernel density estimates, the uncertainty and the information criterion.

Usage

## S3 method for class 'betaclust'
plot(
  x,
  what = "fitted density",
  plot_type = "ggplot",
  data = NULL,
  sample_name = NULL,
  title = NULL,
  patient_number = 1,
  threshold = FALSE,
  scale_param = "free_y",
  ...
)

Arguments

x

A betaclust object.

what

The different plots that can be obtained are either "fitted density","kernel density","uncertainty" or "information criterion" (default = "fitted density").

plot_type

The plot type to be displayed are either "ggplot" or "plotly" (default = "ggplot").

data

A dataframe of dimension C \times NR containing methylation values for C CpG sites from R samples collected from N patients which was passed as an argument to the betaclust function. The data is not required as an input when generating "uncertainty" or "information criterion" plots and the default has been set as "NULL". The data needs to be passed as an argument to this function when generating either "fitted density" or "kernel density" plots.

sample_name

The names of DNA sample types in the dataset analysed by the K.R model. If no value is passed then default values of sample names, e.g. Sample 1, Sample 2, etc are used as legend text (default = NULL).

title

The title that the user wants to display. If no title is to be displayed the default is "NULL".

patient_number

The column number representing the patient in the patient-wise ordered dataset selected for visualizing the clustering solution of the K.. or KN. model (default = 1).

threshold

The "TRUE" option displays the threshold points in the graph for the K.. and the KN. model (default = "FALSE").

scale_param

The position scales can be fixed or allowed to vary between different panels generated for the density estimate plots for visualizing the K.R clustering solution. Options are "fixed", "free_y","free_x" or "free" (default = "free_y"). The option "fixed" results in the x and y scales being fixed across all panels, "free" varies the x and y scales across the panels, "free_x" fixes the y scale and lets the x scale vary across all panels and "free_y" fixes the x scale and lets the y scale vary across all panels.

...

Other graphics parameters.

Details

The fitted density estimates can be visualized under the optimal clustering solution by specifying what = "fitted density" and kernel density estimates under the optimal clustering solution by specifying what = "kernel density".

The threshold inferred can be visualized by specifying threshold = TRUE. The KN. model calculates different pairs of threshold points for each patient as the shape parameters are allowed to vary for each patient. So the patient for whom the threshold needs to be displayed can be specified by inputting the column number representing the patient in the patient-wise ordered dataset in the parameter patient_number.

Interactive plots can also be produced using plot_type = "plotly". The uncertainty in the clustering solution can be plotted using what = "uncertainty". The information criterion values for all fitted models can be plotted using what = "information criterion".

Value

This function displays the following plots as requested by the user:

See Also

betaclust

Examples


my.seed <- 190
M <- 3
N <- 4
R <- 2
data_output <- betaclust(pca.methylation.data[1:100,2:9], M, N, R,
            model_names = c("K..","KN.","K.R"), model_selection = "BIC",
            parallel_process = FALSE, seed = my.seed)
plot(data_output, what = "fitted density", plot_type = "ggplot",
     data=pca.methylation.data[1:100,2:9])
plot(data_output, what = "uncertainty", plot_type = "ggplot")
plot(data_output, what = "information criterion", plot_type = "ggplot")


Summarizing the beta mixture model fits

Description

Summary method for a betaclust object containing the results under the optimal model selected.

Usage

## S3 method for class 'betaclust'
summary(object, ...)

Arguments

object

A betaclust object.

...

Further arguments passed to or from other methods.

Value

An object of class summary.betaclust which contains the following list:

See Also

betaclust

Examples


my.seed <- 190
M <- 3
N <- 4
R <- 2
data_output <- betaclust(pca.methylation.data[1:30,2:9], M, N, R,
            model_names=c("K..","KN.","K.R"), model_selection="BIC",
            parallel_process = FALSE, seed=my.seed)
summary(data_output)

Thresholds under the K.. and the KN. models

Description

An objective method to calculate the threshold points for the clustering solution of the K.. and the KN. models.

Usage

threshold(object, data, model_name)

Arguments

object

A beta_k or beta_kn object.

data

A dataframe of dimension C \times NR containing methylation values for C CpG sites from R sample types collected from N patients which was passed as an argument to the betaclust function.

model_name

The name of the model for which the thresholds need to be calculated.

Details

As the K.. model constrains the shape parameters to be equal for all patients, a single pair of threshold points are calculated for all patients. The KN. model allows patient-specific shape parameters which results in a pair of threshold points for each patient based on the shape parameters for that patient. The first threshold point means that any CpG site with beta value less than this value is likely to be hypomethylated. The second threshold point means that any CpG site with beta value greater than this is highly likely to be hypermethylated. A CpG site with beta value lying between the two threshold points is likely to be hemimethylated.

Value

thresholds - The threshold points calculated for the selected model. A vector containing two threshold points are returned for the K.. model whereas a matrix containing two threshold points for each patient is returned for the KN. model.

See Also

beta_k

beta_kn

betaclust