Title: | Clustering of Functional Data Based on Measures of Change |
Version: | 2.2.1 |
Description: | Implements a three-step procedure in the spirit of Leffondre et al. (2004) to identify clusters of individual longitudinal trajectories. The procedure involves (1) computing a number of "measures of change" capturing various features of the trajectories; (2) using a Principal Component Analysis based dimension reduction algorithm to select a subset of measures and (3) using the k-medoids or k-means algorithm to identify clusters of trajectories. |
License: | MIT + file LICENSE |
URL: | https://CRAN.R-project.org/package=traj |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
Imports: | stats, cluster, psych |
Depends: | R (≥ 2.10) |
LazyData: | true |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-02-01 16:14:39 UTC; Moi |
Author: | Marie-Pierre Sylvestre [aut], Laurence Boulanger [aut, cre], Gillis Delmas Tchouangue Dinkou [ctb], Dan Vatnik [ctb] |
Maintainer: | Laurence Boulanger <laurence.boulanger@umontreal.ca> |
Repository: | CRAN |
Date/Publication: | 2025-02-01 23:40:02 UTC |
traj: Clustering of Functional Data Based on Measures of Change
Description
Implements a three-step procedure in the spirit of Leffondre et al. (2004) to identify clusters of individual longitudinal trajectories. The procedure involves (1) computing a number of "measures of change" capturing various features of the trajectories; (2) using a Principal Component Analysis based dimension reduction algorithm to select a subset of measures and (3) using the k-medoids or k-means algorithm to identify clusters of trajectories.
Author(s)
Maintainer: Laurence Boulanger laurence.boulanger@umontreal.ca
Authors:
Marie-Pierre Sylvestre marie-pierre.sylvestre@umontreal.ca
Other contributors:
Gillis Delmas Tchouangue Dinkou [contributor]
Dan Vatnik [contributor]
See Also
Useful links:
Compute Measures for Identifying Patterns of Change in Longitudinal Data
Description
Step1Measures
computes up to 19 measures for each
longitudinal trajectory. See Details for the list of measures.
Usage
Step1Measures(
Data,
Time = NULL,
ID = FALSE,
measures = c(1:18),
midpoint = NULL,
cap.outliers = FALSE
)
## S3 method for class 'trajMeasures'
print(x, ...)
## S3 method for class 'trajMeasures'
summary(object, ...)
Arguments
Data |
a matrix or data frame in which each row contains the longitudinal data (trajectories). |
Time |
either |
ID |
logical. Set to |
measures |
a vector containing the numerical identifiers of the measures to compute. The default, 1:18, corresponds to measures 1-18 and thus excludes the measures which require specifying a midpoint. |
midpoint |
specifies which column of |
cap.outliers |
logical. If |
x |
object of class |
... |
further arguments passed to or from other methods. |
object |
object of class |
Details
Each trajectory must have a minimum of 3 observations otherwise it will be omitted from the analysis.
The 19 measures and their numerical identifiers are listed below. Please refer to the vignette for the specific formulas used to compute them.
Maximum
Range (max - min)
Mean value
Standard deviation
Intercept of linear model
Slope of the linear model
-
R^2
: Proportion of variance explained by the linear model
Curve length (total variation)
Rate of intersection with the mean
Proportion of time spent above the mean
Minimum of the first derivative
Maximum of the first derivative
Mean of the first derivative
Standard deviation of the first derivative
Minimum of the second derivative
Maximum of the second derivative
Mean of the second derivative
Standard deviation of the second derivative
Later change/Early change
If 'cap.outliers' is set to TRUE
, or if some measures are infinite as a result of division by 0, Nishiyama's improved Chebychev bound for continuous distributions
is used to determine extreme values for each measure, corresponding to
a 0.3% probability threshold. Extreme values beyond the threshold are then capped
to the 0.3% probability threshold (see vignette for more details). If applicable, the values which
would be of the form 0/0 are set to 1.
Value
An object of class trajMeasures
; a list containing the values
of the measures, a table of the outliers which have been capped, as well as
a curated form of the function's arguments.
References
Leffondre K, Abrahamowicz M, Regeasse A, Hawker GA, Badley EM, McCusker J, Belzile E. Statistical measures were proposed for identifying longitudinal patterns of change in quantitative health indicators. J Clin Epidemiol. 2004 Oct;57(10):1049-62. doi: 10.1016/j.jclinepi.2004.02.012. PMID: 15528056.
Nishiyama T, Improved Chebyshev inequality: new probability bounds with known supremum of PDF, arXiv:1808.10770v2 stat.ME https://doi.org/10.48550/arXiv.1808.10770
Examples
## Not run:
data("trajdata")
trajdata.noGrp <- trajdata[, -which(colnames(trajdata) == "Group")] #remove the Group column
m1 = Step1Measures(trajdata.noGrp, ID = TRUE, measures = 19, midpoint = NULL)
m2 = Step1Measures(trajdata.noGrp, ID = TRUE, measures = 19, midpoint = 3)
identical(m1$measures, m2$measures)
## End(Not run)
Select a Subset of the Measures Using Factor Analysis
Description
This function applies the following dimension reduction algorithm
to the measures computed by Step1Measures
:
Drop the measures whose values are constant across the trajectories;
Whenever two measures are highly correlated (absolute value of Pearson correlation > 0.98), keep the highest-ranking measure on the list (see
Step1Measures
) and drop the other;Use principal component analysis (PCA) on the measures to form factors summarizing the variability in the measures;
Drop the factors whose variance is smaller than any one of the standardized measures;
Perform a varimax rotation on the remaining factors;
For each rotated factor, select the measure that has the highest correlation (aka factor loading) with it and that hasn't yet been selected;
Drop the remaining measures.
Usage
Step2Selection(trajMeasures, num.select = NULL, discard = NULL, select = NULL)
## S3 method for class 'trajSelection'
print(x, ...)
## S3 method for class 'trajSelection'
summary(object, ...)
Arguments
trajMeasures |
object of class |
num.select |
an optional positive integer indicating the number of
factors to keep in the second stage of the algorithm. Defaults to |
discard |
an optional vector of positive integers corresponding to the
measures to be dropped from the analysis. See
|
select |
an optional vector of positive integers corresponding to the
measures to forcefully select. Defaults to |
x |
object of class |
... |
further arguments passed to or from other methods. |
object |
object of class |
Details
Whenever two measures are highly correlated (Pearson correlation >
0.98), the highest-ranking measure on the list (see Step1Measures
) is kept and the other is discarded and discards the others. PCA is applied on the remaining measures using the principal
function from the psych
package.
Value
An object of class trajSelection
; a list containing the values
of the selected measures, the output of the principal component analysis as
well as a curated form of the arguments.
References
Leffondre K, Abrahamowicz M, Regeasse A, Hawker GA, Badley EM, McCusker J, Belzile E. Statistical measures were proposed for identifying longitudinal patterns of change in quantitative health indicators. J Clin Epidemiol. 2004 Oct;57(10):1049-62. doi: 10.1016/j.jclinepi.2004.02.012. PMID: 15528056.
See Also
Examples
## Not run:
data("trajdata")
trajdata.noGrp <- trajdata[, -which(colnames(trajdata) == "Group")] #remove the Group column
m = Step1Measures(trajdata.noGrp, measure = c(1:18), ID = TRUE)
s = Step2Selection(m)
print(s)
s2 = Step2Selection(m, select = c(13, 3, 12, 9))
## End(Not run)
Classify the Longitudinal Data Based on the Selected Measures.
Description
Classifies the trajectories by applying the k-medoids or k-means
algorithm to the measures selected by Step2Selection
.
Usage
Step3Clusters(
trajSelection,
algorithm = "k-medoids",
metric = "euclidean",
nstart = 200,
iter.max = 100,
nclusters = NULL,
criterion = "Calinski-Harabasz",
K.max = min(ceiling(sqrt(nrow(trajSelection$selection))), 10),
B = 500
)
## S3 method for class 'trajClusters'
print(x, ...)
## S3 method for class 'trajClusters'
summary(object, ...)
Arguments
trajSelection |
object of class |
algorithm |
either |
metric |
to be passed to the |
nstart |
to be passed to the |
iter.max |
to be passed to the |
nclusters |
either |
criterion |
criterion to determine the optimal number of clusters if |
K.max |
maximum number of clusters to be considered if |
B |
to be passed to the |
x |
object of class |
... |
further arguments passed to or from other methods. |
object |
object of class |
Details
If "GAP"
is the chosen criterion
for determining the optimal number of clusters, the method described by Tibshirani et al. is implemented by the clusGap
function.
Instead, if "Calinski-Harabasz"
is the chosen criterion
, the Calinski-Harabasz index is computed for each possible number of clusters between 2 and K.max
and the optimal number of clusters is the maximizer of the Calinski-Harabasz index.
Value
An object of class trajClusters
; a list containing the result
of the clustering, as well as a curated form of the arguments.
References
Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411–423.
Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.
See Also
Examples
## Not run:
data("trajdata")
trajdata.noGrp <- trajdata[, -which(colnames(trajdata) == "Group")] #remove the Group column
m = Step1Measures(trajdata.noGrp, ID = TRUE, measures = 1:18)
s = Step2Selection(m)
s$RC$loadings
s2 = Step2Selection(m, select = c(10, 12, 8, 4))
c3.part <- Step3Clusters(s2, nclusters = 3)$partition
c4.part <- Step3Clusters(s2, nclusters = 4)$partition
c5.part <- Step3Clusters(s2, nclusters = 5)$partition
## End(Not run)
Plots trajClusters
objects
Description
Plots the cluster-specific median and mean trajectories and a random sample of trajectories from each cluster.
Usage
## S3 method for class 'trajClusters'
plot(x, sample.size = 5, ask = TRUE, which.plots = NULL, spline = FALSE, ...)
scatterplots(x, ask = TRUE, ...)
critplot(x, ...)
Arguments
x |
object of class |
sample.size |
the number of random trajectories to be randomly sampled
from each cluster. Defaults to |
ask |
logical. If |
which.plots |
either |
spline |
logical. If |
... |
other parameters to be passed through to plotting functions. |
See Also
Examples
## Not run:
data("trajdata")
trajdata.noGrp <- trajdata[, -which(colnames(trajdata) == "Group")] #remove the Group column
m = Step1Measures(trajdata.noGrp, ID = TRUE)
s = Step2Selection(m)
c3 = Step3Clusters(s, nclusters = 3)
plot(c3)
#The pointwise mean trajectories correspond to the third and fourth displayed plots.
c4 = Step3Clusters(s, nclusters = 4)
plot(c4, which.plots = 3:4)
## End(Not run)
trajdata
Description
An artificially created data set with 130 trajectories split into four groups, labelled A, B, C, D according to the data generating process.
Usage
trajdata
Format
This data frame has 130 rows and the following 7 columns:
- ID
An identification variable that runs from 1 to 130.
- Group
A character variable that's either "A", "B", "C" or "D" depending on which of the four data generating process the trajectory is coming from.
- X1
The observation of the trajectory at time t = 1.
- X2
The observation of the trajectory at time t = 2.
- X3
The observation of the trajectory at time t = 3.
- X4
The observation of the trajectory at time t = 4.
- X5
The observation of the trajectory at time t = 5.
- X6
The observation of the trajectory at time t = 6.