R Tutorials: Robustness Assessment of Regressions using Cluster Analysis Typologies

Leonard Roth

In a standard sequence analysis, similar trajectories are clustered together to create a typology of trajectories, which is then often used to evaluate the association between sequence patterns and covariates inside regression models. The sampling uncertainty, which affects both the derivation of the typology and associated regressions, is typically ignored in this analysis, an oversight that may lead to wrong statistical conclusions.

RARCAT aims to overcome this limitation by assessing the robustness of regression results obtained from this standard analysis. It works as follows. Bootstrap samples are drawn from the data, and for each bootstrap, a new typology replicating the original one is constructed, followed by the estimation of the corresponding regression models. The bootstrap estimates are then combined using a multilevel modelling framework. The method and the exact interpretation of the results are fully discussed in the following reference:

You are kindly asked to cite the above reference if you use the methods presented in this document.

This document focuses on the R code needed to use RARCAT. It is structured in two parts. First, a short tutorial with a streamlined standard analysis of sequence data and a robustness assessment made with the rarcat function. Then, an extended tutorial with the same data illustration and a detailed explanation of the functions, their arguments and their outputs.

Let us start by setting the seed for reproducible results.

set.seed(1)

Time for this code chunk to run: 0.01 seconds

Short tutorial

Creating the state sequence object

For this example, we use the openly available mvad dataset on transitions from school to work. First, we create the state sequence object.

## Loading the TraMineR library
library(TraMineR)
## Loading the data
data(mvad)

## State properties
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.shortlab <- c("EM","FE","HE","JL","SC","TR")

## Creating the state sequence object
mvad.seq <- seqdef(mvad[, 17:86], alphabet = mvad.alphabet, states = mvad.shortlab, 
                   labels = mvad.lab, xtstep = 6)

Time for this code chunk to run: 0.21 seconds

Constructing the typology

We will now construct a typology using cluster analysis. For readers seeking more details, the WeightedCluster library manual provides an in-depth explanation of the process and the computation of cluster quality measures (Studer, 2013).

We start by computing dissimilarities with the seqdist function using the longest common subsequence distance. We then use Ward clustering to create a typology of the trajectories.

## Using fastcluster for hierarchical clustering
library(fastcluster)
## Distance computation
diss <- seqdist(mvad.seq, method="LCS")
## Hierarchical clustering
hc <- hclust(as.dist(diss), method="ward.D")

Time for this code chunk to run: 3.84 seconds

We can now compute several cluster quality indices using as.clustrange function with two to ten groups.

# Loading the WeightedCluster library
library(WeightedCluster)
# Computing cluster quality measures.
clustqual <- as.clustrange(hc, diss=diss, ncluster=10)
clustqual
##            PBC   HG HGSD  ASW ASWw     CH   R2   CHsq R2sq   HC
## cluster2  0.59 0.75 0.74 0.43 0.43 216.72 0.23 431.41 0.38 0.13
## cluster3  0.43 0.51 0.50 0.25 0.25 175.61 0.33 335.58 0.49 0.25
## cluster4  0.52 0.67 0.67 0.29 0.30 164.63 0.41 352.31 0.60 0.17
## cluster5  0.52 0.69 0.68 0.31 0.31 153.46 0.46 326.41 0.65 0.16
## cluster6  0.57 0.79 0.78 0.34 0.35 151.17 0.52 372.95 0.73 0.11
## cluster7  0.58 0.83 0.83 0.37 0.37 144.38 0.55 383.10 0.77 0.09
## cluster8  0.51 0.78 0.78 0.32 0.33 140.00 0.58 358.85 0.78 0.12
## cluster9  0.52 0.85 0.85 0.34 0.35 137.85 0.61 386.02 0.81 0.09
## cluster10 0.51 0.87 0.87 0.35 0.36 131.67 0.63 379.81 0.83 0.08

Time for this code chunk to run: 0.27 seconds

Different clustering solutions may be argued based on the information above. We are interested in a relatively detailed partition of the data, so focus hereafter on a six-cluster solution. The clustassoc function supports considering at least 6 groups to sufficiently account for this association (see WeightedCluster vignette Validating Sequence Analysis Typologies to be used in Subsequent Regression). Therefore, we consider a six-cluster solution for further illustration. The corresponding typology is shown next:

seqdplot(mvad.seq, group=clustqual$clustering$cluster6, border=NA)

Time for this code chunk to run: 0.49 seconds

Association study

We are now interested in the relationship between this typology and the funemp and gcse5eq covariates, which represent the father’s unemployment status and the qualifications gained by the end of compulsory education, respectively (both are binary variables). Such associations can be studied with different approaches. Here, we focus on implementing separate logistic regressions for each cluster and estimating average marginal effects (AMEs) with these models. The readers are referred to the article by Roth et al. (2024) for theoretical and practical explanations for these methodological choices.

Multiple commands would normally be required to explore all possible combinations between the clusters and their related covariates. This can also be done with the function rarcat below.

# Add the clustering solution to the dataset
mvad$clustering <- clustqual$clustering$cluster6
# The first argument is a formula for the association between clustering and covariates of interest
# The function estimates separate logistic regressions for each cluster and compute the corresponding AMEs
# with their confidence interval
rarcatout <- rarcat(clustering ~ funemp + gcse5eq, data=mvad, diss=diss, robust=FALSE)
rarcatout
## 
##  NAIVE ESTIMATES
##  Average Marginal Effect (AME) to be in a given cluster instead of any others
##            Cluster 1        Cluster 2         Cluster 3        
## funempyes  -0.0701          -0.0192           0.0111           
##            (-0.150;  0.010) (-0.0749; 0.0364) (-0.0722; 0.0944)
## gcse5eqyes -0.1811           0.1592           0.0140           
##            (-0.244; -0.118) ( 0.1093; 0.2092) (-0.0496; 0.0776)
##            Cluster 4         Cluster 5        Cluster 6         
## funempyes   0.0563           -0.0489           0.0512           
##            (-0.0274;  0.140) (-0.110; 0.0124) (-0.0030;  0.1055)
## gcse5eqyes -0.2020            0.2543          -0.0506           
##            (-0.2597; -0.144) ( 0.197; 0.3116) (-0.0815; -0.0196)
## 
##  Robust estimates not estimated

Time for this code chunk to run: 1.25 seconds

The results in the table above are AMEs, which measure the expected change in the probability of belonging to a given cluster if the covariate takes a given value, together with their 95% confidence intervals. Thus, after adjustment for the GCSEs, father unemployment status is only marginally associated with membership to cluster 6, which contains the children’s unemployment trajectories (the p-value is actually 0.06).

On the other hand, there are strong associations apparent between the GCSEs obtained and the trajectory groups, with a distinctively higher probability of entering higher education (clusters 2 and 5) compared to being employed, in training or, to a lesser degree, unemployed (clusters 1, 4 and 6) if five or more GCSEs were gained. However, the standard analysis presented to this point does not properly account for the sampling uncertainty, such that the findings’ reliability remains in the balance.

Robustness assessment

The Robustness Assessment of Regressions using Cluster Analysis Typologies (RARCAT) procedure allows for evaluating the reproducibility of the analysis on resamples from the data. We focus here on a quick implementation of the procedure and refer to the extended tutorial below or the article by Roth et al. (2024) for further details. The function rarcat is run again with the following further arguments:

Parallel processing can be set up by using the parallel=TRUE argument. However, it is generally recommended to set up the parallel environment by yourselves. This ensures that the parallel environment is initialized only once and that it matches your environment. This is achieved by using the plan() function of the future framework (see the documentation for more detail). If you do so, rarcat automatically uses parallel computations. You can do it by running the following code before rarcat(). If you use plan(), you shouldn’t use the parallel=TRUE argument.

# # Loading the parallel library
library(future)
plan(multisession)

Time for this code chunk to run: 0.71 seconds

Now we can run it using parallel processing.

# Evaluate the validity of the original analysis and the reliability of its
# findings by applying RARCAT with 50 bootstrap replications.
# As in the original analysis, hierarchical clustering with Ward method is implemented.
# The number of clusters is fixed to 6 here.
rarcatout <- rarcat(clustering ~ funemp + gcse5eq, data = mvad, diss = diss, robust = TRUE, R = 100, 
                    kmedoid = FALSE, hclust.method = "ward.D", 
                    fixed = TRUE, ncluster = 6)
rarcatout
## 
##  NAIVE ESTIMATES
##  Average Marginal Effect (AME) to be in a given cluster instead of any others
##            Cluster 1        Cluster 2         Cluster 3        
## funempyes  -0.0701          -0.0192           0.0111           
##            (-0.150;  0.010) (-0.0749; 0.0364) (-0.0722; 0.0944)
## gcse5eqyes -0.1811           0.1592           0.0140           
##            (-0.244; -0.118) ( 0.1093; 0.2092) (-0.0496; 0.0776)
##            Cluster 4         Cluster 5        Cluster 6         
## funempyes   0.0563           -0.0489           0.0512           
##            (-0.0274;  0.140) (-0.110; 0.0124) (-0.0030;  0.1055)
## gcse5eqyes -0.2020            0.2543          -0.0506           
##            (-0.2597; -0.144) ( 0.197; 0.3116) (-0.0815; -0.0196)
## 
##  ROBUST ESTIMATES (RARCAT)
##  Average Marginal Effect (AME) to be in a given cluster instead of any others
## 
##  Number of bootstraps:  100  with  fixed  number of clusters
##            Cluster 1         Cluster 2         Cluster 3        
## funempyes  -0.0442           -0.0162           -0.00131         
##            (-0.107;  0.0189) (-0.0755; 0.0432) (-0.0721; 0.0695)
## gcse5eqyes -0.1813            0.1721            0.02948         
##            (-0.274; -0.0888) ( 0.0963; 0.2478) (-0.0482; 0.1071)
##            Cluster 4          Cluster 5        Cluster 6           
## funempyes   0.016             -0.0486           0.0838             
##            (-0.0544;  0.0864) (-0.110; 0.0126) ( 0.00556;  0.16210)
## gcse5eqyes -0.159              0.2509          -0.0894             
##            (-0.2331; -0.0858) ( 0.172; 0.3297) (-0.17013; -0.00873)

Time for this code chunk to run: 44.52 seconds

Now, a second table presents the estimates of the RARCAT analysis. The results stay fairly close to the ones from the original analysis (first table), in particular regarding the “significant” associations. The relationship between father unemployment status and unemployment trajectories (cluster 6) is somewhat stronger than it did in the original sample, and the relationship between GCSEs and long training spells (cluster 4) is a bit weaker. The other relevant associations are virtually unchanged. Thus, we conclude that the original analysis appears to be valid and robust to sampling variation. However, fixing the number of clusters (six here) throughout the bootstrap procedure may not always be warranted.

Extended Tutorial

In the following subsections, we present several extensions and complementary analysis that can be really helpful to make the most of the RARCAT results and better interpret the outcome. Generally speaking, the Robustness Assessment of Regressions using Cluster Analysis Typologies (RARCAT) procedure works as follows:

  1. A random sample with replacement (i.e, bootstrap) is drawn from the data.
  2. The bootstrap sample is clustered, applying the exact same clustering procedure as the one used in the original analysis, which implies using the same dissimilarity measure, cluster algorithm, and method to determine the number of clusters.
  3. A separate logistic regression predicting membership probability in each group is estimated.
  4. The AME of each covariate on the probability to be assigned to a given type is retrieved for all sequences belonging to this type.
  5. These steps are repeated \(N\) times, with \(N\) typically large.
  6. The AMEs from step 4 are pooled using a multilevel modelling framework.

In this section we further detail some of these steps. First, we consider the estimation of the number of clusters (instead of using a fixed number). Second, we analyse the AME estimated in each bootstrap and their distribution by cluster. This provides relevant information on which clusters lead to stable or unstable results. We then discuss several diagnostics and check to ensure that a sufficient number of bootstraps have been used. We finally discuss how to investigate the stability of the clustering itself and its potential impact on the regression results.

Estimating the Number of Clusters

Usually, the number of groups of a typology is not known in advance, but rather estimated from the data, for instance by maximizing (or minimizing) a given cluster quality index. The RARCAT procedure can account for the estimation of the number of groups chosen through CQI maximization (or minimization). This is achieved by specifying fixed=FALSE, the maximum number of groups to consider (here ncluster=10) as well as the target CQI (here the Hubert’s C index). To limit computation time, we only used R=50 here, but this should be increased in any application.

# Bootstrap replicates of the typology and its association with the variable funemp.
# As in the original analysis, hierarchical clustering with Ward method is implemented.
# Also, an optimal clustering solution with n between 2 and 10 is evaluated each time by
# maximizing the HC index.

# As we previously initalized parallel computing, it is used in these computations
vary.rarcat <- rarcat(clustering ~ funemp, data = mvad, diss = diss, R = 50,
                       kmedoid=FALSE, hclust.method = "ward.D",
                       fixed = FALSE, ncluster = 10, cqi = "HC")
vary.rarcat
## 
##  NAIVE ESTIMATES
##  Average Marginal Effect (AME) to be in a given cluster instead of any others
##           Cluster 1       Cluster 2          Cluster 3        Cluster 4       
## funempyes -0.0447         -0.0412            0.00878          0.0927          
##           (-0.13; 0.0407) (-0.0874; 0.00509) (-0.0734; 0.091) (0.00205; 0.183)
##           Cluster 5         Cluster 6       
## funempyes -0.0797           0.0641          
##           (-0.131; -0.0285) (0.00462; 0.123)
## 
##  ROBUST ESTIMATES (RARCAT)
##  Average Marginal Effect (AME) to be in a given cluster instead of any others
## 
##  Number of bootstraps:  50  with  varying  number of clusters
## Distribution of the number of clusters across bootstraps
## 
##  4  5  6  7  8  9 10 
##  3  2  1  3  2 15 24 
## 
##           Cluster 1         Cluster 2         Cluster 3        
## funempyes -0.0229           -0.033            -0.0113          
##           (-0.0862; 0.0403) (-0.0943; 0.0283) (-0.0413; 0.0186)
##           Cluster 4        Cluster 5        Cluster 6       
## funempyes 0.0205           -0.08            0.0871          
##           (-0.017; 0.0581) (-0.135; -0.025) (-0.0086; 0.183)

Time for this code chunk to run: 10.91 seconds

The output now print information about the distributions of the optimal number of groups per bootstrap, i.e. the one identified by maximizing the HC index. Here, most bootstrap partitions have nine or ten clusters. The RARCAT procedure can take into account the different number of groups across the bootstrap in its computation of the robust AME. Logically, the “naive” approach provides the same results, but the robust ones changes as we modified the bootstrap procedure.

Plotting the AME

RARCAT works by estimating for each observation the AME of its cluster in each bootstrap. We can plot these AME (across individuals and bootstraps) using the plot() function and specifying the covar (covariable) of interest.

# Histogram of the estimated AMEs for all individuals and all bootstraps
plot(rarcatout, covar="funempyes")

Time for this code chunk to run: 0.21 seconds

The plot also shows the estimated overall AME using the naive (in blue) or RARCAT approach (in red). Corresponding confidence intervals are also represented using colored shaded area.

For some clusters, we identify an unimodal distribution of the AME, usually with a low dispersion. This denotes that the results for this specific cluster are stable. The observations initially regrouped in this cluster were clustered in clusters with similar AME values across the bootstraps. We can therefore be confident about our interpretation (whether a positive, negative, neutral association).

On the contrary, if there’s a large dispersion of the AME, the bootstraps lead to the estimation of different values, showing less stable results. We can therefore be less confident and the RARCAT confidence interval will generally be larger.

Multimodal distributions of the AME generally reveal an interesting situation. It often means that a subgroup of observations switch to a different cluster, with a different AME, in several of the bootstraps. This can happens for instance in presence of observation lying in-between types. We can identify those observations by looking at the outliers (see below).

Pooling Effect Sizes

The final RARCAT estimates are computed or pooled using a multilevel models. This model allows identifying the pooled estimate and its the expected variation across bootstraps. It also allows us to identify outliers.

Diagnostics

We rely on a cross-classified multilevel model, using bootstrap and indiviudal as upper level, to pool the results. The residual standard errors provides us information on the remaining estimation related to the bootstrapping procedure. It should therefore be low if we have a sufficiently large number of iteration. The summary() function provide several informations.

summary(rarcatout)
## 
## RARCAT Diagnostics
## Distribution of standardized observation random intercept
## 
##  funempyes 
## 
##     < -2 -2....-1   -1...1    1...2       >2 
##        2       54      567       69       20 
## 
## 
##  gcse5eqyes 
## 
##     < -2 -2....-1   -1...1    1...2       >2 
##       12       53      583       17       47 
## 
## 
## Standard errors
##                      1           2           3           4           5
## funempyes  0.003576662 0.002978417 0.003737584 0.003983932 0.003075888
## gcse5eqyes 0.008404634 0.003879203 0.005412312 0.004271173 0.004193584
##                      6
## funempyes  0.004280292
## gcse5eqyes 0.004535972

Time for this code chunk to run: 0.02 seconds

In our example, the standard errors are low. The number of bootstrap was therefore sufficient. If those numbers are too high, increasing the number of bootstrap should fix it.

Outliers

The previous output also shows the number of outliers for each covariate. These observations lies far from the average estimate in at least some of the bootstrap. Interestingly, outliers are generally regrouped in some of the clusters only. We can look at the distribution of standardized observation-level random effects per clusters. These random effects can be interpreted as estimated deviation from the pooled estimates. Therefore, we can use them to identify observations (i.e. sequences) that are not well represented by their pooled estimates.

plot(rarcatout, what="ranef", covar="funempyes")

Time for this code chunk to run: 0.12 seconds

We can also identify outliers by looking at the observation level standardized random effect. Those are available in the observation.stdranef component of the rarcat object. Absolute values above 2 or even 3 are generally used to identify outliers. To get an idea of deviating observation by cluster, we use an absolute value of 1. We then plot identified sequences in each of the clusters.

outliers <- abs(rarcatout$observation.stdranef[, "funempyes"])>2
seqIplot(mvad.seq[outliers, ], group=mvad$clustering[outliers])

Time for this code chunk to run: 0.28 seconds

Interestingly, there are no deviating sequences in some of the clusters.

Lastly, the individual-specific random effects in the output of unirarcat inform on the within-cluster homogeneity. Indeed, individuals with fitted values that diverge from the other individuals in the same cluster were often assigned to different clusters in the bootstrap, thus impacting their estimated AMEs. We can see this by looking at the distribution of these random effects.

Bootstrap replicates of the typology

There are several reasons that might lead to non-robust results. Being able to distinguish them allows for a better understanding of RARCAT results.

One of them relies on the stability of the clustering results. The clustering (i.e. the estimation of the typology) might be sample dependent. As a result, inference and interpretation should be made with extreme caution. Hennig (2007) proposed to look at the cluster stability across difference bootstraps to estimate this sampling uncertainty. The analysis allows to identify specific clusters that stable or not across bootstraps. The method is available in the fpcpackage and can be run as follows for hierarchical clustering with B=500 the number of bootstraps.

# Loading the fpc library
library(fpc)
# Cluster-wise stability assessment by bootstrap
stab <- clusterboot(diss, B = 500, distances = TRUE, clustermethod = disthclustCBI, 
                    method = "ward.D", k = 6, count = FALSE)
stab
## * Cluster stability assessment *
## Cluster method:  hclust 
## Full clustering results are given as parameter result
## of the clusterboot object, which also provides further statistics
## of the resampling results.
## Number of resampling runs:  500 
## 
## Number of clusters found in data:  6 
## 
##  Clusterwise Jaccard bootstrap (omitting multiple points) mean:
## [1] 0.5946359 0.8804019 0.7663908 0.3992446 0.9393705 0.6298396
## dissolved:
## [1]  70  31  37 449   0 190
## recovered:
## [1]  32 438 318  28 449 187

Time for this code chunk to run: 12.74 seconds

The results shows the cluster the “Clusterwise Jaccard bootstrap mean”. Values above 0.85 indicate high cluster-wise stability, meaning that most of the individuals belonging to any of the two clusters in the original partition tend to be clustered again in the bootstrapped partitions. Values below 0.60 show unstable clusters, which are likely not recovered in other subsamples of the data. They should, therefore, be interpreted with caution.

[1] In fact, the estimated Jaccard coefficient for this cluster is only 0.4, compared to 0.94 for the fifth cluster and 0.63 for the sixth.

References

Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.

Roth, L., Studer, M., Zuercher, E., & Peytremann-Bridevaux, I. (2024). Robustness assessment of regressions using cluster analysis typologies: a bootstrap procedure with application in state sequence analysis. BMC medical research methodology, 24(1), 303. https://doi.org/10.1186/s12874-024-02435-8.

Studer M. WeightedCluster Library Manual A practical guide to creating typologies of trajectories in the social sciences with R. 2013.