Title: | Cluster Sharpening |
Version: | 0.1.0.1 |
Author: | Tomasz Konopka [aut, cre] |
Maintainer: | Tomasz Konopka <tokonopka@gmail.com> |
Description: | Clustering typically assigns data points into discrete groups, but the clusters can sometimes be indistinct. Cluster sharpening adjusts an existing clustering to create contrast between groups. This package provides a general interface for cluster sharpening along with several implementations based on different excision criteria. |
Depends: | R (≥ 3.5.0) |
Imports: | methods, stats |
License: | MIT + file LICENSE |
URL: | https://github.com/tkonopka/ksharp |
BugReports: | https://github.com/tkonopka/ksharp/issues |
LazyData: | true |
Suggests: | cluster, dbscan, knitr, Rcssplot (≥ 1.0.0), rmarkdown, testthat |
VignetteBuilder: | knitr |
Encoding: | UTF-8 |
RoxygenNote: | 7.0.2 |
NeedsCompilation: | no |
Packaged: | 2020-01-18 15:56:31 UTC; tkonopka |
Repository: | CRAN |
Date/Publication: | 2020-01-26 10:10:02 UTC |
Toy dataset with two convex groups with partial overlap
Description
Toy dataset with two convex groups with partial overlap
Usage
data(kdata.1)
Format
matrix with two columns: D1
, D2
Toy dataset with two non-overalpping and non-spherical groups
Description
Toy dataset with two non-overalpping and non-spherical groups
Usage
data(kdata.2)
Format
matrix with two columns: D1
, D2
Toy dataset with three groups
Description
Toy dataset with three groups
Usage
data(kdata.3)
Format
matrix with two columns: D1
, D2
Toy dataset with four groups atop a wide area of noise points
Description
Toy dataset with four groups atop a wide area of noise points
Usage
data(kdata.4)
Format
matrix with two columns: D1
, D2
sharpen a clustering
Description
Each data point in a clustering is assigned to a cluster, but some data points may lie in ambiguous zones between two or more clusters, or far from other points. Cluster sharpening assigns these border points into a separate noise group, thereby creating more stark distinctions between groups.
Usage
ksharp(
x,
threshold = 0.1,
data = NULL,
method = c("silhouette", "neighbor", "medoid"),
threshold.abs = NULL
)
Arguments
x |
clustering object; several types of inputs are acceptable, including objects of class kmeans, pam, and self-made lists with a component "cluster". |
threshold |
numeric; the fraction of points to place in noise group |
data |
matrix, raw data corresponding to clustering x; must be present when sharpening for the first time or if data is not present within x. |
method |
character, determines method used for sharpening |
threshold.abs |
numeric; absolute-value of threshold for sharpening. When non-NULL, this value overrides value in argument 'threshold' |
Details
Noise points are assigned to a group with cluster index 0. This is analogous behavior to output produced by dbscan.
Value
clustering object based on input x, with adjusted cluster assignments and additional list components with sharpness measures. Cluster assignments are placed in $cluster and excised data points are given a cluster index of 0. Original cluster assignments are saved in $cluster.original. Sharpness measures are stored in components $silinfo, $medinfo, and $neiinfo, although these details may change in future versions of the package.
Examples
# prepare iris dataset for analysis
iris.data = iris[, 1:4]
rownames(iris.data) = paste0("iris_", seq_len(nrow(iris.data)))
# cluster the dataset into three groups
iris.clustered = kmeans(iris.data, centers=3)
table(iris.clustered$cluster)
# sharpen the clustering by excluding 10% of the data points
iris.sharp = ksharp(iris.clustered, threshold=0.1, data=iris.data)
table(iris.sharp$cluster)
# visualize cluster assignments
iris.pca = prcomp(iris.data)$x[,1:2]
plot(iris.pca, col=iris$Species, pch=ifelse(iris.sharp$cluster==0, 1, 19))
compute info on distances to medoids/centroids
Description
Analogous in structure to silinfo and neiinfo, it computes a "widths" matrix assessing how well each data point belongs to its cluster. Here, this measure is the ratio of two distances: in the numerator, the distance from the point to the nearest cluster center, and in the denominator, from the point to its own cluster center.
Usage
medinfo(cluster, data, silwidths)
Arguments
cluster |
named vector |
data |
matrix with raw data |
silwidths |
matrix with silhouette widths |
Value
list with component widths. The widths object is a matrix with one row per data item, with column med_ratio holding the sharpness measure.
Examples
# construct a manual clustering of the iris dataset
iris.data = iris[, 1:4]
rownames(iris.data) = paste0("iris_", seq_len(nrow(iris.data)))
iris.dist = dist(iris.data)
iris.clusters = setNames(as.integer(iris$Species), rownames(iris.data))
# compute sharpnessvalues based on medoids
iris.silinfo = silinfo(iris.clusters, iris.dist)
medinfo(iris.clusters, iris.data, iris.silinfo$widths)
Compute info on 'neighbor widths'
Description
This function provides information on how well each data point belongs to its cluster. For each query point, the function considers n of its nearest neighbors. The neighbor widths are defined as the fraction of those neighbors that belong to the same cluster as the query point. These values are termed 'widths' in analogy to silhouette widths, another measure of cluster membership.
Usage
neiinfo(cluster, dist)
Arguments
cluster |
vector with assignments of data elements to clusters |
dist |
distance object or matrix |
Details
The function follows a similar signature as silinfo from this package.
Value
list with component widths. The wdiths object is a matrix with one row per data item, wth column neighborhood holding the sharpness value.
Examples
# construct a manual clustering of the iris dataset
iris.data = iris[, 1:4]
rownames(iris.data) = paste0("iris_", seq_len(nrow(iris)))
iris.dist = dist(iris.data)
iris.clusters = setNames(as.integer(iris$Species), rownames(iris.data))
# compute neighbor-based sharpness widths
neiinfo(iris.clusters, iris.dist)
Compute info on silhouette widths
Description
This function provides information on how well each data point belongs to its cluster. For each query point, the function considers the average distance to other members of the same cluster and the average distance to members of another, nearest, cluster. The widths are defined as the
Usage
silinfo(cluster, dist)
Arguments
cluster |
vector with assignments of data elements to clusters |
dist |
distance object or matrix |
Details
The function signature is very similar to cluster::silhouette but the implementation has important differences. This implementation requires both the dist object and and cluster vector must have names. This prevents accidental assignment of silhouette widths to the wrong elements.
Value
list, analogous to object within output from cluster::pam. In particular, the list has a component widths. The widths object is matrix with one row per data item, with column sil_width holding the silhouette width.
Examples
# construct a manual clustering of the iris dataset
iris.data = iris[, 1:4]
rownames(iris.data) = paste0("iris_", seq_len(nrow(iris.data)))
iris.dist = dist(iris.data)
iris.clusters = setNames(as.integer(iris$Species), rownames(iris.data))
# compute sharpness values based on silhouette widths
silinfo(iris.clusters, iris.dist)