\input{head.tex} %\VignetteIndexEntry{Introduction to streamMOA} <>= options(width = 75, digits = 3, prompt = 'R> ') @ \section{Introduction} Please refer to the vignette in package~\pkg{stream} for an introduction to data stream mining in \proglang{R}. In this vignette we give two examples that show how to use the \pkg{stream} framework being used from start to finish. The examples encompasses the creation of data streams, preparation of data stream clustering algorithms, the online clustering of data points into micro-clusters, reclustering and finally evaluation. The first example shows how compare a set of data stream clustering algorithms on a static data set. The second example shows how to perform evaluation on a data stream with concept drift (clusters evolve over time). \section{Experimental Comparison on Static Data} \label{examples:full} First, we set up a static data set. We extract 1500 data points from the Bars and Gaussians data stream generator with 5\% noise and put them in a \code{DSD_Memory}. The wrapper is used to replay the same part of the data stream for each algorithm. We will use the first 1000 points to learn the clustering and the remaining 500 points for evaluation. <>= set.seed(1234) @ <>= library("streamMOA") @ <>= stream <- DSD_BarsAndGaussians(noise=0.05) %>% DSD_Memory(n = 5500) stream plot(stream) @ \begin{figure} \centering \includegraphics[width=.5\linewidth]{streamMOA-data_bng} \caption{Bar and Gaussians data set.} \label{figure:data_bng} \end{figure} Figure~\ref{figure:data_bng} shows the structure of the data set. It consists of four clusters, two Gaussians and two uniformly filled rectangular clusters. The Gaussian and the bar to the right have $1/3$ the density of the other two clusters. We initialize a k-means on a sample and multiple clusterers from \pkg{streamMOA}. We choose the parameters experimentally so that the algorithm produce each (approximately) 100 micro-clusters. <<>>= algorithms <- list( 'Sample + k-means' = DSC_TwoStage(micro = DSC_Sample(k = 100), macro = DSC_Kmeans(k = 4)), 'DenStream' = DSC_DenStream(epsilon = .5, mu = 1), 'cluStream' = DSC_CluStream(m = 100, k = 4), 'Bico' = DSC_BICO_MOA(Cluster = 4, Dimensions = 2, MaxClusterFeatures = 100) ) @ We store the algorithms in a list for easier handling and then cluster the same 1000 data points with each algorithm. Note that we have to reset the stream each time before we cluster. <<>>= for (a in algorithms) { reset_stream(stream) update(a, stream, 1000) } @ We use \code{nclusters()} to inspect the number of micro-clusters. <<>>= sapply(algorithms, nclusters, type = "micro") @ All algorithms except DenStream produce around 100 micro-clusters. We were not able to adjust DenStream to produce more than around 50 micro-clusters for this data set. To inspect micro-cluster placement, we plot the calculated micro-clusters and the original data. <>= op <- par(no.readonly = TRUE) layout(mat = matrix(1:4, ncol = 2)) for (a in algorithms) { reset_stream(stream) plot(a, stream, main = description(a), type = "micro") } par(op) @ \begin{figure} \centering \includegraphics{streamMOA-microclusters} \caption{Micro-cluster placement for different data stream clustering algorithms.} \label{figure:microclusters} \end{figure} Figure~\ref{figure:microclusters} shows the micro-cluster placement by the different algorithms. Micro-clusters are shown as red circles and the size is proportional to each cluster's weight. Reservoir sampling and the sliding window randomly place the micro-clusters and also a few noise points (shown as grey dots). Clustream and BICO also do not suppress noise and places even more micro-clusters on noise points since it tries to represent all data as faithfully as possible. DenStream suppresses noise and concentrate the micro-clusters on the real clusters. DenStream produces one heavy micro-cluster on one cluster, while using a large number of micro clusters for the others. It also has problems with detecting the rectangular low-density cluster. It is also interesting to compare the assignment areas for micro-clusters created by different algorithms. The assignment area is the area around the center of a micro-cluster in which points are considered to belong to the micro-cluster. In case that a point is in the assignment area of several micro-clusters, the closer center is chosen. To show the assignment area we add \code{assignment = TRUE} to plot. We also disable showing micro-cluster weights to make the plot clearer. <>= op <- par(no.readonly = TRUE) layout(mat = matrix(1:4, ncol = 2)) for (a in algorithms) { reset_stream(stream) plot( a, stream, main = description(a), assignment = TRUE, weight = FALSE, type = "micro" ) } par(op) @ \begin{figure} \centering \includegraphics{streamMOA-microclusters_assignment} \caption{Micro-cluster assignment areas for different data stream clustering algorithms.} \label{figure:microclusters_assignment} \end{figure} Figure~\ref{figure:microclusters_assignment} shows the assignment areas as dotted circles around micro-clusters. Not all algorithms provide assignment ares. <<>>= sapply( algorithms, FUN = function(a) { reset_stream(stream, 1001) evaluate_static( a, stream, measure = c("numMicroClusters", "purity", "SSQ", "silhouette"), n = 500, assignmentMethod = "auto", type = "micro" ) }) @ We need to be careful with the comparison of these numbers, since the depend heavily on the number of micro-clusters with more clusters leading to a better value. Therefore, a comparison with DenStream is not valid. We can compare the measures, of the other algorithms since the number of micro-clusters is close. Sampling produces very good values for purity, BICO achieves the highest average silhouette coefficient and CluStream produces the lowest sum of squares. For better results more data and cross-validation could be used. \section{Outlier detection} To support the outlier detection area, \pkg{streamMOA} contains a wrapper to the MOA implementation of the Micro-cluster Continuous Outlier Detector (MCOD). To demonstrate a synergy of outlier detection capabilities between \pkg{stream} and \pkg{streamMOA} packages, we bring two basic examples. First, we create a fixed data stream with noise outliers that are well separated from the clusters using Mahalanobis distance. <>= library(stream) set.seed(1000) stream <- DSD_Gaussians(k = 3, d = 2, variance_limit = c(0.1, 1), space_limit = c(0, 30), noise = .01, noise_limit = c(0, 30), noise_separation = 6, separation_type = "Mahalanobis" ) %>% DSD_Memory(n = 1000) @ <>= plot(stream, n = 1000) @ The generated stream can be seen in Figure~\ref{figure:outlier_points}. \begin{figure} \centering \includegraphics[width=.5\linewidth]{streamMOA-outlier2} \caption{Data points from \code{DSD\_Gaussians} having 3 clusters and 1\% is outliers (noise).} \label{figure:outlier_points} \end{figure} Then we define a \code{DSC_MCOD} clusterer. Since this is a single-pass clusterer \code{DSC_SinglePass}, we do not need to update the model first, we can immediately call evaluation. <>= reset_stream(stream) mic_c <- DSOutlier_MCOD(r = 2, w = 1000) evaluate_static( mic_c, stream, n = 1000, type = "micro", measure = c("crand", "outlierjaccard") ) @ <>= reset_stream(stream) plot(mic_c, stream, n = 1000) @ \begin{figure} \centering \includegraphics[width=.45\linewidth]{streamMOA-outlier4} \caption{MCOD outlier detection.} \label{figure:outlier_mcod1} \end{figure} In Figure~\ref{figure:outlier_mcod1} we can see micro-cluster and outlier assignments for the generated data stream. \input{foot.tex}