%\VignetteIndexEntry{Introducing coop: Fast Covariance, Correlation, and Cosine Operations} \documentclass[]{article} \input{./include/settings} \mytitle{Introducing coop: Fast Covariance, Correlation, and Cosine Operations} \mysubtitle{} \myversion{0.6-2} \myauthor{ \centering Drew Schmidt \\ \texttt{wrathematics@gmail.com} } \begin{document} \makefirstfew \section{Introduction}\label{introduction} The \textbf{coop} package~\cite{coop} does co-operations: covariance, correlation, and cosine similarity. And it does them in a quick, memory efficient way. The package has separate, optimized routines for dense matrices and vectors; and currently, there is a cosine similarity method for a sparse matrix (like a term-document/document-term matrix) stored as a ``simple triplet matrix'' (aij/coo format). The use of each of these methods is seamless to the user by way of R's S3 methods. A full description of the algorithms, computational complexity, and benchmarks are available in the vignette \emph{Algorithms and Benchmarks for the coop Package} \cite{coop2016algos}. When building/using this package, you will see the biggest performance improvements, in decreasing order of value, by using: \begin{enumerate} \item A good \textbf{BLAS} library \item A compiler supporting OpenMP~\cite{openmp} (preferably version 4 or better). \end{enumerate} See the section below for more details. \subsection{Why Put this in a Package?}\label{why-put-this-in-a-package} The way R computes covariance and pearson correlation is unreasonably slow. Additionally, there is no function built into R to compute the all-pairwise cosine similarities of the columns of a matrix. As such, you will find people across the R community computing cosine similarity in a myriad of bizarre, often inefficient, ways. In fact, some of them are even incorrect! The operation \texttt{cosine()} provided by this package is simple, but it is important enough, particularly to fields like text mining, to have a good, high performance implementation. \subsection{Installation}\label{installation} You can install the stable version from CRAN using the usual \texttt{install.packages()}: \begin{lstlisting}[language=rr] install.packages("coop") \end{lstlisting} The development version is maintained on GitHub. You can install this version using any of the well-known installer packages available to R: \begin{lstlisting}[language=rr] ### Pick your preference devtools::install_github("wrathematics/coop") ghit::install_github("wrathematics/coop") remotes::install_github("wrathematics/coop") \end{lstlisting} \subsection{A Note on the C Library}\label{a-note-on-the-c-library} The ``workhorse'' C source code is separated from the R wrapper code. So it easily builds as a standalone shared library after removing the file \texttt{src/wrapper.c}. Additionally, like the rest of the package, the C shared library code is licensed under the permissive 2-clause BSD license. \section{Choice of BLAS Library}\label{choice-of-blas-library} This topic has been written about endlessly, so we will only briefly summarize the topic here. The \textbf{coop} package heavily depends on ``Basic Linear Algebra Subprograms'', or \textbf{BLAS} \cite{lawson1979basic} operations. R ships the reference \textbf{BLAS}~\cite{lawson1979basic} which are known to have poor performance. Modern re-implementations of the \textbf{BLAS} library have identical API and should generally produce similar outputs from the same input, but they are instrumented to be very high performance. These high-performance versions take advantage of things like vector instructions (``vectorization'') and multiple processor cores. Several well-supported high performance \textbf{BLAS} libraries used today are Intel \textbf{MKL}~\cite{mkl} and AMD \textbf{ACML} \cite{acml}, both of which are proprietary and can, depending on circumstances, require paying for a license to use. There are also good open source implementations, such as \textbf{OpenBLAS} \cite{OpenBLAS}, which is perhaps the best of the free options. Another free \textbf{BLAS} library which will outperform the reference \textbf{BLAS} is Atlas~\cite{atlas}, although there is generally no good reason to use \textbf{Atlas} over \textbf{OpenBLAS}; so if possible, one should choose \textbf{OpenBLAS} over Atlas. In addition to merely being faster on a single core, all of the above named \textbf{BLAS} libraries except for Atlas is multi-threaded. If you're on Linux, you can generally use \textbf{OpenBLAS} with R without too much trouble. For example, on Ubuntu you can simply run: \begin{lstlisting}[language=rr] sudo apt-get install libopenblas-dev sudo update-alternatives --config libblas.so.3 \end{lstlisting} Users on other platforms like Windows or Mac (which I know considerably less about) might consider using Revolution R Open, which ships with Intel MKL. \section{The Operations}\label{the-operations} Covariance and correlation should largely need no introduction. Cosine similarity is commonly needed in, for example, natural language processing, where the cosine similarity coefficients of all columns of a term-document or document-term matrix is needed. \subsection{Covariance}\label{covariance} \subsubsection{Definition}\label{definition} The covariance matrix of an \(m\times n\) matrix \(A\) can be computed by first standardizing the data: \[ covar(A) = \frac{1}{n-1}\sum_{i=1}^m\left(x_i-\mu_x\right)^T\left(x_i-\mu_x\right) \] where \(x_i\) is row \(i\) of \(A\), and \(\mu_x\) is a vector of the column means of \(A\). \subsubsection{Computing Covariances}\label{computing-covariances} Although covariance is often stated as above, it is generally not computed in this way. For example, in R, which uses column-major matrix order, operating by rows will lead to very poor performance due to unnecessary cache misses. Instead, we will want to first sweep the column means from each corresponding column and then compute a crossproduct. In R, this necessarily requires a copy of the data (for the sweep operation). In R, we can use the \texttt{cov()} function. It does not use the BLAS to do so, and is likely otherwise un-optimised. However, we note that it has multiple ways of handling \texttt{NA}'s. Implementing the computation in a more runtime efficient way in R is simple: \begin{lstlisting}[language=rr] cov2 <- function(x) { 1/(NROW(x)-1) * crossprod(scale(x, TRUE, FALSE)) } \end{lstlisting} And indeed, this is a simplified R version of how the computation is performed in \textbf{coop}'s \texttt{covar()}, the latter being written in C. For very small matrices, the \texttt{cov()} version is likely to dominate \texttt{cov2()}. But for even modest sized data (and good BLAS), \texttt{cov2()} will win out: \begin{lstlisting}[language=rr] m <- 500 n <- 100 x <- matrix(rnorm(m*n), m, n) rbenchmark::benchmark(cov(x), cov2(x), columns=c("test", "replications", "elapsed", "relative")) ## test replications elapsed relative ## 2 cov2(x) 100 0.246 1.000 ## 1 cov(x) 100 0.435 1.768 \end{lstlisting} \subsection{Correlation}\label{correlation} We restrict our discussion solely to pearson correlation. \subsubsection{Definition}\label{definition-1} This is really just a minor modification of covariance. The correlation matrix of an \(m\times n\) matrix \(A\) is given by: \[ pcor(A) = \frac{1}{n-1}\sum_{i=1}^m \frac{\left(x_i-\mu_x\right)^T\left(x_i-\mu_x\right)}{\sigma_x} \] where \(x_i\) and \(\mu_x\) are as before, and \(\sigma_x\) is a vector of the column standard deviations of \(A\). \subsubsection{Computing Correlations}\label{computing-correlations} As easily guessed, we need only make a trivial modification of the above covariance function to get correlation: \begin{lstlisting}[language=rr] cor2 <- function(x) { 1/(NROW(x)-1) * crossprod(scale(x, TRUE, TRUE)) } \end{lstlisting} \subsection{Cosine Similarity}\label{cosine-similarity} \subsubsection{Definition}\label{definition-2} Given two vectors \(x\) and \(y\) each of length \(m\), we can define the \emph{cosine similarity} of the two vectors as \[ cosim(x,y) = \frac{x\cdot y}{\|x\| \|y\|} \] This is the cosine of the angle between the two vectors. This is very similar to pearson correlation. In fact, if the vectors \(x\) and \(y\) have their means removed, it is identical. \[ \rho(x,y) = \frac{(x - \bar{x}) \cdot (y - \bar{y})}{\|x - \bar{x}\| \|y - \bar{y}\|} \] Given an \(m\times n\) matrix \(A\), we can define the \emph{cosine similarity} by extending the above definition to \[ cosim(A) = \left[ cosim(A_{*,i}, A_{*,j}) \right]_{n\times n} \] Where \(A_{*,i}\) denotes the i'th column vector of \(A\). In words, this is the matrix of all possible pairwise cosine similarities of the columns of the matrix \(A\). \subsubsection{Computing Cosines}\label{computing-cosines} We begin with a naive implementation. Proceeding directly from the definition, we can easily compute the cosine of two vectors as follows: \begin{lstlisting}[language=rr] cosine <- function(x, y) { cp <- t(x) %*% y normx <- sqrt(t(x) %*% x) normy <- sqrt(t(y) %*% y) cp / normx / normy } \end{lstlisting} This might lead us to generalize to a matrix by defining the function to operate recursively or iteratively on the above construction. In fact, this is precisely how the \textbf{lsa} package~\cite{lsa} computes cosine similarity. And while this captures the mathematical spirit of the operation, it ignores how computers actually operate on data in some critical ways. Notably, this will only use the so-called ``Level 1'' \textbf{BLAS}. \textbf{BLAS} operations are enumerated 1 through 3, for vector-vector, matrix-vector, and matrix-matrix operations. The higher level operations are much more cache efficient, and so can achieve significant performance improvements over the lower level \textbf{BLAS} operations. Since we wish to perform a full crossproduct (what numerical people call a ``rank-k update''), we will achieve much better performance with the level 2 \textbf{BLAS} operations. However, we can massively improve the runtime performance by being slightly more careful in which R functions we use. Consider for example: \begin{lstlisting}[language=rr] cosine <- function(x) { cp <- crossprod(x) rtdg <- sqrt(diag(cp)) cos <- cp / tcrossprod(rtdg) return(cos) } \end{lstlisting} The main reason this will \emph{significantly} outperform the naive implementation is because it makes very efficient use of the \textbf{BLAS}. Additionally, \texttt{crossprod()} and \texttt{tcrossprod()} are actually optimized to not only avoid the unnecessary copy in explicitly forming the transpose (produced when \texttt{t()} is called), but they only compute one triangle of the desired matrix and copy it over to the other, essentially doing only half the work. These operations alone allow for significant improvement in performance: \begin{lstlisting}[language=rr] n <- 250 x <- matrix(rnorm(n*n), n, n) mb <- microbenchmark::microbenchmark(t(x) %*% x, crossprod(x), times=20) boxplot(mb) \end{lstlisting} One can easily numerically verify that these are identical computations. However, this implementation is not memory efficient, as it requires the allocation of additional storage for: \begin{enumerate} \def\labelenumi{\arabic{enumi}.} \item \(n\times n\) elements for the \texttt{crossprod()} \item \(n\) elements for \texttt{diag()} \item Another \(n\) elements just for the \texttt{sqrt()} \item \(n\times n\) elements for the \texttt{tcrossprod()} \end{enumerate} The final output storage is the result of the division of \texttt{cp} by the result of \texttt{tcrossprod()}. So the total number of \emph{intermediary} elements allocated is \texttt{2n(n+1)}. So for an input matrix with 1000 columns, you need \texttt{r\ matmemsize(1000)} of additional memory, and for one with 10000 columns, you need \texttt{r\ matmemsize(10000)} of additional storage. For a smaller matrix, this may be satisfactory. By comparison, the implementation in \textbf{coop} needs no additional storage for computing cosines (to be more precise, the storage is \(O(1)\)). The implementation for two dense vector inputs is dominated by the product \code{t(x) \%*\% y} performed by the \textbf{BLAS} subroutine \code{dgemm} and the normalizing products \code{t(y) \%*\% y}, each computed via the \textbf{BLAS} function \code{dsyrk}. For more details, see the vignette \emph{Algorithms and Benchmarks for the coop Package}. \addcontentsline{toc}{section}{References} \bibliography{./include/coop} \bibliographystyle{plain} \end{document}