\documentclass[nojss]{jss} \usepackage{Sweave} \author{Tomasz G\'orecki\\Adam Mickiewicz University \And \L ukasz Smaga\\Adam Mickiewicz University} \title{\pkg{fdANOVA}: An \proglang{R}~Software Package for\\ Analysis of Variance for Univariate and Multivariate Functional Data} \Plainauthor{Tomasz G\'orecki, \L ukasz Smaga} \Plaintitle{fdANOVA: An R Software Package for\\ Analysis of Variance for Univariate and Multivariate Functional Data} \Shorttitle{\pkg{fdANOVA}: Analysis of Variance for Functional Data in \proglang{R}} \Abstract{ Functional data, i.e., observations represented by curves or functions, frequently arise in various fields. The theory and practice of statistical methods for such data is referred to as functional data analysis (FDA) which is one of major research fields in statistics. The practical use of FDA methods is made possible thanks to availability of specialized and usually free software. In particular, a number of \proglang{R}~packages is devoted to these methods. In the vignette, we introduce a new \proglang{R}~package \pkg{fdANOVA} which provides an access to a broad range of global analysis of variance methods for univariate and multivariate functional data. The implemented testing procedures mainly for homoscedastic case are briefly overviewed and illustrated by examples on a well known functional data set. To reduce the computation time, parallel implementation is developed. } \Keywords{analysis of variance, functional data, \pkg{fdANOVA}, \proglang{R}} \Plainkeywords{analysis of variance, functional data, fdANOVA, R} %% without formatting \Address{ Tomasz G\'orecki, \L ukasz Smaga\\ Faculty of Mathematics and Computer Science\\ Adam Mickiewicz University\\ Umultowska 87, 61-614 Pozna\'n, Poland\\ E-mail: \email{tomasz.gorecki@amu.edu.pl, ls@amu.edu.pl}} %% for those who use Sweave please include the following line (with % symbols): %% need no \usepackage{Sweave.sty} \usepackage{amsmath} \usepackage{booktabs} \newcommand{\vect}[1]{\mbox{\boldmath${#1}$}} \begin{document} \SweaveOpts{concordance=TRUE} \SweaveOpts{engine=R,eps=FALSE} %\VignetteIndexEntry{fdANOVA: An R Software Package for Analysis of Variance for Univariate and Multivariate Functional Data} %\VignetteDepends{} %\VignetteKeywords{analysis of variance, functional data, fdANOVA, R} %\VignettePackage{fdANOVA} <>= options(prompt = "R> ", continue = "+ ", width = 70, useFancyQuotes = FALSE) @ \section[Introduction]{Introduction} \label{sec1} In recent years considerable attention has been devoted to analysis of so called functional data. The functional data are represented by functions or curves which are observations of a random variable (or random variables) taken over a continuous interval or in large discretization of it. Sets of functional observations are peculiar examples of the high-dimensional data where the number of variables significantly exceeds the number of observations. Such data are often gathered automatically due to advances in modern technology, including computing environments. The functional data are naturally collected in agricultural sciences, behavioral sciences, chemometrics, economics, medicine, meteorology, spectroscopy, and many others. The main purpose of functional data analysis (FDA) is to provide tools for statistically describing and modelling sets of functions or curves. The monographs by \cite{RamsaySilverman2005}, \cite{FerratyVieu2006}, \cite{HorvathKokoszka2012} and \cite{Zhang2013} present a broad perspective of the FDA solutions. The following problems for functional data are commonly studied \citep[see also the review papers of][]{Cuevas2014,Wangetal2016}: analysis of variance \citep{Faraway1997, CFF2004, ShenFaraway2004, ZhangChen2007, CAFB2010, Zhang2011, Zhang2013, ZhangLiang2014, GoreckiSmaga2015, Zhangetal2013}, canonical correlation analysis \citep{KW2013, KeserKocakoc2015}, classification \citep{JamesHastie2001, DelaigleHall2012, DelaigleHall2013, Changetal2014}, cluster analysis \citep{Giacofcietal2013, Coffeyetal2014, YamamotoTerada2014}, outlier detection \citep{FBetal2008}, principal component analysis \citep{Boenteetal2014, Fremdtetal2014}, regression analysis \citep{Chenetal2011, Hilgertetal2013, Morris2015, Robbianoetal2015, Pengetal2016}, repeated measures analysis \citep{MCC2011,Smaga2017}, simultaneous confidence bands \citep{Degras2011, Caoetal2012, Maetal2012}. The above references concern the univariate functional data. However, some methods have their multivariate counterparts, e.g., analysis of variance \citep{GoreckiSmaga2017}, canonical correlation analysis \citep{Goreckietal2016a}, classification \citep{Goreckietal2015, Goreckietal2016a,Goreckietal2016b}, cluster analysis \citep{Tokushigeetal2007, JacquesPreda2014,ParkAhn2016}, linear regression and prediction \citep{Chiouetal2016,KrzyskoSmaga2017}, principal component analysis \citep{Berrenderoetal2011, Chiouetal2014}, statistical modelling \citep{ChiouMuller2014}. Some examples of applications of functional data analysis can be found in \cite{Ogdenetal2002}, \cite{Pfeifferetal2002}, \cite{Rossietal2002}, \cite{JankShmueli2006}, \cite{FBetal2008}, \cite{Bobelynetal2010}, \cite{TarrioSaavedraetal2011} and \cite{Longetal2012}. Many methods for functional data analysis have been already implemented in the \proglang{R}~software \citep{RCoreTeam2017}. The packages \pkg{fda} \citep{Ramsayetal2009,Ramsayetal2014} and \pkg{fda.usc} \citep{FBOF2012} are the biggest and probably the most commonly used ones. The first package includes the techniques for functional data in the Hilbert space $L_2(I)$ of square integrable functions over an interval $I=[a,b]$. On the other hand, in the second one, many of the methods implemented do not need such assumption and use only the values of functions evaluated on a grid of discretization points (also non-equally spaced). There is also a broad range of \proglang{R}~packages containing solutions for more particular functional data problems: Widely used principal components analysis for functional data implemented in the packages \pkg{fda} and \pkg{fda.usc} is also available in \pkg{fdapace} \citep{Daietal2016}, \pkg{fpca} \citep{PengPaul2011}, \pkg{fdasrvf} \citep{fdasrvf2017} and \pkg{MFPCA} \citep{MFPCA2017, HappGreven2016} ones. For classification and cluster analysis, the packages \pkg{classiFunc} \citep{classiFunc2017}, \pkg{fdakma} \citep{Parodietal2015}, \pkg{Funclustering} \citep{Soueidatt2014}, \pkg{funcy} \citep{Yassouridis2016} and \pkg{GPFDA} \citep{GPFDA2014} can be used. In the packages \pkg{fdaPDE} \citep{Lilaetal2016}, \pkg{FDboost} \citep{Brockhausetal2015, BrockhausRugamer2016}, \pkg{flars} \citep{flars2016}, \pkg{funreg} \citep{DziakShiyko2016} and \pkg{refund} \citep{Goldsmithetal2016}, a lot of work was done to develop the regression analysis for functional data. The packages \pkg{freqdom} \citep{freqdom2015}, \pkg{ftsa} \citep{Shang2013, HyndmanShang2016}, \pkg{ftsspec} \citep{Tavakoli2015} and \pkg{pcdpca} \citep{Kidzinskietal2016} provide implementation of various methods of functional time series analysis. Methods for the robust analysis of univariate and multivariate functional data are provided in the \pkg{roahd} package \citep{roahd2017}. Interval testing procedures for functional data in different frameworks (i.e., one or two-population frameworks, functional linear models) are implemented in the \pkg{fdatest} package \citep{fdatest2015}. The functional data analysis in mixed model framework is implemented in the package \pkg{fdaMixed} \citep{Markussen2014}. The functional observations can be visualized by many of plots implemented in the following packages \pkg{GOplot} \citep{Walteretal2015}, \pkg{rainbow} \citep{ShangHyndman2016} and \pkg{refund.shiny} \citep{WrobelGoldsmith2016}. The packages \pkg{fds} \citep{ShangHyndman2013} and \pkg{mfds} \citep{GoreckiSmaga2017mfds} contain some interesting functional data sets. Despite so many \proglang{R}~packages for functional data analysis, a broad range of test for a widely applicable analysis of variance problem for functional data was implemented very recently in the package \pkg{fdANOVA}. Earlier, only the testing procedures of \cite{CFF2004} and \cite{CAFB2010} were available in the package \pkg{fda.usc}. The package \pkg{fdANOVA} is available from the Comprehensive \proglang{R}~Archive Network at \url{http://CRAN.R-project.org/package=fdANOVA}. It is the aim of this package to provide a few functions implementing most of known analysis of variance tests for univariate and multivariate functional data, mainly for homoscedastic case. Most of them are based on bootstrap, permutations or projections, which may be time-consuming. For this reason, the package also contains parallel implementations which enable to reduce the computation time significantly, which is shown in empirical evaluation. The rest of the vignette is organized in the following manner. In Section~\ref{sec2}, the problems of the analysis of variance for univariate and multivariate functional data are introduced. A review of most of the known solutions of these problems is also presented there. Some of the testing procedures are slightly generalized. Since it was not easy task to implement many different tests in a few functions, their usage may also be not easy at first. Thus, Section~\ref{sec3} contains a detailed description of (eventual) preparation of data and package functionality as well as a package demonstration on commonly used real data set. \section{Analysis of variance for functional data} \label{sec2} In this section, we briefly describe most of the known testing procedures for the analysis of variance problem for functional data in the univariate and multivariate cases. All of them are implemented in the package \pkg{fdANOVA}. \subsection{Univariate case} \label{sec21} We consider the $l$ groups of independent random functions $X_{ij}(t)$, $i=1,\dots,l,$ $j=1,\dots,n_i$ defined over a closed and bounded interval $I=[a,b]$. Let $n=n_1+\dots+n_l$. These groups may differ in mean functions, i.e., we assume that $X_{ij}(t)$, $j=1,\dots,n_i$ are stochastic processes with mean function $\mu_i(t)$, $t\in I$ and covariance function $\gamma(s, t)$, $s,t\in I$, for $i=1,\dots,l$. Of interest is to test the following null hypothesis \begin{equation} \label{H0u} H_0:\mu_1(t)=\dots=\mu_l(t),\ t\in I. \end{equation} The alternative is the negation of the null hypothesis. The problem of testing this hypothesis is known as the one-way analysis of variance problem for functional data (FANOVA). Many of the tests for (\ref{H0u}) are based on the pointwise $F$~test statistic \citep{RamsaySilverman2005} given by the formula $$F_n(t)=\frac{\mathrm{SSR}_n(t)/(l-1)}{\mathrm{SSE}_n(t)/(n-l)},$$ where $$\mathrm{SSR}_n(t)=\sum_{i=1}^l n_i(\bar{X}_i(t)-\bar{X}(t))^{2},\quad\mathrm{SSE}_n(t)=\sum_{i=1}^l\sum_{j=1}^{n_i}(X_{ij}(t)-\bar{X}_i(t))^{2}$$ are the pointwise between-subject and within-subject variations respectively, and $\bar{X}(t)=(1/n)\sum_{i=1}^l\sum_{j=1}^{n_i}X_{ij}(t)$ and $\bar{X}_i(t)=(1/n_i)\sum_{j=1}^{n_i}X_{ij}(t),$ $i=1,\dots,l$, are respectively the sample grand mean function and the sample group mean functions. \cite{Faraway1997} and \cite{ZhangChen2007} proposed to use only the pointwise between-subject variation and considered the test statistic $\int_I\mathrm{SSR}_n(t)dt$. Tests based on it are called the $L^2$-norm-based tests. The distribution of this test statistic can be approximated by that of $\beta\chi_d^2$ and comparing the first two moments of these random variables. The parameters $\beta$ and $d$ were estimated by the naive and biased-reduced methods \citep[see, for instance,][for more detail]{GoreckiSmaga2015}. Thus we have the $L^2$-norm-based tests with the naive and biased-reduced methods of estimation of these parameters (the $\mathrm{L}^{2}$N and $\mathrm{L}^{2}$B tests for short). In case of non-Gaussian samples or small sample sizes, the bootstrap $L^2$-norm-based test is also considered (the $\mathrm{L}^{2}$b tests for short). A bit different $L^2$-norm-based test was proposed by \cite{CFF2004}. Namely, they considered $\sum_{1\leq i library("fdANOVA") R> str(fanova.tests) \end{CodeInput} \begin{CodeOutput} function(x = NULL, group.label, test = "ALL", params = NULL, parallel = FALSE, nslaves = NULL) \end{CodeOutput} \end{CodeChunk} which is controlled by the following parameters: \begin{itemize} \item \code{x} -- a $\mathcal{T}\times n$ matrix of data, whose each column is a discretized version of a function and rows correspond to design time points. Its default values is \code{NULL}, since if the FP test is only used, we can give a basis representation of the data instead of raw observations (see the list \code{paramFP} below). For any of the other testing procedures, the raw data are needed. \item \code{group.label} -- a vector containing group labels. \item \code{test} -- a kind of indicator which establishes a choice of FANOVA tests to be performed. Its default value means that all testing procedures of Section \ref{sec21} will be used. When we want to use only some tests, the parameter \code{test} is an appropriate subvector of the following vector of tests' labels (see Section \ref{sec21}): \begin{Code} c("FP", "CH", "CS", "L2N", "L2B", "L2b", "FN", "FB", "Fb", "GPF", "Fmaxb", "TRP") \end{Code} \item \code{params} -- a list of additional parameters for the FP, CH, CS, L$^2$b, Fb, Fmaxb tests and the test based on random projections. Its possible elements and their default values are described below. The default value of this parameter means that these tests are performed with their default values. \item \code{parallel} -- a logical indicating whether to use parallelization. \item \code{nslaves} -- if \code{parallel = TRUE}, a number of slaves. Its default value means that it will be equal to a number of logical processes of a computer used. \end{itemize} The list \code{params} can contain all or a part of the elements \code{paramFP}, \code{paramCH}, \code{paramCS}, \code{paramL2b}, \code{paramFb}, \code{paramFmaxb} and \code{paramTRP} for passing the parameters for the FP, CH, CS, L$^2$b, Fb, Fmaxb tests and the test based on random projections, respectively, to the function \code{fanova.tests()}. They are described as follows. The list \begin{Code} paramFP = list(int, B.FP = 1000, basis = c("Fourier", "b-spline", "own"), own.basis, own.cross.prod.mat, criterion = c("BIC", "eBIC", "AIC", "AICc", "NO"), commonK = c("mode", "min", "max", "mean"), minK = NULL, maxK = NULL, norder = 4, gamma.eBIC = 0.5) \end{Code} contains the parameters of the FP test and their default values, where: \begin{itemize} \item \code{int} -- a vector of two elements representing the interval $I=[a,b]$. When it is not specified, it is determined by a number of design time points. \item \code{B.FP} -- a number of permutation replicates. \item \code{basis} -- a choice of basis of functions used in the basis function representation of the data. \item \code{own.basis} -- if \code{basis = "own"}, a $K\times n$ matrix with columns containing the coefficients of the basis function representation of the observations. \item \code{own.cross.prod.mat} -- if \code{basis = "own"}, a $K\times K$ cross product matrix corresponding to a basis used to obtain the matrix \code{own.basis}. \item \code{criterion} -- a choice of information criterion for selecting the optimum value of $K$. \code{criterion = "NO"} means that $K$ is equal to the parameter \code{maxK} defined below. By (\ref{Xbfr}), we have $\text{\code{BIC}}(X_{ij})=\mathcal{T}\log(\mathbf{e}_{ij}^{\top}\mathbf{e}_{ij}/\mathcal{T})+K\log\mathcal{T}$, $\text{\code{eBIC}}(X_{ij})=\mathcal{T}\log(\mathbf{e}_{ij}^{\top}\mathbf{e}_{ij}/\mathcal{T})+K[\log\mathcal{T}+2\gamma\log(K_{\max})]$, $\text{\code{AIC}}(X_{ij})=\mathcal{T}\log(\mathbf{e}_{ij}^{\top}\mathbf{e}_{ij}/\mathcal{T})+2K$ and $\text{\code{AICc}}(X_{ij})=\text{\code{AIC}}(X_{ij})+2K(K + 1)/(n-K-1)$, where $\mathbf{e}_{ij}=(e_{ij1},\dots,e_{ij\mathcal{T}})^{\top}$, $e_{ijr}=X_{ij}(t_r)-\sum_{m=1}^K\hat{c}_{ijm}\varphi_m(t_r)$, $t_1,\dots,t_{\mathcal{T}}$ are the design time points, $\gamma\in[0,1]$, $K_{\max}$ is a maximum $K$ considered and $\log$ denotes the natural logarithm. \item \code{commonK} -- a choice of method for selecting the common value for all observations from the values of $K$ corresponding to all processes. \item \code{minK} (resp. \code{maxK}) -- a minimum (resp. maximum) value of $K$. When \code{basis = "Fourier"}, they have to be odd numbers. If $\text{\code{minK}}=\text{\code{NULL}}$ and $\text{\code{maxK}}=\text{\code{NULL}}$, we take $\text{\code{minK}}=3$ and \code{maxK} equal to the largest odd number smaller than the number of design time points. If \code{maxK} is greater than or equal to the number of design time points, \code{maxK} is taken as above. For \code{basis = "b-spline"}, \code{minK} (resp. \code{maxK}) has to be greater (resp. smaller) than or equal to \code{norder} defined below (resp. $\mathcal{T}$). If $\text{\code{minK}}=\text{\code{NULL}}$ or $\text{\code{minK}}<\text{\code{norder}}$ and $\text{\code{maxK}}=\text{\code{NULL}}$ or $\text{\code{maxK}}>\mathcal{T}$, then we take $\text{\code{minK}}=\text{\code{norder}}$ and $\text{\code{maxK}}=\mathcal{T}$. \item \code{norder} -- if \code{basis = "b-spline"}, an integer specifying the order of b-splines. \item \code{gamma.eBIC} -- a $\gamma\in[0,1]$ parameter in the eBIC. \end{itemize} It should be noted that the AICc may choose the finale model with a number $K$ of coefficients close to a number of observations $n$, when $K_{\max}$ is greater than $n$. Such selection usually differs from choices suggested by other criterion, but it seems that this does not have much impact on the results of testing. For the CH and CS (resp. L$^2$b, Fb and Fmaxb) tests, the parameters \code{paramCH} and \code{paramCS} (resp. \code{paramL2b}, \code{paramFb} and \code{paramFmaxb}) denote the numbers of discretized artificial trajectories for certain Gaussian processes (resp. bootstrap samples) used to approximate the null distributions of their test statistics. The default value of each of these parameters is 10,000. The parameters of the test based on random projections and their default values are contained in a list \begin{Code} paramTRP = list(k = 30, projection = c("GAUSS", "BM"), permutation = FALSE, B.TRP = 10000, independent.projection.tests = TRUE) \end{Code} where: \begin{itemize} \item \code{k} -- a vector of numbers of projections. \item \code{projection} -- a method of generating Gaussian processes in step 1 of the testing procedure based on random projections presented in Section \ref{sec2}. If \code{projection = "GAUSS"}, the Gaussian white noise is generated as in the function \code{anova.RPm()} from the \proglang{R}~package \pkg{fda.usc} \citep{FBOF2012}. In the second case, the Brownian motion is generated. \item \code{permutation} -- a logical indicating whether to compute $p$~values by permutation method. \item \code{B.TRP} -- a number of permutation replicates. \item \code{independent.projection.tests} -- a logical indicating whether to generate the random projections independently or dependently for different elements of vector \code{k}. In the first case, the random projections for each element of vector \code{k} are generated separately, while in the second one, they are generated as chained subsets, e.g., for \code{k = c(5, 10)}, the first 5 projections are a subset of the last 10. The second way of generating random projections is faster than the first one. \end{itemize} A Brownian process in $[a,b]$ has small variances near $a$ and higher variances close to $b$. This means that the tests based on random projections and the Brownian motion may be more able to detect differences among mean groups, when those differences are much closer to $b$ than to $a$. To perform step 3 of the procedure based on random projections given in Section \ref{sec21}, in the package, we use five testing procedures: the standard (\code{paramTRP$permutation = FALSE}) and permutation (\code{paramTRP$permutation = TRUE}) tests based on ANOVA F~test statistic and ANOVA-type statistic (ATS) proposed by \cite{Brunneretal1997}, as well as the testing procedure based on Wald-type permutation statistic (WTPS) of \cite{Paulyetal2015}. The function \code{fanova.tests()} returns a list of the class \code{fanovatests}. This list contains the values of the test statistics, the $p$~values and the parameters used. The results for a given test are given in a list (being an element of output list) named the same as the indicator of a test in the vector \code{test}. Additional outputs as the chosen optimal length of basis expansion (the FP test), the values of estimators used in approximations of null distributions of test statistics (the L$^2$N, L$^2$B, FN, FB, GPF tests) and projections of the data (the test based on random projections) are contained in appropriate lists. If \code{independent.projection.tests = TRUE}, the projections of the data are contained in a list of the length equal to length of vector \code{k}, whose $i$-th element is an $n\times \text{\code{k[i]}}$ matrix with columns being projections of the data. When \code{independent.projection.tests = FALSE}, the projections of the data are contained in an $n\times \max(\text{\code{k}})$ matrix with columns being projections of the data. The permutation tests based on a basis function representation for FMANOVA problem, i.e., the W, LH, P and R tests are implemented in the function \code{fmanova.ptbfr()}: \begin{CodeChunk} \begin{CodeInput} R> library("fdANOVA") R> str(fmanova.ptbfr) \end{CodeInput} \begin{CodeOutput} function(x = NULL, group.label, int, B = 1000, parallel = FALSE, nslaves = NULL, basis = c("Fourier", "b-spline", "own"), own.basis, own.cross.prod.mat, criterion = c("BIC", "eBIC", "AIC", "AICc", "NO"), commonK = c("mode", "min", "max", "mean"), minK = NULL, maxK = NULL, norder = 4, gamma.eBIC = 0.5) \end{CodeOutput} \end{CodeChunk} The parameters \code{group.label}, \code{int}, \code{B}, \code{parallel}, \code{nslaves}, \code{basis}, \code{norder} and \code{gamma.eBIC} are the same as in the function \code{fanova.tests()} (\code{B} corresponds to \code{B.FP}). The other arguments of \code{fmanova.ptbfr()} are described as follows: \begin{itemize} \item \code{x} -- a list of $\mathcal{T}\times n$ matrices of data, whose each column is a discretized version of a function and rows correspond to design time points. The $m$th element of this list contains the data of $m$th feature, $m=1,\dots,p$. Its default values is \code{NULL}, because a basis representation of the data can be given instead of raw observations (see the parameter \code{own.basis} below). \item \code{own.basis} -- if \code{basis = "own"}, a list of length $p$, whose elements are $K_m\times n$ matrices ($m=1,\dots,p$) with columns containing the coefficients of the basis function representation of the observations. \item \code{own.cross.prod.mat} -- if \code{basis = "own"}, a $KM\times KM$ cross product matrix corresponding to a basis used to obtain the list \code{own.basis}. \item \code{criterion} -- a choice of information criterion for selecting the optimum value of $K_m$, $m=1,\dots,p$. \code{criterion = "NO"} means that $K_m$ are equal to the parameter \code{maxK} defined below. \item \code{commonK} -- a choice of method for selecting the common value for all observations from the values of $K_{m}$ corresponding to all processes. \item \code{minK} (resp. \code{maxK}) -- a minimum (resp. maximum) value of $K_m$. Further remarks about these arguments are the same as for the function \code{fanova.tests()}. \end{itemize} The function \code{fmanova.ptbfr()} returns a list of the class \code{fmanovaptbfr} containing the values of the test statistics (\code{W}, \code{LH}, \code{P}, \code{R}), the $p$~values (\code{pvalueW}, \code{pvalueLH}, \code{pvalueP}, \code{pvalueR}), chosen optimal values of $K_m$ and the parameters used. This function uses the \proglang{R}~package \pkg{fda} \citep{Ramsayetal2014}. The function \code{fmanova.trp()} performs the testing procedures based on random projections for FMANOVA problem (the Wp, LHp, Pp and Rp tests): \begin{CodeChunk} \begin{CodeInput} R> library("fdANOVA") R> str(fmanova.trp) \end{CodeInput} \begin{CodeOutput} function(x, group.label, k = 30, projection = c("GAUSS", "BM"), permutation = FALSE, B = 1000, independent.projection.tests = TRUE, parallel = FALSE, nslaves = NULL) \end{CodeOutput} \end{CodeChunk} The first two parameters of this function as well as the arguments \code{parallel}, \code{nslaves} are the same as in the function \code{fmanova.ptbfr()}. The other ones have the same meaning as in the parameter list \code{paramTRP} of the function \code{fanova.tests()} (\code{B} corresponds to \code{B.TRP}). The function \code{fmanova.trp()} returns a list of class \code{fmanovatrp} containing the parameters and the following elements ($|\text{\code{k}}|$ denotes the length of vector \code{k}): \code{pvalues} -- a $4\times |\text{\code{k}}|$ matrix of $p$~values of the tests; \code{data.projections} -- if \code{independent.projection.tests = TRUE}, a list of length $|\text{\code{k}}|$, whose elements are lists of $n\times p$ matrices of projections of the observations, while when \code{independent.projection.tests = FALSE}, a list of length $\max(\text{\code{k}})$, whose elements are $n\times p$ matrices of projections of the observations. The executions of selecting the optimum length of basis expansion by some information criterion, the bootstrap, permutation and projection loops are the most time consuming steps of the testing procedures under consideration. To reduce the computational cost of the procedures, they are parallelized, when the parameter \code{parallel} is set to \code{TRUE}. The parallel execution is handled by \pkg{doParallel} package \citep{RASW2015}. In the package, the number of auxiliary functions are also contained. The $p$~values of the tests based on random projections for FANOVA problem against the number of projections are visualized by the function \code{plot.fanovatests()} using the package \pkg{ggplot2} \citep{Wickham2009}, which is controlled by the following parameters: \code{x} -- an \code{fanovatests} object, more precisely, a result of the function \code{fanova.tests()} for the standard tests based on random projections; \code{y} -- an \code{fanovatests} object, more precisely, a result of the function \code{fanova.tests()} for the permutation tests based on random projections. Similarly, the $p$~values of the Wp, LHp, Pp and Rp tests are plotted by the function \code{plot.fmanovatrp()}. The arguments of this function are as follows: \code{x} -- an \code{fmanovatrp} object, more precisely, a result of the function \code{fmanova.trp()} for the standard tests; \code{y} -- an \code{fmanovatrp} object, more precisely, a result of the function \code{fmanova.trp()} for the permutation tests; \code{withoutRoy} -- a logical indicating whether to plot the $p$~values of the Rp test. We can use only one of the arguments \code{x} and \code{y}, or both simultaneously. Using the package \pkg{ggplot2} \citep{Wickham2009}, the function \code{plotFANOVA()}: \begin{CodeChunk} \begin{CodeInput} R> library("fdANOVA") R> str(plotFANOVA) \end{CodeInput} \begin{CodeOutput} function(x, group.label = NULL, int = NULL, separately = FALSE, means = FALSE, smooth = FALSE, ...) \end{CodeOutput} \end{CodeChunk} produces a plot showing univariate functional observations with or without indication of groups as well as mean functions of samples. The following parameters control this function: \begin{itemize} \item \code{x} -- a $\mathcal{T}\times n$ matrix of data, whose each column is a discretized version of a function and rows correspond to design time points. \item \code{group.label} -- a character vector containing group labels. Its default value means that all functional observations are drawn without division into groups. \item \code{int} -- this parameter is the same as in the function \code{fanova.tests()}. \item \code{separately} -- a logical indicating how groups are drawn. If \code{separately = FALSE}, groups are drawn on one plot by different colors. When \code{separately = TRUE}, they are depicted in different panels. \item \code{means} -- a logical indicating whether to plot only group mean functions. \item \code{smooth} -- a logical indicating whether to plot reconstructed data via smoothing splines instead of raw data. \end{itemize} The functions \code{print.fanovatests()}, \code{print.fmanovaptbfr()} and \code{print.fmanovatrp()} print out the $p$~values and values of the test statistics for the implemented testing procedures. Additionally, the functions \code{summary.fanovatests()}, \code{summary.fmanovaptbfr()} and \code{summary.fmanovatrp()} print out information about the data and parameters of the methods. When calling the functions of the \pkg{fdANOVA} package, the software will check for presence of the \pkg{doBy}, \pkg{doParallel}, \pkg{ggplot2}, \pkg{fda}, \pkg{foreach}, \pkg{magic}, \pkg{MASS} and \pkg{parallel} packages if necessary \citep{doBy2016, RASW2015, Wickham2009, Ramsayetal2014, magic2005, MASS2002}. If the required packages are not installed, an error message will be displayed. It is worth to mention that fifteen labeled multivariate functional data sets are available in the \pkg{mfds} package \citep{GoreckiSmaga2017mfds}, which is a kind of supplement to the \pkg{fdANOVA} package. Table~\ref{Tab1} depicts brief information about these data sets, i.e., numbers of variables, design time points, groups and observations. The data sets were created from multivariate time series data available in \cite{Chenetal2015}, \cite{Leebetal2007}, \cite{Lichman2013} and \cite{Olszewski2001} by extending all variables to the same length as in \cite{Rodriguezetal2005}. They originate from different domains, including handwriting recognition, medicine, robotics, etc. The data sets can be used for illustrating and evaluating practical efficiency of classification and statistical inference methods, etc. \citep[see, for example,][]{GoreckiSmaga2015,GoreckiSmaga2017}. \begin{table} \begin{center} \begin{tabular}{lcccc} \hline Data sets & \#Variables & \#Time points & \#Groups & \#Observations \\ \hline Arabic digits & 13 & 93 & 10 & 8800 \\ Australian language & 22 & 136 & 95 & 2565 \\ Character trajectories & 3 & 205 & 20 & 2858 \\ ECG & 2 & 152 & 2 & 200 \\ Graz & 3 & 1152 & 2 & 140 \\ Japanese vowels & 12 & 29 & 9 & 640 \\ Libras & 2 & 45 & 15 & 360 \\ Pen digits & 2 & 8 & 10 & 10992 \\ Robot failure LP1 & 6 & 15 & 4 & 88 \\ Robot failure LP2 & 6 & 15 & 5 & 47 \\ Robot failure LP3 & 6 & 15 & 4 & 47 \\ Robot failure LP4 & 6 & 15 & 3 & 117 \\ Robot failure LP5 & 6 & 15 & 5 & 164 \\ uWaveGestureLibrary & 3 & 315 & 8 & 4478 \\ Wafer & 6 & 198 & 2 & 1194 \\\hline \end{tabular} \end{center} \caption{Brief information about data sets available in the \pkg{mfds} package \citep{GoreckiSmaga2017mfds}.} \label{Tab1} \end{table} \subsection{Package demonstration on real data example} \label{sec33} In this section, we provide examples that illustrate how the functions of the \proglang{R}~package \pkg{fdANOVA} can be used to analyze real data. For this purpose, we use the popular \textit{gait} data set available in the \pkg{fda} package. This data set consists of the angles formed by the hip and knee of each of 39 children over each child's gait cycle. The simultaneous variations of the hip and knee angles for children are observed at 20 equally spaced time points in $[0.025, 0.975]$. So, in this data set, we have two functional features, which we put in the list \code{x.gait} of length two, as presented below. \begin{CodeChunk} \begin{CodeInput} R> library("fda") R> gait.data.frame <- as.data.frame(gait) R> x.gait <- vector("list", 2) R> x.gait[[1]] <- as.matrix(gait.data.frame[, 1:39]) R> x.gait[[2]] <- as.matrix(gait.data.frame[, 40:78]) \end{CodeInput} \end{CodeChunk} Similarly to \cite{GoreckiSmaga2017}, for illustrative purposes, the functional observations are divided into three samples. Namely, the first sample consists of the functions for the first 13 children, the second sample of the functions for the next 13 children, and the third sample of the functions for the remaining children. The sample labels are contained in the vector \code{group.label.gait}: \begin{CodeChunk} \begin{CodeInput} R> group.label.gait <- rep(1:3, each = 13) \end{CodeInput} \end{CodeChunk} We can plot the functional data by using the function \code{plotFANOVA()}. For example, we plot the observations for the first functional feature without (Figure~\ref{Fig1} (a)) and with indication of the samples (Figure~\ref{Fig1} (b) and (c)) as well as the group mean functions (Figure~\ref{Fig1} (d)). \begin{CodeChunk} \begin{CodeInput} R> library("fdANOVA") R> plotFANOVA(x = x.gait[[1]], int = c(0.025, 0.975)) R> plotFANOVA(x = x.gait[[1]], group.label = as.character(group.label.gait), + int = c(0.025, 0.975)) R> plotFANOVA(x = x.gait[[1]], group.label = as.character(group.label.gait), + int = c(0.025, 0.975), separately = TRUE) R> plotFANOVA(x = x.gait[[1]], group.label = as.character(group.label.gait), + int = c(0.025, 0.975), means = TRUE) \end{CodeInput} \end{CodeChunk} \begin{figure}[!t] \includegraphics[width=0.99\textwidth, height=0.45\textwidth]{Fig1.pdf} \caption{The first functional feature of the \textit{gait} data without (Panel (a)) and with indication of the samples (Panels (b) and (c)). Panel (d) depicts the group mean functions.} \label{Fig1} \end{figure} From Figure~\ref{Fig1}, it seems that the mean functions of the three samples do not differ significantly. To confirm this statistically, we use the FANOVA tests implemented in the \code{fanova.tests()} function. First, we use default values of the parameters of this function: \begin{CodeChunk} \begin{CodeInput} R> set.seed(123) R> (fanova <- fanova.tests(x = x.gait[[1]], group.label = group.label.gait)) \end{CodeInput} \begin{CodeOutput} Analysis of Variance for Functional Data FP test - permutation test based on a basis function representation Test statistic = 1.468218 p-value = 0.198 CH test - L2-norm-based parametric bootstrap test for homoscedastic samples Test statistic = 7911.385 p-value = 0.2247 CS test - L2-norm-based parametric bootstrap test for heteroscedastic samples Test statistic = 7911.385 p-value = 0.1944 L2N test - L2-norm-based test with naive method of estimation Test statistic = 2637.128 p-value = 0.2106562 L2B test - L2-norm-based test with bias-reduced method of estimation Test statistic = 2637.128 p-value = 0.1957646 L2b test - L2-norm-based bootstrap test Test statistic = 2637.128 p-value = 0.2169 FN test - F-type test with naive method of estimation Test statistic = 1.46698 p-value = 0.2226683 FB test - F-type test with bias-reduced method of estimation Test statistic = 1.46698 p-value = 0.2198691 Fb test - F-type bootstrap test Test statistic = 1.46698 p-value = 0.2704 GPF test - globalizing the pointwise F-test Test statistic = 1.363179 p-value = 0.2691363 Fmaxb test - Fmax bootstrap test Test statistic = 3.752671 p-value = 0.1815 TRP - tests based on k = 30 random projections p-value ANOVA = 0.4026718 p-value ATS = 0.3422311 p-value WTPS = 0.509 \end{CodeOutput} \end{CodeChunk} Besides of the $p$~values displayed above, the list of matrices of projections of the data may be of practical interest for the test based on random projections users. The reason for this is that we can check the assumptions of the tests used in step 3 of the procedure based on random projections (see Section~\ref{sec21}), e.g., the normality assumptions of ANOVA F test. Such inspection may result in choosing the appropriate test used in this step. This is especially important when the tests based on random projections differ in their decisions. \begin{CodeChunk} \begin{CodeInput} R> fanova$TRP$data.projections \end{CodeInput} \begin{CodeOutput} [[1]] [,1] [,2] [,3] [,4] [,30] [1,] 56.27204 42.95853 4.717162 2.128967 ... -6.055347 [2,] 59.79175 47.34407 4.081016 5.029328 ... -15.867838 [3,] 59.56066 50.29226 5.867918 4.284960 ... -17.816042 ... [39,] 82.65391 65.15959 12.971629 7.695403 ... -7.070380 \end{CodeOutput} \end{CodeChunk} As expected, neither FANOVA test rejects the null hypothesis. Now, we show how particular tests can be chosen and how the parameters of these tests can be changed. For the FP test, we use the predefined basis function representation of the data. For this purpose, we expand the data in the b-spline basis by using the functions from the \pkg{fda} package. They return the coefficients of expansion as well as the cross product matrix corresponding to the basis functions. For control, we choose the GPF test, which does not need any additional parameters. The Fmaxb test is performed by 1000 bootstrap samples. For the tests based on random projections, 10 and 15 projections are generated by using the Brownian motion, and $p$~values are computed by the permutation method. \begin{CodeChunk} \begin{CodeInput} R> fbasis <- create.bspline.basis(rangeval = c(0.025, 0.975), 19, norder = 4) R> own.basis <- Data2fd(seq(0.025, 0.975, len = 20), x.gait[[1]], fbasis)$coefs R> own.cross.prod.mat <- inprod(fbasis, fbasis) R> set.seed(123) R> fanova.tests(x.gait[[1]], group.label.gait, + test = c("FP", "GPF", "Fmaxb", "TRP"), + params = list(paramFP = list(B.FP = 1000, basis = "own", + own.basis = own.basis, + own.cross.prod.mat = + own.cross.prod.mat), + paramFmaxb = 1000, + paramTRP = list(k = c(10, 15), + projection = "BM", + permutation = TRUE, + B.TRP = 1000))) \end{CodeInput} \begin{CodeOutput} Analysis of Variance for Functional Data FP test - permutation test based on a basis function representation Test statistic = 1.468105 p-value = 0.193 GPF test - globalizing the pointwise F-test Test statistic = 1.363179 p-value = 0.2691363 Fmaxb test - Fmax bootstrap test Test statistic = 3.752671 p-value = 0.177 TRP - tests based on k = 10 random projections p-value ANOVA = 0.3583333 p-value ATS = 0.3871429 p-value WTPS = 0.465 TRP - tests based on k = 15 random projections p-value ANOVA = 0.504 p-value ATS = 0.507 p-value WTPS = 0.345 \end{CodeOutput} \end{CodeChunk} The above examples concern only the first functional feature of the \textit{gait} data set. Similar analysis can be performed for the second one. However, both features can be simultaneously investigated by using the FMANOVA tests desribed in Section~\ref{sec22}. First, we consider the permutation tests based on a basis function representation implemented in the function \code{fmanova.ptbfr()}. We apply this function to the whole data set specifying non-default values of most of parameters. Here, we also show how use the function \code{summary.fmanovaptbfr()} to additionally obtain a summary of the data and test parameters. Observe that the results are consistent with these obtained by FANOVA tests. \begin{CodeChunk} \begin{CodeInput} R> set.seed(123) R> fmanova <- fmanova.ptbfr(x.gait, group.label.gait, int = c(0.025, 0.975), + B = 5000, basis = "b-spline", criterion = "eBIC", + commonK = "mean", minK = 5, maxK = 20, norder = 4, + gamma.eBIC = 0.7) R> summary(fmanova) \end{CodeInput} \begin{CodeOutput} FMANOVA - Permutation Tests based on a Basis Function Representation Data summary Number of observations = 39 Number of features = 2 Number of time points = 20 Number of groups = 3 Group labels: 1 2 3 Group sizes: 13 13 13 Range of data = [0.025 , 0.975] Testing results W = 0.9077424 p-value = 0.5322 LH = 0.1003732 p-value = 0.5286 P = 0.09340229 p-value = 0.5366 R = 0.08565056 p-value = 0.3852 Parameters of test Number of permutations = 5000 Basis: b-spline (norder = 4) Criterion: eBIC (gamma.eBIC = 0.7) CommonK: mean Km = 20 20 KM = 20 minK = 5 maxK = 20 \end{CodeOutput} \end{CodeChunk} Finally, we apply the tests based on random projections for the FMANOVA problem in the \textit{gait} data set. In the following, these tests are performed with $k=1,5,10,15,20$ projections, standard and permutation methods as well as the random projections generated in independent and dependent ways. The resulting $p$~values are visualized by the \code{plot.fmanovatrp()} function: \begin{CodeChunk} \begin{CodeInput} R> set.seed(123) R> fmanova1 <- fmanova.trp(x.gait, group.label.gait, k = c(1, 5, 10, 15, 20)) R> fmanova2 <- fmanova.trp(x.gait, group.label.gait, k = c(1, 5, 10, 15, 20), + permutation = TRUE) R> plot(x = fmanova1, y = fmanova2) R> plot(x = fmanova1, y = fmanova2, withoutRoy = TRUE) R> set.seed(123) R> fmanova3 <- fmanova.trp(x.gait, group.label.gait, k = c(1, 5, 10, 15, 20), + independent.projection.tests = FALSE) R> fmanova4 <- fmanova.trp(x.gait, group.label.gait, k = c(1, 5, 10, 15, 20), + permutation = TRUE, + independent.projection.tests = FALSE) R> plot(x = fmanova3, y = fmanova4) R> plot(x = fmanova3, y = fmanova4, withoutRoy = TRUE) \end{CodeInput} \end{CodeChunk} \begin{figure}[!t] \includegraphics[width=0.99\textwidth, height=0.55\textwidth]{Fig2.pdf} \caption{P~values of the tests based on random projections for the FMANOVA problem in the \textit{gait} data set. The Wpp, LHpp, Ppp and Rpp tests are the permutation versions of the Wp, LHp, Pp and Rp tests. In the first (resp. second) row, the random projections were generated in independent (resp. dependent) way.} \label{Fig2} \end{figure} The obtained plots are shown in Figure \ref{Fig2}. As we can observe, except the standard Rp test, all testing procedures behave similarly and do not reject the null hypothesis. The standard Rp test does not keep the pre-assigned type-I error rate as \cite{GoreckiSmaga2017} shown in simulations. More precisely, this test is usually too liberal, which explains that its $p$~values are much smaller than these of the other testing procedures. That is why the function has option not to plot the $p$~values of this test. \newpage \bibliography{fdANOVA} %\bibliographystyle{jss} \end{document}