\title{Dynamic Network Regression Using R Package dnr} \author{Abhirup Mallik} \date{\today} %\VignetteIndexEntry{Dynamic Network Regression Using dnr} %\VignetteEngine{knitr::knitr} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage{amsthm} \usepackage{amssymb} \usepackage{geometry} \geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm} %\usepackage{fullpage} \usepackage{graphicx} \usepackage{setspace} \usepackage{verbatim} \begin{document} %\SweaveOpts{concordance=TRUE} \newcommand\myeq{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny def}}}{=}}} <>= library(knitr) # set global chunk options knitr::opts_chunk$set(fig.path='figure/Vignette-', fig.align='center', fig.show='hold', fig.width='\\linewidth', out.width='\\linewidth') options(formatR.arrow=TRUE,width=90) @ \maketitle R package 'dnr' enables the user to fit dynamic network regression models for time variate network data available mostly in social sciences or social network analysis. In this document, we demonstrate the process of building a model to fit a dynamic network data set and using that model for prediction. \section{Analysis of Beach data} \label{sec:analysis-beach-data} First, we consider the beach data for our demo. <>= suppressMessages(library(dnr)) data(beach) ## get the number of time points length(beach) ## network size (that allows NA) network.size.1 <- function(x){ if(!network::is.network(x)){ if(is.na(x)) return(0) } else return(network::network.size(x)) } ## get the size of networks at each time point sapply(beach, network.size.1) @ The beach data is a rapidly changing data set with possible periodic effects. We visualize the adjacency matrix from four time points of the data. <>= par(mfrow = c(2,2)) binaryPlot(beach[[1]][, ], title = "Time point 1") binaryPlot(beach[[10]][, ], title = "Time point 10") binaryPlot(beach[[20]][, ], title = "Time point 20") binaryPlot(beach[[31]][, ], title = "Time point 31") @ For vertex model, we define our own term dictionary. We use the similar approach as the edge model for specifying the time dependence of the terms using a matrix of lag terms. \begin{table}[h] \centering \begin{tabular}[h]{|c|c|} \hline Term & Index \\ \hline degree (Freeman) & 1 \\ in degree & 2 \\ out degree & 3 \\ Eigen centrality & 4 \\ Between centrality & 5 \\ Info centrality & 6 \\ Closeness centrality & 7 \\ Log K cycle & 8 \\ Log size & 9 \\ \hline \end{tabular} \caption{Index of the terms for specifying the vertex model} \end{table} \subsection{Model Fitting} \label{sec:model-fitting-beach} We first try to build the model for vertex regression. We consider a maximum lag of 3. We need to specify the lag structure using a binary vector of size 3. We also need to specify the dependence on the vertex parameters up to 3 lags. There are $9$ vertex parameters available in the current version of the library. We use a binary matrix of size $3 \times 9$ for specifying the lag dependence of the parameters. <>= nvertexstats <- 9 maxLag = 3 VertexLag = rep(1, maxLag) VertexLagMatrix1 <- matrix(1, maxLag, nvertexstats) VertexLagMatrix1 @ As for this data set there is expected seasonal effect, for example weekends would have different effect than weekdays, we would like to model that using a time variate intercept parameter. We write a function to extract the day information from the data. <>= getWeekend <- function(z){ weekends <- c("Saturday", "Sunday") if(!network::is.network(z)){ if(is.na(z)) return(NA) } else { zDay <- get.network.attribute(z, attrname = "day") out <- ifelse(zDay %in% weekends, 1, 0) return(out) } } ## for(i in 1:31) print(getWeekend(beach[[i]])) ## generate a vector of network level exogenous variable dayClass <- numeric(length(beach)) for(i in seq_along(dayClass)) { dayClass[i] <- getWeekend(beach[[i]]) } @ We then use the function paramVertexOnly() to fit the model specified above to the beach data. Most of the options are kept at their default value. For a detail description of the model specification, please refer to the help pages of the function. We use the default 'bayesGLM' option for the logistic regression. We print the model object, which is an object from arm package, with its own summary method. <>= out <- paramVertexOnly(InputNetwork = beach, maxLag = 3, VertexStatsvec = rep(1, nvertexstats), VertexLag = rep(1, maxLag), VertexLagMatrix = VertexLagMatrix1, dayClass = dayClass) summary(out$VertexFit$fit) @ As we can see the model is hardly parsimonious. So, we decided to remove the terms that were not significant. We report the result of the refitted model. <>= VertexLagMatrix <- matrix(0, maxLag, nvertexstats) VertexLagMatrix[, c(4, 7)] <- 1 VertexLagMatrix[c(2,3),7] <- 0 VertexLagMatrix out <- paramVertexOnly(InputNetwork = beach, maxLag = 3, VertexStatsvec = rep(1, nvertexstats), VertexLag = rep(1, maxLag), VertexLagMatrix = VertexLagMatrix, dayClass = dayClass) summary(out$VertexFit$fit) @ Now, we have a model with mostly significant parameters, so we select this model for vertex generation. As the edge model and vertex model are separable, we can expect this model to work for the joint model as well. We use the function paramVertex() for fitting the joint vertex-edge model to the beach data. <>= out <- paramVertex(InputNetwork = beach, maxLag = 3, VertexStatsvec = rep(1, nvertexstats), VertexModelGroup = "regular", VertexLag = rep(1, maxLag), VertexLagMatrix = VertexLagMatrix, dayClass = dayClass, EdgeModelTerms = NA, EdgeModelFormula = NA, EdgeGroup = NA, EdgeIntercept = c("edges"), EdgeNetparam = c("logSize"), EdgeExvar = NA, EdgeLag = c(1, 1, 0), paramout = TRUE) summary(out$VertexFit$fit) summary(out$EdgeFit$fit) @ The edge model parameters are specified using 'EdgeIntercept' term, as we are using an intercept only model. For this example, we have tried using time variate parameters, but finally decided on the intercept only model. The term 'EdgeNetParam' indicates the network level attribute. Currently the only attribute supported here is 'logSize', which is log of the network size at the present time point. The binary vector 'EdgeLag' indicates the lag dependence of the edges. The terms 'EdgeModelTerms' and 'EdgeModelFormula' has not been used for this example. \subsection{Prediction for Beach Data} \label{sec:pred-beach-data} As we have finalized on a model for the beach data, we can use this model to predict the future networks up to any arbitrary number of time points. As long as we do not run into the problems of degeneracy, the simulation method should be able to generate networks with this model. <>= suppressWarnings(simResult <- engineVertex(InputNetwork = beach, numSim = 3, maxLag = 3, VertexStatsvec = rep(1, nvertexstats), VertexModelGroup = "regular", VertexAttLag = rep(1, maxLag), VertexLag = rep(1, maxLag), VertexLagMatrix = VertexLagMatrix, dayClassObserved = dayClass, dayClassFuture = c(1, 0, 0, 0, 0), EdgeModelTerms = NA, EdgeModelFormula = NA, EdgeGroup = NA, EdgeIntercept = c("edges"), EdgeNetparam = c("logSize"), EdgeExvar = NA, EdgeLag = c(0, 1, 0), paramout = TRUE )) par(mfrow = c(2,2)) binaryPlot(beach[[31]][, ], title = "Time point 31") binaryPlot(simResult$SimNetwork[[1]][, ], title = "Time point 32 (simulated)") binaryPlot(simResult$SimNetwork[[2]][, ], title = "Time point 33 (simulated)") binaryPlot(simResult$SimNetwork[[3]][, ], title = "Time point 34 (simulated)") @ \section{Model for Fixed Vertex Case} \label{sec:model-fixed-vertex} Even though fixed vertex case can be considered as a special case of dynamic vertex-edge case, it is preferred that the fixed vertex case is handled in a simpler way. We have provided separate functions for this case, that we will demonstrate using the blog data set. <>= data(rdNets) length(rdNets) rdNets[[1]] plot(rdNets[[1]]) @ We use the function paramest() to fit the edge model to the blog data. The function accepts ERGM style model formulas, however we require that the terms are expanded by the user. This is required to construct the lag dependence binary matrix. The interface for this function is similar to the dynamic vertex case. <>= input_network=rdNets[1:6] model.terms=c("triadcensus.003", "triadcensus.012", "triadcensus.102", "triadcensus.021D", "gwesp"); model.formula = net~triadcensus(0:3)+gwesp(decay=0, fixed=FALSE, cutoff=30)-1; graph_mode='digraph'; group='dnc'; alpha.glmnet=1 directed=TRUE; method <- 'bayesglm' maxlag <- 3 lambda=NA intercept = c("edges") cdim <- length(model.terms) lagmat <- matrix(sample(c(0,1),(maxlag+1)*cdim,replace = TRUE),ncol = cdim) ylag <- rep(1,maxlag) exvar <- NA out <- suppressWarnings(paramEdge(input_network, model.terms, model.formula, graph_mode='digraph', group,intercept = c("edges"),exvar=NA, maxlag = 3, lagmat = matrix(sample(c(0,1),(maxlag+1)*cdim, replace = TRUE),ncol = cdim), ylag = rep(1,maxlag), lambda = NA, method='bayesglm', alpha.glmnet=1)) out$coef @ Here the model formula is an ERGM formula. However, we have also provided the expansion of all the terms in the formula. For example the term 'triadcensus(0:3)' has been expanded out to respective triadcensus terms. The 'group' parameter is a categorical attribute for the vertices. This was present in the dynamic vertex case also. The specification of the intercept term is similar as well. The lag terms and lag dependancy of the parameters are represented with a binary vector or a binary matrix respectively. We can use the model chosen to simulate the networks in future time points. Here we are simulating $10$ future networks. We have kept the option of specifying the model and the initial network separate unlike the dynamic vertex case. However, using different model than the model fitted on the input network is not recommended as it is easily possible to create examples where these two inputs differ significantly, hurting the performance of the simulation. <>= input_network=rdNets[1:6] model.terms=c("triadcensus.003", "triadcensus.012", "triadcensus.102", "triadcensus.021D", "gwesp") model.formula = net~triadcensus(0:3)+gwesp(decay = 0, fixed=FALSE, cutoff=30)-1 graph_mode='digraph' group='dnc' alpha.glmnet=1 directed=TRUE method <- 'bayesglm' maxlag <- 3 lambda=NA intercept = c("edges") cdim <- length(model.terms) lagmat <- matrix(sample(c(0,1),(maxlag+1)*cdim,replace = TRUE),ncol = cdim) ylag <- rep(1,maxlag) lagmat[1,] <- rep(0,ncol(lagmat)) out <- suppressWarnings(paramEdge(input_network,model.terms, model.formula, graph_mode="digraph",group,intercept = c("edges"),exvar=NA, maxlag = 3, lagmat = lagmat, ylag = rep(1,maxlag), lambda = NA, method='bayesglm', alpha.glmnet=1)) # start_network <- input_network inputcoeff <- out$coef$coef nvertex <- 47 ns <- 10 exvar <- NA input_network <- rdNets[1:6] maxlag <- 3 start_network <- input_network inputcoeff <- out$coef$coef nvertex <- 47 ns <- 10 exvar <- NA tmp <- suppressWarnings(engineEdge(start_network=start_network,inputcoeff=inputcoeff,ns=ns, model.terms=model.terms, model.formula=model.formula, graph_mode=graph_mode,group=group,intercept=intercept, exvar=exvar, maxlag=maxlag, lagmat=lagmat, ylag=ylag, lambda = NA, method='bayesglm', alpha.glmnet=alpha.glmnet)) par(mfrow = c(2,2)) binaryPlot(input_network[[1]][, ], title = "Time point 6", axlabs = FALSE) binaryPlot(tmp$out_network[[1]][, ], title = "Time point 7 (simulated)", axlabs = FALSE) binaryPlot(tmp$out_network[[2]][, ], title = "Time point 8 (simulated)", axlabs = FALSE) binaryPlot(tmp$out_network[[3]][, ], title = "Time point 9 (simulated)", axlabs = FALSE) @ \subsection{Time series of parameter estimates} \label{sec:time-seri-param} As all the coefficients are calculated as a part of the simulation, they are also provided along with the simulated networks. We can plot the time series of the network parameters to see the quality of the simulations. <>= plot.ts(tmp$coefmat[, 1:10], xy.labels=FALSE, main = "Estimated parameters from simulated networks", cex = 0.8) @ \subsection{Performance metrics} \label{sec:performance-metrics} We also provide some functions for assessing the quality of the simulated networks and make comparisons with holdout set or some other benchmarked networks. Specifically, there are functions for number of triangles, cluster coefficient and expectation of degree distribution has been implemented. We report the performance metrics for the input networks as well as the simulated networks for our example on blog data. <>= perfMetrics <- cbind(c(sapply(tmp$out_network, function(net) ntriangles(net[, ])), sapply(input_network, function(net) ntriangles(net[, ]))), c(sapply(tmp$out_network, function(net) clustCoef(net[, ])), sapply(input_network, function(net) clustCoef(net[, ]))), c(sapply(tmp$out_network, function(net) expdeg(net[, ])), sapply(input_network, function(net) expdeg(net[, ])))) colnames(perfMetrics) <- c("Triangles", "ClustCoefs", "ExpDeg") perfMetrics <- data.frame(perfMetrics, row.names = NULL) knitr::kable(perfMetrics, digits = 3, caption ="Performance metrics for input and simulated networks.") @ \end{document}