%\VignetteIndexEntry{Maximum Entropy Bootstrap for Time Series: Toy Example Exposition}


\documentclass{article}



\usepackage{multirow,thumbpdf}
\usepackage{amsmath, color}

\begin{document}
\SweaveOpts{keep.source=TRUE}
\SweaveOpts{concordance=FALSE}

\DefineVerbatimEnvironment{Sinput}{Verbatim}{formatcom={\color{red}}}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{formatcom={\color{blue}}}

\author{Hrishikesh D. Vinod\\Fordham University}
\title{Maximum Entropy Bootstrap for Time Series: Toy Example Exposition}
\maketitle

\section*{Toy Example}

The Maximum Entropy Bootstrap is illustrated with a small example. Let the sequence $x_t = (4, 12, 36, 20, 8)$ be the series of data observed from the period $t = 1$ to $t = 5$ as indicated in the first two columns in Table 1. We jointly sort these two columns on the second column and place the result in the next two columns (Table 2 columns 3 and 4), giving us the ordering index vector in column 3.

Next, the four intermediate points in Column 5 are seen to be simple averages of consecutive order statistics. We need two more (limiting) "intermediate" points. These are obtained as described in Step 3 above. Using 10\% trimming, the limiting intermediate values are $z_0 = -11$ and $z_T = 51$. With these six $z_t$ values we build our five half open intervals:
$$U(-11, 6] \times U(6, 10] \times U(10, 16] \times U(16, 28] \times U(28, 51]$$

The maximum entropy density of the ME bootstrap is defined as the combination of $T$ uniform densities defined over (the support of) $T$ half open intervals.

\begin{table}[ht]
\centering
\small{
\begin{tabular}{lllllllll}
\\
\hline
\multirow{3}{0.7cm}{Time} & \multirow{3}{0.5cm}{$x_t$} &
\multirow{3}{1.2cm}{Ordering vector} & \multirow{3}{1cm}{Sorted $x_t$} &
\multirow{3}{1.3cm}{Interme-diate points} & \multirow{3}{1.3cm}{Desired means} &
\multirow{3}{1.3cm}{Uniform draws} & \multirow{3}{1.3cm}{Preli-minary values} &
\multirow{3}{1.3cm}{Final replicate} \\ \\ \\

\hline
\multicolumn{1}{c}{1} & \multicolumn{1}{c}{4} &
\multicolumn{1}{c}{1} & \multicolumn{1}{c}{4} &
\multicolumn{1}{c}{6} & \multicolumn{1}{c}{5} & \multicolumn{1}{c}{0.12} &
\multicolumn{1}{c}{5.85} & \multicolumn{1}{c}{5.85} \\
%
\multicolumn{1}{c}{2} & \multicolumn{1}{c}{12} & \multicolumn{1}{c}{5} &
\multicolumn{1}{c}{8} & \multicolumn{1}{c}{10} & \multicolumn{1}{c}{8} &
\multicolumn{1}{c}{0.83} & \multicolumn{1}{c}{6.70} & \multicolumn{1}{c}{13.90} \\
%
\multicolumn{1}{c}{3} & \multicolumn{1}{c}{36} & \multicolumn{1}{c}{2} &
\multicolumn{1}{c}{12} & \multicolumn{1}{c}{16} & \multicolumn{1}{c}{13} &
\multicolumn{1}{c}{0.53} &
\multicolumn{1}{c}{13.90} &\multicolumn{1}{c}{23.95} \\
\multicolumn{1}{c}{4} & \multicolumn{1}{c}{20} & \multicolumn{1}{c}{4} &
\multicolumn{1}{c}{20} & \multicolumn{1}{c}{28} & \multicolumn{1}{c}{22} &
\multicolumn{1}{c}{0.59} & \multicolumn{1}{c}{15.70} & \multicolumn{1}{c}{15.70} \\
%
\multicolumn{1}{c}{5} & \multicolumn{1}{c}{8} & \multicolumn{1}{c}{3} &
\multicolumn{1}{c}{36} & &
\multicolumn{1}{c}{32} & \multicolumn{1}{c}{0.11} & \multicolumn{1}{c}{23.95} &
\multicolumn{1}{c}{6.70} \\
\hline
\end{tabular}
}
\caption{\label{Tt5ex} Example of the ME bootstrap algorithm.}
\end{table}

\begin{figure}[ht]
\begin{center}
\includegraphics[width=0.5\textwidth]{t5ex_density}
\end{center}
\caption{\label{Ft5ex.density} Maximum entropy density for the $x_t=4,12,36,20,8$ example.}
\end{figure}


<<>>=
xx <- c(4,12,36,20,8) #original time series up and down shape
trimprop <- 0.10 #trimming proportion
reachbnd <- FALSE #reaching the bound of the range forced or  not?

# uniform draws used as an example in Table 1 of the paper

p <- c(0.12, 0.83, 0.53, 0.59, 0.11)

n <- length(xx)

x <- sort(xx)
ordxx <- sort(xx, index.return=TRUE)
print(c("ordxx=",ordxx)) #without the dollar ix appending

print(c("ordxx$ix=",ordxx$ix))
@


\subsection*{Index Return = TRUE using \texttt{sort} command}

The above use of the \texttt{sort} command with \texttt{index.return=TRUE} is worth learning.
It will be used later to map from numerical magnitudes (values) domain to the
time domain.  The use of \texttt{sort} with option  \texttt{index.return=TRUE} allows us
to avoid explicit use of sorting on two columns of data.

<<>>=
x <- sort(xx)
x #sorted magnitudes original xx data in values domain
#embed good for getting a matrix with lagged values in the second column
embed(1:4,2) #allows no worry about missing values with lags
embed(x, 2) #apply embed to our sorted xx
z <- rowMeans(embed(x, 2))
z #these are intermediate values
dv <- abs(diff(xx))
dv #vector of absolute differences
dvtrim <- mean(dv, trim=trimprop)
dvtrim #trimmed mean of dv
xmin <- x[1]-dvtrim
xmax <- x[n]+dvtrim

xmin #ultimate minimum for resampled data gives z_0
xmax #ultimate maximum for resampled data gives z_T
@

\subsection*{\texttt{embed} command}

R function \texttt{embed} Embeds the time series x into a low-dimensional Euclidean space.  We are using dimension=2 here.
It gives lagged values in second column of a matrix without worrying about missing values.

\texttt{dv} denotes the absolute difference between consecutive sorted values.

\texttt{z} denotes intermediate values needed for defining half-open intervals $I_t = (z_{(t-1)},\, z_t]$.

Unfortunately, the xmin ($z_0=-11$) and xmax ($z_T=51$) do not appear in the published Table.

\section*{Mass and Mean Preserving Constraints satisfy ergodic theorem}

A fraction $1/T$ of the mass of the probability distribution must lie in each
interval.  meboot requires each half open interval $I_{t}$ to have an equal chance
being included in the resample.


$\Sigma x_{t}=\Sigma x_{(t)}=\Sigma m_{t}$, where $m_{t}$ denote the mean of
$f(x)$ within the interval $I_{t}$.
\textbf{mean preserving} constraint.
\begin{subequations}
\begin{align}
f(x)&= 1/(z_{1}- z_{0}),\quad x \in I_{(1)},\quad
m_{1}=0.75x_{(1)}+0.25x_{(2)},\\
f(x)&=1/(z_{k}- z_{k-1}),\quad x \in (z_{k}- z_{k-1}],\\
&\mbox{\qquad with mean}\quad m_{k}=0.25x_{(k-1)}+ 0.50x_{(k)}+0.25x_{(k +
1)}\\
& {\qquad\quad\ \rm for}\quad k= 2, \ldots , T-1,\\
f(x)&= 1/(z_{T}-z_{T - 1}),\quad x \in I_{(T) },\quad m_{T}=0.25x_{(T -
1)}+0.75x_{(T)}.
\end{align}
\end{subequations}

The weights for the two observations at the left end interval are (0.75, 0.25)

Note that the weights are (0.25, 0.50, and 0.25) for all intermediate intervals
We have $T=5$ here leading to three intervals needing these weights.

The weights for the two observations at the right end interval are (0.25, 0.75)

\section*{Properties of uniform density}

If the range of continuous uniform random variable are \texttt{a} to \texttt{b}
the density of uniform is f( 1/(b-a))

Mean of uniform is (a+b)/2

We have used maximum entropy principle to say that the densities
between the intermediate points $z_0$  to $z_T$ are all uniform.

The desired means for our toy example with order \texttt{stats}=(4,8,12,20,36) are

(6+10)/2=8,  (10+16)/2=13,  (16+28)/2=22 for intermediate intervals

For the left extreme we use $0.75*x_{(1)}+0.25*x_{(2)}$=0.75*4+0.25*8=5

For the right extreme interval desired mean is $0.25*x_{(T-1)}+0.75*x_{(T)}$
0.25*20+0.75*36=32

see last table column entitled \texttt{desintxb} with entries (5,8,13,22,32)

\section*{\texttt{embed} helps achieve desired means $m_t$ of the $T$ intervals}

It is worth learning how three dimensional \texttt{embed} function of R works with
this toy example where we are considering mean of 3 consecutive values, except
for the two intervals at the two ends of the series.
<<>>=
embed(1:5,3) #embeding 1:5 gives 3 by 3 matrix
#Note j-th column has lag=j-1 values.  Col.2 has lag 1
#Note embed retains only non-missing lag values
x
t(embed(x,3))# transpose embed matrix
t(embed(x, 3))*c(0.25,0.5,0.25) #multiply by weights
t(t(embed(x, 3))*c(0.25,0.5,0.25)) #transpose twice to get back
@


Next we compute the row sum of above and call it a vector \texttt{aux}.
This applies to intermediate intervals not the extreme
intervals a the bottom end and at the top end, where one
needs to average only two intermediate values with weights 0.75 and 0.25.



<<>>=
aux <- rowSums( t( t(embed(x, 3))*c(0.25,0.5,0.25) ) )

aux #these are only 3
#append the means of two extreme intervals at the two ends
desintxb <- c(0.75*x[1]+0.25*x[2], aux, 0.25*x[n-1]+0.75*x[n])
desintxb# des=desired, int=interval, xb=xbar=means
desintxb  #desired means  5  8 13 22 32,  Now 5 as desired
print( "mean(xx),mean(desintxb),mean(x)") #all=16
print(c(mean(xx),mean(desintxb), mean(x)))
@


The above shows that the \textbf{mean preserving constraint} is satisfied,
since the mean of data and mean of desired means equal the same number 16.
This is no accident, but achieved by designed weights which force the desired means of each interval
to be based on the $x_t$ data.  This helps ensure that the ergodic theorem is satisfied by our resamples.



\section*{Drawing random quantile $q_t\in [z_0,z_T]$ from empirical cumulative ME density$\in [0,1]$}

Given empirical cdf of ME density consisting of uniform patches, we just
draw 999 realizations of iid uniform in the values domain.  For example,
a random draw of uniform between 0 to 1 illustrated in the toy example is:

\texttt{p}=c(0.12, 0.83, 0.53, 0.59, 0.11)

I wish I had included an additional column for sorted uniform draws
\texttt{pp}=(0.11, 0.12, 0.53, 0.59, 0.83) in the published paper for clearer exposition. A complete
table is included toward the end of this document.


<<>>=
q=rep(0,n) #place holder for q
pp=sort(p) #sorted random draws
print(c("sorted random draws", pp))
@

In traditional iid bootstrap each $x_t$ has 1/T (if we have $T$ observations) chance of being included
in the resample.  Of course, some $x_t$ might repeat and
some may not be present in some individual realizations of the random draws from the uniform density.
Imposing similar requirement in meboot algorithm is called satisfying \textbf{mass preserving constraint} in the paper.


If the uniform random variable is defined over the range [a,b], then
the its mean is (a+b)/2.
The first interval is $(-11,6]$ with width 17 and mean $-5/2$.
Since we have T=5 observations in the toy example, each interval
should have 1/5 =0.2 probability of being included in the resample.

The sorted random draws are (0.11, 0.12,
0.53, 0.59, 0.83).  the numbers
having the pp values less than 0.2 (=1/n or 1/T) are two numbers: 0.11 and 0.12.
First we use the \texttt{approx} function to interpolate in $I_1$ interval to get the
corresponding two interpolated value qq=$-1.65, -0.8$.
These do \textbf{not} satisfy the mean preserving constraint.  They
need to be adjusted by adding the adjustment 7.5 (calculated above) to yield
5.85 and 6.7 as the two quantiles of the ME density associated with the first two
sorted random draws 0.11 and 0.12.

<<>>=
z[1] #first intermediate value
xmin #smallest
0.5*(z[1]+xmin) #average for the first interval
desintxb[1] #des=desired, int=interval, xb=xbar=mean
desintxb[1]-0.5*(z[1]+xmin)#adjustment for first interval
@


Thus the first interval adjustment 7.5 must be added so that
the mean equals the desired value so that eventually we satisfy
mean preserving constraint.

Now we turn to random draw(s) which happen to be less than
or equal to (1/T=1/5), which will come
from the first half open interval $I_1=(z_0, z_1]$.

\section*{R commands \texttt{approx} (linear interpolate) and \texttt{which}}

<<>>=
ref1 <- which(pp <= (1/n)) #how many are less than or equal to 1/5 if n=T=5
ref1
# approx. returns list of points which linearly interpolate given data points,
#first interval ref1

if(length(ref1)>0){
  qq <- approx(c(0,1/n), c(xmin,z[1]), pp[ref1])$y
  qq #interpolated values
adj= desintxb[1]-0.5*(z[1]+xmin)
print(c("qq=",qq,"adj=",adj))
  q[ref1] <- qq
  if(!reachbnd)  q[ref1] <- qq + desintxb[1]-0.5*(z[1]+xmin)
}

print(c("qq=",qq))
print(c("q",q))
@

\section*{Second, Third and Fourth intervals}

In our example sorted random draws are \texttt{pp}
=(0.11 0.12 0.53 0.59 0.83) and the relevant range limits are
(0, 0.2, 0.4, 0.6, 0.8, 1.0).  Clearly the first two \texttt{pp}
values are in the first interval 0 to 0.2 discussed above.



The second interval is $I_2=(6, 10]$ with width 4 and mean $8$ for sorted \texttt{pp}
in (1/T, 2/T]. None of our \texttt{pp}=(0.11 0.12 0.53 0.59 0.83) is
between 0.2 and 0.4.


Two \texttt{pp} values 0.53 and 0.59 are both in the range
0.4 to 0.6 from which there is no random draw.
The next  second range of probabilities is 0.2 to 0.4
and no draw in the range 06 to 0.8.

The third interval is $I_3=(10, 16]$ with width 6 and mean $13$ for sorted \texttt{pp}
in (2/T, 3/T].  After interpolation and adjustment, corresponding two quantile values are 13.9 and 15.7.

The fourth interval is $(16, 28]$ with width 12 and mean $22$ for sorted \texttt{pp}
in (3/T, 4/T]. No random draw here.


<<>>=
for(i1 in 1:(n-2)){
  ref2 <- which(pp > (i1/n))
print(c("ref2",ref2,"pp[ref2]", pp[ref2]))
  ref3 <- which(pp <= ((i1+1)/n))
print(c("ref3",ref3))
  ref23 <- intersect(ref2, ref3)
print(c("ref23",ref23,"sorted draw pp[ref23]=",pp[ref23]))

  if(length(ref23)>0){
    qq <- approx(c(i1/n,(i1+1)/n), c(z[i1], z[i1+1]), pp[ref23])$y
print(c("interpolated value qq=",qq))
adj= desintxb[-1][i1]-0.5*(z[i1]+z[i1+1])
print(c("qq=",qq,"adj=",adj))
    q[ref23] <- qq + desintxb[-1][i1]-0.5*(z[i1]+z[i1+1])
print(c("q",q))
  }
}
@

We find that the adjustment to interpolated value above is zero.

\section*{Interval called \texttt{ref4} if \texttt{pp} exactly equals 4/5  (n-1)/n is empty.}
<<>>=
ref4 <- which(pp == ((n-1)/n))
print(c("ref4", ref4))
if(length(ref4)>0)
  q[ref4] <- z[n-1]
q
@

\section*{Last interval,  fifth here, is called \texttt{ref5}}

Note that the last interval interpolated value \texttt{qq} is 31.45 and we adjust it by \texttt{adj=-7.5}
to yield 23.95.  Recall that the adjustment is designed to ensure the \textbf{ergodic
theorem} is numerically satisfied by the meboot algorithm.

<<>>=
ref5 <- which(pp > ((n-1)/n))
if(length(ref5)>0){
print(c("ref5",ref5,"pp[ref5]=",pp[ref5]))
  qq <- approx(c((n-1)/n,1), c(z[n-1],xmax), pp[ref5])$y
print(c("interpolated value qq in last interval",qq))
  q[ref5] <- qq   # this implicitly shifts xmax for algorithm
adj=desintxb[n]-0.5*(z[n-1]+xmax)
print(c("qq=",qq,"adj=",adj))

  if(!reachbnd)  q[ref5] <- qq + desintxb[n]-0.5*(z[n-1]+xmax)
}
@


\section*{Now wrap up the entire calculation of meboot for toy example}

Following code maps the \texttt{q} vector from values domain to the time domain
by using the sort function with the option \texttt{index.return=TRUE} noted above.

We set \texttt{q[ordxx\$ix]} as sorted \texttt{q} denoted by \texttt{qseq} in the values domain.


<<>>=
prel=q #preliminary quantile values
qseq <- sort(q)
print(c("sorted q",qseq))
q[ordxx$ix] <- qseq
print(c("after mapping to time domain",q))

print(q)
@


Now we produce the table.

<<>>=
Tim=1:5
xt=xx #notation xt for original data
xordstat=x #order stats
ord1=ordxx$ix #output of sort
intermed=c(z,xmax)  #these are zt
prel
qseq #sorted quantiles
final=q #final quantiles of ME density
cb=cbind(Tim,xt,xordstat,ord1,intermed,desintxb, p,pp,prel,final)
@


\newpage
<< eval=FALSE>>=
require(xtable)
options(xtable.comment = FALSE)
print(xtable(cb))
@

\begin{table}[ht]
\centering
\begin{tabular}{rrrrrrrrrrr}
  \hline
 & Tim & xt & xordstat & ord1 & intermed & desintxb & p & pp & prel & final \\
  \hline
1 & 1.00 & 4.00 & 4.00 & 1.00 & 6.00 & 5.00 & 0.12 & 0.11 & 5.85 & 5.85 \\
  2 & 2.00 & 12.00 & 8.00 & 5.00 & 10.00 & 8.00 & 0.83 & 0.12 & 6.70 & 13.90 \\
  3 & 3.00 & 36.00 & 12.00 & 2.00 & 16.00 & 13.00 & 0.53 & 0.53 & 13.90 & 23.95 \\
  4 & 4.00 & 20.00 & 20.00 & 4.00 & 28.00 & 22.00 & 0.59 & 0.59 & 15.70 & 15.70 \\
  5 & 5.00 & 8.00 & 36.00 & 3.00 & 51.00 & 32.00 & 0.11 & 0.83 & 23.95 & 6.70 \\
   \hline
\end{tabular}
\end{table}



\bigskip
I thank Fred Viole, director of the consulting firm OVVO Financial Systems specializing in analysis of stock market data for
vastly improving an earlier draft version of this vignette.

\end{document}