\documentclass[a4paper,10pt]{scrartcl} \usepackage[OT1]{fontenc} \usepackage{Sweave} %% additional packages \usepackage{natbib} \bibpunct{(}{)}{,}{a}{}{,} \usepackage{amsmath, amssymb} \usepackage{hyperref} \hypersetup{colorlinks, citecolor=blue, linkcolor=blue, urlcolor=blue} \usepackage[top=30mm, bottom=30mm, left=30mm, right=30mm]{geometry} %% additional commands \newcommand{\code}[1]{\texttt{#1}} \newcommand{\pkg}[1]{\mbox{\textbf{#1}}} \newcommand{\proglang}[1]{\mbox{\textsf{#1}}} %%\VignetteIndexEntry{Robust Pareto Tail Modeling for the Estimation of Indicators on Social Exclusion using the R Package laeken} %%\VignetteDepends{laeken} %%\VignetteKeywords{social exclusion, indicators, robust estimation, Pareto distribution} %%\VignettePackage{laeken} \begin{document} \title{Robust Pareto Tail Modeling for the Estimation of Indicators on Social Exclusion using the \proglang{R} Package \pkg{laeken}} %\author{ % Andreas Alfons\footnote{Vienna University of Technology, % \href{mailto:alfons@statistik.tuwien.ac.at}{alfons@statistik.tuwien.ac.at}}, % Matthias Templ\footnote{Vienna University of Technology \& Statistics Austria, % \href{mailto:templ@tuwien.ac.at}{templ@tuwien.ac.at}}, % Peter Filzmoser\footnote{Vienna University of Technology, % \href{mailto:p.filzmoser@tuwien.ac.at}{p.filzmoser@tuwien.ac.at}}, % Josef Holzer\footnote{Landesstatistik Steiermark, % \href{mailto:josef.holzer@stmk.gv.at}{josef.holzer@stmk.gv.at}} %} \author{ Andreas Alfons$^{1}$, Matthias Templ$^{2}$, Peter Filzmoser$^{3}$, Josef Holzer$^{4}$ } \date{} \maketitle \setlength{\footnotesep}{11pt} \footnotetext[1]{ \begin{tabular}[t]{l} Erasmus School of Economics, Erasmus University Rotterdam\\ E-mail: \href{mailto:alfons@ese.eur.nl}{alfons@ese.eur.nl} \end{tabular} } \footnotetext[2]{ \begin{tabular}[t]{l} Zurich University of Applied Sciences\\ E-mail: \href{mailto:matthias.templ@zhaw.ch}{matthias.templ@zhaw.ch} \end{tabular} } \footnotetext[3]{ \begin{tabular}[t]{l} Vienna University of Technology\\ E-mail: \href{mailto:p.filzmoser@tuwien.ac.at}{p.filzmoser@tuwien.ac.at} \end{tabular} } \footnotetext[4]{ \begin{tabular}[t]{l} Landesstatistik Steiermark\\ E-mail: \href{mailto:josef.holzer@stmk.gv.at}{josef.holzer@stmk.gv.at} \end{tabular} } % change R prompt <>= options(prompt="R> ") @ %% specify folder and name for Sweave graphics %\SweaveOpts{prefix.string=figures-pareto/fig} \paragraph{Abstract} In this vignette, robust semiparametric estimation of social exclusion indicators using the \proglang{R} package \pkg{laeken} is discussed. Special emphasis is thereby given to income inequality indicators, as the standard estimates for these indicators are highly influenced by outliers in the upper tail of the income distribution. This influence can be reduced by modeling the upper tail with a Pareto distribution in a robust manner. While the focus of the paper is to demonstrate the functionality of \pkg{laeken} beyond the standard estimation techniques, a brief mathematical description of the implemented procedures is given as well. % ------------ % introduction % ------------ \section{Introduction} From a robustness point of view, the standard estimators for some of the social exclusion indicators defined by \citet{EU-SILC04, EU-SILC09} are problematic. In particular the income inequality indicators \emph{quintile share ratio} (QSR) and \emph{Gini coefficient} suffer from a lack of robustness. Consider, e.g., the QSR, which is estimated as the ratio of estimated totals or means (see Section~\ref{sec:QSR} for an exact definition). It is well known that the classical estimates for totals or means have a breakdown point of 0, meaning that even a single outlier can distort the results to an arbitrary extent. In fact, the influence of a single observation in the upper tail of the income distribution on the estimation of the QSR is linear and therefore unbounded. For practical purposes, the standard QSR estimator thus cannot be recommended in many situations \citep[cf.][]{hulliger09a}. It is also important to note that the behavior of the Gini coefficient is similar to the behavior of the QSR. The data basis for the estimation of the social exclusion indicators according to \citet{EU-SILC04, EU-SILC09} is the \emph{European Union Statistics on Income and Living Conditions} (EU-SILC), which is an annual panel survey conducted in EU member states and other European countries. On the one hand, EU-SILC data typically contain a considerable amount of \emph{representative} outliers in the upper tail of the income distribution, i.e., correct observations that behave differently from the main part of the data, but that are not unique in the population and hence need to be considered for computing estimates of the indicators. On the other hand, EU-SILC data frequently contain some even more extreme \emph{nonrepresentative} outliers, i.e., observations that are either incorrect or can be considered unique in the population. Consequently, such nonrepresentative outliers need to be excluded from the estimation process or downweighted. As a remedy, the upper tail of the income distribution may be modeled with a \emph{Pareto distribution} in order to recalibrate the sample weights or use fitted income values for observations in the upper tail when estimating the indicators (see Section~\ref{sec:fit}). %This is highly applicable because the upper tail of the income distribution in %EU-SILC data virtually always contains a considerable amount of representative %outliers. Nevertheless, classical estimators for the parameters of the Pareto distribution are highly influenced by the nonrepresentative outliers themselves. Using robust methods reduces the influence on fitting the Pareto distribution to the representative outliers and therefore on the estimation of the indicators. Rather than evaluating these methods, the paper concentrates on showing how they can be applied in the statistical environment \proglang{R} \citep{RDev} with the add-on package \pkg{laeken} \citep{laeken}. The basic design of the package, as well as standard estimation of the social exclusion indicators is discussed in detail in vignette \code{laeken-standard} \citep{templ11a}. Furthermore, the general framework for variance estimation is illustrated in vignette \code{laeken-variance} \citep{templ11b}. Those documents can be viewed from within \proglang{R} with the following commands: <>= vignette("laeken-standard") vignette("laeken-variance") @ Morover, a general introduction to package \pkg{laeken} is published as \citet{alfons13b}. Throughout the paper, the example data from package \pkg{laeken} is used. The data set is called \code{eusilc} and consists of $14\,827$ observations from $6\,000$ households. In addition, it was synthetically generated from Austrian EU-SILC survey data from 2006 using the data simulation methodology proposed by \citet{alfons11c} and implemented in the \proglang{R} package \pkg{simPopulation} \citep{simPopulation}. More information on the example data can be found in vignette \code{laeken-standard} or in the corresponding \proglang{R} help page. <<>>= library("laeken") data("eusilc") @ The rest of the paper is organized as follows. Section~\ref{sec:laeken} gives a mathematical description of the Eurostat definitions of the social exclusion indicators QSR and Gini coefficient. In Section~\ref{sec:Pareto}, the Pareto distribution is briefly discussed. Section~\ref{sec:threshold} discusses a rule of thumb for estimating the threshold for the upper tail of the distribution, and illustrates graphical methods for exploring the data in order to find the threshold. Classical and robust estimators for the shape parameter of the Pareto distribution are described in Section~\ref{sec:shape}. How to use Pareto tail modeling to estimate the social exclusion indicators is then shown in Section~\ref{sec:fit}. Finally, Section~\ref{sec:concl} concludes. % ------------------- % selected indicators % ------------------- \section{Social exclusion indicators} \label{sec:laeken} This paper is focused on the inequality indicators \emph{quintile share ratio} (QSR) and \emph{Gini coefficient}, which are both highly influenced by outliers in the upper tail of the distribution. Note that for the estimation of the social exclusion indicators, each person in a household is assigned the same \emph{eqivalized disposable income}. See vignette \code{laeken-standard} \citep{templ11a} for the computation of the equivalized disposable income with the \proglang{R} package \pkg{laeken}. For the following definitions, let $\boldsymbol{x} := (x_{1}, \ldots, x_{n})'$ be the equivalized disposable income with $x_{1} \leq \ldots \leq x_{n}$ and let $\boldsymbol{w} := (w_{i}, \ldots, w_{n})'$ be the corresponding personal sample weights, where $n$ denotes the number of observations. \subsection{Quintile share ratio (QSR)} \label{sec:QSR} The income \emph{quintile share ratio} (QSR) is defined as the ratio of the sum of the equivalized disposable income received by the 20\% of the population with the highest equivalized disposable income to that received by the 20\% of the population with the lowest equivalized disposable income \citep{EU-SILC04, EU-SILC09}. For the estimation of the quintile share ratio from a sample, let $\hat{q}_{0.2}$ and $\hat{q}_{0.8}$ denote the weighted 20\% and 80\% quantiles, respectively. With $0 \leq p \leq 1$, these weighted quantiles are given by \begin{equation} \label{eq:wq} \hat{q}_{p} = \hat{q}_{p} (\boldsymbol{x}, \boldsymbol{w}) := \begin{cases} \frac{1}{2} (x_{j} + x_{j+1}), & \quad \text{if } \sum_{i=1}^{j} w_{i} = p \sum_{i=1}^{n} w_{i}, \\ x_{j+1}, & \quad \text{if } \sum_{i=1}^{j} w_{i} < p \sum_{i=1}^{n} w_{i} < \sum_{i=1}^{j+1} w_{i}. \end{cases} \end{equation} %See also vignette \code{laeken-standard} \citep{templ11a} for the computation %of these quantiles with package \pkg{laeken}. Using index sets \mbox{$I_{\leq \hat{q}_{0.2}} := \{ i \in \{ 1, \ldots, n \} : x_{i} \leq \hat{q}_{0.2} \}$} and \mbox{$I_{> \hat{q}_{0.8}} := \{ i \in \{ 1, \ldots, n \} : x_{i} > \hat{q}_{0.8} \}$}, the quintile share ratio is estimated by \begin{equation} \widehat{QSR} := \frac{\sum_{i \in I_{> \hat{q}_{0.8}}} w_{i} x_{i}}{\sum_{i \in I_{\leq \hat{q}_{0.2}}} w_{i} x_{i}}. \end{equation} With package \pkg{laeken}, the quintile share ratio can be estimated using the function \code{qsr()}. Sample weights can thereby be supplied via the \code{weights} argument. <<>>= qsr("eqIncome", weights = "rb050", data = eusilc) @ \subsection{Gini coefficient} \label{sec:Gini} The \emph{Gini coefficient} is defined as the relationship of cumulative shares of the population arranged according to the level of equivalized disposable income, to the cumulative share of the equivalized total disposable income received by them \citep{EU-SILC04, EU-SILC09}. For the estimation of the Gini coefficient from a sample, the sample weights need to be taken into account. In mathematical terms, the Gini coefficient is estimated by \begin{equation} \widehat{Gini} := 100 \left[ \frac{2 \sum_{i=1}^{n} \left( w_{i} x_{i} \sum_{j=1}^{i} w_{j} \right) - \sum_{i=1}^{n} w_{i}^{\phantom{i}2} x_{i}}{\left( \sum_{i=1}^{n} w_{i} \right) \sum_{i=1}^{n} \left(w_{i} x_{i} \right)} - 1 \right]. \end{equation} The function \code{gini()} is available in \pkg{laeken} to estimate the Gini coefficient. As before, sample weights can be specified with the \code{weights} argument. <<>>= gini("eqIncome", weights = "rb050", data = eusilc) @ % ------------------- % Pareto distribution % ------------------- \section{The Pareto distribution} \label{sec:Pareto} The \emph{Pareto distribution} is well studied in the literature and is defined in terms of its cumulative distribution function \begin{equation} \label{eq:CDF} F_{\theta}(x) = 1 - \left( \frac{x}{x_{0}} \right) ^{-\theta}, \qquad x \geq x_{0}, \end{equation} where $x_{0} > 0$ is the scale parameter and $\theta > 0$ is the shape parameter \citep{kleiber03}. Furthermore, its density function is given by \begin{equation} f_{\theta}(x) = \frac{\theta x_{0}^{\theta}}{x^{\theta + 1}}, \qquad x \geq x_{0}. \end{equation} Figure~\ref{fig:Pareto} visualizes the Pareto probability density function with scale parameter $x_{0} = 1$ and different values of the shape parameter $\theta$. Clearly, the Pareto distribution is a highly right-skewed distribution with a heavy tail. It is therefore reasonable to assume that a random variable following a Pareto distribution contains extreme values. The effect of changing the shape parameter $\theta$ is visible in the probability mass at the scale parameter $x_{0}$: the higher $\theta$, the higher the probability mass at $x_{0}$. <>= x <- seq(1, 6, length.out=1000) dpareto <- function(x, x0 = 1, theta = 1) theta*x0^theta / x^(theta+1) y1 <- dpareto(x, theta=1) y2 <- dpareto(x, theta=2) y3 <- dpareto(x, theta=3) @ \begin{figure} \begin{center} <>= par(mar = c(4, 4, 0.5, 0.5) + 0.1) plot(x, y3, type = "l", lty = 3, ylab = "f(x)", xlim = c(0.75, 6), panel.first = { abline(h = 0, col = grey(0.75)) abline(v = 1, col = grey(0.75)) }) lines(x, y2, lty = 2) lines(x, y1, lty = 1) leg <- expression(paste(theta, " = 1"), paste(theta, " = 2"), paste(theta, " = 3")) legend("topright", legend = leg, lty = 1:3) @ \caption{Pareto probability density functions with parameters $x_{0} = 1$ and $\theta = 1, 2, 3$.} \label{fig:Pareto} \end{center} \end{figure} In Pareto tail modeling, the cumulative distribution function on the whole range of $x$ is modeled as \begin{equation} \label{eq:tail} F(x) = \left\{ \begin{array}{ll} G(x), & \quad \text{if } x \leq x_{0}, \\ G(x_{0}) + (1 - G(x_{0})) F_{\theta}(x), & \quad \text{if } x > x_{0}, \end{array} \right. \end{equation} where $G$ is an unknown distribution function \citep{dupuis06}. Let $n$ be the number of observations and let $\boldsymbol{x} = (x_{1}, \ldots, x_{n})'$ denote the observed values with $x_{1} \leq \ldots \leq x_{n}$. In addition, let $k$ be the number of observations to be used for tail modeling. In this scenario, the threshold $x_{0}$ is estimated by % Let $k$ be the number of observations to be used for tail modeling and let % $x_{(1)} \leq \ldots \leq x_{(n)}$, denote the sorted observations. In this % scenario, the threshold $x_{0}$ is estimated by \begin{equation} \hat{x}_{0} := x_{n-k}. \end{equation} If an estimate $\hat{x}_{0}$ for the scale parameter of the Pareto distribution has been obtained, $k$ is given by the number of observations larger than $\hat{x}_{0}$. Thus estimating $x_{0}$ and $k$ directly corresponds with each other. In the remainder of this package vignette, the equivalized disposable income of the EU-SILC example data is of main interest. Consequently, the Pareto distribution will be modeled at the household level rather than the individual level. Moreover, the focus of this vignette is on robust estimation of the social exclusion indicators. Hence the equivalized disposable income of the household with the largest income is replaced by a large outlier. <<>>= hID <- eusilc$db030[which.max(eusilc$eqIncome)] eusilc[eusilc$db030 == hID, "eqIncome"] <- 10000000 @ Since the aim is to model a Pareto distribution at the household level, the following command creates a data set that contains only the equivalized disposable income and the sample weights on the household level. This data set will be used in Sections~\ref{sec:threshold} and~\ref{sec:shape} to estimate the parameters of the Pareto distribution. <<>>= eusilcH <- eusilc[!duplicated(eusilc$db030), c("eqIncome", "db090")] @ % --------- % threshold % --------- \section{Finding the threshold} \label{sec:threshold} The aim of the methods presented in this sections is to find the threshold $x_{0}$ for modeling the Pareto distribution. Several methods for the estimation of the threshold $x_{0}$ or the number of observations $k$ in the tail have been proposed in the literature, but those proposals typically do not consider sample weights. \citet{beirlant96a, beirlant96b} developed a procedure that analytically determines the optimal choice of $k$ for the Hill estimator of the shape parameter \citep[see also Section~\ref{sec:Hill} of this paper]{hill75} by minimizing the asymptotic mean squared error (AMSE). In package \pkg{laeken}, this approach is implemented in the function \code{minAMSE()}. However, the procedure is designed for the non-robust Hill estimator and is therefore not further discussed in this paper. Furthermore, \citet{danielsson01} proposed a bootstrap method to find the optimal $k$ for the Hill estimator with respect to the AMSE, which has less analytical requirements than the approach by \citet{beirlant96a, beirlant96b}. Please note that this method is not robust either and that it is currently not available in package \pkg{laeken}. A robust prediction error criterion for choosing the number of observations $k$ in the tail and estimating the shape parameter $\theta$ was developed by \citet{dupuis06}. Nevertheless, our implementation of this robust criterion was unstable and is therefore not included in \pkg{laeken}. In any case, \citet{holzer09} concludes that graphical methods for finding the threshold outperform those analytical approaches in the case of EU-SILC data. While this section is thus focused graphical methods, a simple rule of thumb designed specifically for the equivalized disposable income in EU-SILC data is described in the following as well. \subsection{Van Kerm's rule of thumb} \label{sec:vanKerm} \citet{vankerm07} presented a formula that is more of a rule of thumb for the threshold of the equivalized disposable income in EU-SILC data. Is is given by \begin{equation} \hat{x}_{0} := \min(\max(2.5\bar{x}, q_{0.98}), q_{0.97}), \end{equation} where $\bar{x}$ is the weighted mean, and $q_{0.98}$ and $q_{0.97}$ are weighted quantiles as defined in Equation~(\ref{eq:wq}). In package \pkg{laeken}, the function \code{paretoScale()} provides functionality for computing the threshold with van Kerm's rule of thumb. The argument \code{w} is available to supply sample weights. %In the example below, the household IDs are supplied via the argument %\code{groups} to estimate the threshold on the houshold level rather than the %personal level. %<<>>= %paretoScale(eusilc$eqIncome, eusilc$db090, groups = eusilc$db030) %@ <<>>= ts <- paretoScale(eusilcH$eqIncome, w = eusilcH$db090) ts @ It should be noted that the function returns an object of class \code{"paretoScale"}, which consists of a component \code{x0} for the threshold (scale parameter) and a component \code{k} for the number of observations in the tail of the distribution, i.e., that are larger than the threshold. \subsection{Pareto quantile plot} The \emph{Pareto quantile plot} is a graphical method for inspecting the parameters of a Pareto distribution. For the case without sample weights, it is described in detail in \citet{beirlant96a}. If the Pareto model holds, there exists a linear relationship between the lograrithms of the observed values and the quantiles of the standard exponential distribution, since the logarithm of a Pareto distributed random variable follows an exponential distribution. Hence the logarithms of the observed values, $\log (x_{i})$, $i = 1, \ldots, n$, are plotted against the theoretical quantiles. In the case without sample weights, the theoretical quantiles of the standard exponential distribution are given by \begin{equation} \label{eq:quantiles} -\log \left( 1 - \frac{i}{n+1} \right), \qquad i = 1, \ldots, n, \end{equation} i.e., by dividing the range into $n + 1$ equally sized subsets and using the resulting $n$ inner gridpoints as probabilities for the quantiles. If the data contain sample weights, the range of the exponential distribution needs to be divided according to the weights of the $n$ observations. The Pareto quantile plot is thus generalized by using the theoretical quantiles \begin{equation} -\log \left( 1 - \frac{\sum_{j=1}^{i} w_{j}}{\sum_{j=1}^{n} w_{j}} \frac{n}{n+1} \right), \qquad i = 1, \ldots, n, \end{equation} where the correction factor $\frac{n}{n+1}$ ensures that the quantiles reduce to (\ref{eq:quantiles}) if all sample weights are equal. If the tail of the data follows a Pareto distribution, those observations form almost a straight line. The leftmost point of a fitted line can thus be used as an estimate of the threshold $x_{0}$, the scale parameter. All values starting from the point after the threshold may be modeled by a Pareto distribution, but this point cannot be determined exactly. Furthermore, the slope of the fitted line is in turn an estimate of $\frac{1}{\theta}$, the reciprocal of the shape parameter. Figure~\ref{fig:ParetoQuantile} displays the Pareto quantile plot for the example data \code{eusilc} on the household level with the largest observation replaced by an outlier. The plot is generated using the function \code{paretoQPlot()}, which allows to supply sample weights via the argument \code{w}. In addition, the threshold can be selected interactively by clicking on a data point. Information on the selected threshold is then printed on the \proglang{R} console. When the interactive selection is terminated, which is typically done by a secondary mouse click, the selected threshold is returned as an object of class \code{"paretoScale"}. Another advantage of the Pareto quantile plot is also illustrated in Figure~\ref{fig:ParetoQuantile}. Nonrepresentative outliers such as the large income introduced into the example data in Section~\ref{sec:Pareto}, i.e., extreme observations in the upper tail that deviate from the Pareto model, are clearly visible. \begin{figure} \begin{center} \setkeys{Gin}{width=.75\textwidth} <>= paretoQPlot(eusilcH$eqIncome, w = eusilcH$db090) @ \caption{Pareto Quantile plot for the example data \code{eusilc} on the household level with the largest observation replaced by an outlier.} \label{fig:ParetoQuantile} \end{center} \end{figure} \subsection{Mean excess plot} The \emph{mean excess plot} is another graphical method for inspecting the threshold for Pareto tail modeling, but it does not provide information on the shape parameter. It is based on the excess function \begin{equation} \label{eq:excess} e(x_{0}) := \mathbb{E}(x - x_{0}|x > x_{0}), \qquad x_{0} \geq 0. \end{equation} A detailed description for the case without sample weights can be found in \citet{borkovec00}. For the following definition of the mean excess plot, keep in mind that the observations are sorted such that $x_{1} \leq \ldots \leq x_{n}$. For each observation $x_{i}$, $i = 1, \ldots, \lfloor n-\sqrt{n} \rfloor$, the empirical excess function $e_{n}$ is computed. In the case without sample weights, the expectation in Equation~(\ref{eq:excess}) is replaced by the arithmetic mean, and the empirical excess function is given by \begin{equation} e_{n}(x_{i}) := \frac{1}{n-i} \sum_{j=i+1}^{n} (x_{j} - x_{i}), \qquad i = 1, \ldots, \lfloor n-\sqrt{n} \rfloor. \end{equation} The values of the empirical excess function $e_{n}(x_{i})$ are then plotted against the corresponding $x_{i}$, $i = 1, \ldots, \lfloor n-\sqrt{n} \rfloor$. If sample weights are available in the data, the mean excess plot is simply generalized by using the weighted mean for the empirical excess function: \begin{equation} e_{n}(x_{i}) := \frac{1}{\sum_{j=i+1}^{n} w_{j}} \sum_{j=i+1}^{n} w_{j} (x_{j} - x_{i}), \qquad i = 1, \ldots, \lfloor n-\sqrt{n} \rfloor. \end{equation} If the tail of the data follows a Pareto distribution, those observations show a positive linear trend. The leftmost point of a fitted line can thus be used as an estimate of the threshold $x_{0}$, the scale parameter. As for the Pareto quantile plot, a disadvantage of the mean excess plot is that the threshold cannot be determined exactly. \begin{figure} \begin{center} \setkeys{Gin}{width=.75\textwidth} <>= meanExcessPlot(eusilcH$eqIncome, w = eusilcH$db090) @ \caption{Mean excess plot for the example data \code{eusilc} on the household level with the largest observation replaced by an outlier.} \label{fig:meanExcess} \end{center} \end{figure} Figure~\ref{fig:meanExcess} shows the mean excess plot for the example data \code{eusilc} on the household level with the largest observation replaced by an outlier. The function \code{meanExcessPlot()} is thereby used to produce the plot. Sample weights can be supplied via the argument \code{w}. Interactive selection of the threshold works just like for the Pareto quantile plot. Again, the selected threshold is returned as an object of class \code{"paretoScale"}. % --------------- % shape parameter % --------------- \section{Estimation of the shape parameter} \label{sec:shape} This section is focused on methods for estimating the shape parameter $\theta$ once the threshold $x_0$ is fixed. It should be noted that none of the original proposals takes sample weights into account. Most estimators presented in the following were therefore adjusted for the case of sample weights. \subsection{Hill estimator} \label{sec:Hill} The maximum likelihood estimator for the shape parameter of the Pareto distribution was introduced by \citet{hill75} and is referred to as the \emph{Hill} estimator. If the data do not contain sample weights, it is given by \begin{equation} \label{eq:Hill} \hat{\theta}_{\mathrm{Hill}} = \frac{k}{\sum_{i = 1}^{k} \log x_{n-k+i} - k \log x_{n-k}}. \end{equation} In the case of sample weights, the \emph{weighted Hill} (wHill) estimator is given by generalizing Equation~(\ref{eq:Hill}) to \begin{equation} \label{eq:wHill} \hat{\theta}_{\mathrm{wHill}} = \frac{\sum_{i = 1}^{k} w_{n-k+i}}{\sum_{i = 1}^{k} w_{n-k+i} \left( \log x_{n-k+i} - \log x_{n-k} \right)} . \end{equation} Package \pkg{laeken} provides the function \code{thetaHill()} to compute the Hill estimator. It requires to specify either the number of observations in the tail via the argument \code{k}, or the threshold via the argument \code{x0}. Furthermore, the argument \code{w} can be used to supply sample weights. In the following example, the shape parameter is estimated using the largest observations (first command) and the threshold (second command) as computed with van Kerm's rule of thumb in Section~\ref{sec:vanKerm}. <<>>= thetaHill(eusilcH$eqIncome, k = ts$k, w = eusilcH$db090) thetaHill(eusilcH$eqIncome, x0 = ts$x0, w = eusilcH$db090) @ \subsection{Weighted maximum likelihood estimator} The \emph{weighted maximum likelihood} (WML) estimator \citep{dupuis02, dupuis06} falls into the class of M-estimators and is given by the solution $\hat{\theta}$ of \begin{equation} \sum_{i = 1}^{k} \mathrm{\Psi}(x_{n-k+i}, \theta) = 0 \end{equation} with \begin{equation} \mathrm{\Psi}(x, \theta) := u(x, \theta) \frac{\partial}{\partial \theta} \log f(x, \theta) = u(x, \theta) \left( \frac{1}{\theta} - \log \frac{x}{x_{0}} \right), \end{equation} where $u(x, \theta)$ is a weight function with values in $[0,1]$. In the implementation in package \pkg{laeken}, a Huber type weight function is used by default, as proposed by \citet{dupuis06}. Let the logarithms of the relative excesses be denoted by \begin{equation} z_{i} := \log \left( \frac{x_{n-k+i}}{x_{n-k}} \right), \qquad i = 1, \ldots, k. \end{equation} In the Pareto model, these can be predicted by \begin{equation} \hat{z}_{i} := -\frac{1}{\theta} \log \left( \frac{k+1-i}{k+1} \right), \qquad i = 1, \ldots, k. \end{equation} The variance of $z_{i}$ is given by \begin{equation} \sigma_{i}^{\phantom{i}2} := \sum_{j = 1}^{i} \frac{1}{\theta^{2} (k-i+j)^{2}}, \qquad i = 1, \ldots, k. \end{equation} Using the standardized residuals \begin{equation} r_{i} := \frac{z_{i} - \hat{z}_{i}}{\sigma_{i}}, \end{equation} the Huber type weight function with tuning constant $c$ is defined as \begin{equation} u(x_{n-k+i}, \theta) := \left\{ \begin{array}{cl} 1, & \quad \text{if } |r_{i}| \leq c, \\ \frac{c}{|r_{i}|}, & \quad \text{if } |r_{i}| > c. \end{array} \right. \end{equation} For this choice of weight function, the bias of $\hat{\theta}$ is approximated by \begin{equation} \hat{B}(\hat{\theta}) = - \frac{\sum_{i=1}^{k} \left( u_{i} \frac{\partial}{\partial \theta} \log f_{i} \right) \vert_{\hat{\theta}} \left( F_{\hat{\theta}}(x_{n-k+i}) - F_{\hat{\theta}}(x_{n-k+i-1}) \right)}{\sum_{i=1}^{k} \left( \frac{\partial}{\partial \theta} u_{i} \frac{\partial}{\partial \theta} \log f_{i} + u_{i} \frac{\partial^{2}}{\partial \theta^{2}} \log f_{i} \right) \vert_{\hat{\theta}} \left( F_{\hat{\theta}}(x_{n-k+i}) - F_{\hat{\theta}}(x_{n-k+i-1}) \right)}, \end{equation} where $u_{i} := u(x_{n-k+i}, \theta)$ and $f_{i} := f(x_{n-k+i}, \theta)$. This term is used to obtain a bias-corrected estimator \begin{equation} \tilde{\theta} := \hat{\theta} - \hat{B}(\hat{\theta}). \end{equation} For details and proofs of the above statements, as well as for information on a probability-based weight function $u(x, \theta)$, the reader is referred to \citet{dupuis02} and \citet{dupuis06}. However, note the WML estimator does not consider sample weights. An adjustment of the estimator to take sample weights into account is currently not available due to its complexity. For sampling designs that lead to equal sample weights, the WML estimator may still be useful, though. The function \code{thetaWML()} is available in \pkg{laeken} to compute the WML estimator. Again, either the argument \code{k} or \code{x0} needs to be used to specify the number of observations in the tail or the threshold. Since the sample weights in the example data are not equal, the following example is only included to demonstrate the use of the function. <<>>= thetaWML(eusilcH$eqIncome, k = ts$k) thetaWML(eusilcH$eqIncome, x0 = ts$x0) @ \subsection{Integrated squared error estimator} For the \emph{integrated squared error} (ISE) estimator \citep{vandewalle07}, the Pareto distribution is modeled in terms of the relative excesses \begin{equation} y_{i} := \frac{x_{n-k+i}}{x_{n-k}}, \qquad i = 1, \ldots, k. \end{equation} The density function of the Pareto distribution for the relative excesses is approximated by \begin{equation} f_{\theta}(y) = \theta y^{-(1+\theta)}. \end{equation} The ISE estimator is then given by minimizing the integrated squared error criterion \citep{terrell90}: \begin{equation} \hat{\theta} = \arg \min_{\theta} \left[ \int f_{\theta}^{2}(y) dy - 2 \mathbb{E}(f_{\theta}(Y)) \right] . \end{equation} If there are no sample weights in the data, the mean is used as an unbiased estimator of $\mathbb{E}(f_{\theta}(Y))$ in order to obtain the ISE estimate \begin{equation} \label{eq:ISE} \hat{\theta}_{\mathrm{ISE}} = \arg \min_{\theta} \left[ \int f_{\theta}^{2}(y) dy - \frac{2}{k} \sum_{i=1}^{k} f_{\theta}(y_{i}) \right] . \end{equation} See \citet{vandewalle07} for more information on the ISE estimator for the case without sample weights. If sample weights are available in the data, the mean in Equation~(\ref{eq:ISE}) is simply replaced by a weighted mean to obtain the \emph{weighted integrated squared error} (wISE) estimator: \begin{equation} \label{eq:wISE} \hat{\theta}_{\mathrm{wISE}} = \arg \min_{\theta} \left[ \int f_{\theta}^{2}(y) dy - \frac{2}{\sum_{i=1}^{k} w_{n-k+i}} \sum_{i=1}^{k} w_{n-k+i} f_{\theta}(y_{i}) \right] . \end{equation} With package \pkg{laeken}, the ISE estimator can be computed using the function \code{thetaISE()}. The arguments \code{k} and \code{x0} are available to specify either the number of observations in the tail or the threshold, and sample weights can be supplied via the argument \code{w}. <<>>= thetaISE(eusilcH$eqIncome, k = ts$k, w = eusilcH$db090) thetaISE(eusilcH$eqIncome, x0 = ts$x0, w = eusilcH$db090) @ \subsection{Partial density component estimator} For the \emph{partial density component} (PDC) estimator \cite{vandewalle07} minimizes the integrated squared error criterion using an incomplete density mixture model $u f_{\theta}$. If the data do not contain sample weights, the PDC estimator in is thus given by \begin{equation} \label{eq:PDC} \hat{\theta}_{\mathrm{PDC}} = \arg \min_{\theta} \left[ u^{2} \int f_{\theta}^{2}(y) dy - \frac{2 u}{k} \sum_{i = 1}^{k} f_{\theta}(y_{i}) \right]. \end{equation} The parameter $u$ can be interpreted as a measure of the uncontaminated part of the sample and is estimated by \begin{equation} \label{eq:u} \hat{u} = \frac{\frac{1}{k} \sum_{i = 1}^{k} f_{\hat{\theta}}(y_{i})}{\int f_{\hat{\theta}}^{2}(y) dy}. \end{equation} See \cite{vandewalle07} and references therein for more information on the PDC estimator for the case without sample weights. Taking sample weights into account, the \emph{weighted partial density component} (wPDC) estimator is obtained by generalizing Equations~(\ref{eq:PDC}) and~(\ref{eq:u}) to \begin{align} \label{eq:wPDC} \hat{\theta}_{\mathrm{wPDC}} =& \arg \min_{\theta} \left[ u^{2} \int f_{\theta}^{2}(y) dy - \frac{2u}{\sum_{i=1}^{k} w_{n-k+i}} \sum_{i = 1}^{k} w_{n-k+i} f_{\theta}(y_{i}) \right] , \\ \hat{u} =& \frac{\frac{1}{\sum_{i=1}^{k} w_{n-k+i}} \sum_{i = 1}^{k} w_{n-k+i} f_{\hat{\theta}}(y_{i})}{\int f_{\hat{\theta}}^{2}(y) dy} . \end{align} The function \code{thetaPDC()} is implemented in package \pkg{laeken} to compute the PDC estimator. As for the other estimators, it is necessary to specify either the number of observations in the tail via the argument \code{k}, or the threshold via the argument \code{x0}. Sample weights can be supplied using the argument \code{w}. <<>>= thetaPDC(eusilcH$eqIncome, k = ts$k, w = eusilcH$db090) thetaPDC(eusilcH$eqIncome, x0 = ts$x0, w = eusilcH$db090) @ % ---------------------------- % estimation of the indicators % ---------------------------- \section{Estimation of the indicators using Pareto tail modeling} \label{sec:fit} Three approaches based on Pareto tail modeling for reducing the influence of outliers on the social exclusion indicators are implemented in the \proglang{R} package \pkg{laeken}: \begin{description} \item[Calibration for nonrepresentative outliers (CN):] Values larger than a certain quantile of the fitted distribution are declared as nonrepresentative outliers. Since these are considered to be unique to the population data, the sample weights of the corresponding observations are set to $1$ and the weights of the remaining observations are adjusted accordingly by calibration. \item[Replacement of nonrepresentative outliers (RN):] Values larger than a certain quantile of the fitted distribution are declared as nonrepresentative outliers. Only these nonrepresentative outliers are replaced by values drawn from the fitted distribution, thereby preserving the order of the original values. \item[Replacement of the tail (RT):] All values above the threshold are replaced by values drawn from the fitted distribution. The order of the original values is preserved. \end{description} An evaluation of the RT approach by means of a simulation study can be found in \citet{alfons10b}. Keep in mind that the largest observation in the example data \code{eusilc} was replaced by a large outlier in Section~\ref{sec:Pareto}. With the following command, the Gini coefficient is estimated according to the Eurostat definition to show that even a single outlier can completely distort the results for the standard estimation (see Section~\ref{sec:Gini} for the original value). <<>>= gini("eqIncome", weights = "rb050", data = eusilc) @ For Pareto tail modeling, the function \code{paretoTail()} is implemented in \pkg{laeken}. It returns an object of class \code{"paretoTail"}, which contains all the necessary information for further analysis using the three approaches described above. Note that the household IDs are supplied via the argument \code{groups} such that the Pareto distribution is fitted on the household level rather than the individual level. In addition, the PDC is used by default to estimate the shape parameter. Other estimators can be specified via the \code{method} argument. <<>>= fit <- paretoTail(eusilc$eqIncome, k = ts$k, w = eusilc$db090, groups = eusilc$db030) @ The function \code{reweightOut()} is available for semiparametric estimation with the CN approach. It returns a vector of the recalibrated weights. In this example, regional information is used as auxiliary variables for calibration. The function \code{calibVars()} thereby transforms a factor into a matrix of binary variables, as required by the calibration function \code{calibWeights()}, which is called internally. These recalibrated weights are then simply used to estimate the Gini coefficient with function \code{gini()}. <<>>= w <- reweightOut(fit, calibVars(eusilc$db040)) gini(eusilc$eqIncome, w) @ For the RN approach, the function \code{replaceOut()} is implemented. Since values are drawn from the fitted distribution to replace the observations flagged as outliers, the seed of the random number generator is set first for reproducibility of the results. The returned vector of incomes is then supplied to \code{gini()} to estimate the Gini coefficient. <<>>= set.seed(1234) eqIncome <- replaceOut(fit) gini(eqIncome, weights = eusilc$rb050) @ Similarly, the function \code{replaceTail()} is available for the RT approach. Again, the seed of the random number generator is set beforehand. <<>>= set.seed(1234) eqIncome <- replaceTail(fit) gini(eqIncome, weights = eusilc$rb050) @ It should be noted that \code{replaceTail()} can also be used for the RN approach by setting the argument \code{all} to \code{FALSE}. In fact, \code{replaceOut(x, ...)} is a simple wrapper for \code{replaceTail(x, all = FALSE, ...)}. In any case, the estimates for the semiparametric approaches based on Pareto tail modeling are very close to the original value before the outlier has been introduced (see Section~\ref{sec:Gini}), whereas the standard estimation is corrupted by the outlier. Furthermore, the estimation of other indicators such as the quintile share ratio (see Section~\ref{sec:QSR}) using the semiparametric approaches is straightforward and hence not shown here. % ----------- % conclusions % ----------- \section{Conclusions} \label{sec:concl} This vignette shows the functionality of package \pkg{laeken} for robust semiparametric estimation of social exclusion indicators based on Pareto tail modeling. Most notably, it demonstrates that the functions are easy to use and that the implementation follows an object-oriented design. While the focus of the paper lies on the use of the package, a mathematical description of the methods is given as well. Furthermore, it is shown that the standard estimation of the inequality indicators can be corrupted by a single outlier, thus underlining the need for robust alternatives. Three approaches for robust semiparametric estimation based on Pareto tail modeling are thereby implemented such that the corresponding functions share a common interface for ease of use. % --------------- % acknowledgments % --------------- \section*{Acknowledgments} This work was partly funded by the European Union (represented by the European Commission) within the 7$^{\mathrm{th}}$ framework programme for research (Theme~8, Socio-Economic Sciences and Humanities, Project AMELI (Advanced Methodology for European Laeken Indicators), Grant Agreement No. 217322). Visit \url{http://ameli.surveystatistics.net} for more information on the project. % ------------ % bibliography % ------------ \bibliographystyle{plainnat} \bibliography{laeken} \end{document}