\documentclass[a4paper,10pt]{scrartcl} \usepackage[OT1]{fontenc} \usepackage{Sweave} %% additional packages \usepackage{natbib} \bibpunct{(}{)}{,}{a}{}{,} \usepackage{amsmath, amssymb} \usepackage{hyperref} \hypersetup{colorlinks, citecolor=blue, linkcolor=blue, urlcolor=blue} \usepackage[top=30mm, bottom=30mm, left=30mm, right=30mm]{geometry} \usepackage{enumerate} \usepackage{engord} %% additional commands \newcommand{\code}[1]{\texttt{#1}} \newcommand{\pkg}[1]{\mbox{\textbf{#1}}} \newcommand{\proglang}[1]{\mbox{\textsf{#1}}} %%\VignetteIndexEntry{Standard Methods for Point Estimation of Indicators on Social Exclusion and Poverty using the R Package laeken} %%\VignetteDepends{laeken} %%\VignetteKeywords{social exclusion, poverty, indicators, point estimation} %%\VignettePackage{laeken} \begin{document} \title{Standard Methods for Point Estimation of Indicators on Social Exclusion and Poverty using the \proglang{R} Package \pkg{laeken}} \author{Matthias Templ$^{1}$, Andreas Alfons$^{2}$} \date{} \maketitle \setlength{\footnotesep}{11pt} \footnotetext[1]{ \begin{tabular}[t]{l} Zurich University of Applied Sciences\\ E-mail: \href{mailto:matthias.templ@zhaw.ch}{matthias.templ@zhaw.ch} \end{tabular} } \footnotetext[2]{ \begin{tabular}[t]{l} Erasmus School of Economics, Erasmus University Rotterdam\\ E-mail: \href{mailto:alfons@ese.eur.nl}{alfons@ese.eur.nl} \end{tabular} } % change R prompt <>= options(prompt="R> ") @ \paragraph{Abstract} This vignette demonstrates the use of the \proglang{R} package \pkg{laeken} for standard point estimation of indicators on social exclusion and poverty according to the definitions by Eurostat. The package contains synthetically generated data for the European Union Statistics on Income and Living Conditions (EU-SILC), which is used in the code examples throughout the paper. Furthermore, the basic object-oriented design of the package is discussed. Even though the paper is focused on showing the functionality of package \pkg{laeken}, it also provides a brief mathematical description of the implemented indicators. % ------------ % introduction % ------------ \section{Introduction} The \emph{European Union Statistics on Income and Living Conditions} (EU-SILC) is a panel survey conducted in EU member states and other European countries, and serves as basis for measuring risk-of-poverty and social cohesion in Europe. %and for evaluating the Lisbon~2010 strategy and for monitoring the %Europe~2020 goals of the European Union. A short overview of the $11$ most important indicators on social exclusion and poverty according to \cite{EU-SILC04} %and \cite{EU-SILC09} is given in the following. \paragraph{Primary indicators} \begin{enumerate} \item At-risk-of-poverty rate (after social transfers) \begin{enumerate}[a.] \item At-risk-of-poverty rate by age and gender \item At-risk-of-poverty rate by most frequent activity status and gender \item At-risk-of-poverty rate by household type \item At-risk-of-poverty rate by accommodation tenure status \item At-risk-of-poverty rate by work intensity of the household \item At-risk-of-poverty threshold (illustrative values) \end{enumerate} \item Inequality of income distribution: S80/S20 income quintile share ratio \item At-persistent-risk-of-poverty rate by age and gender ($60\%$ median) \item Relative median at-risk-of-poverty gap, by age and gender \newcounter{enumi_last} \setcounter{enumi_last}{\value{enumi}} \end{enumerate} \paragraph{Secondary indicators} \begin{enumerate} \setcounter{enumi}{\value{enumi_last}} \item Dispersion around the at-risk-of-poverty threshold \item At-risk-of-poverty rate anchored at a moment in time \item At-risk-of-poverty rate before social transfers by age and gender \item Inequality of income distribution: Gini coefficient \item At-persistent-risk-of-poverty rate, by age and gender ($50\%$ median) \setcounter{enumi_last}{\value{enumi}} \end{enumerate} \paragraph{Other indicators} \begin{enumerate} \setcounter{enumi}{\value{enumi_last}} \item Mean equivalized disposable income \item The gender pay gap \end{enumerate} \paragraph{} Note that especially the Gini coefficient is very well studied due to its importance in many fields of research. The add-on package \pkg{laeken} \citep{laeken} aims is to bring functionality for the estimation of indicators on social exclusion and poverty to the statistical environment \proglang{R} \citep{RDev}. In the examples in this vignette, standard estimates for the most important indicators are computed according to the Eurostat definitions \citep{EU-SILC04, EU-SILC09}. More sophisticated methods that are less influenced by outliers are described in vignette \code{laeken-pareto} \citep{alfons11a}, while the basic framework for variance estimation is discussed in vignette \code{laeken-variance} \citep{templ11b}. Those documents can be viewed from within \proglang{R} with the following commands: <>= vignette("laeken-pareto") vignette("laeken-variance") @ Morover, a general introduction to package \pkg{laeken} is published as \citet{alfons13b}. The example data set of package \pkg{laeken}, which is called \code{eusilc} and consists of $14\,827$ observations from $6\,000$ households, is used throughout the paper. It was synthetically generated from Austrian EU-SILC survey data from 2006 using the data simulation methodology proposed by \citet{alfons11c} and implemented in the \proglang{R} package \pkg{simPopulation} \citep{simPopulation}. The first three observations of the synthetic data set \code{eusilc} are printed below. <<>>= library("laeken") data("eusilc") head(eusilc, 3) @ Only a few of the large number of variables in the original survey are included in the example data set. The variable names are rather cryptic codes, but these are the standardized names used by the statistical agencies. Furthermore, the variables \code{hsize} (household size), \code{age}, \code{eqSS} (equivalized household size) and \code{eqIncome} (equivalized disposable income) are not included in the standardized format of EU-SILC data, but have been derived from other variables for convenience. Moreover, some very sparse income components were not included in the the generation of this synthetic data set. Thus the equivalized household income is computed from the available income components. For the remainder of the paper, the variable \code{eqIncome} (equivalized disposable income) is of main interest. Other variables are in some cases used to break down the data in order to evaluate the indicators on the resulting subsets. It is important to note that EU-SILC data are in practice conducted through complex sampling designs with different inclusion probabilities for the observations in the population, which results in different weights for the observations in the sample. Furthermore, calibration is typically performed for non-response adjustment of these initial design weights. Therefore, the sample weights have to be considered for all estimates, otherwise biased results are obtained. The rest of the paper is organized as follows. Section \ref{sec:design} briefly illustrates the basic object-oriented design of the package. The calculation of the equivalized household size and the equivalized disposable income is then described in Section \ref{sec:income}. Afterwards, Section~\ref{sec:w} introduces the Eurostat definitions of the weighted median and weighted quantiles, which are required for the estimation of some of the indicators. In Section~\ref{sec:ind}, a mathematical description of the most important indicators on social exclusion and poverty is given and their estimation with package \pkg{laeken} is demonstrated. Section~\ref{sec:sub} discusses a useful subsetting method, and Section~\ref{sec:concl} concludes. % ------------ % basic design % ------------ \section{Basic design of the package} \label{sec:design} The implementation of the package follows an object-oriented design using \proglang{S3} classes \citep{chambers92}. Its aim is to provide functionality for point and variance estimation of Laeken indicators with a single command, even for different years and domains. Currently, the following indicators are available in the \proglang{R} package \pkg{laeken}: \begin{itemize} \item \emph{At-risk-of-poverty rate}: function \code{arpr()} \item \emph{Quintile share ratio}: function \code{qsr()} \item \emph{Relative median at-risk-of-poverty gap}: function \code{rmpg()} \item \emph{Dispersion around the at-risk-of-poverty threshold}: also function \code{arpr()} \item \emph{Gini coefficient}: function \code{gini()} \end{itemize} Note that the implementation strictly follows the Eurostat definitions \citep{EU-SILC04,EU-SILC09}. %In addition, robust estimators are also implemented. Here, the focus is on %Pareto tail modeling. \subsection{Class structure} In this section, the class structure of package \pkg{laeken} is briefly discussed. Section~\ref{sec:indicator} describes the basic class \code{"indicator"}, while the different subclasses for the specific indicators are listed in Section~\ref{sec:classes}. \subsubsection{Class \code{"indicator"}} \label{sec:indicator} The basic class \code{"indicator"} acts as the superclass for all classes in the package corresponding to specific indicators. It consists of the following components: % \begin{description} \item[\code{value}:] A numeric vector containing the point estimate(s). \item[\code{valueByStratum}:] A \code{data.frame} containing the point estimates by domain. \item[\code{varMethod}:] A character string specifying the type of variance estimation used. \item[\code{var}:] A numeric vector containing the variance estimate(s). \item[\code{varByStratum}:] A \code{data.frame} containing the variance estimates by domain. \item[\code{ci}:] A numeric vector or matrix containing the confidence interval(s). \item[\code{ciByStratum}:] A \code{data.frame} containing the confidence intervals by domain. \item[\code{alpha}:] The confidence level is given by $1 - $\code{alpha}. \item[\code{years}:] A numeric vector containing the different years of the survey. \item[\code{strata}:] A character vector containing the different strata of the breakdown. % \item[\code{seed}:] The seed of the random number generator before the computations. \end{description} These list components are inherited by each indicator in the package. One of the most important features of \pkg{laeken} is that indicators can be evaluated for different years and domains. The latter of which can be regions (e.g., NUTS2), but also any other breakdown given by a categorical variable (see the examples in Section~\ref{sec:ind}). In any case, the advantage of the object-oriented implementation is the possibility of sharing code among the indicators. To give an example, the following methods for the basic class \code{"indicator"} are implemented in the package: <<>>= methods(class="indicator") @ The \code{print()} and \code{subset()} methods are called by their respective generic functions if an object inheriting from class \code{"indicator"} is supplied. While the \code{print()} method defines the output of objects inheriting from class \code{"indicator"} shown on the \proglang{R} console, the \code{subset()} method allows to extract subsets of an object inheriting from class \code{"indicator"} and is discussed in detail in Section~\ref{sec:sub}. Furthermore, the function \code{is.indicator()} is available to test whether an object is of class \code{"indicator"}. \subsubsection{Additional classes} \label{sec:classes} For the specific indicators on social exclusion and poverty, the following classes are implemented in package \pkg{laeken}: % \begin{itemize} \item Class \code{"arpr"} with the following additional components: \begin{description} \item[\code{p}:] The percentage of the weighted median used for the at-risk-of-poverty threshold. \item[\code{threshold}:] The at-risk-of-poverty threshold(s). \end{description} \item Class \code{"qsr"} with no additional components. \item Class \code{"rmpg"} with the following additional components: \begin{description} \item[\code{threshold}:] The at-risk-of-poverty threshold(s). \end{description} \item Class \code{"gini"} with no additional components. \end{itemize} % All these classes are subclasses of the basic class \code{"indicator"} and therefore inherit all its components and methods. In addition, functions to test whether an object is a member of one of these subclasses are implemented. Similarly to \code{is.indicator()}, these are called \code{is.foo()}, where \code{foo} is the name of the respective class (e.g., \code{is.arpr()}). % ----------------------------- % equivalized disposable income % ----------------------------- \section{Calculation of the equivalized disposable income} \label{sec:income} For each person, the equivalized disposable income is defined as the total household disposable income divided by the equivalized household size. It follows that each person in the same household receives the same equivalized disposable income. The total disposable income of a household is calculated by adding together the personal income received by all of the household members plus the income received at the household level. The equivalized household size is defined according to the modified OECD scale, which gives a weight of 1.0 to the first adult, 0.5 to other household members aged 14 or over, and 0.3 to household members aged less than 14 \citep{EU-SILC04, EU-SILC09}. In practice, the equivalized disposable income needs to be computed from the income components included in EU-SILC for the estimation of the indicators on social exclusion and poverty. Therefore, this section outlines how to perform this step with package \pkg{laeken}, even though the variable \code{eqIncome} containing the equivalized disposable income is already available in the example data set \code{eusilc}. Note that not all variables that are required for an exact computation of the equivalized income are included in the synthetic example data. However, the functions of the package can be applied in exactly the same manner to real EU-SILC data. First, the equivalized household size according to the modified OECD scale needs to be computed. This can be done with the function \code{eqSS()}, which requires the household ID and the age of the individuals as arguments. In the example data, household~ID and age are stored in the variables \code{db030} and \code{age}, respectively. It should be noted that the variable \code{age} is not in the standardized format of EU-SILC data and needs to be calculated from the data beforehand. Nevertheless, these computations are very simple and are therefore not shown here \citep[for details, see][]{EU-SILC09}. The following two lines of code calculate the equivalized household size, add it to the data set, and print the first eight observations of the variables involved. <<>>= eusilc$eqSS <- eqSS("db030", "age", data=eusilc) head(eusilc[,c("db030", "age", "eqSS")], 8) @ Then the equivalized disposable income can be computed with the function \code{eqInc()}. It requires the following information to be supplied: the household~ID, the household income components to be added and subtracted, respectively, the personal income components to be added and subtracted, respectively, as well as the equivalized household size. With the following commands, the equivalized disposable income is calculated and added to the data set, after which the first eight observations of the important variables in this context are printed. <<>>= hplus <- c("hy040n", "hy050n", "hy070n", "hy080n", "hy090n", "hy110n") hminus <- c("hy130n", "hy145n") pplus <- c("py010n", "py050n", "py090n", "py100n", "py110n", "py120n", "py130n", "py140n") eusilc$eqIncome <- eqInc("db030", hplus, hminus, pplus, character(), "eqSS", data=eusilc) head(eusilc[,c("db030", "eqSS", "eqIncome")], 8) @ % Note that the net income is considered in this example, therefore no personal income component needs to be subtracted \citep[see][]{EU-SILC04, EU-SILC09}. This is reflected in the call to \code{eqInc()} by the use of an empty character vector \code{character()} for the corresponding argument. % ------------------ % weighted quantiles % ------------------ \section{Weighted median and quantile estimation} \label{sec:w} Some of the indicators on social exclusion and poverty require the estimation of the median income or other quantiles of the income distribution. Hence functions that strictly follow the definitions according to \citet{EU-SILC04, EU-SILC09} are implemented in package \pkg{laeken}. They are used internally for the estimation of the respective indicators, but can also be called by the user directly. In the analysis of income distributions, the median income is typically of higher interest than the arithmetic mean. This is because income distributions commonly are strongly right-skewed with a heavy tail of \emph{representative outliers} (correctly measured units that are not unique to the population) and \emph{nonrepresentative outliers} (either measurement errors or correct observations that can be considered unique in the population). Therefore, the center of the distribution is more reliably estimated by a weighted median than by a weighted mean, as the latter is highly influenced by extreme values. In mathematical terms, quantiles are defined as $q_{p} := F^{-1}(p)$, where $F$ is the distribution function on the population level and $0 \leq p \leq 1$. The median as an important special case is given by $p = 0.5$. For the following definitions, let $n$ be the number of observations in the sample, let $\boldsymbol{x} := (x_{1}, \ldots, x_{n})'$ denote the equivalized disposable income with \mbox{$x_{1} \leq \ldots \leq x_{n}$}, and let $\boldsymbol{w} := (w_{i}, \ldots, w_{n})'$ be the corresponding personal sample weights. Weighted quantiles for the estimation of the population values according to \citet{EU-SILC04, EU-SILC09} are then given by \begin{equation} \label{eq:wq} \hat{q}_{p} = \hat{q}_{p} (\boldsymbol{x}, \boldsymbol{w}) := \begin{cases} \frac{1}{2} (x_{j} + x_{j+1}), & \quad \text{if } \sum_{i=1}^{j} w_{i} = p \sum_{i=1}^{n} w_{i}, \\ x_{j+1}, & \quad \text{if } \sum_{i=1}^{j} w_{i} < p \sum_{i=1}^{n} w_{i} < \sum_{i=1}^{j+1} w_{i}. \end{cases} \end{equation} This definition of weighted quantiles is available in \pkg{laeken} through the function \code{weightedQuantile()}. The following command computes the weighed 20\% quantile, the weighted median, and the weighted 80\% quantile. In the context of social exclusion indicators, these are of most importance. % ----- <>= weightedQuantile(eusilc$eqIncome, eusilc$rb050, probs = c(0.2, 0.5, 0.8)) @ % ----- For the important special case of the weighted median, the function \code{weightedMedian()} is available for convenience. % ----- <<>>= weightedMedian(eusilc$eqIncome, eusilc$rb050) @ In addition, the functions \code{incMedian()} and \code{incQuintile()} are more tailored towards application in the case of indicators on social exclusion and poverty and provide a similar interface as the functions for the indicators (see Section~\ref{sec:ind}). In particular, they allow to supply an additional variable to be used as tie-breakers for sorting, and to compute the weighted median and income quintiles, respectively, for several years of the survey. With the following lines of code, the median income as well as the \engordnumber{1} and \engordnumber{4} income quintile (i.e., the weighted 20\% and 80\% quantiles) are estimated. <<>>= incMedian("eqIncome", weights = "rb050", data = eusilc) incQuintile("eqIncome", weights = "rb050", k = c(1, 4), data = eusilc) @ % ------------------- % selected indicators % ------------------- \section{Indicators on social exclusion and poverty} \label{sec:ind} In this section, the most important indicators on social exclusion and poverty are described in detail. Furthermore, the functionality of package \pkg{laeken} to estimate these indicators is demonstrated. It should be noted that all functions for the implemented indicators provide a very similar interface. Most importantly, it is possible to compute estimates for several years of the survey and different subdomains with a single command. Furthermore, the functions allow to supply an additional variable to be used as tie-breakers for sorting. However, not all of the implemented functionality is shown in this vignette. For a complete description of the functions and their arguments, the reader is referred to the corresponding \proglang{R} help pages. In addition, only point estimation of the indicators on social exclusion and poverty is illustrated here, statistical significance of these estimates is not discussed. The functionality for variance estimation of the indicators is described in the package vignette \code{laeken-variance} \citep{templ11b}. For the following definitions of the estimators according to \citet{EU-SILC04, EU-SILC09}, let $\boldsymbol{x} := (x_{1}, \ldots, x_{n})'$ be the equivalized disposable income with $x_{1} \leq \ldots \leq x_{n}$ and let $\boldsymbol{w} := (w_{i}, \ldots, w_{n})'$ be the corresponding personal sample weights, where $n$ denotes the number of observations. Furthermore, define the following index sets for a certain threshold $t$: \begin{align} I_{< t} &:= \{ i \in \{ 1, \ldots, n \} : x_{i} < t \},\label{eq:01-Ilt}\\ I_{\leq t} &:= \{ i \in \{ 1, \ldots, n \} : x_{i} \leq t \},\label{eq:01-Ileqt}\\ I_{> t} &:= \{ i \in \{ 1, \ldots, n \} : x_{i} > t\}\label{eq:01-Igt}. \end{align} \subsection{At-risk-at-poverty rate} \label{sec:ARPR} In order to define the \emph{at-risk-of-poverty rate} (ARPR), the \emph{at-risk-of-poverty threshold} (ARPT) needs to be introduced first, which is set at $60\%$ of the national median equivalized disposable income. Then the at-risk-at-poverty rate is defined as the proportion of persons with an equivalized disposable income below the at-risk-at-poverty threshold \citep{EU-SILC04, EU-SILC09}. In a more mathematical notation, the at-risk-at-poverty rate is defined as \begin{equation} \label{eq:ARPR} ARPR := P(x < 0.6 \cdot q_{0.5}) \cdot 100,% = F(0.6 \cdot q_{0.5}) \cdot 100, \end{equation} where $q_{0.5} := F^{-1}(0.5)$ denotes the population median (50\% quantile) and $F$ is the distribution function of the equivalized income on the population level. For the estimation of the at-risk-at-poverty rate from a sample, the sample weights need to be taken into account. %Let $n$ be the number of observations in the sample, let $\boldsymbol{x} := %(x_{1}, \ldots, x_{n})'$ denote the equivalized disposable income with %\mbox{$x_{1} \leq \ldots \leq x_{n}$}, and let $\boldsymbol{w} := (w_{i}, %\ldots, w_{n})'$ be the corresponding personal sample weights. Then the %at-risk-at-poverty threshold is estimated by First, the at-risk-at-poverty threshold is estimated by \begin{equation} \label{eq:ARPT} \widehat{ARPT} = 0.6 \cdot \hat{q}_{0.5}, \end{equation} where $\hat{q}_{0.5}$ is the weighted median as defined in Equation~(\ref{eq:wq}). %Furthermore, define an index set of observations with an equivalized disposable %income below the estimated at-risk-at-poverty threshold as %\begin{equation} %I_{< \widehat{ARPT}} := \{ i \in \{ 1, \ldots, n \} : x_{i} < \widehat{ARPT} \}. %\end{equation} %With these definitions, the at-risk-at-poverty rate can be estimated by Then the at-risk-at-poverty rate can be estimated by \begin{equation} \widehat{ARPR} := \frac{\sum_{i \in I_{< \widehat{ARPT}}} w_{i}}{\sum_{i=1}^{n} w_{i}} \cdot 100, \end{equation} where $I_{< \widehat{ARPT}}$ is an index set of persons with an equivalized disposable income below the estimated at-risk-of-poverty threshold as defined in Equation~(\ref{eq:01-Ilt}). In package \pkg{laeken}, the functions \code{arpt()} and \code{arpr()} are implemented for the estimation of the at-risk-of-poverty threshold and the at-risk-of-poverty rate. Whenever sample weights are available in the data, they should be supplied as the \code{weights} argument. Even though \code{arpt()} is called internally by \code{arpr()}, it can also be called by the user directly. <<>>= arpt("eqIncome", weights = "rb050", data = eusilc) arpr("eqIncome", weights = "rb050", data = eusilc) @ It is also possible to use these functions for the estimation of the indicator \emph{dispersion around the at-risk-of-poverty threshold}, which is defined as the proportion of persons with an equivalized disposable income below $40\%$, $50\%$ and $70\%$ of the national weighted median equivalized disposable income. The proportion of the median equivalized income to be used can thereby be adjusted via the argument \code{p}. <<>>= arpr("eqIncome", weights = "rb050", p = 0.4, data = eusilc) arpr("eqIncome", weights = "rb050", p = 0.5, data = eusilc) arpr("eqIncome", weights = "rb050", p = 0.7, data = eusilc) @ In order to compute estimates for different subdomains, a breakdown variable simply needs to be supplied as the \code{breakdown} argument. Note that in this case the same overall at-risk-of-poverty threshold is used for all subdomains \citep[see][]{EU-SILC04, EU-SILC09}. The following command computes the overall estimate, as well as estimates for all NUTS2 regions. <<>>= arpr("eqIncome", weights = "rb050", breakdown = "db040", data = eusilc) @ However, any kind of breakdown can be supplied, e.g., the breakdowns defined by \citet{EU-SILC04, EU-SILC09}. With the following lines of code, a breakdown variable with all possible combinations of age categories and gender is defined and added to the data set, before it is used to compute estimates for the corresponding domains. <<>>= ageCat <- cut(eusilc$age, c(-1, 16, 25, 50, 65, Inf), right=FALSE) eusilc$breakdown <- paste(ageCat, eusilc$rb090, sep=":") arpr("eqIncome", weights = "rb050", breakdown = "breakdown", data = eusilc) @ Clearly, the results are even more heterogeneous than for the breakdown into NUTS2 regions. %The results are even more different when considering household size %(\code{hsize}) and citizenship (\code{pb220a}) as the domain level for %estimation. %<<>>= %eusilc$breakdown <- paste(eusilc$hsize, eusilc$pb220a, sep=":") %arpr("eqIncome", weights = "rb050", breakdown = "breakdown", data = eusilc) %@ \subsection{Quintile share ratio} The income \emph{quintile share ratio} (QSR) is defined as the ratio of the sum of the equivalized disposable income received by the 20\% of the population with the highest equivalized disposable income to that received by the 20\% of the population with the lowest equivalized disposable income \citep{EU-SILC04, EU-SILC09}. For the estimation of the quintile share ratio from a sample, let $\hat{q}_{0.2}$ and $\hat{q}_{0.8}$ denote the weighted 20\% and 80\% quantiles, respectively, as defined in Equation~(\ref{eq:wq}). Using index sets $I_{\leq \hat{q}_{0.2}}$ and $I_{> \hat{q}_{0.8}}$ as defined in Equations~(\ref{eq:01-Ileqt}) and~(\ref{eq:01-Igt}), respectively, the quintile share ratio is estimated by \begin{equation} \widehat{QSR} := \frac{\sum_{i \in I_{> \hat{q}_{0.8}}} w_{i} x_{i}}{\sum_{i \in I_{\leq \hat{q}_{0.2}}} w_{i} x_{i}}. \end{equation} With package \pkg{laeken}, the quintile share ratio can be estimated using the function \code{qsr()}. As for the at-risk-of-poverty rate, sample weights can be supplied via the \code{weights} argument. <<>>= qsr("eqIncome", weights = "rb050", data = eusilc) @ Computing estimates for different subdomains is again possible by specifying the \code{breakdown} argument. In the following example, estimates for each NUTS2 region are computed in addition to the overall estimate. <<>>= qsr("eqIncome", weights = "rb050", breakdown = "db040", data = eusilc) @ Nevertheless, it should be noted that the quintile share ratio is highly influenced by outliers \citep[see][]{hulliger09a, alfons10b}. Since the upper tail of income distributions virtually always contains nonrepresentative outliers, robust estimators of the quintile share ratio should preferably be used. Thus robust semi-parametric methods based on Pareto tail modeling are implemented in package \pkg{laeken} as well. Their application is discussed in vignette \code{laeken-pareto} \citep{alfons11a}. \subsection{Relative median at-risk-of-poverty gap (by age and gender)} The \emph{relative median at-risk-of-poverty gap} (RMPG) is defined as the difference between the median equivalized disposable income of persons below the at-risk-of-poverty threshold and the at-risk of poverty threshold itself, expressed as a percentage of the at-risk-of-poverty threshold \citep{EU-SILC04, EU-SILC09}. %Let $wmed_{(poor)}$ the weighted median of the people who having an income %below $ARPR$ defined in Equation~\ref{eq:ARPR}. Then the relative median %at-risk-of-poverty gap is estimated by %\begin{displaymath} %RMPG = \frac{ARPR - wmed_{(poor)}}{ARPR} \cdot 100 %\end{displaymath} For the estimation of the relative median at-risk-of-poverty gap from a sample, let $\widehat{ARPT}$ be the estimated at-risk-of-poverty threshold according to Equation~(\ref{eq:ARPT}), and let $I_{< \widehat{ARPT}}$ be an index set of persons with an equivalized disposable income below the estimated at-risk-of-poverty threshold as defined in Equation~(\ref{eq:01-Ilt}). Using this index set, define $\boldsymbol{x}_{< \widehat{ARPT}} := (x_{i})_{i \in I_{< \widehat{ARPT}}}$ and $\boldsymbol{w}_{< \widehat{ARPT}} := (w_{i})_{i \in I_{< \widehat{ARPT}}}$. Furthermore, let $\hat{q}_{0.5} (\boldsymbol{x}_{< \widehat{ARPT}}, \boldsymbol{w}_{< \widehat{ARPT}})$ be the corresponding weighted median according to the definition in Equation~(\ref{eq:wq}). Then the relative median at-risk-of-poverty gap is estimated by \begin{equation} \widehat{RMPG} = \frac{\widehat{ARPT} - \hat{q}_{0.5} (\boldsymbol{x}_{< \widehat{ARPT}}, \boldsymbol{w}_{< \widehat{ARPT}})}{\widehat{ARPT}} \cdot 100. \end{equation} In package \pkg{laeken}, the function \code{rmpg()} is implemented for the estimation of the relative median at-risk-of-poverty gap. If available in the data, sample weights should be supplied as the \code{weights} argument. Note that the function \code{arpt()} for the estimation of the at-risk-of-poverty threshold is called internally (cf. function \code{arpr()} for the at-risk-of-poverty rate in Section~\ref{sec:ARPR}). <<>>= rmpg("eqIncome", weights = "rb050", data = eusilc) @ Estimates for different subdomains can be computed by making use of the \code{breakdown} argument. With the following command, the overall estimate and estimates for all NUTS2 regions are computed. <<>>= rmpg("eqIncome", weights = "rb050", breakdown = "db040", data = eusilc) @ For the relative median at-risk-of-poverty gap, the breakdown by age and gender is of particular interest. In the following example, a breakdown variable with all possible combinations of age categories and gender is defined and added to the data set. Afterwards, estimates for the corresponding domains are computed. <<>>= ageCat <- cut(eusilc$age, c(-1, 16, 25, 50, 65, Inf), right=FALSE) eusilc$breakdown <- paste(ageCat, eusilc$rb090, sep=":") rmpg("eqIncome", weights = "rb050", breakdown = "breakdown", data = eusilc) @ \subsection{Gini coefficient} The \emph{Gini coefficient} is defined as the relationship of cumulative shares of the population arranged according to the level of equivalized disposable income, to the cumulative share of the equivalized total disposable income received by them \citep{EU-SILC04, EU-SILC09}. For the estimation of the Gini coefficient from a sample, the sample weights need to be taken into account. In mathematical terms, the Gini coefficient is estimated by \begin{equation} \widehat{Gini} := 100 \left[ \frac{2 \sum_{i=1}^{n} \left( w_{i} x_{i} \sum_{j=1}^{i} w_{j} \right) - \sum_{i=1}^{n} w_{i}^{\phantom{i}2} x_{i}}{\left( \sum_{i=1}^{n} w_{i} \right) \sum_{i=1}^{n} \left(w_{i} x_{i} \right)} - 1 \right]. \end{equation} The function \code{gini()} is available in \pkg{laeken} to estimate the Gini coefficient. As for the other indicators, sample weights can be specified with the \code{weights} argument. <<>>= gini("eqIncome", weights = "rb050", data = eusilc) @ Using the \code{breakdown} argument in the following command, estimates for the NUTS2 regions are computed in addition to the overall estimate. <<>>= gini("eqIncome", weights = "rb050", breakdown = "db040", data = eusilc) @ Since outliers have a strong influence on the Gini coefficient, robust estimators are preferred to the standard estimation described above \citep[see][]{alfons10b}. Vignette \code{laeken-pareto} \citep{alfons11a} describes how to apply the robust semi-parametric methods implemented in package \pkg{laeken}. % ------------------ % extracting subsets % ------------------ \section{Extracting information using the \code{subset()} method} \label{sec:sub} If estimates of an indicator have been computed for several subdomains, it may sometimes be desired to extract the results for some domains of particular interest. In package \pkg{laeken}, this is implemented by taking advantage of the object-oriented design of the package. Each of the functions for the indicators described in Section~\ref{sec:ind} returns an object belonging to a class of the same name as the respective function, e.g., function \code{arpr()} returns an object of class \code{"arpr"}. All these classes thereby inherit from the basic class \code{"indicator"} (see Section~\ref{sec:design}). <<>>= a <- arpr("eqIncome", weights = "rb050", breakdown = "db040", data = eusilc) print(a) is.arpr(a) is.indicator(a) class(a) @ To extract a subset of results from such an object, a \code{subset()} method for the class \code{"indicator"} is implemented in \pkg{laeken}. The method \code{subset.indicator()} is hidden from the user and is called internally by the generic function \code{subset()} whenever an object of class \code{"indicator"} is supplied. In the following example, the estimates of the at-risk-of-poverty rate for the regions Lower Austria and Vienna are extracted from the object computed above. <<>>= subset(a, strata = c("Lower Austria", "Vienna")) @ % ----------- % conclusions % ----------- \section{Conclusions} \label{sec:concl} This vignette demonstrates the use of package \pkg{laeken} for point estimation of the European Union indicators on social exclusion and poverty. Since the description of the indicators in \citet{EU-SILC04, EU-SILC09} is weak from a mathematical point of view, a more precise notation is given in this paper. Currently, the most important indicators are implemented in \pkg{laeken}. Their estimation is made easy with the package, as it is even possible to compute estimates for several years and different subdomains with a single command. Concerning the inequality indicators quintile share ratio and Gini coefficient, it is clearly visible from their definitions that the standard estimators are highly influenced by outliers \citep[see also][]{hulliger09a, alfons10b}. Therefore, robust semi-parametric methods are implemented in \pkg{laeken} as well. These are described in vignette \code{laeken-pareto} \citep{alfons11a}, while variance and confidence interval estimation for the indicators on social exclusion and poverty with package \pkg{laeken} is treated in vignette \code{laeken-variance} \citep{templ11b}. % --------------- % acknowledgments % --------------- \section*{Acknowledgments} This work was partly funded by the European Union (represented by the European Commission) within the 7$^{\mathrm{th}}$ framework programme for research (Theme~8, Socio-Economic Sciences and Humanities, Project AMELI (Advanced Methodology for European Laeken Indicators), Grant Agreement No. 217322). Visit \url{http://ameli.surveystatistics.net} for more information on the project. % ------------ % bibliography % ------------ \bibliographystyle{plainnat} \bibliography{laeken} \end{document}