Type: | Package |
Title: | A User-Oriented Statistical Toolkit for Analytical Variance Estimation |
Depends: | R(≥ 3.2.5) |
Imports: | methods, utils, stats, Matrix |
Suggests: | testthat, sampling, magrittr, tibble, dplyr, data.table |
Version: | 1.0.0 |
URL: | https://github.com/InseeFr/gustave |
BugReports: | https://github.com/InseeFr/gustave/issues |
Description: | Provides a toolkit for analytical variance estimation in survey sampling. Apart from the implementation of standard variance estimators, its main feature is to help the sampling expert produce easy-to-use variance estimation "wrappers", where systematic operations (linearization, domain estimation) are handled in a consistent and transparent way. |
License: | GPL-3 |
Collate: | 'data.R' 'utils.R' 'define_variance_wrapper.R' 'variance_function.R' 'define_statistic_wrapper.R' 'standard_statistic_wrapper.R' 'qvar.R' |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.2.3 |
NeedsCompilation: | no |
Packaged: | 2023-11-17 09:16:29 UTC; mc |
Author: | Martin Chevalier [aut] (Creator), Khaled Larbi [cre], Institut national de la statistique et des études économiques [cph] |
Maintainer: | Khaled Larbi <khaled.larbi@insee.fr> |
Repository: | CRAN |
Date/Publication: | 2023-11-17 10:10:02 UTC |
Expand a matrix or a data.frame with zeros based on rownames matching
Description
For a given two-dimensional object with rownames and a character
vector, add_zero
produces a corresponding object whose rownames match
the character vector, with zeros on the additional rows.
This function is an easy-to-use and reliable way to reintroduce non-responding units in the variance estimation process (after the non-response phase is taken into account).
Usage
add_zero(y, rownames, remove = TRUE)
Arguments
y |
A (sparse) matrix or a data.frame. The object to add zeros to. |
rownames |
A character vector (other types are coerced to character). The character vector giving the rows of the produced object. |
remove |
Should rows of |
Value
A (sparse) matrix or data.frame depending on the type of y
.
Author(s)
Martin Chevalier
Examples
# Data generation
set.seed(1)
n <- 10
p <- 2
y <- matrix(1:(n*p), ncol = p, dimnames = list(sample(letters, n)))
y[c(3, 8, 12)] <- NA
rownames <- letters
# Standard use
add_zero(y, rownames)
# Use when rownames in y do not match
# any element in the rownames argument
rownames(y)[1:3] <- toupper(rownames(y)[1:3])
add_zero(y, rownames)
add_zero(y, rownames, remove = FALSE)
Define a statistic wrapper
Description
define_statistic_wrapper
defines
statistic wrappers to be used together with
variance estimation wrappers
.
A statistic wrapper produces both the point estimator and the
linearized variable associated with a given statistic to estimate
variance on (Deville, 1999). define_statistic_wrapper
is intended
for advanced use only, standard statistic wrappers are included
in the gustave package (see standard statistic wrappers
)
Usage
define_statistic_wrapper(
statistic_function,
arg_type,
arg_not_affected_by_domain = NULL,
display_function = standard_display
)
Arguments
statistic_function |
An R function specific to the statistic to calculate. It should produce at least the point estimator and the linearized variable associated with the statistic (see Details). |
arg_type |
A named list with three character vectors describing
the type of each argument of |
arg_not_affected_by_domain |
A character vector indicating the arguments which should not be affected by domain-splitting. Such parameters may appear in some complex linearization formula, for instance when the At-Risk of Poverty Rate (ARPR) is estimated by region but with a poverty line calculated at the national level. |
display_function |
An R function which produces, for each variance
estimation, the data.frame to be displayed by the variance estimation
wrapper. The default display function ( |
Details
When the statistic to estimate is not a total, the application of analytical variance estimation formulae developed for the estimator of a total is not straightforward (Deville, 1999). An asymptotically unbiased variance estimator can nonetheless be obtained if the estimation of variance is performed on a variable obtained from the original data through a linearization step.
define_statistic_wrapper
is the function used to create, for a
given statistic, an easy-to-use function which calculates both the point
estimator and the linearized variable associated with the statistic. These
operations are implemented by the statistic_function
, which can have
any needed input (for example num
and denom
for a ratio
estimator) and should output a list with at least two named elements:
-
point
: the point estimator of the statistic -
lin
: the linearized variable to be passed on to the variance estimation formula. If several variables are to be associated with the statistics,lin
can be a list itself.
All other named elements in the output of define_statistic_wrapper
are
treated as metadata (that may be used later on by display_function
).
arg_type
is a named list of three elements that describes the nature
of the argument of statistic_function
:
-
data
: data argument(s), numerical vector(s) to be used to calculate the point estimate and the linearized variable associated with the statistic -
weight
: weight argument, numerical vector to be used as row weights -
param
: parameters, non-data arguments to be used to control some aspect of the computation
Statistic wrappers are quite flexible tools to apply a variance function to an estimator requiring a linearization step (e.g. all estimators except the estimator of a total) with virtually no additional complexity for the end-user.
standard statistic wrappers
are included within the gustave package and automatically added
to the variance estimation wrappers. New statistic wrappers can be defined
using the define_statistic_wrapper
and then explicitly added to the
variance estimation wrappers using the objects_to_include
argument.
Note: To some extent, statistic wrappers can be seen as ggplot2
geom_
and stat_
functions: they help the end-user in writing
down what he or she wants without having to go too deep into the details
of the corresponding layers.
Value
A function to be used within a variance estimation wrapper to estimate
a specific statistic (see examples). Its formals are the ones of
statistic_function
with the addition of by
and where
(for domain estimation, set to NULL
by default).
Author(s)
Martin Chevalier
References
Deville J.-C. (1999), "Variance estimation for complex statistics and estimators: linearization and residual techniques", Survey Methodology, 25:193–203
See Also
standard statistic wrappers
, define_variance_wrapper
Examples
### Example from the Information and communication technologies (ICT) survey
# Let's define a variance wrapper asfor the ICT survey
# as in the examples of the qvar function:
precision_ict <- qvar(
data = ict_sample,
dissemination_dummy = "dissemination",
dissemination_weight = "w_calib",
id = "firm_id",
scope_dummy = "scope",
sampling_weight = "w_sample",
strata = "strata",
nrc_weight = "w_nrc",
response_dummy = "resp",
hrg = "hrg",
calibration_weight = "w_calib",
calibration_var = c(paste0("N_", 58:63), paste0("turnover_", 58:63)),
define = TRUE
)
precision_ict(ict_survey, mean(speed_quanti))
# Let's now redefine the mean statistic wrapper
mean2 <- define_statistic_wrapper(
statistic_function = function(y, weight){
point <- sum(y * weight) / sum(weight)
lin <- (y - point) / sum(weight)
list(point = point, lin = lin, metadata = list(n = length(y)))
},
arg_type = list(data = "y", weight = "weight")
)
# mean2 can now be used inside precision_ict (and yields
# the same results as the mean statistic wrapper)
precision_ict(ict_survey, mean(speed_quanti), mean2(speed_quanti))
Define a variance estimation wrapper
Description
Given a variance estimation function (specific to a
survey), define_variance_wrapper
defines a variance estimation
wrapper easier to use (e.g. automatic domain estimation,
linearization).
Usage
define_variance_wrapper(
variance_function,
reference_id,
reference_weight,
default_id = NULL,
technical_data = NULL,
technical_param = NULL,
objects_to_include = NULL
)
Arguments
variance_function |
An R function. It is the methodological workhorse of the variance estimation: from a set of arguments including the variables of interest (see below), it should return a vector of estimated variances. See Details. |
reference_id |
A vector containing the ids of all the responding units
of the survey. It can also be an unevaluated expression (enclosed in
|
reference_weight |
A vector containing the reference weight of the survey.
It can also be an unevaluated expression (enclosed in |
default_id |
A character vector of length 1, the name of the default
identifying variable in the survey file. It can also be an unevaluated
expression (enclosed in |
technical_data |
A named list of technical data needed to perform
the variance estimation (e.g. sampling strata, first- or second-order
probabilities of inclusion, estimated response probabilities, calibration
variables). Its names should match the names of the corresponding arguments
in |
technical_param |
A named list of technical parameters used to control
some aspect of the variance estimation process (e.g. alternative methodology).
Its names should match the names of the corresponding arguments in |
objects_to_include |
(Advanced use) A character vector indicating the name of additional R objects to include within the variance wrapper. |
Details
Defining variance estimation wrappers is the key feature of
the gustave
package. It is the workhorse of the ready-to-use
qvar
function and should be used directly to handle more complex
cases (e.g. surveys with several stages or balanced sampling).
Analytical variance estimation is often difficult to carry out by non-specialists owing to the complexity of the underlying sampling and estimation methodology. This complexity yields complex variance estimation functions which are most often only used by the sampling expert who actually wrote them. A variance estimation wrapper is an intermediate function that is "wrapped around" the (complex) variance estimation function in order to provide the non-specialist with user-friendly features (see examples):
calculation of complex statistics (see
standard statistic wrappers
)domain estimation
handy evaluation and factor discretization
define_variance_wrapper
allows the sampling expert to define a
variance estimation wrapper around a given variance estimation function and
set its default parameters. The produced variance estimation wrapper is
standalone in the sense that it contains all technical data necessary
to carry out the estimation (see technical_data
).
The arguments of the variance_function
fall into three types:
the data argument (mandatory, only one allowed): the numerical matrix of variables of interest to apply the variance estimation formula on
technical data arguments (optional, one or more allowed): technical and methodological information used by the variance estimation function (e.g. sampling strata, first- or second-order probabilities of inclusion, estimated response probabilities, calibration variables)
technical parameters (optional, one or more allowed): non-data arguments to be used to control some aspect of the variance estimation (e.g. alternative methodology)
technical_data
and technical_param
are used to determine
which arguments of variance_function
relate to technical information,
the only remaining argument is considered as the data argument.
Value
An R function that makes the estimation of variance based on the provided variance function easier. Its parameters are:
-
data
: one or more calls to a statistic wrapper (e.g.total()
,mean()
,ratio()
). See examples andstandard statistic wrappers
) andstandard statistic wrappers
) -
where
: a logical vector indicating a domain on which the variance estimation is to be performed -
by
: q qualitative variable whose levels are used to define domains on which the variance estimation is performed -
alpha
: a numeric vector of length 1 indicating the threshold for confidence interval derivation (0.05
by default) -
display
: a logical verctor of length 1 indicating whether the result of the estimation should be displayed or not -
id
: a character vector of size 1 containing the name of the identifying variable in the survey file. Its default value depends on the value ofdefault_id
indefine_variance_wrapper
-
envir
: an environment containing a binding todata
Author(s)
Martin Chevalier
References
Rao, J.N.K (1975), "Unbiased variance estimation for multistage designs", Sankhya, C n°37
See Also
qvar
, standard statistic wrappers
, varDT
Examples
### Example from the Labour force survey (LFS)
# The (simulated) Labour force survey (LFS) has the following characteristics:
# - first sampling stage: balanced sampling of 4 areas (each corresponding to
# about 120 dwellings) on first-order probability of inclusion (proportional to
# the number of dwellings in the area) and total annual income in the area.
# - second sampling stage: in each sampled area, simple random sampling of 20
# dwellings
# - neither non-response nor calibration
# As this is a multi-stage sampling design with balanced sampling at the first
# stage, the qvar function does not apply. A variance wrapper can nonetheless
# be defined using the core define_variance_wrapper function.
# Step 1 : Definition of the variance function and the corresponding technical data
# In this context, the variance estimation function specific to the LFS
# survey can be defined as follows:
var_lfs <- function(y, ind, dwel, area){
variance <- list()
# Variance associated with the sampling of the dwellings
y <- sum_by(y, ind$id_dwel)
variance[["dwel"]] <- var_srs(
y = y, pik = dwel$pik_dwel, strata = dwel$id_area,
w = (1 / dwel$pik_area^2 - dwel$q_area)
)
# Variance associated with the sampling of the areas
y <- sum_by(y = y, by = dwel$id_area, w = 1 / dwel$pik_dwel)
variance[["area"]] <- varDT(y = y, precalc = area)
Reduce(`+`, variance)
}
# where y is the matrix of variables of interest and ind, dwel and area the technical data:
technical_data_lfs <- list()
# Technical data at the area level
# The varDT function allows for the pre-calculation of
# most of the methodological quantities needed.
technical_data_lfs$area <- varDT(
y = NULL,
pik = lfs_samp_area$pik_area,
x = as.matrix(lfs_samp_area[c("pik_area", "income")]),
id = lfs_samp_area$id_area
)
# Technical data at the dwelling level
# In order to implement Rao (1975) formula for two-stage samples,
# we associate each dwelling with the diagonal term corresponding
# to its area in the first-stage variance estimator:
lfs_samp_dwel$q_area <- with(technical_data_lfs$area, setNames(diago, id))[lfs_samp_dwel$id_area]
technical_data_lfs$dwel <- lfs_samp_dwel[c("id_dwel", "pik_dwel", "id_area", "pik_area", "q_area")]
# Technical data at the individual level
technical_data_lfs$ind <- lfs_samp_ind[c("id_ind", "id_dwel", "sampling_weight")]
# Test of the variance function var_lfs
y <- matrix(as.numeric(lfs_samp_ind$unemp), ncol = 1, dimnames = list(lfs_samp_ind$id_ind))
with(technical_data_lfs, var_lfs(y = y, ind = ind, dwel = dwel, area = area))
# Step 2 : Definition of the variance wrapper
# Call of define_variance_wrapper
precision_lfs <- define_variance_wrapper(
variance_function = var_lfs,
technical_data = technical_data_lfs,
reference_id = technical_data_lfs$ind$id_ind,
reference_weight = technical_data_lfs$ind$sampling_weight,
default_id = "id_ind"
)
# Test
precision_lfs(lfs_samp_ind, unemp)
# The variance wrapper precision_lfs has the same features
# as variance wrappers produced by the qvar function (see
# qvar examples for more details).
Sampling frame of the Information and communication technologies (ICT) survey
Description
A (simulated) dataset containing basic identification information and auxiliary variables for the sampling of the Information and communication technologies (ICT) survey in the information and communication sector (NACE rev 2 J section).
Usage
ict_pop
Format
A data frame with 7670 observations and 5 variables:
- firm_id
identifier of the firm
- division
identifier of the economic sub-sector
- employees
number of employees
- turnover
firm turnover, in thousand euros
- strata
stratification variable
See Also
Sample of the Information and communication technologies (ICT) survey
Description
A (simulated) dataset containing sampling information about the sample of the Information and communication technologies (ICT) survey in the information and communication sector (NACE rev 2 J section)
Usage
ict_sample
Format
A data frame with 650 observations and 8 variables:
- firm_id
identifier of the firm
- division
identifier of the economic sub-sector
- employees
number of employees
- turnover
firm turnover, in euros
- strata
stratification variable
- w_sample
sampling weight
- scope
boolean indicating whether the firm did belong to the scope of the survey or not
- resp
boolean indicating whether the firm did respond to the survey or not
- nrc
boolean indicating whether the firm did take part in the non-response correction process or not
- hrg
homogeneous response group used for the non-response correction
- response_prob_est
response probability of the unit estimated using homogeneous response groups
- w_nrc
weight after unit non-response correction
- calib
boolean indicating whether the firm was integrated in the calibration process or not (
TRUE
for all responding units)- N_58, N_59, N_60, N_61, N_62, N_63, turnover_58, turnover_59, turnover_60, turnover_61, turnover_62, turnover_63
calibration variables (number of firms and turnover broken down by economic sub-sector)
- w_calib
calibrated weight
- dissemination
boolean indicating whether the unit appears in the dissemination file
See Also
Survey data of the Information and communication technologies (ICT) survey
Description
A (simulated) dataset containing calibration and survey variables of the respondents to the Information and communication technologies (ICT) survey in the information and communication sector (NACE rev 2 J section)
Usage
ict_survey
Format
A data frame with 612 observations and 11 variables:
- firm_id
identifier of the firm
- division
identifier of the economic sub-sector
- employees
number of employees
- turnover
firm turnover, in euros
- w_calib
calibrated weight
- speed_quanti, speed_quanti_NA
internet connection speed of the firm in Mbps, without or with missing values
- speed_quali, speed_quali_NA
internet connection speed of the firm recoded in classes, without or with missing values
- big_data, big_data_NA
use of big data analytics within the firm, without or with missing values
See Also
Sample of areas in the Labour force survey
Description
A (simulated) dataset containing information about 4 geographical areas (about 120 dwellings each) sampled for the labour force survey.
Usage
lfs_samp_area
Format
A data frame with 4 observations and 3 variables:
- id_area
identifier of the area
- income
total annual income of the area in thousand euros (from income registry)
- pik_area
first-order inclusion probability of the area (proportional to the number of dwellings in the area)
See Also
define_variance_wrapper
, lfs_samp_dwel
, lfs_samp_ind
Sample of dwellings in the Labour force survey
Description
A (simulated) dataset containing information about 80 dwellings
sampled for the Labour force survey (in the 4 areas described
in lfs_samp_area
).
Usage
lfs_samp_dwel
Format
A data frame with 80 observations and 6 variables:
- id_dwel
identifier of the dwelling
- id_area
identifier of the area
- income
total annual income of the dwelling in thousand euros (from income registry)
- pik_area
first-order inclusion probability of the area
- pik_dwel
first-order inclusion probability of the dwelling within the area (20 dwelling sampled per area)
- pik
first-order inclusion probability of the dwelling
See Also
define_variance_wrapper
, lfs_samp_area
, lfs_samp_ind
Sample of individuals in the Labour force survey
Description
A (simulated) dataset containing information about 157 individuals
sampled for the Labour force survey (all members of the 80 dwellings
described in lfs_samp_dwel
). It also contains the
unemployment status extracted from the survey file (no non-response).
Usage
lfs_samp_ind
Format
A data frame with 157 observations and 5 variables:
- id_ind
identifier of the individual
- id_dwel
identifier of the dwelling
- income
total annual income of the individual in thousand euros (from income registry)
- unemp
unemployment status
- sampling_weight
sampling weight of the individual (inverse of the first-order inclusion probability of the dwelling)
See Also
define_variance_wrapper
, lfs_samp_area
, lfs_samp_dwel
Quickly perform a variance estimation in common cases
Description
qvar
(for "quick variance estimation") is a function
performing analytical variance estimation in most common cases, that is:
stratified simple random sampling
non-response correction (if any) through reweighting
calibration (if any)
Used with define = TRUE
, it defines a so-called variance wrapper, that
is a standalone ready-to-use function that can be applied to the survey dataset
without having to specify the methodological characteristics of the survey
(see define_variance_wrapper
).
Usage
qvar(
data,
...,
by = NULL,
where = NULL,
alpha = 0.05,
display = TRUE,
id,
dissemination_dummy,
dissemination_weight,
sampling_weight,
strata = NULL,
scope_dummy = NULL,
nrc_weight = NULL,
response_dummy = NULL,
nrc_dummy = NULL,
calibration_weight = NULL,
calibration_dummy = NULL,
calibration_var = NULL,
define = FALSE,
envir = parent.frame()
)
Arguments
data |
The |
... |
One or more calls to a statistic wrapper (e.g. |
by |
A qualitative variable whose levels are used to define domains on which the variance estimation is performed. |
where |
A logical vector indicating a domain on which the variance estimation is to be performed. |
alpha |
A numeric vector of length 1 indicating the threshold
for confidence interval derivation ( |
display |
A logical verctor of length 1 indicating whether the result of the estimation should be displayed or not. |
id |
The identification variable of the units in |
dissemination_dummy |
A character vector of length 1, the name
of the logical variable in |
dissemination_weight |
A character vector of length 1, the name
of the numerical variable in |
sampling_weight |
A character vector of length 1, the name of the
numeric variable in |
strata |
A character vector of length 1, the name of the factor
variable in |
scope_dummy |
A character vector of length 1, the name of the logical
variable in |
nrc_weight |
A character vector of length 1, the name of the
numerical variable in |
response_dummy |
A character vector of length 1, the name of of the logical
variable in |
nrc_dummy |
A character vector of length 1, the name of
the logical variable in |
calibration_weight |
A character vector of length 1, the name of the
numerical variable in |
calibration_dummy |
A character vector of length 1, the name of of the logical
variable in |
calibration_var |
A character vector, the name of the variable(s) used in
the calibration process. Logical variables are coerced to numeric.
Character and factor variables are automatically discretized.
|
define |
Logical vector of lentgh 1. Should a variance wrapper be defined instead of performing a variance estimation (see details and examples)? |
envir |
An environment containing a binding to |
Details
qvar
performs not only technical but also
methodological checks in order to ensure that the standard variance
estimation methodology does apply (e.g. equal probability of
inclusion within strata, number of units per stratum).
Used with define = TRUE
, the function returns a variance
estimation wrapper, that is a ready-to-use function that
implements the described variance estimation methodology and
contains all necessary data to do so (see examples).
Note: To some extent, qvar
is analogous to the qplot
function
in the ggplot2 package, as it is an easier-to-use function for common
cases. More complex cases are to be handled by using the core functions of
the gustave package, e.g. define_variance_wrapper
.
See Also
define_variance_wrapper
, standard_statistic_wrapper
Examples
### Example from the Information and communication technologies (ICT) survey
# The (simulated) Information and communication technologies (ICT) survey
# has the following characteristics:
# - stratified one-stage sampling design
# - non-response correction through reweighting in homogeneous response groups
# - calibration on margins.
# The ict_survey data.frame is a (simulated) subset of the ICT
# survey file containing the variables of interest for the 612
# responding firms.
# The ict_sample data.frame is the (simulated) sample of 650
# firms corresponding to the ict_survey file. It contains all
# technical information necessary to estimate a variance with
# the qvar() function.
## Methodological description of the survey
# Direct call of qvar()
qvar(
# Sample file
data = ict_sample,
# Dissemination and identification information
dissemination_dummy = "dissemination",
dissemination_weight = "w_calib",
id = "firm_id",
# Scope
scope_dummy = "scope",
# Sampling design
sampling_weight = "w_sample",
strata = "strata",
# Non-response correction
nrc_weight = "w_nrc",
response_dummy = "resp",
hrg = "hrg",
# Calibration
calibration_weight = "w_calib",
calibration_var = c(paste0("N_", 58:63), paste0("turnover_", 58:63)),
# Statistic(s) and variable(s) of interest
mean(employees)
)
# Definition of a variance estimation wrapper
precision_ict <- qvar(
# As before
data = ict_sample,
dissemination_dummy = "dissemination",
dissemination_weight = "w_calib",
id = "firm_id",
scope_dummy = "scope",
sampling_weight = "w_sample",
strata = "strata",
nrc_weight = "w_nrc",
response_dummy = "resp",
hrg = "hrg",
calibration_weight = "w_calib",
calibration_var = c(paste0("N_", 58:63), paste0("turnover_", 58:63)),
# Replacing the variables of interest by define = TRUE
define = TRUE
)
# Use of the variance estimation wrapper
precision_ict(ict_sample, mean(employees))
# The variance estimation wrapper can also be used on the survey file
precision_ict(ict_survey, mean(speed_quanti))
## Features of the variance estimation wrapper
# Several statistics in one call (with optional labels)
precision_ict(ict_survey,
"Mean internet speed in Mbps" = mean(speed_quanti),
"Turnover per employee" = ratio(turnover, employees)
)
# Domain estimation with where and by arguments
precision_ict(ict_survey,
mean(speed_quanti),
where = employees >= 50
)
precision_ict(ict_survey,
mean(speed_quanti),
by = division
)
# Domain may differ from one estimator to another
precision_ict(ict_survey,
"Mean turnover, firms with 50 employees or more" = mean(turnover, where = employees >= 50),
"Mean turnover, firms with 100 employees or more" = mean(turnover, where = employees >= 100)
)
# On-the-fly evaluation (e.g. discretization)
precision_ict(ict_survey, mean(speed_quanti > 100))
# Automatic discretization for qualitative (character or factor) variables
precision_ict(ict_survey, mean(speed_quali))
# Standard evaluation capabilities
variables_of_interest <- c("speed_quanti", "speed_quali")
precision_ict(ict_survey, mean(variables_of_interest))
# Integration with %>% and dplyr
library(magrittr)
library(dplyr)
ict_survey %>%
precision_ict("Internet speed above 100 Mbps" = mean(speed_quanti > 100)) %>%
select(label, est, lower, upper)
Linear Regression Residuals Calculation
Description
res_cal
calculates linear regression residuals in an
efficient way : handling several dependent variables at a time, using
Matrix::TsparseMatrix capabilities and allowing for pre-calculation of
the matrix inverse.
Usage
res_cal(y = NULL, x, w = NULL, by = NULL, precalc = NULL, id = NULL)
Arguments
y |
A (sparse) numerical matrix of dependent variable(s). |
x |
A (sparse) numerical matrix of independent variable(s). |
w |
An optional numerical vector of row weights. |
by |
An optional categorical vector (factor or character) when residuals calculation is to be conducted within by-groups (see Details). |
precalc |
A list of pre-calculated results (see Details). |
id |
A vector of identifiers of the units used in the calculation.
Useful when |
Details
In the context of the gustave
package, linear
regression residual calculation is solely used to take into account
the effect of calibration on variance estimation. Independent variables
are therefore most likely to be the same from one variance estimation
to another, hence the inversion of the matrix
t(x) %*% Diagonal(x = w) %*% x
can be done once and for all
at a pre-calculation step.
The parameters y
and precalc
determine whether a list of
pre-calculated data should be used in order to speed up the regression
residuals computation at execution time:
if
y
notNULL
andprecalc
NULL
: on-the-fly calculation of the matrix inverse and the regression residuals (no pre-calculation).if
y
NULL
andprecalc
NULL
: pre-calculation of the matrix inverse which is stored in a list of pre-calculated data.if
y
notNULL
andprecalc
notNULL
: calculation of the regression residuals using the list of pre-calculated data.
The by
parameter allows for calculation within by-groups : all
calculation are made separately for each by-group (when calibration was
conducted separately on several subsamples), but in an efficient way using
Matrix::TsparseMatrix capabilities (especially when the matrix inverse is
pre-calculated).
Value
if
y
is notNULL
(calculation step) : a numerical matrix with same structure (regular base::matrix or Matrix::TsparseMatrix) and dimensions asy
.if
y
isNULL
(pre-calculation step) : a list containing pre-calculated data.
Author(s)
Martin Chevalier
Examples
# Generating random data
set.seed(1)
n <- 100
H <- 5
y <- matrix(rnorm(2*n), nrow = n)
x <- matrix(rnorm(10*n), nrow = n)
by <- letters[sample(1:H, n, replace = TRUE)]
# Direct calculation
res_cal(y, x)
# Calculation with pre-calculated data
precalc <- res_cal(y = NULL, x)
res_cal(y, precalc = precalc)
identical(res_cal(y, x), res_cal(y, precalc = precalc))
# Matrix::TsparseMatrix capability
require(Matrix)
X <- as(x, "TsparseMatrix")
Y <- as(y, "TsparseMatrix")
identical(res_cal(y, x), as.matrix(res_cal(Y, X)))
# by parameter for within by-groups calculation
res_cal(Y, X, by = by)
all.equal(
res_cal(Y, X, by = by)[by == "a", ],
res_cal(Y[by == "a", ], X[by == "a", ])
)
Standard statistic wrappers
Description
Functions to be used within variance estimation wrappers in order to specify which statistic is to be estimated.
Usage
total(y, by = NULL, where = NULL)
ratio(num, denom, by = NULL, where = NULL)
mean(y, by = NULL, where = NULL)
diff_of_ratio(num1, denom1, num2, denom2, by = NULL, where = NULL)
ratio_of_ratio(num1, denom1, num2, denom2, by = NULL, where = NULL)
Arguments
y |
A vector corresponding to the variable to calculate the statitic on. If not numeric (character or factor), it is automatically discretized. |
by |
Factor vector (character vectors are coerced to factors) whose levels are used to break down the estimation by domains. |
where |
Logical vector indicating the domain to perform variance estimation on. |
num , num1 , num2 |
Numerical vector(s) corresponding to the numerator(s) to be used in the estimation. |
denom , denom1 , denom2 |
Numerical vector(s) corresponding to the denominator(s) to be used in the estimation. |
Details
When the estimator is not the estimator of a total, the application of analytical variance estimation formulae developed for the estimator of a total is not straightforward (Deville, 1999). An asymptotically unbiased variance estimator can nonetheless be obtained if the estimation of variance is performed on a variable obtained from the original data through a linerization step.
The ratio
, mean
, diff_of_ratio
and ratio_of_ratio
functions produce the point estimate of the statistic and derive the
corresponding linearized variable which is later on passed on to the variance
estimation function within the variance estimation wrapper.
Note: The total
function does not perform any linearization
(as none is needed for the estimator of a total) and solely produces the
corresponding point estimator.
Author(s)
Martin Chevalier
References
Caron N. (1998), "Le logiciel Poulpe : aspects méthodologiques", Actes des Journées de méthodologie statistique http://jms-insee.fr/jms1998s03_1/
Deville J.-C. (1999), "Variance estimation for complex statistics and estimators: linearization and residual techniques", Survey Methodology, 25:193–203
See Also
define_statistic_wrapper
, define_variance_wrapper
Examples
# See qvar examples
Efficient by-group (weighted) summation
Description
sum_by
performs an efficient and optionally weighted
by-group summation by using linear algebra and the Matrix package
capabilities. The by-group summation is performed through matrix cross-product
of the y
parameter (coerced to a matrix if needed) with a (very) sparse
matrix built up using the by
and the (optional) w
parameters.
Compared to base R, dplyr or data.table alternatives, this implementation aims at being easier to use in a matrix-oriented context and can yield efficiency gains when the number of columns becomes high.
Usage
sum_by(y, by, w = NULL, na_rm = TRUE, keep_sparse = FALSE)
Arguments
y |
A (sparse) vector, a (sparse) matrix or a data.frame. The object to perform by-group summation on. |
by |
The factor variable defining the by-groups. Character variables are coerced to factors. |
w |
The optional row weights to be used in the summation. |
na_rm |
Should |
keep_sparse |
When |
Value
A vector, a matrix or a data.frame depending on the type of y
. If y
is
sparse and keep_sparse = TRUE
, then the result is also sparse (without names
when it is a sparse vector, see keep_sparse argument for details).
Author(s)
Martin Chevalier
Examples
# Data generation
set.seed(1)
n <- 100
p <- 10
H <- 3
y <- matrix(rnorm(n*p), ncol = p, dimnames = list(NULL, paste0("var", 1:10)))
y[1, 1] <- NA
by <- letters[sample.int(H, n, replace = TRUE)]
w <- rep(1, n)
w[by == "a"] <- 2
# Standard use
sum_by(y, by)
# Keeping the NAs
sum_by(y, by, na_rm = FALSE)
# With a weight
sum_by(y, by, w = w)
Variance approximation with Deville-Tillé (2005) formula
Description
varDT
estimates the variance of the estimator of a total
in the case of a balanced sampling design with equal or unequal probabilities
using Deville-Tillé (2005) formula. Without balancing variables, it falls back
to Deville's (1993) classical approximation. Without balancing variables and
with equal probabilities, it falls back to the classical Horvitz-Thompson
variance estimator for the total in the case of simple random sampling.
Stratification is natively supported.
var_srs
is a convenience wrapper for the (stratified) simple random
sampling case.
Usage
varDT(
y = NULL,
pik,
x = NULL,
strata = NULL,
w = NULL,
precalc = NULL,
id = NULL
)
var_srs(y, pik, strata = NULL, w = NULL, precalc = NULL)
Arguments
y |
A (sparse) numerical matrix of the variable(s) whose variance of their total is to be estimated. |
pik |
A numerical vector of first-order inclusion probabilities. |
x |
An optional (sparse) numerical matrix of balancing variable(s). |
strata |
An optional categorical vector (factor or character) when variance estimation is to be conducted within strata. |
w |
An optional numerical vector of row weights (see Details). |
precalc |
A list of pre-calculated results (see Details). |
id |
A vector of identifiers of the units used in the calculation.
Useful when |
Details
varDT
aims at being the workhorse of most variance estimation conducted
with the gustave
package. It may be used to estimate the variance
of the estimator of a total in the case of (stratified) simple random sampling,
(stratified) unequal probability sampling and (stratified) balanced sampling.
The native integration of stratification based on Matrix::TsparseMatrix allows
for significant performance gains compared to higher level vectorizations
(*apply
especially).
Several time-consuming operations (e.g. collinearity-check, matrix
inversion) can be pre-calculated in order to speed up the estimation at
execution time. This is determined by the value of the parameters y
and precalc
:
if
y
notNULL
andprecalc
NULL
: on-the-fly calculation (no pre-calculation).if
y
NULL
andprecalc
NULL
: pre-calculation whose results are stored in a list of pre-calculated data.if
y
notNULL
andprecalc
notNULL
: calculation using the list of pre-calculated data.
w
is a row weight used at the final summation step. It is useful
when varDT
or var_srs
are used on the second stage of a
two-stage sampling design applying the Rao (1975) formula.
Value
if
y
is notNULL
(calculation step) : the estimated variances as a numerical vector of size the number of columns of y.if
y
isNULL
(pre-calculation step) : a list containing pre-calculated data.
Difference with varest
from package sampling
varDT
differs from sampling::varest
in several ways:
The formula implemented in
varDT
is more general and encompasses balanced sampling.Even in its reduced form (without balancing variables), the formula implemented in
varDT
slightly differs from the one implemented insampling::varest
. Caron (1998, pp. 178-179) compares the two estimators (sampling::varest
implements V_2,varDT
implements V_1).-
varDT
introduces several optimizations:-
matrixwise operations allow to estimate variance on several interest variables at once
Matrix::TsparseMatrix capability and the native integration of stratification yield significant performance gains.
-
the ability to pre-calculate some time-consuming operations speeds up the estimation at execution time.
-
-
varDT
does not natively implements the calibration estimator (i.e. the sampling variance estimator that takes into account the effect of calibration). In the context of thegustave
package,res_cal
should be called beforevarDT
in order to achieve the same result.
Author(s)
Martin Chevalier
References
Caron N. (1998), "Le logiciel Poulpe : aspects méthodologiques", Actes des Journées de méthodologie statistique http://jms-insee.fr/jms1998s03_1/ Deville, J.-C. (1993), Estimation de la variance pour les enquêtes en deux phases, Manuscript, INSEE, Paris.
Deville, J.-C., Tillé, Y. (2005), "Variance approximation under balanced sampling", Journal of Statistical Planning and Inference, 128, issue 2 569-591
Rao, J.N.K (1975), "Unbiased variance estimation for multistage designs", Sankhya, C n°37
See Also
Examples
library(sampling)
set.seed(1)
# Simple random sampling case
N <- 1000
n <- 100
y <- rnorm(N)[as.logical(srswor(n, N))]
pik <- rep(n/N, n)
varDT(y, pik)
sampling::varest(y, pik = pik)
N^2 * (1 - n/N) * var(y) / n
# Unequal probability sampling case
N <- 1000
n <- 100
pik <- runif(N)
s <- as.logical(UPsystematic(pik))
y <- rnorm(N)[s]
pik <- pik[s]
varDT(y, pik)
varest(y, pik = pik)
# The small difference is expected (see Details).
# Balanced sampling case
N <- 1000
n <- 100
pik <- runif(N)
x <- matrix(rnorm(N*3), ncol = 3)
s <- as.logical(samplecube(x, pik))
y <- rnorm(N)[s]
pik <- pik[s]
x <- x[s, ]
varDT(y, pik, x)
# Balanced sampling case (variable of interest
# among the balancing variables)
N <- 1000
n <- 100
pik <- runif(N)
y <- rnorm(N)
x <- cbind(matrix(rnorm(N*3), ncol = 3), y)
s <- as.logical(samplecube(x, pik))
y <- y[s]
pik <- pik[s]
x <- x[s, ]
varDT(y, pik, x)
# As expected, the total of the variable of interest is perfectly estimated.
# strata argument
n <- 100
H <- 2
pik <- runif(n)
y <- rnorm(n)
strata <- letters[sample.int(H, n, replace = TRUE)]
all.equal(
varDT(y, pik, strata = strata),
varDT(y[strata == "a"], pik[strata == "a"]) + varDT(y[strata == "b"], pik[strata == "b"])
)
# precalc argument
n <- 1000
H <- 50
pik <- runif(n)
y <- rnorm(n)
strata <- sample.int(H, n, replace = TRUE)
precalc <- varDT(y = NULL, pik, strata = strata)
identical(
varDT(y, precalc = precalc),
varDT(y, pik, strata = strata)
)
Sen-Yates-Grundy variance estimator
Description
varSYG
computes the Sen-Yates-Grundy
variance estimator.
Usage
varSYG(y = NULL, pikl, precalc = NULL)
Arguments
y |
A (sparse) numerical matrix of the variable(s) whose variance of their total is to be estimated. |
pikl |
A numerical matrix of second-order inclusion probabilities. |
precalc |
A list of pre-calculated results (analogous to the one used by
|
Details
varSYG
aims at being an efficient implementation of the
Sen-Yates-Grundy variance estimator for sampling designs with fixed sample
size. It should be especially useful when several variance estimations are
to be conducted, as it relies on (sparse) matrix linear algebra.
Moreover, in order to be consistent with varDT
, varSYG
has a precalc
argument allowing for the re-use of intermediary
results calculated once and for all in a pre-calculation step (see
varDT
for details).
Value
if
y
is notNULL
(calculation step) : a numerical vector of size the number of columns of y.if
y
isNULL
(pre-calculation step) : a list containing pre-calculated data (analogous to the one used byvarDT
).
Difference with varHT
from package sampling
varSYG
differs from sampling::varHT
in several ways:
The formula implemented in
varSYG
is solely the Sen-Yates-Grundy estimator, which is the one calculated byvarHT
when method = 2.-
varSYG
introduces several optimizations:-
matrixwise operations allow to estimate variance on several interest variables at once
Matrix::TsparseMatrix capability yields significant performance gains.
-
Author(s)
Martin Chevalier
Examples
library(sampling)
set.seed(1)
# Simple random sampling case
N <- 1000
n <- 100
y <- rnorm(N)[as.logical(srswor(n, N))]
pikl <- matrix(rep((n*(n-1))/(N*(N-1)), n*n), nrow = n)
diag(pikl) <- rep(n/N, n)
varSYG(y, pikl)
sampling::varHT(y = y, pikl = pikl, method = 2)
Variance estimator for a Poisson sampling design
Description
var_pois
estimates the variance of the estimator
of a total for a Poisson sampling design.
Usage
var_pois(y, pik, w = rep(1, length(pik)))
Arguments
y |
A (sparse) numerical matrix of the variable(s) whose variance of their total is to be estimated. |
pik |
A numerical vector of first-order inclusion probabilities. |
w |
An optional numerical vector of row weights (see Details). |
Details
w
is a row weight used at the final summation step. It is useful
when var_pois
is used on the second stage of a two-stage sampling
design applying the Rao (1975) formula.
Value
The estimated variances as a numerical vector of size the number of
columns of y
.
Author(s)
Martin Chevalier
References
Rao, J.N.K (1975), "Unbiased variance estimation for multistage designs", Sankhya, C n°37