This vignette details the data requirements, implementation steps, and relevant code references for data preparation in shinymrp.
MRP requires two main data components: the survey or test data and a corresponding poststratification table. The workflow involves two stages:
Data preprocessing accepts either of these formats:
For continuous outcomes, only individual-level data
are supported.
For binary outcomes, the aggregated format is preferred
for computational efficiency; individual-level data are aggregated
automatically upon upload. Other data requirements depend on format,
primarily regarding outcome measures.
The code expects columns with specific names and values (not case-sensitive):
Input data are categorized for clear requirements and implementation, with multiple modules. The two primary categories, time-varying and cross-sectional, support specific applications as well as general cases. The following cheatsheet summarizes requirements and typical preprocessing outputs for each.
TIME-VARYING
COVID-19 Test Data
General
CROSS-SECTIONAL
Public Opinion Poll Data
General
The preprocessing pipeline includes:
Code reference: preprocess.
A major strength of MRP is small area estimation, so it is advisable to include as much geographic/geo-covariate information as possible.
First, the application identifies geographic units at larger scales that are not present in the data. It automatically determines the smallest geographic units in the data and infers corresponding units at larger scales. For example, if the data contains ZIP codes, the application will automatically find the county and state that has the largest overlap with each ZIP code.
Quantitative measures associated with geographic units are sourced from your data or external datasets. For general use cases, the app scans the data to find quantities that have a one-to-one relationship with the geographic identifier of interest.
For the COVID-19 use case, we have identified specific ZIP code-level measures that are informative in modeling COVID-19 test results. We obtain these quantities at the tract level from the ACS and other sources, then aggregate over the tracts that overlap with each ZIP code based on the USPS crosswalk table.
We obtain the following tract-level measures from the ACS and other sources:
Code reference: get_tract_data.
While the ACS reports geography at the levels of census tracts, counties, and states, ZIP codes are defined by the U.S. Postal Service (USPS). We use the ZIP code crosswalk table released by the U.S. Department of Housing and Urban Development and USPS to link ZIP codes to census tracts and calculate ZIP-code-level measures by aggregating all available tract-level measures weighted by tract population counts. ZIP code level statistics are computed by combining the values across census tracts as follows:
Code reference: combine_tracts_covid.
Poststratification tables are computed from ACS data via the
tidycensus package and IPUMS and summarize the size of
every subpopulation defined by demographic and geographic
cross-categories. For efficiency, tables are precomputed at the tract
level and then aggregated for larger geographies. We select the county
with the most overlapping residential addresses for a given ZIP code as
the ZIP-linked county and sum over the overlapping tracts for each ZIP
code to obtain ZIP code-level population counts.
Code reference: combine_tracts and combine_tracts_covid
Geographical columns are optional for general use. The app automatically identifies the smallest available geographic scale and infers higher levels.↩︎
Geographical columns are optional for general use. The app automatically identifies the smallest available geographic scale and infers higher levels.↩︎
Geographical columns are optional for general use. The app automatically identifies the smallest available geographic scale and infers higher levels.↩︎
For individual-level data, dates are automatically
converted to time indices but can be provided explicitly. Aggregated
data must include a time column with time indices.
Optionally include a date column (first day of each period)
for visualization. The interface uses time-invariant poststratification
data.↩︎
For continuous outcomes, name your outcome column
outcome.↩︎
For binary outcomes, the column in individual-level data
must be positive. For aggregated data, use
total (number in cell) and positive (number
positive in cell).↩︎
For binary outcomes, the column in individual-level data
must be positive. For aggregated data, use
total (number in cell) and positive (number
positive in cell).↩︎
Survey weights must be in a column named
weight. If uploaded poststratification data contain
weights, they’re used to estimate population counts.↩︎