Help for package fetch

Type:

Package

Title:

Fetch Data from Various Data Sources

Version:

0.1.5

Maintainer:

David Bosak <dbosak01@gmail.com>

Description:

Contains functions to fetch data from various data sources. The user first creates a catalog of objects from a data source, then fetches data from the catalog. The package provides an easy way to access data from many different types of sources.

Encoding:

UTF-8

License:

CC0

URL:

https://fetch.r-sassy.org

BugReports:

https://github.com/dbosak01/fetch/issues

Depends:

R (≥ 4.1), common

Imports:

readr, readxl, haven, crayon, tibble, foreign

Suggests:

knitr, rmarkdown, testthat (≥ 3.0.0)

Config/testthat/edition:

VignetteBuilder:

knitr

RoxygenNote:

7.3.1

NeedsCompilation:

Packaged:

2024-02-11 00:10:39 UTC; dbosa

Author:

David Bosak [aut, cre], Kevin Kramer [ctb], Archytas Clinical Solutions [cph]

Repository:

CRAN

Date/Publication:

2024-02-11 00:30:02 UTC

Fetch data from many data sources

Description

The fetch package allows you to retrieve data from many different data sources. The package retrieves data in a memory-efficient manner. You first identify the data by defining a data catalog. Then fetch the data from the catalog. Catalogs can be defined for many popular data formats: csv, rds, sas7bdat, excel, etc.

The functions contained in the fetch package are as follows:

catalog: Creates a data library
fetch: Creates a data dictionary
import_spec: Defines an import spec for a specific dataset

The fetch function retrieves a dataset from a data catalog. The function accepts a catalog item as the first parameter. The catalog item is the only required parameter. The "select" parameter allows you to pull only some of the columns. The "where" and "top" parameters may be used to define a subset of the data to retrieve. The "import_specs" parameter accepts an import_spec object, which can be used to control how data is read into the data frame.

Usage

fetch(catalog, select = NULL, where = NULL, top = NULL, import_specs = NULL)

Arguments

catalog

The catalog item to fetch data for. Catalog items are created using the catalog function.

select

A vector of column names or column numbers to extract from the data item. Note that the column names can be easily obtained as a vector from the catalog item, and then manipulated to suit your needs.

where

An optional expression to be used to filter the fetched data. Use the base R expression function to define the expression. The expression allows logical operators and Base R functions. Column names can be unquoted.

top

A number of records to return from the head of the data item. Valid value is an integer.

import_specs

The import specs to use for the fetch operation. Import specs can be used to control the data types of the fetched dataset. An import specification is created with the import_spec function. See the documentation of this function for additional details and an example.

Value

The desired dataset, returned as a tibble.

Author(s)

Maintainer: David Bosak dbosak01@gmail.com

Other contributors:

Kevin Kramer kkrame02@amgen.com [contributor]
Archytas Clinical Solutions [copyright holder]

Examples

# Get data directory
pkg <- system.file("extdata", package = "fetch")

# Create catalog
ct <- catalog(pkg, engines$csv)

# View catalog
ct
# data catalog: 6 items
# - Source: C:/packages/fetch/inst/extdata
# - Engine: csv
# - Items:
  # data item 'ADAE': 56 cols 150 rows
  # data item 'ADEX': 17 cols 348 rows
  # data item 'ADPR': 37 cols 552 rows
  # data item 'ADPSGA': 42 cols 695 rows
  # data item 'ADSL': 56 cols 87 rows
  # data item 'ADVS': 37 cols 3617 rows

# Example 1: Fetch Entire Dataset

# Get data from the catalog
dat1 <- fetch(ct$ADEX)

# View Data
dat1
# A tibble: 348 × 17                                                                                      
#   STUDYID USUBJID   SUBJID SITEID TRTP  TRTPN TRTA  TRTAN RANDFL SAFFL
#   <chr>   <chr>     <chr>  <chr>  <chr> <dbl> <chr> <dbl> <chr>  <chr>
#  1 ABC     ABC-01-0… 049    01     ARM D     4 ARM D     4 Y      Y    
#  2 ABC     ABC-01-0… 049    01     ARM D     4 ARM D     4 Y      Y    
#  3 ABC     ABC-01-0… 049    01     ARM D     4 ARM D     4 Y      Y    
#  4 ABC     ABC-01-0… 049    01     ARM D     4 ARM D     4 Y      Y    
#  5 ABC     ABC-01-0… 050    01     ARM B     2 ARM B     2 Y      Y    
#  6 ABC     ABC-01-0… 050    01     ARM B     2 ARM B     2 Y      Y    
#  7 ABC     ABC-01-0… 050    01     ARM B     2 ARM B     2 Y      Y    
#  8 ABC     ABC-01-0… 050    01     ARM B     2 ARM B     2 Y      Y    
#  9 ABC     ABC-01-0… 051    01     ARM A     1 ARM A     1 Y      Y    
# 10 ABC     ABC-01-0… 051    01     ARM A     1 ARM A     1 Y      Y    
#  338 more rows
#  7 more variables: MITTFL <chr>, PPROTFL <chr>, PARAM <chr>,
#  PARAMCD <chr>, PARAMN <dbl>, AVAL <dbl>, AVALCAT1 <chr>
#  Use `print(n = ...)` to see more rows

# Example 2: Fetch a Subset

# Get data with selected columns and where expression
dat2 <- fetch(ct$ADEX, select = c("SUBJID", "TRTA", "RANDFL", "SAFFL"),
              where = expression(SUBJID == '051'))

# View Data
dat2
# A tibble: 4 x 4
#   SUBJID TRTA  RANDFL SAFFL
#   <chr>  <chr> <chr>  <chr>
# 1 051    ARM A Y      Y    
# 2 051    ARM A Y      Y    
# 3 051    ARM A Y      Y    
# 4 051    ARM A Y      Y

Create a data source catalog

Description

The catalog function returns a data catalog for a data source. A data catalog is like a collection of data dictionaries for all the datasets in the data source. The catalog allows you to examine the datasets in the data source without yet loading anything into memory. Once you decide which data items you want to load, use the fetch function to load that item into memory.

Usage

catalog(source, engine, pattern = NULL, where = NULL, import_specs = NULL)

Arguments

source

The source for the data. This parameter is required. Normally the source is passed as a full or relative path.

engine

The data engine to use for this data source. This parameter is required. The available data engines are available on the engines enumeration. For example, engines$csv will specify the CSV engine, and engines$rdata will specify the RDATA engine.

pattern

A pattern to use when loading data items from the data source. The pattern can be a name or a vector of names. Names also accept wildcards. The supplied pattern will be used to filter which data items are loaded into the catalog. For example, the pattern pattern = "AD*" will load only datasets that start with "AD".

where

A where expression to use when fetching the data. This expression will apply to all fetch operations on this catalog. The where expression should be defined with the Base R expression function. The expression is unquoted and can use any Base R operators or functions.

import_specs

The import specs to use for any fetch operation on this catalog. The import spec can be used to control the data types on the incoming columns. You can create separate import specs for each dataset, or one import spec to use for all datasets. See the import_spec and specs functions for more information about this capability.

Value

The loaded data catalog, as class "dcat". The catalog will be a list of data dictionaries. Each data dictionary is a tibble.

Examples

# Get data directory
pkg <- system.file("extdata", package = "fetch")

# Create catalog
ct <- catalog(pkg, engines$csv)

# Example 1: Catalog all rows

# View catalog
ct
# data catalog: 6 items
# - Source: C:/packages/fetch/inst/extdata
# - Engine: csv
# - Items:
  # data item 'ADAE': 56 cols 150 rows
  # data item 'ADEX': 17 cols 348 rows
  # data item 'ADPR': 37 cols 552 rows
  # data item 'ADPSGA': 42 cols 695 rows
  # data item 'ADSL': 56 cols 87 rows
  # data item 'ADVS': 37 cols 3617 rows

# View catalog item
ct$ADEX
# data item 'ADEX': 17 cols 348 rows
# - Engine: csv
# - Size: 70.7 Kb
# - Last Modified: 2020-09-18 14:30:22
#    Name   Column     Class Label Format NAs MaxChar
# 1  ADEX  STUDYID character  <NA>     NA   0       3
# 2  ADEX  USUBJID character  <NA>     NA   0      10
# 3  ADEX   SUBJID character  <NA>     NA   0       3
# 4  ADEX   SITEID character  <NA>     NA   0       2
# 5  ADEX     TRTP character  <NA>     NA   8       5
# 6  ADEX    TRTPN   numeric  <NA>     NA   8       1
# 7  ADEX     TRTA character  <NA>     NA   8       5
# 8  ADEX    TRTAN   numeric  <NA>     NA   8       1
# 9  ADEX   RANDFL character  <NA>     NA   0       1
# 10 ADEX    SAFFL character  <NA>     NA   0       1
# 11 ADEX   MITTFL character  <NA>     NA   0       1
# 12 ADEX  PPROTFL character  <NA>     NA   0       1
# 13 ADEX    PARAM character  <NA>     NA   0      45
# 14 ADEX  PARAMCD character  <NA>     NA   0       8
# 15 ADEX   PARAMN   numeric  <NA>     NA   0       1
# 16 ADEX     AVAL   numeric  <NA>     NA  16       4
# 17 ADEX AVALCAT1 character  <NA>     NA  87      10


# Example 2: Catalog with where expression
ct <- catalog(pkg, engines$csv, where = expression(SUBJID == '049'))

# View catalog item - Now only 4 rows
ct$ADEX
# data item 'ADEX': 17 cols 4 rows
#- Where: SUBJID == "049"
#- Engine: csv
#- Size: 4.5 Kb
#- Last Modified: 2020-09-18 14:30:22
#Name   Column     Class Label Format NAs MaxChar
#1  ADEX  STUDYID character  <NA>     NA   0       3
#2  ADEX  USUBJID character  <NA>     NA   0      10
#3  ADEX   SUBJID character  <NA>     NA   0       3
#4  ADEX   SITEID character  <NA>     NA   0       2
#5  ADEX     TRTP character  <NA>     NA   0       5
#6  ADEX    TRTPN   numeric  <NA>     NA   0       1
#7  ADEX     TRTA character  <NA>     NA   0       5
#8  ADEX    TRTAN   numeric  <NA>     NA   0       1
#9  ADEX   RANDFL character  <NA>     NA   0       1
#10 ADEX    SAFFL character  <NA>     NA   0       1
#11 ADEX   MITTFL character  <NA>     NA   0       1
#12 ADEX  PPROTFL character  <NA>     NA   0       1
#13 ADEX    PARAM character  <NA>     NA   0      45
#14 ADEX  PARAMCD character  <NA>     NA   0       8
#15 ADEX   PARAMN   numeric  <NA>     NA   0       1
#16 ADEX     AVAL   numeric  <NA>     NA   0       4
#17 ADEX AVALCAT1 character  <NA>     NA   1      10

A list of engine types

Description

The engines enumeration contains all possible options for the "engine" parameter of the catalog function. Use this enumeration to specify what kind of data you would like to load. Options are: csv, dbf, rda, rds, rdata, sas7bdat, xls, xlsx, and xpt.

Usage

engines

Format

An object of class etype of length 9.

Value

The engine parameter string.

Examples

#' # Get data directory
pkg <- system.file("extdata", package = "fetch")

# Create catalog
ct <- catalog(pkg, engines$csv)

# Example 1: Catalog all rows

# View catalog
ct
# data catalog: 6 items
# - Source: C:/packages/fetch/inst/extdata
# - Engine: csv
# - Items:
  # data item 'ADAE': 56 cols 150 rows
  # data item 'ADEX': 17 cols 348 rows
  # data item 'ADPR': 37 cols 552 rows
  # data item 'ADPSGA': 42 cols 695 rows
  # data item 'ADSL': 56 cols 87 rows
  # data item 'ADVS': 37 cols 3617 rows

Create an Import Specification

Description

A function to create the import specifications for a particular data file. This information can be used on the catalog or fetch functions to correctly assign the data types for columns on imported data. The import specifications are defined as name/value pairs, where the name is the column name and the value is the data type indicator. Available data type indicators are 'guess', 'logical', 'character', 'integer', 'numeric', 'date', 'datetime', and 'time'.

Also note that multiple import specifications can be combined into a collection, and assigned to an entire catalog. See the specs function for an example of using a specs collection.

Usage

import_spec(..., na = NULL, trim_ws = NULL)

Arguments

...

Named pairs of column names and column data types, separated by commas. Available types are: 'guess', 'logical', 'character', 'integer', 'numeric', 'date', 'datetime', and 'time'. The date/time data types accept an optional input format. To supply the input format, append it after the data type following an equals sign, e.g.: 'date=%d%b%Y' or 'datetime=%d-%m-%Y %H:%M:%S'. Default is NULL, meaning no column types are specified, and the function should make its best guess for each column.

na

A vector of values to be treated as NA. For example, the vector c('', ' ') will cause empty strings and single blanks to be converted to NA values. Default is NULL, meaning the value of the na parameter will be taken from the specs function. Any value supplied on the import_spec function will override the value from the specs function.

trim_ws

Whether or not to trim white space from the input data values. The default is NULL, meaning the value of the trim_ws parameter will be taken from the specs function. Any value supplied on the import_spec function will override the value from the specs function.

Value

The import specification object. The class of the object will be "import_spec".

Date/Time Format Codes

Below are some common date formatting codes. For a complete list, see the documentation for the strptime function:

%d = day as a number
%a = abbreviated weekday
%A = unabbreviated weekday
%m = month number
%b = abbreviated month name
%B = unabbreviated month name
%y = 2-digit year
%Y = 4-digit year
%H = hour
%M = minute
%S = second
%p = AM/PM indicator

Examples

# Get sample data directory
pkg <- system.file("extdata", package = "fetch")

# Create import spec
spc <- import_spec(TRTSDT = "date=%d%b%Y",
                   TRTEDT = "date=%d%b%Y")

# Create catalog without filter
ct <- catalog(pkg, engines$csv, import_specs = spc)

# Get dictionary for ADVS with Import Spec
d <- ct$ADVS

# Observe data types for TRTSDT and TRTEDT are now Dates
d[d$Column %in% c("TRTSDT", "TRTEDT"), ]
# data item 'ADVS': 37 cols 3617 rows
#- Engine: csv
#- Size: 1.1 Mb
#- Last Modified: 2020-09-18 14:30:22
#   Name Column Class Label Format NAs MaxChar
#16 ADVS TRTSDT  Date  <NA>     NA  54      10
#17 ADVS TRTEDT  Date  <NA>     NA 119      10

Print a data catalog

Description

A class-specific instance of the print function for a data catalog. The function prints the catalog in a summary manner. Use verbose = TRUE option to print the catalog as a list.

Usage

## S3 method for class 'dcat'
print(x, ..., verbose = FALSE)

Arguments

x

The catalog to print.

...

Any follow-on parameters.

verbose

Whether or not to print the catalog in verbose style. By default, the parameter is FALSE, meaning to print in summary style.

Value

The object, invisibly.

Examples

# Get data directory
pkg <- system.file("extdata", package = "fetch")

# Create catalog
ct <- catalog(pkg, engines$csv)

# View catalog
print(ct)
# data catalog: 6 items
# - Source: C:/packages/fetch/inst/extdata
# - Engine: csv
# - Items:
  # data item 'ADAE': 56 cols 150 rows
  # data item 'ADEX': 17 cols 348 rows
  # data item 'ADPR': 37 cols 552 rows
  # data item 'ADPSGA': 42 cols 695 rows
  # data item 'ADSL': 56 cols 87 rows
  # data item 'ADVS': 37 cols 3617 rows

Print a data catalog item

Description

A class-specific instance of the print function for data catalog items. The function prints the info in a summary manner. Use verbose = TRUE to print the data info as a list.

Usage

## S3 method for class 'dinfo'
print(x, ..., verbose = TRUE)

Arguments

x

The library to print.

...

Any follow-on parameters.

verbose

Whether or not to print the info in verbose style. By default, the parameter is FALSE, meaning to print in summary style. Verbose style includes a full data dictionary and printing of all attributes.

Value

The data catalog object, invisibly.

Print import specifications

Description

A function to print the import specification collection.

Usage

## S3 method for class 'specs'
print(x, ..., verbose = FALSE)

Arguments

x

The specifications to print.

...

Any follow-on parameters to the print function.

verbose

Whether or not to print the specifications in verbose style. By default, the parameter is FALSE, meaning to print in summary style.

Value

The specification object, invisibly.

Read import specs from the file system

Description

A function to read import specifications from the file system. The function accepts a full or relative path to the spec file, and returns the specs as an object. If the file_path parameter is passed as a directory name, the function will search for a file with a '.specs' extension and read it.

Usage

read.specs(file_path = getwd())

Arguments

file_path

The full or relative path to the file system. Default is the current working directory. If the file_path is a file name that does not contain the '.specs' file extension, the function will add the extension. If the file_path contains a directory name, the function will search the directory for a file with an extension of '.specs'. If more than one file with an extension of '.specs' is founds, the function will generate an error.

Value

The specifications object.

Create an Import Spec Collection

Description

A function to create a collection of import specifications for a data source. These specs can be used on the catalog function to correctly assign the data types uniquely for different imported data files. The spec collection is a set of import_spec objects identified by name/value pairs. The name corresponds to the name of the input dataset, without file extension. The value is the import_spec object to use for that dataset. In this way, you may define different specs for each dataset in your catalog.

The import engines will guess at the data types for any columns that are not explicitly defined in the import specifications. The import spec syntax is the same for all data engines.

Note that the na and trim_ws parameters on the specs function will be applied globally to all files in the library. These global settings can be overridden on the import_spec for any particular data file.

Also note that the specs collection is defined as an object so it can be stored and reused. See the write.specs and read.specs functions for additional information on saving and restoring specs.

Usage

specs(..., na = c("", "NA"), trim_ws = TRUE)

Arguments

...

Named input specs. The name should correspond to the file name, without the file extension. The spec is defined as an import_spec object. See the import_spec function for additional information on parameters for that object.

na

A vector of values to be treated as NA. For example, the vector c('', ' ') will cause empty strings and single blanks to be converted to NA values. For most file types, empty strings and the string 'NA' ('', 'NA') are considered NA. For SAS® datasets and transport files, a single blank and a single dot c(" ", ".") are considered NA. The value of the na parameter on the specs function can be overridden by the na parameter on the import_spec function.

trim_ws

Whether or not to trim white space from the input data values. Valid values are TRUE, and FALSE. Default is TRUE. The value of the trim_ws parameter on the specs function can be overridden by the trim_ws parameter on the import_spec function.

Value

The import spec collection. The class of the object is "specs".

Examples

# Get sample data directory
pkg <- system.file("extdata", package = "fetch")

# Create import spec
spc <- specs(ADAE = import_spec(TRTSDT = "date=%d%b%Y",
                                TRTEDT = "date=%d%b%Y"),
             ADVS = import_spec(TRTSDT = "character",
                                TRTEDT = "character"))

# Create catalog with specs collection
ct <- catalog(pkg, engines$csv, import_specs = spc)

# Get dictionary for ADAE with Import Spec
d1 <- ct$ADAE

# Observe data types for TRTSDT and TRTEDT are Dates
d1[d1$Column %in% c("TRTSDT", "TRTEDT"), ]
# data item 'ADAE': 56 cols 150 rows
#- Engine: csv
#- Size: 155 Kb
#- Last Modified: 2020-09-18 14:30:22
#   Name Column Class Label Format NAs MaxChar
#13 ADAE TRTSDT  Date  <NA>     NA   1      10
#14 ADAE TRTEDT  Date  <NA>     NA   4      10

# Get dictionary for ADVS with Import Spec
d2 <- ct$ADVS

# Observe data types for TRTSDT and TRTEDT are character
d2[d2$Column %in% c("TRTSDT", "TRTEDT"), ]
# data item 'ADVS': 37 cols 3617 rows
#- Engine: csv
#- Size: 1.1 Mb
#- Last Modified: 2020-09-18 14:30:22
#   Name Column     Class Label Format NAs MaxChar
#16 ADVS TRTSDT character  <NA>     NA  54       9
#17 ADVS TRTEDT character  <NA>     NA 119       9

Write import specs to the file system

Description

A function to write import specifications to the file system. The function accepts a specifications object and a full or relative path. The function returns the full file path. This function is useful so that you can define import specifications once, and reuse them in multiple programs or across multiple teams.

Usage

write.specs(x, dir_path = getwd(), file_name = NULL)

Arguments

x

A specifications object of class 'specs'.

dir_path

A full or relative path to save the specs. Default is the current working directory.

file_name

The file name to save to specs, without a file extension. The '.specs' file extension will be added automatically. If no file name is supplied, the function will use the variable name as the file name.

Value

The full file path.

Fetch data from many data sources

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Create a data source catalog

Description

Usage

Arguments

Value

See Also

Examples

A list of engine types

Description

Usage

Format

Value

See Also

Examples

Create an Import Specification

Description

Usage

Arguments

Value

Date/Time Format Codes

See Also

Examples

Print a data catalog

Description

Usage

Arguments

Value

Examples

Print a data catalog item

Description

Usage

Arguments

Value

Print import specifications

Description

Usage

Arguments

Value

See Also

Read import specs from the file system

Description

Usage

Arguments

Value

See Also

Create an Import Spec Collection

Description

Usage

Arguments

Value

See Also

Examples

Write import specs to the file system

Description

Usage

Arguments

Value

See Also