Title: Quantitative Text Kit
Version: 1.1.1
Description: Support package for the textbook "An Introduction to Quantitative Text Analysis for Linguists: Reproducible Research Using R" (Francom, 2024) <doi:10.4324/9781003393764>. Includes functions to acquire, clean, and analyze text data as well as functions to document and share the results of text analysis. The package is designed to be used in conjunction with the book, but can also be used as a standalone package for text analysis.
License: GPL (≥ 3)
URL: https://cran.r-project.org/package=qtkit
BugReports: https://github.com/qtalr/qtkit/issues
SystemRequirements: Chromium-based browser (e.g., Chrome, Chromium, or Brave)
Depends: R (≥ 4.1)
Imports: chromote, dplyr, ggplot2, gutenbergr, kableExtra, knitr, Matrix, openai, rlang, xml2
Suggests: httptest, rmarkdown, testthat (≥ 3.0.0), webshot2, fs, tibble, glue, readr
Config/testthat/edition: 3
Encoding: UTF-8
Language: en-US
RoxygenNote: 7.3.1
VignetteBuilder: knitr
Author: Jerid Francom ORCID iD [aut, cre, cph]
Maintainer: Jerid Francom <francojc@wfu.edu>
NeedsCompilation: no
Packaged: 2025-01-14 04:29:08 UTC; francojc
Repository: CRAN
Date/Publication: 2025-01-14 06:50:02 UTC

Add Package Citations to BibTeX File

Description

Adds citation information for R packages to a BibTeX file. Uses the knitr::write_bib function to generate and append package citations in BibTeX format.

Usage

add_pkg_to_bib(pkg_name, bib_file = "packages.bib")

Arguments

pkg_name

Character string. The name of the R package to add to the BibTeX file.

bib_file

Character string. The path and name of the BibTeX file to write to. Default is "packages.bib".

Details

The function will create the BibTeX file if it doesn't exist, or append to it if it does. It includes citations for both the specified package and all currently loaded packages.

Value

Invisible NULL. The function is called for its side effect of writing to the BibTeX file.

Examples

# Create a temporary BibTeX file
my_bib_file <- tempfile(fileext = ".bib")

# Add citations for dplyr package
add_pkg_to_bib("dplyr", my_bib_file)

# View the contents of the BibTeX file
readLines(my_bib_file) |> cat(sep = "\n")


Calculate Association Metrics for Bigrams

Description

This function calculates various association metrics (PMI, Dice's Coefficient, G-score) for bigrams in a given corpus.

Usage

calc_assoc_metrics(
  data,
  doc_index,
  token_index,
  type,
  association = "all",
  verbose = FALSE
)

Arguments

data

A data frame containing the corpus.

doc_index

Column in 'data' which represents the document index.

token_index

Column in 'data' which represents the token index.

type

Column in 'data' which represents the tokens or terms.

association

A character vector specifying which metrics to calculate. Can be any combination of 'pmi', 'dice_coeff', 'g_score', or 'all'. Default is 'all'.

verbose

A logical value indicating whether to keep the intermediate probability columns. Default is FALSE.

Value

A data frame with one row per bigram and columns for each calculated metric.

Examples

data_path <- system.file("extdata", "bigrams_data.rds", package = "qtkit")
data <- readRDS(data_path)

calc_assoc_metrics(data, doc_index, token_index, type)


Calculate Document Frequency

Description

Computes the document frequency (DF) for each term in a term-document matrix. DF is the number of documents in which each term appears at least once.

Usage

calc_df(tdm)

Arguments

tdm

A term-document matrix

Value

A numeric vector of document frequencies for each term


Calculate Gries' Deviation of Proportions

Description

Computes the Deviation of Proportions (DP) measure developed by Stefan Th. Gries. DP measures how evenly distributed a term is across all parts of the corpus. The normalized version (DP_norm) is returned, which ranges from 0 (evenly distributed) to 1 (extremely clumped distribution).

Usage

calc_dp(tdm)

Arguments

tdm

A term-document matrix

Value

A numeric vector of normalized DP values for each term

References

Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403-437.


Calculate Inverse Document Frequency

Description

Computes the inverse document frequency (IDF) for each term in a term-document matrix. IDF is calculated as log(N/df) where N is the total number of documents and df is the document frequency of the term.

Usage

calc_idf(tdm)

Arguments

tdm

A term-document matrix

Value

A numeric vector of inverse document frequencies for each term


Calculate Normalized Entropy for Categorical Variables

Description

Computes the normalized entropy (uncertainty measure) for categorical variables, providing a standardized measure of dispersion or randomness in the data.

Usage

calc_normalized_entropy(x)

Arguments

x

A character vector or factor containing categorical data.

Details

The function:

The calculation process:

  1. Computes category proportions

  2. Calculates raw entropy using Shannon's formula

  3. Normalizes by dividing by maximum possible entropy

Value

A numeric value between 0 and 1 representing the normalized entropy:

Examples

# Calculate entropy for a simple categorical vector
x <- c("A", "B", "B", "C", "C", "C", "D", "D", "D", "D")
calc_normalized_entropy(x)

# Handle missing values
y <- c("A", "B", NA, "C", "C", NA, "D", "D")
calc_normalized_entropy(y)

# Works with factors too
z <- factor(c("Low", "Med", "Med", "High", "High", "High"))
calc_normalized_entropy(z)


Calculate Observed Relative Frequency

Description

Computes the observed relative frequency (ORF) for each term in a term-document matrix. ORF is the relative frequency expressed as a percentage (RF * 100).

Usage

calc_orf(tdm)

Arguments

tdm

A term-document matrix

Value

A numeric vector of observed relative frequencies (as percentages) for each term


Internal Functions for Calculating Dispersion and Frequency Metrics

Description

A collection of internal helper functions that calculate various dispersion and frequency metrics from term-document matrices. These functions support the main calc_type_metrics function by providing specialized calculations for different statistical measures.

Computes the relative frequency (RF) for each term in a term-document matrix, representing how often each term occurs relative to the total corpus size.

Usage

calc_rf(tdm)

Arguments

tdm

A sparse term-document matrix (Matrix package format)

Details

The package implements these metrics:

Dispersion measures:

Frequency measures:

Implementation notes:

The calculation process:

  1. Sums occurrences of each term across all documents

  2. Divides by total corpus size (sum of all terms)

  3. Returns proportions between 0 and 1

Value

A numeric vector where each element represents a term's relative frequency in the corpus (range: 0-1)

References

Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403-437.


Calculate Frequency and Dispersion Metrics for Text Types

Description

Calculates various frequency and dispersion metrics for types (terms/tokens) in tokenized text data. Provides a comprehensive analysis of how types are distributed across documents in a corpus.

Usage

calc_type_metrics(data, type, document, frequency = NULL, dispersion = NULL)

Arguments

data

Data frame. Contains the tokenized text data with document IDs and types/terms.

type

Symbol. Column in data containing the types to analyze (e.g., terms, lemmas).

document

Symbol. Column in data containing the document identifiers.

frequency

Character vector. Frequency metrics to calculate: - NULL (default): Returns only type counts - 'all': All available metrics - 'rf': Relative frequency - 'orf': Observed relative frequency (per 100)

dispersion

Character vector. Dispersion metrics to calculate: - NULL (default): Returns only type counts - 'all': All available metrics - 'df': Document frequency - 'idf': Inverse document frequency - 'dp': Gries' deviation of proportions

Details

The function creates a term-document matrix internally and calculates the requested metrics. Frequency metrics show how often types occur, while dispersion metrics show how evenly they are distributed across documents.

The 'dp' metric (Gries' Deviation of Proportions) ranges from 0 (perfectly even distribution) to 1 (completely clumped distribution).

Value

Data frame containing requested metrics:

References

Gries, Stefan Th. (2023). Statistical Methods in Corpus Linguistics. In Readings in Corpus Linguistics: A Teaching and Research Guide for Scholars in Nigeria and Beyond, pp. 78-114.

Examples

data_path <- system.file("extdata", "types_data.rds", package = "qtkit")
df <- readRDS(data_path)
calc_type_metrics(
  data = df,
  type = letter,
  document = doc_id,
  frequency = c("rf", "orf"),
  dispersion = "dp"
)

Calculate Probabilities for Bigrams

Description

Helper function that calculates joint and marginal probabilities for bigrams in the input data using dplyr. It processes the data to create bigrams and computes their probabilities along with individual token probabilities.

Usage

calculate_bigram_probabilities(data, doc_index, token_index, type)

Arguments

data

A data frame containing the corpus

doc_index

Column name for document index

token_index

Column name for token position

type

Column name for the actual tokens/terms

Value

A data frame containing:


Calculate Association Metrics

Description

Helper function that computes various association metrics for bigrams based on their probability distributions. Supports PMI (Pointwise Mutual Information), Dice's Coefficient, and G-score calculations.

Usage

calculate_metrics(bigram_probs, association)

Arguments

bigram_probs

A data frame containing bigram probability data with columns:

  • p_xy Joint probability of bigram

  • p_x Marginal probability of first token

  • p_y Marginal probability of second token

association

Character vector specifying which metrics to calculate

Value

A data frame containing the original probability columns plus requested association metrics:


Clean Downloaded File Names

Description

Helper function that removes spaces from filenames in the target directory, replacing them with underscores.

Usage

clean_filenames(target_dir)

Arguments

target_dir

Character string of the target directory path

Value

Invisible NULL, called for side effects


Check if Permission Confirmation is Needed

Description

Helper function that determines whether to prompt the user for permission confirmation based on the confirmed parameter.

Usage

confirm_if_needed(confirmed)

Arguments

confirmed

Logical indicating if permission is pre-confirmed

Value

Logical indicating if permission is granted


Create Data Dictionary

Description

This function takes a data frame and creates a data dictionary. The data dictionary includes the variable name, a human-readable name, the variable type, and a description. If a model is specified, the function uses OpenAI's API to generate the information based on the characteristics of the data frame.

Usage

create_data_dictionary(
  data,
  file_path,
  model = NULL,
  sample_n = 5,
  grouping = NULL,
  force = FALSE
)

Arguments

data

A data frame to create a data dictionary for.

file_path

The file path to save the data dictionary to.

model

The ID of the OpenAI chat completion models to use for generating descriptions (see openai::list_models()). If NULL (default), a scaffolding for the data dictionary is created.

sample_n

The number of rows to sample from the data frame to use as input for the model. Default NULL.

grouping

A character vector of column names to group by when sampling rows from the data frame for the model. Default NULL.

force

If TRUE, overwrite the file at file_path if it already exists. Default FALSE.

Value

A data frame containing the variable name, human-readable name, variable type, and description for each variable in the input data frame.


Create Data Origin Documentation

Description

Creates a standardized data origin documentation file in CSV format, containing essential metadata about a dataset's source, format, and usage rights.

Usage

create_data_origin(file_path, return = FALSE, force = FALSE)

Arguments

file_path

Character string. Path where the CSV file should be saved.

return

Logical. If TRUE, returns the data frame in addition to saving. Default is FALSE.

force

Logical. If TRUE, overwrites existing file at path. Default is FALSE.

Details

Generates a template with the following metadata fields:

Value

If return=TRUE, returns a data frame containing the data origin template. Otherwise returns invisible(NULL).

Examples

tmp_file <- tempfile(fileext = ".csv")
create_data_origin(tmp_file)
read.csv(tmp_file)

Curate ENNTT Data

Description

This function processes and curates ENNTT (European Parliament) data from a specified directory. It handles both .dat files (containing XML metadata) and .tok files '(containing text content).

Usage

curate_enntt_data(dir_path)

Arguments

dir_path

A string. The path to the directory containing the ENNTT data files. Must be an existing directory.

Details

The function expects a directory containing paired .dat and .tok files with matching names, as found in the raw ENNTT data https://github.com/senisioi/enntt-release. The .dat files should contain XML-formatted metadata with attributes:

The .tok files should contain the corresponding text content, one entry per line.

Value

A tibble containing the curated ENNTT data with columns:

Examples

# Example using simulated data bundled with the package
example_data <- system.file("extdata", "simul_enntt", package = "qtkit")
curated_data <- curate_enntt_data(example_data)

str(curated_data)


Curate Single ENNTT File Pair

Description

Curate Single ENNTT File Pair

Usage

curate_enntt_file(dir_path, corpus_type)

Arguments

dir_path

Directory containing the files

corpus_type

Type identifier for the corpus

Value

Data frame of curated data


Curate SWDA data

Description

Process and curate Switchboard Dialog Act (SWDA) data by reading all .utt files from a specified directory and converting them into a structured format.

Usage

curate_swda_data(dir_path)

Arguments

dir_path

Character string. Path to the directory containing .utt files. Must be an existing directory.

Details

The function expects a directory containing .utt files or subdirectories with .utt files, as found in the raw SWDA data (Linguistic Data Consortium. LDC97S62: Switchboard Dialog Act Corpus.)

Value

A data frame containing the curated SWDA data with columns:

Examples

# Example using simulated data bundled with the package
example_data <- system.file("extdata", "simul_swda", package = "qtkit")
swda_data <- curate_swda_data(example_data)

str(swda_data)


Process a single SWDA utterance file

Description

Process a single SWDA utterance file

Usage

curate_swda_file(file_path)

Arguments

file_path

Character string. Path to the .utt file

Value

A data frame containing processed data from the file


Download and Decompress Archive File

Description

Helper function that downloads an archive file to a temporary location and decompresses it to the target directory.

Usage

download_and_decompress(url, target_dir, ext)

Arguments

url

Character string of the archive file URL

target_dir

Character string of the target directory path

ext

Character string of the file extension

Value

No return value, called for side effects


Extract Attributes from XML Line Node

Description

Extract Attributes from XML Line Node

Usage

extract_dat_attrs(line_node)

Arguments

line_node

XML node containing line attributes

Value

Data frame of extracted attributes


Extract speaker information from document lines

Description

Extract speaker information from document lines

Usage

extract_speaker_info(doc_lines)

Arguments

doc_lines

Character vector of file lines

Value

Named list of speaker information


Extract and process utterances from document lines

Description

Extract and process utterances from document lines

Usage

extract_utterances(doc_lines, speaker_info)

Arguments

doc_lines

Character vector of file lines

speaker_info

List of speaker information

Value

Data frame of processed utterances


Find ENNTT Files

Description

Find ENNTT Files

Usage

find_enntt_files(dir_path)

Arguments

dir_path

Directory to search for ENNTT files

Value

Vector of unique corpus types


Detect Statistical Outliers Using IQR Method

Description

Identifies statistical outliers in a numeric variable using the Interquartile Range (IQR) method. Provides detailed diagnostics about the outlier detection process.

Usage

find_outliers(data, variable_name, verbose = TRUE)

Arguments

data

Data frame containing the variable to analyze.

variable_name

Unquoted name of the numeric variable to check for outliers.

verbose

Logical. If TRUE, prints diagnostic information about quartiles, fences, and number of outliers found. Default is TRUE.

Details

The function uses the standard IQR method for outlier detection:

Value

If outliers are found:

If no outliers:

Diagnostic Output

Examples

data(mtcars)
find_outliers(mtcars, mpg)
find_outliers(mtcars, wt, verbose = FALSE)


Download and Extract Archive Files

Description

Downloads compressed archive files from a URL and extracts their contents to a specified directory. Supports multiple archive formats and handles permission confirmation.

Usage

get_archive_data(url, target_dir, force = FALSE, confirmed = FALSE)

Arguments

url

Character string. Full URL to the compressed archive file.

target_dir

Character string. Directory where the archive contents should be extracted.

force

Logical. If TRUE, overwrites existing data in target directory. Default is FALSE.

confirmed

Logical. If TRUE, skips permission confirmation prompt. Useful for reproducible workflows. Default is FALSE.

Details

Supported archive formats:

The function includes safety features:

Value

Invisible NULL. Called for side effects:

Examples

## Not run: 
data_dir <- file.path(tempdir(), "data")
url <-
  "https://raw.githubusercontent.com/qtalr/qtkit/main/inst/extdata/test_data.zip"
get_archive_data(
  url = url,
  target_dir = data_dir,
  confirmed = TRUE
)

## End(Not run)

Get Works from Project Gutenberg

Description

Retrieves works from Project Gutenberg based on specified criteria and saves the data to a CSV file. This function is a wrapper for the gutenbergr package.

Usage

get_gutenberg_data(
  target_dir,
  lcc_subject,
  birth_year = NULL,
  death_year = NULL,
  n_works = 100,
  force = FALSE,
  confirmed = FALSE
)

Arguments

target_dir

The directory where the CSV file will be saved.

lcc_subject

A character vector specifying the Library of Congress Classification (LCC) subjects to filter the works.

birth_year

An optional integer specifying the minimum birth year of authors to include.

death_year

An optional integer specifying the maximum death year of authors to include.

n_works

An integer specifying the number of works to retrieve. Default is 100.

force

A logical value indicating whether to overwrite existing data if it already exists.

confirmed

If TRUE, the user has confirmed that they have permission to use the data. If FALSE, the function will prompt the user to confirm permission. Setting this to TRUE is useful for reproducible workflows.

Details

This function retrieves Gutenberg works based on the specified LCC subjects and optional author birth and death years. It checks if the data already exists in the target directory and provides an option to overwrite it. The function also creates the target directory if it doesn't exist. If the number of works is greater than 1000 and the 'confirmed' parameter is not set to TRUE, it prompts the user for confirmation. The retrieved works are filtered based on public domain rights in the USA and availability of text. The resulting works are downloaded and saved as a CSV file in the target directory.

For more information on Library of Congress Classification (LCC) subjects, refer to the https://www.loc.gov/catdir/cpso/lcco/ Library of Congress Classification Guide.

Value

A message indicating whether the data was acquired or already existed on disk, writes the data files to disk in the specified target directory.

Examples

## Not run: 
data_dir <- file.path(tempdir(), "data")

get_gutenberg_data(
  target_dir = data_dir,
  lcc_subject = "JC",
  n_works = 5,
  confirmed = TRUE
)

## End(Not run)


Process speaker turn information

Description

Process speaker turn information

Usage

process_speaker_info(speaker_turn, speaker_a_id, speaker_b_id)

Arguments

speaker_turn

Vector of speaker turns

speaker_a_id

ID for speaker A

speaker_b_id

ID for speaker B

Value

List with processed speaker information


Validate Directory Path

Description

Validate Directory Path

Usage

validate_dir_path(dir_path)

Arguments

dir_path

Directory path to validate


Validate Archive File Extension

Description

Helper function that checks if the file extension is supported (zip, gz, tar, or tgz).

Usage

validate_file_extension(ext)

Arguments

ext

Character string of the file extension

Details

Stops execution if extension is not supported

Value

No return value, called for side effects


Validate Inputs for Association Metrics Calculation

Description

Helper function that validates the input parameters for the calc_assoc_metrics function. Checks data frame structure, column existence, and association metric specifications.

Usage

validate_inputs_cam(data, doc_index, token_index, type, association)

Arguments

data

A data frame to validate

doc_index

Column name for document index

token_index

Column name for token position

type

Column name for the tokens/terms

association

Character vector of requested association metrics

Details

Stops execution with error message if:

Value

No return value, called for side effects


Validate Inputs for Type Metrics Calculation

Description

Helper function that validates the input parameters for the calc_type_metrics function. Checks data frame structure, column existence, and metric specifications.

Usage

validate_inputs_ctm(data, type, document, frequency, dispersion)

Arguments

data

A data frame to validate

type

Column name for the type/term variable

document

Column name for the document ID variable

frequency

Character vector of requested frequency metrics

dispersion

Character vector of requested dispersion metrics

Details

Stops execution with error message if:

Value

No return value, called for side effects


Save ggplot Objects to Files

Description

A wrapper around ggsave that facilitates saving ggplot objects within knitr documents. Automatically handles file naming and directory creation, with support for multiple output formats.

Usage

write_gg(
  gg_obj = NULL,
  file = NULL,
  target_dir = NULL,
  device = "pdf",
  theme = NULL,
  ...
)

Arguments

gg_obj

The ggplot to be written. If not specified, the last ggplot created will be written.

file

The name of the file to be written. If not specified, the label of the code block will be used.

target_dir

The directory where the file will be written. If not specified, the current working directory will be used.

device

The device to be used for saving the ggplot. Options include "pdf" (default), "png", "jpeg", "tiff", and "svg".

theme

The ggplot2 theme to be applied to the ggplot. Default is the theme specified in the ggplot2 options.

...

Additional arguments to be passed to the ggsave function from the ggplot2 package.

Details

This function extends ggplot2::ggsave by:

Value

The path of the written file.

Examples

## Not run: 
library(ggplot2)

plot_dir <- file.path(tempdir(), "plot")

# Write a ggplot object as a PDF file
p <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

write_gg(
  gg_obj = p,
  file = "plot_file",
  target_dir = plot_dir,
  device = "pdf"
)

unlink(plot_dir)

## End(Not run)

Write a kable object to a file

Description

A wrapper around kableExtra::save_kable that facilitates saving kable objects within knitr documents. Automatically handles file naming, directory creation, and supports multiple output formats with Bootstrap theming options.

Usage

write_kbl(
  kbl_obj,
  file = NULL,
  target_dir = NULL,
  device = "pdf",
  bs_theme = "bootstrap",
  ...
)

Arguments

kbl_obj

The knitr_kable object to be written.

file

The name of the file to be written. If not specified, the name will be based on the current knitr code block label.

target_dir

The directory where the file will be written. If not specified, the current working directory will be used.

device

The device to be used for saving the file. Options include "pdf" (default), "html", "latex", "png", and "jpeg". Note that a Chromium-based browser (e.g., Google Chrome, Chromium, Microsoft Edge or Brave) is required on your system for all options except "latex'. If a suitable browser is not available, the function will stop and return an error message.

bs_theme

The Bootstrap theme to be applied to the kable object (only applicable for HTML output). Default is "bootstrap".

...

Additional arguments to be passed to the save_kable function from the kableExtra package.

Details

The function extends save_kable functionality by:

For HTML output, the function supports all Bootstrap themes available in kableExtra. The default theme is "bootstrap".

Value

The path of the written file.

Examples

## Not run: 
library(knitr)

table_dir <- file.path(tempdir(), "table")

mtcars_kbl <- kable(
  x = mtcars[1:5, ],
  format = "html"
)

# Write a kable object as a PDF file
write_kbl(
  kbl_obj = mtcars_kbl,
  file = "kable_pdf",
  target_dir = table_dir,
  device = "pdf"
)

# Write a kable as an HTML file with a custom Bootstrap theme
write_kbl(
  kbl_obj = mtcars_kbl,
  file = "kable_html",
  target_dir = table_dir,
  device = "html",
  bs_theme = "flatly"
)

unlink(table_dir)

## End(Not run)

Write an R object as a file

Description

A wrapper around dput that facilitates saving R objects within knitr documents. Automatically handles file naming and directory creation, with support for preserving object structure and attributes.

Usage

write_obj(obj, file = NULL, target_dir = NULL, ...)

Arguments

obj

The R object to be written.

file

The name of the file to be written. If not specified, the label of the code block will be used.

target_dir

The directory where the file will be written. If not specified, the current working directory will be used.

...

Additional arguments to be passed to dput.

Details

This function extends dput functionality by:

Objects saved with this function can be read back using the standard dget function.

Value

The path of the written file.

Examples

## Not run: 
obj_dir <- file.path(tempdir(), "obj")

# Write a data frame as a file
write_obj(
  obj = mtcars,
  file = "mtcars_data",
  target_dir = obj_dir
)

# Read the file back into an R session
my_mtcars <- dget(file.path(obj_dir, "mtcars_data"))

unlink(obj_dir)

## End(Not run)