---
title: CA Step 1 Read_Dyads
subtitle: Read and Format Data for ConversationAlign
author: Jamie Reilly, Ben Sacks, Ginny Ulichney, Gus Cooney, Chelsea Helion
date: "`r Sys.Date()`"
show_toc: true
slug: ConversationAlign Read
output:
rmarkdown::html_vignette:
toc: yes
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{CA Step 1 Read_Dyads}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r, message=FALSE, warning=F, echo=F}
# Load SemanticDistance
library(ConversationAlign)
```
# Reading data into R for ConversationAlign
Half the battle with R is getting your data imported and formatted. This is especially true for string data and working with text. `ConversationAlign` uses a series of sequential functions to import, clean, and format your raw data. You **MUST** run each of these functions. They append important variable names and automatically reshape your data.
# Prepping your data for import
- `ConversationAlign` works **ONLY** on dyadic (i.e., two person) conversation transcripts.
- Each transcript must nominally contain two colummns, one column should delineate the interlocutor (person who produced the text), and another column should contain the text itself.
- `ConversationAlign` contains an import function called `read_dyads()` that will scan a target folder for text samples.
- `read_dyads()` will import all of your transcripts into R and concatenate them into a single dataframe.
- `read_dyads()` will append each transcript's filename as a unique identifier for that conversation. This is SUPER important to remember when analyzing your data.
- Store each of your individual conversation transcripts (`.csv`, `.txt`, `.ai`) that you wish to concatenate into a corpus in a folder. `ConversationAlign` will search for a folder called `my_transcripts` in the same directory as your script. However, feel free to name your folder anything you like. You can specify a custom path as an argument to read_dyads()
- Each transcript must nominally contain two columns of data (Participant and Text). All other columns (e.g., meta-data) will be retained.
## `read_dyads()`
Here are some exampples of `read_dyads()` in action. There is only one argument to `read_dyads()`, and that is `my_path`. This is for supplying a quoted directory path to the folder where your transcripts live. Remember to treat this folder as a staging area! Once you are finished with a set of transcripts and don't want them read into `ConversationAlign` move them out of the folder, or specify a new folder. Language data tends to proliferate quickly, and it is easy to forget what you are doing. Be a CAREFUL secretary, and record your steps.
Arguments to `read_dyads` include:
1. **my_path**: default is 'my_transcripts', change path to your folder name
```{r, eval=F, message=F, warning=F}
#will search for folder 'my_transcripts' in your current directory
MyConvos <- read_dyads()
#will scan custom folder called 'MyStuff' in your current directory, concatenating all files in that folder into a single dataframe
MyConvos2 <- read_dyads(my_path='/MyStuff')
```
## `read_1file()`
- Read single transcript already in R environment. We will use `read_1file()` to prep the Marc Maron and Terry Gross transcript. Look at how the column headers have changed and the object name (MaronGross_2013) is now the Event_ID (a document identifier),
Arguments to `read_1file` include:
1. **my_dat**: object already in your R environment containing text and speaker information.
```{r, eval=T, message=F, warning=F}
MaryLittleLamb <- read_1file(MaronGross_2013)
#print first ten rows of header
knitr::kable(head(MaronGross_2013, 15), format = "pipe")
```