---
title: "Tutorial and basic overview of the admixr R package"
author: "Martin Petr"
date: "`r Sys.Date()`"
output: html_document
vignette: >
%\VignetteIndexEntry{Tutorial and basic overview of the admixr R package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
evaluate <- .Platform$OS.type == "unix" && system("which qpDstat", ignore.stdout = TRUE) == 0
knitr::opts_chunk$set(
collapse = TRUE,
comment = "",
eval = evaluate
)
```
## Introduction
[ADMIXTOOLS](https://github.com/DReichLab/AdmixTools/) is a widely used
software package for calculating admixture statistics and testing population
admixture hypotheses.
A typical ADMIXTOOLS workflow generally involves a combination of
`sed`/`awk`/shell scripting and manual editing to create text configuration
files. These are then passed as command-line arguments to one of ADMIXTOOLS
commands, and control how to run a particular analysis. The results are then
redirected to another file, which has to be parsed by the user to extract
values of interest, often using command-line utilities again or (worse) by
manual copy-pasting. Finally, the processed results are analysed in R, Excel or
another program.
This workflow can be a little cumbersome, especially if one wants to explore many
hypotheses involving different combinations of populations. Most importantly,
however, it makes it difficult to coduct reproducible research, as it is nearly
impossible to construct fully automated "pipelines" that don't require user
intervention.
This R package makes it possible to perform all stages of ADMIXTOOLS analyses
entirely from R, completely removing the need for "low level" configuration of
individual ADMIXTOOLS programs.
## Installation
**Note that in order to use the *admixr* package, you need a working
installation of ADMIXTOOLS!** You can find installation instructions
[here](https://github.com/DReichLab/AdmixTools/blob/master/README.INSTALL).
The software runs on Linux and macOS and these are the two systems that
_admixr_ is tested on.
**Furthermore, you need to make sure that R can find ADMIXTOOLS binaries on the
`$PATH`.** If this is not the case, running `library(admixr)` will show a
warning message with instructions on how to fix this.
To install *admixr* from GitHub you need to install the package `devtools`
first. To do this, you can simply run (in R):
```{r, eval = FALSE}
install.packages("devtools")
devtools::install_github("bodkan/admixr")
```
Furthermore, if you want to follow the examples in this vignette, you will need
the [tidyverse](https://www.tidyverse.org) collection of packages for
data manipulation manipulation and plotting, which you can install with:
```{r, eval = FALSE}
install.packages("tidyverse")
```
You definitely don't need tidyverse for working with _admixr_ but it really
makes data manipulation and plotting things much easier. I recommend at least
giving it a shot.
When everything is ready, you can run the following code to load both
packages:
```{r libraries, message = FALSE, warning = FALSE}
library(admixr)
library(tidyverse)
```
## A note about EIGENSTRAT format
ADMIXTOOLS software uses a peculiar set of genetic file formats, which may seem
strange if you are used to working with [VCF
files](http://samtools.github.io/hts-specs/VCFv4.3.pdf). However, the basic
idea remains the same: we want to store and access SNP data (REF/ALT alleles)
of a set of individuals at a defined set of genomic positions.
EIGENSTRAT datasets always contain three kinds of files:
* `ind` file - specifies a unique name, sex (optional - can be simply "U" for
"undefined") and label (such as population assignment) of each sample;
* `snp` file - specifies the positions of SNPs, REF/ALT alleles etc.;
* `geno` file - contains SNP data (one row per site, one character per sample)
in a dense string-based format:
- 0: individual is homozygous ALT
- 1: individual is a heterozygote
- 2: individual is homozygous REF
- 9: missing data
Therefore, a VCF file is essentially a combination of all three files in a
single package.
Let's first download a small testing SNP dataset using a built-in *admixr*
function `download_data()`. This function downloads the data into a temporary
directory (you can specify the destination using its `dirname` argument, in
case you want to place it elsewhere). In addition to this, the function returns
a shared path/prefix of the whole dataset.
```{r eigenstrat_path}
(prefix <- download_data(dirname = tempdir()))
```
We can verify that there are indeed three files with this prefix:
```{r eigenstrat_trio}
list.files(path = dirname(prefix), pattern = basename(prefix), full.names = TRUE)
```
Let's look at their contents:
#### `ind` file
```{r ind_file, echo = FALSE}
cat(system(paste0("column -t ", prefix, ".ind"), intern = TRUE), sep = "\n")
```
The first column (sample name) and the third column (population label) are
generally not the same (sample names often have numerical suffixes to make them
unique, etc.), but were kept the same here for simplicity. Importantly, when
specifying population/sample names in *admixr* functions, the information in the
third column is what is used. For example, if you have individuals such as
"French1", "French2", "French3" in the first column of an `ind` file, all three
sharing a "French" population label in the third column, specifying "French" in
an *admixr* function will combine all three samples in a single population,
instead of working with each individual separately.
#### `snp` file (first 3 lines)
```{r snp_file, echo = FALSE}
cat(system(paste0("head -n 3 ", prefix, ".snp"), intern = TRUE), sep = "\n")
```
The columns of this file are, in order:
1. SNP string ID
2. chromosome
3. genetic distance
4. position along a chromosome
5. reference allele
6. alternative allele
#### `geno` file (first 3 lines)
```{r geno_file, echo = FALSE}
cat(system(paste0("head -n 3 ", prefix, ".geno"), intern = TRUE), sep = "\n")
```
Each row is one genomic site, each column is a genotype in one individual.
## Philosophy of *admixr*
The goal of *admixr* is to make ADMIXTOOLS analyses as trivial to run as
possible, without having to worry about par/pop/left/right configuration files
(as they are known in the jargon of ADMIXTOOLS) and other low-level details.
The only interface between you and ADMIXTOOLS is the following set of R
functions:
- `d()`
- `f4()`
- `f4ratio()`
- `f3()`
- `qpAdm()`
- `qpWave()`
Anything that would normally require [dozens of lines of shell
scripts](https://gaworkshop.readthedocs.io/en/latest/contents/06_f3/f3.html)
can be often accomplished by running a single line of R code.
## Internal representation of EIGENSTRAT data
As we saw above, each EIGENSTRAT dataset has three components. The way this
data is internally represented in *admixr* is using a small S3 R object created
using the `eigenstrat` constructor function. This function accepts the path and
prefix of a trio of EIGENSTRAT snp/ind/geno files and returns an R object of
the class `EIGENSTRAT`:
```{r}
snps <- eigenstrat(prefix)
```
```{r, comment = "#>"}
snps
```
This object encapsulates the paths to all three EIGENSTRAT components and makes
it easy to pass the data to different _admixr_ functions.
The following couple of sections describe how to use the _admixr_ package using
simple example analyses.
## D statistic
Let's say we are interested in the following question: _"Which populations
today show evidence of Neanderthal admixture?_
One way of looking at this is using the following D statistic:
$$D(\textrm{present-day human W}, \textrm{African}, \textrm{Neanderthal}, \textrm{Chimp}).$$
$D$ statistics are based on comparing the proportions of BABA and ABBA sites
patterns observed in the data:
$$D = \frac{\textrm{# BABA sites - # ABBA sites}}{\textrm{# BABA sites + # ABBA sites}}.$$
Significant departure of $D$ from zero indicates an excess of allele sharing
between the first and the third population (positive $D$), or an excess of
allele sharing between the second and the third population (negative $D$). If
we get $D$ that is not significantly different from 0, this suggests that the
first and second populations form a clade, and don't differ in the rate of
allele sharing with the third population (this is the null hypothesis that the
data is compared against).
Therefore, our $D$ statistic above tests whether some modern humans today
admixed with Neanderthals, which would increase their genetic affinity to this
archaic group compared to Africans (whose ancestors never met Neanderthals).
Let's save some population names first to make our code more concise:
```{r pop_def1}
pops <- c("French", "Sardinian", "Han", "Papuan", "Khomani_San", "Mbuti", "Dinka")
```
Using the *admixr* package we can then calculate our $D$ statistic simply by
running:
```{r d}
result <- d(W = pops, X = "Yoruba", Y = "Vindija", Z = "Chimp", data = snps)
```
The result is a following data frame:
```{r eval = FALSE}
head(result)
```
```{r d_kable, echo = FALSE}
knitr::kable(head(result))
```
We can see that in addition to the specified population names, the output
table contains additional columns:
- `D` - $D$ statistic value
- `stderr` - standard error of the $D$ statistic calculated using the block
jackknife
- `Zscore` - $Z$-zscore value (number of standard errors the $D$ is from 0,
i.e. how strongly do we reject the null hypothesis of no admixture)
- `BABA`, `ABBA` - counts of observed site patterns
- `nsnps` - number of SNPs used for a given calculation
While we could certainly make inferences by looking at the $Z$-scores, tables in
general are not the best representation of this kind of data, especially as the
number of samples increases. Instead, we can use the
[`ggplot2`](https://ggplot2.tidyverse.org) package to plot the results:
```{r d_plot, fig.width = 7, fig.height = 4}
ggplot(result, aes(fct_reorder(W, D), D, color = abs(Zscore) > 2)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_errorbar(aes(ymin = D - 2 * stderr, ymax = D + 2 * stderr))
```
(If you want to more know about data analysis using R, including plotting with
ggplot2, I highly recommend [this](http://r4ds.had.co.nz) free book.)
We can see that the $D$ values for Africans are not significantly different from
0, meaning that the data is consistent with the null hypothesis of no
Neanderthal ancestry in Africans. On the other hand, the test rejects the null
hypothesis for all non-Africans today, suggesting that Neanderthals admixed with
the ancestors of present-day non-Africans.
## f4 statistic
An alternative way of addressing the previous question is to use the $f_4$
statistic, which is very similar to $D$ statistic and can be calculated as:
$$ f_4 = \frac{\textrm{# BABA sites - # ABBA sites}}{\textrm{# sites}}$$
Again, significant departure of $f_4$ from 0 can be interpreted as evidence of
gene flow.
To repeat the previous analysis using $f_4$ statistic, we can run the function
`f4()`:
```{r f4}
result <- f4(W = pops, X = "Yoruba", Y = "Vindija", Z = "Chimp", data = snps)
```
```{r eval = FALSE}
head(result)
```
```{r f4_kable, echo = FALSE}
knitr::kable(head(result))
```
By comparing this result to the $D$ statistic analysis above, we can make the
same conclusions.
You might be wondering why we have both $f_4$ and $D$ if they are so similar.
The truth is that $f_4$ is, among other things, directly informative about the
amount of shared genetic drift ("branch length") between pairs of populations,
which is a very useful theoretical property. Other than that, it's often a
matter of personal preference and so *admixr* provides functions for calculating
both.
## f4-ratio statistic
Now we know that non-Africans today carry _some_ Neanderthal ancestry. But what if
we want to know _how much_ Neanderthal ancestry they have? What proportion of their
genomes is of Neanderthal origin? To answer questions like this, we can use the
$f_4$-ratio statistic, which can be formulated in the following way (using
a notation of [Patterson et al., 2012](http://www.genetics.org/content/192/3/1065), who formally described
its properties).
$$f_4\textrm{-ratio} = \frac{f_4(A, O; X, C)}{f_4(A, O; B, C)}.$$
Using `amidxr`, we can calculate $f_4$-ratios using the following code (`X`
being a vector of samples which we want to estimate the Neanderthal ancestry
in):
```{r f4ratio}
result <- f4ratio(X = pops, A = "Altai", B = "Vindija", C = "Yoruba", O = "Chimp", data = snps)
```
The ancestry proportion (a number between 0 and 1) is given in the `alpha`
column:
```{r, eval=FALSE}
head(result)
```
```{r f4ratio_kable, echo=FALSE}
knitr::kable(head(result))
```
```{r f4ratio_plot, fig.width = 7, fig.height = 4}
ggplot(result, aes(fct_reorder(X, alpha), alpha, color = abs(Zscore) > 2)) +
geom_point() +
geom_errorbar(aes(ymin = alpha - 2 * stderr, ymax = alpha + 2 * stderr)) +
geom_hline(yintercept = 0, linetype = 2) +
labs(y = "Neandertal ancestry proportion", x = "present-day individual")
```
We can make several observations:
- Again, we don't see any significant Neanderthal ancestry in present-day
Africans (proportion is consistent with 0%), which is what we confirmed using
$D$ and $f_4$ above.
- Present-day non-Africans carry between 2-3% of Neanderthal ancestry.
- We see a much higher proportion of Neanderthal ancestry in people from Papua
New Guinea - more than 4%. This is consistent with earlier studies that
suggest additional archaic admixture events in the ancestors of present-day
Papuans.
## f3 statistic
The $f_3$ statistic, also known as the 3-population statistic, is useful
whenever we want to:
1. Estimate the branch length (shared genetic drift) between a pair of
populations $A$ and $B$ with respect to a common outgroup $C$. In this case,
the higher the $f_3$ value, the longer the shared evolutionary time between
$A$ and $B$.
2. Test whether population $C$ is a mixture of two populations $A$ and
$B$. Significantly negative values of the $f_3$ statistic are then a
statistical evidence of this admixture.
As an example, imagine we are interested in relative divergence times between
pairs of present-day human populations, and want to know in which approximate
order they split of from each other. To address this problem, we could use
$f_3$ statistic by fixing the $C$ outgroup as San, and calculating pairwise
$f_3$ statistics between all present-day modern humans.
```{r pops2}
pops <- c("French", "Sardinian", "Han", "Papuan", "Mbuti", "Dinka", "Yoruba")
result <- f3(A = pops, B = pops, C = "Khomani_San", data = snps)
```
```{r, eval=FALSE}
head(result)
```
```{r f3_kable, echo=FALSE}
knitr::kable(head(result))
```
```{r f3_plot, fig.width = 8, fig.height = 6}
# sort the population labels according to an increasing f3 value relative to French
ordered <- filter(result, A == "Mbuti", B != "Mbuti") %>% arrange(f3) %>% .[["B"]] %>% c("Mbuti")
# plot heatmap of pairwise f3 values
result %>%
filter(A != B) %>%
mutate(A = factor(A, levels = ordered),
B = factor(B, levels = ordered)) %>%
ggplot(aes(A, B)) + geom_tile(aes(fill = f3))
```
We can see that when we order the heatmap labels based on values of pairwise
$f_3$ statistics, the (already known) order of population splits pops up
nicely (i.e. San separated first, followed by Mbuti, etc.).
## qpWave and qpAdm
Both _qpWave_ and _qpAdm_ can be though of as more complex and powerful
extensions of the basic ideas behind a simple $f_4$ statistic. Building upon the
$f_4$ theory and generalizing it, _qpWave_ makes it possible to find the lowest
number of "streams of ancestry" between two groups of populations that is
consistent with the data. Extending the concept of $f_4$ statistics even
further, _qpAdm_ allows to find the proportions of ancestry from a set of
ancestral populations that contributed ancestry to our population of interest.
Unfortunately, both methods represent a rather advanced topic that still lacks
proper documentation and beginner-friendly tutorials, and explaining them in
detail is beyond the scope of this vignette. If you want to use them, it's
crucial that you read the official documentation decribing the basic ideas of both methods ([distributed with
ADMIXTOOLS](https://github.com/DReichLab/AdmixTools/blob/master/pdoc.pdf)),
_and_ that you read the relevant supplementary sections of papers published
by David Reich's group. At the very least, I recommend reading:
- Note S6 of _"[Reconstructing Native American population history](https://www.nature.com/articles/nature11258)"_ by Reich et al. This paper
first introduced the theoretical background of what later became _qpWave_.
- Supplementary Information 10 of _"[Massive migration from the steppe was a source for Indo-European languages in Europe](https://www.nature.com/articles/nature14317)"_ by Haak et al., which gives a more consise
overview of the _qpWave_ method than S6 of Reich et al. 2012, and also introduces the
_qpAdm_ methodology for estimating admixture proportions.
- A phenomenal description of _qpAdm_ methodology and best practices
by [Harney _et al._
2020](https://www.biorxiv.org/content/10.1101/2020.04.09.032664v1)
(see also PDF with [practical
guidelines](https://www.biorxiv.org/content/biorxiv/early/2020/04/10/2020.04.09.032664/DC1/embed/media-1.pdf?download=true). Both
documents should really be a mandatory reading before doing any
_qpAdm_ analysis.
In the remainder of this section, I will assume that you are familiar with both
methods, and will only explain how to use _admixr_ for running them from R.
### _qpWave_
To run _qpWave_, you must provide a list of _left_ and _right_ populations
(using the terminology of Haak et al. 2015 above). The aim of the method is to get
an idea about the number of migration waves from _right_ to _left_ (with no
back-migration from _left_ to _right_!). This is done by estimating the
rank of a matrix of all possible $f_4$ statistics
$$f_4(\textrm{left}_1, \textrm{left}_i; \textrm{right}_1, \textrm{right}_i),$$
where $\textrm{left}_1$ and $\textrm{right}_1$ are some fixed populations and
the $i$ and $j$ indices run over all other possible choices of populations.
As an example, let's try to find the number of admixture waves from _right_ =
{Yoruba, Mbuti, Alta} into _left_ = {French, Sardinian, Han} populations. We can
do this using the function `qpWave()`, setting its arguments appropriately:
```{r}
result <- qpWave(
left = c("French", "Sardinian", "Han"),
right = c("Altai", "Yoruba", "Mbuti"),
data = snps
)
```
The `qpWave()` function returns a data frame which shows the results of a series
of matrix rank tests. The `rank` column is the matrix rank tested, `df`, `chisq`
and `tail` give the degrees of freedom, $\chi^2$ value and $p$-value for the
comparison with the saturated model (the $p$-value then indicates which matrix
rank is consistent with the data - see example below), and `dfdiff`, `chisqdiff`
and `taildiff` give the same, but always comparing a model to the model with one
rank less.
```{r, eval = FALSE}
result
```
```{r, echo = FALSE}
knitr::kable(result)
```
In this example, we see that matrix $r = 0$ cannot be rejected (`tail` $p$-value
= 0.78). Because Reich et al. 2012 showed that $r + 1 \le n$, where $n$ is the
number of admixture waves, we can interpret this as _left_ populations having at
least $n = 1$ streams of ancestry from the set of _right_ populations. In this
case, the most likely explanation is Neandertal admixture into non-Africans
today.
Now, what happens if we add Papuans to the _left_ group?
```{r}
result <- qpWave(
left = c("Papuan", "French", "Sardinian", "Han"),
right = c("Altai", "Yoruba", "Mbuti"),
data = snps
)
```
```{r, eval = FALSE}
result
```
```{r, echo = FALSE}
knitr::kable(result)
```
We can now clearly reject rank $r = 0$, but we see that the data is consistent
with rank $r = 1$, meaning that there must have been at least $n = 2$ streams of
ancestry from _right_ to _left_ populations ($r + 1 \le n$). Because
this happened after we introduced Papuans to the _left_ set, this could indicate
a separate pulse of archaic introgression into Papuans, which is not
surprising given what we know about significantly more archaic ancestry
in Papuans than in any other present-day population.
### _qpAdm_
The _qpAdm_ method can be used to find, for a given target population, the
proportions of ancestry coming from a set of _source_ populations. Importantly,
since we often lack accurate representatives of the true ancestral populations,
we can use a set of reference populations instead, under a crucial assumption
that the references set is phylogenetically closer to true _source_ populations
than to a set of specified _outgroups_. For example, coming back to our example
of estimating the proportions of Neandertal ancestry in people today, we could
define:
- a set of European individuals as the _target_;
- Vindija Neanderthal and an African as two _source_ populations;
- _outgroup_ populations as Chimp, Altai Neanderthal and Denisovan (which are
all further from the true ancestral populations than the specified _sources_).
Having defined all three population sets, we can run qpAdm with:
```{r qpAdm}
result <- qpAdm(
target = c("Sardinian", "Han", "French"),
sources = c("Vindija", "Yoruba"),
outgroups = c("Chimp", "Denisova", "Altai"),
data = snps
)
```
The `qpAdm()` function has an argument `details` (default TRUE) which makes
the function return a list of three elements:
* `proportions` - data frame with admixture proportions - this is what we
mostly care about;
* `ranks` - results of rank tests performed by _qpWave_ - these evaluate how well
does the assumed traget-sources-outgroups population model match the data;
* `subsets` - results of the "all subsets" analysis (see the [documentation](https://github.com/DReichLab/AdmixTools/blob/master/pdoc.pdf) for
more details.
If `details` is set to `FALSE`, only the `proportions` components is returned
by the `qpAdm()` function.
Let's start with the `ranks` element:
```{r, eval = FALSE}
result$ranks
```
```{r, echo = FALSE}
knitr::kable(result$ranks)
```
The row with rank = 1 represents a _qpWave_ test with all $n$ _source_
populations set as the _left_ set and all _outgroups_ as the _right_ set. This
test evaluates whether the ancestral populations are descended from $n$
independent streams of ancestry. In our case, $n = 2$ (Mbuti and Vindija), which
means that the data would have to be consistent with rank $r = 1$ to satisfy the
inequality $r + 1 \le n$ proved by Reich et al., 2012. We see that this is true
for all three target populations ($p$-value > 0.05 for all targets), and the
simple model of Neandertal admixture thus seems to be reasonably consistent with
the data.
The rank = 2 row represents a _qpWave_ test after adding a target population to
the _left_ group together with the _sources_. This test makes sure that
including the target population does not increase the rank of the $f_4$ matrix,
meaning that the target can be really modelled as a mixture of ancestries from
the _sources_. If the $p$-values turn out to be very low, this indicates that
the assumed model does not fit the data and that a part of the ancestry in a
_target_ possibly cannot be traced to any of the _sources_. In our case,
however, all rank = 2 test $p$-values are not significant, and we can be
reasonably sure that the _target_ samples can be fully modelled as a mixtures of
all specified _references_.
The most important element of a _qpAdm_ output is in the
`$proportions` component. This contains admixture proportion estimates
from all specified sources, the p-values of each model (remember, low
p-values/significance means the model is rejected!) as well as
standard errors for those proportions using a block jackknife:
```{r, eval = FALSE}
result$proportions
```
```{r, echo=FALSE}
knitr::kable(result$proportions)
```
If we compare this result to the $f_4$-ratio values calculated above, we see
that the _qpAdm_ estimates are very close to what we got earlier.
The third element in the list of results shows the outcome of an "all subsets"
analysis, which involves testing all subsets of potential source populations.
Each 1 in the "pattern" column means that the proportion of ancestry from that
particular source population (in the order as specified by the user) was forced
to 0.0.
```{r, eval = FALSE}
result$subsets
```
```{r, echo=FALSE}
knitr::kable(result$subsets)
```
*New feature*: There is a new function called `qpAdm_rotation()` which
allows exhaustive exploration of many _qpAdm_ models in paralle. For
more information please see the vignette _"Fitting qpAdm models with a
'rotation' strategy_".
## Grouping samples
What we've been doing so far was calculating statistics for individual samples.
However, it is often useful to treat multiple samples as a single group or
population. *admixr* provides a function called `relabel()` that does just
that.
Here is an example: let's say we want to run a similar analysis to the one
described in the $D$ statistic section, but we want to treat Europeans, Africans
and archaics as combined populations, and not as separate individuals. But the
`ind` file that we have does not contain grouped labels - each sample stands on
its own:
```{r orig_ind, echo = FALSE, comment = ""}
cat(system(paste0("column -t ", snps$ind), intern = TRUE), sep = "\n")
```
To merge several individual samples under a combined label we can call
`relabel()` like this:
```{r relabel}
# paths to the original ind file and a new modified ind file, which will
# contain merged population labels
modif_snps <- relabel(
snps,
European = c("French", "Sardinian"),
African = c("Dinka", "Yoruba", "Mbuti", "Khomani_San"),
Archaic = c("Vindija", "Altai", "Denisova")
)
```
```{r, comment = "#>"}
modif_snps
```
We can see that the function `relabel` returned a modified `EIGENSTRAT` object,
which contains a new item in the "modifiers" section - the path to a new ind
file. Let's look at its contents:
```{r modif_ind, echo = FALSE, comment = ""}
cat(system(paste0("column -t ", modif_snps$group), intern = TRUE), sep = "\n")
```
Having the modified `EIGENSTRAT` object ready, we can then use "European",
"African" and "Archaic" names in any of the *admixr* wrapper functions described
above. For example:
```{r modif_d}
result <- d(W = "European", X = "African", Y = "Archaic", Z = "Chimp", data = modif_snps)
```
Here is the result, showing again Europeans show genetic affinity to archaic
humans compared to Africans today:
```{r eval = FALSE}
head(result)
```
```{r modif_d_kable, echo = FALSE}
knitr::kable(head(result))
```
Note that the `d()` function correctly picks up the "group modifier" `ind` file
from the provided `EIGENSTRAT` object and uses it in place of the original
`ind` file.
## Counting present/missing SNPs
The `count_snps` function can be useful for quality control, weighting of
admixture statistics ($D$, $f_4$, etc.) in regression analyses etc. There are
two optional arguments:
- `prop` - changes whether to report SNP counts or proportions (set to `FALSE`
by default),
- `missing` - controls whether to count missing SNPs instead of present SNPs
(set to `FALSE` by default).
For each sample, count the SNPs present in that sample:
```{r present_snps, results = "hide"}
count_snps(snps)
```
```{r present_snps_kable, echo = FALSE}
knitr::kable(count_snps(snps))
```
## Data filtering
### Filtering based on a BED file
It is quite common to repeat a particular analysis only on a subset of the
genome (such as intergenic sites, etc). However, EIGENSTRAT is a rather obscure
file format which is generally not supported by standard bioinformatics tools.
Luckily, *admixr* includes a function `filter_bed()` that takes an `EIGENSTRAT`
object and a BED file as its inputs and produces a new object that contains a
modifier called "excluded", linking to a snp file with coordinates of sites that
did not pass the filtering and will be excluded from later analyses.
```{r}
bed <- file.path(dirname(prefix), "regions.bed")
```
```{r}
# BED file contains regions to keep in an analysis
new_snps <- filter_bed(snps, bed)
# BED file contains regions to remove from an analysis
new_snps <- filter_bed(snps, bed, remove = TRUE)
```
```{r, comment = "#>"}
new_snps
```
If we want to run the whole analysis in a single pipeline, we can use the
`%>%` pipe operator and do the following:
(The `%>%` operator takes what is on its left side and puts it as a first
argument of a function on the right side. While it takes some time to get used
to, it is very useful in longer multi-step "pipelines" because it makes more
pipelines much more readable. In fact, the resulting code often reads _almost_
like English! The `%>%` pipe is automatically imported when you load the
`tidyverse` library, and you can read about it more
[here](https://magrittr.tidyverse.org).)
```{r, eval = FALSE}
snps %>%
filter_bed("regions.bed") %>%
d(W = "French", X = "Mbuti", Y = "Vindija", Z = "Chimp")
```
This is because in the formal definitions of *admixr* function, `data = ` is
always the argument, so we don't have to specify it manually.
**Important:** The `filter_bed()` function makes it very easy to do filtering
without worrying about locations of intermediate files, but it is important to
keep in mind that the function still creates temporary files under the hood. If
you plan to run many independent calculations on a filtered subset of the
data, it's better to save the new `EIGENSTRAT` object to a variable first and
re-use the same object multiple times, rather than running the whole pipeline
for each analysis separately (which would create new copies of intermediate
files for each iteration).
### Filtering out potential ancient DNA damage SNPs
In the field of ancient DNA, we often need to repeat an analysis on a subset of
data that is less likely to be influenced by ancient DNA damage, to verify that
our results are not caused by artifacts in the data (due to biochemical
properties of DNA degradation, ancient DNA damage will lead to an increase in
C→T and G→A substitutions). Using a similar method
described in the BED filtering section above, we can use the
`transversions_only()` function to generate a snp file with positions that carry
transitions (C→T and G→A sites):
```{r, eval = FALSE}
new_snps <- transversions_only(snps)
# perform the calculation only on transversions
d(W = "French", X = "Dinka", Y = "Altai", Z = "Chimp", data = new_snps)
```
Again, we could combine several filtering steps into one pipeline:
```{r, eval = FALSE}
snps %>% # take the original data
filter_bed("regions.bed", remove = TRUE) %>% # remove sites not in specified regions
transversions_only() %>% # remove potential false SNPs due to aDNA damage
d(W = "French", X = "Dinka", Y = "Altai", Z = "Chimp") # calculate D on the filtered dataset
```
## Merging EIGENSTRAT datasets
Another useful data processing function is `merge_eigenstrat()`. This function
takes two EIGENSTRAT datasets and merges them, producing a union of samples and
intersection of SNPs from both of them and returning a new `EIGENSTRAT` object.
```{r merge_eigenstrat, eval = FALSE}
# this is just an example code - it will not run unless you specify the paths
merged <- merge_eigenstrat(
merged = <"prefix of the merged dataset">
a = first_EIGENSTRAT_object,
b = second_EIGENSTRAT_object
)
```
## Examining log information
The goal of _admixr_ is to abstract away all the low-level technical workings
of ADMIXTOOLS. As we saw in the examples above, it achieves this by doing all
the dirty work of parsing the output files generated by ADMIXTOOLS, presenting
the user with convenient R data structures.
Nevertheless, admixr cannot (yet) parse _all_ the information generated by
ADMIXTOOLS utilities. For calculating D statistics, $f_4$ statistics and doing other
simple analyses, a single data frame is usually all the user needs. However,
sometimes things go wrong - it turns out data was in a wrong format, or ADMIXTOOLS
crashes with an error or the results simply look suspicious. Furthermore, some
complex commands (such as _qpAdm_) are not yet completely implemented in admixr.
In any of these cases, it is still useful (and important!) to examine the log
outputs.
Each admixr result, such as the data frame object below, contains a (hidden!) attribute
which carries the complete log file associate with that particular analysis:
```{r d_log, comment = "#>"}
dres <- d(W = c("French", "Han", "Dinka"), X = "Yoruba", Y = "Vindija", Z = "Chimp", data = snps)
dres
```
We can examine the full log of this D statistic run with the `loginfo` function.
Without any further arguments, this function writes the complete log on the screen.
Note that we run this function _on the data frame object_ although it might appear
that it does not contain any other information (it does but it's normally hidden
as most of the time it's not useful):
```{r d_loginfo, comment = "#>"}
loginfo(dres)
```
The situation is a bit more complex for _qpAdm_ which can be evaluated for
multiple target populations/samples which are then analyzed separately, with
each analysis having its own log file:
```{r qpAdm_log, comment = "#>"}
qpadm_res <- qpAdm(
target = c("Sardinian", "Han"),
sources = c("Vindija", "Yoruba"),
outgroups = c("Chimp", "Denisova", "Altai"),
data = snps
)
qpadm_res
```
In this case, we could either print all log information (one log output for
each target we specified in the _qpAdm_ call) by running`loginfo(qpadm_res)` or,
perhaps more useful, specify which target's log file we want to examine:
```{r qpadm_log_target, comment = "#>"}
loginfo(qpadm_res, target = "Han")
```
Finally, we might want to keep some of the log files around for future reference,
further debugging or sharing with others. For these purposes, we can run the
`loginfo` function with the argument `save = TRUE`, potentially also specifying
the output directory (`dir = "."` by default) or a prefix of the output file(s)
(default prefix is simply the name of the admixr command which generated the
log file).
For example, the following will save the result of our qpAdm analysis of Neandertal
ancestry to a file with a prefix "qpAdm_Neandertal_ancestry", but just the one
for the Sardinian individual:
```{r qpadm_log_save}
loginfo(qpadm_res, target = "Sardinian", save = TRUE, prefix = "qpAdm_Neandertal_ancestry")
```
This will create a file in the current directory named `qpAdm_Neandertal_ancestry_Sardinian.txt`.
Note that when you call `loginfo()` on a qpAdm run, the name of the target is
always added to the end of the output log file.
```{bash remove_log, echo = FALSE}
rm qpAdm_Neandertal_ancestry_Sardinian.txt
```
---
This work is licensed under a Creative Commons Attribution 4.0 International License.