---
title: "Retrieving data from VectorByte"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Retrieving data from VectorByte}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Introduction

The ohvbd package allows you to retrieve data from many different databases directly.

Currently these databases include the [VecTraits](https://vectorbyte.crc.nd.edu/vectraits-explorer) and [VecDyn](https://vectorbyte.crc.nd.edu/vecdyn-datasets) projects from [VectorByte](https://www.vectorbyte.org/), and [GBIF](https://www.gbif.org).

Let's imagine you wanted to figure out where there is trait data for a particular vector species - _Aedes aegypti_, for example. Here's how you'd use `ohvbd` to do that:


``` r
df <- search_hub("Aedes aegypti", "vt") |>
  fetch() |>
  glean(
    cols = c("Interactor1Genus", "Interactor1Species", "Latitude", "Longitude"),
    returnunique = TRUE
  )
df
```

Great, lovely. But now... what does that all mean? And how can you build up your own data request? Read on to find out...

## A note before we begin





For users who are familiar with base R pipes (`|>`), this approach is generally usable in ohvbd as well.

For users who are not familiar, pipes take the output of one command and feed it forward to the next command as the first argument:



``` r
# Find mean of a vector normally
x <- c(1, 2, 3)
mean(x)
```

```
## [1] 2
```


``` r
# Find mean of a vector using pipes
c(1, 2, 3) |> mean()
```

```
## [1] 2
```

For the rest of this vignette we will be using a piped-style approach.

## Setting up the cache

Some supporting downloads such as those from AREAdata are large, but rarely updated. As such it's worthwhile keeping these files around if you'll use them regularly.

To do this let's run a command to set up a local cache.


``` r
library(ohvbd)
set_default_ohvbd_cache()
```

Now you've done this, your cached data will not be deleted when you close R (though you must run this command once per session).

When you run this on your own computer, you will receive a message giving instructions on how to permanently set this using your [.Rprofile](https://docs.posit.co/ide/user/ide/guide/environments/r/managing-r.html#rprofile) file.

## Finding IDs

Datasets in VecTraits and VecDyn, and GBIF are organised by id.

You can search for ids related to a particular query using the `vbdhub.org` (aka "the hub") search functionality via `search_hub()`.

In this case let's search the hub for *Aedes aegypti*, the "Yellow Fever mosquito":


``` r
aedes_results <- search_hub("Aedes aegypti")
summary(aedes_results)
```

```
## Rows: 150, Query: Aedes aegypti
## 
## Split by database:
## gbif   px   vd   vt 
##   21   46   12   71
```

You can see here there are 20 GBIF datasets, 10 VecDyn datasets, and 71 VecTraits datasets.

However right now we only have the ids of the data, not the data themselves. In order to get that, we must fetch the data from a given database.

## Filtering dbs

Before fetching data, we must extract only the ids relevant to our database from our search.

> GBIF, VecTraits, and VecDyn do not have unified ids between datasets, so if you attempted to get VT ids from another database you would (at best) get garbage.

Filtering database results from searches can be performed using the `filter_db()` command:


``` r
aedes_vt <- filter_db(aedes_results, "vt")
aedes_vt
```

```
## <ohvbd.ids>
## Database: vt
##  [1]  474  475  148  578  126  556  142  144  169  580  577  285  287  863  865
## [16]  357  473  476  149  573  576  565  555  146  841  842  356  170  214  579
## [31]  864  359  355  143  147  564  574  575  124  125  346  553  554  354  853
## [46]  286  901  825  826  145  906  892  893  358  854  855  911  860  828  910
## [61]  572  557  558  571 1506 1510 1511 1512 1507 1508 1509
```

If you only searched the hub for one database, by default `search_hub()` will automatically perform the `filter_db()` operation for you!


``` r
search_hub("Aedes aegypti", db = "vt")
```

```
## <ohvbd.ids>
## Database: vt
##  [1]  474  475  148  578  126  556  142  144  169  580  577  285  287  863  865
## [16]  357  473  476  149  573  576  565  555  146  841  842  356  170  214  579
## [31]  864  359  355  143  147  564  574  575  124  125  346  553  554  354  853
## [46]  286  901  825  826  145  906  892  893  358  854  855  911  860  828  910
## [61]  572  557  558  571 1506 1510 1511 1512 1507 1508 1509
```

## Getting data

Now you have a vector of datasets from vectraits, we need to actually retrieve the data of these datasets through the API.

To do this we can use the `fetch()` function. In this case let's get the first 5 *Aedes aegypti* datasets:


``` r
aedes_vt <- aedes_vt |> head(5)
aedes_responses <- aedes_vt |> fetch()
aedes_responses[[1]]
```

```
## <httr2_response>
## GET https://vectorbyte.crc.nd.edu/portal/api/vectraits-dataset/474/?format=json
## Status: 200 OK
## Content-Type: application/json
## Body: In memory (54992 bytes)
```

The `fetch()` function returns a list of the data in the form of the original `httr2` responses. These are useful if you want to know specifics about how the server sent data back, but for most usecases it is more useful to extract the data into a dataframe.

### Specific fetch functions

`fetch()` retrieves data from the appropriate database for your data. Under the hood it farms out this work to the `fetch_x()` series of functions.

You can use these yourself for extra piece of mind, though we do not really recommend it.

So the above code could also be written as:


``` r
aedes_responses <- aedes_vt |> fetch_vt()
```

## Extracting data

Now we have a list of responses, we can extract the relevant data from them using the `glean()` function.


``` r
aedes_data <- aedes_responses |> glean()
cat("Data dimensions: ", ncol(aedes_data), " cols x ", nrow(aedes_data), " rows")
```

```
## Data dimensions:  157  cols x  727  rows
```

This dataset is a bit too large to print here, and often you may only want a few columns of data rather than the whole dataset.

Fortunately the `cols` argument allows us to filter this data out easily. We can also use the argument `returnunique` to instruct ohvbd to return only unique rows.

So let's get just the unique locations and trait name combinations from our data using the same command as before:


``` r
aedes_data_filtered <- aedes_responses |>
  glean(
    cols = c("Location", "OriginalTraitName"),
    returnunique = TRUE
  )
aedes_data_filtered
```

```
## <ohvbd.data.frame>
## Database: vt
##   DatasetID                      Location      OriginalTraitName
## 1       474                Marilia Brazil         fecundity rate
## 2       475                Marilia Brazil              longevity
## 3       148      Unversity of Georgia USA       glycogen content
## 4       578            Fort Myers Florida       development time
## 5       126 Guadeloupe French West Indies transmission potential
```

### Specific glean functions

Like `fetch()`, `glean()` has database-specific variants:


``` r
aedes_data <- aedes_responses |> glean_vt()
```

## Putting it all together

In day-to-day use, you will mostly find yourself using all these functions together to create small pipelines.

A typical pipeline would likely only contain a few lines of code:


``` r
df <- search_hub("Aedes aegypti") |>
  filter_db("vt") |>
  head(5) |>
  fetch() |>
  glean(
    cols = c("Interactor1Genus", "Interactor1Species", "Latitude", "Longitude"),
    returnunique = TRUE
  )
head(df)
```

```
## <ohvbd.data.frame>
## Database: vt
##   DatasetID Interactor1Genus Interactor1Species  Latitude Longitude
## 1       474            Aedes            aegypti -22.21389 -49.94583
## 2       475            Aedes            aegypti -22.21389 -49.94583
## 3       148            Aedes            aegypti        NA        NA
## 4       578            Aedes            aegypti  26.61667 -81.83333
## 5       126            Aedes            aegypti  16.25000 -61.58333
```

A similar pipeline taking advantage of the autofiltering in `search_hub()` might look like this:


``` r
df <- search_hub("Aedes aegypti", db = "vt") |>
  head(5) |>
  fetch() |>
  glean(
    cols = c("Interactor1Genus", "Interactor1Species", "Latitude", "Longitude"),
    returnunique = TRUE
  )
```

## Smart searching of VectorByte databases

One subtlety of VectorByte (in particular) is to do with field collisions.

Let's imagine that we are looking for traits of whitefly species (*Bemisia*). We can just construct a query to investigate this as follows:


``` r
df <- search_hub("Bemisia", "vt") |>
  head(6) |>
  fetch() |>
  glean(
    cols = c(
      "DatasetID",
      "Interactor1Genus",
      "Interactor1Species",
      "Interactor2Genus",
      "Interactor2Species"
      )
    )
```
Now we would expect this to be traits of *Bemisia* spp, however when we look at the `Interactor1Genus` column we see something a touch odd:


``` r
unique(df$Interactor1Genus)
```

```
## [1] "Axinoscymnus" "Bemisia"
```

*Axinoscymnus* is a ladybird, but why is it appearing? Let's look at only rows containing *Axinoscymnus*:


``` r
df |> dplyr::filter(Interactor1Genus == "Axinoscymnus") |> head()
```

```
## <ohvbd.data.frame>
## Database: vt
##   DatasetID Interactor1Genus Interactor1Species Interactor2Genus
## 1       160     Axinoscymnus         cardilobus          Bemisia
## 2       160     Axinoscymnus         cardilobus          Bemisia
## 3       160     Axinoscymnus         cardilobus          Bemisia
## 4       160     Axinoscymnus         cardilobus          Bemisia
## 5       160     Axinoscymnus         cardilobus          Bemisia
## 6       160     Axinoscymnus         cardilobus          Bemisia
##   Interactor2Species
## 1             tabaci
## 2             tabaci
## 3             tabaci
## 4             tabaci
## 5             tabaci
## 6             tabaci
```

In this scenario, *Bemisia* is present in the dataset, but it is as the "target" rather than the animal that the trait is referring to.

As such we might want to be more specific about precisely which data to retrieve. Enter the `search_x_smart()` family of functions.

These allow you to construct a more specific search.

So let's construct the same search as we were wanting to do before, but with the smart search.


``` r
df <- search_vt_smart("Interactor1Genus", "contains", "Bemisia") |>
  head(6) |>
  fetch() |>
  glean(
    cols = c(
      "DatasetID",
      "Interactor1Genus",
      "Interactor1Species",
      "Interactor2Genus",
      "Interactor2Species"
      )
    )
unique(df$Interactor1Genus)
```

```
## [1] "Bemisia"
```

Here we have made sure to only search the `Interactor1Genus` column. As such we have only gotten back *Bemisia* traits!

This same sort of collision is particularly common in the "Citation" column, where papers may mention multiple trait names.

The `search_x_smart()` functions have many different operators and columns that they can work upon. For full details, run `?search_vt_smart` or `?search_vd_smart` in your console.

In general it is always worthwhile inspecting the data you retrieve to make sure that your query returned the data that you thought it did.

## Futher steps

From here you now have all the data you might need for further analysis, so now it's down to you!

One final note to end on: it can be advisable to save any output data in a csv or parquet format so that you do not need to re-download it every time you run your script. This is as easy as running `write.csv()` on your dataframe, then reading it in later with `read.csv()`



<span style="opacity: 0.1;font-size: small;">Built in 18.0008059s</span>