---
title: "Intro to duckspatial"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
code-annotations: hover
urlcolor: blue
vignette: >
  %\VignetteIndexEntry{Intro to duckspatial}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---
  
```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = identical(tolower(Sys.getenv("NOT_CRAN")), "true"),
  out.width = "100%"
)

# CRAN OMP THREAD LIMIT to avoid CRAN NOTE
Sys.setenv(OMP_THREAD_LIMIT = 2)
```

The **{duckspatial}** package provides fast and memory-efficient functions to 
analyze and manipulate large spatial vector datasets in R. It allows R users to 
benefit directly from the analytical power of [DuckDB and its spatial extension](https://duckdb.org/docs/stable/core_extensions/spatial/functions), 
while remaining fully compatible with R’s spatial ecosystem, especially **{sf}**.

At its core, **{duckspatial}** bridges two worlds:

- R spatial workflows based on {sf} objects
- Database-backed spatial analytics powered by DuckDB SQL

This design makes **{duckspatial}** especially well suited for:

- Working with large spatial data sets
- Speeding up spatial analysis at scale
- Workflows with larger-than memory data


# Installation

You can install **{duckspatial}** directly from CRAN with:

``` r
install.packages("duckspatial")
```

Or you can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("pak")
pak::pak("Cidree/duckspatial")
```

# Core idea: flexible spatial workflows

A central design principle of **{duckspatial}** is that the same spatial operation 
can be used in different ways, depending on how your data is stored and how you 
want to manage memory and performance. Most functions in **{duckspatial}** support 
four complementary workflows:

1. Input`sf` → Output `sf`
4. Input `sf` → Output DuckDB table
2. Input DuckDB table → Output `sf`
3. Input DuckDB table → Output DuckDB table

Let's see a few examples to illustrate these workflows with a few sample data 
sets.

```{r, message=FALSE}
library(duckspatial)
library(sf)

# polygons
countries_sf  <- sf::st_read(
    system.file("spatial/countries.geojson",  package = "duckspatial"),
    quiet = TRUE
    )

# create random points
set.seed(42)
n <- 10000
points_sf <- data.frame(
  id = 1:n,
  x  = runif(n, min = -180, max = 180),
  y  = runif(n, min =  -90, max =  90)
) |>
  sf::st_as_sf(coords = c("x","y"), crs = 4326)

```

## Workflow 1: `sf` input → `sf` output

The simplest way to perform fast spatial operations. Here you pass `sf` objects as 
inputs, and under the hood {duckspatial}:

- Registers them temporarily in DuckDB
- Executes the spatial operation using SQL
- Returns the result as an {sf} object

In this example, we use `ddbs_join()` (which is equivalent to `sf::st_join`) to 
determine which country is intersected by each point.

```{r, message=FALSE}
result_sf <- ddbs_join(
  x = points_sf,
  y = countries_sf,
  join = "intersects"
)

head(result_sf)
```

- **When to use:** quick analysis, prototyping, or when you don’t need to persist 
intermediate tables.


## Creating a DuckDB connection

The next workflows use a DuckDB connection, which makes these workflows much more
efficient for working with large spatial data sets in general. To create a DuckDB connection, we use the `ddbs_create_conn()` function, which automatically creates 
the database connection and installs / loads DuckDB's spatial extension in a 
single call.

```{r, message=FALSE}
# create duckdb con and install / load spatial extension
conn <- duckspatial::ddbs_create_conn()
```


## Workflow 2: `sf` input → DuckDB table

This workflow is ideal when you start in R, but want to persist results efficiently 
in the database without load the results to memory. You pass `sf` objects as input, 
and {duckspatial} writes the output directly to DuckDB. 

The only difference is that here you also pass the `name` of the table that 
should be written with the output to DuckDB and the database `conn` where the 
table should saved.


```{r, message=FALSE}
ddbs_join(
    conn = conn,
    x = points_sf,
    y = countries_sf,
    join = "intersects", 
    name = "points_in_countries_tbl"
)

```

And if you want to fetch the table to memory, the `ddbs_read_vector()` allows you
to read a table and return it as a `sf` object.


```{r, message=FALSE}
tbl <- ddbs_read_vector(
    conn = conn,
    name = "points_in_countries_tbl"
    )

head(tbl)

```
## Workflow 3: DuckDB tables → `sf` output

In this workflow, your spatial data lives inside DuckDB as tables but you want 
to return the output of `sf` objects to memory.

You can easily write `sf` objects as tables to duckdb with the `ddbs_write_vector()` function by passing the `sf` data and indicating the `name` of the table to be written in the database.

```{r, message=FALSE}
# write `sf` objects as tables to duckdb
duckspatial::ddbs_write_vector(
    conn = conn, 
    data = countries_sf, 
    name = "countries"
    )

duckspatial::ddbs_write_vector(
    conn = conn, 
    data = points_sf, 
    name = "points"
    )

```


To perform a spatial operation, you pass the table names and you get an `sf` 
object back.

```{r, message=FALSE}
result_sf <- ddbs_join(
  conn = conn,
  x = "points",
  y = "countries",
  join = "intersects"
  )

``` 
- **When to use:** iterative workflows, larger-than-memory data, or when you’ll run multiple queries on the same tables.


## Workflow 4: DuckDB tables → DuckDB table

This is the fastest and most scalable workflow. The entire computation happens 
inside DuckDB, and the result is written to a new database table.

```{r, message=FALSE}
ddbs_join(
  conn = conn,
  x = "points",
  y = "countries",
  join = "intersects", 
  name = "points_in_countries_tbl", 
  overwrite = TRUE
  )


# and read the table to memory as sf
# tbl <- ddbs_read_vector(
#     conn = conn,
#     name = "points_in_countries_tbl"
#     )

``` 
- **When to use:** Very large datasets, Production pipelines with multiple steps
and when results will be reused downstream in DuckDB