Forests of the Future: Week 10 Lecture • fofpack

Last week

We saw how to do spatial cross validation in tidymodels
We downloaded occurrence points for a species using GBIF

Occurrence Data

Up until now every model of occurrence we have done has using Presence / Absence data
We had sites where we knew the species was present and where we knew it was absent
The model could then estimate variation in the ‘probability’ of occurrence across different sites
Occurrence point data we downloaded are only places we know the species is Present (no Absence data)

How to Model Presence-Only Data

If we tried to put the data through a model as is, the model would be unable to choose a sensible model
There is no variation for the model to explain
Therefore any model that always predicts presence will be equivalently good
We need to introduce variation

What Really Varies in Presence-Absence Data?

Density!
Points have higher density in some areas and lower in others

What is Density?

Density is a continuous analog of counts
Density is a count per unit area, at the limit of infinitesimal area (for a two dimensional space)
To estimate a density over a continuous space, we need to use integration

Quadrature

A simple way to estimate an integral is to sample a space randomly or regularly, estimate the value at each sample, and then take a weighted sum of the sample.
Noting that models estimate the expectation of a function (which can be formulated as a weighted sum), we can make our model predict the density of points in space by using random or regular ‘background’ points, often called ‘pseudo-absences’.
In order to estimate true density requires carefully choosing observation weights and applying them in the model fitting, but not all models allow for weighting.
Without appropriate weighting we can still say the the model produces estimates that are proportional to density, which is often good enough.
We often just want to know where species are more or less likely to be, not the exact probability (although this is useful too, when possible).

Example of Pseudo-absences

Use tidysdm

library(rgbif)
library(tidysdm)
library(spatialsample)

sachsia <- occ_download_get("0107668-220831081235567") %>%
  occ_download_import()

## Download file size: 0 MB

## On disk at ./0107668-220831081235567.zip

## convert to sf
sachsia <- sachsia %>%
  select(long = decimalLongitude,
         lat = decimalLatitude) %>%
  st_as_sf(coords = c("long", "lat"), crs = 4326)

mapview(sachsia)

## Warning in cbind(`Feature ID` = fid, mat): number of rows of result is not a
## multiple of vector length (arg 1)

Create Background Area

bg <- create_background(sachsia, buffer = 20000, max_bg = florida)
mapview(bg)

Sample Pseudo-Absences

sachsia_dat <- sdm_data(sachsia, bg = bg, n = 10000)
sachsia_dat

## Simple feature collection with 10032 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.56721 ymin: 24.4205 xmax: -80.16127 ymax: 25.79203
## Geodetic CRS:  WGS 84
## # A tibble: 10,032 × 3
##                    pnts present pnt_origin
##  *          <POINT [°]> <fct>   <chr>     
##  1 (-80.69476 25.39784) present data      
##  2 (-80.63322 25.40324) present data      
##  3  (-80.6333 25.40328) present data      
##  4 (-80.69095 25.39616) present data      
##  5 (-80.62141 25.39301) present data      
##  6 (-80.38955 25.54626) present data      
##  7 (-80.63322 25.40085) present data      
##  8 (-80.51267 25.43387) present data      
##  9   (-80.657 25.40336) present data      
## 10 (-80.39395 25.54607) present data      
## # … with 10,022 more rows

mapview(sachsia_dat, zcol = "present")

Add environmental variables

sachsia_dat <- add_env_vars(sachsia_dat, bioclim_fl)
sachsia_dat

## Simple feature collection with 10032 features and 21 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.56721 ymin: 24.4205 xmax: -80.16127 ymax: 25.79203
## Geodetic CRS:  WGS 84
## # A tibble: 10,032 × 22
##                    pnts present pnt_origin  BIO1  BIO2  BIO3  BIO4  BIO5  BIO6
##  *          <POINT [°]> <fct>   <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 (-80.69476 25.39784) present data        24.2  28.0  20.0  1360   235    39
##  2 (-80.63322 25.40324) present data        24.2  27.9  19.9  1386   238    35
##  3  (-80.6333 25.40328) present data        24.2  27.9  19.9  1386   238    35
##  4 (-80.69095 25.39616) present data        24.2  28.0  20.0  1364   235    38
##  5 (-80.62141 25.39301) present data        24.2  27.9  20.0  1388   238    37
##  6 (-80.38955 25.54626) present data        24.3  27.9  20.2  1464   246    35
##  7 (-80.63322 25.40085) present data        24.2  27.9  19.9  1386   238    35
##  8 (-80.51267 25.43387) present data        24.2  27.9  20    1417   241    37
##  9   (-80.657 25.40336) present data        24.2  27.9  20.0  1376   237    36
## 10 (-80.39395 25.54607) present data        24.3  27.9  20.1  1467   246    35
## # … with 10,022 more rows, and 13 more variables: BIO7 <dbl>, BIO8 <dbl>,
## #   BIO9 <dbl>, BIO10 <dbl>, BIO11 <dbl>, BIO12 <dbl>, BIO13 <dbl>,
## #   BIO14 <dbl>, BIO15 <dbl>, BIO16 <dbl>, BIO17 <dbl>, BIO18 <dbl>,
## #   BIO19 <dbl>

Remove NAs

sachsia_dat <- sachsia_dat %>%
  drop_na(BIO1:BIO19)
mapview(st_sf(sachsia_dat), zcol = "present")

mapview(st_sf(sachsia_dat), zcol = "BIO1")

Make a recipe

Let’s do some transformations

sachsia_recipe <- recipe(sachsia_dat,
                         vars = c("present",
                                  sachsia_dat %>% select(BIO1:BIO19) %>% colnames()),
                         roles = c("outcome", rep("predictor", 19))) %>%
  step_YeoJohnson(all_predictors()) %>%
  step_normalize(all_predictors())

sachsia_recipe

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         19
## 
## Operations:
## 
## Yeo-Johnson transformation on all_predictors()
## Centering and scaling for all_predictors()

Make Model and Workflow

sachsia_mod <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")

sachsia_wf <- workflow() %>%
  add_recipe(sachsia_recipe) %>%
  add_model(sachsia_mod)

sachsia_wf

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Computational engine: ranger

Fit Workflow

sachsia_fit <- sachsia_wf %>%
  fit(sachsia_dat)

sachsia_fit

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      9511 
## Number of independent variables:  19 
## Mtry:                             4 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.002885645

Make a prediction dataset

We want to make predictions on a landscape
Use a grid:

sachsia_grid <- sdm_data(sachsia, bg,
                         5000, sample_options = list(type = "regular")) %>%
  filter(present == "absent")

## although coordinates are longitude/latitude, st_sample assumes that they are planar

mapview(sachsia_grid)

Add environmental variable and predict

sachsia_grid_dat <- add_env_vars(sachsia_grid, bioclim_fl) %>%
  drop_na(BIO1:BIO19)

sachsia_grid_dat

## # A tibble: 4,716 × 22
##                    pnts present pnt_origin  BIO1  BIO2  BIO3  BIO4  BIO5  BIO6
##             <POINT [°]> <fct>   <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  (-81.5532 24.61569) absent  pseudo      25.1  28.5  21.3  1054   152    40
##  2  (-81.5532 24.62482) absent  pseudo      25.1  28.5  21.3  1068   152    42
##  3 (-81.54407 24.62482) absent  pseudo      25.1  28.5  21.3  1057   153    41
##  4 (-81.53494 24.62482) absent  pseudo      25.1  28.5  21.3  1056   153    40
##  5 (-81.52581 24.62482) absent  pseudo      25.1  28.5  21.3  1058   154    40
##  6  (-81.5532 24.63395) absent  pseudo      25.1  28.5  21.3  1068   153    42
##  7 (-81.54407 24.63395) absent  pseudo      25.1  28.5  21.3  1072   153    43
##  8 (-81.51668 24.63395) absent  pseudo      25.1  28.5  21.2  1078   154    43
##  9 (-81.56233 24.64308) absent  pseudo      25.0  28.4  21.2  1053   153    41
## 10 (-81.51668 24.64308) absent  pseudo      25.1  28.5  21.2  1085   154    44
## # … with 4,706 more rows, and 13 more variables: BIO7 <dbl>, BIO8 <dbl>,
## #   BIO9 <dbl>, BIO10 <dbl>, BIO11 <dbl>, BIO12 <dbl>, BIO13 <dbl>,
## #   BIO14 <dbl>, BIO15 <dbl>, BIO16 <dbl>, BIO17 <dbl>, BIO18 <dbl>,
## #   BIO19 <dbl>

sachsia_preds <- sachsia_fit %>% 
  augment(sachsia_grid_dat)

Plot predictions

coords <- st_coordinates(st_sf(sachsia_preds))
ggplot(sachsia_preds %>% bind_cols(coords), aes(X, Y)) +
  geom_sf(data = florida, inherit.aes = FALSE) +
  geom_raster(aes(fill = .pred_present + 0.0001)) +
  geom_sf(data = sachsia, inherit.aes = FALSE, colour = "red") +
  scale_fill_continuous(name = "Probability of Occurrence",
                        trans = "logit") +
  coord_sf(ylim = c(24.5, 25.9), xlim = c(-81.6, -79.8)) +
  theme_minimal()

## Warning: Raster pixels are placed at uneven horizontal intervals and will be
## shifted. Consider using geom_tile() instead.

## Warning: Raster pixels are placed at uneven vertical intervals and will be
## shifted. Consider using geom_tile() instead.

Final Projects

Next week we will all try and do a full SDM on GBIF data
No lectures, both classes will be for working
Hand in at the end of next week
There will be no assignment this week
This assignment can be used as part of the final project

Final Projects

Next Class

Lecture will cover:
- Poisson regression for count data
- How to weight points in model to make output more interpretable
- Evaluating SDM
- More on Final Projects