Skip to contents

Last week

  • We saw how to do spatial cross validation in tidymodels
  • We downloaded occurrence points for a species using GBIF

Occurrence Data

  • Up until now every model of occurrence we have done has using Presence / Absence data
  • We had sites where we knew the species was present and where we knew it was absent
  • The model could then estimate variation in the ‘probability’ of occurrence across different sites
  • Occurrence point data we downloaded are only places we know the species is Present (no Absence data)

How to Model Presence-Only Data

  • If we tried to put the data through a model as is, the model would be unable to choose a sensible model
  • There is no variation for the model to explain
  • Therefore any model that always predicts presence will be equivalently good
  • We need to introduce variation

What Really Varies in Presence-Absence Data?

  • Density!
  • Points have higher density in some areas and lower in others

What is Density?

  • Density is a continuous analog of counts
  • Density is a count per unit area, at the limit of infinitesimal area (for a two dimensional space)
  • To estimate a density over a continuous space, we need to use integration

Quadrature

  • A simple way to estimate an integral is to sample a space randomly or regularly, estimate the value at each sample, and then take a weighted sum of the sample.
  • Noting that models estimate the expectation of a function (which can be formulated as a weighted sum), we can make our model predict the density of points in space by using random or regular ‘background’ points, often called ‘pseudo-absences’.
  • In order to estimate true density requires carefully choosing observation weights and applying them in the model fitting, but not all models allow for weighting.
  • Without appropriate weighting we can still say the the model produces estimates that are proportional to density, which is often good enough.
  • We often just want to know where species are more or less likely to be, not the exact probability (although this is useful too, when possible).

Example of Pseudo-absences

## Download file size: 0 MB
## On disk at ./0107668-220831081235567.zip
## convert to sf
sachsia <- sachsia %>%
  select(long = decimalLongitude,
         lat = decimalLatitude) %>%
  st_as_sf(coords = c("long", "lat"), crs = 4326)

mapview(sachsia)
## Warning in cbind(`Feature ID` = fid, mat): number of rows of result is not a
## multiple of vector length (arg 1)

Create Background Area

bg <- create_background(sachsia, buffer = 20000, max_bg = florida)
mapview(bg)

Sample Pseudo-Absences

sachsia_dat <- sdm_data(sachsia, bg = bg, n = 10000)
sachsia_dat
## Simple feature collection with 10032 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.56721 ymin: 24.4205 xmax: -80.16127 ymax: 25.79203
## Geodetic CRS:  WGS 84
## # A tibble: 10,032 × 3
##                    pnts present pnt_origin
##  *          <POINT [°]> <fct>   <chr>     
##  1 (-80.69476 25.39784) present data      
##  2 (-80.63322 25.40324) present data      
##  3  (-80.6333 25.40328) present data      
##  4 (-80.69095 25.39616) present data      
##  5 (-80.62141 25.39301) present data      
##  6 (-80.38955 25.54626) present data      
##  7 (-80.63322 25.40085) present data      
##  8 (-80.51267 25.43387) present data      
##  9   (-80.657 25.40336) present data      
## 10 (-80.39395 25.54607) present data      
## # … with 10,022 more rows
mapview(sachsia_dat, zcol = "present")

Add environmental variables

sachsia_dat <- add_env_vars(sachsia_dat, bioclim_fl)
sachsia_dat
## Simple feature collection with 10032 features and 21 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.56721 ymin: 24.4205 xmax: -80.16127 ymax: 25.79203
## Geodetic CRS:  WGS 84
## # A tibble: 10,032 × 22
##                    pnts present pnt_origin  BIO1  BIO2  BIO3  BIO4  BIO5  BIO6
##  *          <POINT [°]> <fct>   <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 (-80.69476 25.39784) present data        24.2  28.0  20.0  1360   235    39
##  2 (-80.63322 25.40324) present data        24.2  27.9  19.9  1386   238    35
##  3  (-80.6333 25.40328) present data        24.2  27.9  19.9  1386   238    35
##  4 (-80.69095 25.39616) present data        24.2  28.0  20.0  1364   235    38
##  5 (-80.62141 25.39301) present data        24.2  27.9  20.0  1388   238    37
##  6 (-80.38955 25.54626) present data        24.3  27.9  20.2  1464   246    35
##  7 (-80.63322 25.40085) present data        24.2  27.9  19.9  1386   238    35
##  8 (-80.51267 25.43387) present data        24.2  27.9  20    1417   241    37
##  9   (-80.657 25.40336) present data        24.2  27.9  20.0  1376   237    36
## 10 (-80.39395 25.54607) present data        24.3  27.9  20.1  1467   246    35
## # … with 10,022 more rows, and 13 more variables: BIO7 <dbl>, BIO8 <dbl>,
## #   BIO9 <dbl>, BIO10 <dbl>, BIO11 <dbl>, BIO12 <dbl>, BIO13 <dbl>,
## #   BIO14 <dbl>, BIO15 <dbl>, BIO16 <dbl>, BIO17 <dbl>, BIO18 <dbl>,
## #   BIO19 <dbl>

Remove NAs

sachsia_dat <- sachsia_dat %>%
  drop_na(BIO1:BIO19)
mapview(st_sf(sachsia_dat), zcol = "present")
mapview(st_sf(sachsia_dat), zcol = "BIO1")

Make a recipe

  • Let’s do some transformations
sachsia_recipe <- recipe(sachsia_dat,
                         vars = c("present",
                                  sachsia_dat %>% select(BIO1:BIO19) %>% colnames()),
                         roles = c("outcome", rep("predictor", 19))) %>%
  step_YeoJohnson(all_predictors()) %>%
  step_normalize(all_predictors())

sachsia_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         19
## 
## Operations:
## 
## Yeo-Johnson transformation on all_predictors()
## Centering and scaling for all_predictors()

Make Model and Workflow

sachsia_mod <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")

sachsia_wf <- workflow() %>%
  add_recipe(sachsia_recipe) %>%
  add_model(sachsia_mod)

sachsia_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Computational engine: ranger

Fit Workflow

sachsia_fit <- sachsia_wf %>%
  fit(sachsia_dat)

sachsia_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      9511 
## Number of independent variables:  19 
## Mtry:                             4 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.002885645

Make a prediction dataset

  • We want to make predictions on a landscape
  • Use a grid:
sachsia_grid <- sdm_data(sachsia, bg,
                         5000, sample_options = list(type = "regular")) %>%
  filter(present == "absent")
## although coordinates are longitude/latitude, st_sample assumes that they are planar
mapview(sachsia_grid)

Add environmental variable and predict

sachsia_grid_dat <- add_env_vars(sachsia_grid, bioclim_fl) %>%
  drop_na(BIO1:BIO19)

sachsia_grid_dat
## # A tibble: 4,716 × 22
##                    pnts present pnt_origin  BIO1  BIO2  BIO3  BIO4  BIO5  BIO6
##             <POINT [°]> <fct>   <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  (-81.5532 24.61569) absent  pseudo      25.1  28.5  21.3  1054   152    40
##  2  (-81.5532 24.62482) absent  pseudo      25.1  28.5  21.3  1068   152    42
##  3 (-81.54407 24.62482) absent  pseudo      25.1  28.5  21.3  1057   153    41
##  4 (-81.53494 24.62482) absent  pseudo      25.1  28.5  21.3  1056   153    40
##  5 (-81.52581 24.62482) absent  pseudo      25.1  28.5  21.3  1058   154    40
##  6  (-81.5532 24.63395) absent  pseudo      25.1  28.5  21.3  1068   153    42
##  7 (-81.54407 24.63395) absent  pseudo      25.1  28.5  21.3  1072   153    43
##  8 (-81.51668 24.63395) absent  pseudo      25.1  28.5  21.2  1078   154    43
##  9 (-81.56233 24.64308) absent  pseudo      25.0  28.4  21.2  1053   153    41
## 10 (-81.51668 24.64308) absent  pseudo      25.1  28.5  21.2  1085   154    44
## # … with 4,706 more rows, and 13 more variables: BIO7 <dbl>, BIO8 <dbl>,
## #   BIO9 <dbl>, BIO10 <dbl>, BIO11 <dbl>, BIO12 <dbl>, BIO13 <dbl>,
## #   BIO14 <dbl>, BIO15 <dbl>, BIO16 <dbl>, BIO17 <dbl>, BIO18 <dbl>,
## #   BIO19 <dbl>
sachsia_preds <- sachsia_fit %>% 
  augment(sachsia_grid_dat)

Plot predictions

coords <- st_coordinates(st_sf(sachsia_preds))
ggplot(sachsia_preds %>% bind_cols(coords), aes(X, Y)) +
  geom_sf(data = florida, inherit.aes = FALSE) +
  geom_raster(aes(fill = .pred_present + 0.0001)) +
  geom_sf(data = sachsia, inherit.aes = FALSE, colour = "red") +
  scale_fill_continuous(name = "Probability of Occurrence",
                        trans = "logit") +
  coord_sf(ylim = c(24.5, 25.9), xlim = c(-81.6, -79.8)) +
  theme_minimal()
## Warning: Raster pixels are placed at uneven horizontal intervals and will be
## shifted. Consider using geom_tile() instead.
## Warning: Raster pixels are placed at uneven vertical intervals and will be
## shifted. Consider using geom_tile() instead.

Final Projects

  • Next week we will all try and do a full SDM on GBIF data
  • No lectures, both classes will be for working
  • Hand in at the end of next week
  • There will be no assignment this week
  • This assignment can be used as part of the final project

Final Projects

Next Class

  • Lecture will cover:
    • Poisson regression for count data
    • How to weight points in model to make output more interpretable
    • Evaluating SDM
    • More on Final Projects