Skip to contents

Forests of the Future Week 6

Species Distribution Modelling 2

  • Last week we learned how to use tidymodels to fit a model to data
  • We modelled the abundance of fish species at different reefs based on their mean temperatures
  • Today we will talk about modelling occurrence data, where we have data on the presence or absence of species at particular sites
  • We will also learn how to preprocess data using tidymodels, with the recipes package

Ecological Niche Theory

  • All species have an ecological niche
  • An Ecological niche is the position of a species within an ecosystem
    • The conditions necessary for persistence of the species
    • Its ecological role in the ecosystem
  • Combining both more formally:
    • A niche is: the part of ecological space (defined by all combinations of biotic and abiotic environmental conditions) where the species population can persist and thus utilize resources and impact its environment

Ecological Niche Theory

  • Niches are often visualized or conceptualized as hump-shaped distributions along an environmental variable

Ecological Niche Theory

  • Compare that with the model predictions I made in last lecture:

Generalise to many dimensions

  • Environmental ‘hypervolume’

3D Hypervolume

Review

Data Science Steps

  1. A Question
  2. Collect Data
  3. Munge / Clean Data
  4. Transform Data for Model
  5. Analyze Data using Model
  6. Tune Model
  7. Validate / Test Model
  8. Interpret Model

This week

  1. A Question
  2. Collect Data
  3. Munge / Clean Data
  4. Transform Data for Model
  5. Analyze Data using Model
  6. Tune Model
  7. Validate / Test Model
  8. Interpret Model

  • Modelling fish abundance
data("RF_abund")
fish_dat <- RF_abund %>%
  filter(SpeciesName == "Thalassoma pavo") %>%
  mutate(Presence = as.factor(Presence))
  • What if we only had presence / absence data?

Presence / Absence Data

  • We are not really modelling numeric data anymore
  • Presence or Absence are categories
  • We do a classification Model, and classify into the present or absent category
  • Binary classification models typically output a ‘probability’, a number between 0 and 1
  • Observed data is compared to output using the ‘binomial’ probability distribution

Presence / Absence Data

  • R has a number of probability distribution built-in which we can use to explore
rbinom(100, 1, 0.5)
##   [1] 1 0 0 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 1 1 0 0 0 0 1 0 0 1
##  [38] 1 0 1 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1
##  [75] 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 1
rbinom(100, 1, 0.75)
##   [1] 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 1 0 0 0 1 0 1
##  [38] 0 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
##  [75] 1 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 0
rbinom(100, 1, 0.25)
##   [1] 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 1
##  [38] 1 0 0 0 0 0 0 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0
##  [75] 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
curve(dbinom(x, size = 100, 0.5), 0, 100)

RF_abund includes presence/absence data column

fish_dat$Presence
##   [1] 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0
##  [38] 1 1 1 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 0 1 1 1
##  [75] 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
## [223] 0 0 1 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1
## [297] 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0
## [371] 0 1 0 0 1 1 1 1 1 0 1 0 0 0 1
## Levels: 0 1

Let’s model fish occurrence

  • Use two variables: Mean Temperature and Depth
  • In order to make them more comparable we will ‘standardise’ them
  • We will also test our model by creating a test set

  • Split off a test set:
set.seed(1234)
data_split <- initial_split(fish_dat, 0.8, strata = Presence)

train_data <- training(data_split)
test_data <- testing(data_split)

train_data
## # A tibble: 308 × 25
##    SpeciesName   SiteC…¹ Abund…² Sampl…³ MeanT…⁴ MinTe…⁵ MaxTe…⁶ SDTem…⁷ ECOre…⁸
##    <chr>         <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <fct>  
##  1 Thalassoma p… AND2          0     164    17.6    14.1    22.8    2.85 Albora…
##  2 Thalassoma p… AND25         0     173    19.0    14.0    25.4    4.01 Albora…
##  3 Thalassoma p… AND37         0     173    19.1    14.0    26.1    4.20 Albora…
##  4 Thalassoma p… AND43         0     173    18.9    14.3    25.3    3.66 Albora…
##  5 Thalassoma p… AND50         0     164    18.2    14.2    23.5    3.02 Albora…
##  6 Thalassoma p… AND51         0     164    18.3    14.1    23.8    3.12 Albora…
##  7 Thalassoma p… AND54         0     173    18.6    14.2    24.6    3.41 Albora…
##  8 Thalassoma p… AND56         0     164    17.8    14.2    22.1    2.66 Albora…
##  9 Thalassoma p… AND60         0     164    18.4    14.9    22.2    2.53 Sahara…
## 10 Thalassoma p… AND63         0     173    18.9    14.4    26.3    3.85 Albora…
## # … with 298 more rows, 16 more variables: Presence <fct>, OLRE <fct>,
## #   MaxAbundance <dbl>, N_Obs <int>, Confidence_NObs <dbl>, T_Range_Obs <dbl>,
## #   Confidence_TRange_Obs <dbl>, N_Absences_T_Upper <int>,
## #   N_Absences_T_Lower <int>, Confidence_Occ_Tupper <dbl>,
## #   Confidence_Occ_Tlower <dbl>, T_Upper_Absences <dbl>,
## #   T_Lower_Absences <dbl>, T_Mean_Absences <dbl>, NEOLI <dbl>,
## #   Depth_Site <dbl>, and abbreviated variable names ¹​SiteCode, …
test_data
## # A tibble: 77 × 25
##    SpeciesName   SiteC…¹ Abund…² Sampl…³ MeanT…⁴ MinTe…⁵ MaxTe…⁶ SDTem…⁷ ECOre…⁸
##    <chr>         <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <fct>  
##  1 Thalassoma p… AND1          3     217    17.5    14.1    22.6    2.75 Albora…
##  2 Thalassoma p… AND12       100     173    18.5    14.0    24.7    3.67 Albora…
##  3 Thalassoma p… AND16         0     173    18.8    14.0    25.2    3.89 Albora…
##  4 Thalassoma p… AND18        13     173    18.9    14.0    25.2    3.92 Albora…
##  5 Thalassoma p… AND19         1     173    19.0    14.0    25.5    4.08 Albora…
##  6 Thalassoma p… AND21        11     173    18.8    14.0    25.2    3.87 Albora…
##  7 Thalassoma p… AND27        14     173    19.0    14.0    25.5    4.08 Albora…
##  8 Thalassoma p… AND29        17     173    19.0    14.0    25.4    4.03 Albora…
##  9 Thalassoma p… AND3          8     164    17.7    14.1    22.9    2.90 Albora…
## 10 Thalassoma p… AND4          0     164    17.9    14.1    23.3    3.01 Albora…
## # … with 67 more rows, 16 more variables: Presence <fct>, OLRE <fct>,
## #   MaxAbundance <dbl>, N_Obs <int>, Confidence_NObs <dbl>, T_Range_Obs <dbl>,
## #   Confidence_TRange_Obs <dbl>, N_Absences_T_Upper <int>,
## #   N_Absences_T_Lower <int>, Confidence_Occ_Tupper <dbl>,
## #   Confidence_Occ_Tlower <dbl>, T_Upper_Absences <dbl>,
## #   T_Lower_Absences <dbl>, T_Mean_Absences <dbl>, NEOLI <dbl>,
## #   Depth_Site <dbl>, and abbreviated variable names ¹​SiteCode, …

Transform data

  • Make symmetric and non-skewed (Yeo-Johnson transformation)
  • Flexible transformation that needs a parameter, which is usually chosen based on an optimization with the training data
  • Standardise (make mean zero, and standard deviation 1)
  • Before:
s_dat <- train_data %>%
  select(MeanTemp_CoralWatch, Depth_Site) %>%
  pivot_longer(cols = everything(), values_to = "Value", names_to = "Variable")
ggplot(s_dat, aes(Value)) +
  geom_histogram(bins = 30) +
  facet_wrap(vars(Variable), nrow = 2, scales = "free") +
  theme_minimal()


MeanTemp_yj <- car::powerTransform(train_data$MeanTemp_CoralWatch,
                                   family = "yjPower")
Depth_yj <- car::powerTransform(train_data$Depth_Site,
                                family = "yjPower")
MeanTemp_yj
## Estimated transformation parameter 
## train_data$MeanTemp_CoralWatch 
##                     -0.8426166
Depth_yj
## Estimated transformation parameter 
## train_data$Depth_Site 
##             0.1361994
train_data <- train_data %>%
  mutate(MeanTemp_tr = car::yjPower(MeanTemp_CoralWatch, MeanTemp_yj$roundlam),
         Depth_tr = car::yjPower(Depth_Site, Depth_yj$roundlam))

  • After:
s_dat <- train_data %>%
  select(MeanTemp_tr, Depth_tr) %>%
  pivot_longer(cols = everything(), values_to = "Value", names_to = "Variable")
ggplot(s_dat, aes(Value)) +
  geom_histogram(bins = 30) +
  facet_wrap(vars(Variable), nrow = 2, scales = "free") +
  theme_minimal()

What about the test data?

  • When we use the model to predict the occurrence on the test set, the data must be transformed in exactly the same way as the training set
  • This means we need to keep our lambda parameter fit on the training data and apply the same parameter to transform the test data
  • The standardisation also requires parameters from the training data
  • Keeping track of these transformations manually is dangerous
  • A common mistake is to transform the test data using parameters derived from the test data, this makes the test data incomparable to the training data

Recipes helps to solve this issue

  • A recipe is a set of steps applied to data before modelling
  • Everything is kept track of by tidymodels so that the steps are applied correctly to test data after the model is fit
  • Let’s make a recipe
RF_recipe <- recipe(train_data,
                    Presence ~ MeanTemp_CoralWatch + Depth_Site)
RF_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          2

Add Steps to the recipe

RF_recipe <- recipe(train_data,
                    Presence ~ MeanTemp_CoralWatch + Depth_Site) %>%
  step_YeoJohnson(MeanTemp_CoralWatch, Depth_Site) %>%
  step_normalize(MeanTemp_CoralWatch, Depth_Site)
RF_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          2
## 
## Operations:
## 
## Yeo-Johnson transformation on MeanTemp_CoralWatch, Depth_Site
## Centering and scaling for MeanTemp_CoralWatch, Depth_Site

Prep the recipe

  • Prepping a recipe estimates any required parameters
RF_recipe <- prep(RF_recipe)
RF_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          2
## 
## Training data contained 308 data points and no missing data.
## 
## Operations:
## 
## Yeo-Johnson transformation on MeanTemp_CoralWatch, Depth_Site [trained]
## Centering and scaling for MeanTemp_CoralWatch, Depth_Site [trained]

Bake the recipe

  • Baking applies the steps to a dataset
train_data_tr <- bake(RF_recipe, train_data)
train_data_tr
## # A tibble: 308 × 3
##    MeanTemp_CoralWatch Depth_Site Presence
##                  <dbl>      <dbl> <fct>   
##  1             -0.761     -1.37   0       
##  2             -0.125     -2.22   0       
##  3             -0.0707     0.555  0       
##  4             -0.150     -1.34   0       
##  5             -0.480     -1.73   0       
##  6             -0.448     -0.645  0       
##  7             -0.300     -0.946  0       
##  8             -0.672      0.144  0       
##  9             -0.369     -0.196  0       
## 10             -0.159     -0.0482 0       
## # … with 298 more rows

The data now looks like:

s_dat <- train_data_tr %>% select(MeanTemp_CoralWatch, Depth_Site) %>% pivot_longer(cols = everything(), values_to = "Value", names_to = "Variable")
ggplot(s_dat, aes(Value)) + geom_histogram(bins = 30) + facet_wrap(vars(Variable), nrow = 2, scales = "free") + theme_minimal()

Bake the test data

  • We can now easily apply the exact same data manipulation to the test data
test_data_tr <- bake(RF_recipe, test_data)
test_data_tr
## # A tibble: 77 × 3
##    MeanTemp_CoralWatch Depth_Site Presence
##                  <dbl>      <dbl> <fct>   
##  1             -0.828    -2.00    1       
##  2             -0.315     0.00543 1       
##  3             -0.177    -0.412   0       
##  4             -0.171     0.495   1       
##  5             -0.0972   -2.00    1       
##  6             -0.185    -0.667   1       
##  7             -0.0980   -1.93    1       
##  8             -0.119     0.178   1       
##  9             -0.710    -3.14    1       
## 10             -0.637    -2.64    0       
## # … with 67 more rows

Combine a recipe and a model

RF_mod <-
  boost_tree() %>%
  set_engine('xgboost') %>%
  set_mode('classification')

RF_mod
## Boosted Tree Model Specification (classification)
## 
## Computational engine: xgboost

A workflow is a recipe and a model

RF_wf <- workflow() %>%
  add_recipe(RF_recipe) %>%
  add_model(RF_mod)

RF_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (classification)
## 
## Computational engine: xgboost

Use fit to run a workflow

RF_fit <- RF_wf %>%
  fit(train_data)

RF_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## ##### xgb.Booster
## raw: 25.1 Kb 
## call:
##   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
##     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
##     subsample = 1), data = x$data, nrounds = 15, watchlist = x$watchlist, 
##     verbose = 0, nthread = 1, objective = "binary:logistic")
## params (as set within xgb.train):
##   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.evaluation.log()
## # of features: 2 
## niter: 15
## nfeatures : 2 
## evaluation_log:
##     iter training_logloss
##        1        0.5244973
##        2        0.4279284
## ---                      
##       14        0.1878803
##       15        0.1829834

Make predictions

  • Predict the test data
  • The workflow knows to apply the recipe to the test data first
predict(RF_fit, test_data)
## # A tibble: 77 × 1
##    .pred_class
##    <fct>      
##  1 0          
##  2 1          
##  3 1          
##  4 1          
##  5 1          
##  6 1          
##  7 1          
##  8 1          
##  9 0          
## 10 0          
## # … with 67 more rows
predict(RF_fit, test_data, type = "prob")
## # A tibble: 77 × 2
##    .pred_0 .pred_1
##      <dbl>   <dbl>
##  1  0.916   0.0842
##  2  0.110   0.890 
##  3  0.118   0.882 
##  4  0.0654  0.935 
##  5  0.224   0.776 
##  6  0.0807  0.919 
##  7  0.224   0.776 
##  8  0.136   0.864 
##  9  0.879   0.121 
## 10  0.767   0.233 
## # … with 67 more rows

Augment data with predictions

RF_test_preds <- augment(RF_fit, test_data)

At last we test the fit based on test data

RF_test_preds %>% 
  roc_curve(.pred_0, truth = Presence) %>% 
  autoplot()


RF_test_preds %>% 
  roc_auc(.pred_0, truth = Presence)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.911

Plot predictions and original data

pred_dat <- cross_df(list(MeanTemp_CoralWatch = seq(min(fish_dat$MeanTemp_CoralWatch),
                                                    max(fish_dat$MeanTemp_CoralWatch),
                                                    length.out = 100),
                            Depth_Site = seq(min(fish_dat$Depth_Site),
                                             max(fish_dat$Depth_Site),
                                             length.out = 100)))

newdat <- augment(RF_fit, pred_dat)

ggplot(newdat, aes(MeanTemp_CoralWatch, Depth_Site)) +
  geom_contour_filled(aes(z = .pred_1)) +
  geom_point(aes(size = Presence), data = fish_dat, alpha = 0.6) +
  theme_minimal()
## Warning: Using size for a discrete variable is not advised.

Pine Rocklands Plant Species

data("IRC")
IRC
## # A tibble: 476,255 × 7
##    ScientificName         area_name            Occur…¹ Intro…² size   long   lat
##    <chr>                  <chr>                <chr>   <chr>   <chr> <dbl> <dbl>
##  1 Abelmoschus esculentus A.D. Doug Barnes Pa… Absent  Not in… 60.4… -80.3  25.7
##  2 Abelmoschus esculentus Alice C. Wainwright… Absent  Not in… 21.5… -80.2  25.7
##  3 Abelmoschus esculentus Allapattah Flats Wi… Absent  Not in… 2094… -80.5  27.2
##  4 Abelmoschus esculentus Amberjack Slough     Absent  Not in… 182 … -82.3  26.9
##  5 Abelmoschus esculentus Arch Creek Park      Absent  Not in… 10.3… -80.2  25.9
##  6 Abelmoschus esculentus Arthur R. Marshall … Absent  Not in… 1484… -80.2  26.5
##  7 Abelmoschus esculentus Bahia Honda State P… Absent  Not in… 491.… -81.3  24.7
##  8 Abelmoschus esculentus Bartlett Estate      Absent  Not in… 35 a… -80.1  26.1
##  9 Abelmoschus esculentus Beachwalk Pasley Pa… Absent  Not in… 17 a… -80.2  27.2
## 10 Abelmoschus esculentus Big and Little Geor… Absent  Not in… 18.2… -80.4  25.6
## # … with 476,245 more rows, and abbreviated variable names ¹​Occurrence,
## #   ²​Introduced

Summary

n_distinct(IRC$area_name)
## [1] 205
n_distinct(IRC$ScientificName)
## [1] 2323

Species Summary

IRC_summ <- IRC %>%
  filter(Occurrence == "Present") %>%
  group_by(ScientificName) %>%
  summarise(num_areas = n_distinct(area_name)) %>%
  arrange(desc(num_areas)) %>%
  mutate(spec_num = 1:n())

ggplot(IRC_summ, aes(spec_num, num_areas)) +
  geom_col() +
  theme_minimal()