Non-raster Data: Using Predictor Importance, Partial Dependence, and Predictive Performance Metrics

Tom Auer, Daniel Fink

2021-09-14

Outline

  1. Introduction
  2. Loading Centroids
  3. Selecting Region and Season
  4. Plotting Centroids and Extent of Analysis
  5. Plot Predictor Importance
  6. Plot Partial Dependencies
  7. Predictive Performance Metrics

Introduction

Beyond estimates of occurrence and relative abundance, the eBird Status and Trends products contain estimates of predictor importance (PI), partial dependence (PD), and predictive performance metrics (PPMs). The PPMs can be used to evaluate statistical performance of the models, either over the entire spatiotemporal extent of the model results, or for a specific region and season. Predictor importances (PIs) and partial dependences (PDs can be used to understand relationships between the occurrence or count response and predictors, most notably the land cover variables used in the model. The functions described in this section help load and analyse these data. More details about predictive performance metrics and how they were calculated and interpreted can be found in Fink et al. (2019).

Data Structure

IMPORTANT. AFTER DOWNLOADING THE RESULTS, DO NOT CHANGE THE FILE STRUCTURE. All functionality in this package relies on the structure inherent in the delivered results. Changing the folder and file structure will cause errors with this package.

The non-raster data are stored in two SQLite databases:

The ebirdst package provides functions for accessing these, such that you should never have to handle them manually.

PI and PD

We’ll start by loading the PI and PD data from the example data package for Yellow-bellied Sapsucker in Michigan.

library(ebirdst)
library(dplyr)

# Because the non-raster data is large, there is a parameter on the
# ebirdst_download function that defaults to downloading only the raster data.
# To access the non-raster data, set tifs_only = FALSE.
sp_path <- ebirdst_download(species = "example_data", tifs_only = FALSE)

# predictor importance
pis <- load_pis(sp_path)
glimpse(pis)

# partial dependence
pds <- load_pds(sp_path)
glimpse(pds)

Notice that the data in both these data frames is provided for each stixel, identified by a stixel_id.

Selecting Region and Season

When working with Predictive Performance Metrics (PPMs), PIs, or PDs, it is very common to select a subset of space and time for analysis. In ebirdst this is done by creating a spatiotemporal extent object with ebirdst_extent(). These objects define the region and season for analysis and are passed to many functions in ebirdst.

# define a spatiotemporal extent
lp_extent <- ebirdst_extent(c(xmin = -86, xmax = -83, ymin = 42, ymax = 45),
                            t = c(0.425, 0.475))
print(lp_extent)

# subset to this extent
pis_ss <- ebirdst_subset(pis, ext = lp_extent)
nrow(pis)
nrow(pis_ss)

To understand which stixels are contributing to estimates within a given spatiotemporal extent, stixel_footprint() generate a RasterLayer depicting where a majority of the information is coming from within a given extent. The map ranges from 0 to 1, with pixels have a value of 1 meaning that 100% of the selected stixels are contributing information at that pixel. Calling plot() on the output of this function will map the stixel footprint as well as centroids of all the stixels.

par(mfrow = c(1, 1), mar = c(0, 0, 0, 6))
footprint <- stixel_footprint(sp_path, ext = lp_extent)
plot(footprint)

Plot Predictor Importance

The plot_pis() function generates a bar plot showing a rank of the most important predictors within a spatiotemporal subset. There is an option to show all predictors or to aggregate FRAGSTATS by the land cover types.

# with all classes
plot_pis(pis, ext = lp_extent, by_cover_class = FALSE, n_top_pred = 15)

# aggregating fragstats for cover classes
plot_pis(pis, ext = lp_extent, by_cover_class = TRUE, n_top_pred = 15)

Plotting Partial Dependence

Smoothed partial dependence curves for a given predictor can be plotted using plot_pds(). Confidence intervals are estimated through a processing of subsampling and bootstrapping. This function returns the smoothed data and CIs and plots these data. For example, let’s look at PD curves for checklist start time, expressed as the difference from solar noon, and the percentage of broadleaf forest cover. The full list of predictors see the data frame ebirdst_predictors.

# in the interest of speed, run with 5 bootstrap iterations
# in practice, best to run with the default number of iterations (100)
pd_smooth <- plot_pds(pds, "solar_noon_diff", ext = lp_extent, n_bs = 5)
dplyr::glimpse(pd_smooth)

# deciduous broadleaf forest
plot_pds(pds, "mcd12q1_lccs1_fs_c14_1500_pland", ext = lp_extent, n_bs = 5)

Predictive Performance Metrics

Beyond confidence intervals provided for the abundance estimates, the centroid data can also be used to calculate predictive performance metrics, to get an idea as to whether there is substantial statistical performance to evaluate information provided by PIs and PDs (as well as abundance and occurrence information). Three types of PPMs are calculated:

Both the occurrence and count PPMs are within-range metrics, meaning the comparison between observations and predictions is only made within the range where the species occurs.

The function ebirdst_ppms() calculates a suite of PPMs for a given spatiotemporal extent. The results can then be plotted with plot().

ppms <- ebirdst_ppms(sp_path, ext = lp_extent)
plot(ppms)

ebirdst_ppms_ts() can be used to get a time series of PPMs, calculating the full suite either at a weekly or monthly resolution. plot() can be used to visualise these PPM time series for a given PPM.

ppms_monthly <- ebirdst_ppms_ts(sp_path, ext = lp_extent, summarize_by = "months")
# plot binary kappa
plot(ppms_monthly, type = "binary", metric = "kappa")
# plot occurrence probability auc
plot(ppms_monthly, type = "occurrence", metric = "auc")

References

Fink, D., T. Damoulas, & J. Dave, J. 2013. Adaptive Spatio-Temporal Exploratory Models: Hemisphere-wide species distributions from massively crowdsourced eBird data. In Twenty-Seventh AAAI Conference on Artificial Intelligence.

Fink, D., W.M. Hochachka, B. Zuckerberg, D.W. Winkler, B. Shaby, M.A. Munson, G. Hooker, M. Riedewald, D. Sheldon, & S. Kelling. 2010. Spatiotemporal exploratory models for broad‐scale survey data. Ecological Applications, 20(8), 2131-2147.

Fink, D., T. Auer, A. Johnston, V. Ruiz‐Gutierrez, W.M. Hochachka, & S. Kelling. 2019. Modeling avian full annual cycle distribution and population trends with citizen science data. Ecological Applications, 00(00):e02056. https://doi.org/10.1002/eap.2056