Evaluating Index Performance and Setting Benchmarks

Evaluating Index Performance During Index Development

Methods for establishing benchmarks for assessment indices generally rely on the use of reference sites to estimate the range of variability in index values expected to occur under reference (natural or desired) conditions (Stoddard et al. 2006), and percentiles of this reference site distribution are typically used to define threshold values. Therefore, errors in estimating the true variability in index values under reference conditions can reduce the precision of threshold values and the sensitivity of an index. Conceptually, the sensitivity of an index can be thought of as a signal-to noise (S/N) ratio, where the S/N ratio is simply the magnitude of response by the index to human-caused disturbance relative to unexplained variation in index values. Unexplained variation in index values is associated with a combination of sampling and prediction error.

All models, are neither completely precise nor accurate, errors in prediction need to be quantified and incorporated into assessments. The most direct and robust way of evaluating model error is to apply the model to samples collected from an independent set of reference sites, i.e., sites that were not used in construction of the model. 

Additionally, model error can be visualized by plotting predicted values on the X axis and observed values on the y axis (Piñeiro et al. 2008). The slope of this relationship should be 1 (the model is generally accurate) and the scatter of the points around the regression line should be small (the model is precise). Models with high r2 values are generally more precise than models with low values, and in our experience, reasonably good models have r2 values between 0.5 and 0.75.

However, r2 values also depend on the range of values observed among samples, so it is possible for two different models to have different r2 values but have the same amount of scatter around a single value of E. Examining the frequency distribution of reference site O/E values allows a different evaluation of model precision and avoids the problem of the dependency of r2 on the range of values. The average of reference site observed/predicted values should be 1 (accurate model), and the standard deviation of observed/predicted values quantifies precision. Models with observed/predicted standard deviations of ~ 0.10 approach the precision possible given typical sampling error for biological indices, whereas models with standard deviations much greater than ~0.2 are probably not accounting for a significant amount of natural variation among samples (Van Sickle et al. 2004).

Additionally index performance can be measured and compared among indices using the following metrics from Hawkins et al. (2010) where an index score is the observed value divided by the predicted value from the model:

  • Precision
    • the standard deviation of reference site scores. 
  • Accuracy
    • the departure of mean observed reference site values from mean predicted reference site values (i.e. departure of mean reference index scores from 1.0)
    • the amount of variation in predicted reference site values that are still associated with naturally occurring environmental variables when submitted to a second round of random forest modeling
  • Responsiveness
    • the t-value derived from a t-test comparing reference and degraded sites
    • the slope of the regression of index values on a gradient of disturbance
  • Sensitivity
    • the percent of the physiochemically degraded sites inferred as being in non-reference condition. Percentiles of reference-site values can be used to infer if sites are non-reference or not. 

Avoiding the problem of extrapolation

A potential problem in the use of empirical models is to apply models to inappropriate situations. This problem can arise if we wish to assess the condition of a site that is physically or geographically dissimilar to the reference sites that were used for model construction. For example, if data from only small streams were used to build a model, it would be dangerous, and almost certainly inappropriate, to apply the model to larger streams, i.e., extrapolate beyond the experience of the model. NAMCs models all include a statistical test that guards against such inappropriate extrapolation. This test determines if the values of the predictor variables measured at the sites being assessed are within a statistically acceptable range of values measured at the reference sites. If a test site has values that fall outside that range, the program will flag the site but will calculate an index value for the site, which gives the user final say over whether the extrapolation is valid or not.

Questions for practitioners to consider when setting and applying benchmarks

  • Are the reference data used representative of the environmental gradients of your specific site(s)?
  • How much remaining natural variance is there among reference sites? 
    • Site-specific natural conditions using models – model R2
    • Percentiles of reference sites within a given ecoregion– Standard deviation of reference distribution
  • How protective should you be based on your monitoring application?
    • Objective: identify potential problems= favor overprotection 
    • Objective: permit renewal = need quittable balance of over and under-protection
    • Less extreme the percentile the more protective the benchmark
    • 10th or 90th percentiles represent a balance between over and under-protection
  • What degree of departure from reference is allowable while still maintaining ecosystem structure and function?

Literature Cited

  • Hawkins, C. P., Y. Cao, and B. Roper. 2010. Method of predicting reference condition biota affects the performance and interpretation of ecological indices. Freshwater Biology 55:1066–1085.
  • Piñeiro, G., S. Perelman, J. P. Guerschman, and J. M. Paruelo. 2008. How to evaluate models: Observed vs. predicted or predicted vs. observed? Ecological Modelling 216:316–322.
  • Stoddard, J. L., D. P. Larsen, C. P. Hawkins, R. K. Johnson, and R. H. Norris. 2006. Setting expectations for the ecological condition of streams: The concept of reference condition. Ecological Applications 16:1267–76.