#### BLM/USU Buglab

#### USEPA Bioassessment/ Biocriteria

#### USEPA Aquatic Resource Monitoring

#### USGS NAWQA

#### PACFISH/INFISH

#### USEPA Causal Analysis

#### NW Biological Assessment Workgroup

#### Central Plains Center for Bioassessment

#### Midwest Biodiversity Institute

#### California Aquatic Bioassessment Lab

#### North American Benthological Society

#### Xerces Society

#### RIVPACS

#### AUSRIVAS

#### European Environment Agency

#### EU BioFresh

#### EU Freshwater Ecology

#### EU WFD Research

#### EU WISER

#### EU AQEM

#### EU STAR

# Primer on Predictive Models

If E is defined as the number of native taxa expected in a sample and O is the number of those taxa observed in that sample, then the ratio O/E is an easily understood and ecologically meaningful measure of the biological condition at a site. Values can theoretically vary from 1 (equivalent to reference condition) to zero (completely degraded, i.e., all expected taxa are missing). Note that O/E is not based on raw taxa richness. Instead, O is constrained to include only those taxa predicted to naturally occur at a site. This point is important because many of the biological changes that occur in response to pollution or habitat alteration involve taxa replacements, i.e., new taxa that are tolerant to new environmental conditions replace taxa that cannot tolerate the new conditions.

Although conceptually simple, measuring O/E is complicated by the fact that we never conduct a complete census at any site (we take samples instead). In general, the number of taxa that we collect at a site will increase with increasing sampling effort (Fig. 1) , and our collections will therefore always contain a subset of the taxa that actually occur at a site. Furthermore, random sampling error ensures that replicate collections will seldom be identical in either the number of individuals or the specific taxa collected. For any single sample, we are more likely to collect an abundant taxon than a rare one. We therefore need a way of estimating E given that each taxon at a site has a different probability of being captured and that the mix of taxa collected in replicate samples vary somewhat.

We provide a hypothetical example to illustrate how the probability of detecting a taxon varies across taxa, and how we can use these probabilities to estimate E, the most likely number of taxa that we expect to observe in a single sample (Table 1). If we collected 10 replicate samples at a site, very few taxa, if any, would occur in all replicate samples; some taxa would occur in a few replicate samples; and many taxa would occur in just one or two samples. The most likely number of taxa expected in a new, single sample (E) is simply the mean of the number of taxa observed across all replicate samples. However, we can also estimate E as the sum of the individual probabilities of collecting taxa in a single sample. Predictive models use this relationship as the basis for estimating E from single samples and thus measuring O/E, the biological condition at a site (Table 2). The trick is to estimate the probabilities of collecting different taxa at different places.

A little thought should quickly reveal that E can be calculated above any desired probability of detection threshold (e.g., 0, or 0.1, or 0.5, or 0.7). As the threshold is increased, the value of E drops. Although on initial reflection, a threshold value of zero seems to make the most biological sense, higher thresholds often work better for assessment purposes. The reasons are that rare taxa (i.e., those with low probabilities of detection) cannot be modeled as well as more common taxa, and errors in their prediction lead to errors in E. Errors in predicting E create imprecise models, which result in lower ability to detect impairment when it actually exists (see discussion regarding statistical inferences below). In general, we and others have found that models based on intermediate probability of detection thresholds usually result in models that are both more precise and more sensitive in detecting effects of stressors (Simpson and Norris 2000, Hawkins et al. 2000). For that reason, all of our models report O/E values based on two probability of detection thresholds: zero and >= 0.5.

The creation of RIVPACS-type models consists of two primary steps: (1) classification of sites based on their biological similarity to one another and (2) development of an empirical model to predict the class membership of new sites from environmental attributes.

**Classification** Site classification results in biologically similar sites being grouped together into quasi-distinct classes that represent different ‘types’ of sites. Only data from reference sites are used to create the classification, because we need to estimate the likelihood of detecting different taxa under naturally occurring conditions. Classification (grouping of sites) allows us to calculate frequencies of occurrence of different taxa within classes, which is the first step in estimating taxon-specific probabilities of detection in RIVPACS-type models (see below). The classification step consists of first calculating the compositional similarity among all pairs of samples (Tables 3a and 3b). We have found that the Bray-Curtis index results in consistently good results. After calculating pair-wise similarities, a clustering algorithm is used to create a dendrogram that graphically displays the degree of biotic similarity between sites and groups of sites (Fig. 2). This dendrogram is then ‘cut’ to identify clusters of biologically similar sites, i.e., classes. Theoretically, the number of classes created can vary between 1 (all sites in a single class) to N (each site in its own class), but in our experience the best models result when dendrograms are cut to maximize mean similarity within classes (i.e., more classes), given the constraint that classes should consist of 10 or more sites.

Once sites are classified, we need to keep in mind that by assigning sites to classes, we have not made them identical to one another. Sites within a class are simply more similar to each other than they are to sites in other classes, and some sites in one class may actually be rather similar to some sites in other classes. Such ‘messy’ classifications arise whenever ecologists attempt to force any continuously varying set of attributes (such as taxa abundances or occurrences) into discrete classes. When used for predictive purposes, such classifications have the undesirable property of predicting that all new sites assigned to a class will be identical in the list of taxa they contain. Moreover, this predicted list will be either the average set of taxa or the combined list of taxa that were observed at those reference sites used in the original classification. RIVPACS attempts to adjust these coarse predictions to provide more realistic, site-specific predictions that better conform to naturally occurring distributions of biota.

**Predictive models.** Once reference sites are classified, we need a way of using this information to predict the reference condition biota of new sites, i.e., what taxa should occur at these sites in the absence of human disturbance. These predictions will typically be made for non-reference sites (hereafter called test sites) so we can compare their current biological condition to their potential. In the RIVPACS approach, discriminant functions models (DFMs) are used to make these predictions. DFMs use multivariate statistical equations to relate the likelihood of class membership to variation in a set of predictor variables (Fig. 3). For assessment purposes, we therefore need predictor variables that are unlikely to be significantly affected by human activity; otherwise we could predict the biota expected under altered conditions instead of a site’s natural potential. Variables related to geographic position (e.g., latitude, longitude, elevation), catchment area, climate, and surficial geology are good candidates, because they are reasonably invariant over time periods of ecological relevance. Readers familiar with DFMs know that they have been traditionally used to place new observations into pre-existing classes, e.g., assigning an individual to a species based on measurements of its morphology. However, these models actually estimate the probabilities of a new observation belonging to each of the pre-existing classes. RIVPACS uses this information to make near site-specific predictions of taxa probabilities of detection by weighting raw frequencies of occurrence in each class.

If the world could be cleanly classified into discrete categories of site types, we could use the simple frequencies of occurrence of taxa among reference sites of the appropriate type to estimate the probabilities of detecting different taxa at other unstressed sites of that type (Table 4). Because the natural world is not so discretely organized, we need to modify the simple frequencies of occurrence derived from our classification by the likelihood that the site belongs in different classes. For example, if a classification consists of two types of sites (biota found at high and low elevations), our best prediction of what we should observe at an intermediate elevation site would likely be some mix of taxa found at sites in the high- and low-elevation classes. RIVPACS accomplishes this type of interpolation by weighting the frequencies of taxa occurrences within classes by the probabilities that a site belongs to each class given its environmental setting (Fig. 4). This calculation is made for every taxon in the regional taxa pool for every site that is assessed. Once we have these estimates of what the probabilities of detection should be at a specific environmental setting under reference conditions, we can easily calculate O/E values as described earlier (Table 2 ).

Estimating E requires that we predict the probabilities of detecting all taxa in the region for any site we wish to assess. Ideally, these probabilities should represent the likelihood of collecting taxa under conditions of no or minimal human-caused stress. Declines in abundances of taxa associated with stress will cause the actual probabilities of detecting taxa to also decline and thus result in a smaller number of taxa being collected than expected and an O/E value greater than 1.

To predict probabilities of detection under conditions of no or minimal human-caused stress requires that we model how taxa occurrences vary along those natural environmental gradients that occur within the region of interest. In general, the data on which these models are built are collected at a series of reference sites, the type and number of which should be selected to both represent the range of naturally occurring environmental variation in the region and guarantee creation of statistically robust models.

Two general approaches can be followed to predict probabilities of detection. First, separate logistic regression or general additive models (GAMs) can be applied to binary taxa data (presence/absence) to model how the probabilities of detecting individual taxa vary across natural environmental gradients. The outputs of the models for each taxon can then be summed to estimate E. A second modeling approach was developed in Great Britain (Moss et al. 1987, Wright 1995) and called the River Invertebrate Prediction and Classification System (RIVPACS). In this approach, one model is used to predict the probabilities of detecting all taxa. Because RIVPACS models are somewhat less intuitive than the single taxa model approach, we provide a brief overview of how RIVPACS models work below. Those interested in details should consult (Moss et al. 1987, Wright 1995, Wright et al. 2000, and Hawkins and Carlisle 2000).

A potential problem in the use of empirical models is to apply models to inappropriate situations. With respect to RIVPACS models, this problem can arise if we wish to assess the condition of a site that is physically or geographically dissimilar to the reference sites that were used for model construction. For example, if data from only small streams were used to build a model, it would be dangerous, and almost certainly inappropriate, to apply the model to larger streams, i.e., extrapolate beyond the experience of the model. Most applications of RIVPACS-type models, including ours, include a statistical test that guards against such inappropriate extrapolation. That test determines if the values of the predictor variables measured at the sites being assessed are within a statistically acceptable range of values measured at the reference sites. If a test site has values that fall outside that range, the program will flag the site but will calculate an O/E value for the site, which gives the user final say over whether the extrapolation is valid or not.

Because RIVPACS models, like all models, are neither completely precise nor accurate, errors in prediction need to be quantified and incorporated into assessments. The most direct and robust way of evaluating model error is to apply the model to samples collected from an independent set of reference sites, i.e., sites that were not used in construction of the model. In many cases, however, too few reference sites exist to both construct a robust model and validate it. In these cases, error can only be assessed by applying the model to the same data used to build the model. Although care should always be taken when using models validated in this way, previous studies have shown that, for RIVPACS models, results derived from use of internal (same data) and independent sites are very similar (Hawkins et al. 2000, Van Sickle et al. 2004).

Two types of graphs are useful in illustrating error and evaluating model performance (Fig. 5). A plot of O versus E shows the range in the number of predicted taxa that were observed across reference sites and the degree to which predictions of E agree with observed values. In good models in which alpha diversity (site richness) naturally varies considerably among sites, the slope of this relationship (i.e., the regression of O on E) should be 1 (the model is generally accurate) and the scatter of points around the regression line should be small (the model is precise). We can quantify aspects of precision in two ways. Values of r2, the coefficient of determination, quantify the amount of variation in O that is predicted by E. Models with high r2 values are generally more precise than models with low values, and in our experience, reasonably good models have r2 values between 0.5 and 0.75. However, r2 values also depend on the range of values observed among samples, so it is possible for two different models to have different r2 values but have the same amount of scatter around a single value of E. Examining the frequency distribution of reference site O/E values allows a different evaluation of model precision and avoids the problem of the dependency of r2 on the range of values. The average of reference site O/E values should be 1 (accurate model), and the standard deviation of O/E values quantifies precision. Models with O/E standard deviations of ~ 0.10 approach the precision possible given typical sampling error, whereas models with standard deviations much greater than ~0.2 are probably not accounting for a significant amount of natural variation among samples (Van Sickle et al. 2004). The distribution of reference site O/E values also provides a basis for drawing a statistically supported inference regarding whether a test site is biologically impaired or not. The statistical hypothesis being tested is if the O/E value observed at a test site is outside of the range of values expected for reference sites. Values falling outside of the distribution of reference site O/E values are judged to be biologically impaired. The thresholds beyond which O/E values are consider to be outside of the distribution of reference values should be set to balance what statisticians call Type I and Type II errors. Type I and Type II errors measure the likelihood of making different types of wrong inferences. In statistical terms, a Type I error rate is the likelihood that we will reject the null hypothesis of ‘no effect or difference’ when the null hypothesis is actually true. In plain English, a Type I error is the likelihood of concluding that a site is impaired when it really is not. A Type II error is the likelihood of accepting the null hypothesis of no effect or difference when it is actually false. In other words, a Type II error occurs when we conclude a site is biologically unaltered when it actually has been. As the Type II error rate increases, the effect that can be statistically detected increases. The tendency to make these two types of errors are inversely related to one another, and the only way to minimize both types of error is to use either large sample sizes, reduce sampling error, or accept a large detectable effect size. The typical statistical test most people learn to do (e.g., a t-test) considers only Type I error. Type I error rates are listed in statistics tables as alpha values, which we typically see expressed as P-values in publications. For many research applications, setting Type 1 error rates to 0.05 or 0.01 (5% or 1% chance of being wrong) makes sense, because such rates guard against the human tendency to seek pattern in data where none really exists. The past focus on Type I error rates fit well the judicial tradition of innocent-until-proven-guilty. Type II error rates, the chance of being judged innocent when actually guilty, were almost completely ignored until recently. We now know that Type II errors have considerable importance in natural resource management. For example, by focusing solely on Type I error and setting alpha to 0.05, we have a high chance of not detecting impairment when it has actually occurred. In fact, given the relative small sample sizes associated with most natural resources management issues, setting Type I error rates at 0.05 or 0.01 essentially guarantees that impairment will have to be severe before we would conclude the observed value was truly different from reference. If we want to conduct fair assessments, we must address Type II as well as Type I errors. In general, the statistical tests that we use in bioassessments should be fair to both the regulated community and the resource. However, we need to set thresholds based not only on statistical issues but also the various costs associated with making the two types of inferential errors and the specific management objectives being addressed. We cannot therefore state what exact threshold should be used to conclude a site is impaired under all situations. Such decisions must ultimately reflect a consensus of stakeholder concerns. However, until better guidance is available, the 10th and 90th percentiles of the reference site O/E values appear to represent a reasonable balance between Type I and II errors. With these threshold values, any site with an O/E value that falls outside of the central 80% of values would be flagged as impaired.

An assessment based on a single sample at an individual site allows us to decide if a site is impaired or not, but does not allow us to specify the confidence we have in the magnitude of impairment. In those situations where we need to know the degree of impairment, replicate samples can be used to generate both a more robust estimate of the degree of impairment than is possible with single samples, as well as measures of confidence in the estimate. The statistical bases for calculating means and confidence intervals are well worked out and documentation is available in statistics texts. Although an assessment of an individual site can only test the hypothesis “Is the O/E value measured at this site outside of the expected range of reference site values?”, we can test the more familiar hypothesis “Is the mean O/E value observed at this site different from the mean value expected for reference sites" when assessments are regional in scope (Fig. 6). When test sites are sampled based on a probabilistic sampling design, the mean condition of all sites in the region can be inferred from the mean value observed at test sites. These types of tests are useful when we want to say something about the status of aquatic resources in a region as a whole and when we want to determine if conditions in are improving or degrading over time. These tests can also be modified to test the more general hypothesis of whether the distribution of values at test sites is different from the distribution of values observed at reference sites. These more general tests may be more appropriate if stressors affect the variation among sites in biotic condition more than they do average conditions.

See list of literature under Bioassessments.