Using and Building Models
Our predictive model software was written to allow users to either run data through existing models or build custom models for their specific needs. Use of our software requires manipulation and creation of input files that must be in text (ASCII) format. Because some data files may have more than 256 columns, you may not be able to use a spreadsheet program to manipulate and create these files. We therefore recommend that you use a good text editor that allows you to see control characters when creating these files. Note that Microsoft’s Notebook® text editor cannot handle large data files and it has limited editing capability. Do not use a word processor, because they often leave miscellaneous garbage characters in the text files they create, something that will cause our program to crash.
We have constructed or are in the process of constructing models for streams and rivers in the following states or regions: Oregon, Washington, combined Oregon/Washington, Idaho, Wyoming, Colorado, Utah, mountainous California, Arizona, Ohio (invertebrates and fish), North Carolina (invertebrates and fish), Maine, the Mid-Atlantic Highlands region, the three large regions used in the National Wadeable Streams Assessment and National Lakes Assessment. These models all differ in their specific data requirements, largely associated with the predictor variables and level of taxonomy used. When running these models, you will need to follow the specific data guidelines as well as the general guidelines described below.
All users must have a username and password to access the software. Because the Center and its web resources are no longer supported by grants, user fees are required to support software and web maintenance. All users except those using the software for educational purposes must pay a yearly fee prior to accessing the software. Please contact the Center's Director regarding fees and to request a username and password.
All models require the input of two matched data files for sites you wish to assess:
- a site-by-taxa matrix containing the biological data
- a site-by-predictor variable matrix.
At this time, no standard sampling protocols have been adopted by all water quality monitoring agencies. Users must therefore be careful that the data they wish to run through our models were collected following the protocols used in their state or region. If users plan to run data through the state-specific models, these data must have been collected and processed in the same way as the data used to create the models, data must be in the same units as those used to create the models, and both taxa and predictor variable names must be exactly the same as coded into the model. Otherwise, users can run data through a regional model if they followed either our fixed-area sampling procedures or the Western EMAP targeted riffle procedures.
RIVPACS models require that biota be identified to a consistent and unambiguous level of taxonomy. The level of taxonomy used can vary among taxonomic groups, but no individuals can be ambiguous, e.g., individuals within a family cannot be identified to family part of the time, genus some of the time, and species some of the time. When we build models, we scrutinize the original data to determine the frequency with which individuals in different taxonomic groupings are identified to different levels of resolution. Depending on these frequency distributions, we make decisions to either aggregate taxa (e.g., species within genera) or exclude individuals from analyses (e.g., those individuals identified only to order or family when most others were identified to a lower level). The result of this exercise is a list of operational taxonomic units (OTUs) that can vary in their level of taxonomic resolution, but which are unique from one another. Model users must therefore take care to ensure that the biological data files they create are based on the same OTUs that were used in model construction. To facilitate conversion of raw data files to the OTUs needed in the models, users can download our taxa translation tables, which can be used in any database program to create files with the correct taxa names.
Nearly all taxonomy labs identify a subset of the individuals contained in a sample. The target counts for the subsampling procedures used by most labs range from 100 to over 500 individuals. Because the number of taxa observed in a sample is partly a function of the number of individuals examined, many of our models require that the biological data be based on the same subsampling procedures that were applied to the data used to build models. Otherwise, data must be adjusted to be compatible with the models (see testtaxa.txt file below). If you are building your own model (see below), we suggest a minimum of 300 OTU individuals because the relative costs of processing larger numbers of individuals is small in comparison to other expenses, and models based on larger subsamples are more accurate, precise, and sensitive than those based on small subsamples (Ostermiller and Hawkins 2004).
The two input files are required to run the models, and they must be formatted as either tab-, comma-, or space-delimited rectangular text files. The first column of each file must contain the unique names or codes assigned to samples, and the order of these samples must be the same in both files. The first row of the taxa file must contain the names of the taxa (OTUs) found in the samples. The first row of the predictor variable file must contain the names or codes of the predictor variables. The cells in these matrices contain either the taxa counts (testtaxa.txt file) or the values of the predictor variables (testhab.txt file). No missing data are permitted, and zeros must be coded as zeros, not blanks. If a sample contains missing data, you will need to exclude that sample from your analysis. The site names/codes, taxa names, and predictor variable names may be up to 128 characters long. If you are using an existing model, taxa and predictor variable names must be identical (case sensitive) to those used in model construction (see specific model guidelines).
Most biological sample data are now stored in database format instead of rectangular spreadsheets. Databases make it very easy to select subsets of samples or taxa for analysis. Once data are in 3 column format (Site, Taxon, Count), it is easy to convert original taxa names into the OTU names required by our models, create files containing fixed-counts of certain size, and create the rectangular sample by taxa matrix required by the models. We therefore strongly recommend that users manage their data in a database program.
Converting from original taxa names or codes to model OTU names. Creating a taxa file in which taxa have the required OTU names can be done through a database query once you have a taxa translation table that links original names or codes with OTU names. For example, if you are using Microsoft’s Access® database program, you would start a query by first establishing a link between your original data file and the taxa translation table. If the field name in the original file containing the taxa data was “Taxa_Name”, the translation table would also contain that name (or a similar one) as a column header. Once these two files are linked by a common field (column), you need to construct a query to generate a new file that contains the original sample names, the new OTU names, and the original counts. Because the translation table may have the same OTU name for more than one of the original taxa names, you need to make sure the query sums counts within the same OTU and sample. To do this in Access’s ‘design view’ mode, create a new query, select the tables with the original data and the translation table, link them, add the fields you want to the query boxes, then click on the totals (∑) button. At this point, make sure that the sample and OTU columns of the query are specified as ‘group by’, which is the default, and that ‘sum’ is specified for the count column. There may also be some individuals in the original data that were not assigned to an OTU, i.e., they are being dropped from the analysis because they are ambiguous taxa. These taxa will have blank cells in the OTU column of the taxa translation table. When setting up the query, be sure to type ‘Is Not Null’ in the criterion cell of the OTU column to exclude these taxa from the query.
Once you save the query and then run it, you can either export the new 3-column data in a variety of formats (e.g., text or Excel file) or you can simply copy and paste the data from the query window into an open spreadsheet or text editor. We recommend that after creating the new file, you inspect the data in the file to make sure the query ran properly.
For those models that require a fixed-count sample, users may download a Fortran subsampling program (subsample.exe) created by Dr. Dave Roberts that runs in a DOS window under Windows® and will randomly ‘sample’ individuals from each of the samples in an original data file and create a new file with no more than the selected count in each sample. Samples with original counts less than that selected will not be affected. This program requires that the original data be a tab- or comma-delimited text file in database ‘list’ format, i.e., 3 columns in which the first column contains sample names/codes, the second column contains OTU names, and the third column contains counts. These files do not include taxa with zero counts.
To run subsample.exe, place it in the directory containing the original taxa file (in 3 column format). Navigate to this directory and launch the program by double clicking on the program name or icon. A window will open and you will be requested to first enter the name of the file containing the original data. Do so and hit Enter. The program will then ask you to type in the name of the file you wish to create. Do so and hit Enter, again. You must now specify the number of individuals you want to include in each sample. Finally, you need to enter a random number to initiate the program. The program will then create a new file in which each sample contains the number of individuals you specified, except for those original samples that contained fewer individuals than you specified. These samples will be saved to the new file unchanged.
Although it is possible to manually create this file within a spreadsheet, it is far easier and much less error prone to create it with the program (matrify.exe) that Dr. Dave Roberts created for this purpose. To run matrify.exe, place the program in the same directory containing the taxa data (subsampled and OTU translated if needed). Launch the program and enter the original and new file names when prompted. Indicate that you want a text file in tab-, comma-, or space-delimited format (our program accepts any of these formats). Finally indicate that a zero should be used for absences. The program will create a rectangular matrix in which the OTUs and the sample names/codes are sorted in alphabetical order. Open this file in a text editor or spreadsheet (some files will contain > 256 taxa so you will not be able to use a spreadsheet) and inspect the data to make sure the program ran successfully. Also, note that the first column will not have a label. Type in a header label for the sample name column if you wish and then resave the file. The values in this matrix may be either counts or presence-absence data (1/0); both types of data will generate the same results from our software.
The testhab.txt file is much more simple to construct than the testtaxa.txt file, and a spreadsheet program often aids in creating it. The cells of this file contain the values of the predictor variables used by the model. There are just a few rules that have to be followed when creating this file:
- The sample names/codes have to be the same and in the same order as those in the testtaxa.txt file.
- The labels for the predictor variables and their order have to be exactly the same as expected by the model (case sensitive).
- The units of measure and any transformations must be the same as expected by the model.
If this file is created in a spreadsheet, save the file in text format or copy the information to a text editor and resave the data in text format.
Use a good text editor that allows you to view control characters such as tabs, end-of-records, and spaces. In general, use of tab-delimited text files facilitates inspection of files and allows you to catch errors more easily. Search all text files for unwanted spaces or other unwanted characters and eliminate them before running models. It is also useful to check for two adjacent tab characters, which indicates a cell with no value. Use of a text editor with an automatic search feature greatly simplifies this task.
To build a custom model, you must be familiar enough with multivariate statistical procedures and software to conduct a classification of sites based on their biota and then derive a discriminant functions model based on the classes identified in the classification. These analyses are not difficult to do, and we provide a brief description of them below to help guide your efforts; however please do not contact us for help conducting statistical analyses. We do not have the resources to respond to such requests. Some standard statistical packages will allow you to conduct all of the analyses you need to run (including R), but software designed specifically for ecological purposes, such as PC-ORD®, can often facilitate these analyses. Note however, that many statistical programs, including PC-ORD® restrict variable names to 8 characters.
A custom model requires 2 main pieces of information (the frequencies OTUs among site classes) and a method of predicting the probabilities that a new site belongs to each of the reference site classes. The frequencies of OTUs among classes will be derived from the classification analysis and will result in a single text file (reftaxa.txt) that consists of reference site samples as rows and OTUs as columns. This data file is organized exactly like the testtaxa.txt file described above, except that it has one additional column of data (group) placed after the sample name column that specifies the group to which each reference site samples was assigned. A discriminant functions model (DFM) is typically used to predict the probabilities of group membership from environmental data, although other types of predictive models can be used (e.g., Random Forests models). For the purpose of this exercise, we assume the use of DFMs.
From the output of the discriminant function analysis, you will need to create 2 tab-delimited text files. One file (groupmeans.txt) consists of the mean values of the predictor variables used in the DFM for each biotic class. The other (inv_covariance.txt) is the inverse of the pooled within covariance matrix, derived from the discriminant functions analysis. These two latter files represent the discriminate equations in matrix format. Two other files will be used to conduct an internal validation of the model (reftesttaxa.txt and refhab.txt), which are identical in construction to the testtaxa.txt and testhab.txt described above).
Reference Site Classification
Reference site classification requires access to software that can create a classification of sites based on their taxonomic similarities to one another. A variety of software programs and specific methods are available for conducting these analyses. However, based on several sensitivity analyses we and others have run, we suggest initially basing the classification on presence-absence data (0/1), between-sample similarities derived from the Sorensen Index (Bray-Curtis applied to presence-absence data), and clustering based on either the flexible-beta algorithm with beta set to between -0.25 and -0.5, Ward’s method, or TWINSPAN. Other approaches can be experimented with, but these methods appear to consistently result in good models.
After preparing the reference site data sets and running the clustering or classification software, you must then assign samples to classes. If you use a clustering method like flexible-beta, print the cluster diagram. Some groupings will look more distinct than others. Initially, try to cut the cluster diagram such that you create groups (classes) with as high a level of within-group similarity as possible and at a level of similarity that results in all classes having > 10 samples. When cluster diagrams are cut at a specific similarity value, group sizes (number of sites per group) may vary considerably. This result is expected. You may also find that you need to cut branches in the cluster diagram at slightly different similarity values to generate relatively discrete groups or to ensure that very small groups are not created.
If your graphing software can produce an ASCII-character version of the cluster diagram, copy the figure to a text editor that allows you to edit columns. Insert 3-4 blank spaces next to the sample labels and then use your editor to type in the class membership codes next to the sample names. Make sure you leave at least one blank space between the class code and the sample name. We recommend coding the classes sequentially with integers, e.g., 1, 2, 3…..n. After inserting class codes into this text document, highlight just the portion of the document that contains the class assignments and the sample names (you can do this with a text editor that allows column mode editing). Copy and paste this selection into a blank spreadsheet, and then use the text-to-columns feature to create two columns of data: class codes and sample names. Check the cells to make sure that no extra characters were included, insert a header label for the two columns (e.g., sample, class), and then sort the file based on the sample name. Save this file and import it into your database for later use.
Create the reftaxa.txt file following the procedures described previously. You will need to examine the raw data to determine if the number of taxa per sample in the original file increases with the number of individuals in a sample. If it does, you may want to create fixed-count samples with the subsample program. Once you have the reference sample taxa in matrix format, make sure the file is sorted by sample name and then insert the two columns of data from the sample-class file you created just to the right of the left most column of data (i.e., the sample names). Because one of the columns of data that you just inserted is the sample names, you can compare these names to the names that were already in the reftaxa.txt file. Make sure they are in the same order. If they are not, you will need to start again and resort one or both files prior to inserting the class codes into the reftaxa.txt file. Once you have determined that the sample names match, delete the extra sample names column that you inserted. You should now have a file with this sequence of columns: sample_name, class_code, OTU1, OTU2, OTU3, etc. Check the file for unwanted characters and save.
The Discriminant Model
To run a discriminant functions analysis, you will need a data file that consists of reference sample names, class codes, and values for each of the potential predictor variables. At some point in the past, you probably constructed this file with all data except the class codes. If so, and if this file is part of your project database, you can run a database query to create a new file that consists of the predictor variables plus the class codes. Once you have this file, open it in your favorite statistics software.
Before proceeding with a discriminate analysis, check all predictor variables for normality. You can do a quick check for normality by creating cumulative normal probability plots of the predictor variable data values. If these plots exhibit nearly straight lines, the data are likely OK for use in discriminate analysis. If they are curvilinear, a transformation is needed. A simple log transformation will often result in a near-normal distribution of data. Try it first and replot. If the log transformation did not succeed, try a power function transformation with different values of the exponent. If none of these transformations work, you may have to exclude that variable from the analysis or consider creating dummy variables based on different value ranges of the variable. Dummy variables have values of 0 or 1 depending on the observed value at a site and are useful when you want to include categorical variables in discriminate analyses, e.g., Ecoregion A, Ecoregion B, Ecoregion C, etc. In this example, for a single site, one of these variables would have a value of 1 and the other 2 values of zero with the 1 indicating which ecoregion the site was in.
Once all potential predictors have been appropriately transformed, if needed, you need to run a discriminate analysis to select those predictor variables that best discriminate among the biological classes. If you are familiar with the R statistical language, we strongly suggest that you run the all possible subsets program developed by John Van Sickle (see Van Sickle, J., David D. Huff, and C.P. Hawkins. 2006. Selecting discriminant function models for predicting the expected richness of aquatic macroinvertebrates. Freshwater Biology 51:359-372). Otherwise, try using a stepwise procedure with both backward and frontward selection. If the two stepwise procedures agree, then use those variables in the model, If they do not agree, make initial selections on the strength of the predictor variables, their ease of measurement, and their ecological interpretability.
Most statistical software packages provide cross-validation procedures that estimate how well the model classifies observations into the correct group. You can use this output, in conjunction with the stepwise procedures, to help you select the discriminant function model that provides the best discrimination among groups. If you followed our advice and created smaller groups with higher within-group similarity, the cross-validated error estimates may appear high (e.g., 60-75% error rates). However, high error estimates are usually not a serious problem as long as misclassified sites were predicted to occur in a neighboring, biologically similar group. A better evaluation of different discriminant function models is to examine the extent to which sites are classified into highly dissimilar groups, which will result in estimates of E with poorer precision.
Once you identify a final set of predictor variables, rerun the discriminant analysis as a ‘complete’ model, i.e., not in stepwise mode, with just those predictor variables you decided to use. At this time, you should also instruct the software to generate a table of class means and the pooled within covariance matrix. Make sure values in these tables include all of the significant digits present in the original data. Copy both of these tables to a spreadsheet. Cleanup the class means data if needed (classes should be columns and predictor variables should be rows), copy to a text editor, and save the file as groupmeans.txt. The pooled within covariance matrix will take a bit more work. First make sure the copied matrix consists of columns and rows of data in which both columns and rows have the predictor variables as labels and are in the same order as the groupmeans.txt file. There should be no completely blank columns or rows, but pasting into a spreadsheet can sometimes create blank columns between data. If there are, cut and paste until you have a complete triangular matrix. You will now need to fill in the blank spaces in the other half of the matrix so that this half is a mirror image of the other half. Do this by copying the values in the matrix a column at a time, but not including the diagonal value, and transposing these values to the equivalent row in the other half of the matrix. Once you have created a full rectangular matrix, generate the inverse of this matrix in another portion of the worksheet. In Excel®, this manipulation is done by the =MINVERSE function. Note that when using this function, you need to highlight a section of the worksheet of the same dimensions as the original matrix. Then, with the cells still highlighted, type =MINVERSE, enter a beginning parenthesis, highlight the array of cells with the original matrix values, enter the ending parenthesis, and then simultaneously hit the Control, Shift, and Enter keys (hold down Control and Shift, then hit Enter). If you did this manipulation correctly, you should have a new matrix with different values than the old one. Highlight the cells of the new matrix, copy them, and then use paste-special to convert the formulas to values. Make sure both rows and columns have the appropriate variable labels and then copy the matrix to a text editor. Clean up the file if needed (unwanted spaces, etc.), and then save the file as “inv_covariance.txt” or what ever name you prefer.
To evaluate the performance of the model, you need taxa data and predictor data for either an independent set of reference samples or you will need to conduct an internal validation based on the model building data. In either case, you must prepare the two text files in the same way as described for testtaxa.txt and testhab.txt described above. You may want to name these files something like ref-val-taxa.txt and ref-val-hab.txt. If you are running an internal validation, the ref-val-taxa.txt file will be identical to the reftaxa.txt file except that the group column is eliminated. In this case, the ref-val-hab.txt file will contain the same values as in the file used in the discriminant functions analysis.
To run a custom model, you will need to specify “New Model” after accessing the web software. At this point, the software will ask you to submit 5 files instead of the two files needed for existing models: 3 files that comprise the predictive model itself (groupmeans.txt, covariance.txt, and reftaxa.txt) and 2 files that contain data for samples that are to be assessed. These latter files may be validation files (e.g., ref-val-taxa.txt and ref-val-hab.txt) or regular test samples (testtaxa.txt and testhab.txt). Note that the names of these files can actually be anything you want, and the generic names we use here are just convenient for keeping track of what the files are.
Once you submit the input files, you should be taken to a new page with a list of output files. If nothing happens, click the ‘Reload’ button on your web browser. For some reason that we have yet to discover, the results will not always be shown on the screen until the page is reloaded.
If there is a problem with one or more input files, you will receive an error message. We have included some rudimentary error checking procedures in the software to help you figure out where the problem is, but these error messages are rather general and will not pin point the exact errors. Here is a list of common errors we have encountered:
- The taxa file and the predictor variable file do not have the same number of samples, or the order of samples in the two files differ, or the two files contain different sample names. Open both files in a text editor that supports multiple windows and align the files vertically so you can compare the sample names and their order in both files.
- One or more cells are blank in one of the files, typically the predictor variable file. Open the file in either a text editor or spreadsheet and search for blank cells.
- There are unwanted blank spaces in the input files. Unwanted spaces will be interpreted as delimiters between adjacent records, which will cause a mismatch between taxa and predictor variable files. Open the files in a text editor and search for spaces. Make sure taxa names have no spaces and rename them if necessary. For example, the software will interpret the name ‘Drunella doddsii’ as the names of two taxa. Fix by replacing such spaces in names with underscores, e.g., ‘Drunella_doddsii’. Saving the files in tab or comma-delimited format will aid in catching these errors. If you find spaces in predictor variable or taxa names, you will have to fix them in all files where they occur.
- A carriage return is missing at the end of the file. This problem often occurs when copying and pasting from one application to another. Use a text editor that can show the location of these and other control characters. Fix by going to the end of the last record and hitting ‘Enter’.
Our software generates 4 types of output files, each of which is provided in both html and text format. The html format files are probably the easiest to use because they can be directly opened within a spreadsheet for viewing, retain their formatting within a spreadsheet, and can be further manipulated for input into a statistical program. The text files are smaller but less easy to view. All files can be saved to disk by right clicking on the file name and left clicking the ‘save target as’ selection. At this time you can browse to the directory in which you want to save the file and rename the file if so desired.
The SiteHabTest file
This file contains a list of all of the samples submitted in the testhab.txt file. Each record (row) contains the values of the predictor variables read by the model and includes 2 additional fields: a field showing if the sample was within the experience of the model or not (P = pass or F = fail) and a count of the individuals (if abundance data) or taxa (if present/absence data) in each sample of the testtaxa.txt file. If a sample is outside the experience of the model, the entire row is colored red.
The Probability of Detection Matrix File
This file contains the estimated probabilities of detection under reference conditions for each taxon at each site. This file can often be very large if many samples are submitted and the model is based on many taxa. It may not be particularly useful for most applications, but it does allow you to inspect the primary output of the model prior to calculation of E and O/E. We have also instructed the software to highlight cells in which taxa were predicted to occur with a probability of >= 0.5 but were not observed (cell colored red) and in which taxa were predicted to occur at a probability less than 0.5 but were collected (cell colored light blue). This file contains estimates of probabilities of detection for all submitted samples including those that were flagged as being outside of the experience of the model.
The O/E File
The O/E file contains the calculated O/E values for each sample as well as a measure of similarity between observed and expected taxa lists based on the Bray-Curtis (BC) distance measure (see Van Sickle, J. 2008. An index of compositional dissimilarity between observed and expected assemblages. Journal of the North American Benthological Society 27:227-235). Both O/E and BC values are calculated at 2 probability of capture thresholds (zero and >= 0.5). Values of E and O are included along with O/E and BC values.
The Taxa Response Summary File
This file lists all taxa that the model expected to see (i.e., those taxa in the reftaxa.txt file) as well as any new taxa that occurred at test sites (testtaxa.txt file) but were not observed in at least 1 reference site sample. For each of these taxa, we list their average predicted probability of detection (assuming sites were under reference condition), the number of test sites a which taxa were predicted to occur, the number of test sites at which taxa were observed, and the ratio of observed sites to expected sites for each taxon. We have labeled this ratio as the ‘Sensitivity Index’ and interpret it as a measure of sensitivity of a taxon to whatever stressors are influencing a taxon within the set of test sites submitted for assessment. A ratio > 1 indicates the taxon was found at more sites that expected and was thus an ‘increaser’ or tolerant taxon. A ratio less than 1 indicates the taxon was found at fewer sites than expected and was thus a ‘decreaser’ or intolerant taxon. The magnitude of these values can provide insight into the relative sensitivities of taxa to stressors, although users should be careful to avoid over-interpreting ratios based on small numbers. Results obtained by separately submitting sets of samples that differ in the primary stressors known to be affecting sites may provide insight regarding the relative sensitivities of taxa to different stressors. We provide two versions of this file: one based on all submitted samples and one based on just those samples considered to be within the experience of the model.