HomeMy WebLinkAboutDRC-2024-006926 STATISTICAL ANALYSIS OF
GROUND-WATER MONITORING
DATA AT RCRA FACILITIES
ADDENDUM TO INTERIM FINAL
GUIDANCE
OFFICE OF SOLID WASTE
PERMITS AND STATE PROGRAMS DIVISION
U.S. ENVIRONMENTAL PROTECTION AGENCY
401 M STREET, S.W.
WASHINGTON, D.C. 20460
JULY 1992
DISCLAIMER
This document is intended to assist Regional and State personnel in evaluating ground-water
monitoring data from RCRA facilities. Conformance with this guidance is expected to result in
statistical methods and sampling procedures that meet the regulatory standard of protecting
human health and the environment. However, EPA will not in all cases limit its approval of
statistical methods and sampling procedures to those that comport with the guidance set forth
herein. This guidance is not a regulation (i.e., it does not establish a standard of conduct which
has the force of law) and should not be used as such. Regional and State personnel should
exercise their discretion in using this guidance document as well as other relevant information in
choosing a statistical method and sampling procedure that meet the regulatory requirements for
evaluating ground-water monitoring data from RCRA facilities.
This document has been reviewed by the Office of Solid Waste, U.S. Environmental
Protection Agency, Washington, D.C., and approved for publication. Approval does not signify
that the contents necessarily reflect the views and policies of the U.S. Environmental Protection
Agency, nor does mention of trade names, commercial products, or publications constitute
endorsement or recommendation for use.
CONTENTS
1. CHECKING ASSUMPTIONS FOR STATISTICAL PROCEDURES .............................1
1.1 Normality of Data .............................................................................1
1.1.1 Interim Final Guidance Methods for Checking Normality ....3
1.1.2 Probability Plots .................................................................5
1.1.3 Coefficient of Skewness ......................................................8
1.1.4 The Shapiro-Wilk Test of Normality (n 50).......................9
1.1.5 The Shapiro-Francia Test of Normality (n>50)....................12
1.1.6 The Probability Plot Correlation Coefficient ........................13
1.2 Testing for Homogeneity of Variance ................................................20
1.2.1 Box Plots............................................................................20
1.2.2 Levene's Test ......................................................................23
2. RECOMMENDATIONS FOR HANDLING NONDETECTS .......................................25
2.1 Nondetects in ANOVA Procedures ...................................................26
2.2 Nondetects in Statistical Intervals ......................................................27
2.2.1 Censored and Detects-Only Probability Plots ......................28
2.2.2 Aitchison's Adjustment .......................................................33
2.2.3 More Than 50% Nondetects ................................................34
2.2.4 Poisson Prediction Limits....................................................35
2.2.5 Poisson Tolerance Limits ....................................................38
3. NON-PARAMETRIC COMPARISON OF COMPLIANCE DATA TO BACKGROUND ...41
3.1 Kruskal-Wallis Test ...........................................................................41
3.1.1 Adjusting for Tied Observations..........................................42
3.2 Wilcoxon Rank-Sum Test for Two Groups .......................................45
3.2.1 Handling Ties in the Wilcoxon Test ....................................48
4. STATISTICAL INTERVALS: CONFIDENCE, TOLERANCE, AND PREDICTION .......49
4.1 Tolerance Intervals............................................................................51
4.1.1 Non-parametric Tolerance Intervals ....................................54
4.2 Prediction Intervals ...........................................................................56
4.2.1 Non-parametric Prediction Intervals.....................................59
4.3 Confidence Intervals..........................................................................60
5. STRATEGIES FOR MULTIPLE COMPARISONS ....................................................62
5.1 Background of Problem ....................................................................62
5.2 Possible Strategies.............................................................................67
5.2.1 Parametric and Non-parametric ANOVA ............................67
5.2.2 Retesting with Parametric Intervals .....................................67
5.2.3 Retesting with Non-parametric Intervals .............................71
6. OTHER TOPICS ................................................................................................75
6.1 Control Charts ..................................................................................75
6.2 Outlier Testing ..................................................................................80
ACKNOWLEDGMENT
This document was developed by EPA's Office of Solid Waste under the direction of Mr.
James R. Brown of the Permits and State Programs Division. The Addendum was prepared by
the joint efforts of Mr. James R. Brown and Kirk M. Cameron, Ph.D., Senior Statistician at
Science Applications International Corporation (SAIC). SAIC provided technical support in
developing this document under EPA Contract No. 68-W0-0025. Other SAIC staff who assisted
in the preparation of the Addendum include Mr. Robert D. Aaron, Statistician.
Draft
1
STATISTICAL ANALYSIS OF GROUND-
WATER MONITORING DATA AT RCRA
FACILITIES
ADDENDUM TO INTERIM FINAL GUIDANCE
JULY 1992
This Addendum offers a series of recommendations and updated advice concerning the
Interim Final Guidance document for statistical analysis of ground-water monitoring data. Some
procedures in the original guidance are replaced by alternative methods that reflect more current
thinking within the statistics profession. In other cases, further clarification is offered for
currently recommended techniques to answer questions and address public comments that EPA
has received both formally and informally since the Interim Final Guidance was published.
1. CHECKING ASSUMPTIONS FOR STATISTICAL
PROCEDURES
Because any statistical or mathematical model of actual data is an approximation of reality,
all statistical tests and procedures require certain assumptions for the methods to be used
correctly and for the results to have a proper interpretation. Two key assumptions addressed in
the Interim Guidance concern the distributional properties of the data and the need for equal
variances among subgroups of the measurements. In the Addendum, new techniques are outlined
for testing both assumptions that offer distinct advantages over the methods in the Interim Final
Guidance.
1.1 NORMALITY OF DATA
Most statistical tests assume that the data come from a Normal distribution. Its density
function is the familiar bell-shaped curve. The Normal distribution is the assumed underlying
model for such procedures as parametric analysis of variance (ANOVA), t-tests, tolerance
intervals, and prediction intervals for future observations. Failure of the data to follow a Normal
distribution at least approximately is not always a disaster, but can lead to false conclusions if the
data really follow a more skewed distribution like the Lognormal. This is because the extreme tail
Draft
2
behavior of a data distribution is often the most critical factor in deciding whether to apply a
statistical test based on the assumption of Normality.
The Interim Final Guidance suggests that one begin by assuming that the original data are
Normal prior to testing the distributional assumptions. If the statistical test rejects the model of
Normality, the data can be tested for Lognormality instead by taking the natural logarithm of
each observation and repeating the test. If the original data are Lognormal, taking the natural
logarithm of the observations will result in data that are Normal. As a consequence, tests for
Normality can also be used to test for Lognormality by applying the tests to the logarithms of the
data.
Unfortunately, all of the available tests for Normality do at best a fair job of rejecting non-
Normal data when the sample size is small (say less than 20 to 30 observations). That is, the tests
do not exhibit high degrees of statistical power. As such, small samples of untransformed
Lognormal data can be accepted by a test of Normality even though the skewness of the data may
lead to poor statistical conclusions later. EPA's experience with environmental concentration
data, and ground-water data in particular, suggests that a Lognormal distribution is generally
more appropriate as a default statistical model than the Normal distribution, a conclusion shared
by researchers at the United States Geological Survey (USGS, Dennis Helsel, personal
communication, 1991). There also appears to be a plausible physical explanation as to why
pollutant concentrations so often seem to follow a Lognormal pattern (Ott, 1990). In Ott's
model, pollutant sources are randomly diluted in a multiplicative fashion through repeated dilution
and mixing with volumes of uncontaminated air or water, depending on the surrounding medium.
Such random and repeated dilution of pollutant concentrations can lead mathematically to a
Lognormal distribution.
Because the Lognormal distribution appears to be a better default statistical model than the
Normal distribution for most ground-water data, it is recommended that all data first be logged
prior to checking distributional assumptions. McBean and Rovers (1992) have noted that
"[s]upport for the lognormal distribution in many applications also arises from the shape of the
distribution, namely constrained on the low side and unconstrained on the high side.... The
logarithmic transform acts to suppress the outliers so that the mean is a much better
representation of the central tendency of the sample data."
Transformation to the logarithmic scale is not done to make "large numbers look smaller."
Performing a logarithmic or other monotonic transformation preserves the basic ordering within a
Draft
3
data set, so that the data are merely rescaled with a different set of units. Just as the physical
difference between 80 Fahrenheit and 30 Fahrenheit does not change if the temperatures are
rescaled or transformed to the numerically lower Celsius scale, so too the basic statistical
relationships between data measurements remain the same whether or not the log transformation
is applied. What does change is that the logarithms of Lognormally distributed data are more
nearly Normal in character, thus satisfying a key assumption of many statistical procedures.
Because of this fact, the same tests used to check Normality, if run on the logged data, become
tests for Lognormality.
If the assumption of Lognormality is not rejected, further statistical analyses should be
performed on the logged observations, not the original data. If the Lognormal distribution is
rejected by a statistical test, one can either test the Normality of the original data, if it was not
already done, or use a non-parametric technique on the ranks of the observations.
If no data are initially available to test the distributional assumptions, "referencing" may be
employed to justify the use of, say, a Normal or Lognormal assumption in developing a statistical
testing regimen at a particular site. "Referencing" involves the use of historical data or data from
sites in similar hydrogeologic settings to justify the assumptions applied to currently planned
statistical tests. These initial assumptions must be checked when data from the site become
available, using the procedures described in this Addendum. Subsequent changes to the initial
assumptions should be made if formal testing contradicts the initial hypothesis.
1.1.1 Interim Final Guidance Methods for Checking Normality
The Interim Final Guidance outlines three different methods for checking Normality: the
Coefficient-of-Variation (CV) test, Probability Plots, and the Chi-squared test. Of these three,
only Probability Plots are recommended within this Addendum. The Coefficient-of-Variation and
the Chi-squared test each have potential problems that can be remedied by using alternative tests.
These alternatives include the Coefficient of Skewness, the Shapiro-Wilk test, the Shapiro-Francia
test, and the Probability Plot Correlation Coefficient.
The Coefficient-of-Variation is recommended within the Interim Guidance because it is easy
to calculate and is amenable to small sample sizes. To ensure that a Normal model which predicts
a significant fraction of negative concentration values is not fitted to positive data, the Interim
Final Guidance recommends that the sample Coefficient of Variation be less than one; otherwise
this "test" of Normality fails. A drawback to using the sample CV is that for Normally distributed
Draft
4
data, one can often get a sample CV greater than one when the true CV is only between 0.5 and
1. In other words, the sample CV, being a random variable, often estimates the true Coefficient
of Variation with some error. Even if a Normal distribution model is appropriate, the Coefficient
of Variation test may reject the model because the sample CV (but not the true CV) is too large.
The real purpose of the CV is to estimate the skewness of a dataset, not to test Normality.
Truly Normal data can have any non-zero Coefficient of Variation, though the larger the CV, the
greater the proportion of negative values predicted by the model. As such, a Normal distribution
with large CV may be a poor model for positive concentration data. However, if the Coefficient
of Variation test is used on the logarithms of the data to test Lognormality, negative logged
concentrations will often be expected, nullifying the rationale used to support the CV test in the
first place. A better way to estimate the skewness of a dataset is to compute the Coefficient of
Skewness directly, as described below.
The Chi-square test is also recommended within the Interim Guidance. Though an
acceptable goodness-of-fit test, it is not considered the most sensitive or powerful test of
Normality in the current literature (Gan and Koehler, 1990). The major drawback to the Chi-
square test can be explained by considering the behavior of parametric tests based on the Normal
distribution. Most tests like the t-test or Analysis of Variance (ANOVA), which assume the
underlying data to be Normally distributed, give fairly robust results when the Normality
assumption fails over the middle ranges of the data distribution. That is, if the extreme tails are
approximately Normal in shape even if the middle part of the density is not, these parametric tests
will still tend to produce valid results. However, if the extreme tails are non-Normal in shape
(e.g., highly skewed), Normal-based tests can lead to false conclusions, meaning that either a
transformation of the data or a non-parametric technique should be used instead.
The Chi-square test entails a division of the sample data into bins or cells representing
distinct, non-overlapping ranges of the data values (see figure below). In each bin, an expected
value is computed based on the number of data points that would be found if the Normal
distribution provided an appropriate model. The squared difference between the expected number
and observed number is then computed and summed over all the bins to calculate the Chi-square
test statistic.
Draft
5
CHI SQUARE GOODNESS OF FIT
If the Chi-square test indicates that the data are not Normally distributed, it may not be clear
what ranges of the data most violate the Normality assumption. Departures from Normality in the
middle bins are given nearly the same weight as departures from the extreme tail bins, and all the
departures are summed together to form the test statistic. As such, the Chi-square test is not as
powerful for detecting departures from Normality in the extreme tails of the data, the areas most
crucial to the validity of parametric tests like the t-test or ANOVA (Miller, 1986). Furthermore,
even if there are departures in the tails, but the middle portion of the data distribution is
approximately Normal, the Chi-square test may not register as statistically significant in certain
cases where better tests of Normality would. Because of this, four alternative, more sensitive
tests of Normality are suggested below which can be used in conjunction with Probability Plots.
1.1.2 Probability Plots
As suggested within the Interim Final Guidance, a simple, yet useful graphical test for
Normality is to plot the data on probability paper. The y-axis is scaled to represent probabilities
according to the Normal distribution and the data are arranged in increasing order. An observed
value is plotted on the x-axis and the proportion of observations less than or equal to each
observed value is plotted as the y-coordinate. The scale is constructed so that, if the data are
Normal, the points when plotted will approximate a straight line. Visually apparent curves or
bends indicate that the data do not follow a Normal distribution (see Interim Final Guidance, pp.
4-8 to 4-11).
Probability Plots are particularly useful for spotting irregularities within the data when
compared to a specific distributional model like the Normal. It is easy to determine whether
departures from Normality are occurring more or less in the middle ranges of the data or in the
extreme tails. Probability Plots can also indicate the presence of possible outlier values that do
not follow the basic pattern of the data and can show the presence of significant positive or
negative skewness.
Draft
6
If a (Normal) Probability Plot is done on the combined data from several wells and
Normality is accepted, it implies that all of the data came from the same Normal distribution.
Consequently, each subgroup of the data set (e.g., observations from distinct wells), has the same
mean and standard deviation. If a Probability Plot is done on the data residuals (each value minus
its subgroup mean) and is not a straight line, the interpretation is more complicated. In this case,
either the residuals are not Normal, or there is a subgroup of the data with a Normal distribution
but a different mean or standard deviation than the other subgroups. The Probability Plot will
indicate a deviation from the underlying Normality assumption either way.
The same Probability Plot technique may be used to investigate whether a set of data or
residuals follows the Lognormal distribution. The procedure is the same, except that one first
replaces each observation by its natural logarithm. After the data have been transformed to their
natural logarithms, the Probability Plot is constructed as before. The only difference is that the
natural logarithms of the observations are used on the x-axis. If the data are Lognormal, the
Probability Plot (on Normal probability paper) of the logarithms of the observations will
approximate a straight line.
Many statistical software packages for personal computers will construct Probability Plots
automatically with a simple command or two. If such software is available, there is no need to
construct Probability Plots by hand or to obtain special graph paper. The plot itself may be
generated somewhat differently than the method described above. In some packages, the
observed value is plotted as before on the x-axis. The y-axis, however, now represents the
quantile of the Normal distribution (often referred to as the "Normal score of the observation")
corresponding to the cumulative probability of the observed value. The y-coordinate is often
computed by the following formula:
y i =F -1 i
n +1
æ
è ö ø
where F -1 denotes the inverse of the cumulative Normal distribution, n represents the sample
size, and i represents the rank position of the ith ordered concentration. Since the computer does
these calculations automatically, the formula does not have to be computed by hand.
Draft
7
EXAMPLE 1
Determine whether the following data set follows the Normal distribution by using a
Probability Plot.
Nickel Concentration (ppb)
Month Well 1 Well 2 Well 3 Well 4
1 58.8 19 39 3.1
2 1.0 81.5 151 942
3 262 331 27 85.6
4 56 14 21.4 10
5 8.7 64.4 578 637
SOLUTION
Step 1.List the measured nickel concentrations in order from lowest to highest.
Nickel
Concentration
(ppb)
Order
(i)
Probability
100*(i/(n+1))
Normal
Quantile
1 1 5 -1.645
3.1 2 10 -1.28
8.7 3 14 -1.08
10 4 19 -0.88
14 5 24 -0.706
19 6 29 -0.55
21.4 7 33 -0.44
27 8 38 -0.305
39 9 43 -0.176
56 10 48 -0.05
58.8 11 52 0.05
64.4 12 57 0.176
81.5 13 62 0.305
85.6 14 67 0.44
151 15 71 0.55
262 16 76 0.706
331 17 81 0.88
578 18 86 1.08
637 19 90 1.28
Draft
8
942 20 95 1.645
Step 2.The cumulative probability is given in the third column and is computed as
100*(i/(n+1)) where n is the total number of samples (n=20). The last column gives the
Normal quantiles corresponding to these probabilities.
Step 3.If using special graph paper, plot the probability versus the concentration for each
sample. Otherwise, plot the Normal quantile versus the concentration for each sample,
as in the plot below. The curvature found in the Probability Plot indicates that there is
evidence of non-Normality in the data.
0 200 400 600 800 1000
-2
-1
0
1
2
PROBABILITY PLOT
Nickel (ppb)
NO
R
M
A
L
Q
U
A
N
T
I
L
E
S
1.1.3 Coefficient of Skewness
The Coefficient of Skewness (g1) indicates to what degree a data set is skewed or
asymmetric with respect to the mean. Data from a Normal distribution will have a Skewness
Coefficient of zero, while asymmetric data will have a positive or negative skewness depending on
whether the right- or left-hand tail of the distribution is longer and skinnier than the opposite tail.
Since ground-water monitoring concentration data are inherently nonnegative, one often
expects the data to exhibit a certain degree of skewness. A small degree of skewness is not likely
to affect the results of statistical tests based on an assumption of Normality. However, if the
Skewness Coefficient is larger than 1 (in absolute value) and the sample size is small (e.g., n<25),
Draft
9
statistical research has shown that standard Normal theory-based tests are much less powerful
than when the absolute skewness is less than 1 (Gayen, 1949).
Calculating the Skewness Coefficient is useful and not much more difficult than computing
the Coefficient of Variation. It provides a quick indication of whether the skewness is minimal
enough to assume that the data are roughly symmetric and hopefully Normal in distribution. If the
original data exhibit a high Skewness Coefficient, the Normal distribution will provide a poor
approximation to the data set. In that case, g1 can be computed on the logarithms of the data to
test for symmetry of the logged data.
The Skewness Coefficient may be computed using the following formula:
g 1 =
1
n S i xi -x ()3
n -1
n
æ
è ö ø
3
2 SD()3
where the numerator represents the average cubed residual and SD denotes the standard deviation
of the measurements. Most statistics computer packages (e.g., Minitab, GEO-EAS) will compute
the Skewness Coefficient automatically via a simple command.
EXAMPLE 2
Using the data in Example 1, compute the Skewness Coefficient to test for approximate
symmetry in the data.
SOLUTION
Step 1.Compute the mean, standard deviation (SD), and average cubed residual for the nickel
concentrations:
x =169.52 ppb
SD =259.72 ppb
1
n (x iiå -x )3 =2.98923 *108 ppb3
Step 2.Calculate the Coefficient of Skewness using the previous formula to get g1=1.84. Since
the skewness is much larger than 1, the data appear to be significantly positively
skewed. Do not assume that the data follow a Normal distribution.
Draft
10
Step 3.Since the original data evidence a high degree of skewness, one can attempt to compute
the Skewness Coefficient on the logged data instead. In that case, the skewness works
out to be |g1|= 0.24 < 1, indicating that the logged data values are slightly skewed, but
not enough to reject an assumption of Normality in the logged data. In other words,
the original data may be Lognormally distributed.
1.1.4 The Shapiro-Wilk Test of Normality (n 50)
The Shapiro-Wilk test is recommended as a superior alternative to the Chi-square test for
testing Normality of the data. It is based on the premise that if a set of data are Normally
distributed, the ordered values should be highly correlated with corresponding quantiles taken
from a Normal distribution (Shapiro and Wilk, 1965). In particular, the Shapiro-Wilk test gives
substantial weight to evidence of non-Normality in the tails of a distribution, where the robustness
of statistical tests based on the Normality assumption is most severely affected. The Chi-square
test treats departures from Normality in the tails nearly the same as departures in the middle of a
distribution, and so is less sensitive to the types of non-Normality that are most crucial. One
cannot tell from a significant Chi-square goodness-of-fit test what sort of non-Normality is
indicated.
The Shapiro-Wilk test statistic (W) will tend to be large when a Probability Plot of the data
indicates a nearly straight line. Only when the plotted data show significant bends or curves will
the test statistic be small. The Shapiro-Wilk test is considered to be one of the very best tests of
Normality available (Miller, 1986; Madansky, 1988).
To calculate the test statistic W, one can use the following formula:
W =b
SD n -1
é
ë
ù
û
2
where the numerator is computed as
b =a n -i +1i=1
kå (x(n -i +1)-x (i))=b ii=1
kå
In this last formula, x(j) represents the jth smallest ordered value in the sample and
coefficients aj depend on the sample size n. The coefficients can be found for any sample size
Draft
11
from 3 up to 50 in Table A-1 of Appendix A. The value of k can be found as the greatest integer
less than or equal to n/2.
Normality of the data should be rejected if the Shapiro-Wilk statistic is too low when
compared to the critical values provided in Table A-2 of Appendix A. Otherwise one can assume
the data are approximately Normal for purposes of further statistical analysis. As before, it is
recommended that the test first be performed on the logarithms of the original data to test for
Lognormality. If the logged data indicate non-Normality by the Shapiro-Wilk test, a re-test can
be performed on the original data to test for Normality of the original concentrations.
EXAMPLE 3
Use the data of Example 1 to compute the Shapiro-Wilk test of Normality.
SOLUTION
Step 1.Order the data from smallest to largest and list, as in the following table. Also list the
data in reverse order alongside the first column.
Step 2.Compute the differences x(n-i+1)-x(i) in column 3 of the table by subtracting column 1
from column 2.
i x(i)x(n-i+1)x(n-i+1)-x(i)an-i+1 bi
1 1.0 942.0 941.0 .4734 445.47
2 3.1 637.0 633.9 .3211 203.55
3 8.7 578.0 569.3 .2565 146.03
4 10.0 331.0 321.0 .2085 66.93
5 14.0 262.0 248.0 .1686 41.81
6 19.0 151.0 132.0 .1334 17.61
7 21.4 85.6 64.2 .1013 6.50
8 27.0 81.5 54.5 .0711 3.87
9 39.0 64.4 25.4 .0422 1.07
10 56.0 58.8 2.8 .0140 0.04
11 58.8 56.0 -2.8 b=932.88
12 64.4 39.0 -25.4
13 81.5 27.0 -54.5
14 85.6 21.4 -64.2
15 151.0 19.0 -132.0
16 262.0 14.0 -248.0
17 331.0 10.0 -321.0
18 578.0 8.7 -569.3
19 637.0 3.1 -633.9
Draft
12
20 942.0 1.0 -941.0
Step 3.Compute k as the greatest integer less than or equal to n/2. Since n=20, k=10 in this
example.
Step 4.Look up the coefficients an-i+1 from Table A-1 and list in column 4. Multiply the
differences in column 3 by the coefficients in column 4 and add the first k products to
get quantity b. In this case, b=932.88.
Step 5.Compute the standard deviation of the sample, SD=259.72. Then
W =932.88
259.72 19
é
ë ê ù
û ú
2
=0.679.
Step 6.Compare the computed value of W=0.679 to the 5% critical value for sample size 20 in
Table A-2, namely W.05,20=0.905. Since W < 0.905, the sample shows significant
evidence of non-Normality by the Shapiro-Wilk test. The data should be transformed
using natural logs and rechecked using the Shapiro-Wilk test before proceeding with
further statistical analysis (Actually, the logged data should have been tested first. The
original concentration data are used in this example to illustrate how the assumption of
Normality can be rejected.)
1.1.5 The Shapiro-Francia Test of Normality (n>50)
The Shapiro-Wilk test of Normality can be used for sample sizes up to 50. When the
sample is larger than 50, a slight modification of the procedure called the Shapiro-Francia test
(Shapiro and Francia, 1972) can be used instead.
Like the Shapiro-Wilk test, the Shapiro-Francia test statistic (W´) will tend to be large when
a Probability Plot of the data indicates a nearly straight line. Only when the plotted data show
significant bends or curves will the test statistic be small.
To calculate the test statistic W´, one can use the following formula:
W '=
Si m i x i()
é
ë
ù
û
2
n -1()SD 2 Si m i
2
Draft
13
where x(i) represents the ith ordered value of the sample and where mi denotes the approximate
expected value of the ith ordered Normal quantile. The values for mi can be approximately
computed as
m i =F -1 i
n +1
æ
è ö ø
where F-1 denotes the inverse of the standard Normal distribution with zero mean and unit
variance. These values can be computed by hand using a Normal probability table or via simple
commands in many statistical computer packages.
Normality of the data should be rejected if the Shapiro-Francia statistic is too low when
compared to the critical values provided in Table A-3 of Appendix A. Otherwise one can assume
the data are approximately Normal for purposes of further statistical analysis. As before, the
logged data should be tested first to see if a Lognormal model is appropriate. If these data
indicate non-Normality by the Shapiro-Francia test, a re-test can be performed on the original
data.
1.1.6 The Probability Plot Correlation Coefficient
One other alternative test for Normality that is roughly equivalent to the Shapiro-Wilk and
Shapiro-Francia tests is the Probability Plot Correlation Coefficient test described by Filliben
(1975). This test fits in perfectly with the use of Probability Plots, because the essence of the test
is to compute the common correlation coefficient for points on a Probability Plot. Since the
correlation coefficient is a measure of the linearity of the points on a scatterplot, the Probability
Plot Correlation Coefficient, like the Shapiro-Wilk test, will be high when the plotted points fall
along a straight line and low when there are significant bends and curves in the Probability Plot.
Comparison of the Shapiro-Wilk and Probability Plot Correlation Coefficient tests has indicated
very similar statistical power for detecting non-Normality (Ryan and Joiner, 1976).
The construction of the test statistic is somewhat different from the Shapiro-Wilk W, but
not difficult to implement. Also, tabled critical values for the correlation coefficient have been
derived for sample sizes up to n=100 (and are reproduced in Table A-4 of Appendix A). The
Probability Plot Correlation Coefficient may be computed as
Draft
14
r =X (i )i =1
nå M i -nX M
C n ´SD n -1
where X(i) represents the ith smallest ordered concentration value, Mi is the median of the ith
order statistic from a standard Normal distribution, and X and M represent the average values of
X(i) and M(i). The ith Normal order statistic median may be approximated as Mi=F-1(mi), where
as before, F-1 is the inverse of the standard Normal cumulative distribution and mi can be
computed as follows (given sample size n):
m i =
1 –.5()1 n for i =1
i –.3175()/n +.365()
.5()1 n for i =n
ì
í
ï
î ï
for 1 <i <n
Quantity Cn represents the square root of the sum of squares of the Mi's minus n times the
average value M , that is
C n =M i
2 -nM 2
iå
When working with a complete sample (i.e., containing no nondetects or censored values), the
average value M =0, and so the formula for the Probability Plot Correlation Coefficient simplifies
to
r =X (i )M iiå
M i
2
iå ´SD n -1
EXAMPLE 4
Use the data of Example 1 to compute the Probability Plot Correlation Coefficient test.
SOLUTION
Step 1.Order the data from smallest to largest and list, as in the following table.
Step 2.Compute the quantities mi from Filliben's formula above for each i in column 2 and the
order statistic medians, Mi, in column 3 by applying the inverse Normal transformation
to column 2.
Step 3.Since this sample contains no nondetects, the simplified formula for r may be used.
Compute the products X(i)*Mi in column 4 and sum to get the numerator of the
Draft
15
correlation coefficient (equal to 3,836.81 in this case). Also compute Mi2 in column 5
and sum to find quantity Cn2=17.12.
i x(i)mi Mi X(i)*Mi Mi2
1 1.0 .03406 -1.8242 -1.824 3.328
2 3.1 .08262 -1.3877 -4.302 1.926
3 8.7 .13172 -1.1183 -9.729 1.251
4 10.0 .18082 -0.9122 -9.122 0.832
5 14.0 .22993 -0.7391 -10.347 0.546
6 19.0 .27903 -0.5857 -11.129 0.343
7 21.4 .32814 -0.4451 -9.524 0.198
8 27.0 .37724 -0.3127 -8.444 0.098
9 39.0 .42634 -0.1857 -7.242 0.034
10 56.0 .47545 -0.0616 -3.448 0.004
11 58.8 .52455 0.0616 3.621 0.004
12 64.4 .57366 0.1857 11.959 0.034
13 81.5 .62276 0.3127 25.488 0.098
14 85.6 .67186 0.4451 38.097 0.198
15 151.0 .72097 0.5857 88.445 0.343
16 262.0 .77007 0.7391 193.638 0.546
17 331.0 .81918 0.9122 301.953 0.832
18 578.0 .86828 1.1183 646.376 1.251
19 637.0 .91738 1.3877 883.941 1.926
20 942.0 .96594 1.8242 1718.408 3.328
Step 4.Compute the Probability Plot Correlation Coefficient using the simplified formula for r,
where SD=259.72 and Cn=4.1375, to get
r =3836.81
(4.1375)(259.72)19 =0.819
Step 5.Compare the computed value of r=0.819 to the 5% critical value for sample size 20 in
Table A-4, namely R.05,20=0.950. Since r < 0.950, the sample shows significant
evidence of non-Normality by the Probability Plot Correlation Coefficient test. The
data should be transformed using natural logs and the correlation coefficient
recalculated before proceeding with further statistical analysis.
Draft
16
EXAMPLE 5
The data in Examples 1, 2, 3, and 4 showed significant evidence of non-Normality. Instead
of first logging the concentrations before testing for Normality, the original data were used. This
was done to illustrate why the Lognormal distribution is usually a better default model than the
Normal. In this example, use the same data to determine whether the measurements better follow
a Lognormal distribution.
Computing the natural logarithms of the data gives the table below.
Logged Nickel Concentrations log (ppb)
Month Well 1 Well 2 Well 3 Well 4
1 4.07 2.94 3.66 1.13
2 0.00 4.40 5.02 6.85
3 5.57 5.80 3.30 4.45
4 4.03 2.64 3.06 2.30
5 2.16 4.17 6.36 6.46
SOLUTION
Method 1. Probability Plots
Step 1.List the natural logarithms of the measured nickel concentrations in order from lowest
to highest.
Order
(i)
Log Nickel
Concentration
log(ppb)
Probability
100*(i/(n+1))
Normal
Quantiles
1 0.00 5 -1.645
2 1.13 10 -1.28
3 2.16 14 -1.08
4 2.30 19 -0.88
5 2.64 24 -0.706
6 2.94 29 -0.55
7 3.06 33 -0.44
8 3.30 38 -0.305
9 3.66 43 -0.176
10 4.03 48 -0.05
11 4.07 52 0.05
Draft
17
12 4.17 57 0.176
13 4.40 62 0.305
14 4.45 67 0.44
15 5.02 71 0.55
16 5.57 76 0.706
17 5.80 81 0.88
18 6.36 86 1.08
19 6.46 90 1.28
20 6.85 95 1.645
Step 2.Compute the probability as shown in the third column by calculating 100*(i/n+1),
where n is the total number of samples (n=20). The corresponding Normal quantiles
are given in column 4.
Step 3.Plot the Normal quantiles against the natural logarithms of the observed concentrations
to get the following graph. The plot indicates a nearly straight line fit (verified by
calculation of the Correlation Coefficient given in Method 4). There is no substantial
evidence that the data do not follow a Lognormal distribution. The Normal-theory
procedure(s) should be performed on the log-transformed data.
-2 0 2 4 6 8
-2
-1
0
1
2
PROBABILITY PLOT
LN(Nickel) LN(ppb)
NO
R
M
A
L
Q
U
A
N
T
I
L
E
S
Method 2.Coefficient of Skewness
Step 1.Calculate the mean, SD, and average cubed residuals of the natural logarithms of the
data.
Draft
18
X =3.918 log(ppb)
SD =1.802 log (ppb )
1
n S i x i -X ()3 =-1.325 log 3 (ppb)
Step 2.Calculate the Skewness Coefficient, g1.
g 1 =-1.325
.95()3
2 1.802()3
=-0.244
Step 3.Compute the absolute value of the skewness, |g1|=|-0.244|=0.244.
Step 4.Since the absolute value of the Skewness Coefficient is less than 1, the data do not
show evidence of significant skewness. A Normal approximation to the log-
transformed data may therefore be appropriate, but this model should be further
checked.
Method 3.Shapiro-Wilk Test
Step 1.Order the logged data from smallest to largest and list, as in following table. Also list
the data in reverse order and compute the differences x(n-i+1)-x(i).
i LN(x(i))LN(x(n-i+1))an–i+1 bi
1 0.00 6.85 .4734 3.24
2 1.13 6.46 .3211 1.71
3 2.16 6.36 .2565 1.08
4 2.30 5.80 .2085 0.73
5 2.64 5.57 .1686 0.49
6 2.94 5.02 .1334 0.28
7 3.06 4.45 .1013 0.14
8 3.30 4.40 .0711 0.08
9 3.66 4.17 .0422 0.02
10 4.03 4.07 .0140 0.00
11 4.07 4.03 b=7.77
12 4.17 3.66
13 4.40 3.30
14 4.45 3.06
15 5.02 2.94
Draft
19
16 5.57 2.64
17 5.80 2.30
18 6.36 2.16
19 6.46 1.13
20 6.85 0.00
Step 2.Compute k=10, since n/2=10. Look up the coefficients an-i+1 from Table A-1 and
multiply by the first k differences between columns 2 and 1 to get the quantities bi.
Add these 10 products to get b=7.77.
Step 3.Compute the standard deviation of the logged data, SD=1.8014. Then the Shapiro-
Wilk statistic is given by
W =7.77
1.8014 19
é
ë ê ù
û ú
2
=0.979.
Step 4.Compare the computed value of W to the 5% critical value for sample size 20 in Table
A-2, namely W.05,20=0.905. Since W=0.979>0.905, the sample shows no significant
evidence of non-Normality by the Shapiro-Wilk test. Proceed with further statistical
analysis using the log-transformed data.
Method 4.Probability Plot Correlation Coefficient
Step 1.Order the logged data from smallest to largest and list below.
Order
(i)
Log Nickel
Concentratio
n
log(ppb)
mi Mi X(i)*Mi Mi2
1 0.00 .03406 -1.8242 0.000 3.328
2 1.13 .08262 -1.3877 -1.568 1.926
3 2.16 .13172 -1.1183 -2.416 1.251
4 2.30 .18082 -0.9122 -2.098 0.832
5 2.64 .22993 -0.7391 -1.951 0.546
6 2.94 .27903 -0.5857 -1.722 0.343
7 3.06 .32814 -0.4451 -1.362 0.198
8 3.30 .37724 -0.3127 -1.032 0.098
9 3.66 .42634 -0.1857 -0.680 0.034
10 4.03 .47545 -0.0616 -0.248 0.004
11 4.07 .52455 0.0616 0.251 0.004
12 4.17 .57366 0.1857 0.774 0.034
13 4.40 .62276 0.3127 1.376 0.098
14 4.45 .67186 0.4451 1.981 0.198
Draft
20
15 5.02 .72097 0.5857 2.940 0.343
16 5.57 .77007 0.7391 4.117 0.546
17 5.80 .81918 0.9122 5.291 0.832
18 6.36 .86828 1.1183 7.112 1.251
19 6.46 .91738 1.3877 8.965 1.926
20 6.85 .96594 1.8242 12.496 3.328
Step 2.Compute the quantities mi and the order statistic medians Mi, according to the
procedure in Example 4 (note that these values depend only on the sample size and are
identical to the quantities in Example 4).
Step 3.Compute the products X(i)*Mi in column 4 and sum to get the numerator of the
correlation coefficient (equal to 32.226 in this case). Also compute Mi2 in column 5
and sum to find quantity Cn2=17.12.
Step 4.Compute the Probability Plot Correlation Coefficient using the simplified formula for r,
where SD=1.8025 and Cn=4.1375, to get
r =32.226
(4.1375)(1.8025)19 =0.991
Step 5.Compare the computed value of r=0.991 to the 5% critical value for sample size 20 in
Table A-4, namely R.05,20=0.950. Since r > 0.950, the logged data show no significant
evidence of non-Normality by the Probability Plot Correlation Coefficient test.
Therefore, Lognormality of the original data could be assumed in subsequent statistical
procedures.
1.2 TESTING FOR HOMOGENEITY OF VARIANCE
One of the most important assumptions for the parametric analysis of variance (ANOVA) is
that the different groups (e.g., different wells) have approximately the same variance. If this is not
the case, the power of the F-test (its ability to detect differences among the group means) is
reduced. Mild differences in variance are not too bad. The effect becomes noticeable when the
largest and smallest group variances differ by a ratio of about 4 and becomes quite severe when
the ratio is 10 or more (Milliken and Johnson, 1984).
The procedure suggested in the EPA guidance document, Bartlett's test, is one way to test
whether the sample data give evidence that the well groups have different variances. However,
Bartlett's test is sensitive to non-Normality in the data and may give misleading results unless one
knows in advance that the data are approximately Normal (Milliken and Johnson, 1984). As an
Draft
21
alternative to Bartlett's test, two procedures for testing homogeneity of the variances are
described below that are less sensitive to non-Normality.
1.2.1 Box Plots
Box Plots were first developed for exploratory data analysis as a quick way to visualize the
"spread" or dispersion within a data set. In the context of variance testing, one can construct a
Box Plot for each well group and compare the boxes to see if the assumption of equal variances is
reasonable. Such a comparison is not a formal test procedure, but is easier to perform and is
often sufficient for checking the group variance assumption.
The idea behind a Box Plot is to order the data from lowest to highest and to trim off 25
percent of the observations on either end, leaving just the middle 50 percent of the sample values.
The spread between the lowest and highest values of this middle 50 percent (known as the
interquartile range or IQR) is represented by the length of the box. The very middle observation
(i.e., the median) can also be shown as a line cutting the box in two.
To construct a Box Plot, calculate the median and upper and lower quantiles of the data set
(respectively, the 50th, 25th, and 75th percentiles). To do this, calculate k=p(n+1)/100 where
n=number of samples and p=percentile of interest. If k is an integer, let the kth ordered or ranked
value be an estimate of the pth percentile of the data. If k is not an integer, let the pth percentile
be equal to the average of the two values closest in rank position to k. For example, if the data
set consists of the 10 values {1, 4, 6.2, 10, 15, 17.1, 18, 22, 25, 30.5}, the position of the median
would be found as 50*(10+1)/100=5.5. The median would then be computed as the average of
the 5th and 6th ordered values, or (15+17.1)/2=16.05.
Likewise, the position of the lower quartile would be 25*(10+1)/100=2.75. Calculate the
average of the 2nd and 3rd ordered observations to estimate this percentile, i.e., (4+6.2)/2=5.1.
Since the upper quartile is found to be 23.5, the length of Box Plot would be the difference
between the upper and lower quartiles, or (23.5–5.1)=18.4. The box itself should be drawn on a
graph with the y-axis representing concentration and the x-axis denoting the wells being plotted.
Three horizontal lines are drawn for each well, one line each at the lower and upper quartiles and
another at the median concentration. Vertical connecting lines are drawn to complete the box.
Most statistics packages can directly calculate the statistics needed to draw a Box Plot, and
many will construct the Box Plots as well. In some computer packages, the Box Plot will also
Draft
22
have two "whiskers" extending from the edges of the box. These lines indicate the positions of
extreme values in the data set, but generally should not be used to approximate the overall
dispersion.
If the box length for each group is less than 3 times the length of the shortest box, the
sample variances are probably close enough to assume equal group variances. If, however, the
box length for any group is at least triple the length of the box for another group, the variances
may be significantly different (Kirk Cameron, SAIC, personal communication). In that case, the
data should be further checked using Levene’s test described in the following section. If Levene’s
test is significant, the data may need to be transformed or a non-parametric rank procedure
considered before proceeding with further analysis.
EXAMPLE 6
Construct Box Plots for each well group to test for equality of variances.
Arsenic Concentration (ppm)
Month Well 1 Well 2 Well 3 Well 4 Well 5 Well 6
1 22.9 2.0 2.0 7.84 24.9 0.34
2 3.09 1.25 109.4 9.3 1.3 4.78
3 35.7 7.8 4.5 25.9 0.75 2.85
4 4.18 52 2.5 2.0 27 1.2
SOLUTION
Step 1.Compute the 25th, 50th, and 75th percentiles for the data in each well group. To
calculate the pth percentile by hand, order the data from lowest to highest. Calculate
p*(n+1)/100 to find the ordered position of the pth percentile. If necessary, interpolate
between sample values to estimate the desired percentile.
Step 2.Using well 1 as an example, n+1=5 (since there are 4 data values). To calculate the
25th percentile, compute its ordered position (i.e., rank) as 25*5/100=1.25. Average
the 1st and 2nd ranked values at well 1 (i.e., 3.09 and 4.18) to find an estimated lower
quartile of 3.64. This estimate gives the lower end of the Box Plot. The upper end or
75th percentile can be computed similarly as the average of the 3rd and 4th ranked
values, or (22.9+35.7)/2=29.3. The median is the average of the 2nd and 3rd ranked
values, giving an estimate of 13.14.
Step 3.Construct Box Plots for each well group, lined up side by side on the same axes.
Draft
23
1 2 3 4 5 6
0
20
40
60
80
100
120
WELL
BOX PLOTS OF WELL DATA
AR
S
E
N
I
C
C
O
N
C
E
N
T
R
A
T
I
O
N
(
p
p
m
)
Step 4.Since the box length for well 3 is more than three times the box lengths for wells 4 and
6, there is evidence that the group variances may be significantly different. These data
should be further checked using Levene’s test described in the next section.
1.2.2 Levene's Test
Levene's test is a more formal procedure than Box Plots for testing homogeneity of variance
that, unlike Bartlett's test, is not sensitive to non-Normality in the data. Levene's test has been
shown to have power nearly as great as Bartlett's test for Normally distributed data and power
superior to Bartlett's for non-Normal data (Milliken and Johnson, 1984).
To conduct Levene's test, first compute the new variables
z ij =x ij -x i
Draft
24
where xij represents the jth value from the ith well and x i is the ith well mean. The values zij
represent the absolute values of the usual residuals. Then run a standard one-way analysis of
variance (ANOVA) on the variables zij. If the F-test is significant, reject the hypothesis of equal
group variances. Otherwise, proceed with analysis of the xij's as initially planned.
EXAMPLE 7
Use the data from Example 6 to conduct Levene's test of equal variances.
SOLUTION
Step 1.Calculate the group mean for each well x i()
Well 1 mean = 16.47 Well 4 mean = 11.26
Well 2 mean = 15.76 Well 5 mean = 13.49
Well 3 mean = 29.60 Well 6 mean = 2.29
Step 2.Compute the absolute residuals zij in each well and the well means of the residuals ( z i).
Absolute Residuals
Month Well 1 Well 2 Well 3 Well 4 Well 5 Well 6
1 6.43 13.76 27.6 3.42 11.41 1.95
2 13.38 14.51 79.8 1.96 12.19 2.49
3 19.23 7.96 25.1 14.64 12.74 0.56
4 12.29 36.24 27.1 9.26 13.51 1.09
Well
Mean ( z i)= 12.83 18.12 39.9 7.32 12.46 1.52
Overall
Mean ( z )= 15.36
Draft
25
Step 3.Compute the sums of squares for the absolute residuals.
SSTOTAL = (N–1) SDZ2 = 6300.89
SSWELLS =niåi z i
2 -Nz 2 =3522.90
SSERROR = SSTOTAL–SSWELLS = 2777.99
Step 4.Construct an analysis of variance table to calculate the F-statistic. The degrees of
freedom (df) are computed as (#groups–1)=(6–1)=5 df and (#samples–#groups)=(24–
6)=18 df.
ANOVA Table
Source Sum-of-Squares df Mean-Square F-Ratio P
Between Wells 3522.90 5 704.58 4.56 0.007
Error 2777.99 18 154.33
Total 6300.89 23
Step 5.Since the F-statistic of 4.56 exceeds the tabulated value of F.05=2.77 with 5 and 18 df,
the assumption of equal variances should be rejected. Since the original concentration
data are used in this example, the data should be logged and retested.
Draft
26
2. RECOMMENDATIONS FOR HANDLING
NONDETECTS
The basic recommendations within the Interim Final Guidance for handling nondetect
analyses include the following (see p. 8-2): 1) if less than 15 percent of all samples are nondetect,
replace each nondetect by half its detection or quantitation limit and proceed with a parametric
analysis, such as ANOVA, Tolerance Limits, or Prediction Limits; 2) if the percent of nondetects
is between 15 and 50, either use Cohen's adjustment to the sample mean and variance in order to
proceed with a parametric analysis, or employ a non-parametric procedure by using the ranks of
the observations and by treating all nondetects as tied values; 3) if the percent of nondetects is
greater than 50 percent, use the Test of Proportions.
As to the first recommendation, experience at EPA and research at the United States
Geological Survey (USGS, Dennis Helsel, personal communication, 1991) has indicated that if
less than 15 percent of the samples are nondetect, the results of parametric statistical tests will not
be substantially affected if nondetects are replaced by half their detection limits. When more than
15 percent of the samples are nondetect, however, the handling of nondetects is more crucial to
the outcome of statistical procedures. Indeed, simple substitution methods tend to perform
poorly in statistical tests when the nondetect percentage is substantial (Gilliom and Helsel, 1986).
Even with a small proportion of nondetects, however, care should be taken when choosing
between the method detection limit (MDL) and the practical quantitation limit (PQL) in
characterizing “nondetect” concentrations. Many nondetects are characterized by analytical
laboratories with one of three data qualifier flags: "U," "J," or "E." Samples with a "U" data
qualifier represent "undetected" measurements, meaning that the signal characteristic of that
analyte could not be observed or distinguished from "background noise" during lab analysis.
Inorganic samples with an "E" flag and organic samples with a "J" flag may or may not be
reported with an estimated concentration. If no concentration is estimated, these samples
represent "detected but not quantified" measurements. In this case, the actual concentration is
assumed to be positive, but somewhere between zero and the PQL. Since all of these non-detects
may or may not have actual positive concentrations between zero and the PQL, the suggested
substitution for parametric statistical procedures is to replace each nondetect by one-half the PQL
(note, however, that "E" and "J" samples reported with estimated concentrations should be
treated, for statistical purposes, as valid measurements. Substitution of one-half the PQL is not
recommended for these samples).
Draft
27
In no case should nondetect concentrations be assumed to be bounded above by the MDL.
The MDL is estimated on the basis of ideal laboratory conditions with ideal analyte samples and
does not account for matrix or other interferences encountered when analyzing specific, actual
field samples. For this reason, the PQL should be taken as the most reasonable upper bound for
nondetect concentrations.
It should also be noted that the distinction between “undetected” and “detected but not
quantified” measurements has more specific implications for rank-based non-parametric
procedures. Rather than assigning the same tied rank to all nondetects (see below and in Section
3), “detected but not quantified” measurements should be given larger ranks than those assigned
to “undetected” samples. In fact the two types of nondetects should be treated as two distinct
groups of tied observations for use in the Wilcoxon and Kruskal-Wallis non-parametric
procedures.
2.1 NONDETECTS IN ANOVA PROCEDURES
For a moderate to large percentage of nondetects (i.e., over 15%), the handling of
nondetects should vary depending on the statistical procedure to be run. If background data from
one or more upgradient wells are to be compared simultaneously with samples from one or more
downgradient wells via a t-test or ANOVA type procedure, the simplest and most reliable
recommendation is to switch to a non-parametric analysis. The distributional assumptions for
parametric procedures can be rather difficult to check when a substantial fraction of nondetects
exists. Furthermore, the non-parametric alternatives described in Section 3 tend to be efficient at
detecting contamination when the underlying data are Normally distributed, and are often more
powerful than the parametric methods when the underlying data do not follow a Normal
distribution.
Nondetects are handled easily in a nonparametric analysis. All data values are first ordered
and replaced by their ranks. Nondetects are treated as tied values and replaced by their midranks
(see Section 3). Then a Wilcoxon Rank-Sum or Kruskal-Wallis test is run on the ranked data
depending on whether one or more than one downgradient well is being tested.
The Test of Proportions is not recommended in this Addendum, even if the percentage of
nondetects is over 50 percent. Instead, for all two-group comparisons that involve more than 15
percent nondetects, the non-parametric Wilcoxon Rank-Sum procedure is recommended.
Although acceptable as a statistical procedure, the Test of Proportions does not account for
Draft
28
potentially different magnitudes among the concentrations of detected values. Rather, each
sample is treated as a 0 or 1 depending on whether the measured concentration is below or above
the detection limit. The Test of Proportions ignores information about concentration magnitudes,
and hence is usually less powerful than a non-parametric rank-based test like the Wilcoxon Rank-
Sum, even after adjusting for a large fraction of tied observations (e.g., nondetects). This is
because the ranks of a dataset preserve additional information about the relative magnitudes of the
concentration values, information which is lost when all observations are scored as 0's and 1's.
Another drawback to the Test of Proportions, as presented in the Interim Final Guidance, is
that the procedure relies on a Normal probability approximation to the Binomial distribution of 0's
and 1's. This approximation is recommended only when the quantities n ´ (%NDs) and n
´ (1-%NDs) are no smaller than 5. If the percentage of nondetects is quite high and/or the
sample size is fairly small, these conditions may be violated, leading potentially to inaccurate
results.
Comparison of the Test of Proportions to the Wilcoxon Rank-Sum test shows that for small
to moderate proportions of nondetects (say 0 to 60 percent), the Wilcoxon Rank-Sum procedure
adjusted for ties is more powerful in identifying real concentration differences than the Test of
Proportions. When the percentage of nondetects is quite high (at least 70 to 75 percent), the Test
of Proportions appears to be slightly more powerful in some cases than the Wilcoxon, but the
results of the two tests almost always lead to the same conclusion, so it makes sense to simply
recommend the Wilcoxon Rank-Sum test in all cases where nondetects constitute more than 15
percent of the samples.
2.2 NONDETECTS IN STATISTICAL INTERVALS
If the chosen method is a statistical interval (Confidence, Tolerance or Prediction limit) used
to compare background data against each downgradient well separately, more options are
available for handling moderate proportions of nondetects. The basis of any parametric statistical
interval limit is the formula x ± k×s, where x and s represent the sample mean and standard
deviation of the (background) data and k depends on the interval type and characteristics of the
monitoring network. To use a parametric interval in the presence of a substantial number of
nondetects, it is necessary to estimate the sample mean and standard deviation. But since
nondetect concentrations are unknown, simple formulas for the mean and standard deviation
cannot be computed directly. Two basic approaches to estimating or "adjusting" the mean and
standard deviation in this situation have been described by Cohen (1959) and Aitchison (1955).
Draft
29
The underlying assumptions of these procedures are somewhat different. Cohen's
adjustment (which is described in detail on pp. 8-7 to 8-11 of the Interim Final Guidance) assumes
that all the data (detects and nondetects) come from the same Normal or Lognormal population,
but that nondetect values have been "censored" at their detection limits. This implies that the
contaminant of concern is present in nondetect samples, but the analytical equipment is not
sensitive to concentrations lower than the detection limit. Aitchison's adjustment, on the other
hand, is constructed on the assumption that nondetect samples are free of contamination, so that
all nondetects may be regarded as zero concentrations. In some situations, particularly when the
analyte of concern has been detected infrequently in background measurements, this assumption
may be practical, even if it cannot be verified directly.
Before choosing between Cohen's and Aitchison's approaches, it should be cautioned that
Cohen's adjustment may not give valid results if the proportion of nondetects exceeds 50%. In a
case study by McNichols and Davis (1988), the false positive rate associated with the use of t-
tests based on Cohen's method rose substantially when the fraction of nondetects was greater than
50%. This occurred because the adjusted estimates of the mean and standard deviation are more
highly correlated as the percentage of nondetects increases, leading to less reliable statistical tests
(including statistical interval tests).
On the other hand, with less than 50% nondetects, Cohen's method performed adequately in
the McNichols and Davis case study, provided the data were not overly skewed and that more
extensive tables than those included within the Interim Final Guidance were available to calculate
Cohen's adjustment parameter. As a remedy to the latter caveat, a more extensive table of
Cohen's adjustment parameter is provided in Appendix A (Table A-5). It is also recommended
that the data (detected measurements and nondetect detection limits) first be log-transformed
prior to computing either Cohen's or Aitchison's adjustment, especially since both procedures
assume that the underlying data are Normally distributed.
2.2.1 Censored and Detects-Only Probability Plots
To decide which approach is more appropriate for a particular set of ground water data, two
separate Probability Plots can be constructed. The first is called a Censored Probability Plot and
is a test of Cohen's underlying assumption. In this method, the combined set of detects and
nondetects is ordered (with nondetects being given arbitrary but distinct ranks). Cumulative
probabilities or Normal quantiles (see Section 1.1) are then computed for the data set as in a
regular Probability Plot. However, only the detected values and their associated Normal quantiles
Draft
30
are actually plotted. If the shape of the Censored Probability Plot is reasonably linear, then
Cohen's assumption that nondetects have been "censored" at their detection limit is probably
acceptable and Cohen's adjustment can be made to estimate the sample mean and standard
deviation. If the Censored Probability Plot has significant bends and curves, particularly in one or
both tails, one might consider Aitchison's procedure instead.
To test the assumptions of Aitchison's method, a Detects-Only Probability Plot may be
constructed. In this case, nondetects are completely ignored and a standard Probability Plot is
constructed using only the detected measurements. Thus, cumulative probabilities or Normal
quantiles are computed only for the ordered detected values. Comparison of a Detects-Only
Probability Plot with a Censored Probability Plot will indicate that the same number of points and
concentration values are plotted on each graph. However, different Normal quantiles are
associated with each detected concentration. If the Detects-Only Probability Plot is reasonably
linear, then the assumptions underlying Aitchison's adjustment (i.e., that "nondetects" represent
zero concentrations, and that detects and nondetects follow separate probability distributions) are
probably reasonable.
If it is not clear which of the Censored or Detects-Only Probability Plots is more linear,
Probability Plot Correlation Coefficients can be computed for both approaches (note that the
correlations should only involve the points actually plotted, that is, detected concentrations). The
plot with the higher correlation coefficient will represent the most linear trend. Be careful,
however, to use other, non-statistical judgments to help decide which of Cohen's and Aitchison's
underlying assumptions appears to be most reasonable based on the specific characteristics of the
data set. It is also likely that these Probability Plots may have to be constructed on the logarithms
of the data instead of the original values, if in fact the most appropriate underlying distribution is
the Lognormal instead of the Normal.
EXAMPLE 8
Create Censored and Detects-Only Probability Plots with the following zinc data to
determine whether Cohen's adjustment or Aitchison's adjustment is most appropriate for
estimating the true mean and standard deviation.
Draft
31
Zinc Concentrations (ppb) at Background Wells
Sample Well 1 Well 2 Well 3 Well 4 Well 5
1 <7 <7 <7 11.69 <7
2 11.41 <7 12.85 10.90 <7
3 <7 13.70 14.20 <7 <7
4 <7 11.56 9.36 12.22 11.15
5 <7 <7 <7 11.05 13.31
6 10.00 <7 12.00 <7 12.35
7 15.00 10.50 <7 13.24 <7
8 <7 12.59 <7 <7 8.74
SOLUTION
Step 1.Pool together the data from the five background wells and list in order in the table
below.
Step 2.To construct the Censored Probability Plot, compute the probabilities i/(n+1) using the
combined set of detects and nondetects, as in column 3. Find the Normal quantiles
associated with these probabilities by applying the inverse standard Normal
transformation, F-1.
Step 3.To construct the Detects-Only Probability Plot, compute the probabilities in column 5
using only the detected zinc values. Again apply the inverse standard Normal
transformation to find the associated Normal quantiles in column 6. Note that
nondetects are ignored completely in this method.
Draft
32
Order (i)Zinc Conc.
(ppb)
Censored
Probs.
Normal
Quantiles
Detects-Only
Probs.
Normal
Quantiles
1 <7 .024 -1.971
2 <7 .049 -1.657
3 <7 .073 -1.453
4 <7 .098 -1.296
5 <7 .122 -1.165
6 <7 .146 -1.052
7 <7 .171 -0.951
8 <7 .195 -0.859
9 <7 .220 -0.774
10 <7 .244 -0.694
11 <7 .268 -0.618
12 <7 .293 -0.546
13 <7 .317 -0.476
14 <7 .341 -0.408
15 <7 .366 -0.343
16 <7 .390 -0.279
17 <7 .415 -0.216
18 <7 .439 -0.153
19 <7 .463 -0.092
20 <7 .488 -0.031
21 8.74 .512 0.031 .048 -1.668
22 9.36 .537 0.092 .095 -1.309
23 10.00 .561 0.153 .143 -1.068
24 10.50 .585 0.216 .190 -0.876
25 10.90 .610 0.279 .238 -0.712
26 11.05 .634 0.343 .286 -0.566
27 11.15 .659 0.408 .333 -0.431
28 11.41 .683 0.476 .381 -0.303
29 11.56 .707 0.546 .429 -0.180
30 11.69 .732 0.618 .476 -0.060
31 12.00 .756 0.694 .524 0.060
32 12.22 .780 0.774 .571 0.180
33 12.35 .805 0.859 .619 0.303
34 12.59 .829 0.951 .667 0.431
35 12.85 .854 1.052 .714 0.566
36 13.24 .878 1.165 .762 0.712
37 13.31 .902 1.296 .810 0.876
38 13.70 .927 1.453 .857 1.068
39 14.20 .951 1.657 .905 1.309
40 15.00 .976 1.971 .952 1.668
Draft
33
Step 4.Plot the detected zinc concentrations versus each set of probabilities or Normal
quantiles, as per the procedure for constructing Probability Plots (see figures below).
The nondetect values should not be plotted. As can be seen from the graphs, the
Censored Probability Plot indicates a definite curvature in the tails, especially the lower
tail. The Detects-Only Probability Plot, however, is reasonably linear. This visual
impression is bolstered by calculation of a Probability Plot Correlation Coefficient for
each set of detected values: the Censored Probability Plot has a correlation of r=.969,
while the Detects-Only Probability Plot has a correlation of r=.998.
Step 5.Because the Detects-Only Probability Plot is substantially more linear than the
Censored Probability Plot, it may be appropriate to consider detects and nondetects as
arising from statistically distinct distributions, with nondetects representing "zero"
concentrations. Therefore, Aitchison's adjustment may lead to better estimates of the
true mean and standard deviation than Cohen's adjustment for censored data.
8 9 10 11 12 13 14 15 16
-2.5
-1.5
-0.5
0.5
1.5
2.5
CENSORED PROBABILITY PLOT
ZINC CONCENTRATIONS (ppb)
NO
R
M
A
L
Q
U
A
N
T
I
L
E
S
Draft
34
8 9 10 11 12 13 14 15 16
-2.5
-1.5
-0.5
0.5
1.5
2.5
DETECTS-ONLY PROBABILITY PLOT
ZINC CONCENTRATIONS (ppb)
NO
R
M
A
L
Q
U
A
N
T
I
L
E
S
2.2.2 Aitchison's Adjustment
To actually compute Aitchison's adjustment (Aitchison, 1955), it is assumed that the
detected samples follow an underlying Normal distribution. If the detects are Lognormal,
compute Aitchison's adjustment on the logarithms of the data instead. Let d=# nondetects and let
n=total # of samples (detects and nondetects combined). Then if x * and s* denote respectively
the sample mean and standard deviation of the detected values, the adjusted overall mean can be
estimated as
ˆ m =1 –d
n
æ
è ö ø x *
and the adjusted overall standard deviation may be estimated as the square root of the quantity
ˆ s 2 =n -(d +1)
n -1 (s*)2 +d
n
n -d
n -1
æ
è ö ø (x *)2
The general formula for a parametric statistical interval adjusted for nondetects by Aitchison's
method is given by ˆ m ±k × ˆ s , with k depending on the type of interval being constructed.
Draft
35
EXAMPLE 9
In Example 8, it was determined that Aitchison's adjustment might lead to more appropriate
estimates of the true mean and standard deviation than Cohen's adjustment. Use the data in
Example 8 to compute Aitchison's adjustment.
SOLUTION
Step 1.The zinc data consists of 20 nondetects and 20 detected values; therefore d=20 and
n=40 in the above formulas.
Step 2.Compute the average x *=11.891 and the standard deviation s *=1.595 of the set of
detected values.
Step 3.Use the formulas for Aitchison's adjustment to compute estimates of the true mean and
standard deviation:
ˆ m =1 -20
40
æ
è ö ø ´11.891 =5.95
ˆ s 2 =40 -21
39
æ
è ö ø (1.595)2 +20
40
æ
è ö ø
20
39
æ
è ö ø (11.891)2 =37.495 Þ ˆ s =6.12
If Cohen's adjustment is mistakenly computed on these data instead, with a detection
limit of 7 ppb,the estimates become ˆ m =7.63 and ˆ s =4.83 . Thus, the choice of
adjustment can have a significant impact on the upper limits computed for statistical
intervals.
2.2.3 More Than 50% Nondetects
If more than 50% but less than 90% of the samples are nondetect or the assumptions of
Cohen's and Aitchison's methods cannot be justified, parametric statistical intervals should be
abandoned in favor of non-parametric alternatives (see Section 3 below). Nonparametric
statistical intervals are easy to construct and apply to ground water data measurements, and no
special steps need be taken to handle nondetects.
When 90% or more of the data values are nondetect (as often occurs when measuring
volatile organic compounds [VOCs] in ground water, for instance), the detected samples can
often be modeled as "rare events" by using the Poisson distribution. The Poisson model describes
the behavior of a series of independent events over a large number of trials, where the probability
of occurrence is low but stays constant from trial to trial. The Poisson model is similar to the
Binomial model in that both models represent "counting processes." In the Binomial case,
Draft
36
nondetects are counted as 'misses' or zeroes and detects are counted (regardless of contamination
level) as 'hits' or ones; in the case of the Poisson, each particle or molecule of contamination is
counted separately but cumulatively, so that the counts for detected samples with high
concentrations are larger than counts for samples with smaller concentrations. As Gibbons (1987,
p. 574) has noted, it can be postulated
...that the number of molecules of a particular compound out of a much larger
number of molecules of water is the result of a Poisson process. For example,
we might consider 12 ppb of benzene to represent a count of 12 units of
benzene for every billion units examined. In this context, Poisson's approach is
justified in that the number of units (i.e., molecules) is large, and the probability
of the occurrence (i.e., a molecule being classified as benzene) is small.
For a detect with concentration of 50 ppb, the Poisson count would be 50. Counts for
nondetects can be taken as zero or perhaps equal to half the detection limit (e.g., if the detection
limit were 10 ppb, the Poisson count for that sample would be 5). Unlike the Binomial (Test of
Proportions) model, the Poisson model has the ability to utilize the magnitudes of detected
concentrations in statistical tests.
The Poisson distribution is governed by the average rate of occurrence, l, which can be
estimated by summing the Poisson counts of all samples in the background pool of data and
dividing by the number of samples in the pool. Once the average rate of occurrence has been
estimated, the formula for the Poisson distribution is given by
Pr X =x{}=e -l l x
x!
where x represents the Poisson count and l represents the average rate of occurrence. To use the
Poisson distribution to predict concentration values at downgradient wells, formulas for
constructing Poisson Prediction and Tolerance limits are given below.
2.2.4 Poisson Prediction Limits
To estimate a Prediction limit at a particular well using the Poisson model, the approach
described by Gibbons (1987b) and based on the work of Cox and Hinkley (1974) can be used. In
this case, an upper limit is estimated for an interval that will contain all of k future measurements
of an analyte with confidence level 1-a, given n previous background measurements.
Draft
37
To do this, let Tn represent the sum of the Poisson counts of n background samples. The
goal is to predict Tk*, representing the total Poisson count of the next k sample measurements.
As Cox and Hinkley show, if Tn has a Poisson distribution with mean m and if no contamination
has occurred, it is reasonable to assume that Tk* will also have a Poisson distribution but with
mean cm, where c depends on the number of future measurements being predicted.
In particular, Cox and Hinckley demonstrate that the quantity
Tk
*-c(Tn +Tk
*)
(1 +c)
é
ë ê
ù
û ú
2
c(Tn +Tk
*)
(1 +c )2
has an approximate standard Normal distribution. From this relation, an upper prediction limit for
Tk* is calculated by Gibbons to be approximately
Tk
*=cTn +ct 2
2 +ct Tn 1 +1
c
æ
è ö ø +t 2
4
where t=tn-1,a is the upper (1-a) percentile of the Student's t distribution with (n-1) degrees of
freedom. The quantity c in the above formulas may be computed as k/n, where, as noted, k is the
number of future samples being predicted.
EXAMPLE 10
Use the following benzene data from six background wells to estimate an upper 99%
Poisson Prediction limit for the next four measurements from a single downgradient well.
Benzene Concentrations (ppb)
Month Well 1 Well 2 Well 3 Well 4 Well 5 Well 6
1 <2 <2 <2 <2 <2 <2
2 <2 <2 <2 15.0 <2 <2
3 <2 <2 <2 <2 <2 <2
4 <2 12.0 <2 <2 <2 <2
5 <2 <2 <2 <2 <2 10.0
6 <2 <2 <2 <2 <2 <2
Draft
38
SOLUTION
Step 1.Pooling the background data yields n=36 samples, of which, 33 (92%) are nondetect.
Because the rate of detection is so infrequent (i.e., <10%), a Poisson-based Prediction
limit may be appropriate. Since four future measurements are to be predicted, k=4, and
hence, c=k/n=1/9.
Step 2.Set each nondetect to half the detection limit or 1 ppb. Then compute the Poisson
count of the sum of all the background samples, in this case,
Tn=33(1)+(12.0+15.0+10.0) = 70.0. To calculate an upper 99% Prediction limit, the
upper 99th percentile of the t-distribution with (n-1)=35 degrees of freedom must be
taken from a reference table, namely t35,.01=2.4377.
Step 3.Using Gibbons' formula above, calculate the upper Prediction limit as:
Tk
*=1
9 (70 )+(2.4377)2
2 (9)+2.4377
9 70(1 +9)+(2.4377)2
4 =15.3 ppb
Step 4.To test the upper Prediction limit, the Poisson count of the sum of the next four
downgradient wells should be calculated. If this sum is greater than 15.3 ppb, there is
significant evidence of contamination at the downgradient well. If not, the well may be
regarded as clean until the next testing period.
The procedure for generating Poisson prediction limits is somewhat flexible. The value k
above, for instance, need not represent multiple samples from a single well. It could also denote a
collection of single samples from k distinct wells, all of which are assumed to follow the same
Poisson distribution in the absence of contamination. The Poisson distribution also has the
desirable property that the sum of several Poisson variables also has a Poisson distribution, even if
the individual components are not identically distributed. Because of this, Gibbons (1987b) has
suggested that if several analytes (e.g., different VOCs) can all be modeled via the Poisson
distribution, the combined sum of the Poisson counts of all the analytes will also have a Poisson
distribution, meaning that a single prediction limit could be estimated for the combined group of
analytes, thus reducing the necessary number of statistical tests.
A major drawback to Gibbons' proposal of establishing a combined prediction limit for
several analytes is that if the limit is exceeded, it will not be clear which analyte is responsible for
"triggering" the test. In part this problem explains why the ground-water monitoring regulations
mandate that each analyte be tested separately. Still, if a large number of analytes must be
regularly tested and the detection rate is quite low, the overall facility-wide false positive rate may
be unacceptably high. To remedy this situation, it is probably wisest to do enough initial testing
Draft
39
of background and facility leachate and waste samples to determine those specific parameters
present at levels substantially greater than background. By limiting monitoring and statistical tests
to a few parameters meeting the above conditions, it should be possible to contain the overall
facility-wide false positive rate while satisfying the regulatory requirements and assuring reliable
identification of ground-water contamination if it occurs.
Though quantitative information on a suite of VOCs may be automatically generated as a
consequence of the analytical method configuration (e.g., SW-846 method 8260 can provide
quantitative results for approximately 60 different compounds), it is usually unnecessary to
designate all of these compounds as leak detection indicators. Such practice generally aggravates
the problem of many comparisons and results in elevated false positive rates for the facility as a
whole. This makes accurate statistical testing especially difficult. EPA therefore recommends
that the results of leachate testing or the waste analysis plan serve as the primary basis for
designating reliable leak detection indicator parameters.
2.2.5 Poisson Tolerance Limits
To apply an upper Tolerance limit using the Poisson model to a group of downgradient
wells, the approach described by Gibbons (1987b) and based on the work of Zacks (1970) can be
taken. In this case, if no contamination has occurred, the estimated interval upper limit will
contain a large fraction of all measurements from the downgradient wells, often specified at 95%
or more.
The calculations involved in deriving Poisson Tolerance limits can seem non-intuitive,
primarily because the argument leading to a mathematically rigorous Tolerance limit is
complicated. The basic idea, however, uses the fact that if each individual measurement follows a
common Poisson distribution with rate parameter, l, the sum of n such measurements will also
follow a Poisson distribution, this time with rate nl.
Because the Poisson distribution has the property that its true mean is equal to the rate
parameter l, the concentration sum of n background samples can be manipulated to estimate this
rate. But since we know that the distribution of the concentration sum is also Poisson, the
possible values of l can actually be narrowed to within a small range with fixed confidence
probability (g).
Draft
40
For each "possible" value of l in this confidence range, one can compute the percentile of
the Poisson distribution with rate l that would lie above, say, 95% of all future downgradient
measurements. By setting as the "probable" rate, that l which is greater than all but a small
percentage a of the most extreme possible l's, given the values of n background samples, one can
compute an upper tolerance limit with, say, 95% coverage and (1-a)% confidence.
To actually make these computations, Zacks (1970) shows that the most probable rate l
can be calculated approximately as
l Tn =1
2n c g
2 2 Tn +2[]
where as before Tn represents the Poisson count of the sum of n background samples (setting
nondetects to half the method detection limit), and
c g
2 2Tn +2[]
represents the g percentile of the Chi-square distribution with (2Tn+2) degrees of freedom.
To find the upper Tolerance limit with b% coverage (e.g., 95%) once a probable rate l has
been estimated, one must compute the Poisson percentile that is larger than b% of all possible
measurements from that distribution, that is, the b% quantile of the Poisson distribution with
mean rate lTn, denoted by P-1(b,lTn). Using a well-known mathematical relationship between
the Poisson and Chi-square distributions, finding the b% quantile of the Poisson amounts to
determining the least positive integer k such that
c1 -b
2 2 k +2[]³2l Tn
where, as above, the quantity [2k+2] represents the degrees of freedom of the Chi-square
distribution. By calculating two times the estimated probable rate lTn on the right-hand-side of
the above inequality, and then finding the smallest degrees of freedom so that the (1-
b)% percentile of the Chi-square distribution is bigger than 2lTn, the upper tolerance limit k can
be determined fairly easily.
Draft
41
Once the upper tolerance limit, k, has been estimated, it will represent an upper Poisson
Tolerance limit having approximately b% coverage with g% confidence in all comparisons with
downgradient well measurements.
EXAMPLE 11
Use the benzene data of Example 10 to estimate an upper Poisson Tolerance limit with 95%
coverage and 95% confidence probability.
SOLUTION
Step 1.The benzene data consist of 33 nondetects with detection limit equal to 2 ppb and 3
detected values for a total of n=36. By setting each nondetect to half the detection limit
as before, one finds a total Poisson count of the sum equal to Tn=70.0. It is also
known that the desired confidence probability is g=.95 and the desired coverage is
b=.95.
Step 2.Based on the observed Poisson count of the sum of background samples, estimate the
probable occurrence rate lTn using Zacks' formula above as
l Tn =1
2n c g
2 [2 Tn +2]=1
72 c .95
2 [142 ]=2.37
Step 3.Compute twice the probable occurrence rate as 2lTn=4.74. Now using a Chi-square
table, find the smallest degrees of freedom (df), k, such that
c .05
2 [2 k +2]³4.74
Since the 5th percentile of the Chi-square distribution with 12 df equals 5.23 (but only
4.57 with 11 df), it is seen that (2k+2)=12, leading to k=5. Therefore, the upper
Poisson Tolerance limit is estimated as k=5 ppb.
Step 4.Because the estimated upper Tolerance limit with 95% coverage equals 5 ppb, any
detected value among downgradient samples greater than 5 ppb may indicate possible
evidence of contamination.
Draft
42
3. NON-PARAMETRIC COMPARISON OF COMPLIANCE
WELL DATA
TO BACKGROUND
When concentration data from several compliance wells are to be compared with
concentration data from background wells, one basic approach is analysis of variance (ANOVA).
The ANOVA technique is used to test whether there is statistically significant evidence that the
mean concentration of a constituent is higher in one or more of the compliance wells than the
baseline provided by background wells. Parametric ANOVA methods make two key
assumptions: 1) that the data residuals are Normally distributed and 2) that the group variances
are all approximately equal. The steps for calculating a parametric ANOVA are given in the
Interim Final Guidance (pp. 5-6 to 5-14).
If either of the two assumptions crucial to a parametric ANOVA is grossly violated, it is
recommended that a non-parametric test be conducted using the ranks of the observations rather
than the original observations themselves. The Interim Final Guidance describes the Kruskal-
Wallis test when three or more well groups (including background data, see pp. 5-14 to 5-20) are
being compared. However, the Kruskal-Wallis test is not amenable to two-group comparisons,
say of one compliance well to background data. In this case, the Wilcoxon Rank-Sum procedure
(also known as the Mann-Whitney U Test) is recommended and explained below. Since most
situations will involve the comparison of at least two downgradient wells with background data,
the Kruskal-Wallis test is presented first with an additional example.
3.1 KRUSKAL-WALLIS TEST
When the assumptions used in a parametric analysis of variance cannot be verified, e.g.,
when the original or transformed residuals are not approximately Normal in distribution or have
significantly different group variances, an analysis can be performed using the ranks of the
observations. Usually, a non-parametric procedure will be needed when a substantial fraction of
the measurements are below detection (more than 15 percent), since then the above assumptions
are difficult to verify.
The assumption of independence of the residuals is still required. Under the null hypothesis
that there is no difference among the groups, the observations are assumed to come from identical
distributions. However, the form of the distribution need not be specified.
Draft
43
A non-parametric ANOVA can be used in any situation that the parametric analysis of
variance can be used. However, because the ranks of the data are being used, the minimum
sample sizes for the groups must be a little larger. A useful rule of thumb is to require a minimum
of three well groups with at least four observations per group before using the Kruskal-Wallis
procedure.
Non-parametric procedures typically need a few more observations than parametric
procedures for two reasons. On the one hand, non-parametric tests make fewer assumptions
concerning the distribution of the data and so more data is often needed to make the same
judgment that would be rendered by a parametric test. Also, procedures based on ranks have a
discrete distribution (unlike the continuous distributions of parametric tests). Consequently, a
larger sample size is usually needed to produce test statistics that will be significant at a specified
alpha level such as 5 percent.
The relative efficiency of two procedures is defined as the ratio of the sample sizes needed
by each to achieve a certain level of power against a specified alternative hypothesis. As sample
sizes get larger, the efficiency of the Kruskal-Wallis test relative to the parametric analysis of
variance test approaches a limit that depends on the underlying distribution of the data, but is
always at least 86 percent. This means roughly that in the worst case, if 86 measurements are
available for a parametric ANOVA, only 100 sample values are needed to have an equivalently
powerful Kruskal-Wallis test. In many cases, the increase in sample size necessary to match the
power of a parametric ANOVA is much smaller or not needed at all. The efficiency of the
Kruskal-Wallis test is 95 percent if the data are really Normal, and can be much larger than 100
percent in other cases (e.g., it is 150 percent if the residuals follow a distribution called the double
exponential).
These results concerning efficiency imply that the Kruskal-Wallis test is reasonably powerful
for detecting concentration differences despite the fact that the original data have been replaced
by their ranks, and can be used even when the data are Normally distributed. When the data are
not Normal or cannot be transformed to Normality, the Kruskal-Wallis procedure tends to be
more powerful for detecting differences than the usual parametric approach.
3.1.1 Adjusting for Tied Observations
Frequently, the Kruskal-Wallis procedure will be used when the data contain a significant
fraction of nondetects (e.g., more than 15 percent of the samples). In these cases, the parametric
Draft
44
assumptions necessary for the usual one-way ANOVA are difficult or impossible to verify, making
the non-parametric alternative attractive. However, the presence of nondetects prevents a unique
ranking of the concentration values, since nondetects are, up to the limit of measurement, all tied
at the same value.
To get around this problem, two steps are necessary. First, in the presence of ties (e.g.,
nondetects), all tied observations should receive the same rank. This rank (sometimes called the
midrank (Lehmann, 1975)) is computed as the average of the ranks that would be given to a
group of ties if the tied values actually differed by a tiny amount and could be ranked uniquely.
For example, if the first four ordered observations are all nondetects, the midrank given to each of
these samples would be equal to (1+2+3+4)/4=2.5. If the next highest measurement is a unique
detect, its rank would be 5 and so on until all observations are appropriately ranked.
The second step is to compute the Kruskal-Wallis statistic as described in the Interim Final
Guidance, using the midranks computed for the tied values. Then an adjustment to the Kruskal-
Wallis statistic must be made to account for the presence of ties. This adjustment is described on
page 5-17 of the Interim Final Guidance and requires computation of the formula:
H ¢=H
1 -S i =1
g t i
3 -ti
N3 -N
æ
è ç ö
ø
where g equals the number of groups of distinct tied observations and ti is the number of
observations in the ith tied group.
EXAMPLE 12
Use the non-parametric analysis of variance on the following data to determine whether
there is evidence of contamination at the monitoring site.
Toluene Concentration (ppb)
Background Wells Compliance Wells
Month Well 1 Well 2 Well 3 Well 4 Well 5
1 <5 <5 <5 <5 <5
2 7.5 <5 12.5 13.7 20.1
3 <5 <5 8.0 15.3 35.0
4 <5 <5 <5 20.2 28.2
Draft
45
5 6.4 <5 11.2 25.1 19.0
SOLUTION
Step 1.Compute the overall percentage of nondetects. In this case, nondetects account for 48
percent of the data. The usual parametric analysis of variance would be inappropriate.
Use the Kruskal-Wallis test instead, pooling both background wells into one group and
treating each compliance well as a separate group.
Step 2.Compute ranks for all the data including tied observations (e.g., nondetects) as in the
following table. Note that each nondetect is given the same midrank, equal to the
average of the first 12 unique ranks.
Toluene Ranks
Background Wells Compliance Wells
Month Well 1 Well 2 Well 3 Well 4 Well 5
1 6.5 6.5 6.5 6.5 6.5
2 14 6.5 17 18 21
3 6.5 6.5 15 19 25
4 6.5 6.5 6.5 22 24
5 13 6.5 16 23 20
Rank Sum Rb=79 R3=61 R4=88.5 R5=96.5
Rank Mean R b=7.9 R 3=12.2 R 4=17.7 R 5=19.3
Step 3.Calculate the sums of the ranks in each group (Ri) and the mean ranks in each group
( R i). These results are given above.
Step 4.Compute the Kruskal-Wallis statistic H using the formula on p. 5-15 of the Interim
Final Guidance
H =12
N(N +1)i =1
Kå R i
2
N i
é
ë ê
ù
û ú -3 N +1()
where N=total number of samples, Ni=number of samples in ith group, and K=number
of groups. In this case, N=25, K=4, and H can be computed as
Draft
46
H =12
25 *26
79
10
2
+61
5
2
+88.5
5
2
+96.5
5
2é
ë ê
ù
û ú -78 =10.56.
Step 5.Compute the adjustment for ties. There is only one group of distinct tied observations,
containing 12 samples. Thus, the adjusted Kruskal-Wallis statistic is given by:
H ¢=10.56
1 -123 -12
253 -25
æ
è ç ö
ø
=11.87.
Step 6.Compare the calculated value of H´ to the tabulated Chi-square value with (K-1)= (#
groups-1)=3 df, c 23,.05=7.81. Since the observed value of 11.87 is greater than the
Chi-square critical value, there is evidence of significant differences between the well
groups. Post-hoc pairwise comparisons are necessary.
Step 7.Calculate the critical difference for compliance well comparisons to the background
using the formula on p. 5-16 of the Interim Final Guidance document. Since the
number of samples at each compliance well is four, the same critical difference can be
used for each comparison, namely,
C i =z .05 /3
25 × 26
12
1
10 +1
5
æ
è ö ø =8.58
Step 8.Form the differences between the average ranks of each compliance well and the
background and compare these differences to the critical value of 8.58.
Well 3: R 3– R b = 12.2-7.9 = 4.3
Well 4: R 4– R b = 17.7-7.9 = 9.8
Well 5: R 5– R b = 19.3-7.9 = 11.4
Since the average rank differences at wells 4 and 5 exceed the critical difference, there
is significant evidence of contamination at wells 4 and 5, but not at well 3.
3.2 WILCOXON RANK-SUM TEST FOR TWO GROUPS
When a single compliance well group is being compared to background data and a non-
parametric test is needed, the Kruskal-Wallis procedure should be replaced by the Wilcoxon
Rank-Sum test (Lehmann, 1975; also known as the two-sample Mann-Whitney U test). For most
ground-water applications, the Wilcoxon test should be used whenever the proportion of
nondetects in the combined data set exceeds 15 percent. However, to provide valid results, do
Draft
47
not use the Wilcoxon test unless the compliance well and background data groups both contain at
least four samples each.
To run the Wilcoxon Rank-Sum Test, use the following algorithm. Combine the compliance
and background data and rank the ordered values from 1 to N. Assume there are n compliance
samples and m background samples so that N=m+n. Denote the ranks of the compliance samples
by Ci and the ranks of the background samples by Bi. Then add up the ranks of the compliance
samples and subtract n(n+1)/2 to get the Wilcoxon statistic W:
W =S i =1
n C i -1
2 n n +1().
The rationale of the Wilcoxon test is that if the ranks of the compliance data are quite large
relative to the background ranks, then the hypothesis that the compliance and background values
came from the same population should be rejected. Large values of the statistic W give evidence
of contamination at the compliance well site.
To find the critical value of W, a Normal approximation to its distribution is used. The
expected value and standard deviation of W under the null hypothesis of no contamination are
given by the formulas
E W()=1
2 mn ;SD W()=1
12 mn(N +1)
An approximate Z-score for the Wilcoxon Rank-Sum Test then follows as:
Z »
W -E (W )-1
2
SD(W ).
The factor of 1/2 in the numerator serves as a continuity correction since the discrete distribution
of the statistic W is being approximated by the continuous Normal distribution.
Once an approximate Z-score has been computed, it may be compared to the upper 0.01
percentile of the standard Normal distribution, z.01=2.326, in order to determine the statistical
significance of the test. If the observed Z-score is greater than 2.326, the null hypothesis may be
Draft
48
rejected at the 1 percent significance level, suggesting that there is significant evidence of
contamination at the compliance well site.
EXAMPLE 13
The table below contains copper concentration data (ppb) found in water samples at a
monitoring facility. Wells 1 and 2 are background wells and well 3 is a single compliance well
suspected of contamination. Calculate the Wilcoxon Rank-Sum Test on these data.
Copper Concentration (ppb)
Background Compliance
Month Well 1 Well 2 Well 3
1 4.2 5.2 9.4
2 5.8 6.4 10.9
3 11.3 11.2 14.5
4 7.0 11.5 16.1
5 7.3 10.1 21.5
6 8.2 9.7 17.6
SOLUTION
Step 1.Rank the N=18 observations from 1 to 18 (smallest to largest) as in the following table.
Ranks of Copper Concentrations
Background Compliance
Month Well 1 Well 2 Well 3
1 1 2 8
2 3 4 11
3 13 12 15
4 5 14 16
5 6 10 18
6 7 9 17
Step 2.Compute the Wilcoxon statistic by adding up the compliance well ranks and subtracting
n(n+1)/2, so that W=85-21=64.
Step 3.Compute the expected value and standard deviation of W.
Draft
49
E W()=1
2 mn =36
SD W()=1
12 mn (N +1)=114 =10.677
Step 4.Form the approximate Z-score.
Z »
W -E (W )-1
2
SD(W )=64 -36 -0.5
10.677 =2.576
Step 5.Compare the observed Z-score to the upper 0.01 percentile of the Normal distribution.
Since Z=2.576>2.326=z.01, there is significant evidence of contamination at the
compliance well at the 1 percent significance level.
3.2.1 Handling Ties in the Wilcoxon Test
Tied observations in the Wilcoxon test are handled in similar fashion to the Kruskal-Wallis
procedure. First, midranks are computed for all tied values. Then the Wilcoxon statistic is
computed as before but with a slight difference. To form the approximate Z-score, an adjustment
is made to the formula for the standard deviation of W in order to account for the groups of tied
values. The necessary formula (Lehmann, 1975) is:
SD*(W )=mn (N +1)
12 1 -Si =1
g t i
3 -ti
N 3 -N
æ
è ç ö
ø
where, as in the Kruskal-Wallis method, g equals the number of groups of distinct tied
observations and ti represents the number of tied values in the ith group.
Draft
50
4. STATISTICAL INTERVALS: CONFIDENCE,
TOLERANCE, AND PREDICTION
Three types of statistical intervals are often constructed from data: Confidence intervals,
Tolerance intervals, and Prediction intervals. Though often confused, the interpretations and uses
of these intervals are quite distinct. The most common interval encountered in a course on
statistics is a Confidence interval for some parameter of the distribution (e.g., the population
mean). The interval is constructed from sample data and is thus a random quantity. This means
that each set of sample data will generate a different Confidence interval, even though the
algorithm for constructing the interval stays the same every time.
A Confidence interval is designed to contain the specified population parameter (usually the
mean concentration of a well in ground-water monitoring) with a designated level of confidence
or probability, denoted as 1-a. The interval will fail to include the true parameter in
approximately a percent of the cases where such intervals are constructed.
The usual Confidence interval for the mean gives information about the average
concentration level at a particular well or group of wells. It offers little information about the
highest or most extreme sample concentrations one is likely to observe over time. Often, it is
those extreme values one wants to monitor to be protective of human health and the environment.
As such, a Confidence interval generally should be used only in two situations for ground-water
data analysis: (1) when directly specified by the permit or (2) in compliance monitoring, when
downgradient samples are being compared to a Ground-Water Protection Standard (GWPS)
representing the average of onsite background data, as is sometimes the case with an Alternate
Contaminant Level (ACL) . In other situations it is usually desirable to employ a Tolerance or
Prediction interval.
A Tolerance interval is designed to contain a designated proportion of the population (e.g.,
95 percent of all possible sample measurements). Since the interval is constructed from sample
data, it also is a random interval. And because of sampling fluctuations, a Tolerance interval can
contain the specified proportion of the population only with a certain confidence level. Two
coefficients are associated with any Tolerance interval. One is the proportion of the population
that the interval is supposed to contain, called the coverage. The second is the degree of
confidence with which the interval reaches the specified coverage. This is known as the tolerance
coefficient. A Tolerance interval with coverage of 95 percent and a tolerance coefficient of 95
Draft
51
percent is constructed to contain, on average, 95 percent of the distribution with a probability of
95 percent.
Tolerance intervals are very useful for ground-water data analysis, because in many
situations one wants to ensure that at most a small fraction of the compliance well sample
measurements exceed a specific concentration level (chosen to be protective of human health and
the environment). Since a Tolerance interval is designed to cover all but a small percentage of the
population measurements, observations should very rarely exceed the upper Tolerance limit when
testing small sample sizes. The upper Tolerance limit allows one to gauge whether or not too
many extreme concentration measurements are being sampled from compliance point wells.
Tolerance intervals can be used in detection monitoring when comparing compliance data to
background values. They also should be used in compliance monitoring when comparing
compliance data to certain Ground-Water Protection Standards. Specifically, the tolerance
interval approach is recommended for comparison with a Maximum Contaminant Level (MCL) or
with an ACL if the ACL is derived from health-based risk data.
Prediction intervals are constructed to contain the next sample value(s) from a population or
distribution with a specified probability. That is, after sampling a background well for some time
and measuring the concentration of an analyte, the data can be used to construct an interval that
will contain the next analyte sample or samples (assuming the distribution has not changed). A
Prediction interval will thus contain a future value or values with specified probability. Prediction
intervals can also be constructed to contain the average of several future observations.
Prediction intervals are probably most useful for two kinds of detection monitoring. The
first kind is when compliance point well data are being compared to background values. In this
case the Prediction interval is constructed from the background data and the compliance well data
are compared to the upper Prediction limits. The second kind is when intrawell comparisons are
being made on an uncontaminated well. In this case, the Prediction interval is constructed on past
data sampled from the well, and used to predict the behavior of future samples from the same
well.
In summary, a Confidence interval usually contains an average value, a Tolerance interval
contains a proportion of the population, and a Prediction interval contains one or more future
observations. Each has a probability statement or "confidence coefficient" associated with it. For
further explanation of the differences between these interval types, see Hahn (1970).
Draft
52
One should note that all of these intervals assume that the sample data used to construct the
intervals are Normally distributed. In light of the fact that much ground-water concentration data
is better modeled by a Lognormal distribution, it is recommended that tests for Normality be run
on the logarithms of the original data before constructing the random intervals. If the data follow
the Lognormal model, then the intervals should be constructed using the logarithms of the sample
values. In this case, the limits of these intervals should not be compared to the original
compliance data or GWPS. Rather, the comparison should involve the logged compliance data or
logged GWPS. When neither the Normal or Lognormal models can be justified, a non-parametric
version of each interval may be utilized.
4.1 TOLERANCE INTERVALS
In detection monitoring, the compliance point samples are assumed to come from the same
distribution as the background values until significant evidence of contamination can be shown.
To test this hypothesis, a 95 percent coverage Tolerance interval can be constructed on the
background data. The background data should first be tested to check the distributional
assumptions. Once the interval is constructed, each compliance sample is compared to the upper
Tolerance limit. If any compliance point sample exceeds the limit, the well from which it was
drawn is judged to have significant evidence of contamination (note that when testing a large
number of samples, the nature of a Tolerance interval practically ensures that a few measurements
will be above the upper Tolerance limit, even when no contamination has occurred. In these
cases, the offending wells should probably be resampled in order to verify whether or not there is
definite evidence of contamination.)
If the Tolerance limit has been constructed using the logged background data, the
compliance point samples should first be logged before comparing with the upper Tolerance limit.
The steps for computing the actual Tolerance interval in detection monitoring are detailed in the
Interim Final Guidance on pp. 5-20 to 5-24. One point about the table of factors k used to adjust
the width of the Tolerance interval is that these factors are designed to provide at least 95%
coverage of the population. Applied over many data sets, the average coverage of these intervals
will often be close to 98% or more (see Guttman, 1970). To construct a one-sided upper
Tolerance interval with average coverage of (1-b)%, the k multiplier can be computed directly
with the aid of a Student's t-distribution table. In this case, the formula becomes
k =tn -1 ,1 -b 1 +1
n
Draft
53
where the t-value represents the (1-b)th upper percentile of the t-distribution with (n-1) degrees
of freedom.
In compliance monitoring, the Tolerance interval is calculated on the compliance point data,
so that the upper one-sided Tolerance limit may be compared to the appropriate Ground-Water
Protection Standard (i.e., MCL or ACL). If the upper Tolerance limit exceeds the fixed standard,
and especially if the Tolerance limit has been constructed to have an average coverage of 95% as
described above, there is significant evidence that as much as 5 percent or more of all the
compliance well measurements will exceed the limit and consequently that the compliance point
wells are in violation of the facility permit. The algorithm for computing Tolerance limits in
compliance monitoring is given on pp. 6-11 to 6-15 of the Interim Final Guidance.
EXAMPLE 14
The table below contains data that represent chrysene concentration levels (ppb) found in
water samples obtained from the five compliance wells at a monitoring facility. Compute the
upper Tolerance limit at each well for an average of 95% coverage with 95% confidence and
determine whether there is evidence of contamination. The alternate concentration limit (ACL) is
80 ppb.
Chrysene Concentration (ppb)
Month Well 1 Well 2 Well 3 Well 4 Well 5
1 19.7 10.2 68.0 26.8 47.0
2 39.2 7.2 48.9 17.7 30.5
3 7.8 16.1 30.1 31.9 15.0
4 12.8 5.7 38.1 22.2 23.4
Mean 19.88 9.80 46.28 24.65 28.98
SD 13.78 4.60 16.40 6.10 13.58
SOLUTION
Step 1.Before constructing the tolerance intervals, check the distributional assumptions. The
algorithm for a parametric Tolerance interval assumes that the data used to compute the
interval are Normally distributed. Because these data are more likely to be Lognormal
in distribution than Normal, check the assumptions on the logarithms of the original
data given in the table below. Since each well has only four observations, Probability
Plots are not likely to be informative. The Shapiro-Wilk or Probability Plot Correlation
Draft
54
Coefficient tests can be run, but in this example only the Skewness Coefficient is
examined to ensure that gross departures from Lognormality are not missed.
Logged Chrysene Concentration [log(ppb)]
Month Well 1 Well 2 Well 3 Well 4 Well 5
1 2.98 2.32 4.22 3.29 3.85
2 3.67 1.97 3.89 2.87 3.42
3 2.05 2.78 3.40 3.46 2.71
4 2.55 1.74 3.64 3.10 3.15
Mean 2.81 2.20 3.79 3.18 3.28
SD 0.68 0.45 0.35 0.25 0.48
Step 2.The Skewness Coefficients for each well are given in the following table. Since none of
the coefficients is greater than 1 in absolute value, approximate Lognormality (that is,
Normality of the logged data) is assumed for the purpose of constructing the tolerance
intervals.
Well Skewness |Skewness|
1 .210 .210
2 .334 .334
3 .192 .192
4 -.145 .145
5 -.020 .020
Step 3.Compute the tolerance interval for each compliance well using the logged concentration
data. The means and SDs are given in the second table above.
Step 4.The tolerance factor for a one-sided Normal tolerance interval with an average of 95%
coverage with 95% probability and n=4 observations is given by
k =t 3,.05 1 +1
4 =2.631
The upper tolerance limit is calculated below for each of the five wells.
Well 1 2.81+2.631(0.68)= 4.61 log(ppb)
Draft
55
Well 2 2.20+2.631(0.45)= 3.38 log(ppb)
Well 3 3.79+2.631(0.35)= 4.71 log(ppb)
Well 4 3.18+2.631(0.25)= 3.85 log(ppb)
Well 5 3.28+2.631(0.48)= 4.54 log(ppb)
Step 5.Compare the upper tolerance limit for each well to the logarithm of the ACL, that is
log(80)=4.38. Since the upper tolerance limits for wells 1, 3, and 5 exceed the logged
ACL of 4.38 log(ppb), there is evidence of chrysene contamination in wells 1, 3, and 5.
4.1.1 Non-parametric Tolerance Intervals
When the assumptions of Normality and Lognormality cannot be justified, especially when a
significant portion of the samples are nondetect, the use of non-parametric tolerance intervals
should be considered. The upper Tolerance limit in a non-parametric setting is usually chosen as
an order statistic of the sample data (see Guttman, 1970), commonly the maximum value or
maybe the second largest value observed. As a consequence, non-parametric intervals should be
constructed only from wells that are not contaminated. Because the maximum sample value is
often taken as the upper Tolerance limit, non-parametric Tolerance intervals are very easy to
construct and use. The sample data must be ordered, but no ranks need be assigned to the
concentration values other than to determine the largest measurements. This also means that
nondetects do not have to be uniquely ordered or handled in any special manner.
One advantage to using the maximum concentration instead of assigning ranks to the data is
that non-parametric intervals (including Tolerance intervals) are sensitive to the actual magnitudes
of the concentration data. Another plus is that unless all the sample data are nondetect, the
maximum value will be a detected concentration, leading to a well-defined upper Tolerance limit.
Once an order statistic of the sample data (e.g., the maximum value) is chosen to represent
the upper tolerance limit, Guttman (1970) has shown that the coverage of the interval,
constructed repeatedly over many data sets, has a Beta probability density with cumulative
distribution
I t (n -m +1,m )=G (n +1)
G (n -m +1)G (m )0
tò un -m (1 -u )m -1 du
Draft
56
where n=# samples in the data set and m=[(n+1)-(rank of upper tolerance limit value)]. If the
maximum sample value is selected as the tolerance limit, its rank is equal to n and so m=1. If the
second largest value is chosen as the limit, its rank would be equal to (n-1) and so m=2.
Since the Beta distribution is closely related to the more familiar Binomial distribution,
Guttman has shown that in order to construct a non-parametric tolerance interval with at least b%
coverage and (1-a) confidence probability, the number of (background) samples must be chosen
such that
n
t
æ
è ç ö
ø t =m
nå 1 -b()t b n -t ³1 -a
Table A-6 in Appendix A provides the minimum coverage levels with 95% confidence for
various choices of n, using either the maximum sample value or the second largest measurement
as the tolerance limit. As an example, with 16 background measurements, the minimum coverage
is b=83% if the maximum background value is designated as the upper Tolerance limit and
b=74% if the Tolerance limit is taken to be the second largest background value. In general,
Table A-6 illustrates that if the underlying distribution of concentration values is unknown, more
background samples are needed compared to the parametric setting in order to construct a
tolerance interval with sufficiently high coverage. Parametric tolerance intervals do not require as
many background samples precisely because the form of the underlying distribution is assumed to
be known.
Because the coverage of the above non-parametric Tolerance intervals follows a Beta
distribution, it can also be shown that the average (not the minimum as discussed above) level of
coverage is equal to 1-[m/(n+1)] (see Guttman, 1970). In particular, when the maximum sample
value is chosen as the upper tolerance limit, m=1, and the expected coverage is equal to n/(n+1).
This implies that at least 19 background samples are necessary to achieve 95% coverage on
average.
EXAMPLE 15
Use the following copper background data to establish a non-parametric upper Tolerance
limit and determine if either compliance well shows evidence of copper contamination.
Copper Concentration (ppb)
Draft
57
Background Wells Compliance Wells
Month Well 1 Well 2 Well 3 Well 4 Well 5
1 <5 9.2 <5
2 <5 <5 5.4
3 7.5 <5 6.7
4 <5 6.1 <5
5 <5 8.0 <5 6.2 <5
6 <5 5.9 <5 <5 <5
7 6.4 <5 <5 7.8 5.6
8 6.0 <5 <5 10.4 <5
SOLUTION
Step 1.Examine the background data in Wells 1, 2, and 3 to determine that the maximum
observed value is 9.2 ppb. Set the 95% confidence upper Tolerance limit equal to this
value. Because 24 background samples are available, Table A-6 indicates that the
minimum coverage is equal to 88% (the expected average coverage, however, is equal
to 24/25=96%). To increase the coverage level, more background samples would have
to be collected.
Step 2.Compare each sample in compliance Wells 4 and 5 to the upper Tolerance limit. Since
none of the measurements at Well 5 is above 9.2 ppb, while one sample from Well 4 is
above the limit, conclude that there is significant evidence of copper contamination at
Well 4 but not Well 5.
4.2 PREDICTION INTERVALS
When comparing background data to compliance point samples, a Prediction interval can be
constructed on the background values. If the distributions of background and compliance point
data are really the same, all the compliance point samples should be contained below the upper
Prediction interval limit. Evidence of contamination is indicated if one or more of the compliance
samples lies above the upper Prediction limit.
With intrawell comparisons, a Prediction interval can be computed on past data to contain a
specified number of future observations from the same well, provided the well has not been
previously contaminated. If any one or more of the future samples falls above the upper
Prediction limit, there is evidence of recent contamination at the well. The steps to calculate
parametric Prediction intervals are given on pp. 5-24 to 5-28 of the Interim Final Guidance.
Draft
58
EXAMPLE 16
The data in the table below are benzene concentrations measured at a groundwater
monitoring facility. Calculate the Prediction interval and determine whether there is evidence of
contamination.
Background Well Data Compliance Well Data
Sampling Date
Benzene
Concentration (ppb)Sampling Date
Benzene
Concentration (ppb)
Month 1 12.6 Month 4 48.0
30.8 30.3
52.0 42.5
28.1 15.0
Month 2 33.3
44.0 n=4
3.0 Mean=33.95
12.8 SD=14.64
Month 3 58.1 Month 5 47.6
12.6 3.8
17.6 2.6
25.3 51.9
n=12 n=4
Mean=27.52 Mean=26.48
SD=17.10 SD=26.94
SOLUTION
Step 1.First test the background data for approximate Normality. Only the background data
are included since these values are used to construct the Prediction interval.
Step 2.A Probability Plot of the 12 background values is given below. The plot indicates an
overall pattern that is reasonably linear with some modest departures from Normality.
To further test the assumption of Normality, run the Shapiro-Wilk test on the
background data.
Draft
59
0 10 20 30 40 50 60
-2
-1
0
1
2
PROBABILITY PLOT
BENZENE (ppb)
NO
R
M
A
L
Q
U
A
N
T
I
L
E
S
Step 3.List the data in ascending and descending order as in the following table. Also calculate
the differences x(n-i+1)-x(i) and multiply by the coefficients an-i+1 taken from Table A-1
to get the components of vector bi used to calculate the Shapiro-Wilk statistic (W).
Draft
60
i x(i)x(n-i+1)an-i+1 bi
1 3.0 58.1 0.548 30.167
2 12.6 52.0 0.333 13.101
3 12.6 44.0 0.235 7.370
4 12.8 33.3 0.159 3.251
5 17.6 30.8 0.092 1.217
6 25.3 28.1 0.030 0.085
7 28.1 25.3 b=55.191
8 30.8 17.6
9 33.3 12.8
10 44.0 12.6
11 52.0 12.6
12 58.1 3.0
Step 4.Sum the components bi in column 5 to get quantity b. Compute the standard deviation
of the background benzene values. Then the Shapiro-Wilk statistic is given as
W =b
SD n -1
é
ë
ù
û
2
=55.191
17.101 11
é
ë
ù
û
2
=0.947.
Step 5.The critical value at the 5% level for the Shapiro-Wilk test on 12 observations is 0.859.
Since the calculated value of W=0.947 is well above the critical value, there is no
evidence to reject the assumption of Normality.
Step 6.Compute the Prediction interval using the original background data. The mean and
standard deviation of the 12 background samples are given by 27.52 ppb and 17.10
ppb, respectively.
Step 7.Since there are two future months of compliance data to be compared to the Prediction
limit, the number of future sampling periods is k=2. At each sampling period, a mean
of four independent samples will be computed, so m=4 in the prediction interval
formula (see Interim Final Guidance, p. 5-25). The Bonferroni t-statistic, t(11,2,.95),
with k=2 and 11 df is equivalent to the usual t-statistic at the .975 level with 11 df, i.e.,
t11,.975=2.201.
Step 8.Compute the upper one-sided Prediction limit (UL) using the formula:
X +t (n -l,k ,.95)S 1
m +1
n
Then the UL is given by:
Draft
61
UL =27.52 +17.10()2.201()1
4 +1
12 =49.25 ppb.
Step 9.Compare the UL to the compliance data. The means of the four compliance well
observations for months 4 and 5 are 33.95 ppb and 26.48 ppb, respectively. Since the
mean concentrations for months 4 and 5 are below the upper Prediction limit, there is
no evidence of recent contamination at the monitoring facility.
4.2.1 Non-parametric Prediction Intervals
When the parametric assumptions of a Normal-based Prediction limit cannot be justified,
often due to the presence of a significant fraction of nondetects, a non-parametric Prediction
interval may be considered instead. A non-parametric upper Prediction limit is typically
constructed in the same way as a non-parametric upper Tolerance limit, that is, by estimating the
limit to be the maximum value of the set of background samples.
The difference between non-parametric Tolerance and Prediction limits is one of
interpretation and probability. Given n background measurements and a desired confidence level,
a non-parametric Tolerance interval will have a certain coverage percentage. With high
probability, the Tolerance interval is designed to miss only a small percentage of the samples from
downgradient wells. A Prediction limit, on the other hand, involves the confidence probability
that the next future sample or samples will definitely fall below the upper Prediction limit. In this
sense, the Prediction limit may be thought of as a 100% coverage Tolerance limit for the next k
future samples.
As Guttman (1970) has indicated, the confidence probability associated with predicting that
the next single observation from a downgradient well will fall below the upper Prediction limit --
estimated as the maximum background value -- is the same as the expected coverage of a similarly
constructed upper Tolerance limit, namely (1-a)=n/(n+1). Furthermore, it can be shown from
Gibbons (1991b) that the probability of having k future samples all fall below the upper non-
parametric Prediction limit is (1-a)=n/(n+k). Table A-7 in Appendix A lists these confidence
levels for various choices of n and k. The false positive rate associated with a single Prediction
limit can be computed as one minus the confidence level.
Balancing the ease with which non-parametric upper Prediction limits are constructed is the
fact that, given fixed numbers of background samples and future sample values to be predicted,
the maximum confidence level associated with the Prediction limit is also fixed. To increase the
Draft
62
level of confidence, the only choices are to 1) decrease the number of future values to be
predicted at any testing period, or 2) increase the number of background samples used in the test.
Table A-7 can be used along these lines to plan an appropriate sampling strategy so that the false
positive rate can be minimized and the confidence probability maximized to a desired level.
EXAMPLE 17
Use the following arsenic data from a monitoring facility to compute a non-parametric upper
Prediction limit that will contain the next 2 monthly measurements from a downgradient well and
determine the level of confidence associated with the Prediction limit.
SOLUTION
Step 1.Determine the maximum value of the background data and use this value to estimate
the upper Prediction limit. In this case, the Prediction limit is set to the maximum value
of the n=18 samples, or 12 ppb. As is true of non-parametric Tolerance intervals, only
uncontaminated wells should be used in the construction of Prediction limits.
Step 2.Compute the confidence level and false positive rate associated with the Prediction
limit. Since two future samples are being predicted and n=18, the confidence level is
found to be n/(n+k)=18/20=90%. Consequently, the Type I error or false positive rate
is equal to (1-.90)=10%. If a lower false positive rate is desired, the number of
background samples used in the test must be enlarged.
Step 3.Compare each of the downgradient samples against the upper Prediction limit. Since
the value of 14 ppb for month 2 exceeds the limit, conclude that there is significant
evidence of contamination at the downgradient well at the 10% level of significance.
Arsenic Concentrations (ppb)
Background Wells Compliance
Month Well 1 Well 2 Well 3 Well 4
1 <5 7 <5
2 <5 6.5 <5
3 8 <5 10.5
4 <5 6 <5
5 9 12 <5 8
6 10 <5 9 14
Draft
63
4.3 CONFIDENCE INTERVALS
Confidence intervals should only be constructed on data collected during compliance
monitoring, in particular when the Ground-Water Protection Standard (GWPS) is an ACL
computed from the average of background samples. Confidence limits for the average
concentration levels at compliance wells should not be compared to MCLs. Unlike a Tolerance
interval, Confidence limits for an average do not indicate how often individual samples will exceed
the MCL. Conceivably, the lower Confidence limit for the mean concentration at a compliance
well could fall below the MCL, yet 50 percent or more of the individual samples might exceed the
MCL. Since an MCL is designed to set an upper bound on the acceptable contamination, this
would not be protective of human health or the environment.
When comparing individual compliance wells to an ACL derived from average background
levels, a lower one-sided 99 percent Confidence limit should be constructed. If the lower
Confidence limit exceeds the ACL, there is significant evidence that the true mean concentration
at the compliance well exceeds the GWPS and that the facility permit has been violated. Again, in
most cases, a Lognormal model will approximate the data better than a Normal distribution
model. It is therefore recommended that the initial data checking and analysis be performed on
the logarithms of the data. If a Confidence interval is constructed using logged concentration
data, the lower Confidence limit should be compared to the logarithm of the ACL rather than the
original GWPS. Steps for computing Confidence intervals are given on pp. 6-3 to 6-11 of the
Interim Final Guidance.
Draft
64
5. STRATEGIES FOR MULTIPLE COMPARISONS
5.1 BACKGROUND OF PROBLEM
Multiple comparisons occur whenever more than one statistical test is performed during any
given monitoring or evaluation period. These comparisons can arise as a result of the need to test
multiple downgradient wells against a pool of upgradient background data or to test several
indicator parameters for contamination on a regular basis. Usually the same statistical test is
performed in every comparison, each test having a fixed level of confidence (1-a), and a
corresponding false positive rate, a.
The false positive rate (or Type I error) for an individual comparison is the probability that
the test will falsely indicate contamination, i.e., that the test will "trigger," though no
contamination has occurred. If ground-water data measurements were always constant in the
absence of contamination, false positives would never occur. But ground-water measurements
typically vary, either due to natural variation in the levels of background concentrations or to
variation in lab measurement and analysis.
Applying the same test to each comparison is acceptable if the number of comparisons is
small, but when the number of comparisons is moderate to large the false positive rate associated
with the testing network as a whole (that is, across all comparisons involving a separate statistical
test) can be quite high. This means that if enough tests are run, there will be a significant chance
that at least one test will indicate contamination, even if no actual contamination has occurred. As
an example, if the testing network consists of 20 separate comparisons (some combination of
multiple wells and/or indicator parameters) and a 99% confidence level Prediction interval limit is
used on each comparison, one would expect an overall network-wide false positive rate of over
18%, even though the Type I error for any single comparison is only 1%. This means there is
nearly 1 chance in 5 that one or more comparisons will falsely register potential contamination
even if none has occurred. With 100 comparisons and the same testing procedure, the overall
network-wide false positive rate jumps to more than 63%, adding additional expense to verify the
lack of contamination at falsely triggered wells.
To lower the network-wide false positive rate, there are several important considerations.
As noted in Section 2.2.4, only those constituents that have been shown to be reliable indicators
of potential contamination should be statistically tested on a regular basis. By limiting the number
of tested constituents to the most useful indicators, the overall number of statistical comparisons
Draft
65
that must be made can be reduced, lowering the facility-wide false alarm rate. In addition,
depending on the hydrogeology of the site, some indicator parameters may need to be tested only
at one (or a few adjacent) regulated waste units, as opposed to testing across the entire facility, as
long as the permit specifies a common point of compliance, thus further limiting the number of
total statistical comparisons necessary.
One could also try to lower the Type I error applied to each individual comparison.
Unfortunately, for a given statistical test in general, the lower the false positive rate, the lower the
power of the test to detect real contamination at the well. If the statistical power drops too much,
real contamination will not be identified when it occurs, creating a situation not protective of the
environment or human health. Instead, alternative testing strategies can be considered that
specifically account for the number of statistical comparisons being made during any evaluation
period. All alternative testing strategies should be evaluated in light of two basic goals:
1.Is the network-wide false positive rate (across all constituents and wells being
tested) acceptably low? and
2.Does the testing strategy have adequate statistical power to detect real
contamination when it occurs?
To establish a standard recommendation for the network-wide overall false positive rate, it
should be noted that for some statistical procedures, EPA specifications mandate that the Type I
error for any individual comparison be at least 1%. The rationale for this minimum requirement is
motivated by statistical power. For a given test, if the Type I error is set too low, the power of
the test will dip below “acceptable” levels. EPA was not able to specify a minimum level of
acceptable power within the regulations because to do so would require specification of a
minimum difference of environmental concern between the null and alternative hypotheses.
Limited current knowledge about the health and/or environmental effects associated with
incremental changes in concentration levels of Appendix IX constituents greatly complicates this
task. Therefore, minimum false positive rates were adopted for some statistical procedures until
more specific guidance could be recommended. EPA's main objective, however, as in the past, is
to approve tests that have adequate statistical power to detect real contamination of ground
water, and not to enforce minimum false positive rates.
This emphasis is evident in §264.98(g)(6) for detection monitoring and §264.99(i) for
compliance monitoring. Both of these provisions allow the owner or operator to demonstrate that
the statistically significant difference between background and compliance point wells or between
compliance point wells and the Ground-Water Protection Standard is an artifact caused by an
Draft
66
error in sampling, analysis, statistical evaluation, or natural variation in ground-water chemistry.
To make the demonstration that the statistically significant difference was caused by an error in
sampling, analysis, or statistical evaluation, re-testing procedures that have been approved by the
Regional Administrator can be written into the facility permit, provided their statistical power is
comparable to the EPA Reference Power Curve given below.
For large monitoring networks, it is almost impossible to maintain a low network-wide
overall false positive rate if the Type I errors for individual comparisons must be kept above 1%.
As will be seen, some alternative testing strategies can achieve a low network-wide false positive
rate while maintaining adequate power to detect contamination. EPA therefore recommends hat
instead of the 1% criterion for individual comparisons, the overall network-wide false positive
rate (across all wells and constituents) of any alternative testing strategy should be kept to
approximately 5% for each monitoring or evaluation period, while maintaining statistical power
comparable to the procedure below.
The other goal of any testing strategy should be to maintain adequate statistical power for
detecting contamination. Technically, power refers to the probability that a statistical testing
procedure will register and identify evidence of contamination when it exists. However, power is
typically defined with respect to a single comparison, not a network of comparisons. Since some
testing procedures may identify contamination more readily when several wells in the network are
contaminated as opposed to just one or two, it is suggested that all testing strategies be compared
on the following more stringent, but common, basis. Let the effective power of a testing
procedure be defined as the probability of detecting contamination in the monitoring network
when one and only one well is contaminated with a single constituent. Note that the effective
power is a conservative measure of how a testing regimen will perform over the network, because
the test must uncover one contaminated well among many clean ones (i.e., like "finding a needle
in a haystack").
To establish a recommended standard for the statistical power of a testing strategy, it must
be understood that the power is not single number, but rather a function of the level of
contamination actually present. For most tests, the higher the level of contamination, the higher
the statistical power; likewise, the lower the contamination level, the lower the power. As such,
when increasingly contaminated ground water passes a particular well, it becomes easier for the
statistical test to distinguish background levels from the contaminated ground water;
consequently, the power is an increasing function of the contamination level.
Draft
67
Perhaps the best way to describe the power function associated with a particular testing
procedure is via a graph, such as the example below of the power of a standard Normal-based
upper Prediction limit with 99% confidence. The power in percent is plotted along the y-axis
against the standardized mean level of contamination along the x-axis. The standardized
contamination levels are in units of standard deviations above the baseline (estimated from
background data), allowing different power curves to be compared across indicator parameters,
wells, and so forth. The standardized units, D, may be computed as
D =(Mean Contamination Level )-(Mean Background Level )
(SD of Background Data)
In some situations, the probability that contamination will be detected by a particular testing
procedure may be difficult if not impossible to derive analytically and will have to be simulated on
a computer. In these cases, the power is typically estimated by generating Normally-distributed
random values at different mean levels and repeatedly simulating the test procedure. With enough
repetitions a reliable power curve can be plotted (e.g., see figure below).
(16 Background Samples)
EPA REFERENCE POWER CURVE
0 1 2 3 4 5
0
20
40
60
80
100
² (STANDARDIZED UNITS ABOVE BACKGROUND)
EF
F
E
C
T
I
V
E
P
O
W
E
R
(
%
)
Draft
68
Notice that the power at D=0 represents the false positive rate of the test, because at that
point no contamination is actually present and the curve is indicating how often contamination will
be "detected" anyway. As long as the power at D=0 is approximately 5% (except for tests on an
individual constituent at an individual well where the false positive rate should approximate 1%)
and the rest of the power curve is acceptably high, the testing strategy should be adequately
comparable to EPA standards.
To determine an acceptable power curve for comparison to alternative testing strategies, the
following EPA Reference Power Curve is suggested. For a given and fixed number of
background measurements, and based on Normally-distributed data from a single downgradient
well generated at various mean levels above background, the EPA Reference Power Curve will
represent the power associated with a 99% confidence upper prediction limit on the next single
future sample from the well (see figure above for n=16).
Since the power of a test depends on several factors, including the background sample size,
the type of test, and the number of comparisons, a different EPA Reference Power Curve will be
associated with each distinct number of background samples. Power curves of alternative tests
should only be compared to the EPA Reference Power Curve using a comparable number of
background measurements. If the power of the alternative test is at least as high as the EPA
reference, while maintaining an approximate 5% overall false positive rate, the alternative
procedure should be acceptable.
With respect to power curves, keep in mind three important considerations: 1) the power of
any testing method can be increased merely by relaxing the false positive rate requirement, letting
a become larger than 5%. This is why an approximate 5% alpha level is suggested as the
standard guidance, to ensure fair power comparisons among competing tests and to limit the
overall network-wide false positive rate. 2) The simulation of alternative testing methods should
incorporate every aspect of the procedure, from initial screens of the data to final decisions
concerning the presence of contamination. This is especially applicable to strategies that involve
some form of retesting at potentially contaminated wells. 3) When the testing strategy
incorporates multiple comparisons, it is crucial that the power be gauged by simulating
contamination in one and only one indicator parameter at a single well (i.e., by measuring the
effective power). As noted earlier, EPA recommends that power be defined conservatively,
forcing any test procedure to find "the needle in the haystack."
Draft
69
5.2 POSSIBLE STRATEGIES
5.2.1 Parametric and Non-parametric ANOVA
As described in the Interim Final Guidance, ANOVA procedures (either the parametric
method or the Kruskal-Wallis test) allow multiple downgradient wells (but not multiple
constituents) to be combined into a single statistical test, thus enabling the network-wide false
positive rate for any single constituent to be kept at 5% regardless of the size of the network. The
ANOVA method also maintains decent power for detecting real contamination, though only for
small to moderately-sized networks. In large networks, even the parametric ANOVA has a
difficult time finding the "needle in a haystack." The reason for this is that the ANOVA F-test
combines all downgradient wells simultaneously, so that "clean" wells are mixed together with the
single contaminated well, potentially masking the test's ability to detect the source of
contamination.
Because of these characteristics, the ANOVA procedure may have poorer power for
detecting a narrow plume of contamination which affects only one or two wells in a much larger
network (say 20 or more comparisons). Another drawback is that a significant ANOVA test
result will not indicate which well or wells is potentially contaminated without further post-hoc
testing. Furthermore, the power of the ANOVA procedure depends significantly on having at
least 3 to 4 samples per well available for testing. Since the samples must be statistically
independent, collection of 3 or more samples at a given well may necessitate a several-month wait
if the natural ground-water velocity at that well is low. In this case, it may be tempting to look
for other strategies (e.g., Tolerance or Prediction intervals) that allow statistical testing of each
new ground water sample as it is collected and analyzed. Finally, since the simple one-way
ANOVA procedure outlined in the Interim Final Guidance is not designed to test multiple
constituents simultaneously, the overall false positive rate will be approximately 5% per
constituent, leading to a potentially high overall network-wide false positive rate (across wells and
constituents) if many constituents need to be tested.
5.2.2 Retesting with Parametric Intervals
One strategy alternative to ANOVA is a modification of approaches suggested by Gibbons
(1991a) and Davis and McNichols (1987). The basic idea is to adopt a two-phase testing
strategy. First, new samples from each well in the network are compared, for each designated
constituent parameter, against an upper Tolerance limit with pre-specified average coverage
Draft
70
(Note that the upper Tolerance limit will be different for each constituent). Since some
constituents at some wells in a large network would be expected to fail the Tolerance limit even in
the absence of contamination, each well that triggers the Tolerance limit is resampled and only
those constituents that "triggered" the limit are retested via an upper Prediction limit (again
differing by constituent). If one or more resamples fails the upper Prediction limit, the specific
constituent at that well failing the test is deemed to have a concentration level significantly greater
than background. The overall strategy is effective for large networks of comparisons (e.g., 100 or
more comparisons), but also flexible enough to accommodate smaller networks.
To design and implement an appropriate pair of Tolerance and Prediction intervals, one
must know the number of background samples available and the number of comparisons in the
network. Since parametric intervals are used, it is assumed that the background data are either
Normal or can be transformed to an approximate Normal distribution. The tricky part is to
choose an average coverage for the Tolerance interval and confidence level for the Prediction
interval such that the twin goals are met of keeping the overall false positive rate to approximately
5% and maintaining adequate statistical power.
To derive the overall false positive rate for this retesting strategy, assume that when no
contamination is present each constituent and well in the network behaves independently of other
constituents and wells. Then if Ai denotes the event that well i is triggered falsely at some stage
of the testing, the overall false positive rate across m such comparisons can be written as
total a =Pr A 1 or A2 or K or Ai or K or Am{}=1 -Pr A i{}
i =1
mÕ
where A i denotes the complement of event Ai. Since P{ A i} is the probability of not registering a
false trigger at uncontaminated well i, it may be written as
Pr A i{}=Pr Xi £TL{}+Pr X i >TL{}´Pr Yi £PL | X i >TL{}
where Xi represents the original sample at well i, Yi represents the concentrations of one or more
resamples at well i, TL and PL denote the upper Tolerance and Prediction limits respectively, and
the right-most probability is the conditional event that all resample concentrations fall below the
Prediction limit when the initial sample fails the Tolerance limit.
Draft
71
Letting x=Pr{Xi£TL} and y=Pr{Yi£PL | Xi>TL}, the overall false positive rate across m
constituent-well combinations can be expressed as
total a =1 -x +(1 -x)× y[]m
As noted by Guttman (1970), the probability that any random sample will fall below the
upper Tolerance limit (i.e., quantity x above) is equal to the expected or average coverage of the
Tolerance interval. If the Tolerance interval has been constructed to have average coverage of
95%, x=0.95. Then given a predetermined value for x, a fixed number of comparisons m, and a
desired overall false positive rate a, we can solve for the conditional probability y as follows:
y =1 -am -x
1 -x
If the conditional probability y were equal to the probability that the resample(s) for the ith
constituent-well combination falls below the upper Prediction limit, one could fix a at, say, 5%,
and construct the Prediction interval to have confidence level y. In that way, one could guarantee
an expected network-wide false positive rate of 5%. Unfortunately, whether or not one or more
resamples falls below the Prediction limit depends partly on whether the initial sample for that
comparison eclipsed the Tolerance limit. This is because the same background data are used to
construct both the Tolerance limit and the Prediction limit, creating a statistical dependence
between the tests.
The exact relationship between the conditional probability y and the unconditional
probability Pr{Yi£PL} is not known; however, simulations of the testing strategy suggest that
when the confidence level for the Prediction interval is equated to the above solution for y, the
overall network-wide false positive rate turns out to be higher than 5%. How much higher
depends on the number of background samples and also the number of downgradient
comparisons. Even with a choice of y that guarantees an expected facility-wide false positive rate
of 5%, the power characteristics of the resulting testing strategy are not necessarily equivalent to
the EPA Reference Power Curve, again depending on the number of background samples and the
number of monitoring well-constituent combinations in the network.
In practice, to meet the selection criteria of 1) establishing an overall false positive rate of
approximately 5% and 2) maintaining adequate statistical power, the confidence level chosen for
the upper Prediction limit should be somewhat higher than the solution y to the preceding
Draft
72
equation. The table below provides recommended choices of expected coverage and confidence
levels for the Tolerance interval-Prediction interval pair when using specific combinations of
numbers of downgradient comparisons and background samples. In general, one should pick
lower coverage Tolerance limits for smaller networks and higher coverage Tolerance limits for
larger networks. That way (as can be seen in the table), the resulting Prediction limit confidence
levels will be low enough to allow the construction of Prediction limits with decent statistical
power.
Note:** = strongly recommended
* = recommended
Only strategies that approximately met the selection criteria are listed in the table. It can be
seen that some, but not all, of these strategies are strongly recommended. Those that are merely
"recommended" failed in the simulations to fully meet one or both of the selection criteria. The
performance of all the recommended strategies, however, should be adequate to correctly identify
contamination while maintaining a modest facility-wide false positive rate.
Once a combination of coverage and confidence levels for the Tolerance-Prediction interval
pair is selected, the statistical power of the testing strategy should be estimated in order to
compare with the EPA Reference Power Curve (particularly if the testing scenario is different
PARAMETRIC RETESTING STRATEGIES
#
COMPARISONS
# BG
SAMPLES
TOLERANCE
COVERAGE (%)
PREDICTION
LEVEL (%)RATING
8 95 90 **
16 95 90 **
5 16 95 85 *
24 95 85 **
24 95 90 *
8 95 98 **
20 16 95 97 **
24 95 97 **
16 98 97 **
16 99 92 *
50 24 98 95 **
24 99 90 **
16 98 98 *
100 24 99 95 *
24 98 98 *
Draft
73
from those computed in this Addendum). Simulation results have suggested that the above
method for choosing a two-phase testing regimen can offer statistical power comparable to the
EPA Reference for almost any sized monitoring network (see power curves in Appendix B).
Several examples of simulated power curves are presented in Appendix B. The range of
downgradient wells tested is from 5 to 100 (note that the number of wells could actually represent
the number of constituent-well combinations if testing multiple parameters), and each curve is
based on either 8, 16, or 24 background samples. The y-axis of each graph measures the effective
power of the testing strategy, i.e., the probability that contamination is detected when one and
only one constituent at a single well has a mean concentration higher than background level. For
each case, the EPA Reference Power Curve is compared to two different two-phase testing
strategies. In the first case, wells that trigger the initial Tolerance limit are resampled once. This
single resample is compared to a Prediction limit for the next future sample. In the second case,
wells that trigger the Tolerance limit are resampled twice. Both resamples are compared to an
upper Prediction limit for the next two future samples at that well.
The simulated power curves suggest two points. First, with an appropriate choice of
coverage and prediction levels, the two-phase retesting strategies have comparable power to the
EPA Reference Power Curve, while maintaining low overall network-wide false positive rates.
Second, the power of the retesting strategy is slightly improved by the addition of a second
resample at wells that fail the initial Tolerance limit, because the sample size is increased.
Overall, the two-phase testing strategy defined above--i.e., first screening the network of
wells with a single upper Tolerance limit, and then applying an upper Prediction limit to resamples
from wells which fail the Tolerance interval--appears to meet EPA's objectives of maintaining
adequate statistical power for detecting contamination while limiting network-wide false positive
rates to low levels. Furthermore, since each compliance well is compared against the interval
limits separately, a narrow plume of contamination can be identified more efficiently than with an
ANOVA procedure (e.g., no post-hoc testing is necessary to finger the guilty wells, and the two-
phase interval testing method has more power against the "needle-in-a-haystack" contamination
hypothesis).
5.2.3 Retesting with Non-parametric Intervals
When parametric intervals are not appropriate for the data at hand, either due to a large
fraction of nondetects or a lack of fit to Normality or Lognormality, a network of individual
Draft
74
comparisons can be handled via retesting using non-parametric Prediction limits. The strategy is
to establish a non-parametric prediction limit for each designated indicator parameter based on
background samples that accounts for the number of well-constituent comparisons in the overall
network.
In order to meet the twin goals of maintaining adequate statistical power and a low overall
rate of false positives, a non-parametric strategy must involve some level of retesting at those
wells which initially indicate possible contamination. Retesting can be accomplished by taking a
specific number of additional, independent samples from each well in which a specific constituent
triggers the initial test and then comparing these samples against the non-parametric prediction
limit for that parameter.
Because more independent data is added to the overall testing procedure, retesting of
additional samples, in general, enables one to make more powerful and more accurate
determinations of possible contamination. Retesting does, however, involve a trade-off. Because
the power of the test increases with the number of resamples, one must decide how quickly
resamples can be collected to ensure 1) quick identification and confirmation of contamination
and yet, 2) the statistical independence of successive resamples from any particular well. Do not
forget that the performance of a non-parametric retesting strategy depends substantially on the
independence of the data from each well.
Two basic approaches to non-parametric retesting have been suggested by Gibbons (1990
and 1991b). Both strategies define the upper Prediction limit for each designated parameter to be
the maximum value of that constituent in the set of background data. Consequently, the
background wells used to construct the limits must be uncontaminated. After the Prediction limits
have been calculated, one sample is collected from each downgradient well in the network. If any
sample constituent value is greater than its upper prediction limit, the initial test is "triggered" and
one or more resamples must be collected at that downgradient well on the constituent for further
testing.
At this point, the similarity between the two approaches ends. In his 1990 article, Gibbons
computes the probability that at least one of m independent samples taken from each of k
downgradient wells will be below (i.e., pass) the prediction limit. The m samples include both the
initial sample and (m-1) resamples. Because retesting only occurs when the initial well sample fails
the limit, a given well fails the overall test (initial comparison plus retests) only if all (m-1)
Draft
75
resamples are above the prediction limit. If any resample passes the prediction limit, that well is
regarded as showing no significant evidence of contamination.
Initially, this first strategy may not appear to be adequately sensitive to mild contamination
at a given downgradient well. For example, suppose two resamples are to be collected whenever
the initial sample fails the upper prediction limit. If the initial sample is above the background
maximum and one of the resamples is also above the prediction limit, the well can still be
classified as "clean" if the other resample is below the prediction limit. Statistical power
simulations (see Appendix B), however, suggest that this strategy will perform adequately under a
number of monitoring scenarios. Still, EPA recognizes that a retesting strategy which might
classify a well as "clean" when the initial sample and a resample both fail the upper Prediction limit
could offer problematic implications for permit writers and enforcement personnel.
A more stringent approach was suggested by Gibbons in 1991. In that article (1991b),
Gibbons computes, as "passing behavior," the probability that all but one of m samples taken from
each of k wells pass the upper prediction limit. Under this definition, if the initial sample fails the
upper Prediction limit, all (m-1) resamples must pass the limit in order for well to be classified as
"clean" during that testing period. Consequently, if any single resample falls above the background
maximum, that well is judged as showing significant evidence of contamination.
Either non-parametric retesting approach offers the advantage of being extremely easy to
implement in field testing of a large downgradient well network. In practice, one has only to
determine the maximum background sample to establish the upper prediction limit against which
all other comparisons are made. Gibbons' 1991 retesting scheme offers the additional advantage
of requiring less overall sampling at a given well to establish significant evidence of
contamination. Why? If the testing procedure calls for, say, two resamples at any well that fails
the initial prediction limit screen, retesting can end whenever either one of the two resamples falls
above the prediction limit. That is, the well will be designated as potentially contaminated if the
first resample fails the prediction limit even if the second resample has not yet been collected.
In both of his papers, Gibbons offers tables that can be used to compute the overall
network-wide false positive rate, given the number of background samples, the number of
downgradient comparisons, and the number of retests for each comparison. It is clear that there is
less flexibility in adjusting a non-parametric as opposed to a parametric prediction limit to achieve
a certain Type I error rate. In fact, if only a certain number of retests are feasible at any given
well (e.g., in order to maintain independence of successive samples), the only recourse to maintain
Draft
76
a low false positive rate is to collect a larger number of background samples. In this way, the
inability to make parametric assumptions about the data illustrates why non-parametric tests are
on the whole less efficient and less powerful than their parametric counterparts.
Unfortunately, the power of these non-parametric retesting strategies is not explored in
detail by Gibbons. To compare the power of both Gibbons' strategies against the EPA Reference
Power Curve, Normally distributed data were simulated for several combinations of numbers of
background samples and downgradient wells (again, if multiple constituents are being tested, the
number of wells in the simulations may be regarded as the number of constituent-well
combinations). Up to three resamples were allowed in the simulations for comparative purposes.
EPA recognizes, however, that it will be feasible in general to collect only one or two independent
resamples from any given well. Power curves representing the results of these simulations are
given in Appendix B. For each scenario, the EPA Reference Power Curve is compared with the
simulated powers of six different testing strategies. These strategies include collection of no
resamples, one resample, two resamples under Gibbons' 1990 approach (designated as A on the
curves) and his 1991 approach (labelled as B), and three resamples (under approaches A and B).
Under the one resample strategy, a potentially contaminated compliance well is designated as
"clean" if the resample passes the retest and "contaminated" otherwise.
The following table lists the best-performing strategies under each scenario. As with the use
of parametric intervals for retesting, the criteria for selecting the best-performing strategies
required 1) an approximate 5% facility-wide false positive rate and 2) power equivalent to or
better than the EPA Reference Power Curve. Because Normal data were used in these power
simulations, more realistically skewed data would likely result in greater advantages for the non-
parametric retesting strategies over the EPA Reference test.
Examination of the table and the power curves in Appendix B shows that the number of
background samples has an important effect on the recommended testing strategy. For instance,
with 8 background samples in a network of at least 20 wells, the best performing strategies all
involve collection of 3 resamples per "triggered" compliance well (EPA regards such a strategy as
impractical for permitting and enforcement purposes at most RCRA facilities). It tends to be true
that as the number of available background samples grows, fewer resamples are needed from each
potentially contaminated compliance well to maintain adequate power. If, as is expected, the
number of feasible, independent retests is limited, a facility operator may have to collect additional
background measurements in order to establish an adequate retesting strategy.
Draft
77
Note:** = very good performance * = good performance
6. OTHER TOPICS
6.1 CONTROL CHARTS
Control Charts are an alternative to Prediction limits for performing either intrawell
comparisons or comparisons to historically monitored background wells during detection
monitoring. Since the baseline parameters for a Control Chart are estimated from historical data,
this method is only appropriate for initially uncontaminated compliance wells. The main
advantage of a Control Chart over a Prediction limit is that a Control Chart allows data from a
well to be viewed graphically over time. Trends and changes in the concentration levels can be
seen easily, because all sample data is consecutively plotted on the chart as it is collected, giving
the data analyst an historical overview of the pattern of contamination. Prediction limits allow
only point-in-time comparisons between the most recent data and past information, making long-
term trends difficult to identify.
NON-PARAMETRIC RETESTING STRATEGIES
#
WELLS
# BG
SAMPLES STRATEGY REFERENCE RATING
8 1 Resample *
5 8 2 Resamples (A)Gibbons, 1990 **
16 1 Resample **
16 2 Resamples (B)Gibbons, 1991 **
24 2 Resamples (B)Gibbons, 1991 **
8 2 Resamples (A)Gibbons, 1990 *
16 1 Resample *
20 16 2 Resamples (A)Gibbons, 1990 *
24 1 Resample **
24 2 Resamples (B)Gibbons, 1991 *
32 1 Resample *
32 2 Resamples (B)Gibbons, 1991 **
16 2 Resamples (A)Gibbons, 1990 **
50 24 1 Resample *
24 2 Resamples (A)Gibbons, 1990 *
32 1 Resample **
100 16 2 Resamples (A)Gibbons, 1990 **
24 2 Resamples (A)Gibbons, 1990 *
32 1 Resample *
Draft
78
More generally, intrawell comparison methods eliminate the need to worry about spatial
variability between wells in different locations. Whenever background data is compared to
compliance point measurements, there is a risk that any statistically significant difference in
concentration levels is due to spatial and/or hydrogeological differences between the wells rather
than contamination at the facility. Because intrawell comparisons involve but a single well,
significant changes in the level of contamination cannot be attributed to spatial differences
between wells, regardless of whether the method used is a Prediction limit or Control Chart.
Of course, past observations can be used as baseline data in an intrawell comparison only if
the well is known to be uncontaminated. Otherwise, the comparison between baseline data and
newly collected samples may negate the goal in detection monitoring of identifying evidence of
contamination. Furthermore, without specialized modification, Control Charts do not efficiently
handle truncated data sets (i.e., those with a significant fraction of nondetects), making them
appropriate only for those constituents with a high frequency of occurrence in monitoring wells.
Control Charts tend to be most useful, therefore, for inorganic parameters (e.g., some metals and
geochemical monitoring parameters) that occur naturally in the ground water.
The steps to construct a Control Chart can be found on pp. 7-3 to 7-10 of the Interim Final
Guidance. The way a Control Chart works is as follows. Initial sample data is collected (from the
specific compliance well in an intrawell comparison or from background wells in comparisons of
compliance data with background) in order to establish baseline parameters for the chart,
specifically, estimates of the well mean and well variance. These samples are meant to
characterize the concentration levels of the uncontaminated well, before the onset of detection
monitoring. Since the estimate of well variance is particularly important, it is recommended that
at least 8 samples be collected (say, over a year's time) to estimate the baseline parameters. Note
that none of these 8 or more samples is actually plotted on the chart.
As future samples are collected, the baseline parameters are used to standardize the data.
At each sampling period, a standardized mean is computed using the formula below, where m
represents the baseline mean concentration and s represents the baseline standard deviation.
Z i =ni (x -m )/s
A cumulative sum (CUSUM) for the ith period is also computed, using the formula Si = max{0,
(Zi-k)+Si-1}, where Zi is the standardized mean for that period and k represents a pre-chosen
Control Chart parameter.
Draft
79
Once the data have been standardized and plotted, a Control Chart is declared out-of-
control if the sample concentrations become too large when compared to the baseline parameters.
An out-of-control situation is indicated on the Control Chart when either the standardized means
or CUSUMs cross one of two pre-determined threshold values. These thresholds are based on
the rationale that if the well remains uncontaminated, new sample values standardized by the
original baseline parameters should not deviate substantially from the baseline level. If
contamination does occur, the old baseline parameters will no longer accurately represent
concentration levels at the well and, hence, the standardized values should significantly deviate
from the baseline levels on the Control Chart.
In the combined Shewhart-cumulative sum (CUSUM) Control Chart recommended by the
Interim Final Guidance (Section 7), the chart is declared out-of-control in one of two ways. First,
the standardized means (Zi) computed at each sampling period may cross the Shewhart control
limit (SCL). Such a change signifies a rapid increase in well concentration levels among the most
recent sample data. Second, the cumulative sum (CUSUM) of the standardized means may
become too large, crossing the "decision internal value" (h). Crossing the h threshold can mean
either a sudden rise in concentration levels or a gradual increase over a longer span of time. A
gradual increase or trend is particularly indicated if the CUSUM crosses its threshold but the
standardized mean Zi does not. The reason for this is that several consecutive small increases in
Zi will not trigger the SCL threshold, but may trigger the CUSUM threshold. As such, the
Control Chart can indicate the onset of either sudden or gradual contamination at the compliance
point.
As with other statistical methods, Control Charts are based on certain assumptions about the
sample data. The first is that the data at an uncontaminated well (i.e., a well process that is "in
control") are Normally distributed. Since estimates of the baseline parameters are made using
initially collected data, these data should be tested for Normality using one of the goodness-of-fit
techniques described earlier. Better yet, the logarithms of the data should be tested first, to see if
a Lognormal model is appropriate for the concentration data. If the Lognormal model is not
rejected, the Control Chart should be constructed solely on the basis of logged data.
The methodology for Control Charts also assumes that the sample data are independently
distributed from a statistical standpoint. In fact, these charts can easily give misleading results if
the consecutive sample data are not independent. For this reason, it is important to design a
sampling plan so that distinct volumes of water are analyzed each sampling period and that
Draft
80
duplicate sample analyses are not treated are independent observations when constructing the
Control Chart.
The final assumption is that the baseline parameters at the well reflect current background
concentration levels. Some long-term fluctuation in background levels may be possible even
though contamination has not occurred at a given well. Because of this possibility, if a Control
Chart remains "in control" for a long period of time, the baseline parameters should be updated to
include more recent observations as background data. After all, the original baseline parameters
will often be based only on the first year's data. Much better estimates of the true background
mean and variance can be obtained by including more data at a later time.
To update older background data with more recent samples, a two-sample t-test can be run
to compare the older concentration levels with the concentrations of the proposed update
samples. If the t-test does not show a significant difference at the 5 percent significance level,
proceed to re-estimate the baseline parameters by including more recent data. If the t-test does
show a significant difference, the newer data should not be characterized as background unless
some specific factor can be pinpointed explaining why background levels on the site have naturally
changed.
EXAMPLE 18
Construct a control chart for the 8 months of data collected below.
m=27 ppb
s=25 ppb
Nickel Concentration (ppb)
Month Sample 1 Sample 2
1 15.3 22.6
2 41.1 27.8
3 17.5 18.1
4 15.7 31.5
5 37.2 32.4
6 25.1 32.5
7 19.9 27.5
8 99.3 64.2
Draft
81
SOLUTION
Step 1.The three parameters necessary to construct a combined Shewhart-CUSUM chart are
h=5, k=1, and SCL=4.5 in units of standard deviation (SD).
Step 2.List the sampling periods and monthly means, as in the following table.
Month Ti Mean (ppb)Zi Zi - k Si
1 1 19.0 -0.45 -1.45 0.00
2 2 34.5 0.42 -0.58 0.00
3 3 17.8 -0.52 -1.52 0.00
4 4 23.6 -0.19 -1.19 0.00
5 5 34.8 0.44 -0.56 0.00
6 6 28.8 0.10 -0.90 0.00
7 7 23.7 -0.19 -1.19 0.00
8 8 81.8 3.10 2.10 2.10
Step 3.Compute the standardized means Zi and the quantities Si. List in the table above. Each
Si is computed for consecutive months using the formula on p. 7-8 of the EPA
guidance document.
S1 = max {0, -1.45 + 0} = 0.00
S2 = max {0, -0.58 + 0} = 0.00
S3 = max {0, -1.52 + 0} = 0.00
S4 = max {0, -1.19 + 0} = 0.00
S5 = max {0, -0.56 + 0} = 0.00
S6 = max {0, -0.90 + 0} = 0.00
S7 = max {0, -1.19 + 0} = 0.00
S8 = max {0, 2.10 + 0} = 2.10
Step 4.Plot the control chart as given below. The combined chart indicates that there is no
evidence of contamination at the monitoring facility because neither the standardized
mean nor the CUSUM statistic exceeds the Shewhart control limits for the months
examined.
Draft
82
Z
CUSUM
h
SCL
MU = 27ppb SIGMA = 25ppb
CONTROL CHART FOR NICKEL DATA
0 2 4 6 8 100246810
-2
-1
0
1
2
3
4
5
6
SAMPLING PERIOD
ST
A
N
D
A
R
D
I
Z
E
D
C
O
N
C
E
N
T
R
A
T
I
O
N
Note: In the above Control Chart, the CUSUMs are compared to threshold h, while the
standardized means (Z) are compared to the SCL threshold.
6.2 OUTLIER TESTING
Formal testing for outliers should be done only if an observation seems particularly high (by
orders of magnitude) compared to the rest of the data set. If a sample value is suspect, one should
run the outlier test described on pp. 8-11 to 8-14 of the EPA guidance document. It should be
cautioned, however, that this outlier test assumes that the rest of the data values, except for the
suspect observation, are Normally distributed (Barnett and Lewis, 1978). Since Lognormally
distributed measurements often contain one or more values that appear high relative to the rest, it
is recommended that the outlier test be run on the logarithms of the data instead of the original
observations. That way, one can avoid classifying a high Lognormal measurement as an outlier
just because the test assumptions were violated.
If the test designates an observation as a statistical outlier, the sample should not be treated
as such until a specific reason for the abnormal measurement can be determined. Valid reasons
may, for example, include contaminated sampling equipment, laboratory contamination of the
Draft
83
sample, or errors in transcription of the data values. Once a specific reason is documented, the
sample should be excluded from any further statistical analysis. If a plausible reason cannot be
found, the sample should be treated as a true but extreme value, not to be excluded from further
analysis.
EXAMPLE 19
The table below contains data from five wells measured over a 4-month period. The value
7066 is found in the second month at well 3. Determine whether there is statistical evidence that
this observation is an outlier.
Carbon Tetrachloride Concentration (ppb)
Well 1 Well 2 Well 3 Well 4 Well 5
1.69 302 16.2 199 275
3.25 35.1 7066 41.6 6.5
7.3 15.6 350 75.4 59.7
12.1 13.7 70.14 57.9 68.4
SOLUTION
Step 1.Take logarithms of each observation. Then order and list the logged concentrations.
Draft
84
Order
Concentration
(ppb)
Logged
Concentration
1 1.69 0.525
2 3.25 1.179
3 6.5 1.872
4 7.3 1.988
5 12.1 2.493
6 13.7 2.617
7 15.6 2.747
8 16.2 2.785
9 35.1 3.558
10 41.6 3.728
11 57.9 4.059
12 59.7 4.089
13 68.4 4.225
14 70.1 4.250
15 75.4 4.323
16 199 5.293
17 275 5.617
18 302 5.710
19 350 5.878
20 7066 8.863
Step 2.Calculate the mean and SD of all the logged measurements. In this case, the mean and
SD are 3.789 and 1.916, respectively.
Step 3.Calculate the outlier test statistic T20 as
T20 =X 20()-X
SD =8.863 -3.789
1.916 =2.648.
Step 4.Compare the observed statistic T20 with the critical value of 2.557 for a sample size
n=20 and a significance level of 5 percent (taken from Table 8 on p. B-12 of the Interim
Final Guidance). Since the observed value T20=2.648 exceeds the critical value, there
is significant evidence that the largest observation is a statistical outlier. Before
excluding this value from further analysis, a valid explanation for this unusually high
value should be found. Otherwise, treat the outlier as an extreme but valid
concentration measurement.