Monitoring data preparation and exploratory analysis

Before reading this page, you should read:

Data preparation

Careful preparation of data for a water quality monitoring program will help ensure that:

  • proposed analysis is feasible and runs smoothly
  • valid results are obtained
  • analytical results are not unduly influenced by anomalies or errors.

Data preparation should not be hurried. Often this is the most time-consuming aspect of data analysis.

Check and clean electronic data before starting a comprehensive analysis. The data may need to be formatted, collated and manipulated. Make it possible to retrace your steps back to the raw data.

Many numerical and graphical exploratory data analysis methods are useful to check and clean data for integrity.

The choice of data formats may depend on how the data will be analysed and the software available. It’s best to enter laboratory and field study data into a database or spreadsheet that is accessible for many software applications.

Only use data that are:

  • deemed acceptable by field or laboratory quality assurance/quality control (QA/QC) criteria
  • treated consistently (e.g. rounded off to an appropriate number of significant figures).

Data integrity

The integrity of water quality data can be reduced in many ways.

Losses or errors can occur at any time, from sample collection and preparation through to interpretation and reporting. After the QA/QC checked data leave the field or laboratory, there is ample opportunity for accidental alteration of results to occur.

Gross errors that are probably the result of data manipulations (e.g. transcribing, transposing rows and columns, editing, recoding and conversion of units) are easily overlooked unless you perform a modest level of screening. These sorts of errors can usually be detected by a scan of the raw data.

Subtle erroneous effects (e.g. repeated data, accidental deletion of 1 or 2 observations, mixed scales) are more difficult to identify.

If left unchecked, anomalous readings can have a profound effect on subsequent statistical analyses and possibly lead to erroneous conclusions being drawn.

Quality control measures

A good QA/QC program for data analysis uses simple yet effective statistical tools for screening data as they are received from the field or laboratories.

Data screen measures typically include a mixture of:

  • graphical procedures — histograms, box plots, time sequence plots, control charts
  • descriptive numerical measures of key distributional aspects — mean, standard deviation (SD), coefficient of variation (CV), measures of skewness and kurtosis.

These routine analyses often provide the first opportunity to inspect data, recognise patterns and draw preliminary understanding. Most pre-processing, manipulation and screening can be undertaken using in-built database or spreadsheet functionality.

Outliers

Take care when labelling extreme observations as an ‘outlier’. An outlier is an extreme observation but not all extreme observations are outliers.

For example, even with data that follow a normal distribution, about 2 out of every 1000 observations will fall beyond 3 SDs from the mean. To treat such observations as outliers by discarding them from the dataset would introduce bias into subsequent analyses.

Instead, you should highlight outliers for follow-up investigation to identify causes (e.g. recording error, laboratory error, abnormal physical conditions during sampling).

If there are strong reasons to explain why an observation is aberrant, then it is reasonable to exclude that observation from any analysis.

If no compelling reason exists for the potential outlier, then you can exercise your professional judgement. Our recommendation would be to retain the potential outlier and flag it throughout the analyses for further scrutiny.

Sometimes, your choice of analysis method can down-weight potential outliers (e.g. robust regression compared to linear regression).

While there are statistical tests for determining if an observation can be treated as an outlier (Neter et al. 1996), simple descriptive statistical measures and graphical techniques — combined with your monitoring team’s understanding of the system — are very valuable tools for identifying outliers.

Outliers in multivariate analyses

Identification of aberrant observations in a multivariate (many variable) context is more complex.

For example, water quality monitoring generates measurements of metals, nutrients, organic compounds and other compounds. Each sample often has a vector of observations rather than a single value. These values (variables) tend to be correlated among themselves to some extent, indicating that they co-vary. If the co-dependence between variables is ignored and the aberrant observations are examined one variable at a time, unusual observations may be missed and the whole procedure will be inefficient.

In a multivariate context, it is quite possible for an observation to be ‘unusual’ even when it is reasonably close to the respective means of each of the constituent variables.

It is more difficult to determine the causes of outliers in a multivariate context. Garner et al. (1991) considered statistical procedures to help in this evaluation but again, take care when labelling and treating potential outlying observations.

The issue of outliers and multivariate normality for compositional data (where data represent percentages of various chemical constituents) was investigated by Barceló et al. (1996).

Censored data

Data that are below or above detection limits (so-called left and right ‘censored’ data, respectively) are common and should be identified and handled appropriately. The focus for chemical data is often ‘below detection limit’ (BDL), while right-censored data occur fairly frequently in ecotoxicological and biological datasets.

Unless the water body is degraded, failure to detect contaminants is common. Rather than concluding that the particular contaminant does not exist in a water sample, the monitoring team records the observation as ‘BDL’.

Some ad hoc approaches for dealing with BDL data:

  • treat the observation as missing or zero (0)
  • use the numerical value of the detection limit
  • use a numerical value of half the detection limit.

Helsel (2005) summarised some of the issues in adopting such approaches.

For example, when a large portion of the data are recoded as BDL, using any of these ad hoc approaches will be problematic because the sample variance will be severely underestimated. Also, when standard statistical techniques are applied to datasets that have constant values in place of the BDL values, the resulting estimates are biased (El-Shaarawi & Esterby 1992).

Assigning a missing value code can cause difficulties because software may treat missing values differently. Some software packages compute an average using the full dataset with missing values replaced by zeros (not a good strategy) and others simply ignore all missing values and reduce the sample size accordingly.

More sophisticated statistical procedures have been devised that use re-sampling or imputation techniques to ‘infer’ a reasonable value for the BDL observation but their implementation requires advanced statistical skills. Helsel (2005) provided a strong summary of the options and relative merits. Gilliom & Helsel (1986) and Helsel & Gilliom (1986) estimated water quality parameters with censored data using a variety of techniques. Liu et al. (1997) described a method based on ‘forward censored regression’ to model data with censored observations. Censored data are routinely addressed in many software tools and should be explored.

In the absence of more sophisticated tools for analysing censored data, the most important decision is to ensure all censored observations are flagged and tracked. If you adopt any of the ad hoc approaches, then there is a danger that the nature of the censoring will be lost over time unless that discipline is maintained.

If a significant proportion (e.g. more than 25%) of the data are BDL, then you may need additional statistical expertise to support or conduct data analysis.

If the proportion of BDL data is small, then using the full dataset with BDL data replaced by either the detection limit or half the detection limit may be useful for exploratory data analysis but it should not be used for any inferential analysis (e.g. confidence intervals or hypothesis testing).

If only a small proportion of the dataset is BDL and has been replaced by a numerical surrogate, then it is best to perform any statistical analysis twice: once using zero (0) and once using the detection limit (or half the detection limit) as the replacement value. If results from the 2 analyses differ markedly, then you should investigate more sophisticated statistical methods of dealing with censored observations. The censored observations probably have little influence on the analysis if the results do not differ markedly.

Right-censored data may also occur, particularly for ecotoxicological and biological datasets, and are typically analysed by appealing to the wider survival analysis literature and using approaches such as the nonparametric Kaplan–Meier method (Helsel 2005, Andersen 2013).

Missing data

Missing observations are common in environmental datasets, and may occur due to:

  • dropout of sites from a study
  • equipment failure
  • time or resource constraints
  • fault of the observer.

In monitoring studies, missing data can reduce the representativeness of the sample and bias the results, which may lead to misleading inferences about the population.

You should not ignore omitted observations because they can have a significant effect on the conclusions that can be drawn from data.

Exploring the missing data to uncover how they were omitted can help you to proceed correctly with analysis. Rubin (1976) differentiated between 3 types of ‘missingness’ mechanisms that will have implications for your approach to analysis and the conclusions drawn from it:

  • missing completely at random — the value of the missing variable is independent of both observed variables and unobserved parameters
  • missing at random — missingness does not depend on some variables
  • missing not at random — missingness is structured; the value of the missing variable is related to the reason it is missing.

In the presence of missing data, we recommend robust analysis methods to minimise the effects of bias on the overall conclusions. Alternative techniques for dealing with missing data include:

  • data imputation or ‘filling-in’, after which standard data analysis approaches could be used
  • Bayesian approaches to some popular methods, which can allow missing values to be treated as an additional parameter and estimated (or imputed)
  • data reduction, which can remove those records or cases where missing data are present
  • maximum likelihood estimation
  • spatial modelling
  • data interpolation.

Understanding how the missing data arise will help you to select an appropriate data imputation approach. Little & Rubin (2002) is a comprehensive and highly cited text on data analysis in the presence of missing data.

Exploratory data analysis

How you analyse the data will largely depend on the type of monitoring study being undertaken and the nature of the arising data (number of variables, continuous or discrete data classes, temporal and spatial attributes). Suggested analysis methods will largely have been discussed and noted as part of the study design process, based on the monitoring program objectives.

After preparing your data, we strongly encourage exploratory analyses and interrogation of key variables using numerical and graphical statistical methods. Simply looking at the data can yield value, identify patterns for further scrutiny or raise questions for investigation.

You can use a range of analysis strategies, including model-based and probability-based analyses and inferences depending on how the data are collected, or the broader Bayesian and frequentist modelling philosophies and methods.

If quantifying status or change in water quality is a monitoring objective, then comparison of sample data against guideline values will be of interest. We discuss how to derive guideline values using reference data, followed by details of significance testing and calculation of confidence intervals for testing monitoring data against those guideline values when the sample data are spatially and temporally independent.

If spatial or temporal independence is not likely, then this dependence needs to be accounted for in the analysis.

We discuss analysis options when a single water quality variable is considered in turn at one site over time (temporal analysis approaches) or at multiple sites for a particular point in time (spatial and regional analysis approaches). We also consider some approaches to model the relationships between multiple water quality or environmental variables, including correlation analysis, multivariate and high-dimensional data analysis techniques, and regression analyses (parametric and nonparametric approaches).

The information we provide is not exhaustive. For complex studies that require a high level of statistical sophistication you should consult a professional statistician.

Data summary

Data summary enables you to present and summarise important features of the data. It also helps to identify anomalous observations (also known as ‘outliers’) that might be attributable to experimental or transcription errors, and to ascertain the validity of outliers and their subsequent inclusion in the analysis.

You can use a variety of numerical and graphical statistical tools to summarise data, including:

  • graphs (e.g. histograms, box plots, dot plots, scatterplots)
  • tables (e.g. frequency distributions, cross-tabulations)
  • numerical measures (e.g. means, medians, SDs, percentiles).

The objective of calculating or plotting summary statistics is to convey the essential information contained in a dataset as concisely and clearly as possible, to estimate a parameter of some population of values or to help with data preparation.

Frequency tabulations are commonly used for summarising data because they can reduce large datasets to manageable formats without substantial loss of information and can be graphically presented as histograms and bar charts.

Measures of central tendency

It is common to use statistics that are measures of the central tendency, such as the arithmetic mean, median and mode (Table 1).

Table 1 Measures of central tendency (adapted from Spiegel 1992)

StatisticDescription
Arithmetic meanUpper X overbar equals sigma-summation Underscript i equals 1 Overscript n EndScripts Upper X Subscript i EndScripts divided by n where Xi denotes the ith observation in a sample of n
α% trimmed meanT.α obtained by trimming (removing) α% off both ends of the ordered sample and computing X̅ for the remainder. Used when outliers are present. α is typically 10% or 20%.
ModeMost frequently occurring value
MedianMiddle value: half the values are numerically smaller and half are numerically larger
Geometric meanThe nth root of the product of n sample values (> 0). GM = (x1x2…xn)1/n. It is always less than the arithmetic mean.
Harmonic meanReciprocal of the summation of n sample reciprocal values. HM = 1 Over n sigma-summation Underscript i equals 1 Overscript n EndScripts 1 over x Subscript i

For most water quality applications, the arithmetic mean, geometric mean or median are the most appropriate quantities.

The median is a robust estimator of central tendency because it is relatively unaffected by extremes in the data. The arithmetic mean does not share this property but is nevertheless the most widely used ‘average’ (easy to compute and uses all the data). The arithmetic mean also has well-established statistical properties, and many statistical tests relate to inference about a population mean.

The geometric mean is defined as the arithmetic mean of the values taken on a log scale, or equivalently the nth root of the product of all the observations. The geometric mean is often an appropriate measure for positively skewed data (data that will become more symmetric under a log transformation). It uses all the observations and is less affected by outliers than the arithmetic mean but cannot be used for data with zero or negative values. The geometric mean will never exceed the arithmetic mean.

Measures of variability

Variability is another extremely important characteristic of a distribution of results (Table 2).

Table 2 Common measures of variation (adapted from Spiegel 1992)

StatisticDescription
Range(largest value) – (smallest value)
Interquartile range (IQR)IQR is the difference between the 3rd and 1st quartiles (75th percentile – 25th percentile).
Sample variance (s2)s squared equals sigma-summation Underscript i equals 1 Overscript n EndScripts left-parenthesis Upper X Subscript i minus Upper X overbar right-parenthesis squared divided by left-parenthesis n minus 1 right-parenthesis
Sample standard deviation (SD)The square root of the variance. It has the same units as the central tendency statistics.
Winsorized standard deviation (sT)Used as a measure of spread of the trimmed mean X̅T. Obtained by replacing trimmed values, identified in computation of X̅T, with values that were next in line for trimming (one from each end), and computing the SD of this new sample.
s StartRoot left-parenthesis n minus 1 right-parenthesis divided by left-parenthesis k minus 1 right-parenthesis EndRoot where k is the size of the trimmed sample.
IQR/medianDimensionless robust measure of spread.
Coefficient of variation (CV)The SD divided by the mean. Dimensionless. Can be used for comparisons among different samples

The simplest measure of variability is the range (difference between largest and smallest values in dataset). The range is rarely used as a measure of variability because it is grossly affected by extremes in the dataset (and defined as the difference between the two extremes).

The most commonly used measure of variability is the variance or its (positive) square root, the standard deviation (SD). The SD is preferred because it has the same units of measurement as the original data. However, the variance is an important parameter in inferential statistics, such as analysis of variance (ANOVA). One difficulty with the SD is that it is not readily comparable among different populations or samples because it tends to be numerically higher as a result of an increasing mean. A dimensionless measure that overcomes this difficulty is the coefficient of variation (CV), which is defined as the ratio of the SD and the mean.

Percentiles of distribution

In addition to measures of location and dispersion (variability), another important statistical measure in water quality applications is the percentile of a distribution.

For example, when examining physical and chemical (PC) stressors, we recommend a triggering process based on a comparison of the 50th percentile at a test site with the 80th percentile at a reference site.

The pth percentile is the value that is greater than or equal to p%of all values in a distribution. For example, 50% of all values in a distribution are numerically less than or equal to the 50th percentile (otherwise known as the median) and 80% of all values are numerically less than or equal to the 80th percentile.

The 25th, 50th, and 75th percentiles are called the ‘quartiles’ (denoted Q1,Q2 and Q3)because they divide the distribution into 4 parts of equal probability.

Data visualisation

High quality and sophisticated graphical functionality is available in many statistical software. We strongly recommend that you create relevant graphs of the data before performing any formal statistical analysis (Tufte 1983).

Simple graphs, such as histograms, box plots, dot plots, scatterplots and probability plots, can assist statistical analyses and make it easier to interpret data with respect to:

  • data anomalies and errors
  • properties of the distributions (location, dispersion, skewness, kurtosis)
  • trends over time, space and attributes
  • relationships (identification and type)
  • checking assumptions appropriate to the distributions (e.g. normal probability plots)
  • time-series analysis
  • reducing the number of dimensions (visualising high dimensional data by projecting it into lower dimensions)
  • operational performance (e.g. control charts).

A list of graph types and potential applications is given in Table 3. Most of these plots are available in the statistical software we mention. Many other dynamic and more interactive versions of these types of visualisation are commonplace and allow important exploration. For example, through animation, partitioning, annotating and tracking subsets and the clever use of colour, symbols and transparency to represent or highlight different features of interest.

Table 3 Useful graphs for common data analysis applications

ApplicationUseful types of graphs
Cause and effectScatterplot (bi-plot), scatterplot matrix, fishbone chart
Distribution checkingHistogram, box plot, stem and leaf, probability and Q-Q plots, residual plot
Exploratory data analysisScatterplot (bi-plot), scatterplot matrix, chart (line, bar, pie), histogram, box plot, dot plot, stem and leaf, time series, 3D (contour, scatter, surface), probability and Q-Q plots, scatterplot smoothers (LOESS), run chart, control chart (Xbar, CUSUM, EWMA, Range, MA), Pareto chart, fishbone chart
Model diagnosticsScatterplot (bi-plot), scatterplot matrix, interval (error bars, confidence intervals), histogram, box plot, stem and leaf, time series, probability and Q-Q plots, residual plot
Multivariable relationshipsScatterplot (bi-plot), scatterplot matrix, 3D (contour, scatter, surface)
Outlier detectionScatterplot (bi-plot), scatterplot matrix, interval (error bars, confidence intervals), histogram, box plot, stem and leaf, time series, 3D (contour, scatter, surface), probability and Q-Q plots, residual plot
Process controlHistogram, box plot, time series, run chart, control chart (Xbar, CUSUM, EWMA, Range, MA), Pareto chart
Trend detectionScatterplot (bi-plot), scatterplot matrix, chart (line, bar, pie), dot plot, time series, 3D (contour, scatter, surface), scatterplot smoothers (LOESS), residual plot, run chart

For most water quality analyses, several different attributes or measures of water quality are being assessed simultaneously. Important information can be overlooked if water quality variables are analysed one at a time.

Relationships between individual variables can be detected relatively easily from graphical summaries. For example, the scatterplot matrix in Figure 1 shows scatterplots and histograms for the concentrations of 7 metals in water samples from a Victorian lake.

Figure 1 Scatterplot matrix for concentration of 7 metals from a single water sample

Using graphical display like this, you can assimilate the trends, relationships and distributions of a number of variables that have been measured on the same water sample. Histograms for individual variables are shown on the diagonal and plots for pairs of variables are displayed on the off-diagonal (the labelling of axes is reversed to create 2 plots for each pair of variables on the off-diagonal).

Given the high variability of most natural ecosystem processes (or the indirect processes that influence them), it is not surprising that water quality data also exhibit a high degree of variation over space and time. This high natural ‘background’ variation tends to mask trends in water quality parameters and reduces our ability to extract a signal from the ‘noise’. Simple graphical tools, such as scatterplots and time series plots, can only provide a combined view of both the trend and the noise.

So-called robust smoothers are nonparametric techniques that ‘let the data speak for themselves’ and have been shown to be remarkably effective in teasing out a signal from very noisy data (Cleveland 1979).

Robust smoothers work by placing a ‘window’ over a portion of the data, computing some numerical quantity, such as mean or median, and then ‘stepping’ the window across the data and repeating the process. The collection of statistics obtained in this way can be plotted (Figure 2) at the mid-points of the intervals to obtain a smooth representation of the underlying process. Figure 2 highlights the upward trend in magnesium over the collection period and a notable increase circa 2010.

Figure 2 Smoothed magnesium data (scatterplot smoother)

You can control the amount of smoothing in these plots by specifying a ‘span’ (fraction of dataset to be captured in the moving window) and the number of passes over the data. Greater smoothing is achieved by increasing the span or the number of passes (Chambers & Hastie 1992).

Many types of smoothing algorithms and filters have been developed, including smoothing splines (De Boor 2001), locally weighted scatterplot smoothing (LOWESS) (Cleveland 1979) and exponential smoothing. Many smoothing approaches are embedded in modelling frameworks, such as generalised additive models (Hastie & Tibshirani 1990, Wood 2006).

Control charts

A control chart is another visualisation tool useful for exploring the operational performance of data.

Statistical process control (SPC) dates back to the 1930s and is deeply rooted in industrial applications where controlling the drift and variation in a process helps to maintain production quality. Control charting techniques used for the past 70 years in industry play an important role in an environmental context.

Control charts are particularly relevant to water quality monitoring and assessment. Regulatory agencies are moving away from the ‘command and control’ mode of water quality monitoring and recognising that, in monitoring, the data generated from environmental sampling are inherently ‘noisy’. The data’s occasional excursion beyond a notional guideline value may be a chance occurrence or may indicate a potential problem. This is precisely the situation that control charts target.

Control charts provide a visual display of an evolving process and offer ‘early warning’ of a shift in the process level (mean) or dispersion (variability). Refer to Montgomery (1985) or Bennett & Franklin (1986 Chapter 10) for details.

High-frequency data

High-frequency water quality data, such as those that are generated by sensors or data loggers, may require different presentation methods to tease out patterns, trends or key features.

For example, jittering coincident points will help to reveal high-data density, while colour and plotting symbol size may also be used to improve the visualisation of high-density data.

Sometimes it may make sense to subsample the data for visualisation, or to work with summary measures (e.g. daily averages).

You need to weigh up all your choices while keeping the monitoring program objectives firmly in mind.

Scale, transformations and distributions

Mathematical transformations of water quality data are usually undertaken to:

  • restore a greater degree of linearity between 2 or more variables
  • stabilise the variance of some variable over time, space or some other attribute, or
  • restore a greater degree of normality in the distribution of the values of a variable.

These objectives generate statistical advantages for transforming data. You may have scientific or physical reasons for transforming data.

Identifying a suitable transformation (if it exists) is largely an iterative process that depends on the objectives of the exercise.

Sometimes the theory or model being fitted to the data suggests the form of the mathematical transformation. For example, you may suspect that the concentration of a nutrient is described by a power relationship of the form:

C = kQp

where C is concentration, Q is flow, and k and p are unknown parameters. A simple logarithmic transformation yields a linear equation in log(C)and log(Q):

log(C) = log(k) + plog(Q)

A log–log graph of the raw data will enable a quick assessment of the appropriateness of this model. Estimates for log(k) and p are readily obtained as the intercept and slope, respectively, of the fitted regression line.

Another common goal when transforming data is to reduce the effect of a relationship between the mean and the variance. In a number of distributions (including gamma and log-normal), the variance is related to the mean. Statistical tests designed to examine the equality of means will be affected by the non-constant variance.

Some common transformations are provided in Table 4.

Table 4 Variance stabilising transformations (adapted from Ott 1984)

Mean (μ) ~ variance (σ2) relationship

Transformation

s2 = k μ (Poisson data have k = 1)Y' = StartRoot y EndRoot or StartRoot y plus 0.375 EndRoot
s2 = k μ2 Y' = log (y) or log (y + 1)
s2 = k π(1 – π), 0 < π < 1 (binomial data have k = 1/n) Y' = sin–1 (StartRoot y EndRoot)

Analysts commonly transform data when one or more assumptions of a proposed statistical test appear to have not been met. You can transform data to try and restore some semblance of normality but often this is unnecessary. Standard statistical procedures, such as ANOVA and t-tests, are relatively robust in the presence of slight to moderate departures from normality.

Rather than attempting to achieve normality, you should ensure that the distribution of the data has a reasonable degree of symmetry.

Significant distortions in the test results are only likely to occur in cases of high skewness or high kurtosis. It is far more important to check that data have homogeneous variances (variances of the variables of interest are constant over different groups, times or space) and are independent.

Data that are either spatially or temporally correlated, or both, require specialised analysis methods. We discuss some of these in Changes in water quality over time and space.

Identifying an appropriate transformation is often a matter of trial and error. But within the class of power transformations, Box & Cox (1964, 1982) developed a systematic method to identify a value of the transformation parameter (λ) in the expression:

Y to the power of gamma equals case construction where first row is begin fraction y to the power of gamma minus 1 over gamma where gamma does not equal 0 and second row ln(y) where gamma equals 0 

such that values of the transformed data (y(λ)) exhibit a greater degree of normality than the original set of y values. This transformation is only applicable to non-negative data.

Natural scale

Transformations can be important but often we prefer approaches that analyse the data on their natural scale and address the mean–variance relationship more flexibly. Such approaches are usually available in statistical software and include:

  • generalised linear models (GLMs)
  • generalised additive models (GAMs)
  • generalised linear mixed model (GLMM) procedures.

Non-normal distributions

In a regression context, these approaches allow response variables that have error distribution models other than a normal distribution. Specific examples include logistic regression for binary responses and Poisson (or log-linear) regression for count data. Extensions that incorporate features such as zero-inflation or extra variation are available in statistical software. Refer to Wood (2006) or Zuur et al. (2009) for details.

Many statistical methods of inference rely on the assumption that the sample data have been randomly selected from a larger population of values that is normally distributed. The normal distribution enjoys a prominent role in theoretical and practical statistics for a number of reasons:

  • Many naturally occurring phenomena actually exhibit normal-shaped distributions.
  • The important central limit theorem (CLT) in statistics assures us that even when the distribution of individual observations is non-normal, aggregate quantities (e.g. arithmetic mean) tend to have a normal distribution.
  • The mathematics is ‘tractable’ (controllable) and a large number of statistical procedures have been developed around the notion of random sampling from a normally distributed population. (Although this is a less convincing argument for the use of normal-based inference.)

Many environmental variables are decidedly non-normal in their distributions. River flows may be normally distributed at a single locality for a short period of time, but on broader time scales they may decrease systematically with time in the absence of rainfall (time of sampling becomes a dominating influence). And, of course, a rainfall event will have a dramatic influence on flow.

Statistical software can do the computations associated with checking distributional assumptions.

Most statistical software can test the assumption of normality. Figure 3 shows the distribution of magnesium concentrations in a water body downstream of a mine. The normal quantile-quantile plot picks up the short tail or truncation for lower values of magnesium.

Figure 3 Distribution of downstream magnesium concentrations

Many software offer considerable flexibility in the types of distributions that can be examined.

For example, if some data were clearly non-normal and positively skewed, then it might be speculated that a log-normal distribution would be an appropriate probability model. One way of checking the log-normal assumption would be to test the normality of the logarithms of the original data. A more direct approach is to inspect a log-normal probability plot. This is a similar plot to the normal quantile-quantile plot in Figure 3 but the theoretical quantiles would be drawn from a log-normal distribution.

References

Andersen PK 2013, Survival analysis, Encyclopedia of Environmetrics 6.

Barceló C, Pawlowsky V & Grunsky E 1996, Some aspects of transformations of compositional data and the identification of outliers, Mathematical Geology 28: 501–518.

Bennett CA & Franklin NL 1986, Statistical Analysis in Chemistry and the Chemical Industry, Qualpro Inc.

Box GEP & Cox DR 1964, An analysis of transformations (with discussion), Journal of the Royal Statistical Society Series B 26: 211–246.

Box GEP & Cox DR 1982, An analysis of transformations revisited, rebutted, Journal of the American Statistical Association 77: 209–210.

Chambers JM & Hastie TJ (eds) 1992, Statistical Models in S, Wadsworth and Brooks.

Cleveland WS 1979, Robust locally weighted regression and smoothing scatterplots, Journal of the American Statistical Association 74: 828–836.

De Boor C 2001, A Practical Guide to Splines, Revised Edition, Springer, New York.

El-Shaarawi AH & Esterby SR 1992, Replacement of censored observations by a constant: an evaluation, Water Research 26: 835–844.

Garner FC, Stapanian MA & Fitzgerald KE 1991, Finding causes of outliers in multivariate data, Journal of Chemometrics 5: 241–248.

Gilliom RJ & Helsel DR 1986, Estimation of distributional parameters for censored trace level water quality data: 1. Estimation techniques, Water Resources Research 22: 135–146.

Hastie TJ & Tibshirani RJ 1990, Generalized Additive Models, Chapman and Hall/CRC, London.

Helsel DR & Gilliom RJ 1986, Estimation of distributional parameters for censored trace level water quality data: 2. Verification and applications, Water Resources Research 22: 147–155.

Helsel D 2005, Nondetects and data analysis: Statistics for censored environmental data, John Wiley and Sons, New Jersey.

Little RJA & Rubin DB 2002, Statistical Analysis with Missing Data, 2nd Edition, John Wiley and Sons, New Jersey.

Liu S, Lu J, Kolpin DW & Meeker WQ 1997, Analysis of environmental data with censored observations, Environmental Science Technology 31: 3358–3362.

Montgomery DC 1985, Introduction to Statistical Quality Control, John Wiley and Sons.

Neter J, Kutner MH, Nachtsheim CJ & Wasserman W 1996, Applied Linear Statistical Models, 4th edition, Irwin.

Rubin DB 1976, Inference and missing data, Biometrika 63(3): 581–592.

Tufte ER 1983, The Visual Display of Quantitative Information, Graphics Press, Cheshire Co.

Wood S 2006, Generalized Additive Models: An introduction with R, Chapman and Hall/CRC, Florida.

Zuur AF, Ieno EN, Walker NJ, Saveliev AA, & Smith GM 2009, Mixed Effects Models and Extensions in Ecology with R, Springer, New York.