Data analysis

Data analysis is the component of the monitoring process that turns collected data into useful information. Analysing water/sediment quality monitoring data improves your understanding of the system being measured and drives management actions.

Your monitoring program objectives will guide the data analysis, and they will also determine the study design, the quantity and type of data collected and sometimes the need for adequate computing power.

Data analysis is quantitative and can be computationally intensive. Undertaking valid data analysis requires a good understanding of appropriate statistical methods and a strong appreciation of the context in which data were generated and the inferences that are required.

We provide guidance in the Water Quality Guidelines on the use of common statistical methods to analyse water/sediment quality data. We have pitched this information at an introductory level to help your monitoring team identify suitable methods of analysis and interpret the results.

Data analysis process

In a typical data analysis process, your interpretation of the analysis outputs may refine the understanding of the system and lead to changes in monitoring design and the data that are collected in the future.

Careful data preparation before any analyses should minimise the influence of anomalies or errors.

Before starting a comprehensive analysis, you must enter, check and securely store the raw data. This usually involves computer database applications. The checking process will need to capture and flag data quality issues, such as missing data, detection limits and data entry errors. You should be able to retrace the steps taken back to the raw data.

It is essential to perform exploratory analyses and interrogation of key variables using a variety of numerical, statistical and graphical methods. Examining the raw data can yield value, such as helping you to identify patterns for further scrutiny or raise questions for investigation.

Useful modelling strategies for data analysis include model-based and probability-based analyses and inferences that may reflect how the data are collected, and the Bayesian and frequentist (classical) approaches to data analysis.

If quantifying status or change in water quality is a monitoring objective, then that may entail comparing sample data with guideline values. We describe how to derive guideline values using reference data, as well as significance testing and calculation of confidence or credible intervals to assess monitoring data against those guideline values when the sample data are spatially and temporally independent. If spatial or temporal independence is not likely, then this dependence may need to be accounted for in the analysis.

We discuss analytical options to assess temporal and spatial change, such as when a single water quality variable is considered at one site over time (temporal analysis approaches) or at multiple sites for a particular point in time (spatial and regional analysis approaches).

We also introduce approaches for modelling relationships between multiple variables, including correlation analysis, multivariate and high-dimensional data analysis techniques, and regression analyses (both parametric and nonparametric approaches).

After completing data analysis, interpretation and reporting of the key results and findings help to complete the cycle of meeting the set monitoring program objectives.

The information we provide here is not exhaustive. Complex monitoring studies may require a greater level of statistical sophistication and input from a professional statistician.

A useful checklist for water quality monitoring data analysis is presented in Box 1.

Box 1: Checklist for data analysis in a water quality monitoring program

Before commencing analysis, have you clearly identified:
1. purpose of the data analysis exercise?
2. parameters to be estimated or hypotheses to be tested?
3. compatible data from different sources (levels of measurement, spatial scale, time scale)?
4. objectives concerning quality and quantity of data?
5. preferred methods for the statistical or data analysis?
6. assumptions that need to be met for an appropriate application of those methods?
7. data organisation and management considerations (storage media, layout, treatment of inconsistencies, outliers, missing observations and below detection limit data)?
Have data visualisation and summary methods (graphical, numerical, and tabular summaries) been applied?
Have data checks been performed and ‘aberrant’ observations (potential outliers) been identified?
Have statistical assumptions (e.g. non-normality, non-constant variance, autocorrelation) been checked?
Have data been suitably transformed, if necessary?
Have data been analysed using previously identified methods. Have alternative procedures been identified for data not amenable to particular techniques?
Have results of analysis been collated into a concise (statistical) summary. Have statistical diagnostics (e.g. residual checking) been used to support the appropriateness of the model approach?
Has the statistical output been carefully assessed and interpreted in the context of the objectives?
Have the objectives been addressed? If not, you may need to redesign the study, collect new or additional data, refine the conceptual models and re-analyse the data.

Planning for data analysis

Data types, quantities and methods of statistical analysis need to be considered collectively and at the early planning stages of any monitoring strategy. You must make study design decisions about measurement scales, frequency of data collection, level of replication and spatial and temporal coverage so that data of sufficient quality and quantity are collected for subsequent statistical analysis.

It is important for your monitoring team to avoid the ‘data rich–information poor’ syndrome of collecting data that will not be subsequently analysed or that do not address the monitoring program objectives.

Given the costs associated with the data collection process, it is imperative for the monitoring team to use formal quality assurance and quality control (QA/QC) procedures to ensure the integrity of the data. These procedures should be supported by established back-up and archival processes. Seriously consider the archival medium to be used because rapid advances in computer technology tend to increase the speed at which equipment becomes obsolete.

QA/QC of any monitoring program should include:

data analysis
field sampling and measurement practices
laboratory analysis.

Before statistically analysing the monitoring data, you should use standard methods of data summary, presentation and outlier checking to help identify ‘aberrant’ observations. If undetected, these data values can have profound effects on subsequent statistical analyses and can lead to incorrect conclusions and flawed decision-making.

Develop a plan of the sequence of actions that the monitoring team will use for the statistical analysis of the water quality data. Only some of the many available statistical techniques need be used.

An initial focus of monitoring might be to assess water quality against a guideline value or to detect trends. An ultimate objective of the data analysis exercise will probably be to increase your team’s understanding of the natural system under investigation. Improved understanding should result in more informed decision-making, which in turn will lead to better environmental management.

One of the most significant challenges for the data analysis phase is to extract a ‘signal’ from an inherently noisy environment.

We can categorise monitoring study designs as:

descriptive studies, including audit monitoring
studies for the measurement of change, including assessment of water quality against a guideline value (which can also be categorised as descriptive)
studies for system understanding, including cause-and-effect studies and general investigations of environmental processes.

The statistical requirements for the descriptive studies are less complex than for the other 2 categories in which more detailed inferences are being sought.

Most of the statistical methods that we present here are based on classical tools of statistical inference (e.g. analysis of variance, t-tests, F-tests). These methods have served researchers from a variety of backgrounds and disciplines extremely well over many years but people have concerns about their utility for environmental sampling and assessment. When measuring natural ecosystems or processes, it is invariably hard to justify the assumptions that:

response variables are normally distributed
variance is constant in space and time
observations are uncorrelated.

In these cases, remedial action (e.g. data transformations) may overcome some of the difficulties but it is more probable that an alternative statistical approach is needed.

For example, generalised linear models (GLMs) are often more suitable than classical analysis of variance (ANOVA) techniques for the analysis of count data because of their inherent recognition and treatment of a non-normal response variable.

Many researchers have questioned the utility and appropriateness of statistical significance testing for environmental assessment (e.g. McBride et al. 1993, Johnson 1999); refer to Bayesian versus frequentist approaches.

A large number of introductory and advanced texts of statistical methods are available (e.g. Ott 1984, Helsel & Hirsch 2002). For a relatively detailed and comprehensive description of statistical techniques used in water quality management studies, refer to Helsel & Hirsch (2002) and McBride (2005). More general resources, such as the Encyclopedia of Environmetrics (El-Shaarawi & Piegorsch 2014), provide extensive coverage on the development and application of quantitative methods in the environmental sciences.

Statistical software for data analysis

So many statistical software tools are available and it is beyond the scope of the Water Quality Guidelines to review them all.

For example, Wikipedia compares various statistical packages. These packages support different statistical techniques and different features, such as size of datasets handled, graphical representations, database interfaces, linkages to other software and the level of expertise needed to use them.

Many software tools provide a high level of functionality and technical sophistication but they also lend themselves to abuse through blind application. It is important to scrutinise both the output and the choice of technique by asking yourself:

Does the analysis make sense?
Is it consistent with what has been observed in the exploratory data analysis?

If these questions are not routinely asked, then you run the risk that the ‘mental models’ of your monitoring team may have undue influence on the outcome. Results are more likely to be accepted — even if they occur for the wrong reasons — when they match the expectations of those who are interpreting the analysis.

Next steps:

References

El-Shaarawi AH & Piegorsch WW (eds) 2014, Encyclopedia of Environmetrics, John Wiley and Sons.

Helsel DR & Hirsch RM 2002, Chapter A3: Statistical methods in water resources, Section A: Statistical analysis, Book 4: Hydrologic analysis and interpretation, Techniques of Water-Resources Investigations of the United States Geological Survey, US Geological Survey, United States Department of the Interior, Reston.

Johnson DH 1999, The insignificance of statistical significance testing, Journal of Wildlife Management 63: 763–772.

McBride GB, Loftis JC & Adkins NC 1993, What do significance tests really tell us about the environment? Environmental Management 17: 423–432.

McBride GB 2005, Using Statistical Methods for Water Quality Management: Issues, options and solutions, Wiley, New York.

Ott L 1984, An Introduction to Statistical Methods and Data Analysis, Duxbury Press, Boston.