Deriving guideline values using multiple lines of evidence

​​​​​​​​​​Using a combination of independent lines of evidence, such as a mix of field and laboratory data, has advantages when deriving site-specific guideline values for water quality stressors.

Use of multiple lines of evidence to derive guideline values has been driven in part by limitations in conventional toxicity testing methods (Cormier et al 2008), including:

  • poor ability to characterise certain water quality stressors (e.g. nutrients, suspended sediment, and persistent and bioaccumulative toxicants typically taken up via the diet rather than the water)
  • poor representation of specific groups of species typically found in receiving waters (e.g. aquatic insects).

The multiple lines-of-evidence approach has gained stronger international support in recent years (Leung et al 2014, Merrington et al 2014, van Dam et al 2014, Chapman 2016, USEPA 2016, Buchwalter et al 2017, Moore et al 2017, Suter et al. 2017).

Despite many potential applications, the approach has not been used to derive default guideline values (DGVs) for Australia and New Zealand. Perhaps the most common application of this approach will be for site-specific guideline values where there is a need, possibly driven by the regulator, to ensure greater confidence in the final value. For example, for a toxicant identified as higher risk than others to an ecosystem of high conservation value.

Our guidance on deriving guideline values using multiple lines of evidence invokes various evaluations of, and judgements across, different datasets using a weight-of-evidence process. We focus on single-toxicant stressors although, in the case of field-effects information, multiple stressors will be present and often confounded with the toxicant of interest. We provide guidance on dealing with confounding in field studies; and intend to provide guidance on how to deal with multiple stressors in the future.

After reading this guidance, we suggest you read the extra information cited, particularly USEPA (2016) Chapter 8 and, expanding on that work, Suter et al. (2017).

As a side note, please don’t confuse this guidance with our separate guidance on how to follow a weight-of-evidence process to assess water/sediment quality and ascribe the likely cause of any observed change.

Advantages of typical lines of evidence

Gathering more lines of evidence increases opportunities for critical comparison and generates a more defensible guideline value (Cormier et al 2008).

​Lines of evidence typically include data from multiple independent field and laboratory studies on the effects of a stressor, and can also include complementary analyses of the same dataset (Cormier et al 2008).

​Typical laboratory and field studies

  • Short-term laboratory or field (in situ) exposures of single or multiple (microcosms) test species.
  • Field or semi-field mesocosms, naturally colonised by biotic assemblages, or seeded assemblages or multiple species (e.g. phytoplankton, zooplankton, attached diatoms and macroinvertebrates).
  • True field-effects studies, involving assemblages (e.g. macroinvertebrate communities) in lentic waterbodies or streams or coastal and marine ecosystem types across a stressor disturbance gradient.

If the evidence from these different investigations converges and indicates that the derived guideline value is correct, then you will have greater confidence in its use.

Strengths and weaknesses of each type of evidence

  • Laboratory toxicity studies provide excellent experimental control and associated ability to quantify cause-and-effect relationships, although with generally unproven ecological realism or relevance.
  • (Semi-)field studies (e.g. mesocosms or in situ toxicity testing) offer improved ecological realism or relevance whilst still retaining some experimental control, although often with increased data variability.
  • True field studies have high ecological realism or relevance, although with little experimental control and often significant confounding.

Complementary analyses of the same dataset​​

Different analyses applied to the same dataset (especially for complex field data) can provide different insights into data and improve inferences, such as:

  • disentangling the effects of confounding factors
  • identifying important environmental variables associated with the biological responses
  • identifying sensitive taxa or taxa diagnostic of particular stressors
  • estimating threshold concentrations.

Some approaches are discussed in Deriving guideline values using field-effects data, along with supporting references.

Evaluating multiple lines of evidence

You should collect and evaluate data and information from different lines of evidence by following a process that will build the case for a rigorous and defensible guideline value derivation (USEPA 2016).

  1. Assess the quality of the data of each different line of evidence. This involves assessment of the relevance and reliability of the data and associated analyses. Relevance relates to how representative the data are for the stressor and ecosystem receptor/s, and the environment in which they both occur. Reliability relates to the level of confidence that a study has generated good quality results, and is generally gauged by factors like the use of well-documented protocols and practices. This includes adequate quality assurance and quality control (QA/QC), strong experimental design, appropriate statistical treatment and overall transparency of methods and decisions.
  2. Confirm that the stressor of interest is the only (or main) contributor to observed effects within each line of evidence. This is most important for field-effects studies because cause-and-effect relationships are usually very strong for laboratory studies. We discuss approaches for disentangling confounding factors in complex field-effects data in Deriving guideline values using field-effects data.
  3. Search for corroborative evidence amongst the lines of evidence. Look for consistency in toxicity effect concentrations between the various types of studies.
  4. Identify plausible models for variances in results amongst lines of evidence. If there are differences in toxic responses and toxic-effect concentrations between the studies, then look for demonstrable or plausible explanations. For example, higher toxicity in field studies compared to laboratory toxicity studies could relate to effects that are direct (e.g. toxicity) and indirect (e.g. trophic). Differences in toxicity between field and laboratory studies could be explained by different exposure durations, amongst other things. You will be looking for a plausible — preferably mechanistic — basis for any explanations, supported by literature where possible.
  5. Derive candidate guideline values within each line of evidence to inform final guideline value derivation. We provide guidance on how to do this in the next section, ‘Deriving a guideline value’.

Steps 2 to 4 form part of the weight-of-evidence process, which seeks to establish the causality of an effect. Establishing causality is essential if a defensible guideline value is to be derived from multiple lines of evidence (Cormier et al 2013, Suter 2016).

Deriving a guideline value

You can derive candidate guideline values for each line of evidence using the approaches we describe for:

Then you can integrate the candidate guideline values (Cormier et al 2008, USEPA 2016) to agree on a final guideline value by combining the evidence or by choosing the best evidence.

Combine the evidence (one approach)

Combining candidate guideline values is most appropriate when:

  • they are similar (e.g. within an order of magnitude), and
  • the variability between them is likely to be within the measurement variability of the methods and instrumentation associated with chemical measurement used in each of the separate investigations.

In this case, the arithmetic mean, geometric mean or a percentile of the candidate guideline values could be used as the final guideline value.

If some of the candidate guideline values are thought to have higher quality data or be more important than others (Linkov et al 2006, Moore et al 2017), then it might be appropriate to calculate a weighted mean of the candidate guideline values. You can do this by assigning a weighting factor to each candidate guideline value before averaging. This might take into consideration the advantages and disadvantages of the various datasets, study design and analytical methods (e.g. uncertainty in the underlying science or models, high variance in exposure–response data or models).

Select the best evidence (another approach)

Selecting the best evidence may be the most appropriate approach when the quality of the candidate guideline values is highly variable. In this case, the highest quality candidate guideline value would typically be selected as the final guideline value.

Sometimes other factors might influence a decision on what represents the best candidate guideline value.

For example, the most protective candidate guideline value might be selected where uncertainty is high or a high level of ecosystem protection is required (e.g. a high conservation or high ecological value ecosystem). If the required level of protection is lower (e.g. a slightly to moderately disturbed ecosystem), then you may select a less protective candidate guideline value of appropriate quality that still ensures the ecosystem remains largely intact and the majority of species are protected (refer to our diamond mine example).

Alternatively, regulatory precedent could influence the choice of approach to determining a final guideline value, as suggested by Cormier et al (2008). In Australia, the derivation and use of guideline values based on multiple lines of evidence is still in its infancy so there is unlikely to be much regulatory precedent as yet.

Making professional and transparent judgements

Integrating multiple lines of evidence to assess possible guideline values will often involve a significant amount of subjectivity, which requires professional judgement.

Decisions based on judgements should be made as a team, where possible, rather than in isolation. This more rigorous approach tends to strengthen the basis of decision-making.

Formal weight-of-evidence processes, such as those described by USEPA (2016) for deriving quantitative estimates (e.g. guideline values), make judgements more transparent.

Transparency throughout the whole process is critical because we do not yet have standardised protocols for deriving guideline values using multiple lines of evidence. This extends to agreeing to and documenting the guideline value derivation method up-front to avoid tailoring — or perceived tailoring of — the approach that gives a preferred outcome.

Iteration and modification of derivation methods may sometimes be necessary, but any such decisions, including their basis, need to be made fully transparent.

Using multiple lines of evidence — some examples

These case studies provide brief insight on how guideline values can be derived using multiple lines of evidence; refer to the source publications for details. Cormier et al (2008) also provided a useful example based on bedded sediment as the stressor.

Deriving a water quality objective for electrical conductivity downstream of a diamond mine

Van Dam et al (2014) described a multiple lines-of-evidence study that derived a site-specific guideline value and then established a water quality objective for a stream receiving mine discharge waters with high electrical conductivity (EC) from the Argyle Diamond Mine in the East Kimberley region of Western Australia. Magnesium sulfate was the dominant contaminant.

Lines of evidence included in the study:

  • laboratory-based toxicity testing of the mine waters using locally relevant species in local receiving water
  • multiple wet seasons of field-effect surveys measuring responses of 4 aquatic communities along a gradient of mine water contamination
  • comprehensive water chemistry and discharge data gathered in conjunction with the biological studies.

Candidate guideline values were derived for the laboratory data and field data and used to inform a site-specific guideline value in the form of a range.

While laboratory and field guideline values of similar magnitude were derived that would provide full protection of the receiving water environment, the agreed final water quality objective (stream discharge dependant) was not as stringent. Instead, the recommended value was one that was consistent with maintaining a water quality observed from field studies to sustain important environmental values and provide important ecosystem services, including visual, social and recreational amenity.

Source: van Dam RA, Humphrey CL, Harford AJ, Sinclair A, Jones DR, Davies S & Storey AW 2014, Site-specific water quality guidelines: 1. Derivation approaches based on physicochemical, ecotoxicological and ecological data,​ Environmental Science and Pollution Research 21(1): 118–130.

Deriving a water quality guideline value for magnesium downstream of a uranium mine

Supervising Scientist (2017) summarised a multiple lines-of-evidence approach for deriving a site-specific guideline value for magnesium in the high conservation value creeks and billabongs downstream of the Ranger uranium mine in the Northern Territory.

Lines of evidence included in the study:

  • extensive laboratory toxicity testing for magnesium using local species in local receiving water, comprising over 250 individual experiments that also quantified the influence of calcium and exposure duration on magnesium toxicity
  • a large mesocosm study, carried out in 1500 L tubs located in the creek bed of the major receiving water, which assessed the toxicity of magnesium to aquatic communities over 8 weeks
  • extensive macroinvertebrate monitoring, over 7 annual sampling occasions between 1979 and 2013, for 14 shallow billabongs that comprised a spatial and temporal gradient of exposure to mine water dominated by magnesium sulfate
  • various other direct toxicity assessment studies conducted on mine waters over a 20-year period.

Candidate guideline values for 99% species protection were derived for the datasets assessed as relevant and reliable. The final site-specific guideline value was based on the geometric mean of the (very similar) candidate guideline values.

Source: Supervising Scientist 2017, Magnesium Rehabilitation Standard for the Ranger Uranium Mine – Water and Sediment, Supervising Scientist, Darwin.

Deriving a level of concern for the herbicide atrazi​ne that is protective of aquatic plant communities

Moore et al (2017) derived a level of concern (LOC), which is similar to a guideline value, for the herbicide atrazine using multiple lines of evidence.

They compared 4 existing modelling approaches based on either laboratory toxicity data, microcosm data and/or mesocosm data, each representing a line of evidence. Each of the models focused on aquatic plant species, with the data used to determine a 60-day LOC.

Candidate LOCs derived from each of the 4 models were similar to one another but weighted based on their environmental relevance and statistical reliability. The final community LOC was based on a mean of the weighted candidate guideline values.

Source: Moore DRJ, Greer CD, Manning G, Woodling K, Beckett KJ, Brain A & Marshall G 2017, A weight-of-evidence approach to deriving a level of concern for atrazine that is protective of aquatic plant communities,​ Integrated Environmental Assessment and Management 13(4): 686–701.

Deriving aquatic life water quality criteria for nutrients

Smith & Tran (2010) determined water quality criteria (similar to guideline values) from candidate criteria derived using aquatic community structure data from 40 large river sites spanning a gradient of nutrient concentrations and biological responses.

Data were subjected to 3 different analytical methods that each represented a line of evidence:

  • percentile analysis
  • nonparametric deviance reduction (change-point analysis)
  • cluster analysis.

Candidate criteria derived from each of the 3 methods were weighted on the basis of strength and significance of the analysis, confidence in the data and best professional judgement. The final criteria were based on a mean of the weighted candidate guideline values.

Source: Smith AJ & Tran CP 2010, A weight-of-evidence approach to define nutrient criteria protective of aquatic life in large rivers,​ Journal of the North American Benthological Society 29: 875–891.


Buchwalter DB, Clements WH, & Luoma SN 2017, Modernizing water quality criteria in the United States: a need to expand the definition of acceptable data, Environmental Toxicology and Chemistry 36(2): 285–291.

Chapman PM 2016, Environmental quality benchmarks — the good, the bad, and the ugly, Environmental Science and Pollution Research: 1–4.

Cormier SM, Paul JF, Spehar RL, Shaw-Allen P, Berry WJ & Suter GW 2008, Using field data and weight of evidence to develop water quality criteria, Integrated Environmental Assessment and Management 4(4): 490–504.

Cormier SM, Suter GW & Norton SB 2010, Causal characteristics for ecoepidemiology,​ Human and Ecological Risk Assessment 16: 53–73.

Leung KMY, Merrington G, Warne MSJ & Wenning RJ 2014, Scientific derivation of environmental quality benchmarks for the protection of aquatic ecosystems: Challenges and opportunities, Environmental Science and Pollution Research 21(1): 1–5.

Linkov I, Satterstrom FK, Kiker G, Seager TP, Bridges T, Gardner KH, Rogers SH, Belluck DA, Meyer A. 2006. Multicriteria decision analysis: A comprehensive decision approach for management of contaminated sediments. Risk Anal 26:61–78.

Merrington G, An YJ, Grist EP, Jeong SW, Rattikansukha C, Roe S, Schneider U, Sthiannopkao S, Suter GW, van Dam R & Van Sprang P 2014, Water quality guidelines for chemicals: learning lessons to deliver meaningful environmental metrics, Environmental Science and Pollution Research 21: 6–16.

Moore DRJ, Greer CD, Manning G, Woodling K, Beckett KJ, Brain A & Marshall G 2017, A weight-of-evidence approach to deriving a level of concern for atrazine that is protective of aquatic plant communities, Integrated Environmental Assessment and Management 13(4): 686–701.

Suter G, Cormier S & Barron M 2017, A weight of evidence framework for environmental assessments: inferring quantities. Integrated Environmental Assessment and Management 13(6): 1045–1051.

USEPA 2016, Weight of Evidence in Ecological Assessment,​ US Environmental Protection Agency Office of Research and Development, Washington DC, EPA100R16001.