Assessing the quality of a potential design is about trying to establish whether you are using your resources as effectively as possible.

Depending on what stage you’re at you will have more or less information to use in doing this - it’s much easier when you know a lot about the system and are asking specific questions than when you know little and are trying to get a basic understanding.

For the typical DOE situation you are intending to understand your experimental system by building a linear model, so the way to assess a design is in terms of how well it can potentially allow you to fit a linear model, or how well that model will work for making predictions.

There are two related questions which you can ask in this context:

How well can I distinguish a particular effect from random noise?
How well can I distinguish a particular effect from all the others?

We will talk about these two using the terms power and resolution respectively

How well can I distinguish a particular effect from random noise?

This question is essentially the definition of what statisticians call power, which asks “what is the probability of detecting a true effect when it’s there?”. The idea is that we will perform some sort of significance test on the effect, setting the acceptable level of false positives as usual, to see if it has an effect size significantly different from zero (in either direction).

The problem with this is of course we rarely know in advance what size of effect to expect, nor how noisy our measurements will be since in general these will depend on many aspects of the experimental setup.

We can get around this in a few ways

We could do an experiment to estimate these
We may be able to use some rules of thumb to set these based on past experience with similar experiments
We can try a variety of values to get a feel for how sensitive different effects are
We can compare different effects (e.g. main effects, interactions, quadratic effects) to see how well they do relative to one another

Option 1 is best if we can afford it, but of course it can be expensive and time-consuming. Option 2 is good but of course only available if we have some previous experience to draw on. We can always do options 3 and 4 and this is essentially how we usually proceed: trying to understand for a given design how well it can find terms of different types, in comparison to other possible designs. This is best done as part of asking questions like “which design should I choose?” or “What if I did twice as many runs, does that help me significantly?”

An example power plot is depicted below.

The plot shows a bar chart of different effects with the height representing the power to detect that effect - this is expressed as a probability, the probability of detecting that effect if it’s really there. 80% power is usually considered a good standard to aim for, so a blue dashed line is plotted. The effect power slider at the very bottom of the panel (not pictured) controls the assumed signal-to-noise ratio and allows you to explore how well the design will perform for effects which are easier or harder to detect.

Strictly speaking, power analysis is directly relevant only for screening type experiments - for optimization it’s more typical to consider how well predictive models built from the design would perform - however the ability to predict is quite closely related to the ability to estimate parameters correctly, which is itself related to power so it’s still informative.

For space-filling experiments power can often seem low - this is not so much indicative of a problem with the designs as a mismatch between the assumptions and the way you would use the design. For scoping purposes you’re either building a non-linear model (which would require different calculations) or (most likely in Synthace) working more qualitatively. In this case the assessment is also more qualitative.

How well can I distinguish a particular effect from all the others?

The second criterion which is useful to look at is how much different effects will be confused with one another. This is a simpler assessment to make since it’s just about seeing how correlated the columns representing each effect are. If two columns are perfectly correlated (i.e. |R| = 1) there is no way to distinguish which effect is actually causing an observed change in the response.

Overall it’s obviously best for all effects to be uncorrelated but it’s impossible to achieve this without doing a lot of runs when you’re looking at a lot of factors. It’s also worth knowing that quadratic effects are always somewhat correlated with other effects for technical reasons. The typical way to assess correlations is by using a correlation plot like the one below, which plots effects against each other and shows correlations between the effects with a colour scale: darker indicating lower absolute correlation.

Looking for correlations between effects is useful when you are expecting specific effects to be of interest, much harder to interpret when you aren’t sure which effects are likely to be of importance. In cases like these the power analysis is much more informative - for many designs the correlation between effects is itself a source of noise, which lowers power, meaning that the power analysis actually subsumes the general effect of correlations so it works well in the case where you aren’t interested in specific effects.

This has been a very brief introduction to the topic of assessing designs.

To learn more about the statistics behind design diagnostics, click here.

To learn how to generate design diagnostics, click here.

Validating Your Model

The statistics behind design diagnostics

What are optimal designs and when are they useful?

What are the statistics behind optimal designs?

How are Hard To Change factors handled statistically?

How do I know if my design is good enough?

How well can I distinguish a particular effect from random noise?

How well can I distinguish a particular effect from all the others?