When trying to understand any dataset regardless of how you choose to analyze it there’s a lot to be said for first doing some simple visualizations to help you get a feel for things before you start fitting models.
Here are the main things to try and what to look for in each case.
Visualising responses
Unfortunately, in most DOEs you don’t actually get to see the factor levels you set in the experiment, just what you intended to set them to. However, the responses you measure are always real information and as such the best starting-point when getting into your data.
This can give you a good idea about the overall structure of your measurements: are they all coming from a single group or are there multiple groups? Do they resemble any standard distributions? Are there a few points which are obviously very different from the rest?
The following show a few examples of common situations and what they look like in the diagnostic plots provided in the Select & Transform tab.
Figure 1. Plots of normally distributed data (mean = 100, sd = 15).
Figure 2. Histogram of uniformly distributed data (low = 50, high = 140)
Figure 3. Histogram of data with two distinct populations
Figure 4. Histogram of data with a log-normal distribution
Figure 5 - Histogram of data with a small number of outliers
Cases 1 & 2 suggest the population is reasonably homogeneous. This is good in some ways but for DOE it’s not actually ideal even though the modelling assumptions are likely to be satisfied: we expect there to be systematic differences based on the different factor levels, which would create patterns more like case 3.
Case 3 suggests we have two different groups in our sample. This is potentially a good sign if the differences in the response relate to the underlying changes of factors. If this isn’t the case it may be necessary to model the two populations separately. Case 4 is log-normal, that is, the values are normally distributed on a logarithmic scale. Linear models will not be able to capture a linear relationship here unless the response is transformed appropriately by taking logs of the response values.
Case 5 has a small number of outliers - you can see the histogram is somewhat similar to the case where the distribution is log-normal, but the other two plots show that there are clearly a few points which differ rather than a more continuous variation.
Statistically speaking it’s not considered OK to remove data purely based on the obvious difference in the numbers. However for modelling purposes it’s often necessary to do this since otherwise the model is essentially garbage. In case 5, for example, modelling the whole set would tell you only about how the three outliers differ from the remainder. That’s useful information but it will be very limited because the model won’t be able to properly capture the relationships because the data are so unbalanced.
The best approach is to decide which questions are most interesting: if you want to understand the outliers you don’t really have much information so it’s best to try a few different methods to see how they might differ from the rest, then do another experiment to explore them further.
If on the other hand you’re more interested in the bulk of the data you can remove them and see what the model can tell you. Removing a few points won’t hurt things too much and you can likely proceed as normal.
Scatterplots of Factor Values Against A Response
Having looked at your response(s) separately you can then start to examine the relationship between responses and factors. The information you have already got from looking at the response distribution(s) will help you to understand the patterns you observe here. If you are dealing with a dataset which includes one or more pathological outliers you will need to resolve that situation before doing this in most cases.
Plotting factors against responses is essentially replicating the first level of modelling: what you’re looking to see is whether the average response changes as the relevant factor value changes. Of course, this is very imprecise and approximate but it’s usually possible to spot the really strong effects this way. This should give you a head-start in which effects you would be likely to expect when model-building.
A few notes of caution: firstly it’s important to remember that by design DOE runs are not replicates in most cases so the variation you see at each level is a function of both the noise AND the other signals. You can get a little more information by, e.g., colouring the points but it’s difficult to get more than the strongest main effects this way.
Secondly scatterplots aren’t always the best tool for this, depending on the design type you may have a lot of points stacked up on top of one another (e.g. 2-level designs). With some jitter in the points you can still usually get a feel for where the average would be but it’s not easy to do in all cases. Outside of Synthace you may find tools like boxplots give you an easier time interpreting things but even with regular scatterplots you can usually get what you need.
Other plots you can use
Run plots
These plot responses against some ordering of the data aimed to represent which order the runs were performed in. This can help to show any possible correlations between adjacent runs which might suggest experimental artifacts. They are most useful when runs are completed in a simple sequence, much less so when the pattern of how the runs are created is more complicated, as it often can be for biological experiments.
Plate heatmaps
These allow you to visualize response or other data as colourings of wells in depictions of microplates. As such they can be a useful alternative to run plots in showing whether particular plate regions (e.g. the edges vs the centres) are systematically different from one another.
Normal probability plots
Probability plots are a statistical tool which compare the number of observations found of a given size with the expected number for a normal distribution with the same mean and standard deviation. These give you essentially the same information as histograms of the data but are good for highlighting small deviations, particularly in the extreme ends (or tails) of the data. Normally distributed data would be expected to lie along a straight line, which is plotted to help interpretation. Other distributions such as log-normal have characteristic patterns which can be easily recognized.
To learn how to create x/y plots of your factors and data, click here.
To learn how to create plate based heat maps of your data, click here.