Skip to main content

Validating Your Model

Updated over a year ago

Models aim to simplify the interpretation of data by making assumptions about the underlying structure of how it is generated.

Although these assumptions are never really “true” as such they can fit the data to a greater or lesser extent, with some disagreements having a much bigger effect than others.

Before trying to interpret the model it’s therefore a good idea to see if there are any significant inconsistencies between the technical assumptions made by the model fitting procedure and the actual data and attempt to judge whether they are making a big impact on the interpretation of the model.

To understand how to do this it’s important to know the key assumptions built into linear models, as well as some other practical concerns.

There are three key assumptions for linear models when modeling DOE data

  • The response is linear given the model

  • The residual errors in the predictions of the model can be modeled using a single normal distribution

  • Every data point is an independent run, there are no dependencies between any point and any other

Additionally there are sometimes practical problems if the model includes small numbers of points which deviate wildly from the behaviour of the rest (outliers). Although this can be seen as a violation of the first and second assumption it’s important enough to call out separately.

The final criterion to apply is whether the model is a good explanation of the data without being overfitted.

We will discuss how these different criteria apply one by one.

Linearity of response

As we outlined above, the actual shape of the response can be quite complex and involve nonlinear terms, it doesn’t have to be a simple straight-line relationship.

However if we plot the predicted values from the model against the actual response values we should see a linear relationship. The actual-by-predicted plot lets us get a quick assessment of how well the model fits the data and whether certain regions are systematically predicted wrongly (Figures 1-3).

Figure 1. Predicted vs actual for a very poorly fitting model. In this case several terms are missing, making the relationship between the model and the data highly non-linear. The model should not be used for any sort of inference.

Figure 2. Predicted-vs-actual for a reasonably well-fitting model. A few points with high responses are not well-modelled, this deserves some more investigation but is unlikely to significantly compromise the model. Inferences are likely to be possible, although the effect of removing the outlying points to the right should be investigated to see if it changes the interpretation significantly.

Figure 3. Predicted-vs-actual for a good model. While there is some spread around the line the model overall is likely to be good enough to be useful.

If the response is obviously not following the line but has a systematically different shape it may be possible to get a better model fit by applying a transformation to the response values. Alternatively the model may be missing several effects. Using automated model selection can help to avoid this second problem.

Normally distributed noise

A second thing to check is whether the errors are approximately constant for all predicted values. Since the model only includes a single, constant error term this is the only situation it can properly handle.

You can check this in several ways: the actual-by-predicted plot will give you some idea but the most useful tools are the normal probability plot and the residual plot (Figure 4 and 5).

Figure 4. Residual and normal plots for a model which satisfies the assumption of constant error variance well.

Figure 5. Residual and normal plots for a model in which error and signal are correlated. The cone shape in the residual plot is characteristic and suggests a log transform may be needed.

This situation is another case in which transforming the response can help, see this section on transformations for more detail.

Independence of Runs and Observations

The model-building process works with the data very simply: it treats every observation in the data set as if it were drawn from a population at random. If there are hidden relationships within the data this will lead to the model being a poor representation of the system’s real behavior.

Much of the work to prevent this being a problem takes place during the implementation of the experiment: by taking care that the different runs are performed independently and do not bias each other we go a long way to fulfill this assumption.

However it’s useful to check whether any systematic biases have inadvertently been introduced during the experiment. One way this can happen in general is for measurements taken sequentially to be similar. This is usually assessed using a run plot, which plots run number against response. This assumes that run numbers are consistent with the order in which runs were created and measured, which may not be valid.

An alternative way to do this is to use the plate view of responses, since the most likely source of correlations between runs is when they are adjacent on a plate. For example there may be contamination from one well to its neighbor, or readings from one well to another may bleed over (such as when a transparent plate is used for fluorescence readings).

Detecting Outliers

Models attempt to predict average behavior given the input variables. Individual data points can sometimes deviate quite a lot from the average and still be within the expected range of behavior for the model.

How much deviation is expected depends on the model itself and also where the data point sits in relation to the range of predicted values the model encompasses. In general more variation is expected at the extreme ends of the range than in the middle.

Outlying data points are those outside this expected range. They are important to identify and potentially correct for since they can in some cases have a significant effect on the fitted model.

Typically outliers are easy to spot on plots of actual vs predicted values or residual plots since they will be a long way from the range occupied by the other points.

It’s usually best to look into any outliers that appear and try to account for them. They indicate that at least some runs are not behaving consistently with the rest of the experiment. This could be down to some experimental artifact, or it could be because the model cannot adequately fit the behavior of these points.

To learn more about the statistics behind model diagnostics, click here.

To learn more about what you can do to improve your model fit, click here.

Did this answer your question?