Modelling is typically an iterative process. At each stage you go round a loop which consists of defining the exact data to model, deciding which terms to include in the model (manually or by using automated effects finding methods like LASSO or Stepwise regression) and then assessing the model using tools like residual plots, normal quantile plots, actual-by-predicted plots and the model statistics, as well as by interpreting its predictions and coefficients before actually using it.
Don’t worry if these terms are unfamiliar to you, it’s good to hear the names now even if you don’t know what they mean. We’ll explain them all in detail when we discuss model validation in this article.
Each time you go around the loop you’re trying to decide whether this is the right model for your purpose (Figure 1). The key considerations are how well the model fits the data, how well the model assumptions are met and whether the model is the simplest you can reasonably have given your goals.
Figure 1. The modelling loop. typically we iterate multiple times on the model by changing the underlying data or its structure before using it to define further experiments or as the endpoint of the investigation.
Walking the Tightrope
The key goal in modelling is to create a model which looks past the specific data you have gathered and is able to generalise about the underlying process which generated the data. As we previously explained, this is somewhat like walking a tightrope because it requires finding the balance between underfitting the data (making the model too simple) and overfitting (making the model too specific to the inputted data) (Figure 2).
When adding terms or doing transformations, you’re looking to see if the effect of the change justifies the extra complexity. When removing terms or data points you’re looking to see if the loss of information harms the model too much.
Although there are plenty of numbers you could use there’s no single way to do this - the goal is really to find something useful rather than something correct. You may not realise that there’s something missing from the model until you try to use it to decide what to do next. And ultimately if all the models you build lead to the same conclusion you can be a little more confident in that conclusion. Where there is a lot of disagreement you may have to do further experiments to really understand what’s going on.
Figure 2: Illustration of the balance between under and overfitting models. The data points (black) are fitted with three models (orange lines) with complexity increasing from left to right. The leftmost model, with two terms, misses that the data are curved while the rightmost model, with ten terms, is likely to be mistaking noise for signal. The model in the middle, with five terms, looks more reasonable since it captures the curvature in the data but has points varying evenly on each side of the line.
To learn how to plot x/y plots or plate based heatmaps of your data before you start modelling, click here.
To learn how to start identifying significant effects to include in your models and how to fit those models in Synthace, click here.