Linear statistical models are a flexible and powerful way to interpret experimental data. Using statistical techniques these models can provide a quantitative understanding of cause and effect in an experimental system. The generated models can then be used in several ways, for example:
As a representation of what is currently known about the system
As a way to explore how the system will behave in different situations
As a way to find factor settings which will make the system behave in a particularly desirable way
Creating a model is a scientific process in which three key steps are repeated until you find a satisfactory model:
Choosing the input data
Choosing the shape of the model
Assessing the model
Ultimately what counts as satisfactory is at least partly a function of the model’s purpose, in other words, what the model will be used for, but there are a few criteria which apply generally:
The model fits the data well
The model is as simple as possible
The underlying modelling assumptions are reasonably well-satisfied
Building a good model is mostly about the trade-off between the first two criteria: it’s nearly always possible to improve a model by making it more complicated, so most of the work to be done is in assessing how complicated the model can be without going too far.
The reason why avoiding excess complexity is critically important is that the ideal model works not only as a description of the specific data it was generated on, but also as a description of how the process that generated the data works in general. In other words you want to avoid excess bias towards the specific data you used to build the model, since what is of interest is not that set of data but the underlying process that generated them.
The criteria for model-building are therefore to make the model generalize as well as possible to other situations of interest. There are two ways this can fail:
The model can be too simple (underfitting)
The model can be too complex (overfitting)
Generally the practice of modelling is in attempting to strike the best possible balance between these two extremes and find a middle ground (Figure 1).
Figure 1. Illustration of the balance between under and overfitting models. The data points (black) are fitted with three models (orange lines) with complexity increasing from left to right. The leftmost model, with two terms, misses that the data are curved while the rightmost model, with ten terms, is likely to be mistaking noise for signal. The model in the middle, with five terms, looks more reasonable since it captures the curvature in the data but has points varying evenly on each side of the line.
To learn more about the fundamentals of modelling, click here.
To learn more about the modelling process, click here.