Skip to main content

Fixing Model Validation Issues: Transformations and Row Sets

Updated over a year ago

As we previously saw, linear models assume a few things about the shape of the data: all the points can be modelled by one single model, the response is linear in relation to the set of input parameters and the noise is the same wherever you go.

These three issues can to some extent be mitigated by performing some basic manipulations of the underlying data. This can make the data better fit the underlying assumptions we went through earlier and make the model more meaningful. However as always there are trade-offs, and it’s important to understand what these manipulations mean for the model and its interpretation.

Row Selections: Dealing With Problem Points

Linear models can be very sensitive to outliers, depending on where they appear in relation to the other points in the dataset (their so-called leverage). In extreme cases one or more coefficients may be essentially only derived from a single data point. Therefore it’s necessary in many cases to remove outliers in order to see what a model of the rest of the data would look like.

However it’s important to realize that there are limitations to this process and it should be approached thoughtfully. Excluding points purely to improve model fit is not good practice since it’s likely that without a scientific reason to consider them artifacts they are actually giving you information that the model does not adequately capture all the behavior of your system. And the stripped-down nature of experimental designs means that once more than a handful of points are excluded the range of possible models you can build is drastically reduced.

Transformations: Shaping To Fit

Linear models assume that it’s possible to make the input data have a linear relationship to the response simply by changing the parameters of the model. However in many cases that isn’t actually possible, for example because the underlying relationship is exponential as often happens when growing organisms.

This situation can be remedied by transforming the shape of the response data (or, sometimes, the factors if they have more than two levels). One sign that this is needed is when you have a very large range in the response with several points which look like outliers compared to the rest. Removing those outliers just leaves more outliers because the problem is the fundamental shape of the data isn’t what the model assumes. The remedy in the case of growth data would usually be to take the logarithm of the population counts.

Transformations Again: Keeping the Noise Down

Transformations are also used for a more technical reason: as we said above we assume the noise doesn’t change regardless of how big the signal is, but it’s very common to see more noise whenever you see more signal. This isn’t a huge problem, but it does mean some of the statistics will be off, and can lead to issues like missing terms or adding too many. It also means some of the model parameters will be wrong and makes the model less effective at prediction.

Again the solution is to apply a transformation to the response - choosing the right function can smooth out the variation and make it fit the assumptions more closely. This generally makes the model more accurate, although the side-effect can be that it’s harder to interpret as you have to undo the transformation to see what the various coefficients and predicted responses would be in real life.

To learn about the statistics behind transformations, click here.

To learn to select and save subsets of your data, click here.

To learn how to apply a predefined transformation to your data, click here.

To learn how to apply predefined column based calculations to your data, click here.

To learn how to apply custom column based calculations to your data, click here.

Did this answer your question?