One of the main applications of DOE is to help understand which possible influences on a system of interest actually have a significant effect on it. From the point of view of modeling this is a question of defining which effects are in the model. This is also used for defining the overall shape of the response being fitted: if there is evidence of curvature then the model will include quadratic effects.
The way this is done is by designing for a model with a lot of potential terms, collecting data and then doing some statistics on the results to understand which effects change the response significantly compared to the amount of random variation.
There are several ways to determine the significance of a potential effect. It’s important to understand that the p-value reported for a model term depends on all the other terms in the model. This creates the problem that for models with a lot of potential terms there is an unfeasibly large number of models to search.
The Synthace analysis app offers four ways to choose the terms which appear in the model:
Find effects
Stepwise regression
LASSO regression
Manual selection
How to choose effects
Firstly it’s important to point out that selecting terms automatically is not always the right thing to do, it depends entirely on what your research question is and how much you already know about your system. It is quite normal to include effects which are not significant by these sorts of analyses either because they are essential to the question being studied or are known a priori to be of interest. Typically only when trying to generate predictive models for optimization (i.e. at the end stages of a campaign) is it a good idea to be stringent in building your model out of only significant terms.
However these methods can be very useful as a guide to what the data are telling you about the behaviour of your system. To take best advantage of them it’s useful to understand the differences between them.
Find effects
Find effects essentially tries to determine the significance of each possible effect individually, to the extent that that’s possible, starting with main effects and progressing through two factor interactions to quadratic terms.
This is useful as a first-pass and easy to understand and compute. However it has the drawback that it does not account for any underlying relationships between the different effects. Sometimes two apparently different factors can affect the system by the same mechanism, which means there’s really only one effect. Looking at them both individually would give the appearance that both are significant, however when looking at the effect in the whole model we would likely only see one as being significant since its effect would be explained away by the other.
Other methods for choosing effects try to solve this problem, at the expense of some complexity and other trade-offs.
Stepwise Regression
Stepwise regression is an iterative procedure in which effects are found by creating full models and assessing how well they fit the data. This is either done by starting with an empty model and adding effects (forward) or by starting with the biggest possible model and removing them (backward). Terms are added or removed according to which would make the biggest difference to the existing model. The process usually stops when no effects which would make a significant difference are left. The quality of the model is assessed by rewarding a good fit to the data and penalising including a lot of model terms. In Synthace we offer two standard metrics for this, AIC and BIC, which penalise complex models differently with BIC generally favouring smaller models.
Stepwise avoids the issue of correlation between effects that the find effects process is subject to, however regardless of the model fit measure used it has a tendency to produce models which are very overfitted. This tendency is particularly pronounced for backwards regression. Stepwise can be a useful way to discover possible models but it is recommended that you treat its results with caution.
Lasso Regression
Lasso regression also tries to identify the best full model rather than assessing terms individually, however it does this using a different mechanism to stepwise regression.
Instead of trying to add and remove terms sequentially, Lasso regression applies a process known as shrinkage to the model terms. This process applies a threshold value to the terms, reducing all those below the threshold to zero. The effect is that these terms are removed from the model. The method tries varying the threshold from large to small values and identifies the best model using similar criteria to above.
This method has been shown to work very well in practice and is a good choice for making the selection of terms to include in your model. However one potential difficulty is that it relies on the process of cross-validation to ensure the model generalises. This process (although standard in many machine learning applications) does not always work well with DOE data.
Manual Selection
Full manual selection of model terms is appropriate when you already have a specific model in mind to fit the data with. Otherwise it’s usual to use a combination of automated and manual effect selection to choose the model to fit. Automated methods are great for finding the biggest effects on the data but do not have any knowledge of the underlying system and how it works. Therefore it’s up to you to ensure that obviously necessary and important effects and relationships are captured by ensuring that models include combinations of terms which make sense biologically.
To learn how to find and fit a significant effects model, click here.
To learn how to fit a Stepwise regression model, click here.
To learn how to fit a Lasso regression model, click here.
To learn how to manually add or remove effects from your model, click here. (Coming Soon!)