Synthace provides quantitative and qualitative ways of assessing the quality, validity, and predictive ability of a model. These factors are interlinked and can be at odds, making the interpretation of model diagnostics a somewhat subjective process that will be influenced by how the model will be used. This article provides an overview of the model diagnostics Synthace presents users and how they are calculated.
Statistics
Synthace uses two statistical backends for fitting models, one written in Python and one written in R. Typically the Python backend will be used, but if there are collinear factors then the R backend will be used.
Python: Models are specified using the ols
function from the statsmodels.formula.api
submodule of the statsmodels Python library. This gives a statsmodels.regression.linear_model.OLS
object. The model is then fitted using the fit(method="qr")
method of this class, using the QR decomposition approach of performing linear regression. The fit
method gives a statsmodels.regression.linear_model.RegressionResults
object, which will be referred to as python_res
in the following text.
R: Models are specified and fitted with the lm
function, which gives an object which will be referred to as r_res
in the following text.
The model diagnostics are split across the “Create Models” and “Browse Models” tabs of Analyse Responses.
Create Models - Summary Table
R Squared
Python: Given by the
rsquared
property on thepython_res
objectR: Returned by passing
r_res
to thesummary
function and accessing ther.squared
property of the result
Adjusted R Squared
Python: Given by the
rsquared_adj
property on thepython_res
objectR: Returned by passing
r_res
to thesummary
function and accessing theadj.r.squared
property of the result
Residual Standard Error
Python: Given by finding the square root of the
scale
properly of thepython_res
objectR: Returned by passing
r_res
to thesummary
function and accessing thesigma
property of the result
p-value (F Statistic)
Python: Obtained using the
f.cdf
function of thescipy.stats
submodule of the SciPy Python library, which is the cumulative distribution function of the F statistic. The F value, model degrees of freedom, and residual degrees of freedom necessary for calculating the p-value are given by thefvalue
,df_model
, anddf_resid
properties of thepython_res
object respectively. The result is subtracted from 1 to give the p-value of the F statisticR: Obtained using the
pf
function, with the F value, model degrees of freedom, and residual degrees of freedom obtained by passingr_res
to thesummary
function and accessing thefstatistic
property of the result. Thelower.tail
argument topf
is set toFALSE
F Statistic
Python: Three values corresponding to the F value, model degrees of freedom, and residual degrees of freedom necessary, given by the
fvalue
,df_model
, anddf_resid
properties of thepython_res
object respectivelyR: Returned by passing
r_res
to thesummary
function and accessing thefstatistic
property of the result
Cross-Validation Error (Mean) and Cross-Validation Error (Std Dev) correspond to the mean and standard deviation of the root mean squared prediction error, calculated across 5 cross-validation rounds
Python: A custom cross-validation algorithm is used. The input data is divided into 5 randomised groups of equal size. For each cross-validation round, data from four of these groups are used to train the model with one group left out as the test data. The response values are predicted for the left out group, with the absolute difference between the predicted and actual response values being the absolute prediction error. This absolute prediction error is squared and the mean calculated using the
mean
function of the NumPy Python library. The square root of this value is the root mean squared prediction error. The root mean squared prediction error is calculated 5 times, leaving a different group of data out of the training each time. The mean and standard deviation of these 5 values is what is reported to the user, calculated using themean
andstd
functions of the NumPy Python libraryR:
cv.lm
from thelmvar
R package is used to perform the cross-validation for the model. TheMSE_sqrt
mean and standard deviation values are shown, accessed using themean
andsd
properties of the result respectively
Note: The custom cross-validation algorithm which the Python backend uses was written to behave as similarly to the cv.lm
function as possible by consulting the open-source code of the lmvar
package (GitHub)
Create Models - Diagnostics
All three plots are constructed using the fitted values and the residuals given by the model, which differs depending on which statistical backend is used:
Fitted values
Python: Given by the
fittedvalues
property of thepython_res
objectR: Given by the
fitted.values
property of ther_res
object
Residuals
Python: Given by the
resid
property of thepython_res
objectR: Given by the
residuals
property of ther_res
object
Using these values, the plots are then constructed the same way regardless of the statistical backend used:
Predicted vs Actual: Actual values are calculated by adding the fitted values and residuals together element-wise and plotted against the fitted, predicted values. The line corresponds to the line y = x
Residual vs Predicted: The residuals are plotted against the fitted, predicted values. The line corresponds to the line y = 0
Normal Plot of Residuals: The x-axis are the theoretical quantile values, calculated by passing the residual values to the
probplot
function of thescipy.stats
submodule of the SciPy Python library. This function calculates the quantile values using Filliben’s estimate of the uniform order statistic medians:Where n is the number of residuals. These medians are passed to the normal quartile function (inverse cumulative distribution function), with the values returned being the x-coordinates of the points. The coordinates are the residual values.
probplot
also performs a least-squares fit, returning an intercept and slope, which are used to plot the line drawn on the plot. SciPy Documentation
Browse Models
For the purposes of illustration, all tooltips are shown at once
The traffic lights displayed next to each model give an overview of the model fit, with green circles indicating a good fit, amber circles with a dash across them indicating an acceptable fit, and red circles with a cross indicating a potential issue with the model fit. Hovering your cursor over the circle will display a tooltip with more information. The circles correspond to the following:
R Squared: This is the R squared value of the model, determined in the same way as it is for the Summary Table on the “Create Models” tab (see above)
Red: R squared ≤ 0.5
Amber: 0.5 < R squared ≤ 0.85
Green: 0.85 < R squared
p-value (F Statistic): This is the p-value of the F-test, determined in the same way as it is for the Summary Table on the “Create Models” tab (see above)
Red: 0.1 < p-value or p-value is NaN
Amber: 0.05 < p-value ≤ 0.1
Green: p-value ≤ 0.05
Normality of residuals (D’Agostino-Pearson test): This tests that the residuals are normal, one of the assumptions of linear regression models. The residuals of the model, accessed in the same way as the Diagnostics plots in the “Create Models” tab (see above), are tested to determine if they are normally distributed. This is done using the
normaltest
function of thescipy.stats
submodule of the SciPy Python library (SciPy Documentation). The traffic light colour corresponds to the p-value of this test, which has the null hypothesis that the data is from a normal distribution. Ideally we don’t want to reject this null hypothesis, as the residuals should be normally distributed with well fitted linear models, so the larger the p-value the betterRed: p-value ≤ 0.05
Amber: 0.05 < p-value ≤ 0.15
Green: 0.15 < p-value
Citations
Python version: 3.9.15
NumPy version: 1.23.5
SciPy version: 1.9.3
statsmodels version: 0.13.5
R version: 4.0.5
lmvar version: 1.5.2
To learn about the assumptions of linear modelling, and how the diagnostics can help you check whether your model violates any assumptions, click here.
To learn more about what you can do to improve your model fit, click here.