Skip to main content

The statistics behind model diagnostics

Updated over a year ago

Synthace provides quantitative and qualitative ways of assessing the quality, validity, and predictive ability of a model. These factors are interlinked and can be at odds, making the interpretation of model diagnostics a somewhat subjective process that will be influenced by how the model will be used. This article provides an overview of the model diagnostics Synthace presents users and how they are calculated.

Statistics

Synthace uses two statistical backends for fitting models, one written in Python and one written in R. Typically the Python backend will be used, but if there are collinear factors then the R backend will be used.

Python: Models are specified using the ols function from the statsmodels.formula.api submodule of the statsmodels Python library. This gives a statsmodels.regression.linear_model.OLS object. The model is then fitted using the fit(method="qr") method of this class, using the QR decomposition approach of performing linear regression. The fit method gives a statsmodels.regression.linear_model.RegressionResults object, which will be referred to as python_res in the following text.

R: Models are specified and fitted with the lm function, which gives an object which will be referred to as r_res in the following text.

The model diagnostics are split across the “Create Models” and “Browse Models” tabs of Analyse Responses.

Create Models - Summary Table

  • R Squared

    • Python: Given by the rsquared property on the python_res object

    • R: Returned by passing r_res to the summary function and accessing the r.squared property of the result

  • Adjusted R Squared

    • Python: Given by the rsquared_adj property on the python_res object

    • R: Returned by passing r_res to the summary function and accessing the adj.r.squared property of the result

  • Residual Standard Error

    • Python: Given by finding the square root of the scale properly of the python_res object

    • R: Returned by passing r_res to the summary function and accessing the sigma property of the result

  • p-value (F Statistic)

    • Python: Obtained using the f.cdf function of the scipy.stats submodule of the SciPy Python library, which is the cumulative distribution function of the F statistic. The F value, model degrees of freedom, and residual degrees of freedom necessary for calculating the p-value are given by the fvalue, df_model, and df_resid properties of the python_res object respectively. The result is subtracted from 1 to give the p-value of the F statistic

    • R: Obtained using the pf function, with the F value, model degrees of freedom, and residual degrees of freedom obtained by passing r_res to the summary function and accessing the fstatistic property of the result. The lower.tail argument to pf is set to FALSE

  • F Statistic

    • Python: Three values corresponding to the F value, model degrees of freedom, and residual degrees of freedom necessary, given by the fvalue, df_model, and df_resid properties of the python_res object respectively

    • R: Returned by passing r_res to the summary function and accessing the fstatistic property of the result

  • Cross-Validation Error (Mean) and Cross-Validation Error (Std Dev) correspond to the mean and standard deviation of the root mean squared prediction error, calculated across 5 cross-validation rounds

    • Python: A custom cross-validation algorithm is used. The input data is divided into 5 randomised groups of equal size. For each cross-validation round, data from four of these groups are used to train the model with one group left out as the test data. The response values are predicted for the left out group, with the absolute difference between the predicted and actual response values being the absolute prediction error. This absolute prediction error is squared and the mean calculated using the mean function of the NumPy Python library. The square root of this value is the root mean squared prediction error. The root mean squared prediction error is calculated 5 times, leaving a different group of data out of the training each time. The mean and standard deviation of these 5 values is what is reported to the user, calculated using the mean and std functions of the NumPy Python library

    • R: cv.lm from the lmvar R package is used to perform the cross-validation for the model. The MSE_sqrt mean and standard deviation values are shown, accessed using the mean and sd properties of the result respectively

Note: The custom cross-validation algorithm which the Python backend uses was written to behave as similarly to the cv.lm function as possible by consulting the open-source code of the lmvar package (GitHub)

Create Models - Diagnostics

All three plots are constructed using the fitted values and the residuals given by the model, which differs depending on which statistical backend is used:

  • Fitted values

    • Python: Given by the fittedvalues property of the python_res object

    • R: Given by the fitted.values property of the r_res object

  • Residuals

    • Python: Given by the resid property of the python_res object

    • R: Given by the residuals property of the r_res object

Using these values, the plots are then constructed the same way regardless of the statistical backend used:

  • Predicted vs Actual: Actual values are calculated by adding the fitted values and residuals together element-wise and plotted against the fitted, predicted values. The line corresponds to the line y = x

  • Residual vs Predicted: The residuals are plotted against the fitted, predicted values. The line corresponds to the line y = 0

  • Normal Plot of Residuals: The x-axis are the theoretical quantile values, calculated by passing the residual values to the probplot function of the scipy.stats submodule of the SciPy Python library. This function calculates the quantile values using Filliben’s estimate of the uniform order statistic medians:

    Where n is the number of residuals. These medians are passed to the normal quartile function (inverse cumulative distribution function), with the values returned being the x-coordinates of the points. The coordinates are the residual values. probplot also performs a least-squares fit, returning an intercept and slope, which are used to plot the line drawn on the plot. SciPy Documentation

Browse Models

For the purposes of illustration, all tooltips are shown at once

The traffic lights displayed next to each model give an overview of the model fit, with green circles indicating a good fit, amber circles with a dash across them indicating an acceptable fit, and red circles with a cross indicating a potential issue with the model fit. Hovering your cursor over the circle will display a tooltip with more information. The circles correspond to the following:

  • R Squared: This is the R squared value of the model, determined in the same way as it is for the Summary Table on the “Create Models” tab (see above)

    • Red: R squared ≤ 0.5

    • Amber: 0.5 < R squared ≤ 0.85

    • Green: 0.85 < R squared

  • p-value (F Statistic): This is the p-value of the F-test, determined in the same way as it is for the Summary Table on the “Create Models” tab (see above)

    • Red: 0.1 < p-value or p-value is NaN

    • Amber: 0.05 < p-value ≤ 0.1

    • Green: p-value ≤ 0.05

  • Normality of residuals (D’Agostino-Pearson test): This tests that the residuals are normal, one of the assumptions of linear regression models. The residuals of the model, accessed in the same way as the Diagnostics plots in the “Create Models” tab (see above), are tested to determine if they are normally distributed. This is done using the normaltest function of the scipy.stats submodule of the SciPy Python library (SciPy Documentation). The traffic light colour corresponds to the p-value of this test, which has the null hypothesis that the data is from a normal distribution. Ideally we don’t want to reject this null hypothesis, as the residuals should be normally distributed with well fitted linear models, so the larger the p-value the better

    • Red: p-value ≤ 0.05

    • Amber: 0.05 < p-value ≤ 0.15

    • Green: 0.15 < p-value

Citations

  • Python version: 3.9.15

  • NumPy version: 1.23.5

  • SciPy version: 1.9.3

  • statsmodels version: 0.13.5

  • R version: 4.0.5

  • lmvar version: 1.5.2

To learn about the assumptions of linear modelling, and how the diagnostics can help you check whether your model violates any assumptions, click here.

To learn more about what you can do to improve your model fit, click here.

Did this answer your question?