Skip to main content

The statistics behind transformations

Updated over a year ago

Transforming a response may be necessary when it is unsuitable for linear modelling due to certain modelling assumptions being violated. Alternatively, it may be best practise or biologically meaningful to model the mean or variance between replicates instead of individual replicates. Synthace provides multiple options for when it comes to data transformation.

Statistics

Log

Calculates the natural logarithm (base $e$) of the data. Uses the log function from the NumPy Python library. No pre-processing of the data is performed, so zero and negative values will be transformed to minus infinity and NaN respectively

Box-Cox

Uses the boxcox function from the scipy.stats submodule of the SciPy Python library. No pre-processing of the data is performed, which means that all values of the response must be greater than zero for this response to be applied. SciPy Documentation

Yeo-Johnson

Uses the yeojohnson function from the scipy.stats submodule of the SciPy Python library. SciPy Documentation

Python Expression

Uses the eval function from the Pandas Python library to interpret code entered by the user to perform the transformation. Any infinity values, either positive or negative, in the resulting column are replaced by NaN values. Pandas Documentation

Mean

Calculated as:

Where n is the total number of x values. Uses the mean method of the DataFrame class from the Pandas Python library to calculate the mean of each row using values from the specified columns. NaN values can optionally be excluded when calculating the mean. Any infinity values, either positive or negative, in the resulting column are replaced by NaN values.

Variance

Measures the variability of values with respect to their mean and is calculated as:

Uses the var method of the DataFrame class from the Pandas Python library to calculate the variance of each row using values from the specified columns. NaN values can optionally be excluded when calculating the variance. Any infinity values, either positive or negative, in the resulting column are replaced by NaN values.

Standard Deviation

Measures the dispersion of values with respect to their mean and is calculated as:

Uses the std method of the DataFrame class from the Pandas Python library to calculate the standard deviation of each row using values from the specified columns. NaN values can optionally be excluded when calculating the standard deviation. Any infinity values, either positive or negative, in the resulting column are replaced by NaN values.

Fitted Slope

Performs a linear least-squares regression for each row using values from the specified columns as the y values and the manually entered values as the x values. The slope of the fit is given by:

NaN values can optionally be excluded when performing the regression. Uses the linregress function from the scipy.stats submodule of the SciPy Python library to perform the regression, with the slope property of the fitted model returned. Any infinity values, either positive or negative, in the resulting column are replaced by NaN values.

Citations

  • Python version: 3.9.15

  • NumPy version: 1.23.5

  • SciPy version: 1.9.3

  • Pandas version: 1.4.4

  • G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964)

  • I. Yeo and R.A. Johnson, “A New Family of Power Transformations to Improve Normality or Symmetry”, Biometrika 87.4 (2000)

To learn to select and save subsets of your data, click here.

To learn how to apply a predefined transformation to your data, click here.

To learn how to apply predefined column based calculations to your data, click here.

To learn how to apply custom column based calculations to your data, click here.

Did this answer your question?