Technical Documentation: Generic Curve Fitting

This article outlines the technical details behind the generic curve fitting funtionality in Synthace. For information on implementation of generic curve fits in your data preparation process, check out the article here.

Like other transforms in the platform, the generic curve fitting functionality operates on data that have previously been reshaped (or pivoted in some perople’s terminology) on some independent variable (e.g. time, substrate concentration) so that measurements are arranged in columns, one per value of the independent variable.

Generic curve fitting supports fitting functions of up to 5 free parameters for each row of data independently. Parameters are completely independent between rows. Functions can be user-defined to include all standard mathematical operations including arithmetic functions, logarithms, exponentials and trigonometric functions.

For each row, the user-defined function is fitted in two stages. The first stage attempts to find an initial estimate of the parameters using a genetic algorithm, while the second refines the initial estimate using nonlinear regression.

Initial Parameter Estimation

Initial parameter estimation uses the differential evolution metaheuristic as implemented by SciPy (version 1.9). Briefly, this is a genetic algorithm method that attempts to do global optimization by simulating an evolutionary process over a population of potential solutions to the problem of interest, subject to any constraints.

An initial population is set up by generating a latin hypercube of values within specified bounds. The quality of each population member as a parameter set for the curve fit (known as its fitness) is estimated by finding the sum of squares deviation between the logistic equation with those parameters at the independent values in the dataset and the corresponding dependent values.

Once all population members in the current generation are evaluated, a new population is generated by preferentially selecting fitter members of the current generation to reproduce. The daughter solution generated inherits a random set of parameters, which are based on the parents’ values. Newly generated solutions are mutated to introduce further random variation.

This process is repeated until either a convergence criterion is satisfied or a pre-set number of generations is reached. The best solution discovered in the whole search is then reported. If no valid solution was discovered, an error is reported.

The exact call is scipy.optimize.differential_evolution The specific parameters used at this step are as follows:

seed=0 maxiter=1000 popsize=15 mutation=(0.5, 1) recombination=0.7 polish=False strategy="best1bin" tol=0.01 init="latinhypercube"

Default initial bounds are between -1,000,000 and 1,000,000 at this step.

Where the user has specified bounds, these override the defaults. If all bounds are set equal, no fitting is performed.

Parameter Refinement

This uses the trust region reflective algorithm for nonlinear least squares fitting with the parameters found in the first step as the starting point for optimization.

Optimization runs until one of several convergence criteria or the pre-set limit of 10,000 function evaluations is reached.

This step uses scipy.optimize.curve_fit The specific parameters used at this step are as follows:

maxfev 10000 sigma None absolute_sigma False method trf jac None

Default initial bounds are $-\infty$ , $\infty$. Where the user has specified bounds, these override the defaults. If all bounds are set equal, no fitting is performed.

Given the breadth of the default bounds it’s generally strongly advised for users to provide them wherever possible.

Model fundamentals, and what does linear mean anyway?

Technical Documentation for Logistic Curve Fits

Performing non-linear curve fits using 4 & 5-parameter logistic regression

Fitting bespoke curves to your data

Understanding data reshaping for processing and analysis