Skip to main content

Understanding data reshaping for processing and analysis

Understand what reshaping data is, how it works and how to use it

Updated over a week ago

Overview

Assay readouts are typically computed from raw data - converting multiple measured values into a single value per run.

Typical examples include pooling replicates, time-course experiments and dose-response curves.

In Synthace, data analysis is performed column-wise, that is, all values in a given column are treated in the same way. Preparing data for analysis means that the required values must appear in separate columns. In many cases the data must be manipulated to achieve this, which we refer to as reshaping.

Take the following example of a very simple 2 factor full-factorial DOE. Each condition has been measured at two timepoints - 0 and 60s.

Enzyme

Substrate

Time

Measurement

5

5

0

23

5

5

60

120

5

10

0

21

5

10

60

109

10

5

0

15

10

5

60

112

10

10

0

31

10

10

60

99

If we want to analyse the difference between the measurements at the initial and final timepoints, we first reshape the table to put the 0s measurements in one column and the 60s measurements in a second column. The resulting table looks like this:

Enzyme

Substrate

Measurement (0)

Measurement (60)

5

5

23

120

5

10

21

109

10

5

15

112

10

10

31

99

We can now express our response as the column-wise operation

Measurement (60) - Measurement (0)

Reshaping is required to allow variables to be used for curve fitting - the variables used for reshaping will be available as independent variables for the fit.

How Reshaping Works

Under the hood, reshaping works in two stages:

First, the data table is split into sub-tables using the selected variable(s). Each unique value of the reshaping variable will create a single table. So for the example above, reshaping on Time, we would have two tables:

Time = 0

Enzyme

Substrate

Measurement

5

5

23

5

10

21

10

5

15

10

10

31

Time = 60

Enzyme

Substrate

Measurement

5

5

120

5

10

109

10

5

112

10

10

99

Next, the sub-tables are merged back into a single table. This step requires finding one or more index variables.

The idea here is that in order to recombine the rows there must be a consistent way to identify which rows in the sub-tables are the same, and therefore to be merged.

In many cases Synthace’s Condition variable is sufficient to do this, however in some cases it is not sufficient. This can happen for example when the design is replicated, as in this case, the condition numbers will be the same across replicates.

The rules for a valid index are

  • In any sub-table every row must have a distinct value of the index

  • The index values found in different sub-tables must overlap so that rows can be merged

We do not require perfect overlap between sub-tables since in some cases we may be missing some observations. We therefore choose the indexing variables based on the amount of overlap (and therefore how much we can match).

In the example, we can uniquely identify the rows simply by using the combination of Enzyme and Substrate since each row has a unique combination of values for these two factors (which must be true since this is a non-replicated full factorial).

The system will then merge the tables to produce the result above. The rules for merging are

  • Identical columns are only represented once

  • Non-identical columns are added along with the value of the reshaping variable they occurred for

Troubleshooting

Why isn’t “Column X” listed as an option to reshape the table?

Synthace filters out columns which are unsuitable for reshaping, for example, because they would lead to a sparse or single-rowed table.

The following columns are always filtered out

  • Columns with only a single value in all rows

  • Columns where nearly every row has a different value

  • The columns Condition, Quasi-Replicate , Quasi-Condition and Control - these are usually used as indices and are not typically meaningful on their own.

For design factors we apply a more complex method to see if they are suitable. This is decided by trying to ensure the remaining factors are still going to be usable for analysis while ensuring the system allows for some values being missing.

The full details are a little complicated but essentially boil down to ensuring the factors which remain after reshaping on the selected reshaping factor map to more than one value of the reshaping factor in the majority of cases. See below (section “How Synthace Decides If A Factor is Suitable For Reshaping”) for an example and more specific details.

Why can’t Synthace find a suitable index, and how can I fix this?

Reshaping sometimes cannot be performed on the desired variable. The reason this happens is that there is no valid index in the dataset.

The solution to this is to upload a new column which creates a valid index according to the rules above - that is, it identifies each row uniquely and takes the same value for any values which should appear in the same row.

To understand how this works, consider this version of our example:

Enzyme

Substrate

Time

Measurement

5

5

0

23

5

5

60

120

5

10

0

21

5

10

60

109

10

5

0

15

10

5

60

112

10

10

0

31

10

10

60

99

5

5

0

19

5

5

60

78

5

10

0

2

5

10

60

132

10

5

0

34

10

5

60

110

10

10

0

27

10

10

60

98

Here, the design is identical to the above, but we have replicated it completely. Now if we try to reshape on Time once again we now have the following two subtables

Time = 0

Enzyme

Substrate

Measurement

5

5

23

5

10

21

10

5

15

10

10

31

5

5

19

5

10

2

10

5

34

10

10

27

Time = 60

Enzyme

Substrate

Measurement

5

5

120

5

10

109

10

5

112

10

10

99

5

5

78

5

10

132

10

5

110

10

10

98

In this case we cannot use the combination of Enzyme and Substrate as an index any more, because now there are two rows in every table with any given pair of values, e.g. in the Time=0 table the values 5,5 map to measurements of 23 and 19 in rows 1 and 5 respectively. This violates the rule that we must end up with exactly one value, so we cannot index. No other choice is any better (as is hopefully obvious), so we must help out by adding another value to serve as an index.

The obvious choice is to distinguish between replicates by adding a Replicate column with the (arbitrary) values 1 and 2 , which you can achieve using the Upload transform (available as a Calculate step). This column should be uploaded as a Response column, because as noted in the section How is the index found? above, only Response columns and the list of Synthace generated columns are considered when searching for an index.

In this example it would look like this:

Enzyme

Substrate

Time

Measurement

Replicate

5

5

0

23

1

5

5

60

120

1

5

10

0

21

1

5

10

60

109

1

10

5

0

15

1

10

5

60

112

1

10

10

0

31

1

10

10

60

99

1

5

5

0

19

2

5

5

60

78

2

5

10

0

2

2

5

10

60

132

2

10

5

0

34

2

10

5

60

110

2

10

10

0

27

2

10

10

60

98

2

Now the sub-tables for Time look like this:

Time = 0

Enzyme

Substrate

Measurement

Replicate

5

5

23

1

5

10

21

1

10

5

15

1

10

10

31

1

5

5

19

2

5

10

2

2

10

5

34

2

10

10

27

2

Time = 60

Enzyme

Substrate

Measurement

Replicate

5

5

120

1

5

10

109

1

10

5

112

1

10

10

99

1

5

5

78

2

5

10

132

2

10

5

110

2

10

10

98

2

Now the reshaping algorithm can use the combination of Enzyme, Substrate and

Replicate as the index, and create the final reshaped table:

Enzyme

Substrate

Measurement (0) - Replicate 1

Measurement (0) -
Replicate 2

Measurement (60) -
Replicate 1

Measurement (60) -
Replicate 2

5

5

23

19

120

78

5

10

21

2

109

132

10

5

15

34

112

110

10

10

31

27

99

98

Reference Technical Details

These sections include some more technical details on how the underlying methods for choosing the index and permitted reshaping columns work. Hopefully they should not need to be referred to very often, but for some tricky indexing problems they may be useful!

How is the index found?

Sometimes, the system needs help (in the form of an additional column) to correctly index the data for reshaping. To make this work you need to know how Synthace determines the index columns to use.

An initial list of columns is collected, consisting of all Response columns and the following list of columns automatically added to datasets by Synthace: Replicate, Condition, Quasi-Replicate, Quasi-Condition, location, labware_name, execution_id, and Control. We also take all design factors (except those selected as reshape columns), combine them into a single column, and add that column to the list.

Response columns are added as sometimes they can contain useful metadata, such as the wavelength used for absorbance readings. The merged design factors are added to allow full factorial designs to be freely reshaped on any column.

From the list we remove any columns selected as reshape columns, any columns which don’t vary, and any columns where the majority of values are unique.

From this filtered list we then generate every combination of the columns as possible index choices. For example, if the filtered list of columns was Condition, Wavelength, and Time, we would consider the following combinations of columns:

  1. Condition

  2. Wavelength

  3. Time

  4. Condition, Wavelength

  5. Condition, Time

  6. Wavelength, Time

  7. Condition, Wavelength, Time

We then determine which of these combinations has:

  1. Unique value combinations within each sub-table

  2. The minimal number of unique value combinations when the table is not split

By looking for the combination with the minimal number of unique value combinations when the table is not split, we are maximising the overlap in the index between sub-tables.

We communicate in the UI which columns are being used as the index, or if we are unable to find a suitable index, we list which columns were considered. This can be useful when troubleshooting.

How Synthace Decides If A Factor is Suitable For Reshaping

When checking a test design factor, we generate sub-tables using every other design factor, and then count the number of unique values of the test design factor in each sub-table. If the majority of these counts are greater than 1, the test design factor is included as a possible reshape column.

It should be mentioned that this is a heuristic; a design Factor may be identified as a suitable reshape column candidate, but when selected, doesn't result in the table changing. This is because this function does not take into account other index factors, it just looks at the other design Factors i.e. the dummy Factor considered as a potential index Factor (see the docstring of get_index_factors for more information on the dummy Factor added).

For example, it's known that this function can identify design Factors as suitable if there are repeated conditions in the DOE design i.e. the same condition in terms of design Factor value combinations, but assigned a different condition number. When we try and reshape by these design Factors, we lack enough information to find appropriate index Factors. We only know this when we try and reshape.

For example, take the following table of 3 design Factors:

Salt

Sugar

Enzyme

5

6

0

10

6

0

5

6

7

10

6

7

5

12

0

10

12

0

5

12

7

Say we are testing to see if Enzyme is suitable, we would split the table based on the values of Salt and Sugar:

Salt

Sugar

Enzyme

5

6

0

5

6

7

Salt

Sugar

Enzyme

10

6

0

10

6

7

Salt

Sugar

Enzyme

5

12

0

5

12

7

Salt

Sugar

Enzyme

10

12

0

The number of values in each sub-table would be [2, 2, 2, 1], so Enzyme would be allowed as a reshaping factor since 3 out of 4 (a majority) of tables have more than a single Enzyme value.

If the original table was:

Salt

Sugar

Enzyme

5

6

0

10

6

7

10

12

0

5

12

7

The sub-tables would be:

Salt

Sugar

Enzyme

5

6

0

Salt

Sugar

Enzyme

10

6

7

Salt

Sugar

Enzyme

5

12

7

Salt

Sugar

Enzyme

10

12

0

Now Enzyme is no longer allowable as a reshaping factor, since each combination of the other two factors maps onto exactly one value of Enzyme.

Did this answer your question?