Skip to main content

Algorithm - GeoMatch Design

Mahi avatar
Written by Mahi
Updated over 2 weeks ago

The process behind designing a Geo-Experiment follows the following steps:

1. Correlation Matrix:

The first heavy lift is to pivot the cleaned panel into a matrix of shape (T days × N geos) and compute an N×N Pearson correlation table.

Only one matrix is built and cached; all later stages reuse it. High correlation is the quickest proxy for “behaves the same under shared shocks”, which translates into lower bias when one set is treated. We tried adding a more complex score such as DTW but this didn’t to change the rankings in A/B tests.

2. Treatment Group Selection:

We build treatment groups for specified sizes within our range. If no size is specified, the range extends from 1 geo to 30% of our total unique geos. For each group size G, we generate unique combinations of geos that meet the required count, up to 2500 unique combinations per size. We only accept groups that meet our maximum treatment percentage to ensure all selected combinations are viable.

One problem we discovered is that this procedure pulls combinations lexicographically, creating very similar treatment groups of the same size. To address this, we implemented a diversity guardrail requiring new combinations to have ≤75% Jaccard overlap with already queued combinations, preventing thousands of near-duplicates.

3. Control Group Selection:

Qualification

Start with the full correlation matrix that was computed once at the design's outset. For each treated geo, read the column of pairwise correlations and keep only rows that: are not in the treatment list, and meet or exceed the minimum-correlation threshold (0.80 by default). If that filter empties out (common with small markets or noisy metrics), we relax the rule and keep the top N most-correlated geos for that treated unit (fallback N = 5). All qualifiers across all treated geos are unioned into a single candidate pool.

Ranking & trimming

For every candidate control geo we look at its correlation with each treated geo and take the mean of those values—this is the geo's similarity score to the treatment set as a whole. We sort the pool by that score and take the first k entries (default 50). Optionally, the user can override the entire procedure and simply label every non-treated geo as control.

Edge cases handled

Mandatory include / exclude lists are respected upstream, so controls never collide with forced treatment or forbidden geos. The routine works even when the cleaned panel has numeric geo IDs because both the pivot table and the correlation index are type-consistent. If k or the threshold eliminates every possible control (extremely rare), the algorithm logs a warning and widens the criteria until it finds an admissible set rather than failing hard. The result is a control pool that is (a) strongly correlated with the treatment series, (b) large enough to provide a stable synthetic counterfactual, and (c) guaranteed to be disjoint from the treatment regions.

4. Synthetic control building:

Synthetic-Control Fitting — Under the Hood

Once a treatment / control split has been drafted, we need a counterfactual series that tells us "what the treatment geos would have looked like had we not changed spend." That counterfactual is built with a synthetic-control model: a weighted cocktail of control geos tuned to shadow the treatment series in the pre-period and then projected forward into the test window. The implementation folds three variants under one wrapper and auto-selects the best.

ⅰ. Preparing the panel

  • Aggregate the raw panel into two matrices:

Every vector is mean-centered by default so that a free intercept can soak up level differences if the optimiser decides it is helpful.

ⅰⅰ. Model menu

Standard Synthetic Control (Classical Abadie et al.)

  • Decision variables: weight vector w and optional intercept .

  • Constraints: (A small negative lower bound allows mild extrapolation).

  • Objective:

where λ terms add L1/L2 regularisation to prevent over-fitting and produce sparse weights.

Ridge-Augmented Synthetic Control

  • First fit a ridge regression "outcome model" on the control units only, predicting each geo's post-period from its pre-period features.

  • Subtract those fitted values from both treatment and control series; run the standard optimisation on the residuals.

  • Finally add the ridge predictions back in. This bias-correction helps when the simple weight model leaves a small, systematic gap.

Generalised (Factor) Synthetic Control

  • Stack pre- and post-period data from controls, centre it, extract k latent factors via truncated SVD.

  • Estimate each geo's loading on those factors from pre-period data, then project the factors forward to predict the post-period for every unit.

  • The treatment counterfactual is the factor-based prediction for the treated aggregate.

ⅰⅰⅰ. Automatic model selection

  • Split the pre-period into k folds (time-series CV).

  • For each candidate model compute out-of-fold MAPE; the model with the lowest average error wins.

  • Lightweight heuristic: if the panel is "small" (<40 geos or <100 days) we try ridge augmentation; if "large" we include the factor model in the race. Standard SC is always tested.

ⅳ. Solving the optimisation

  • Attempt to solve with CVXPY + the ECOS solver.

  • If ECOS fails to converge, fall back to OSQP; if that also fails, drop to SciPy's sequential quadratic programming.

  • Once a solution is found, extract the weight vector, intercept, and store diagnostics (objective value, KKT residuals, time to convergence).

ⅴ. Diagnostics returned to upstream ranking

  • In-sample SMAPE and cross-validated MAPE.

  • Scaled L2 imbalance between actual and synthetic pre-period.

  • Placebo lift when the treatment delta is forced to zero (bias estimate), this is called Abs Lift in Zero.

  • Model identifier ("standard", "ridge", "gsyn"), chosen λ values, number of active weights, factor count where relevant.

  • Full weight vector and intercept so analysts can eyeball which controls are pulling the synthetic line.

💡Scaled L2 Imbalance essentially is the percent improvement of prediction we achieved by tuning the weights of our synthetic control, compared to a naïve synthetic control with all weights being equal

ⅴⅰ. Prediction step used later in power analysis

During sensitivity sweeps, we repeatedly call predict to get the synthetic counterfactual for arbitrary post-period slices. The function returns both the fitted values and the model's latest residuals, ensuring the variance of the treatment-minus-synthetic difference is always available.

ⅴⅰⅰ. Fail-safe behavior

If every solver attempt fails (rare, typically due to singular control matrices), the pipeline logs an error, assigns equal weights to all controls, and flags the combination as "non-optimal." Ranking will usually demote or discard these combinations, but the design process continues rather than crashing.

This layered approach—combining three model variants, cross-validated selection, and a chain of optimisation back-ends—delivers a synthetic control that is both statistically sound and operationally robust across datasets ranging from small panels to extensive DMA datasets.

5. Refinement per group size:

After calculating the synthetic control for all 2500 combinations for each group size, we rank them based on Holdout MAPE, SMAPE, and Bias in the last 20% of the dataset. We select the top K groups for each group size, then simulate various lift effects by applying a factor of (1+Delta) to the data using values from our delta range. This lift is applied to the last N period within our specified period range.

For each time series with Delta lift applied over the last N days, we use permutation testing to determine if the applied delta is sufficient to reach statistical significance (p < 0.05) in more than 90% of our simulations.

The smallest delta that achieves this threshold is designated as the Minimum Detectable Effect (MDE) for that period length. The power at that delta represents the percentage of simulations that achieved significance. We report to the end user the MDE and period combination that minimises their required investment.

6. Ranking and Returning Experiments:

After determining the MDEs for all group-length combinations, the pipeline transitions from "physics" (pure statistics) to "policy" (selecting one design that optimally balances power, bias, and cost). The downstream steps are:

a. Choose the "best" period for each group

  • We examine all tested treatment-lengths (e.g., 14, 21, 28 days) and select the period with the smallest MDE and whose power at that MDE meets or exceeds the target (90% by default).

  • When two periods have equal power, we prefer the shorter window to reduce both cost and operational risk.

b. Inflate the MDE if treatment period bias is larger than the MDE (adaptive MDE). This means the Treatment period often differs from the Abs lift in zero, since that measures the last 20% of the data, while treatment period bias examines the specific final window we selected. If |bias| × safety_margin exceeds the raw MDE, we raise the MDE to bias × safety_margin to avoid chasing effects within the noise floor. We rarely select these experiments because the most important factor in the selection stage is choosing experiments with the smallest treatment period bias.

c. Convert MDE → required budget

  • For a lift test we translate the minimum incremental conversions implied by the MDE into spend using CPIC or iROAS. Spend = PeriodRevenue_{TreatmentGroup} \times MDE \times CPIC or divided by iROAS

  • For an inverse-lift (spend cut) test, we calculate the spend reduction that would deliver the target MDE.

  • The algorithm also estimates how much money the treatment geos realistically handle in the proposed window and flags designs whose required investment exceeds that "wallet".

  1. Calculate robustness tags

    • False-positive rate (FPR) via 1,000 placebo simulations.

    • Treatment-period bias (difference between actual and synthetic inside the candidate window).

    • Designs with FPR = 0 and bias < 1% are granted a Tier-1 tag.

  2. Composite ranking

    • Primary key: power at (possibly inflated) MDE (higher is better).

    • Secondary keys, in order:

    1. Lower imbalance / SMAPE / L2.

    2. Shorter treatment window.

    3. Budget feasibility (within cap).

      • Tier-1 designs always rank above Tier-2 (>1% but <2% bias) which rank above Tier-3

7. Multicell Design:

Multicell design follows the same exact procedure that we use above, but before everything, for N cells, we split up our Geos into N distinct groups.

To form N cells we first rank all geos by a composite “size” score—70 % based on their total historical Y volume, 30 % on their average daily Y. We then assign them to cells in a round-robin sweep (geo 1 → cell 0, geo 2 → cell 1, …) so each cell starts with a similar mix of big and small markets. Finally, thousands of smart swap iterations shuffle pairs of geos between cells, greedily or by targeted imbalance fixes, until the cells are as even as possible on total volume, mean contribution, geo count and time-series variance. After this optimisation, each cell contains a balanced, independent slice of the country and can run the single-cell design pipeline on its own.

Did this answer your question?