Mathematical Formalism of Synthetic Controls in GeoSC

This document provides a rigorous econometric and statistical foundation for the methods implemented in GeoSC, intended for Data Scientists and Statisticians.

1. The Potential Outcomes Framework

Let $Y_{it}$ denote the outcome of interest for region $i \in \{1, \dots, N\}$ at time period $t \in \{1, \dots, T\}$. We observe a pre-treatment period $t \in \{1, \dots, T_0\}$ and a post-treatment period $t \in \{T_0+1, \dots, T\}$.

Without loss of generality, let unit $i=1$ be the treated unit, and units $i \in \{2, \dots, N\}$ be the donor pool (control units).

Following the Rubin Causal Model, we define potential outcomes:

$Y_{it}^N$: The outcome that would be observed for unit $i$ at time $t$ absent the intervention.
$Y_{it}^I$: The outcome that would be observed for unit $i$ at time $t$ exposed to the intervention.

The observed outcome is:

$$ Y_{it} = Y_{it}^N + \alpha_{it} D_{it} $$

Where $D_{it}$ is an indicator variable equal to 1 if unit $i$ receives treatment at time $t$, and 0 otherwise. $\alpha_{it} = Y_{it}^I - Y_{it}^N$ is the treatment effect for unit $i$ at time $t$.

Our goal is to estimate the Average Treatment Effect on the Treated (ATT) during the post-treatment period:

$$ \tau = \frac{1}{T - T_0} \sum_{t=T_0+1}^{T} \alpha_{1t} = \frac{1}{T - T_0} \sum_{t=T_0+1}^{T} (Y_{1t}^I - Y_{1t}^N) $$

Since $Y_{1t}^I$ is observed post-intervention ($Y_{1t}$), the fundamental problem of causal inference is estimating the unobserved counterfactual $Y_{1t}^N$.

2. The Standard Synthetic Control Estimator

The Synthetic Control Method (Abadie, Diamond, and Hainmueller, 2010) estimates the counterfactual $Y_{1t}^N$ as a weighted combination of the donor pool:

$$ \hat{Y}_{1t}^N = \sum_{j=2}^{N} w_j Y_{jt} $$

Where $\mathbf{W} = (w_2, \dots, w_N)'$ is a vector of weights satisfying:

$w_j \geq 0 \quad \forall j$ (Non-negativity)
$\sum_{j=2}^{N} w_j = 1$ (Simplex constraint)

The weights are chosen to minimise the discrepancy between the treated unit and the synthetic control in the pre-treatment period, typically by minimizing:

$$ || \mathbf{X}_1 - \mathbf{X}_0 \mathbf{W} ||_V = \sqrt{(\mathbf{X}_1 - \mathbf{X}_0 \mathbf{W})' V (\mathbf{X}_1 - \mathbf{X}_0 \mathbf{W})} $$

Where $\mathbf{X}_1$ is a $(K \times 1)$ vector of pre-intervention characteristics for the treated unit, $\mathbf{X}_0$ is a $(K \times J)$ matrix of the same variables for the donor pool, and $V$ is a positive semi-definite weighting matrix.

3. Scope Relative to Synthetic Difference-in-Differences

Synthetic Difference-in-Differences (SDiD) is a related panel estimator that combines weighting ideas from synthetic control and difference-in-differences. It is not implemented by GeoSC. GeoSC’s supported estimator selector is sparsesc; the documentation does not claim an empirical superiority of SparseSC over SDiD for every geo-experiment.

The methods make different modelling and weighting choices. Suitability depends on the estimand, timing, donor pool, outcome process, and the diagnostics that the analyst can defend. Neither method removes the need to assess treatment isolation, donor comparability, stable measurement, and sensitivity to reasonable design choices.

GeoSC employs SparseSC (Sparse Synthetic Controls), which regularizes the matching problem through feature-weight and unit-weight penalties. By encouraging a parsimonious match space, it can reduce interpolation error when comparable donors exist. It does not prove that the selected donor set shares the treated unit’s data-generating process, and the implemented assumption and interference diagnostics are design checks rather than guarantees.

4. SparseSC Regularisation Mechanics

Let $\mathbf{Y}_0^{pre}$ be the $(T_0 \times J)$ matrix of pre-treatment outcomes for donors, and $\mathbf{Y}_1^{pre}$ be the $(T_0 \times 1)$ vector for the treated unit.

SparseSC modifies standard synthetic control by fitting in a regularized match space. A simplified objective for interpreting the mechanism is:

$$ \hat{\mathbf{W}}, \hat{V} = \arg\min_{\mathbf{W}, V} || \mathbf{Y}_1^{pre} - \mathbf{Y}_0^{pre} \mathbf{W} ||_V^2

\mathcal{P}_W(\mathbf{W}; \lambda_W)
\mathcal{P}_V(V; \lambda_V) $$

where $V$ controls the feature or time-period match space, $\mathcal{P}_W$ shrinks unit weights toward a regularized solution, and $\mathcal{P}_V$ penalizes feature weights.

In GeoSC’s default configuration, sparse_sc_fast_estimation: true calls SparseSC’s fast path, which uses RidgeCV-backed match-space machinery. When the fast path is disabled, SparseSC uses its full fitting path with cross-validation over feature-weight penalties. The practical goal is to reduce overfitting to pre-treatment noise, especially when $T_0$ is small relative to the donor pool, while preserving a donor-based counterfactual that remains auditable.

5. Placebo Inference and Empirical P-Values

Because geo-experiments often have $N_1 = 1$ or a small integer, asymptotic inference is usually not appropriate. GeoSC calculates donor-pool empirical p-values using in-space placebo permutations.

We iteratively reassign the treatment status to every donor unit $j \in \{2, \dots, N\}$, calculate a placebo synthetic control $\hat{Y}_{jt}^N$, and derive a placebo effect $\hat{\alpha}_{jt}$.

The ratio of post-treatment MSPE to pre-treatment MSPE is calculated for the actual treated unit and all placebos:

$$ r_i = \frac{\frac{1}{T-T_0} \sum_{t=T_0+1}^T (Y_{it} - \hat{Y}_{it}^N)^2}{\frac{1}{T_0} \sum_{t=1}^{T_0} (Y_{it} - \hat{Y}_{it}^N)^2} $$

The p-value is the finite donor-pool placebo share with a ratio at least as extreme as the treated unit’s ratio:

$$ p = \frac{1}{N} \sum_{j=1}^N \mathbf{I}(r_j \geq r_1) $$

This quantity is only as informative as the placebo reference set. With few eligible donors, the attainable p-values are coarse; with non-comparable donors, spillovers, or poor pre-period fit, the empirical comparison can be misleading.