Balancing Weights

Entropy and quadratic calibration on Kang-Schafer and Hainmueller DGPs

BalancingWeights solves the mean-balancing problem

\[ \sum_{i \in \mathcal{C}} w_i = 1, \qquad \sum_{i \in \mathcal{C}} w_i X_i \approx \bar X_{\mathcal{T}}, \]

while keeping the control weights close to uniform under either an entropy or a quadratic objective. In causal-inference terms, the common ATT use case is:

\[ \widehat{\tau}_{ATT} = \bar Y_{\mathcal{T}} - \sum_{i \in \mathcal{C}} w_i Y_i. \]

The examples below use two standard stress tests:

Kang-Schafer, where the observed covariates are nonlinear transformations of latent Gaussian drivers.
Hainmueller, adapted from aipyw, where overlap and functional-form difficulty can be dialed up or down.

Literature Map

BalancingWeights sits in a tight cluster of weighting estimators that differ more by parameterization than by the balance conditions they impose.

Hainmueller (2012) formulates entropy balancing as a convex calibration problem: choose positive control weights that exactly match treated covariate moments while staying close to baseline weights.
Graham, de Xavier Pinto, and Egel (2012) show that inverse probability tilting estimates a logit index by moment conditions chosen so the implied weights satisfy exact sample balance.
Imai and Ratkovic (2014) recast the same balance-first logic as a propensity-score GMM / empirical-likelihood estimator.
Graham, Pinto, and Egel (2016) extend the same tilting geometry to data-combination problems by introducing separate study and auxiliary tilts.
Zhao and Percival (2017) make the dual interpretation explicit from the entropy-balancing side: entropy balancing behaves like a logistic propensity-score fit with a different loss.

For the ATT problem, the cleanest exact equivalence is between entropy balancing and just-identified logit CBPS. Graham’s tilting estimators are the same family, but full AST adds an extra layer of tilting beyond that baseline case.

Entropy Balancing, CBPS, and Tilting

Let \(c_i = c(X_i)\) be the balance basis, including an intercept, and let \(q_i > 0\) denote baseline control weights. For the ATT, entropy balancing solves

\[ \min_{\{w_i\}_{i \in \mathcal{C}}} \sum_{i \in \mathcal{C}} w_i \log\!\left(\frac{w_i}{q_i}\right) \quad \text{subject to} \quad \sum_{i \in \mathcal{C}} w_i = 1, \qquad \sum_{i \in \mathcal{C}} w_i c_i = \bar c_{\mathcal{T}}. \]

The Lagrangian first-order conditions imply a log-linear dual solution

\[ w_i(\lambda) = \frac{q_i \exp(\lambda^\top c_i)} {\sum_{j \in \mathcal{C}} q_j \exp(\lambda^\top c_j)}, \]

after the usual sign relabeling of the multipliers. This is the core Hainmueller result: entropy balancing chooses the multiplier \(\lambda\) so that the tilted control distribution exactly matches the treated moments.

Now write the ATT-CBPS balance equations with a logit propensity score \(p_i = \Lambda(\beta^\top c_i)\):

\[ \frac{1}{n} \sum_{i=1}^n \left[ D_i c_i - (1-D_i)\frac{p_i}{1-p_i} c_i \right] = 0. \]

Because the logit odds ratio satisfies

\[ \frac{p_i}{1-p_i} = \exp(\beta^\top c_i), \]

the implied unnormalized control weights are

\[ \tilde w_i(\beta) = (1-D_i)\exp(\beta^\top c_i). \]

The intercept moment pins down their total mass:

\[ \sum_{i \in \mathcal{C}} \tilde w_i(\beta) = \sum_{i=1}^n D_i = n_{\mathcal{T}}. \]

Dividing by \(n_{\mathcal{T}}\) yields normalized control weights

\[ w_i^{\mathrm{CBPS}} = \frac{\exp(\beta^\top c_i)} {\sum_{j \in \mathcal{C}} \exp(\beta^\top c_j)}, \qquad i \in \mathcal{C}, \]

and the remaining moments become

\[ \sum_{i \in \mathcal{C}} w_i^{\mathrm{CBPS}} c_i = \bar c_{\mathcal{T}}. \]

So with the same balance basis \(c(X)\), an intercept, and uniform baseline weights \(q_i\), ATT entropy balancing and just-identified logit CBPS deliver the same normalized weights. The practical difference is mostly primal versus dual parameterization: entropy balancing solves directly for calibration weights or multipliers, while CBPS solves for propensity-score coefficients whose logit odds generate those same weights.

Graham’s inverse probability tilting step is the same balance-first idea in another parameterization. The 2012 IPT moments choose a logit index so the implied weights satisfy exact balance moments, which in the ATT specialization again produces inverse-odds weights proportional to \(\exp(\beta^\top c_i)\). AST adds a second layer of tilting on top of that baseline. Writing \(\hat p_i = \Lambda(r_i^\top \hat \delta)\), the auxiliary weights take the form

\[ \hat \pi_i^a \propto (1-D_i)\frac{\hat p_i} {1 - \Lambda(r_i^\top \hat \delta + t_i^\top \hat \lambda_a)}. \]

When the extra auxiliary tilt is unnecessary, \(\hat \lambda_a = 0\), this collapses to

\[ \hat \pi_i^a \propto (1-D_i)\frac{\hat p_i}{1-\hat p_i} = (1-D_i)\exp(r_i^\top \hat \delta), \]

which is exactly the same inverse-odds / entropy-balancing weight formula. With nontrivial study or auxiliary tilts, AST is a strict generalization rather than literally the same estimator.

Quadratic balancing in this library keeps the same balance constraints and changes only the distance penalty. It therefore targets the same sample moments as entropy balancing, but it does not imply the same log-linear weight formula.

Show code

from html import escape

import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML, display

import crabbymetrics as cm


def html_table(headers, rows):
    parts = [
        "<table>",
        "<thead>",
        "<tr>",
        *[f"<th>{escape(str(header))}</th>" for header in headers],
        "</tr>",
        "</thead>",
        "<tbody>",
    ]
    for row in rows:
        parts.append("<tr>")
        for cell in row:
            parts.append(f"<td>{escape(str(cell))}</td>")
        parts.append("</tr>")
    parts.extend(["</tbody>", "</table>"])
    return "".join(parts)


def expit(x):
    return 1.0 / (1.0 + np.exp(-x))


def kang_schafer_dgp(n, rng):
    z = rng.normal(size=(n, 4))
    z1, z2, z3, z4 = z.T

    propensity = expit(-z1 + 0.5 * z2 - 0.25 * z3 - 0.1 * z4)
    d = rng.binomial(1, propensity)

    y0 = 210.0 + 27.4 * z1 + 13.7 * (z2 + z3 + z4) + rng.normal(size=n)
    y = y0

    x = np.column_stack(
        [
            np.exp(z1 / 2.0),
            z2 / (1.0 + np.exp(z1)) + 10.0,
            (z1 * z3 / 25.0 + 0.6) ** 3,
            (z2 + z4 + 20.0) ** 2,
        ]
    )
    return y, d, x, z


def hainmueller_dgp(
    n,
    rng,
    overlap_design=1,
    pscore_design=1,
    outcome_design=1,
):
    mean = np.zeros(3)
    cov = np.array([[2.0, 1.0, -1.0], [1.0, 1.0, -0.5], [-1.0, -0.5, 1.0]])
    x1, x2, x3 = rng.multivariate_normal(mean=mean, cov=cov, size=n).T
    x4 = rng.uniform(-3.0, 3.0, size=n)
    x5 = rng.chisquare(df=1.0, size=n)
    x6 = rng.binomial(1, 0.5, size=n)
    x = np.column_stack([x1, x2, x3, x4, x5, x6])

    if overlap_design == 1:
        epsilon = rng.normal(0.0, np.sqrt(30.0), size=n)
    elif overlap_design == 2:
        epsilon = rng.normal(0.0, 10.0, size=n)
    elif overlap_design == 3:
        epsilon = rng.chisquare(df=5.0, size=n)
        epsilon = (epsilon - 5.0) / np.sqrt(10.0) * np.sqrt(67.6) + 0.5
    else:
        raise ValueError("unknown overlap_design")

    if pscore_design == 1:
        base_term = x1 + 2.0 * x2 - 2.0 * x3 - x4 - 0.5 * x5 + x6
    elif pscore_design == 2:
        base_term = x1 + x1**2 - x4 * x6
    elif pscore_design == 3:
        base_term = 2.0 * np.cos(x1) + np.sin(np.pi * x2)
    else:
        raise ValueError("unknown pscore_design")

    d = (base_term + epsilon > 0.0).astype(int)

    eta = rng.normal(0.0, 1.0, size=n)
    if outcome_design == 1:
        y = x1 + x2 + x3 - x4 + x5 + x6 + eta
    elif outcome_design == 2:
        y = x1 + x2 + 0.2 * x3 * x4 - np.sqrt(x5) + eta
    elif outcome_design == 3:
        y = 2.0 * np.cos(x1) + np.tan(np.pi * x2) + (x1 + x2 + x5) ** 2 + eta
    else:
        raise ValueError("unknown outcome_design")

    return y, d, x


def fit_att_balancing(y, d, x, objective):
    treated = d == 1
    control = ~treated
    model = cm.BalancingWeights(
        objective=objective,
        solver="auto",
        autoscale=True,
        max_iterations=300,
        tolerance=1e-8,
    )
    model.fit(x[control], x[treated])
    summary = model.summary()
    weights = np.asarray(summary["weights"])
    att_hat = y[treated].mean() - np.dot(weights, y[control])
    return att_hat, summary


def standardized_mean_difference(x_treated, x_control, weights=None):
    treated_mean = x_treated.mean(axis=0)
    control_mean = x_control.mean(axis=0) if weights is None else np.average(
        x_control, axis=0, weights=weights
    )
    treated_var = x_treated.var(axis=0)
    control_var = x_control.var(axis=0) if weights is None else np.average(
        (x_control - control_mean) ** 2, axis=0, weights=weights
    )
    pooled = np.sqrt(0.5 * (treated_var + control_var))
    pooled = np.where(pooled > 1e-12, pooled, 1.0)
    return (treated_mean - control_mean) / pooled


def evaluate_single_dataset():
    rng = np.random.default_rng(123)
    y, d, x, z = kang_schafer_dgp(2000, rng)
    treated = d == 1
    control = ~treated

    naive_att = y[treated].mean() - y[control].mean()
    quad_att, quad_summary = fit_att_balancing(y, d, x, "quadratic")
    ent_att, ent_summary = fit_att_balancing(y, d, x, "entropy")
    oracle_att, oracle_summary = fit_att_balancing(y, d, z, "entropy")

    smd_before = standardized_mean_difference(x[treated], x[control])
    smd_quad = standardized_mean_difference(
        x[treated], x[control], weights=np.asarray(quad_summary["weights"])
    )
    smd_ent = standardized_mean_difference(
        x[treated], x[control], weights=np.asarray(ent_summary["weights"])
    )

    rows = [
        ["Naive difference", f"{naive_att: .3f}", "--", "--"],
        ["Quadratic balancing on observed X", f"{quad_att: .3f}", quad_summary["success"], f"{quad_summary['effective_sample_size']: .1f}"],
        ["Entropy balancing on observed X", f"{ent_att: .3f}", ent_summary["success"], f"{ent_summary['effective_sample_size']: .1f}"],
        ["Entropy balancing on latent Z (oracle)", f"{oracle_att: .3f}", oracle_summary["success"], f"{oracle_summary['effective_sample_size']: .1f}"],
    ]
    display(HTML(html_table(["Estimator", "ATT Estimate", "Success", "ESS"], rows)))

    labels = [f"x{j + 1}" for j in range(x.shape[1])]
    fig, ax = plt.subplots(figsize=(8, 4))
    xpos = np.arange(len(labels))
    width = 0.25
    ax.bar(xpos - width, np.abs(smd_before), width=width, label="Unweighted")
    ax.bar(xpos, np.abs(smd_quad), width=width, label="Quadratic")
    ax.bar(xpos + width, np.abs(smd_ent), width=width, label="Entropy")
    ax.axhline(0.1, color="black", linestyle="--", linewidth=1.0)
    ax.set_xticks(xpos)
    ax.set_xticklabels(labels)
    ax.set_ylabel("Absolute standardized mean difference")
    ax.set_title("Kang-Schafer: single-dataset balance on observed transformed covariates")
    ax.legend()
    fig.tight_layout()

    return {
        "naive": naive_att,
        "quadratic": quad_att,
        "entropy": ent_att,
        "oracle": oracle_att,
    }


single_dataset = evaluate_single_dataset()

Estimator	ATT Estimate	Success	ESS
Naive difference	-20.704	--	--
Quadratic balancing on observed X	-6.252	True	469.3
Entropy balancing on observed X	-4.627	True	406.7
Entropy balancing on latent Z (oracle)	0.001	True	337.3

Kang-Schafer

Kang-Schafer is useful here because balancing is asked to work on the transformed covariates \(X\), not the latent Gaussian drivers \(Z\) that generated treatment and outcomes. The true ATT is zero, so the gap between the estimator and zero is pure bias.

On the single draw above, both balancing estimators substantially reduce the raw covariate imbalance, and the oracle version that balances on latent \(Z\) shows the benchmark we would like to approach.

Show code

def run_kang_schafer_panel(n_rep=80, n=1000, seed=2026):
    rng = np.random.default_rng(seed)
    rows = []
    for rep in range(n_rep):
        y, d, x, z = kang_schafer_dgp(n, rng)
        naive = y[d == 1].mean() - y[d == 0].mean()
        quad, quad_summary = fit_att_balancing(y, d, x, "quadratic")
        ent, ent_summary = fit_att_balancing(y, d, x, "entropy")
        oracle, oracle_summary = fit_att_balancing(y, d, z, "entropy")

        rows.append(("Naive", naive, True))
        rows.append(("Quadratic on observed X", quad, bool(quad_summary["success"])))
        rows.append(("Entropy on observed X", ent, bool(ent_summary["success"])))
        rows.append(("Entropy on latent Z", oracle, bool(oracle_summary["success"])))
    return rows


def summarize_rows(rows, truth=0.0):
    methods = sorted({row[0] for row in rows})
    out = []
    for method in methods:
        values = np.array([row[1] for row in rows if row[0] == method], dtype=float)
        successes = np.array([row[2] for row in rows if row[0] == method], dtype=bool)
        out.append(
            [
                method,
                f"{values.mean(): .3f}",
                f"{(values.mean() - truth): .3f}",
                f"{np.sqrt(np.mean((values - truth) ** 2)): .3f}",
                f"{successes.mean(): .3f}",
            ]
        )
    return out


kang_rows = run_kang_schafer_panel()
display(HTML(html_table(["Method", "Mean Estimate", "Bias", "RMSE", "Success Rate"], summarize_rows(kang_rows))))

Method	Mean Estimate	Bias	RMSE	Success Rate
Entropy on latent Z	0.027	0.027	0.095	1.000
Entropy on observed X	-4.402	-4.402	4.544	1.000
Naive	-20.383	-20.383	20.511	1.000
Quadratic on observed X	-6.217	-6.217	6.353	1.000

The observed-\(X\) versions still live inside the canonical misspecification problem, so they do not become oracle estimators just by balancing means. But they do move sharply toward zero relative to the naive treated-control difference.

Hainmueller

The Hainmueller design below is adapted from aipyw. It keeps the true treatment effect at zero while varying overlap and the difficulty of the treatment and outcome models.

Show code

def run_hainmueller_panel(setting_name, overlap_design, pscore_design, outcome_design, n_rep=50, n=1500, seed=3030):
    rng = np.random.default_rng(seed)
    rows = []
    for rep in range(n_rep):
        y, d, x = hainmueller_dgp(
            n=n,
            rng=rng,
            overlap_design=overlap_design,
            pscore_design=pscore_design,
            outcome_design=outcome_design,
        )
        naive = y[d == 1].mean() - y[d == 0].mean()
        quad, quad_summary = fit_att_balancing(y, d, x, "quadratic")
        ent, ent_summary = fit_att_balancing(y, d, x, "entropy")

        rows.append((setting_name, "Naive", naive, True))
        rows.append((setting_name, "Quadratic", quad, bool(quad_summary["success"])))
        rows.append((setting_name, "Entropy", ent, bool(ent_summary["success"])))
    return rows


def summarize_hainmueller(rows, truth=0.0):
    settings = sorted({row[0] for row in rows})
    out = []
    for setting in settings:
        methods = sorted({row[1] for row in rows if row[0] == setting})
        for method in methods:
            values = np.array([row[2] for row in rows if row[0] == setting and row[1] == method], dtype=float)
            successes = np.array([row[3] for row in rows if row[0] == setting and row[1] == method], dtype=bool)
            out.append(
                [
                    setting,
                    method,
                    f"{values.mean(): .3f}",
                    f"{(values.mean() - truth): .3f}",
                    f"{np.sqrt(np.mean((values - truth) ** 2)): .3f}",
                    f"{successes.mean(): .3f}",
                ]
            )
    return out


hain_easy = run_hainmueller_panel("Easy: overlap 2 / pscore 1 / outcome 1", 2, 1, 1)
hain_hard = run_hainmueller_panel("Hard: overlap 1 / pscore 3 / outcome 3", 1, 3, 3)
hain_rows = hain_easy + hain_hard
display(
    HTML(
        html_table(
            ["Setting", "Method", "Mean Estimate", "Bias", "RMSE", "Success Rate"],
            summarize_hainmueller(hain_rows),
        )
    )
)

Setting	Method	Mean Estimate	Bias	RMSE	Success Rate
Easy: overlap 2 / pscore 1 / outcome 1	Entropy	-0.001	-0.001	0.063	1.000
Easy: overlap 2 / pscore 1 / outcome 1	Naive	1.157	1.157	1.167	1.000
Easy: overlap 2 / pscore 1 / outcome 1	Quadratic	-0.001	-0.001	0.064	0.980
Hard: overlap 1 / pscore 3 / outcome 3	Entropy	1.774	1.774	18.739	1.000
Hard: overlap 1 / pscore 3 / outcome 3	Naive	1.823	1.823	18.806	1.000
Hard: overlap 1 / pscore 3 / outcome 3	Quadratic	1.799	1.799	18.748	1.000

Show code

def rmse_by_setting(rows):
    settings = sorted({row[0] for row in rows})
    methods = ["Naive", "Quadratic", "Entropy"]
    rmse = np.zeros((len(settings), len(methods)))
    for i, setting in enumerate(settings):
        for j, method in enumerate(methods):
            values = np.array([row[2] for row in rows if row[0] == setting and row[1] == method], dtype=float)
            rmse[i, j] = np.sqrt(np.mean(values**2))
    return settings, methods, rmse


settings, methods, rmse = rmse_by_setting(hain_rows)
fig, ax = plt.subplots(figsize=(10, 4))
xpos = np.arange(len(settings))
width = 0.25
for j, method in enumerate(methods):
    ax.bar(xpos + (j - 1) * width, rmse[:, j], width=width, label=method)
ax.set_xticks(xpos)
ax.set_xticklabels(settings, rotation=10, ha="right")
ax.set_ylabel("RMSE around true ATT = 0")
ax.set_title("Hainmueller DGP: balancing weights versus the naive difference in means")
ax.legend()
fig.tight_layout()

Takeaways

BalancingWeights is most naturally a building block. The class returns the control weights; the ATT estimate is the weighted control mean subtracted from the treated mean.
autoscale=True is useful on these simulation designs because the raw covariate scales can be wildly different.
Entropy and quadratic balancing often move together on easy designs, but quadratic balancing can prefer sparser solutions and a smaller effective sample size.
Kang-Schafer remains hard when only the transformed covariates are observed. Balancing means helps, but it does not erase misspecification by itself.