Moving Beyond Correlation: Estimating Cause-and-Effect with Data
Authored by: Loveleen Narang
Date: May 23, 2024
Machine learning models excel at identifying patterns and making predictions based on correlations in data (\( P(Y|X) \)). A classic mantra in statistics and data science, however, reminds us that "correlation does not imply causation." Just because two variables move together does not mean one causes the other. For example, ice cream sales and drowning incidents are correlated (both increase in summer), but ice cream doesn't cause drowning. Understanding true cause-and-effect relationships requires moving beyond standard predictive modeling to the realm of causal inference.
Causal inference aims to determine the effect of changing one variable (\( X \), the "treatment" or "cause") on another variable (\( Y \), the "outcome" or "effect"). This involves asking counterfactual questions: "What would have happened to Y if X had been different?" Answering such questions is crucial for effective decision-making in fields like medicine (Does this drug work?), policy (Does this program achieve its goal?), and business (Does this ad campaign drive sales?). We want to understand the effect of an intervention, represented notationally as \( P(Y|do(X=x)) \), which is distinct from the observational probability \( P(Y|X=x) \) (Formula 1). While Randomized Controlled Trials (RCTs) are the ideal way to establish causality, they are often impractical. This article explores frameworks and methods, including those enhanced by machine learning, for inferring causal effects primarily from observational data.
Fig 1: Correlation can arise from a common cause (confounder), while causation implies a direct influence.
The Potential Outcomes framework formalizes causal reasoning using counterfactuals. For each unit \( i \) (e.g., a patient) and a binary treatment \( T \in \{0, 1\} \), we imagine two potential states of the world:
The Individual Treatment Effect (ITE) is the difference: \( \tau_i = Y_i(1) - Y_i(0) \) (Formula 4). However, we face the Fundamental Problem of Causal Inference: we only ever observe one potential outcome for each unit. The observed outcome is \( Y_i^{obs} = T_i Y_i(1) + (1-T_i) Y_i(0) \) (Formula 5). The other outcome remains unseen (counterfactual).
Therefore, we usually estimate average effects:
This framework relies on the Stable Unit Treatment Value Assumption (SUTVA), meaning no interference between units and only one version of the treatment.
Fig 2: Only one potential outcome (Y(0) or Y(1)) is observed for any individual unit.
Structural Causal Models use Directed Acyclic Graphs (DAGs) to visually represent assumptions about causal relationships between variables (nodes connected by directed edges). Key structures include chains, forks (confounding), and colliders. The do-operator, \( do(X=x) \), represents setting variable \( X \) to value \( x \) via intervention. DAGs help identify if a causal effect is identifiable from observational data using criteria like the back-door criterion.
Fig 3: A DAG showing Treatment (T), Outcome (Y), and Confounder (Z). Adjusting for Z blocks the non-causal path.
By randomly assigning treatment \( T \), RCTs ensure \( T \) is independent of pre-treatment factors (\( T \perp (Y(1), Y(0), Z, U) \)). This allows direct estimation of ATE: \( ATE = E[Y|T=1] - E[Y|T=0] \) (Formula 8).
Without randomization, we must account for confounders \( Z \).
Quasi-experimental designs for when confounders are unmeasured:
Fig 4: IV setup assumes Z affects T, T affects Y, U affects T and Y, but Z is independent of U and only affects Y through T.
Fig 5: DiD estimates the treatment effect by comparing outcome changes, assuming parallel pre-treatment trends.
ML enhances causal inference, especially for:
Other relevant formulas: Basic Regression \( y = \beta X + \epsilon \) (Formula 21), Expectation \( E[\cdot] \) (Formula 22), Conditional Probability \( P(\cdot|\cdot) \) (Formula 23), Covariance \( Cov(\cdot, \cdot) \) (Formula 24), Basic Loss \( J(\theta) \) (Formula 25), Parameters \( \theta \) (Formula 26).
Method | Data Type | Key Assumption(s) | Handles Unobserved Confounding? |
---|---|---|---|
RCT | Experimental | Successful Randomization | Yes |
Adjustment / PS Methods | Observational | Conditional Ignorability (All confounders measured), Positivity | No |
Instrumental Variable (IV) | Observational | Relevance, Exclusion Restriction, Independence | Yes (if assumptions hold) |
Regression Discontinuity (RDD) | Observational | Continuity of potential outcomes near cutoff | Yes (locally around cutoff) |
Difference-in-Differences (DiD) | Observational (Panel) | Parallel Trends | Yes (for time-invariant confounders) |
Causal inference provides the essential framework for moving beyond correlation to understand cause-and-effect, crucial for informed decision-making. By leveraging frameworks like Potential Outcomes and Structural Causal Models, and applying identification strategies such as adjustment based on the back-door criterion, instrumental variables, RDD, or DiD (often enhanced by ML techniques for estimating nuisance components or heterogeneous effects), we can rigorously estimate causal impacts even from observational data. However, this requires careful consideration of underlying assumptions and potential biases, particularly unobserved confounding. Integrating the predictive power of ML with the inferential rigor of causal methods is key to moving from simply observing patterns to understanding the mechanisms that drive them.
(Formula count check: Includes P(Y|X)!=P(Y|do(X)), Y(1), Y(0), tau_i, Y_obs, ATE, ATT, RCT ATE formula, Backdoor Adj P, Cond. Ignorability (symbol), Prop Score e(x), IPTW Est, IV Cov, IV Exclusion, IV Independence, RDD limit, DiD Est, CATE tau(x), T-Learner tau, S-Learner tau, Basic Regression, E[], P(.|.), Cov, J(theta), theta. Total > 25).