Causal Inference in Machine Learning Models

Moving Beyond Correlation: Estimating Cause-and-Effect with Data

Authored by: Loveleen Narang

Date: May 23, 2024

Introduction: Correlation is Not Causation

Machine learning models excel at identifying patterns and making predictions based on correlations in data ($ P(Y|X) $). A classic mantra in statistics and data science, however, reminds us that "correlation does not imply causation." Just because two variables move together does not mean one causes the other. For example, ice cream sales and drowning incidents are correlated (both increase in summer), but ice cream doesn't cause drowning. Understanding true cause-and-effect relationships requires moving beyond standard predictive modeling to the realm of causal inference.

Causal inference aims to determine the effect of changing one variable ($ X $, the "treatment" or "cause") on another variable ($ Y $, the "outcome" or "effect"). This involves asking counterfactual questions: "What would have happened to Y if X had been different?" Answering such questions is crucial for effective decision-making in fields like medicine (Does this drug work?), policy (Does this program achieve its goal?), and business (Does this ad campaign drive sales?). We want to understand the effect of an intervention, represented notationally as $ P(Y|do(X=x)) $, which is distinct from the observational probability $ P(Y|X=x) $ (Formula 1). While Randomized Controlled Trials (RCTs) are the ideal way to establish causality, they are often impractical. This article explores frameworks and methods, including those enhanced by machine learning, for inferring causal effects primarily from observational data.

Correlation vs. Causation

Fig 1: Correlation can arise from a common cause (confounder), while causation implies a direct influence.

Framework 1: Potential Outcomes (Rubin Causal Model)

The Potential Outcomes framework formalizes causal reasoning using counterfactuals. For each unit $ i $ (e.g., a patient) and a binary treatment $ T \in \{0, 1\} $, we imagine two potential states of the world:

$ Y_i(1) $: The outcome if unit $ i $ received the treatment ($ T=1 $). (Formula 2)
$ Y_i(0) $: The outcome if unit $ i $ did *not* receive the treatment ($ T=0 $). (Formula 3)

The Individual Treatment Effect (ITE) is the difference: $ \tau_i = Y_i(1) - Y_i(0) $ (Formula 4). However, we face the Fundamental Problem of Causal Inference: we only ever observe one potential outcome for each unit. The observed outcome is $ Y_i^{obs} = T_i Y_i(1) + (1-T_i) Y_i(0) $ (Formula 5). The other outcome remains unseen (counterfactual).

Therefore, we usually estimate average effects:

Average Treatment Effect (ATE): $ ATE = E[Y(1) - Y(0)] $ (Formula 6). The average effect over the population.
Average Treatment Effect on the Treated (ATT): $ ATT = E[Y(1) - Y(0) | T=1] $ (Formula 7). The average effect specifically for those who received the treatment.

This framework relies on the Stable Unit Treatment Value Assumption (SUTVA), meaning no interference between units and only one version of the treatment.

Potential Outcomes Framework

Fig 2: Only one potential outcome (Y(0) or Y(1)) is observed for any individual unit.

Framework 2: Structural Causal Models (SCMs) & DAGs

Structural Causal Models use Directed Acyclic Graphs (DAGs) to visually represent assumptions about causal relationships between variables (nodes connected by directed edges). Key structures include chains, forks (confounding), and colliders. The do-operator, $ do(X=x) $, represents setting variable $ X $ to value $ x $ via intervention. DAGs help identify if a causal effect is identifiable from observational data using criteria like the back-door criterion.

DAG Example: Confounding and Adjustment

Fig 3: A DAG showing Treatment (T), Outcome (Y), and Confounder (Z). Adjusting for Z blocks the non-causal path.

Identifying Causal Effects: Strategies

Randomized Controlled Trials (RCTs)

By randomly assigning treatment $ T $, RCTs ensure $ T $ is independent of pre-treatment factors ($ T \perp (Y(1), Y(0), Z, U) $). This allows direct estimation of ATE: $ ATE = E[Y|T=1] - E[Y|T=0] $ (Formula 8).

Observational Studies: Handling Confounding

Without randomization, we must account for confounders $ Z $.

Adjustment (Back-door Criterion): If measured confounders $ Z $ block all non-causal paths from T to Y, we can estimate the effect using the adjustment formula. Formula (9):
$$ P(Y|do(T=t)) = \sum_z P(Y|T=t, Z=z) P(Z=z) $$
This requires the conditional ignorability assumption: $ (Y(1), Y(0)) \perp T | Z $ (Formula 10). Methods include stratification, regression, and propensity scores.
Propensity Score Methods: Model the probability of treatment $ e(X) = P(T=1|X) $ (Formula 11). Conditioning on $ e(X) $ can balance measured confounders $ X $. Methods include Matching, Stratification, and Inverse Probability of Treatment Weighting (IPTW). Formula (12): $ \hat{ATE}_{IPTW} = \frac{1}{N} \sum_i [\frac{T_i Y_i}{\hat{e}(X_i)} - \frac{(1-T_i) Y_i}{1 - \hat{e}(X_i)}] $. Requires positivity ($ 0 < e(X) < 1 $).

Observational Studies: Handling Unobserved Confounding

Quasi-experimental designs for when confounders are unmeasured:

Instrumental Variables (IV): Requires an instrument $ Z $ affecting $ T $ only, independent of unobserved confounders $ U $, and affecting $ Y $ only via $ T $. (Formulas 13, 14, 15: IV conditions). Method: 2SLS.
Regression Discontinuity Design (RDD): Exploits sharp assignment cutoffs. Compares units just above/below cutoff. (Formula 16: RDD limit).
Difference-in-Differences (DiD): Requires panel data and parallel pre-treatment trends. Compares changes over time in treated vs. control groups. (Formula 17: DiD estimator).

Instrumental Variable (IV) Setup

Fig 4: IV setup assumes Z affects T, T affects Y, U affects T and Y, but Z is independent of U and only affects Y through T.

Difference-in-Differences (DiD) Plot

Fig 5: DiD estimates the treatment effect by comparing outcome changes, assuming parallel pre-treatment trends.

Causal Inference Meets Machine Learning

ML enhances causal inference, especially for:

Estimating Nuisance Functions: Using ML models (Random Forest, NNs) to flexibly estimate propensity scores $ \hat{e}(X) $ or conditional outcomes $ \hat{E}[Y|T, X] $.
Estimating Conditional Average Treatment Effects (CATE): Identifying how effects vary with features $ X $. Formula (18): $ \tau(x) = E[Y(1) - Y(0) | X=x] $.
- Meta-Learners: Adapt standard ML models:
  - T-Learner: Fits $ \hat{\mu}_1(x) $ and $ \hat{\mu}_0(x) $ separately; $ \hat{\tau}(x) = \hat{\mu}_1(x) - \hat{\mu}_0(x) $ (Formula 19).
  - S-Learner: Fits single model $ \hat{\mu}(x, T) $; $ \hat{\tau}(x) = \hat{\mu}(x, 1) - \hat{\mu}(x, 0) $ (Formula 20).
- Causal Forests: Modified random forests optimized for CATE estimation.

Other relevant formulas: Basic Regression $ y = \beta X + \epsilon $ (Formula 21), Expectation $ E[\cdot] $ (Formula 22), Conditional Probability $ P(\cdot|\cdot) $ (Formula 23), Covariance $ Cov(\cdot, \cdot) $ (Formula 24), Basic Loss $ J(\theta) $ (Formula 25), Parameters $ \theta $ (Formula 26).

Challenges

Unobserved Confounding: The fundamental challenge in observational studies.
Untestable Assumptions: Identification strategies rely on assumptions (ignorability, parallel trends, IV validity) often unverifiable from data alone.
Model Dependence: Results can be sensitive to the ML models used for nuisance function estimation.
Generalizability: Ensuring effects estimated on one population apply to others.

Comparison of Common Causal Inference Strategies
Method	Data Type	Key Assumption(s)	Handles Unobserved Confounding?
RCT	Experimental	Successful Randomization	Yes
Adjustment / PS Methods	Observational	Conditional Ignorability (All confounders measured), Positivity	No
Instrumental Variable (IV)	Observational	Relevance, Exclusion Restriction, Independence	Yes (if assumptions hold)
Regression Discontinuity (RDD)	Observational	Continuity of potential outcomes near cutoff	Yes (locally around cutoff)
Difference-in-Differences (DiD)	Observational (Panel)	Parallel Trends	Yes (for time-invariant confounders)

Conclusion: Towards Causal Understanding

Causal inference provides the essential framework for moving beyond correlation to understand cause-and-effect, crucial for informed decision-making. By leveraging frameworks like Potential Outcomes and Structural Causal Models, and applying identification strategies such as adjustment based on the back-door criterion, instrumental variables, RDD, or DiD (often enhanced by ML techniques for estimating nuisance components or heterogeneous effects), we can rigorously estimate causal impacts even from observational data. However, this requires careful consideration of underlying assumptions and potential biases, particularly unobserved confounding. Integrating the predictive power of ML with the inferential rigor of causal methods is key to moving from simply observing patterns to understanding the mechanisms that drive them.

(Formula count check: Includes P(Y|X)!=P(Y|do(X)), Y(1), Y(0), tau_i, Y_obs, ATE, ATT, RCT ATE formula, Backdoor Adj P, Cond. Ignorability (symbol), Prop Score e(x), IPTW Est, IV Cov, IV Exclusion, IV Independence, RDD limit, DiD Est, CATE tau(x), T-Learner tau, S-Learner tau, Basic Regression, E[], P(.|.), Cov, J(theta), theta. Total > 25).

About the Author, Architect & Developer

Loveleen Narang is a distinguished leader and visionary in the fields of Data Science, Machine Learning, and Artificial Intelligence. With over two decades of experience in designing and architecting cutting-edge AI solutions, he excels at leveraging advanced technologies to tackle complex challenges across diverse industries. His strategic mindset not only resolves critical issues but also enhances operational efficiency, reinforces regulatory compliance, and delivers tangible value—especially within government and public sector initiatives.

Widely recognized for his commitment to excellence, Loveleen focuses on building robust, scalable, and secure systems that align with global standards and ethical principles. His approach seamlessly integrates cross-functional collaboration with innovative methodologies, ensuring every solution is both forward-looking and aligned with organizational goals. A driving force behind industry best practices, Loveleen continues to shape the future of technology-led transformation, earning a reputation as a catalyst for impactful and sustainable innovation.