Extra L9 — Prediction, Explanation, Causality¶

Companion notebook to Lecture 9 — Machine Learning for Text & LLMs¶


Why this notebook¶

L9 introduced supervised classifiers, zero-shot transformers, and LLM APIs as tools to extract structured signals from text. This notebook is about how to interpret those signals.

None of the tools we use in L9 was designed for causal inference. They were built to predict labels. The standard ML evaluation pipeline (held-out test, F1, accuracy) tells you how well a model predicts, not whether the prediction maps to the construct you actually care about in your research question.

Three claims this notebook defends:

  1. A model can predict perfectly and still tell you nothing about the world. (Worked example 1.)
  2. An LLM-based measure is a moving target: its calibration changes with the model version and the time period of application. (Worked example 2.)
  3. The most honest research design often uses ML as a measurement tool followed by classical inference, not as a substitute for it. (Worked example 3 + framework.)

Reading¶

The core references for this whole conversation are three:

  • Mullainathan, S. & Spiess, J. (2017). Machine Learning: an Applied Econometric Approach. Journal of Economic Perspectives 31(2), 87-106.
  • Kleinberg, J., Ludwig, J., Mullainathan, S. & Obermeyer, Z. (2015). Prediction Policy Problems. American Economic Review 105(5), 491-495.
  • Athey, S. & Imbens, G. (2019). Machine Learning Methods That Economists Should Know About. Annual Review of Economics 11, 685-725.

Specific to text-as-data:

  • Gentzkow, M., Kelly, B. & Taddy, M. (2019). Text as Data. Journal of Economic Literature 57(3), 535-574.
  • Hansen, S., McMahon, M. & Prat, A. (2018). Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach. Quarterly Journal of Economics 133(2), 801-870.
  • Loughran, T. & McDonald, B. (2011). When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. Journal of Finance 66(1), 35-65.

1. Three concepts, three different questions¶

Concept Question it answers Sufficient evidence
Prediction What is the expected $y$ given $x$? High held-out test accuracy / low MSE
Explanation Which features drive a model's predictions? Stable coefficients, partial dependence, SHAP
Causal inference What would $y$ be if we intervened on $x$? Exogenous variation, identification strategy

The three are technically distinct — Mullainathan & Spiess (JEP 2017) make this point at length. A model can be a state-of-the-art predictor, a transparent explainer of its own predictions, and still be useless for answering "what happens if we change $x$?".

The mistake to avoid in your thesis: writing "we use ML to identify the most important determinants of $y$" and treating that sentence as if it were a causal claim. It is not.

In [ ]:
# Imports - we use the same stack as L9
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

np.random.seed(42)
print("Ready.")

2. Worked example 1 — Leakage in text classification¶

In L9 we trained a Logistic Regression on TF-IDF features to classify left-vs-right political documents from the manifesto synthetic dataset, achieving ~92% F1. Suppose someone now wants to use this classifier as a causal tool: take a real corpus of party manifestos over time, classify each, and regress economic policy outcomes on the predicted left/right share.

What can go wrong? Several things, but the cleanest illustration is leakage — a feature that the classifier exploits to predict the label but that does not carry the substantive meaning the researcher thinks it does.

Here we simulate a realistic version of the problem: we add a feature ("year mention") that is mechanically correlated with the label because of how the corpus was constructed, not because of political content.

In [ ]:
# Reconstruct the manifesto dataset from L9, with a twist:
# documents labelled 'left' are also more likely to mention recent years
# (simulating a research design where the labelled training data over-samples
# documents from the post-2015 period for one class).

np.random.seed(42)
n = 300

left_phrases = [
    "workers should have stronger protections",
    "public investment in healthcare is needed",
    "inequality has grown too much",
    "social safety nets need strengthening",
    "the wealthy should contribute more in taxes",
    "collective bargaining helps wage growth",
    "regulation can prevent market failures",
    "the welfare state needs reinforcement",
]
right_phrases = [
    "markets generally allocate resources well",
    "government spending should be controlled",
    "lower taxes can stimulate growth",
    "the private sector drives most innovation",
    "fiscal discipline matters for credibility",
    "competition policy benefits consumers",
    "property rights underpin investment",
    "labor market flexibility supports employment",
]
shared_phrases = [
    "the economy faces multiple challenges",
    "policy tradeoffs need to be considered",
    "data should inform decision making",
    "long term outcomes matter",
]

def make_doc_with_year(primary_phrases, label):
    n_total = np.random.randint(8, 14)
    n_primary = max(1, int(n_total * 0.55))
    n_shared = max(1, int(n_total * 0.30))
    parts = list(np.random.choice(primary_phrases, n_primary, replace=True))
    parts += list(np.random.choice(shared_phrases, n_shared, replace=True))
    # LEAKAGE: left-labelled docs over-sample recent years
    if label == "left":
        year = np.random.choice(["2022", "2023", "2024"], p=[0.3, 0.4, 0.3])
    else:
        year = np.random.choice(["2010", "2012", "2015"], p=[0.4, 0.3, 0.3])
    parts.append(f"in {year}")
    np.random.shuffle(parts)
    return " ".join(parts)

texts  = ([make_doc_with_year(left_phrases, "left")   for _ in range(n//2)] +
          [make_doc_with_year(right_phrases, "right") for _ in range(n//2)])
labels = ["left"] * (n//2) + ["right"] * (n//2)

manifesto = pd.DataFrame({"text": texts, "orientation": labels})
manifesto = manifesto.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Dataset: {len(manifesto)} documents")
print(manifesto.head(3)[["text"]].to_string())
In [ ]:
# Train the L9-style classifier and look at top features
X_train, X_test, y_train, y_test = train_test_split(
    manifesto["text"], manifesto["orientation"],
    test_size=0.2, random_state=42, stratify=manifesto["orientation"])

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=2000, ngram_range=(1, 2),
                               min_df=2, stop_words="english", sublinear_tf=True)),
    ("clf",   LogisticRegression(C=1.0, max_iter=1000, random_state=42)),
])
pipe.fit(X_train, y_train)
print("F1 score (test):", round(pipe.score(X_test, y_test), 3))

# Top features per class
fnames = pipe.named_steps["tfidf"].get_feature_names_out()
coefs  = pipe.named_steps["clf"].coef_[0]
coef_df = pd.DataFrame({"feature": fnames, "coef": coefs}).sort_values("coef")

print("\nTop 8 'right' features (positive coefficients):")
print(coef_df.tail(8).to_string(index=False))
print("\nTop 8 'left' features (negative coefficients):")
print(coef_df.head(8).to_string(index=False))

📊 What just happened. The classifier achieves high F1 (~0.95). But scan the top features: among the most predictive right-leaning terms appear 2010, 2012, 2015; among the most predictive left-leaning terms appear 2022, 2023, 2024. The model learned that recent years signal "left" and older years signal "right" — because we engineered that correlation into the training data.

A researcher who used the predicted labels naively to build a "leftward drift" index over time would document a perfectly real predictive pattern that has nothing to do with political content. It would track when the training data was sampled, not how political language has evolved.

This is leakage in its purest form, and it is much more common than one might think:

  • Time-stamped corpora where the label is correlated with the period (firms with negative earnings are more frequent in recessions; central banks shift tone with the cycle).
  • Authors with personal signatures (one labelled-as-positive analyst always writes "in conclusion"; the model learns "in conclusion" → positive).
  • Source-specific quirks: documents from one journal tend to use a footer template that contains the journal name; the model classifies "journal of finance" papers as one class.

The audit step is non-negotiable. Always inspect the top features of any text classifier you intend to use as input to a downstream analysis. If the top features are content-meaningful, you have a measure. If they are dates, source identifiers, formatting artefacts, you have a leakage detector.


3. Worked example 2 — Calibration drift in LLM-based measures¶

In L9 we used BART-MNLI for zero-shot classification: pass a document, a list of candidate labels, and the model returns a probability distribution over those labels. We applied it to ECB statements to extract hawkish / dovish / neutral stances.

The temptation in research is to treat BART-MNLI's output as a measurement instrument, like a thermometer: a fixed device that maps language onto a numeric scale. If you read it for ECB-2019 and ECB-2024 and compare, you have a panel of stance scores.

The problem is that BART-MNLI is not a thermometer. It is a neural network trained on a corpus that has a publication cutoff, applied to text that may come from a period it has seen, partially seen, or never seen. The score distribution can drift between periods for reasons that have nothing to do with the underlying construct.

This is the construct validity problem from psychometrics (Cronbach & Meehl 1955) reapplied to LLM-as-measurement.

In [ ]:
# Simulate calibration drift: same construct (hawkishness), two periods,
# two ways of saying the same thing.

period_2010 = [
    "The Committee continues to anticipate that economic conditions are likely "
    "to warrant exceptionally low levels for the federal funds rate.",
    "Inflation pressures appear contained and inflation expectations remain stable.",
    "The Committee is prepared to provide additional accommodation as needed.",
]

period_2023 = [
    "The Committee is strongly committed to returning inflation to its 2 percent objective.",
    "In determining the extent of additional policy firming, the Committee will consider "
    "the cumulative tightening of monetary policy.",
    "The Committee anticipates that ongoing increases in the target range will be appropriate.",
]

# Pretend we run BART-MNLI on these (we use a placeholder function here
# so the notebook runs without downloading the model - swap for real `zs_classifier`
# from L9 to reproduce).

# To run for real, uncomment:
# from transformers import pipeline as hf_pipeline
# zs = hf_pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=-1)
# labels = ["hawkish monetary policy", "dovish monetary policy", "neutral statement"]
# for t in period_2010 + period_2023:
#     r = zs(t, candidate_labels=labels)
#     print(f"{r['labels'][0]:30s} {r['scores'][0]:.2f}  {t[:60]}...")

# Below: typical results based on running this in L9.
# Period 2010 (Bernanke-era dovish language):
#   dovish monetary policy            0.65    "exceptionally low levels..."
#   neutral statement                 0.71    "Inflation pressures appear..."
#   dovish monetary policy            0.58    "additional accommodation..."
#
# Period 2023 (Powell-era hawkish language):
#   hawkish monetary policy           0.91    "strongly committed to returning..."
#   hawkish monetary policy           0.85    "additional policy firming..."
#   hawkish monetary policy           0.88    "ongoing increases..."

print("Period 2010 (dovish era) - average dovish score: ~0.62")
print("Period 2023 (hawkish era) - average hawkish score: ~0.88")
print()
print("Apparent conclusion: BART measures a 26-percentage-point shift")
print("toward hawkishness from 2010 to 2023.")

📊 The construct-validity problem. The 26-point shift looks like a clean measurement of policy stance evolution. But three confounds make this number difficult to interpret causally:

  1. Vocabulary drift. Central-bank communication style itself changed: the post-2020 Fed uses more active, directional language (firming, cumulative tightening) than the post-2008 Fed (anticipate, continues to). BART-MNLI is more confident about directional language because that's where its training corpus encountered "hawkish" verbs. The score reflects how clearly hawkishness is communicated, not necessarily how hawkish the stance is.
  2. BART's training data cutoff. facebook/bart-large-mnli was released in 2020. The "hawkish" / "dovish" semantic priors it learned are dominated by pre-2020 financial journalism. Post-2020 language may sit outside its training distribution.
  3. Compositional shift in the corpus. If ECB statements in 2023 are longer or more numerically detailed than in 2010, BART processes a different kind of input. Its scores are not normalised against document length or specificity.

The methodological corollary. If you use BART-MNLI scores as a dependent or independent variable, you need to defend the assumption that the score's meaning is time-invariant. The standard ways:

  • Validation against a hand-coded gold standard within each period (not just on the pooled sample). If BART agreement with humans is 85% in 2010 and 60% in 2023, your panel is unreliable.
  • Anchoring vignettes — pass the same fixed set of "obviously hawkish" / "obviously dovish" reference texts to BART along with your real data. Use the reference scores to detrend.
  • Multiple-model triangulation — run FinBERT in parallel. If FinBERT and BART agree on direction and magnitude of change, you have a more robust measure.

Without one of these defences, the 26-point shift is a prediction artefact, not evidence.


4. Worked example 3 — LLM output as measurement, not as outcome¶

The right way to think about ML-on-text in causal research is as a measurement step, separate from inference. The pipeline:

  1. Measurement: use the ML tool (TF-IDF score, BART probability, LLM extraction) to construct a numeric variable from text.
  2. Validation: hand-code a sample; report agreement / Cohen's kappa / Krippendorff's alpha.
  3. Inference: use the measured variable in a standard econometric specification with explicit identification.

Where this breaks down in practice: the ML output is treated as if it had no measurement error, and dropped directly into a regression. The standard errors are then misleadingly small because they reflect only the noise in the true relationship, not the noise in the measure.

Two papers that do this well — read them as templates:

  • Hansen, McMahon & Prat (QJE 2018) measure FOMC member "deliberation" through LDA topic shares. The contribution is that they exploit a clean institutional shift (the 1993 Greenspan transparency reform) for identification after the textual measure is constructed. The measure is the input, the natural experiment is the design.
  • Baker, Bloom & Davis (QJE 2016) construct the EPU index from newspaper text before they ever run a VAR. The index construction is documented in obsessive detail, validated against hand-coded subsamples, and the causal claims rely on identified shocks (Romer-Romer-style) — not on the index itself being causal.
In [ ]:
# A schematic of the right workflow vs the wrong workflow.

print("WRONG WORKFLOW (will not survive refereeing):")
print("-" * 60)
print('  predicted_stance = bart_mnli(ecb_text)')
print('  reg(bond_yield ~ predicted_stance, data=panel)')
print('  -> claim: "hawkish ECB communication raises yields"')
print()
print("RIGHT WORKFLOW:")
print("-" * 60)
print('  # Step 1 - Measurement')
print('  predicted_stance = bart_mnli(ecb_text)')
print()
print('  # Step 2 - Validation')
print('  hand_coded = read_and_score(sample_of_50_docs)')
print('  kappa = cohen_kappa(predicted_stance[sample], hand_coded)')
print('  # report kappa, discuss disagreement cases')
print()
print('  # Step 3 - Identification')
print('  # Either:')
print('  #   (a) event study around exogenous communication events')
print('  #   (b) IV: use BART score interacted with pre-period institutional features')
print('  #   (c) honest descriptive: "we document that..." not "we estimate that..."')
print()
print('  reg(bond_yield ~ predicted_stance + controls,')
print('      data=panel,')
print('      design=event_study,        # or IV or honest descriptive')
print('      report=robust_to_measurement_error)')

📊 Why this matters for your thesis. Many strong empirical projects fail at the interpretation step, not at the measurement step. A referee will accept:

  • "We construct a measure of $X$ using method $M$. Validation against a hand-coded sample gives $\kappa = 0.71$. We document descriptive patterns of $X$ over time."

A referee will reject:

  • "We use method $M$ to identify the causal effect of $X$ on $Y$."

unless $M$ produces variation that is plausibly exogenous to $Y$. ML methods almost never produce such variation by themselves — they extract a correlation-loaded signal from text. The exogenous variation has to come from elsewhere in your design (event study, IV, RD, DiD on an institutional change). ML provides the dependent variable or the regressor of interest, never the identification strategy.


5. A validity framework for ML-on-text measures¶

Borrowing from psychometrics (Cronbach & Meehl 1955; Messick 1989), three forms of validity to defend explicitly:

Validity type Question How to defend
Construct Does the score measure the latent concept I claim? Hand-coded gold standard; multiple-model triangulation
Internal Does variation in the score reflect variation in the construct, not in incidental features? Subsample by document length, source, period; check stability
External Does the measure generalise beyond the validation sample? Out-of-domain test (e.g. validate on Fed, apply to ECB)

These are the questions a thoughtful reader will ask of any text-based measure. Anticipate them in your thesis.

Additional reading on validity in computational social science:

  • Grimmer, J., Roberts, M. E. & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press. — Chapter 15 specifically on validation.
  • Egami, N., Hauser, J. M., Imai, K., Stewart, B. M., Sun, R. & Wang, T. Y. (2023). Using Large Language Models for Survey Research Coding. — Discusses LLMs as measurement instruments and the bias they introduce.
  • DiMaggio, P. (2015). Adapting Computational Text Analysis to Social Science (and Vice Versa). Big Data & Society 2(2).

6. Exercise — "What would a referee ask?"¶

For each of the following research statements, write 3-5 sentences identifying:

  1. The implicit causal claim.
  2. The most plausible threat to its validity (omitted variable / leakage / drift / lack of construct validity).
  3. The minimum robustness check needed to weaken that threat.

Statement A¶

"We train a logistic regression to classify ECB press releases as hawkish or dovish using a hand-labelled sample of 50 documents. Applying the classifier to the full 2000-2024 archive, we find that hawkishness Granger-causes 10-year bund yields."

Statement B¶

"We use BART-MNLI to score the sentiment of newspaper coverage of climate policy. We document that countries with more negative coverage have lower carbon tax adoption rates."

Statement C¶

"We use Claude (claude-haiku-4-5) to extract structured features from earnings call transcripts. Earnings calls that the LLM flags as 'confident' predict positive abnormal returns three days later, controlling for industry and size."

Write your answers in the cells below. (In class we discuss; at home, this is the rehearsal for your own defence.)

Your answer — Statement A:

[write 3-5 sentences here]

Your answer — Statement B:

[write 3-5 sentences here]

Your answer — Statement C:

[write 3-5 sentences here]


7. Optional extension¶

If you have time and an interest:

  1. Replicate the leakage diagnosis on your own corpus. Train a classifier on whatever labels you have (period dummies count). Inspect the top 20 features. Are any of them dates, source identifiers, formatting artefacts? If yes, document them honestly in your thesis appendix.

  2. Run an anchoring vignette experiment. Pick 5 sentences that are unambiguously hawkish, 5 unambiguously dovish, 5 neutral. Pass them through BART-MNLI alongside your real data, once per quarter of your sample. Plot the BART confidence on each anchor over time — if it drifts, the rest of your measures inherit that drift.

  3. Compare two LLMs on the same task. Run Claude Haiku and GPT-4o-mini on 50 documents from your corpus with the same prompt. Cohen's kappa on their agreement is a free robustness check.

  4. Read and present: pick one of the references above (Mullainathan-Spiess JEP 2017 is the easiest entry point), summarise it in two slides, present in the next lecture.


Wrap-up¶

The honest summary: ML for text is a measurement technology, not an inference engine. Treat it as you would treat a survey instrument — validate it, document it, defend it. The causal claims still have to come from your research design, not from the model.

The discipline of separating the three concepts (prediction, explanation, causality) is the most reliable way to write a thesis that survives refereeing.