Extra L8 — Sentiment, Topic Models, and Measurement Error¶

This notebook complements Lecture 8: Sentiment Analysis & Supervised ML for Text.

Goal¶

Treat sentiment scores and topic shares as noisy measurements rather than ground truth.

Why this matters for research¶

The main issue is not whether a model returns a label. The main issue is whether the label measures the concept you think it measures.

Lecture 8 § 6.B used confusion_matrix and classification_report to evaluate a Naive Bayes classifier — predicted labels vs true labels from a held-out test set. This notebook uses the same machinery for a different comparison: model output vs human coding. The shift in framing is small but consequential. When you trust the model output as your dependent variable, the confusion matrix is no longer just a diagnostic — it becomes the measurement-error matrix that determines whether your econometrics is biased.

In [ ]:

import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report

1. A toy sentiment evaluation dataset¶

The set-up of any measurement-error analysis: two parallel labellings of the same items. One is treated as the benchmark (true_label, here human-coded); the other is the candidate measure to evaluate (model_label, here a simple automated classifier).

The 6 sentences below are chosen to expose specific failure modes — negation, sarcasm, mixed valence, and context-dependent sentiment. Try to spot which the model gets wrong before running the diagnostic cells.

In [ ]:

df = pd.DataFrame({
    "text": [
        "The reform is excellent and overdue",
        "The reform is not good for workers",
        "I am pleasantly surprised by the policy",
        "The proposal is weak and confusing",
        "This is not bad at all",
        "The minister loves publicity more than substance"
    ],
    "true_label": ["pos", "neg", "pos", "neg", "pos", "neg"],
    "model_label": ["pos", "pos", "pos", "neg", "neg", "pos"]
})
df

2. Confusion matrix¶

This is exactly the tool used in L8 § 6.B for the supervised classifier — but read differently. Here, rows are the human-coded "truth" and columns are the model's labels. Off-diagonal cells are measurement errors: cases where the model assigns a label that disagrees with the human coder.

In [ ]:

labels = ["pos", "neg"]
cm = confusion_matrix(df["true_label"], df["model_label"], labels=labels)
cm_df = pd.DataFrame(cm, index=[f"true_{x}" for x in labels], columns=[f"pred_{x}" for x in labels])
cm_df

In [ ]:

print(classification_report(df["true_label"], df["model_label"], labels=labels))

3. Diagnose the mistakes¶

Look closely at the misclassified rows. Each one tells you something about which concept the model is actually measuring — and where it diverges from the construct you wanted.

In [ ]:

errors = df[df["true_label"] != df["model_label"]].copy()
errors

Interpretation¶

Common failure modes you should recognise:

Negation (not good, not bad) — the dictionary or BoW model sees good/bad and assigns the wrong polarity. This is exactly the issue Lecture 7 Extra showed for cosine similarity, and it's the same mechanism here.
Sarcasm (loves publicity more than substance) — the literal valence is positive (loves), the meaning is negative. No dictionary catches this; even most supervised classifiers fail without context.
Domain mismatch — words like liability, tax, obligation are negative in everyday speech but neutral in finance. VADER's general-English calibration is the canonical example (L8 § 3 reading-the-output).
Context-dependent sentiment — unemployment is low reads negative if you don't understand it economically. L8 § 3 showed VADER scoring this sentence at compound −0.296 for exactly this reason.

Inter-annotator agreement: the bound on what any model can achieve¶

A model cannot do better than the humans who provided its training labels. If two qualified humans coding the same 100 statements disagree on 30 of them, then the "true" label is itself noisy, and no classifier can exceed 70% accuracy on this task — that 30% is irreducible label noise, not modelling failure.

Standard measures of inter-annotator agreement:

Cohen's κ — agreement corrected for chance, on a single pair of annotators. κ above 0.8 is "very good", 0.6–0.8 "substantial", below 0.4 the labelling protocol probably needs revision.
Krippendorff's α — generalises κ to ≥2 annotators and missing data. The standard reference is Krippendorff (2018), Content Analysis: An Introduction to Its Methodology.

The Lecture 8 mid-lecture challenge (Human vs machine on 5 peer reviews) made this point concrete: the class typically splits 50/50 on items #1 and #4, meaning any classifier evaluated against a single human coder on those items is being judged against a coin flip. Always report inter-annotator agreement when you build a labelled dataset; a κ of 0.5 means your "ground truth" is shaky and any reported model accuracy needs to be interpreted relative to that ceiling.

4. Topic models: another measurement layer¶

The same measurement-error logic applies to topic shares. Lecture 8 § 9 fitted LDA on Hansen's State of the Union dataset and used the topic shares as time series — i.e., as economic variables. Hansen, McMahon and Prat (QJE 2018) do exactly this with FOMC transcripts and run a difference-in-differences regression on the shares. Once a topic share appears on the right-hand side of an equation, every measurement-error consideration that applies to sentiment applies here too — plus several that are specific to topic modelling.

Topic labels often look objective, but they are model-based summaries of word co-occurrence — a topic is not "discovered truth"; it is a statistical grouping whose interpretation is imposed by the researcher. Three sources of measurement error specific to LDA:

K is a researcher choice. Lecture 8 § 9 used K=15. Different K gives different topics. Coherence-based selection helps but does not pin K down uniquely; sensitivity analysis to K is mandatory.
Granularity of the unit. Hansen-style pipelines fit LDA at the paragraph level (one paragraph ≈ one topic) and aggregate up. Fitting at the document level conflates topics within a document and produces less interpretable themes. The choice of unit is part of the measurement.
Naming is interpretation. When you read "Topic 6: defense, military, nuclear, weapons, security" and call it the Defence topic, you are imposing a label that the model never produced. Two researchers can read the same top-words and assign different labels. Validate via top-words and representative paragraphs, and report both.

The toy table below illustrates how readable top-words can be — and how easy it is to slide from "these terms cluster together" (a statistical fact) to "this is the migration topic" (an interpretive act).

In [ ]:

topic_terms = {
    "Topic 1": ["inflation", "prices", "central bank", "interest rates"],
    "Topic 2": ["migration", "border", "asylum", "citizenship"],
    "Topic 3": ["schools", "teachers", "curriculum", "education"]
}
pd.DataFrame(dict([(k, pd.Series(v)) for k, v in topic_terms.items()]))

Short exercise¶

Answer in words (a paragraph each):

Why should a sentiment score be validated against human coding before being used as an economic variable in a regression?
Why are sarcasm and negation especially dangerous failure modes — what makes them harder to handle than, say, domain mismatch?
Why is naming a topic already an interpretive act? Use the LDA output from L8 § 9 as a concrete example: would two researchers necessarily agree on the label for "Topic 6: defense, military, nuclear, weapons, security"?

Optional extension¶

Hand-code 20 texts from your own corpus (or the ECB corpus from L6) and compare them to a dictionary-based method (VADER or Loughran-McDonald, both covered in L8 §§ 3–4). Report precision, recall and F1 — not only accuracy. Compute the confusion matrix and inspect the disagreements one by one.
Compute Cohen's κ between your labels and a colleague's (or between yourself today and yourself a week from now). sklearn.metrics.cohen_kappa_score is one line. Compare your κ against the bounds in § 3.
Add an "uncertain / mixed" category for cases where binary sentiment is too crude (The reform is good for some workers and bad for others). Discuss in 2–3 sentences when forcing a binary label distorts the underlying construct.
Topic-share sensitivity. Re-fit the L8 § 9 LDA on SOTU paragraphs with K = 10 and K = 20. Plot how the share of the "defence" topic over time changes across the three K values. How sensitive is the time series — and would your conclusions change?