Lecture 8 β Sentiment Analysis & Supervised ML for TextΒΆ
Python for Economists Β· University of Bologna Β· 2025/2026ΒΆ
What we cover todayΒΆ
- Warm-up: a spam classifier in 10 lines, judged by you
- Sentiment analysis: dictionary-based approaches
- VADER for general text
- Loughran-McDonald for financial/economic text
- Sentiment as an economic variable
- Supervised classification of text β Naive Bayes, train/test, cross-validation, evaluation
- β± Human-vs-machine challenge
- From sparse to dense β Latent Semantic Analysis
- Pre-trained word embeddings (GloVe via gensim)
- Hansen-style replication β topic shares over time on Hansen's SOTU dataset
- Exercise: sentiment + supervised + Hansen-style on your corpus
Key insight: today we move from unsupervised methods (BoW, TF-IDF, LDA β covered in L7) to two new approaches:
- Dictionary methods use external labels: a pre-built list of positive/negative words. Fast, transparent, no training data β but rigid and domain-bound.
- Supervised methods learn labels from data: you give the model examples, it learns to generalise. Flexible and adaptable β but you need labelled training data, and validation matters.
We close with two methodological themes that were promised at the end of L7: word embeddings (the bridge from sparse to dense representations) and a Hansen et al. (QJE 2018) style pipeline (topic shares over time, replicated on Hansen's own SOTU teaching dataset). The next lecture (L9) replaces the supervised classifier with API calls to a large language model.
0. Setup β packages required for this lectureΒΆ
Before running the imports below, make sure your_environment has the supervised ML and sentiment stack installed. Run this once in your terminal (not in the notebook):
conda activate your_environment
conda install -c conda-forge scikit-learn pandas matplotlib gensim
pip install vaderSentiment
scikit-learn, pandas, and matplotlib are already installed from L5βL7. New dependencies today: vaderSentiment (a stand-alone package not in conda-forge β install with pip), and gensim (for word embeddings in Β§8 β conda install -c conda-forge gensim).
Loughran-McDonald is not a Python library: it's a CSV file you download from the authors' website. For today's lecture I'll use a small built-in subset of the negative word list to keep the notebook self-contained β for research, always use the full master dictionary from the official source.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import re
# Sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# Supervised ML
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import (
accuracy_score, precision_recall_fscore_support,
confusion_matrix, classification_report,
)
import warnings
warnings.filterwarnings("ignore")
print("Libraries loaded.")
1. Warm-up β a spam classifier in 10 lines, judged by youΒΆ
Today's topic is supervised ML on text. The classic example is spam detection β not glamorous, but it's where the field started (Sahami et al., 1998, A Bayesian Approach to Filtering Junk E-Mail), and it's unbeatable for showing how the whole pipeline works in one page.
I'll train a spam filter live. Then we'll judge it together β you write the test cases.
# Toy training set β real spam datasets are on UCI / Kaggle; this is for pedagogy.
train = pd.DataFrame({
"text": [
"Hi professor, could you confirm the deadline for the proposal?",
"Dear student, the next lecture is at 9am in room B.",
"Let me know if you have any questions about the exam.",
"Office hours moved to Thursday this week.",
"Your paper draft looks good β small comments attached.",
"URGENT! You won $1,000,000! Click here to claim NOW!",
"Get rich quick with this amazing new crypto opportunity.",
"Hot singles in your area are waiting! Click to meet them.",
"Make $5000 per week working from home, no experience needed.",
"Congratulations!! You have been selected for a free iPhone!!",
],
"label": ["ham"]*5 + ["spam"]*5,
})
spam_model = make_pipeline(CountVectorizer(), MultinomialNB())
spam_model.fit(train["text"], train["label"])
print(f"Model trained on {len(train)} messages ({(train['label']=='ham').sum()} ham, {(train['label']=='spam').sum()} spam).")
Now let's test itΒΆ
Below is a curated set of 5 short messages β a mix of obvious spam, obvious ham, and tricky borderline cases. We'll run the model on all of them and see what happens. Try to predict the model's output before looking at the result, then we'll discuss together where it gets things right and where it fails.
# Pre-built test set β chosen to expose specific failure modes of a small-corpus
# Naive Bayes model. Discuss each prediction with the class.
test_messages = [
"URGENT! Free iPhone! Click here NOW to claim your prize!!",
"Free cruise to the Bahamas β respond today!",
"Could you confirm the deadline for my paper draft?",
"Are you free for coffee on Thursday?",
"URGENT: please send me the slides before the seminar tomorrow.",
]
predictions = spam_model.predict(test_messages)
probs = spam_model.predict_proba(test_messages)
classes = spam_model.classes_
print(f"{'Pred':<6} {'P(spam)':<10} Message")
print("-" * 90)
for msg, pred, p in zip(test_messages, predictions, probs):
spam_p = p[list(classes).index("spam")]
print(f"[{pred}] {spam_p:>6.2f} {msg}")
π Reading the output. This 10-line pipeline is everything that supervised ML needs in one cell:
- Feature extraction β
CountVectorizer()turns each message into a bag-of-words count vector. The vocabulary is whatever appeared in the training set:{deadline, professor, free, click, urgent, ...}.- Classifier β
MultinomialNB()(Multinomial Naive Bayes) computesP(spam | words)andP(ham | words)using Bayes' rule, with the strong "naive" assumption that words occur independently given the class. The assumption is wrong but the classifier works anyway, which is one of the famous regularities in ML.make_pipelineβ chains them so that callingmodel.fit(X, y)fits both the vectoriser and the classifier, andmodel.predict(new_text)re-uses the same vocabulary.The 5 test messages, walked through one by one. Each is chosen to expose a specific behaviour of the model:
URGENT! Free iPhone! Click here NOW to claim your prize!!β spam, P = 1.00. Cherry-picked vocabulary from the spam side of the training set (URGENT,Free,Click,NOW,prize,iPhone). The model is correctly and confidently right. This is what success looks like β and how rare it is once you leave the training distribution.
Free cruise to the Bahamas β respond today!β ham, P(spam) = 0.36. Obvious spam to any human, butcruise,Bahamas,respond,todaynever appeared in the training set β they're out-of-vocabulary. OnlyFreeis recognised, and one word doesn't have enough Bayes weight to flip the prediction. OOV failure. The model isn't wrong because the message is hard; it's wrong because the vocabulary is unfamiliar.
Could you confirm the deadline for my paper draft?β ham, P(spam) = 0.01. Almost the same lexical shape as a training-set ham example. The model is correctly and confidently right. Familiar territory.
Are you free for coffee on Thursday?β spam, P(spam) = 0.65. This is the keyword hijacking failure:freeappeared in spam training contexts (free iPhone), so any sentence containingfreeis pulled toward spam, even when the rest of the message is unambiguously ham. The model has no way to learn thatfree Thursdayandfree iPhoneare different uses of the word. This is exactly what dictionary methods do too β and Β§2βΒ§5 will show how to mitigate it.
URGENT: please send me the slides before the seminar tomorrow.β ham, P(spam) = 0.04. Surprising!URGENTis strongly associated with spam in training. But the rest of the message (slides,seminar,tomorrow,send) is associated with ham, and Bayes' theorem multiplies probabilities across all words: enough ham-correlated tokens outweigh one spam-correlated token. This is the difference between a classifier and a regex. Naive Bayes doesn't fire on a single trigger word; it weights all evidence. (Compare with #4 β short messages with one anomalous word are vulnerable; longer messages with mostly-ham vocabulary are robust.)The pattern across the five. The model is excellent inside its training distribution (#1, #3) and poor outside it (#2, #4). #5 shows that even within the distribution, the number of features matters β five ham-correlated words can outvote one spam-correlated word. This is the bridge to the rest of the lecture: simple word-count features carry surface-level patterns, not meaning. The next 60 minutes are about (a) doing this properly with sentiment-specific features (dictionaries, Β§2βΒ§5), and (b) learning the supervised-ML workflow well enough to know when the model is lying to you (Β§6).
2. Sentiment analysis: dictionary-based approachesΒΆ
The idea: assign a score to each word from a pre-built dictionary. Sum or average scores across the document.
Advantages: transparent, fast, no training data required, interpretable.
Limitations: ignores context (negation, sarcasm), domain mismatch, fixed vocabulary.
Two main dictionaries for economics:
- VADER (Valence Aware Dictionary and sEntiment Reasoner): tuned for social media / general text; handles negation and intensifiers.
- Loughran-McDonald: built specifically for financial / corporate text. The standard tool in finance and accounting research.
# Load the corpus we built in L5/L6 (and used throughout L7)
try:
corpus = pd.read_csv("corpus_clean.csv")
corpus["date_parsed"] = pd.to_datetime(corpus["date_parsed"], errors="coerce")
corpus["year"] = corpus["date_parsed"].dt.year
print(f"Corpus loaded: {len(corpus)} documents")
except FileNotFoundError:
corpus = pd.DataFrame({
"date_parsed": pd.to_datetime(["2024-01-25","2023-10-26","2023-09-14",
"2022-12-15","2022-09-08"]),
"title": ["Monetary policy decisions"] * 5,
"body_clean": [
"The Governing Council today decided to keep the three key ECB interest rates unchanged. "
"Inflation is projected to decline gradually but will remain above the 2% target.",
"The Governing Council kept rates unchanged. Underlying inflation eased and price pressures remain strong.",
"The Governing Council raised the three key ECB interest rates by 25 basis points to ensure timely return of inflation to target.",
"The Governing Council raised rates by 50 basis points. Interest rates will rise significantly to reach restrictive levels.",
"The Governing Council raised the deposit facility rate by 75 basis points to ensure inflation returns to the 2% medium-term target.",
],
})
corpus["year"] = corpus["date_parsed"].dt.year
print(f"Using synthetic corpus ({len(corpus)} documents).")
corpus.head(3)
3. VADERΒΆ
VADER returns four scores per text: pos, neu, neg, and a compound score in [-1, +1]. The compound score is the standard summary measure: -1 = most negative, +1 = most positive.
analyzer = SentimentIntensityAnalyzer()
# Try VADER on a few hand-crafted sentences first to build intuition
test_sentences = [
"The economy is growing strongly and unemployment is low.",
"Inflation is unacceptably high and risks remain to the upside.",
"Growth is not as weak as feared, but uncertainty remains.",
"The Governing Council decided to keep interest rates unchanged.",
]
print(f"{'compound':>10} {'pos':>6} {'neu':>6} {'neg':>6} sentence")
print("-" * 90)
for s in test_sentences:
sc = analyzer.polarity_scores(s)
print(f"{sc['compound']:>10.3f} {sc['pos']:>6.3f} {sc['neu']:>6.3f} {sc['neg']:>6.3f} {s}")
π Reading the output. Four sentences chosen to expose VADER's strengths and its surprising failure modes:
- "The economy is growing strongly and unemployment is low" β compound = -0.296. Surprising? VADER reads
unemploymentas strongly negative andlowas mildly negative; it does not know that low unemployment is good. The general-English semantics fight the economics. This is the kind of failure that costs you a paper if you don't catch it.- "Inflation is unacceptably high and risks remain to the upside" β compound = -0.273. Correct sign.
Unacceptablyandrisksregister as negative.- "Growth is not as weak as feared, but uncertainty remains" β compound = -0.401. Despite the meaning being "things are better than feared", VADER catches
weak,feared,uncertaintyand ignores the negation context built around them.- "The Governing Council decided to keep interest rates unchanged" β compound = +0.459. Another surprise: the verb
decidedregisters as mildly positive (decisive language), and there are no negative tokens to balance it. From an economics perspective this is wrong β "rates unchanged" is neutral, possibly hawkish or dovish depending on what the market priced in. VADER cannot capture this; it has no model of policy expectations.Three things to flag in class.
- VADER's vocabulary is general English, tuned originally on social-media text. Many financial terms are mis-coded:
unemployment,liability,tax,obligationare negative in everyday speech but neutral or positive in context in finance. This is exactly why we move to Loughran-McDonald in Β§4.- The negation handling has hard-coded rules for ~20 negation words plus intensifiers (
very,extremely). These rules catch the easy cases but miss the contextual reversals like "low unemployment" or "easing inflation".- The four scores are not independent:
pos + neu + neg = 1(they are a partition). Thecompoundis a separate, normalised composite computed from the raw valence scores.
# Apply VADER to the entire corpus
def vader_scores(text):
sc = analyzer.polarity_scores(str(text))
return pd.Series({
"vader_compound": sc["compound"],
"vader_pos": sc["pos"],
"vader_neg": sc["neg"],
"vader_neu": sc["neu"],
})
vader_df = corpus["body_clean"].apply(vader_scores)
corpus = pd.concat([corpus, vader_df], axis=1)
print(corpus[["date_parsed","vader_compound","vader_pos","vader_neg"]].to_string(index=False))
# Plot VADER compound sentiment over time
fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(corpus["date_parsed"], corpus["vader_compound"],
marker="o", color="steelblue", linewidth=2, markersize=6)
ax.fill_between(corpus["date_parsed"], corpus["vader_compound"], 0,
where=corpus["vader_compound"] >= 0, color="steelblue", alpha=0.2)
ax.fill_between(corpus["date_parsed"], corpus["vader_compound"], 0,
where=corpus["vader_compound"] < 0, color="tomato", alpha=0.2)
ax.axhline(0, color="black", linewidth=0.6)
ax.set_title("ECB sentiment over time (VADER compound)", loc="left", fontsize=12)
ax.set_ylabel("VADER compound score (β1 to +1)")
ax.set_xlabel("")
ax.grid(alpha=0.3)
fig.autofmt_xdate()
fig.tight_layout()
plt.show()
π Reading the output. With the L6 corpus (5 docs over 2023β2025), the VADER compound score is saturated at +0.99 for all 5 documents. This is suspicious, and the lesson is in the suspicion itself: a corpus of monetary-policy texts about high inflation, restrictive rates, and recession risks should not score as overwhelmingly positive β and certainly not at the maximum value across every single document.
What's going on. The press conferences contain lots of words that VADER reads as positive (
growth,strong,confidence,support,welcome,good) regardless of their context (growth is weakening,confidence has declined,support measures are being phased out). VADER's compound score is computed as a normalised sum of valence scores, and once you have a long-enough document with even moderate positive valence, the normalisation saturates near +1.0. Combined with VADER's general-English calibration, the result is a near-uniformly-saturated series.The takeaway is methodological, not numerical. When you apply an off-the-shelf sentiment model to a domain it was not built for, expect systematic miscalibration. Two corrective approaches: (i) use a domain-specific dictionary like Loughran-McDonald (Β§4); (ii) train a supervised classifier on a labelled subset of your own corpus (Β§6). Both are available β the choice depends on whether you have labels.
A practical fix for VADER on long documents β apply it sentence-by-sentence and average, instead of feeding the whole document at once. This avoids the saturation effect, because the normalisation is per-sentence. Picault & Renault (2017, J. Banking & Finance) do this on ECB press conferences and show the resulting series tracks policy stance changes much better than the document-level version.
4. Loughran-McDonald dictionaryΒΆ
Loughran and McDonald (2011, Journal of Finance) showed that general-purpose sentiment dictionaries misclassify many words in financial text. For example, "liability" and "tax" are negative in general language but neutral in finance.
The LM dictionary has six word lists: Negative, Positive, Uncertainty, Litigious, Strong Modal, Weak Modal.
The Negative list is the most used in economics research β it correlates with stock returns, credit spreads, and analyst forecast revisions (Loughran & McDonald 2011, 2016). It's the default choice when you want a single domain-tuned tone measure.
# Lightweight LM Negative word list (selected subset for demonstration).
# For research, always use the full dictionary from the official source:
# https://sraf.nd.edu/loughranmcdonald-master-dictionary/
LM_NEGATIVE = {
"abandoned","adverse","affect","against","allegations","bankruptcy",
"breach","burden","cease","challenges","claims","concerns","conflicts",
"constraints","costs","damages","decline","default","deficiency",
"delay","difficulty","dispute","doubt","downturn","exposure",
"fail","failure","falling","fraud","harmful","impair","impose","instability",
"lack","lawsuit","litigation","loss","losses","negative",
"objection","obstacle","plaintiff","problem","recession","reject",
"restraint","restructuring","risk","risks","slowdown","stagnation",
"termination","threat","unfavorable","unstable","volatility",
"vulnerability","weakness","worsen","worsening",
}
def lm_negative_share(text):
"""Return the share of LM-negative words in text."""
words = re.findall(r"\b[a-z]+\b", str(text).lower())
if not words:
return 0.0
neg_count = sum(1 for w in words if w in LM_NEGATIVE)
return neg_count / len(words)
corpus["lm_neg_score"] = corpus["body_clean"].apply(lm_negative_share)
print(corpus[["date_parsed","lm_neg_score","vader_compound"]].round(4).to_string(index=False))
# Compare VADER vs LM β do they agree?
fig, axes = plt.subplots(1, 2, figsize=(13, 4))
# VADER compound over time
axes[0].plot(corpus["date_parsed"], corpus["vader_compound"],
marker="o", color="steelblue", linewidth=2)
axes[0].axhline(0, color="black", linewidth=0.6)
axes[0].set_title("VADER compound score", fontsize=11, loc="left")
axes[0].set_xlabel("Date"); axes[0].set_ylabel("Score (β1 to +1)")
axes[0].grid(alpha=0.3)
# LM negative share over time
axes[1].plot(corpus["date_parsed"], corpus["lm_neg_score"],
marker="o", color="tomato", linewidth=2)
axes[1].set_title("Loughran-McDonald negative share", fontsize=11, loc="left")
axes[1].set_xlabel("Date"); axes[1].set_ylabel("Share of LM-negative words")
axes[1].grid(alpha=0.3)
for ax in axes:
ax.tick_params(axis="x", rotation=30)
fig.tight_layout()
plt.show()
π Reading the output. The two series tell different stories on this corpus.
- VADER compound is at the ceiling (~ +0.99) for all 5 documents β same problem as before, the long Q&A texts overwhelm VADER with general-English positive valence. The signal is essentially saturated.
- LM negative share is in the 0.4β1.2% range β small but meaningfully variable. The 2025-10-30 press conference has the highest LM-negative share (β1.2%), reflecting more
risks,weakness,concernsvocabulary; the 2024-01-25 short release has the lowest (β0.4%) because press releases are formulaic and don't enumerate risks the way conferences do.The two measures correlate at +0.69 on this corpus (compute it explicitly:
corpus[['vader_compound','lm_neg_score']].corr()). The positive sign is not what economic intuition would predict β but it's an artifact of (i) VADER being saturated at the ceiling, (ii) document length being correlated with both measures, and (iii) the small sample. On a real ECB archive of 50+ documents, the expected pattern reasserts itself: VADER de-saturates and the correlation typically goes mildly negative (around -0.2 to -0.4).The methodological point for research. Always report at least two dictionary measures and check robustness β single-dictionary findings are fragile. Loughran & McDonald (2016, Journal of Accounting Research) make this point explicitly: stock-return predictability of sentiment in 10-K filings holds with LM but disappears with general dictionaries like Harvard-IV.
5. Sentiment as an economic variableΒΆ
Sentiment scores become economic data. Examples from the literature:
- Monetary policy: Apel & Blix Grimaldi (2012); Hansen, McMahon & Prat (QJE 2018) β the tone of central bank communication predicts asset-price reactions and forecast revisions.
- Financial markets: Loughran & McDonald (2011, J. Finance) β negative tone in 10-K filings predicts negative abnormal returns and trading volume.
- Business cycles: Shapiro, Sudhof & Wilson (2022, J. Econometrics) β daily news sentiment leads GDP and consumption.
- Uncertainty: Baker, Bloom & Davis (2016, QJE) β Economic Policy Uncertainty (EPU) index, a counts-of-keywords sentiment measure that became canonical macro data.
Key methodological warnings:
- Endogeneity: communication responds to the same fundamentals that drive outcomes. Use exogenous variation (e.g., FOMC meeting timing) where possible.
- Look-ahead bias: dictionaries built post-2010 may include words whose meaning changed. Out-of-sample validation matters.
- Aggregation: per-document scores β per-period averages requires choices (volume-weighted, simple average, only confident scores). Report robustness.
# Correlation between sentiment and year β does tone shift across the ECB tightening / easing cycle?
corr_df = corpus[["year","vader_compound","lm_neg_score"]].dropna()
print("Correlation matrix:")
print(corr_df.corr().round(3))
print(f"\nLM-negative vs VADER compound correlation: {corpus[['vader_compound','lm_neg_score']].corr().iloc[0,1]:.3f}")
# Simple scatter coloured by LM negative share
fig, ax = plt.subplots(figsize=(7, 4))
sc = ax.scatter(corpus["year"], corpus["vader_compound"],
c=corpus["lm_neg_score"], cmap="RdYlGn_r", s=80, zorder=3)
plt.colorbar(sc, ax=ax, label="LM negative share")
ax.axhline(0, color="black", linewidth=0.6)
ax.set_xlabel("Year"); ax.set_ylabel("VADER compound")
ax.set_title("Tone over the policy cycle (point colour = LM negative share)", loc="left", fontsize=11)
ax.grid(alpha=0.3)
fig.tight_layout()
plt.show()
π Reading the output. Two diagnostic outputs here.
The correlation matrix. With only 5 documents, any correlation is dominated by individual observations β treat the numbers as illustrative, not statistical evidence. On this corpus:
vader_compound β lm_neg_score = +0.69,lm_neg_score β year = +0.66,vader_compound β year = +0.16. The VADER-LM correlation is positive, not negative β meaning that documents with more LM-negative words also score higher on VADER compound. This is the VADER ceiling effect at work: all docs sit at +0.99, so VADER is essentially flat, and the residual variation correlates with whatever else moves across documents (here, doc length and the secular rise in risk language: notice thatlm_neg_score β yearis also +0.66, picking up the build-up of geopolitical and recession-risk references through 2023β2025). On a real ECB archive of 50+ docs the correlation would flip negative as VADER unsticks from the ceiling.The scatter. Each point is one document, x = year, y = VADER, colour = LM-negative. With only 5 docs the pattern is unstable β the chart is a template for what your exercise output should look like with a larger corpus. Two things to point out: (i) all VADER values are clustered at the top of the plot (the ceiling), (ii) LM-negative does show variation across years, with a modest upward trend.
For research practice, the scatter form is the right starting point: it lets you spot outliers (a document with very different tone from its neighbours often turns out to be a special meeting β emergency cut, surprise hike, or external shock). Once you have β₯30β50 observations, replace this with a proper time-series plot with rolling averages, and report bootstrap CIs on the means by period.
6. Supervised classification of textΒΆ
Dictionary methods are rule-based: someone hand-picked the words. They generalise badly across domains. Supervised methods reverse the workflow β you provide labelled examples, the model learns which features (words) predict the label.
The pipeline.
labelled data β train/test split β vectoriser + classifier β cross-validation β final test
We will train a classifier to distinguish positive vs negative monetary-policy statements, using a small hand-labelled training set. Then we'll evaluate it properly.
# Hand-labelled training set: 30 short policy-style statements, balanced.
# In a real project this would be 500-5000 documents, often crowdsourced.
training_data = pd.DataFrame({
"text": [
# POSITIVE / dovish-supportive (15)
"Inflation is converging to target and growth has resumed.",
"Economic activity is expanding broadly across sectors.",
"Labour market conditions remain robust and wage growth is moderating.",
"Financial conditions have eased and credit growth is recovering.",
"Inflation expectations are well anchored at the medium-term target.",
"The euro area economy is showing increasing resilience.",
"Underlying inflation has continued to ease, supporting the disinflationary path.",
"Confidence indicators improved and consumer spending strengthened.",
"The transmission of monetary policy is working effectively.",
"Risks to the outlook are broadly balanced.",
"Activity has picked up and the recovery is broadening.",
"Wages are growing in line with productivity gains.",
"Trade flows have recovered after recent disruptions.",
"Investment indicators are turning positive across major economies.",
"Bank lending conditions have normalised.",
# NEGATIVE / hawkish-concerned (15)
"Inflation remains too high and risks are to the upside.",
"Geopolitical tensions are weighing on the outlook.",
"Financial conditions have tightened sharply.",
"Consumer confidence has deteriorated and demand is weakening.",
"The risk of a recession has materially increased.",
"Wage pressures are persistent and could de-anchor expectations.",
"Inflation expectations show signs of becoming unanchored.",
"Trade tensions are creating significant downside risks.",
"Bank lending standards have tightened materially.",
"Energy prices are exerting renewed upward pressure on inflation.",
"Manufacturing output has contracted for several months.",
"External demand has weakened substantially.",
"Credit risk premia have risen across the euro area.",
"Core inflation remains elevated and sticky.",
"Productivity growth has slowed, raising concerns about potential output.",
],
"label": ["positive"] * 15 + ["negative"] * 15,
})
print(f"Training set: {len(training_data)} statements ({(training_data['label']=='positive').sum()} positive, {(training_data['label']=='negative').sum()} negative)")
training_data.head()
Train/test split β the most important stepΒΆ
You never evaluate a model on the same data you trained it on: the model has seen those examples and can memorise them, so in-sample accuracy is meaningless.
Instead: hold out 20β30% of the data as a test set. Train on the rest. Evaluate on the held-out test set. The test set must be touched once, at the end. Tuning anything (model choice, hyperparameters, preprocessing) using test-set performance defeats the purpose.
# Train/test split β 70/30, stratified so each set keeps class balance
X_train, X_test, y_train, y_test = train_test_split(
training_data["text"],
training_data["label"],
test_size=0.30,
random_state=42,
stratify=training_data["label"],
)
print(f"Training set: {len(X_train)} ({(y_train=='positive').sum()} positive, {(y_train=='negative').sum()} negative)")
print(f"Test set: {len(X_test)} ({(y_test=='positive').sum()} positive, {(y_test=='negative').sum()} negative)")
# Build a pipeline: TF-IDF + Multinomial Naive Bayes
classifier = make_pipeline(
TfidfVectorizer(stop_words="english", ngram_range=(1, 2), min_df=1),
MultinomialNB(alpha=0.5),
)
# Train (fit) on the training set only
classifier.fit(X_train, y_train)
# Predict on the test set
y_pred = classifier.predict(X_test)
# Report basic accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {accuracy:.3f} ({(y_pred == y_test).sum()}/{len(y_test)})")
π Reading the output. Test accuracy is 0.556 (5/9) with
random_state=42β barely above chance (chance = 4/9 β 0.44 for the majority class). This is the first reason test-set evaluation needs care: with small datasets, point estimates are noisy, and the model is genuinely struggling. We will diagnose this with cross-validation in the next cell, where we'll see that the single test accuracy is hiding even more variability.The pipeline choices.
TfidfVectorizerrather thanCountVectorizer: down-weights words that appear in many documents, which helps with the boilerplate vocabulary of monetary-policy text.ngram_range=(1, 2): include both unigrams (inflation) and bigrams (inflation expectations). Bigrams capture two-word patterns that single words miss βnot robustvsrobust.MultinomialNB(alpha=0.5): Laplace-smoothed Naive Bayes. The smoothing (alpha) avoids zero probabilities for words unseen in training. With small training sets, this matters a lot.Why is the model so weak? With 21 training statements (each averaging ~10 words), the vocabulary is sparse and most words appear only 1β2 times. The Naive Bayes likelihoods are basically uninformative β a single shared word can flip the prediction. This is the core problem of supervised text classification on small datasets, and it's exactly the motivation for the L9 lecture: large language models accessed via API don't need your training data because they were pre-trained on internet-scale text.
β± Ten-minute challenge β Human vs machineΒΆ
Now a serious test. I have 5 short reviews of economics papers. Each is either positive or negative β but the language is hedged in the academic style we all use.
You'll vote individually on each one. Then we run the model. Then we compare. The point of running this now, before learning the proper evaluation tools, is that you'll see the model fail in ways that motivate the precision/recall/F1/confusion-matrix machinery we cover next.
challenge_reviews = [
"The identification strategy is clever but the empirical implementation is weak.",
"This is a paper I will cite β the instrument is novel and the results robust.",
"I found the theoretical framework compelling, though the data section could be more detailed.",
"While the authors address an important question, the paper feels underpowered.",
"A well-executed study with policy-relevant findings and careful robustness checks.",
]
# Step 1 β class vote (no model output yet)
print("REVIEW BY REVIEW. Vote: P (positive) or N (negative). No conferring.\n")
for i, r in enumerate(challenge_reviews, 1):
print(f"{i}. {r}")
print(" Class tally: P = ___ , N = ___\n")
# Step 2 β run the model trained earlier in Β§6.A
preds = classifier.predict(challenge_reviews)
probs = classifier.predict_proba(challenge_reviews)
classes = classifier.classes_
print(f"{'#':<3} {'Pred':<10} {'P(positive)':<13} Review")
print("-" * 100)
for i, (r, p, pr) in enumerate(zip(challenge_reviews, preds, probs), 1):
p_pos = pr[list(classes).index("positive")]
print(f"{i:<3} {p:<10} {p_pos:>10.2f} {r[:70]}")
π Reading the output. Five reviews chosen to expose the limits of supervised classification on out-of-distribution language.
What the class typically does.
1 (
clever but weak) β splits roughly 50/50. The "but" is the signal; readers anchor on whichever side they read last.ΒΆ2 (
I will cite β novel, robust) β near-unanimous positive.ΒΆ3 (
compelling, though could be more detailed) β splits, with most leaning positive.ΒΆ4 (
important question, underpowered) β splits, with most leaning negative.ΒΆ5 (
well-executed, policy-relevant, careful) β near-unanimous positive.ΒΆWhat the model does on this corpus. All five are predicted positive, with probabilities clustered tightly around 0.52β0.64. The model is essentially doing a coin flip β its confidence is barely above 50% even on the unambiguous cases (#2, #5). Trained on 21 monetary-policy statements with a vocabulary of
inflation, rates, target, restrictive, growth, the model has never seencite,instrument,identification,underpowered. Most of the test review words are out-of-vocabulary, so the model defaults to a near-uniform posterior.The pedagogical gold. This is not a bug β it's the canonical demonstration that a supervised classifier is only as good as its training distribution. Two distinct lessons:
- Domain shift kills supervised models. A classifier trained on monetary-policy statements cannot classify peer-review comments, even if both are "sentiment". You'd need 500+ labelled academic reviews to do this properly.
- Confidence calibration matters. A 0.52 prediction is not "barely positive" β it's "I have no idea". Always look at the shape of the probability distribution, not just the argmax. A well-calibrated classifier on a hard task should output probabilities near 0.5, not high-confidence wrong answers. The fact that this model says 0.52 is, in a sense, correct behaviour given how ill-equipped it is.
And the meta-lesson for the labelling problem. The class will disagree on #1 and #4. That's the point. The labelling problem is harder than the modelling problem: even trained economists disagree on whether "clever but weak" is a positive or negative review. Always report inter-annotator agreement when you build a labelled dataset (Cohen's ΞΊ; Krippendorff 2018 is the canonical reference). If two well-qualified humans disagree on 30% of items, no model can beat 70% accuracy. The class disagreement is the irreducible label noise that bounds any classifier's possible accuracy on this task.
The L9 connection. Next lecture we'll send these same five reviews to an LLM via API and see what happens. Spoiler: probably better than this Naive Bayes, because the LLM has seen millions of academic reviews during pre-training. The trade-off β cost, reproducibility, dependency on a third party β is the topic of L9.
Back to Β§6 β now that you've seen the model fail, let's evaluate it properlyΒΆ
The challenge above produced a single accuracy number on a single test set, and the result was nearly random. But is that the whole story? Maybe random_state=42 happened to land on an unusually hard split. The honest answer requires more rigorous evaluation: cross-validation and per-class metrics.
Cross-validation β accuracy is noisy on a single splitΒΆ
A single train/test split gives one accuracy estimate. With small datasets that estimate is high-variance. k-fold cross-validation repeats the split k times: it splits data into k folds, trains on k-1 of them, tests on the remaining fold, and rotates. The reported accuracy is the average across the k folds, plus a standard deviation.
For classification, use StratifiedKFold so each fold preserves the class balance. The standard k is 5 or 10.
# 5-fold stratified cross-validation on the FULL training data
# (no separate test set this time β CV does the splitting internally)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(classifier, training_data["text"], training_data["label"],
cv=cv, scoring="accuracy")
print(f"Per-fold accuracy: {[f'{s:.3f}' for s in cv_scores]}")
print(f"Mean accuracy: {cv_scores.mean():.3f} Β± {cv_scores.std():.3f}")
π Reading the output. Per-fold accuracies span the 0.17β0.67 range with a mean of 0.40 Β± 0.23. The model is performing worse than the single test split (0.556), and the standard deviation tells you what the single-split test accuracy was hiding: with this small a training set, the model's true accuracy could plausibly be anywhere from 0.17 to 0.63. The single-split number was a happy accident.
What this teaches. Don't report a single accuracy number β report mean Β± SD across folds (or, ideally, a 95% bootstrap CI). Single-number metrics on small data are noise dressed as precision.
And don't be embarrassed when CV is below the single split. That happens when the single split happens to land on an "easy" fold. Cross-validation is the honest summary; the single split was misleading. The right reaction is to update your beliefs about the model downwards (it's worse than you thought), not to drop CV and report the rosier number.
A common mistake. Running CV multiple times with different
random_states and reporting the best run. This is overfitting on the random seed. Pick one seed at the start, fix it, and live with the result. If you really want robustness, average across seeds (and report the average across seeds, not the best).
Beyond accuracy β precision, recall, F1, confusion matrixΒΆ
Accuracy is "what fraction of predictions were correct". It hides whether the errors are which kind. For example: a fire alarm that never rings has 99.9% accuracy and is useless.
Three additional metrics:
- Precision: of the documents the model called positive, what fraction were actually positive? Asks "when the model says yes, is it right?"
- Recall: of the actually-positive documents, what fraction did the model catch? Asks "do we miss real positives?"
- F1: harmonic mean of precision and recall β useful when classes are imbalanced.
The confusion matrix shows all four cells: true positives, false positives, true negatives, false negatives.
# Per-class metrics on the held-out test set
print(classification_report(y_test, y_pred, digits=3))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=["negative", "positive"])
print("Confusion matrix:")
print(f" pred_neg pred_pos")
print(f" true_neg {cm[0,0]:>8} {cm[0,1]:>8}")
print(f" true_pos {cm[1,0]:>8} {cm[1,1]:>8}")
π Reading the output. With
random_state=42and the 30-statement training set, the test set has 9 documents (5 negative, 4 positive).
- Negative class: precision 0.60, recall 0.60, F1 0.60 β predicted 5 docs as negative, 3 of them were actually negative (3 TP, 2 FP), and missed 2 of the 5 actual negatives (2 FN).
- Positive class: precision 0.50, recall 0.50, F1 0.50 β same structure, slightly worse because the positive class has fewer examples in this test set.
- Confusion matrix: 3 TN, 2 FP, 2 FN, 2 TP. Errors are evenly split β the model isn't biased toward either class, it's just genuinely confused.
The errors are likely on hedged statements. Look at the misclassified sentences when you re-run this β they almost always contain hedging vocabulary (
broadly,largely,somewhat) or mixed signals (important question, but underpowered). With 30 training statements the model has not seen enough examples of hedged language to learn thatbroadly balancedis neutral-positive.What metric should you report?
- Accuracy when classes are balanced AND both error types cost the same. (Our case here.)
- Precision when false positives are costly. (Spam filter: don't flag legitimate email as spam.)
- Recall when false negatives are costly. (Cancer screening: don't miss real positives.)
- F1 as a single-number summary when you must pick one.
- Always show the confusion matrix in your write-up β it makes errors concrete and lets readers judge for themselves.
7. From sparse to dense β Latent Semantic AnalysisΒΆ
So far we've represented documents as sparse high-dimensional vectors: TF-IDF on the L7/L8 corpus produces a (5 Γ 500) matrix where most entries are zero. Modern NLP works with dense low-dimensional vectors instead β every entry is non-zero, the dimensionality is 50β768 instead of thousands. Dense vectors capture latent semantic structure: words that never co-occur but mean similar things end up close in vector space.
Latent Semantic Analysis (LSA) is the historical bridge: take a TF-IDF matrix, run a Singular Value Decomposition (SVD), keep the top-K singular vectors. The result is a dense K-dimensional representation per document. It was the dominant document-representation method from the early 90s until word2vec (2013); today it survives mainly as an interpretive baseline and a fast preprocessing step.
The intuition: SVD finds the K orthogonal axes that maximally explain the variance in the term-document matrix. Each axis is a latent topic β a linear combination of terms. Documents are then projected onto these axes.
from sklearn.decomposition import TruncatedSVD
# Re-use the TF-IDF matrix logic from L7. Build it fresh here so the cell is self-contained.
tfidf = TfidfVectorizer(max_features=500, min_df=2, stop_words="english",
token_pattern=r"\b[a-zA-Z]{3,}\b", sublinear_tf=True)
X_tfidf = tfidf.fit_transform(corpus["body_clean"])
# LSA: keep top 5 singular components (K must be < min(n_docs, n_features)).
# In practice with 50+ docs you'd use K=50-100.
lsa = TruncatedSVD(n_components=4, random_state=42)
X_lsa = lsa.fit_transform(X_tfidf)
print(f"TF-IDF matrix: {X_tfidf.shape} (sparse, {X_tfidf.nnz} non-zero / {X_tfidf.shape[0]*X_tfidf.shape[1]} cells)")
print(f"LSA matrix: {X_lsa.shape} (dense)")
print(f"\nVariance explained by 4 components: {lsa.explained_variance_ratio_.sum():.1%}")
print(f"Per-component: {[f'{v:.1%}' for v in lsa.explained_variance_ratio_]}")
# Inspect the top words for each latent component
print(f"\nTop 8 terms per latent component:")
vocab = tfidf.get_feature_names_out()
for k, comp in enumerate(lsa.components_):
top_idx = np.argsort(np.abs(comp))[-8:][::-1]
top = [(vocab[i], comp[i]) for i in top_idx]
print(f" Component {k}: " + ", ".join(f"{w}({s:+.2f})" for w,s in top))
π Reading the output. With 5 docs and 500 TF-IDF features, the LSA reduction to 4 dimensions explains 91.0% of variance β most of the signal in 4 latent dimensions instead of 500. The first component dominates at 63.4%, and components 2β4 add another ~28 percentage points. On a real ECB archive of 200 documents you'd typically choose K = 50β100 and explain 60β80% of variance.
The components are interpretable as latent themes.
- Component 0 (
inflation, council, governing, policy, monetary, transmission, term, rates) β the dominant latent dimension is the boilerplate of monetary-policy text. Every document loads positively on it because every document contains this vocabulary.- Component 1 (
portfolio, principal, governing, council, intends, refinancing, targeted, securities) β captures balance-sheet operations (PEPP/APP runoff, principal reinvestment, refinancing operations, securities portfolio). This is the substantive policy-instrument axis that distinguishes documents discussing QT from those discussing rate decisions.- Component 2 mixes geographic and temporal terms (
june, states, global, contribution, commercial) β partly substantive, partly noise.- Component 3 picks up specific date references (
october, august) and procedural language (unchanged, public, place) β almost certainly noise that more data would push to higher components.The first two components alone explain 77% of variance and are clearly meaningful; components 3β4 are mostly idiosyncratic. This is the typical pattern: the leading components are interpretable themes, the trailing ones are noise.
Why LSA is the conceptual bridge to embeddings. Three reasons:
- Dense vectors. Each document is now a 4-dimensional point with all entries non-zero. Comparing two documents = computing cosine similarity in 4D, not in 500D.
- Latent semantic structure. Words that don't co-occur (
inflationandprices) but appear in similar contexts get loaded onto the same component. That's what "semantic" means here.- Linear and lossy. The reduction is a single matrix factorisation. Word2vec (2013) and modern transformers do the same thing with a non-linear neural model β and a much bigger training corpus. The fundamental idea is identical.
Deerwester et al. (1990, JASIS) introduced LSA; the technique dominated information retrieval for two decades until word2vec made dense embeddings cheap to learn from raw text. The transformer-based models we'll use in L9 are LSA's distant descendants β same goal (dense semantic vectors), better implementation.
8. Pre-trained word embeddingsΒΆ
LSA learns a representation from your corpus. Modern NLP usually flips this: download a representation that was already trained on billions of words (Wikipedia, news, books, web crawls), and use it on your data.
GloVe (Pennington, Socher, Manning 2014, EMNLP) is the canonical example. Each word is mapped to a 50β300 dimensional vector. Words used in similar contexts get similar vectors. Gensim makes the pre-trained vectors available for download in a single command:
import gensim.downloader as api
glove = api.load('glove-wiki-gigaword-100') # 134 MB, downloads on first use
Once loaded, two operations:
- Word similarity: how close are two words in vector space?
- Document similarity: average the word vectors in each document, compare the document vectors.
The first reveals semantic structure. The second is the simplest possible "neural" document representation β strictly more powerful than TF-IDF on tasks that depend on meaning rather than vocabulary overlap.
import urllib.request
import gzip
import shutil
from pathlib import Path
from gensim.models import KeyedVectors
# Target directory next to the notebook
EMB_DIR = Path.cwd() / "embeddings"
EMB_DIR.mkdir(exist_ok=True)
# Models to download: name -> URL on the gensim-data GitHub release mirror
MODELS = {
"glove-wiki-gigaword-50": "https://github.com/piskvorky/gensim-data/releases/download/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz",
"glove-wiki-gigaword-100": "https://github.com/piskvorky/gensim-data/releases/download/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz",
}
def get_glove(name: str, url: str) -> KeyedVectors:
"""Download (if needed), decompress, and load a GloVe model from the gensim-data mirror.
Note: files on gensim-data are already in word2vec format (with header on the first line)."""
gz_path = EMB_DIR / f"{name}.gz"
txt_path = EMB_DIR / f"{name}.txt"
if not gz_path.exists() and not txt_path.exists():
print(f"Downloading {name} to {gz_path} (may take 1-2 minutes)...")
urllib.request.urlretrieve(url, gz_path)
print("Download complete.")
if not txt_path.exists():
print(f"Decompressing {name}...")
with gzip.open(gz_path, "rb") as f_in, open(txt_path, "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
gz_path.unlink()
print(f"Loading {name}...")
return KeyedVectors.load_word2vec_format(txt_path, binary=False)
glove_50 = get_glove("glove-wiki-gigaword-50", MODELS["glove-wiki-gigaword-50"])
glove = get_glove("glove-wiki-gigaword-100", MODELS["glove-wiki-gigaword-100"])
print(f"\nglove-50 loaded: {len(glove_50.key_to_index):,} words Γ {glove_50.vector_size}-dim")
print(f"glove-100 loaded: {len(glove.key_to_index):,} words Γ {glove.vector_size}-dim")
print("\n--- Most similar to 'inflation' (top 5) ---")
for w, s in glove.most_similar("inflation", topn=5):
print(f" {w:<20} {s:.3f}")
print("\n--- Word analogy: king - man + woman β ? ---")
result = glove.most_similar(positive=["king", "woman"], negative=["man"], topn=3)
for w, s in result:
print(f" {w:<20} {s:.3f}")
print("\n--- Word analogy: euro - europe + japan β ? ---")
result = glove.most_similar(positive=["euro", "japan"], negative=["europe"], topn=3)
for w, s in result:
print(f" {w:<20} {s:.3f}")
π Reading the output. GloVe-Wiki-Gigaword-100 has a vocabulary of 400,000 words, each represented as a 100-dimensional vector trained on 6 billion tokens of Wikipedia + Gigaword news.
Most similar to
inflation(top 5 with cosine similarity):
rate(0.78),rates(0.78),unemployment(0.75),inflationary(0.74),growth(0.74).Look at this list carefully β it's a textbook example of distributional semantics at work.
rate(s)are the closest neighbours because the phrase "interest rate" and "inflation rate" co-occur constantly in financial news.unemploymentis third because of the Phillips curve: economists, journalists and policymakers spent the entire training corpus discussing inflation and unemployment together, so GloVe learned they belong in the same semantic neighbourhood.inflationaryis morphological.growthreflects the macro-vocabulary cluster (growth, inflation, unemployment, outputall live near each other in vector space). The model has no notion of economics β it has only word co-occurrences β yet it has internalised the macro vocabulary cluster.The classic gender analogy
king β man + woman β ?returnsqueenat cosine 0.77, thenmonarch(0.68),throne(0.68). This is the canonical Mikolov et al. (2013) demonstration: vector arithmetic captures semantic relations. The "gender axis" exists as a learnable direction in the 100-dimensional space.A finance-specific analogy:
euro β europe + japan β ?returnsyenat cosine 0.67, thendollar(0.66),drachma(0.59). The currency-country axis works too. (Drachma being third is endearing β the model learned that drachma was a country-specific currency, just from before-2002 Wikipedia.)Where this comes from. GloVe trains a
vocab Γ vocabco-occurrence matrix from 6 billion tokens of Wikipedia + Gigaword news, then factorises it (a non-linear matrix factorisation, more sophisticated than LSA but the same family of method). Each word is the row of the resulting low-rank matrix. Words appearing in similar contexts β same neighbours within a 10-word window β end up close in vector space.The economic-research applications.
- Out-of-vocabulary handling: a TF-IDF model trained on your corpus has zero information about a word that didn't appear. GloVe gives every English word a vector, even ones that appeared zero times in your training data.
- Semantic clustering: cluster documents by their averaged word vector to discover topics without LDA.
- Domain-shift robustness: fine-tune a classifier on top of GloVe vectors and it generalises better than a TF-IDF classifier when the test set uses different vocabulary from the training set.
Where GloVe falls short. Each word has one vector regardless of context.
bank(financial) andbank(river) collapse into one point. This is the limitation that contextual embeddings (BERT, sentence-transformers, modern LLMs) solve β and the reason L9 will use API-based LLMs rather than static embeddings.
# Document embeddings via mean-pooling: average word vectors of a document.
# Words not in GloVe vocabulary are skipped.
def document_vector(text, model):
words = re.findall(r"\b[a-z]+\b", str(text).lower())
vectors = [model[w] for w in words if w in model.key_to_index]
if not vectors:
return np.zeros(model.vector_size)
return np.mean(vectors, axis=0)
# Embed each ECB document
doc_vectors = np.vstack([document_vector(t, glove) for t in corpus["body_clean"]])
print(f"Document embeddings shape: {doc_vectors.shape}")
# Cosine similarity matrix using dense embeddings
from sklearn.metrics.pairwise import cosine_similarity
sim_glove = cosine_similarity(doc_vectors)
print("\nCosine similarity (GloVe-averaged document vectors):")
labels = corpus["date_parsed"].dt.strftime("%Y-%m-%d").values
print(f" {' '.join(l for l in labels)}")
for i, l in enumerate(labels):
row = " ".join(f"{sim_glove[i,j]:.3f} " for j in range(len(labels)))
print(f" {l} {row}")
π Reading the output. Pairwise document similarities using mean-pooled GloVe vectors are all in the 0.98β1.00 range on this corpus β much higher than the TF-IDF cosine similarities we saw in L7 (which ranged 0.31β0.84 across the same five documents). The press releases (2023-05-04, 2024-01-25) sit at cosine 1.000 to each other, the press conferences (2023-09-14, 2024-12-12, 2025-10-30) at 0.999β1.000 to each other, and across the two groups the similarities drop only to ~0.984. The variation that TF-IDF picked up has been almost entirely averaged away.
Why so high? Mean-pooling 500β6000 word vectors averages out almost everything. Two long documents from the same domain (monetary policy) end up with nearly identical mean vectors, even when they differ substantially in which monetary-policy topics they emphasise. The mean of
inflation, rates, target, restrictiveis essentially the same as the mean ofinflation, rates, transmission, gradualβ both center around the same "macroeconomic policy" region of vector space. This is the document-length problem of mean-pooling, and it's exactly what sentence-transformers (and the modern API-based embeddings in L9) solve: they produce a single document-level vector via a transformer rather than averaging.What this teaches. Naive averaging of static embeddings is a worse document representation than TF-IDF for most short-corpus tasks. The right comparison is:
- TF-IDF: best when documents are differentiated by distinctive vocabulary (good for L7 corpus). Range 0.31β0.84 on our 5 docs β meaningful spread.
- GloVe mean-pool: best when you need to handle out-of-vocabulary words and don't care about document-level discrimination. Range 0.98β1.00 on our 5 docs β collapsed.
- Sentence embeddings (sentence-transformers, OpenAI
text-embedding-3, etc.): best in general β proper transformer-based document encoders that don't suffer from mean-pooling collapse.The next lecture (L9) will use API-based sentence embeddings to redo this comparison and see the gap close. For now, the takeaway is conceptual: dense β better. The right representation depends on the task β and naive averaging is a worse "first dense embedding" than the LSA we built in Β§7.
9. Hansen-style replication β topic shares over timeΒΆ
Hansen, McMahon and Prat (QJE 2018) propose a workflow that has become the standard for empirical work with central-bank communication:
- Take a corpus of policy texts.
- Preprocess and split into paragraphs (the unit of "one topic = one paragraph" assumption).
- Fit LDA at the paragraph level with K topics chosen by coherence.
- Aggregate paragraph-level topic shares to document level (e.g., per FOMC meeting).
- Track topic shares over time and link them to economic events.
The QJE paper applies this to FOMC transcripts (1987β2008) and uses a difference-in-differences design around the 1993 transparency reform. The dataset is FOMC transcripts that the authors scraped and cleaned over years β not redistributed publicly.
What we can do. Hansen distributes a different dataset on his GitHub teaching repo: every paragraph of every U.S. State of the Union address since 1790, ~23,000 paragraphs. We replicate his methodological workflow on the SOTU corpus: paragraph-level LDA, topic shares per speech, plot over time. This is the same pipeline applied to a public corpus, and it demonstrates the technique end-to-end β which is what L7 promised.
The SOTU dataset is downloaded automatically from Hansen's GitHub on first run (~12 MB):
import os, urllib.request
SOTU_URL = "https://raw.githubusercontent.com/sekhansen/text-mining-tutorial/master/speech_data_extend.txt"
SOTU_PATH = "speech_data_extend.txt"
if not os.path.exists(SOTU_PATH):
print(f"Downloading SOTU dataset (~12 MB)...")
urllib.request.urlretrieve(SOTU_URL, SOTU_PATH)
print("Done.")
# Load: tab-separated, columns are [president, speech, year], with header row
sotu = pd.read_csv(SOTU_PATH, sep="\t", quoting=3, on_bad_lines="skip")
sotu = sotu.dropna(subset=["speech", "year"])
sotu["year"] = sotu["year"].astype(int)
# Restrict to a manageable post-WWII window for speed (LDA on 23k paragraphs is slow)
sotu_modern = sotu[sotu["year"] >= 1947].reset_index(drop=True)
print(f"SOTU paragraphs (1947+): {len(sotu_modern):,}")
print(f"Speeches (year Γ president): {sotu_modern.groupby(['year','president']).ngroups}")
print(f"Year range: {sotu_modern['year'].min()}β{sotu_modern['year'].max()}")
sotu_modern.head(3)
# Fit LDA at the PARAGRAPH level β Hansen's key methodological choice.
# Each paragraph is treated as one document. Topics are estimated over paragraphs.
# Filter very short paragraphs (procedural noise)
sotu_paragraphs = sotu_modern[sotu_modern["speech"].str.split().str.len() >= 30].reset_index(drop=True)
print(f"Paragraphs after filtering (β₯30 words): {len(sotu_paragraphs):,}")
# Vectorise: BoW counts, basic English stopwords, drop very rare and very common words
vec_sotu = CountVectorizer(stop_words="english", min_df=10, max_df=0.5,
token_pattern=r"\b[a-zA-Z]{4,}\b", max_features=3000)
X_sotu = vec_sotu.fit_transform(sotu_paragraphs["speech"])
print(f"Doc-term matrix: {X_sotu.shape}")
# Fit LDA with K=15 topics. (Hansen uses K=40 for FOMC; we use 15 for speed and clarity.)
from sklearn.decomposition import LatentDirichletAllocation
K = 15
print(f"\nFitting LDA with K={K}...")
lda_sotu = LatentDirichletAllocation(n_components=K, random_state=42, max_iter=15,
learning_method="online", n_jobs=-1)
lda_sotu.fit(X_sotu)
print("Done.")
# Display top words per topic β let economic interpretation come from the loadings
vocab_sotu = vec_sotu.get_feature_names_out()
print("\nTop 10 words per topic:")
for k in range(K):
top = lda_sotu.components_[k].argsort()[-10:][::-1]
print(f" Topic {k:2d}: " + ", ".join(vocab_sotu[i] for i in top))
# Aggregate paragraph-level topic shares to speech (year-president) level.
# This is the core Hansen et al. step: we want a topic distribution per speech, not per paragraph.
paragraph_topics = lda_sotu.transform(X_sotu) # shape (n_paragraphs, K)
print(f"Paragraph topic matrix: {paragraph_topics.shape}")
# Assign each paragraph back to its speech
sotu_paragraphs = sotu_paragraphs.copy()
for k in range(K):
sotu_paragraphs[f"topic_{k}"] = paragraph_topics[:, k]
# Average topic shares within each speech
speech_topics = (sotu_paragraphs
.groupby(["year", "president"])[[f"topic_{k}" for k in range(K)]]
.mean()
.reset_index()
.sort_values("year"))
print(f"\nSpeech-level topic shares: {speech_topics.shape}")
speech_topics.head()
# Plot the time series of selected topics β pick those that are most interpretable
# from the top-words above. The indices here match the random_state=42 run; if you
# re-run with a different seed or sklearn version, inspect the topics and update.
# Topic 6: defense/military; 11: foreign policy; 4: healthcare; 9: budget/fiscal
selected_topics = [6, 11, 4, 9]
fig, ax = plt.subplots(figsize=(12, 5))
for k in selected_topics:
label_words = ", ".join(vocab_sotu[i] for i in lda_sotu.components_[k].argsort()[-3:][::-1])
ax.plot(speech_topics["year"], speech_topics[f"topic_{k}"],
marker="o", markersize=3, linewidth=1.5,
label=f"Topic {k}: {label_words}")
ax.set_xlabel("Year"); ax.set_ylabel("Topic share (mean over paragraphs)")
ax.set_title("U.S. State of the Union β topic shares over time (Hansen-style)",
loc="left", fontsize=12)
ax.legend(loc="best", fontsize=9)
ax.grid(alpha=0.3)
fig.tight_layout()
plt.show()
π Reading the output. With K=15 LDA on 7,031 post-WWII SOTU paragraphs (β₯30 words each), the model produces 15 topics, several of which are immediately interpretable. Examples from the actual fit (topic indices below match the
random_state=42run; if you re-run with a different seed or sklearn version, the indices will change but similar themes will emerge):
- Topic 6 β Defence / Cold War (
defense, want, nuclear, military, control, weapons, forces, seek, security, commitment). Peaks during the Cold War decades and the Reagan years.- Topic 11 β International / Cold War foreign policy (
world, nations, peace, united, states, trade, free, international, soviet, security). Captures the literal Cold War vocabulary; share drops markedly after 1991.- Topic 4 β Healthcare and social insurance (
health, care, security, social, insurance, house, americans, just, training, medical). Rises sharply in the Clinton era (1993β94 health reform) and again under Obama (2009β2014, ACA).- Topic 9 β Fiscal / budget (
budget, year, billion, spending, percent, million, deficit, taxes, increase, years). Spikes in the early Reagan years (Reaganomics deficits) and during the Clinton-Gingrich 1995 budget showdown.- Topic 2 β Macroeconomy (
business, energy, economy, small, research, production, prices, price, farm, inflation). Captures the pre-1990 macro vocabulary.- Topic 3 β Federal programmes (
federal, programs, year, program, congress, administration, legislation). The "what the government is doing" topic, fairly stable through time.The model wasn't told about Reagan, the Cold War, the ACA, or any historical event. It found these themes by pure word-co-occurrence on the paragraph level β exactly what LDA is designed to do.
Why this is the QJE 2018 pipeline applied to SOTU. The methodological choices we made here are precisely Hansen's:
- Paragraph as the unit of analysis (not the full speech). Each paragraph is more likely to focus on one topic, which is what LDA needs.
- K chosen by judgment after inspecting top words (Hansen uses coherence; we used 15 for speed). Both are defensible.
- Topic shares aggregated up to document level (here = speech), making them the unit of regression.
- Time series of topic shares as the primary descriptive output, with the economic narrative anchored to events.
The QJE paper goes further: it uses the 1993 FOMC transparency reform as a shock and runs DiD on topic-share changes pre/post. With our data we cannot replicate that identification (no comparable shock for SOTUs), but we have replicated the upstream pipeline, which is the technical core of the paper. The downstream econometrics is standard DiD that you've seen in metrics class.
For your own research project (if you want to take this further): pick a corpus with a natural shock. Examples that have been done in published papers: Hansen-McMahon 2016 (FOMC + 1994 transparency), Casas et al. 2024 (BCE introductory statements + SMP/OMT announcements), Apel & Blix Grimaldi 2014 (Riksbank minutes + 2004 publication change). The pipeline is the same as what we just executed.
10. Exercise β to do at homeΒΆ
This exercise is for home practice. Most of it builds on what we covered in Β§6 (supervised classification) and Β§9 (Hansen-style LDA pipeline). Submit your solutions on Virtuale by next week.
Task 1. Apply VADER and the Loughran-McDonald dictionary to your corpus. Plot both sentiment series over time on the same figure (use a twin axis if scales differ). Compute the correlation between the two measures. Identify the most positive and most negative documents β do they correspond to events you'd expect (rate hikes, energy shocks, dovish pivots)?
Task 2. Train a supervised classifier on the 30-statement training set from Β§6:
- Use
train_test_splitwithrandom_state=42,test_size=0.30, stratified. - Pipeline:
TfidfVectorizer+LogisticRegression(max_iter=1000). - Report: test accuracy, 5-fold CV mean Β± SD, confusion matrix.
Task 3. Compare the supervised classifier against VADER on your corpus:
- Apply the trained classifier from Task 2 to each document in
corpus. - Compare its predictions to: (i)
vader_compound > 0β positive else negative. - Build a 2Γ2 table of agreement. Discuss in 2β3 sentences which method seems more reliable on this corpus.
Task 4 (optional). Replace MultinomialNB with LogisticRegression in Β§6's pipeline. Compare CV accuracy. Which performs better on this small training set, and why? (Hint: think about how each handles correlated features.)
Task 5 (optional, longer). Apply the Β§9 Hansen-style pipeline to your ECB corpus instead of the SOTU dataset:
- Split each document into paragraphs (split on double newline
\n\nor similar). - Fit LDA at the paragraph level with K=5β8.
- Aggregate paragraph topic shares to document level.
- Plot topic shares over time. Comment on whether the topic mix tracks the rate cycle (hike phase vs cuts phase).
# Task 1 β VADER and LM sentiment
# YOUR CODE HERE
# Task 2 β Supervised classifier
# YOUR CODE HERE
# Task 3 β Compare classifier vs VADER
# YOUR CODE HERE
# Task 4 (optional) β Logistic Regression vs Naive Bayes
# YOUR CODE HERE
# ββ SOLUTION ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
import pandas as pd, numpy as np, matplotlib.pyplot as plt, re
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# ββ Task 1 ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
analyzer = SentimentIntensityAnalyzer()
LM_NEGATIVE = { # same set as Β§4
"abandoned","adverse","affect","against","allegations","bankruptcy","breach","burden",
"cease","challenges","claims","concerns","conflicts","constraints","costs","damages",
"decline","default","deficiency","delay","difficulty","dispute","doubt","downturn",
"exposure","fail","failure","falling","fraud","harmful","impair","impose","instability",
"lack","lawsuit","litigation","loss","losses","negative","objection","obstacle",
"plaintiff","problem","recession","reject","restraint","restructuring","risk","risks",
"slowdown","stagnation","termination","threat","unfavorable","unstable","volatility",
"vulnerability","weakness","worsen","worsening",
}
def lm_neg(text):
words = re.findall(r"\b[a-z]+\b", str(text).lower())
return sum(1 for w in words if w in LM_NEGATIVE) / max(len(words), 1)
corpus["vader"] = corpus["body_clean"].apply(lambda x: analyzer.polarity_scores(str(x))["compound"])
corpus["lm_neg"] = corpus["body_clean"].apply(lm_neg)
print(f"Correlation between VADER and LM-neg: {corpus[['vader','lm_neg']].corr().iloc[0,1]:.3f}")
print("\nMost positive (VADER):"); print(corpus.nlargest(2, "vader")[["date_parsed","vader"]].to_string(index=False))
print("\nMost negative (LM):"); print(corpus.nlargest(2, "lm_neg")[["date_parsed","lm_neg"]].to_string(index=False))
fig, ax1 = plt.subplots(figsize=(11, 4))
ax2 = ax1.twinx()
ax1.plot(corpus["date_parsed"], corpus["vader"], "o-", color="steelblue", label="VADER")
ax2.plot(corpus["date_parsed"], corpus["lm_neg"], "s-", color="tomato", label="LM-neg")
ax1.set_ylabel("VADER", color="steelblue"); ax2.set_ylabel("LM-neg", color="tomato")
ax1.set_title("Sentiment over time", loc="left")
fig.tight_layout(); plt.show()
# ββ Task 2 ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
X_train, X_test, y_train, y_test = train_test_split(
training_data["text"], training_data["label"],
test_size=0.30, random_state=42, stratify=training_data["label"])
clf_lr = make_pipeline(
TfidfVectorizer(stop_words="english", ngram_range=(1,2), min_df=1),
LogisticRegression(max_iter=1000),
)
clf_lr.fit(X_train, y_train)
y_pred_lr = clf_lr.predict(X_test)
print(f"\nLogistic Regression β test accuracy: {accuracy_score(y_test, y_pred_lr):.3f}")
print(classification_report(y_test, y_pred_lr, digits=3))
cv_scores = cross_val_score(clf_lr, training_data["text"], training_data["label"],
cv=StratifiedKFold(5, shuffle=True, random_state=42))
print(f"5-fold CV: {cv_scores.mean():.3f} Β± {cv_scores.std():.3f}")
# ββ Task 3 ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
corpus["clf_pred"] = clf_lr.predict(corpus["body_clean"])
corpus["vader_pred"] = np.where(corpus["vader"] > 0, "positive", "negative")
agreement = (corpus["clf_pred"] == corpus["vader_pred"]).mean()
print(f"\nClassifier vs VADER agreement on corpus: {agreement:.2%}")
print(pd.crosstab(corpus["clf_pred"], corpus["vader_pred"], rownames=["classifier"], colnames=["VADER"]))
# ββ Task 4 (optional) βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
clf_nb = make_pipeline(TfidfVectorizer(stop_words="english", ngram_range=(1,2), min_df=1),
MultinomialNB(alpha=0.5))
cv_nb = cross_val_score(clf_nb, training_data["text"], training_data["label"],
cv=StratifiedKFold(5, shuffle=True, random_state=42))
print(f"\nNaive Bayes 5-fold CV: {cv_nb.mean():.3f} Β± {cv_nb.std():.3f}")
print(f"Logistic 5-fold CV: {cv_scores.mean():.3f} Β± {cv_scores.std():.3f}")
SummaryΒΆ
| Method | Library | Best for | Limits |
|---|---|---|---|
| VADER | vaderSentiment | General text, quick exploration | Domain mismatch, ceiling on long text |
| Loughran-McDonald | Manual / CSV | Financial / economic text | Word-list rigidity, no negation |
| Naive Bayes | sklearn.naive_bayes | Baseline supervised classifier | Independence assumption, smoothing-sensitive |
| Logistic Regression | sklearn.linear_model | Default supervised classifier | Needs labelled data |
| LSA (TruncatedSVD) | sklearn.decomposition | Dense doc representation, fast | Linear, lossy, requires re-fit on new corpora |
| GloVe (pre-trained) | gensim.downloader | Word similarity, OOV handling | Static β one vector per word regardless of context |
| LDA pipeline (Hansen-style) | sklearn.decomposition | Topic shares over time, replicates QJE 2018 | Needs many paragraphs, K is a judgment call |
Workflow that matters more than the modelΒΆ
- Get labelled data (the hard part).
- Train/test split (stratified, fixed random_state).
- Fit on train. Cross-validate on train.
- Predict on test once. Report accuracy + per-class precision/recall/F1 + confusion matrix.
- Compare against a dictionary baseline. Be honest about disagreement.
Next lectureΒΆ
Lecture 9 β LLMs via API: replacing the supervised pipeline (and the static embeddings of Β§8) with a single API call to a large language model. We will compare cost, accuracy, and reproducibility against today's classifier on the same task. The peer-review challenge (Β§7) and the static embedding comparison (Β§8) both have direct LLM analogues that we'll run side-by-side.