Lecture 7 β NLP: Text RepresentationΒΆ
Python for Economists Β· University of Bologna Β· 2025/2026ΒΆ
What we cover todayΒΆ
- Warm-up: text β numbers β insight in 2 minutes
- The NLP pipeline: from raw text to features
- Tokenisation, stopwords, stemming
- Lemmatisation and POS tagging with spaCy
- Bag-of-words and TF-IDF with scikit-learn
- Topic modelling with LDA β what is the corpus about?
- Keyword analysis: most discriminating terms
- Wordclouds β visualisation, not analysis
- Exercise: represent your corpus
Key insight: a machine learning model cannot read prose. Every NLP method first converts text into a numerical representationβa feature matrixβand then runs on those features. Today we learn the standard ways to build that matrix, plus one unsupervised method (LDA) for discovering themes.
0. Setup β packages required for this lectureΒΆ
Before running the imports below, make sure your_environment has the NLP stack installed. Run this once in your terminal (not in the notebook):
conda activate your_environment
conda install -c conda-forge nltk spacy scikit-learn wordcloud
python -m spacy download en_core_web_sm
nltk, spacy, and wordcloud are not installed by default in the base Anaconda distribution, so you need to add them explicitly. The en_core_web_sm model is a small English language model (~12 MB) that powers spaCy's tokenisation, lemmatisation, POS tagging, and named entity recognition β without it, spacy.load("en_core_web_sm") will fail.
If conda install is too slow or cannot resolve the environment, fall back to pip install nltk spacy scikit-learn wordcloud inside the activated env.
The first import cell below also downloads three NLTK data packages (punkt_tab, stopwords, wordnet). Those are downloaded to your user directory (not the env), so they only need to run once on each machine β the quiet=True flag keeps the output silent on subsequent runs.
# Section 0 in the markdown above explains the one-time terminal setup.
# The nltk.download() calls below populate user-level data and are idempotent.
import pandas as pd
import numpy as np
import re, string
import matplotlib.pyplot as plt
from collections import Counter
import nltk
from pathlib import Path # object-oriented path handling: `Path.cwd()` returns the current working directory, and `/` joins path components in a cross-platform way
NLTK_DIR = str(Path.cwd() / "nltk_data")
nltk.data.path.insert(0, NLTK_DIR)
for pkg in ("punkt_tab", "stopwords", "wordnet"):
nltk.download(pkg, download_dir=NLTK_DIR, quiet=True)
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import spacy
nlp = spacy.load("en_core_web_sm")
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
1. Warm-up β text β numbers β insight in 2 minutesΒΆ
We've spent two lectures building text corpora. Now we turn text into something we can analyse with the same tools we'd use for numbers.
Let me show you the whole pipeline in the simplest possible form. I have a corpus of ECB press releases. I'll run dictionary-based sentiment analysis across 25 years and plot it.
# Try the L6 corpus first; if it's too small for a multi-year sentiment chart,
# fall back to a 25-year synthetic series (random templates) so the plot is illustrative.
try:
_l6 = pd.read_csv("corpus_clean.csv")
_l6["date"] = pd.to_datetime(_l6["date_parsed"])
_l6 = _l6.rename(columns={"body_clean": "text"})[["date", "text"]].dropna()
if len(_l6) >= 30 and _l6["date"].dt.year.nunique() >= 5:
warmup_corpus = _l6
warmup_source = "L6 corpus (real)"
else:
raise ValueError("L6 corpus too small for time-series demo")
except Exception:
np.random.seed(0)
dates = pd.date_range("2000-01-01", "2024-12-31", freq="MS")
templates = [
"The outlook is strong, with inflation expectations well anchored and growth robust.",
"Economic conditions remain fragile, with significant downside risks to growth and financial stability.",
"Prices are stable, labour markets are healthy, and the recovery is broadening.",
"Uncertainty is elevated; inflation is persistent and confidence in the outlook has deteriorated.",
]
warmup_corpus = pd.DataFrame({
"date": dates,
"text": np.random.choice(templates, size=len(dates)),
})
warmup_source = "synthetic 25-year fallback (random templates)"
print(f"Warm-up corpus: {len(warmup_corpus):,} documents β source: {warmup_source}")
warmup_corpus.head()
# Dictionary-based sentiment, simplified Loughran-McDonald style
POSITIVE = {"strong", "robust", "healthy", "stable", "recovery", "growth", "confident",
"improved", "sound", "resilient", "expansion", "anchored", "broadening"}
NEGATIVE = {"fragile", "risks", "downturn", "deteriorated", "uncertainty", "weak",
"persistent", "tightening", "contraction", "subdued", "slowdown", "elevated"}
def sentiment(text):
words = str(text).lower().split()
pos = sum(w.strip(".,") in POSITIVE for w in words)
neg = sum(w.strip(".,") in NEGATIVE for w in words)
total = pos + neg
return (pos - neg) / total if total else 0.0
warmup_corpus["sentiment"] = warmup_corpus["text"].apply(sentiment)
monthly = warmup_corpus.set_index("date")["sentiment"].resample("ME").mean()
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(monthly.index, monthly.rolling(6).mean(), linewidth=2, color="#1f77b4")
ax.axhline(0, color="black", linewidth=0.5)
for year, label in [(2008, "GFC"), (2012, "EZ crisis"), (2020, "COVID"), (2022, "Energy")]:
ax.axvline(pd.Timestamp(f"{year}-01-01"), color="grey", alpha=0.3, linestyle="--")
ax.text(pd.Timestamp(f"{year}-06-01"), ax.get_ylim()[1] * 0.85, label,
fontsize=8, color="grey")
ax.set_title("ECB communication sentiment (6-month MA)", fontsize=12, loc="left")
ax.set_ylabel("Sentiment score")
ax.set_xlabel("")
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.tight_layout()
plt.show()
What we did in 30 lines:
- Loaded a corpus of ECB press releases.
- Defined two word lists (positive / negative).
- Scored each document on a [-1, +1] scale.
- Aggregated monthly and plotted.
Every step of this pipeline β tokenisation, word counting, aggregation β is what we cover today, in depth. By the end, you will understand not just how to do this but when a simple dictionary is good enough and when you need bag-of-words, TF-IDF, or topic models.
Discussion: is the sentiment around 2008β2009 what you'd expect? What about 2020?
π Reading the output. The plot shows a 6-month moving average of dictionary sentiment, with vertical markers at the four major shocks of the period (GFC 2008, euro-area crisis 2012, COVID 2020, energy shock 2022).
The L6 corpus is too small for a multi-year sentiment chart (only 5 documents spanning 2023β2025), so the loader falls back to the synthetic 25-year fallback. The fluctuating curve does not track the crisis markers, because the templates are randomly assigned to dates with
np.random.seed(0). The warm-up demonstrates the pipeline, not a real signal β that's the right thing to flag in class. With a real long-running ECB archive (e.g. 200+ docs over 20+ years, scraped from the press archive), the curve would dip visibly around the crisis dates: that is the empirical pattern that makes Apel & Grimaldi (Sveriges Riksbank, 2014), Picault & Renault (J. Banking & Finance, 2017), and Hansen et al. (QJE, 2018) the natural references for this lecture.The 4-line dictionary is deliberately too crude. It misses negation ("not robust"), domain-specific framing ("hawkish" / "dovish" are absent), and intensity (a "severe slowdown" scores the same as a "slowdown"). The point of the warm-up is to motivate why we spend the next 90 minutes on tokenisation, lemmatisation, BoW, TF-IDF, and topic models: each adds a piece that this trivial pipeline is missing.
2. The NLP pipelineΒΆ
Raw text
β Sentence splitting
β Tokenisation (split into words/subwords)
β Lowercasing
β Punctuation removal
β Stopword removal
β Stemming / Lemmatisation
β Feature representation (BoW, TF-IDF, embeddings)
β Model input
Not all steps are always needed β the right pipeline depends on your task. For classification and topic modelling, steps 1β7 are standard. For modern transformer models (L9), most of these steps are handled internally.
# Load the clean corpus from L6
try:
corpus = pd.read_csv("corpus_clean.csv")
corpus["date_parsed"] = pd.to_datetime(corpus["date_parsed"], errors="coerce")
corpus["year"] = corpus["date_parsed"].dt.year
print(f"Corpus loaded: {len(corpus)} documents")
except FileNotFoundError:
# Synthetic fallback: 8 short ECB-style monetary-policy statements (2022β2024)
corpus = pd.DataFrame({
"date_parsed": pd.to_datetime(["2024-01-25","2023-10-26","2023-09-14",
"2022-12-15","2022-10-27","2022-09-08",
"2024-04-11","2024-09-12"]),
"title": ["Monetary policy decisions"] * 8,
"body_clean": [
"The Governing Council today decided to keep the three key ECB interest rates unchanged. "
"Inflation is projected to decline gradually but will remain above the 2% target.",
"The Governing Council kept rates unchanged. Underlying inflation eased and price pressures remain strong.",
"The Governing Council raised the three key ECB interest rates by 25 basis points to ensure timely return of inflation to target.",
"The Governing Council raised rates by 50 basis points. Interest rates will rise significantly to reach restrictive levels.",
"The Governing Council raised the deposit facility rate by 75 basis points to tackle persistent inflation pressures.",
"The Governing Council raised the deposit facility rate by 75 basis points to ensure inflation returns to the 2% medium-term target.",
"Incoming data confirm the inflation outlook. Wage growth is moderating and financing conditions remain restrictive.",
"The Governing Council lowered the deposit facility rate by 25 basis points based on the updated inflation outlook.",
],
})
corpus["year"] = corpus["date_parsed"].dt.year
print(f"Using synthetic corpus ({len(corpus)} documents).")
corpus.head(3)
π Reading the output. With the L5/L6 scrape successful, you should see "Corpus loaded: 5 documents" β the press releases and press conference statements that survived L6's
is_validfilter (β₯50 words). With the synthetic fallback (no internet), you'd see "Using synthetic corpus (8 documents)".The corpus is heterogeneous in length and type. The 5 docs split into two short press releases (~500β700 words: just the rate-decision text) and three press conferences (~5,500β6,500 words: the rate decision + introductory statement + Q&A with journalists). This matters: the Q&A sections introduce vocabulary that's not policy language proper (
question,think,said,agree) and that you'll see surfacing in TF-IDF and topic-modelling outputs below. In a serious replication you'd typically separate the prepared statement from the Q&A and analyse them as two distinct corpora β Hansen, McMahon, Tong (J. Int. Eco., 2019) is the cleanest example.The two-track loading pattern (
tryreal file,exceptsynthetic fallback) is the same as in L6 and is worth keeping in any data-pipeline notebook: a deterministic fallback lets you debug downstream cells without depending on the upstream pipeline working.
3. Tokenisation, stopwords, stemmingΒΆ
# Sample document
doc = corpus["body_clean"].iloc[0]
print("Original:")
print(doc[:200])
print()
# Step 1: Sentence splitting
sentences = sent_tokenize(doc)
print(f"Sentences ({len(sentences)}):")
for s in sentences[:3]:
print(f" {s}")
π Reading the output.
sent_tokenizefrom NLTK splits on sentence boundaries β primarily.,!,?followed by whitespace + uppercase. The first L6 document (a 2023-05-04 press release, 628 words) yields 26 sentences. The longer press-conference docs in this corpus (5,000β6,500 words each) typically yield 200β300 sentences.Why split into sentences at all? For most BoW/TF-IDF work, you don't β you can tokenise the whole document at once. But sentence splitting matters for: (i) sentiment models that score per-sentence and aggregate (Picault & Renault 2017 do this), (ii) extractive summarisation, (iii) any model that needs context windows shorter than the document. Get the habit of doing it now; you'll need it in L8.
# Step 2: Word tokenisation
tokens = word_tokenize(doc.lower())
print(f"Total tokens: {len(tokens)}")
print("First 20:", tokens[:20])
π Reading the output.
word_tokenizeproduces 687 tokens for the first doc (628 words β 687 tokens, the small expansion comes from punctuation and contractions being split into their own tokens). Notice three things:
- Punctuation becomes tokens.
'.'and','appear in the list. We strip them in the next step.- Numbers stay attached.
'4.00'and'%'are split into separate tokens ('4.00','%'). For monetary-policy text where percentage points matter, this is fine; for other domains you may want to keep them joined.- Lowercasing is not automatic β we did
doc.lower()before tokenising. If you skip that step, "ECB" and "ecb" become two distinct vocabulary entries, doubling the size of your feature space for no information gain.NLTK's tokeniser is a regex-based Penn Treebank tokeniser. It's fast and adequate for most tasks but not as robust as spaCy's statistical tokeniser, which we'll see in Β§4.
# Step 3: Remove punctuation and non-alphabetic tokens
tokens_clean = [t for t in tokens if t.isalpha()]
print(f"After punctuation removal: {len(tokens_clean)} tokens")
# Step 4: Remove stopwords
stop_en = set(stopwords.words("english"))
# Add domain-specific stopwords β words that appear everywhere and carry no signal
domain_stop = {"per", "cent", "ecb", "governing", "council", "decided", "december", "march"}
stop_en.update(domain_stop)
tokens_no_stop = [t for t in tokens_clean if t not in stop_en]
print(f"After stopword removal: {len(tokens_no_stop)} tokens")
print("First 20:", tokens_no_stop[:20])
π Reading the output. Two filtering steps cut the token count substantially: 687 β 613 (after punctuation) β 320 (after stopwords) for the first doc. Those two passes typically remove 50β60% of tokens in English prose. Half your "data" was function words.
Domain stopwords matter more than the standard list. "ECB", "Governing", "Council", and "decided" appear in every single monetary-policy document in this corpus β keeping them in dilutes everything else. The general rule: after a first BoW pass, look at the top-20 most frequent terms; any that appear in >80% of documents are candidates for the domain stopword list. Build this list iteratively, document by document.
Don't over-prune. A common mistake in policy-text NLP is dropping "rate" or "inflation" because they're frequent β but those are the substantive nouns. Frequency is not the same as informativeness. TF-IDF (Β§5) handles this more gracefully than hard stopword filtering.
# Step 5a: Stemming β aggressive, language-agnostic
stemmer = PorterStemmer()
stems = [stemmer.stem(t) for t in tokens_no_stop]
print("Stemmed:", stems[:20])
print()
# Notice: 'inflation' β 'inflat', 'rates' β 'rate', 'economy' β 'economi'
# Stemming is fast but produces non-words β fine for BoW, bad for display
π Reading the output. Porter's stemmer chops suffixes mechanically:
'inflation' β 'inflat','continues' β 'continu','pressures' β 'pressur','raise' β 'rais','basis' β 'basi','incoming' β 'incom'. Some outputs are real words ('high','long'), others are not.When to use stemming. Two situations: (i) you have a very small corpus and you need to collapse
inflation,inflations,inflationaryinto a single feature to get any signal at all; (ii) you don't care that the output is human-readable (pure feature engineering for a downstream classifier).When not to use stemming. Almost everything else β including topic models, where you want interpretable topic labels, and any case where you'll show top terms to a reader. Lemmatisation (next cell) gives you almost all the merging benefit without the disfigured words.
# Step 5b: Lemmatisation β linguistically principled, produces real words
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in tokens_no_stop]
print("Lemmatised:", lemmas[:20])
print()
# 'rates' β 'rate', 'projected' β 'projected' (needs POS for best results)
π Reading the output. WordNet's lemmatiser keeps everything as real English words:
'pressures' β 'pressure','rates' β 'rate','points' β 'point'. But notice what it leaves unchanged:'continues','incoming','ongoing'β all participles that should arguably collapse to their base verbs (continue,come,go). Why? WordNet defaults to noun lemmatisation. Without a POS tag, it doesn't know that'incoming'is an adjective and'continues'is a verb form, so it leaves them alone.To get accurate lemmas you need to pass POS info:
lemmatizer.lemmatize("running", pos="v") β "run". NLTK doesn't bundle a fast POS tagger, which is one of several reasons we switch to spaCy in the next section: it does tokenisation + lemmatisation + POS in a single pass and uses the POS internally to produce better lemmas.
4. Lemmatisation and POS tagging with spaCyΒΆ
spaCy is a production-grade NLP library that performs tokenisation, lemmatisation, POS tagging, and named entity recognition in a single pass. It is generally more accurate than NLTK for English.
# spaCy processes a document in one call
doc_spacy = nlp(corpus["body_clean"].iloc[0])
print(f"{'Token':<20} {'Lemma':<20} {'POS':<8} {'Stop?'}")
print("-" * 60)
for token in list(doc_spacy)[:20]:
print(f"{token.text:<20} {token.lemma_:<20} {token.pos_:<8} {token.is_stop}")
π Reading the output. Each row gives you four pieces of information per token: surface form, lemma, part-of-speech tag, and a boolean stopword flag. For tokens like
"continues"you get lemma"continue", POSVERB, stopFalseβ note that spaCy correctly identifies the verb and lemmatises it (NLTK's WordNet left it as"continues"because of the noun-default issue). For"the"you get lemma"the", POSDET, stopTrue. For"is"you get lemma"be", POSAUX, stopTrue.The POS-aware lemmas are the key advantage over NLTK. Where WordNet left
"continues","incoming", and"ongoing"alone (defaulting to noun), spaCy correctly tags them as VERB/ADJ and lemmatises accordingly (continues β continue,incoming β incoming(correctly identified as adjective lemma), etc.). For inflectionally rich languages the gap with NLTK widens further.The
is_stopflag is also generally better curated than NLTK's English stopword list β it includes contractions like"n't"and"'s"that NLTK misses. But it does not include domain stopwords; you still need to extend it for ECB-specific noise ("ecb","governing", etc.).
# Named entity recognition β useful for extracting actors, locations, dates
print("Named entities:")
for ent in doc_spacy.ents[:15]:
print(f" {ent.text!r:<30} [{ent.label_}]")
π Reading the output. spaCy's NER on this corpus typically picks up:
"the Governing Council"asORG, percentages like"3.75%"and"25 basis points"asPERCENT/QUANTITY, dates like"4 May 2023"asDATE,"the medium-term"asDATE, sometimes"ECB"asORG, and on the press-conference transcripts you'll also see PERSON entities like"Lagarde"and"Lane"(the journalists' references to the President and Chief Economist). Quality varies β"the three key ECB interest rates"may be parsed as a single ORG or split arbitrarily.Why NER matters for economists. Three concrete uses: (i) build a panel of company / institution mentions over time (event studies, attention-as-data); (ii) extract dates and money amounts from contracts or judgments (Hansen et al. on FOMC minutes is the classic); (iii) anonymise text by replacing PERSON entities with
[PERSON]before sharing.Quality caveat. The
en_core_web_smmodel is small (~12 MB). For high-stakes NER useen_core_web_lg(~750 MB, more accurate) oren_core_web_trf(transformer-based, slower but state-of-the-art). For non-English text use the correspondingxx_core_news_*models.
# Build a cleaned token list using spaCy
def spacy_tokens(text, min_len=3):
"""Tokenise with spaCy, remove stopwords, punctuation, and short tokens.
Returns a list of lowercase lemmas."""
doc = nlp(text)
return [
token.lemma_.lower()
for token in doc
if not token.is_stop
and not token.is_punct
and not token.is_space
and token.is_alpha
and len(token.text) >= min_len
]
sample_tokens = spacy_tokens(corpus["body_clean"].iloc[0])
print(f"Tokens: {len(sample_tokens)}")
print("First 20:", sample_tokens[:20])
π Reading the output. The
spacy_tokenshelper collapses tokenisation + lemmatisation + filtering into one function. For the first L6 doc (the 2023-05-04 press release, 628 words) you get ~250β300 lemmas, with cleaner forms than NLTK β"continue"instead of"continues","pressure"instead of"pressures","come"instead of"incoming"etc.The
is_alphafilter drops anything containing digits or punctuation:"3.75%"is gone,"25"is gone. For most NLP tasks this is what you want; if percentages matter for your analysis (e.g., extracting policy-rate values), handle them with a regex before callingspacy_tokens.The
min_len=3filter drops"is","to","by"β most are stopwords already, but this also removes things like"3"(digit) and"q1"(quarter) that survive other filters. Tune this based on your corpus.
# Apply to entire corpus β can take a moment for large corpora
# spaCy's pipe() processes in batches (much faster than calling nlp() in a loop)
corpus["tokens"] = [
spacy_tokens(text)
for text in corpus["body_clean"]
]
corpus["n_tokens"] = corpus["tokens"].str.len()
print(f"Avg tokens per document: {corpus['n_tokens'].mean():.0f}")
corpus[["date_parsed","n_tokens","tokens"]].head(3)
π Reading the output. Average token count per document is reported plus the head of the DataFrame. With the L6 corpus (5 docs, 19,131 raw words total β 628 to 6,501 per doc), expect roughly 250β3,000 lemmas per doc after filtering, with a corpus mean around 1,800 lemmas per doc. The press conferences (with their long Q&A sections) dominate the token budget.
Performance note. The current implementation calls
nlp(text)once per document inside a list comprehension. This is fine for β€500 documents; for larger corpora switch tonlp.pipe(corpus["body_clean"], batch_size=50), which batches the calls and runs ~3β5Γ faster. We'll need this in L8 when we work with multi-thousand-document corpora.
5. Bag-of-Words and TF-IDFΒΆ
Bag-of-Words (BoW)ΒΆ
Represent each document as a vector of word counts. The "bag" metaphor: we throw all words in a bag and count them β order and grammar are ignored.
TF-IDFΒΆ
Term FrequencyβInverse Document Frequency weights each word by:
- how often it appears in this document (TF)
- divided by how often it appears across all documents (IDF)
Words that appear everywhere (like "the", "is") get low weight. Words that are specific to a few documents get high weight. TF-IDF is the classical baseline for text classification, retrieval, and similarity tasks (it has been the dominant feature representation for ~30 years and is still hard to beat on small corpora).
# scikit-learn vectorisers work on raw strings β they handle tokenisation internally
# We pass our cleaned text (body_clean) and let the vectoriser do the rest
# Bag-of-Words
bow_vec = CountVectorizer(
max_features=500, # keep only the 500 most frequent terms
min_df=2, # ignore terms that appear in fewer than 2 docs
stop_words="english",
token_pattern=r"\b[a-zA-Z]{3,}\b", # only alphabetic tokens >= 3 chars
)
bow_matrix = bow_vec.fit_transform(corpus["body_clean"])
print(f"BoW matrix shape: {bow_matrix.shape}")
print(f"Vocabulary size: {len(bow_vec.get_feature_names_out())}")
print(f"Sample vocab (first 20): {list(bow_vec.get_feature_names_out())[:20]}")
# Top 10 most frequent terms across the corpus
freq = bow_matrix.sum(axis=0).A1
vocab = bow_vec.get_feature_names_out()
top_freq = sorted(zip(vocab, freq), key=lambda x: -x[1])[:10]
print(f"Top 10 by raw frequency: {top_freq}")
π Reading the output.
(5, 500)matrix on the L6 corpus: 5 documents Γ 500 distinct terms (we hit themax_features=500cap β without it the vocab would be ~1,200 terms, since L6 has ~19k word tokens). Every cell[i, j]is the count of termjin documenti. Most entries are zero β sparse matrices like this are why scikit-learn returns ascipy.sparseobject instead of a densenumpyarray (memory: 500 Γ 5 = 2,500 cells dense vs. ~1,000 non-zero cells sparse, and the gap explodes for real corpora).Top-frequency words on the L6 corpus:
inflation(177),cent(92),policy(81),monetary(75),growth(64),rate(60),question(58),think(57),euro(53),council(51). Notice two things: (i)inflationdominates by a factor of ~2Γ over the next term, which is what you expect in a hawkish-cycle press archive; (ii)questionandthinkrank in the top 10 β these are Q&A artifacts from the press-conference docs ("the next question isβ¦", "I think thatβ¦"). Raw frequency tells you what is in the bag, not what is informative. That's why we move to TF-IDF.The two filters that matter:
min_df=2(a term must appear in β₯2 documents to be kept β drops typos and one-off names) andmax_features=500(cap vocabulary size at the 500 most frequent terms β keeps the matrix manageable). Tune both based on corpus size: with a 5,000-document corpus you'd typically usemin_df=5andmax_features=10000or higher.
# TF-IDF
tfidf_vec = TfidfVectorizer(
max_features=500,
min_df=2,
stop_words="english",
token_pattern=r"\b[a-zA-Z]{3,}\b",
sublinear_tf=True, # use log(1 + tf) β dampens effect of very frequent terms
)
tfidf_matrix = tfidf_vec.fit_transform(corpus["body_clean"])
print(f"TF-IDF matrix: {tfidf_matrix.shape}")
# Convert to DataFrame for inspection
tfidf_df = pd.DataFrame(
tfidf_matrix.toarray(),
columns=tfidf_vec.get_feature_names_out(),
index=corpus["date_parsed"],
)
tfidf_df.iloc[:3, :8].round(3)
π Reading the output. Same shape as the BoW matrix
(5, 500)because we used identical vocabulary filters β but the values are now floats in[0, 1]instead of integer counts. Each row is L2-normalised by default, so different-length documents become comparable (this matters a lot for this corpus, where doc length ranges from 512 to 6,501 words).
sublinear_tf=Trueis almost always the right choice. It replaces raw TF with1 + log(TF), so a term appearing 10 times doesn't get 10Γ the weight of a term appearing once. Without it, one very long document dominates similarity computations β and with three press conferences vastly outweighing the two short releases here, this would be a real problem. Manning et al. (Introduction to Information Retrieval, 2008, ch. 6) is the canonical reference for why this matters.Sanity check on the head of the DataFrame: the rows are dated documents, the columns are terms, the cells are weights. Notice that some cells are 0 (term absent from that document) and the non-zero cells are typically in the 0.02β0.15 range β TF-IDF values are bounded but not concentrated.
5.bis Document similarity with cosineΒΆ
Now that we have a TF-IDF matrix, a natural question is: how similar are two documents? The matrix tfidf_matrix has one row per document and one column per term. Each row is a high-dimensional vector β the document represented as a point in term-space. Comparing documents reduces to comparing their vectors.
Why not Euclidean distance?ΒΆ
The first instinct is to use Euclidean distance β the straight-line distance between two points. The problem: Euclidean distance is sensitive to vector magnitude, and vector magnitude in TF-IDF space depends on document length. A long press conference and a short press release that talk about exactly the same topics would still be far apart in Euclidean space, because the long document has more non-zero entries with larger values. That is not what we want when we ask "do these two documents discuss similar things".
Cosine similarity β the angle between two vectorsΒΆ
Cosine similarity solves this by measuring the angle between two vectors, ignoring their magnitude:
$$ \cos(\mathbf{d}_i, \mathbf{d}_j) \;=\; \frac{\mathbf{d}_i \cdot \mathbf{d}_j}{\|\mathbf{d}_i\| \, \|\mathbf{d}_j\|} $$
The numerator is the dot product (it grows when the two documents share many terms with similar weights); the denominator divides out the lengths of the two vectors. The result is a number in $[0, 1]$ for non-negative TF-IDF vectors:
- 1.0: identical direction (the two documents have proportional term distributions β same topics in the same proportions, regardless of length).
- 0.0: orthogonal (no terms in common at all).
- Values in between: partial vocabulary overlap.
This is the standard similarity measure for text retrieval and document clustering (Manning, Raghavan & SchΓΌtze, Introduction to Information Retrieval, 2008, ch. 6).
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
# Take the first document as a reference and compare it to every doc in the corpus
ref_idx = 0
ref_date = corpus["date_parsed"].iloc[ref_idx].strftime("%Y-%m-%d")
cos_to_ref = cosine_similarity(tfidf_matrix[ref_idx], tfidf_matrix).flatten()
eucl_to_ref = euclidean_distances(tfidf_matrix[ref_idx], tfidf_matrix).flatten()
similarity_table = pd.DataFrame({
"date": corpus["date_parsed"].dt.strftime("%Y-%m-%d").values,
"n_words": corpus["n_words"].values,
"cosine": cos_to_ref.round(3),
"euclidean": eucl_to_ref.round(3),
})
print(f"Similarity to reference document ({ref_date}):\n")
print(similarity_table.to_string(index=False))
π Reading the output. The reference is the 2023-05-04 press release (628 words). Cosine similarity to the other four documents:
- 2024-01-25 (press release, 512 words) β cos = 0.84. Two short releases of similar length and structure: very high similarity.
- 2023-09-14 (press conference, 6,051 words) β cos = 0.44. Same speaker, same period, but the conference adds a long Q&A section with vocabulary that the release does not have.
- 2024-12-12 (press conference, 5,439 words) β cos = 0.39. Different document type and different policy regime (cuts era).
- 2025-10-30 (press conference, 6,501 words) β cos = 0.33. Furthest apart: long Q&A plus a two-and-a-half-year gap.
Cosine vs Euclidean β why we don't use Euclidean. Compare the two columns. With Euclidean distance, the 2024-01-25 release looks closer to the reference (0.57) than any of the press conferences (1.06, 1.11, 1.16) β but the gap is dominated by length differences in the underlying TF-IDF vectors, not by content differences. With cosine, the same ordering emerges, but now the numbers are interpretable: 0.84 means "very similar topic distribution", 0.33 means "weakly overlapping vocabulary". Cosine gives you a measure that is comparable across pairs of documents regardless of their lengths.
What cosine similarity does not tell you. Two important caveats:
- It measures vocabulary overlap, not meaning. A hawkish press release ("inflation remains too high, rates must stay restrictive") and a dovish one ("inflation is converging, rates can start to come down") share most of their vocabulary β
inflation, rates, target, restrictiveβ and will have high cosine similarity, even though their policy stance is opposite. To capture polarity you need sentiment analysis (L8) or word embeddings (L9), not BoW/TF-IDF.- It depends entirely on the vocabulary you kept. If you removed
not, no, neveras stopwords, you erased the very tokens that flip a sentence's meaning. This is the standard reason economists working on policy text either keep negation tokens explicitly or use sentence-level methods that handle negation (Loughran & McDonald 2011 build their dictionary precisely to avoid this trap).Cosine similarity is the right tool for "are these documents about similar things". It is the wrong tool for "do these documents agree on those things". Keep that distinction in mind throughout the rest of the lecture and in L8.
β± Ten-minute challenge β your first text-feature analysisΒΆ
Your task. Pick one of the two options below (both genuinely require the BoW/TF-IDF machinery you just saw) and produce a short result.
- Option A β Distinctive terms per document. For each document in
corpus, find the top 5 terms with the highest TF-IDF score (i.e., the words that most distinguish that document from the rest of the corpus). Return apandas.DataFrameindexed bydate_parsed, with one column per rank (rank_1...rank_5). - Option B β Document similarity matrix. Compute the pairwise cosine similarity between all documents using the TF-IDF matrix. Display the resulting
(N Γ N)matrix as a heatmap withmatplotlib.pyplot.imshow, axes labelled bydate_parsed. Which two documents are most similar? Most different?
You have 10 minutes. Work in pairs if you want. Tips:
- Use the existing
tfidf_matrixandtfidf_vecfrom Β§5 β don't re-fit the vectoriser. - For Option A:
tfidf_matrix.toarray()[i].argsort()[-5:][::-1]gives the top-5 column indices for documenti. Map them back to terms viatfidf_vec.get_feature_names_out(). - For Option B:
from sklearn.metrics.pairwise import cosine_similaritythencosine_similarity(tfidf_matrix). Useax.imshow(..., cmap="viridis", vmin=0, vmax=1)andplt.colorbar().
Teaching note: the two options stress different mechanics β Option A is about reading the TF-IDF matrix row-by-row (per-document view), Option B is about pairwise comparisons (corpus-geometry view). Whichever students pick first, the other is a useful follow-up exercise. Both lead naturally into the LDA section that follows: Option A motivates "what is each document about?", Option B motivates "are there clusters of similar documents β i.e., latent topics?"
# YOUR CODE HERE β 10 minutes
# Skeleton (Option A):
# from sklearn.feature_extraction.text import TfidfVectorizer
# vocab = tfidf_vec.get_feature_names_out()
# rows = []
# for i in range(tfidf_matrix.shape[0]):
# top_idx = tfidf_matrix.toarray()[i].argsort()[-5:][::-1]
# rows.append([vocab[j] for j in top_idx])
# top_terms = pd.DataFrame(rows, columns=[f"rank_{k+1}" for k in range(5)],
# index=corpus["date_parsed"])
# top_terms
# Skeleton (Option B):
# from sklearn.metrics.pairwise import cosine_similarity
# sim = cosine_similarity(tfidf_matrix)
# fig, ax = plt.subplots(figsize=(6, 5))
# im = ax.imshow(sim, cmap="viridis", vmin=0, vmax=1)
# labels = [d.strftime("%Y-%m-%d") for d in corpus["date_parsed"]]
# ax.set_xticks(range(len(labels))); ax.set_xticklabels(labels, rotation=45, ha="right")
# ax.set_yticks(range(len(labels))); ax.set_yticklabels(labels)
# plt.colorbar(im, ax=ax, label="cosine similarity")
# fig.tight_layout(); plt.show()
# ββ SOLUTION (Option A) ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
vocab = tfidf_vec.get_feature_names_out()
rows = []
for i in range(tfidf_matrix.shape[0]):
top_idx = tfidf_matrix.toarray()[i].argsort()[-5:][::-1]
rows.append([vocab[j] for j in top_idx])
top_terms = pd.DataFrame(rows,
columns=[f"rank_{k+1}" for k in range(5)],
index=corpus["date_parsed"])
print("Top 5 distinctive terms per document (TF-IDF):")
print(top_terms.to_string())
# ββ SOLUTION (Option B) ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(tfidf_matrix)
fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(sim, cmap="viridis", vmin=0, vmax=1)
labels = [d.strftime("%Y-%m-%d") for d in corpus["date_parsed"]]
ax.set_xticks(range(len(labels))); ax.set_xticklabels(labels, rotation=45, ha="right")
ax.set_yticks(range(len(labels))); ax.set_yticklabels(labels)
ax.set_title("Pairwise cosine similarity (TF-IDF)", loc="left")
plt.colorbar(im, ax=ax, label="cosine similarity")
fig.tight_layout(); plt.show()
# Identify most-similar and most-different pairs (excluding diagonal)
sim_no_diag = sim.copy()
np.fill_diagonal(sim_no_diag, np.nan)
i_max, j_max = np.unravel_index(np.nanargmax(sim_no_diag), sim_no_diag.shape)
i_min, j_min = np.unravel_index(np.nanargmin(sim_no_diag), sim_no_diag.shape)
print(f"\nMost similar: {labels[i_max]} β {labels[j_max]} (cos = {sim[i_max, j_max]:.3f})")
print(f"Most different: {labels[i_min]} β {labels[j_min]} (cos = {sim[i_min, j_min]:.3f})")
π Reading the output. On the L6 corpus, Option A produces a per-document table whose top-5 terms reveal the heterogeneity I flagged earlier: the two short press releases (2023-05-04, 2024-01-25) have top-5 terms dominated by policy-decision boilerplate plus balance-sheet vocabulary (
council, governing, policy, app, portfolio, intends), while the three press conferences pick up Q&A vocabulary alongside policy terms (cent, growth, question, think, contribution, good, risk). This is exactly the diagnostic that motivates separating the two document types before serious analysis.Option B produces a 5Γ5 cosine similarity heatmap with three clear blocks. The two press releases sit very close to each other (cos β 0.84) and far from the press conferences (cos β 0.31β0.44). The three press conferences are similar to each other (cos β 0.74β0.76) and modestly similar to the releases. The most-similar pair is 2023-05-04 β 2024-01-25 (cos = 0.84 β both are short, formulaic press releases); the most-different pair is 2024-01-25 β 2025-10-30 (cos = 0.31 β short release vs. longest conference, two-year gap, regime change).
The dominant axis of variation in this corpus is document type, not policy regime. That is a striking and important finding: even with a regime change (hike cycle 2023 β cuts/holds 2025), TF-IDF cosine similarity groups documents by their format and length (release vs. conference) before grouping them by their content (hawkish vs. dovish). This is a general lesson β bag-of-words representations are very sensitive to vocabulary overlap, which scales with document length, so without explicit length-normalisation or stratification by document type your "topic clusters" may end up reflecting format more than substance. Hansen et al. (QJE 2018) handle this explicitly by working only on FOMC transcripts of comparable length and structure.
Two takeaways for class. First, cosine similarity on TF-IDF is a coarse document-level measure β it tells you which docs share vocabulary, not which docs share meaning. Two documents that disagree about inflation (one hawkish, one dovish) will still have high cosine similarity because they use the same words. Second, both Option A and Option B set up LDA in the next section: Option A asks "what is each document about?" and Option B asks "are there clusters of similar documents?" β LDA answers both jointly by positing latent themes that explain both per-document term usage and document-document similarity.
6. Topic modelling with LDA β what is the corpus about?ΒΆ
TF-IDF tells us which words are distinctive for each document. But as a researcher, you often want the reverse: which themes recur across documents? That's topic modelling.
Latent Dirichlet Allocation (LDA, Blei, Ng, Jordan 2003, JMLR) is the workhorse method. The intuition:
- Each document is a mixture of topics (Document 1 is 60% monetary policy, 30% banking, 10% other).
- Each topic is a distribution over words (the "monetary policy" topic puts high probability on
inflation,rates,target, low probability onbank,regulation). - The model finds the topics that best explain the observed word counts, given a small number of free parameters (number of topics
K, Dirichlet priors).
LDA is unsupervised: you don't tell it what the topics are. You tell it how many (K), and it returns word distributions that you then label by inspection.
I'll fit an LDA model on the corpus live. Then, before I tell you what the topics are, I'll show you the top 10 words of each β and you tell me what you think the topic is.
# Vectorise (LDA works on raw counts, NOT TF-IDF β its generative model assumes counts)
vec_lda = CountVectorizer(
stop_words="english", max_df=0.9, min_df=2, max_features=2000,
token_pattern=r"\b[a-zA-Z]{3,}\b",
)
X_lda = vec_lda.fit_transform(corpus["body_clean"])
# Number of topics: rule of thumb is K ~ sqrt(n_docs), so 3 for 8 docs, 7-10 for 50-100, etc.
# With a real ECB corpus you'd want to try K β {5, 8, 10, 15} and pick by coherence.
K = 5
lda = LatentDirichletAllocation(n_components=K, random_state=0, max_iter=50)
lda.fit(X_lda)
vocab_lda = vec_lda.get_feature_names_out()
print(f"Doc-term matrix: {X_lda.shape}, fitting K={K} topics\n")
for k, topic in enumerate(lda.components_):
top = topic.argsort()[-10:][::-1]
print(f"Topic {k}: " + ", ".join(vocab_lda[i] for i in top))
π₯ Mini-activity (1 minute). Don't read the master interpretation below yet.
- Look at the top-10 words for each of the 5 topics above.
- Write down on paper a 2β3 word label for each topic β what theme do you see?
- Compare with the person next to you. Where do you agree? Where do you disagree?
- Then scroll down to see how I would label them, and where the model collapses two topics into one.
The point of this exercise: LDA gives you word distributions, but you give them meaning. Two trained economists looking at the same output may produce different labels β and that subjectivity is a feature of unsupervised learning you should be aware of when you use it in research.
π Reading the output. On the L6 corpus (5 docs, 727-term vocab) with
K=5andrandom_state=0, the topics come out roughly as:
- Topic 0 β
portfolio, pepp, operations, app, securities, facility, stance, payments, targeted, refinancingβ "unconventional monetary policy / balance sheet" (PEPP = Pandemic Emergency Purchase Programme; APP = Asset Purchase Programme; TLTRO refinancing operations). This is a real, interpretable theme.- Topic 1 β
tight, push, base, dampening, upward, reduce, manner, determined, substantial, yearβ "transmission and disinflation mechanics" (rate hikes pushing demand down, base effects on inflation).- Topic 2 β near-identical to Topic 1 (
tight, push, base, dampening, upward, reduce, manner, determined, substantial, year).- Topic 3 β
cent, think, growth, question, risk, good, said, time, economy, knowβ "Q&A discussion register" β this captures the press-conference Q&A vocabulary, not policy substance.- Topic 4 β
cent, long, question, growth, contribution, sufficiently, look, second, real, currentβ "forward guidance language in Q&A" (the verbal commitments Lagarde repeats in answers).Two things worth noticing. First, the duplicate topic (1 β 2) is the classic LDA pathology with too few documents: asking for 5 topics on 5 docs is asking too much, and the optimiser collapses two topics into the same local optimum. Second β and more interesting β Topic 3 is a register topic, not a substantive one: it captures the way press conferences talk (informal, first-person, dialogic), distinct from how press releases talk. This is a real finding for a researcher: it means that for substantive topic analysis you should separate press releases from press-conference transcripts before running LDA, or strip Q&A sections out.
What to do in practice. Three diagnostics: (i) re-run with a different
random_stateand check whether the topic set is stable; (ii) compute topic coherence (e.g.,gensim.models.CoherenceModelwith thec_vmeasure) for K β {3, 5, 8, 10, 15} and pick the K that maximises it; (iii) inspect topics manually β if two look identical, K is too high.The deeper point about unsupervised learning. LDA outputs are only useful if a human names them. The model gives you word distributions; you give them meaning. This is fundamentally different from supervised learning, and it's why "topic modelling" is sometimes more accurately called "topic finding" β the topics aren't given by the data, they emerge from the choice of K, the preprocessing, the random seed, and your interpretation. Hansen, McMahon & Prat (QJE, 2018, Transparency and Deliberation) is the textbook example of doing this rigorously on FOMC transcripts, including stability checks.
7. Keyword analysis: most discriminating termsΒΆ
# Top terms by mean TF-IDF across all documents
mean_tfidf = tfidf_df.mean().sort_values(ascending=False)
top20 = mean_tfidf.head(20)
fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(top20.index[::-1], top20.values[::-1], color="steelblue", edgecolor="white")
ax.set_title("Top 20 terms by mean TF-IDF β ECB corpus", fontsize=12)
ax.set_xlabel("Mean TF-IDF score")
fig.tight_layout()
plt.show()
π Reading the output. On the L6 corpus the bar chart leads with
inflation(mean TF-IDF β 0.110), thengoverning(0.101),council(0.101),policy(0.099),monetary(0.095),transmission(0.085),term(0.082),rates(0.082),ecb(0.080),rate(0.079) etc. The drop from rank 1 to rank 10 is gentle (~0.03), which is what you expect when documents are thematically very similar (all monetary-policy material).Three substantive terms in the top-20 are worth flagging:
transmission(the mechanism by which rate changes feed into the real economy β a recurring concern in 2022β2024 ECB communication),pepp(Pandemic Emergency Purchase Programme), andportfolio(the ECB's bond portfolio, central to QT discussion). These are the substantive technical terms that distinguish ECB material from generic financial news, and they're the terms a researcher would actually want to track over time.Diagnostic use. This chart is also how you find candidates for your domain stopword list. If
governingandcouncilappear in every doc with a high TF-IDF score, they're not adding signal β drop them and re-fit. (We already did, in Β§3, by adding them todomain_stop.) The same logic applies tocenthere, which surfaces because of "per cent" being repeated constantly; if you're studying what the ECB says rather than how often it quantifies, drop it.
# Compare term weights across two time periods
# Hike era (2023, hike cycle still ongoing) vs cuts/holds era (2024-2025, post-pivot)
# The L6 corpus has 2 docs in 2023 and 3 docs in 2024-2025 β a 2/3 split.
corpus["period"] = corpus["year"].apply(lambda y: "β₯2024 (cuts/holds)" if y >= 2024 else "β€2023 (hikes)")
groups = {}
for period, grp in corpus.groupby("period"):
vec = TfidfVectorizer(max_features=300, min_df=1, stop_words="english",
token_pattern=r"\b[a-zA-Z]{3,}\b")
mat = vec.fit_transform(grp["body_clean"])
means = pd.Series(mat.mean(axis=0).A1, index=vec.get_feature_names_out())
groups[period] = means.sort_values(ascending=False).head(15)
# Display side-by-side
print("Top 15 terms by mean TF-IDF, per period:\n")
comparison = pd.DataFrame({p: g.reset_index(drop=False).apply(lambda r: f"{r['index']} ({r[0]:.3f})", axis=1)
for p, g in groups.items()})
print(comparison.to_string(index=False))
π Reading the output. With the L6 corpus (2 docs in 2023, 3 docs in 2024-2025):
- β€2023 hikes rank
transmission, ecb, key, longdistinctively high β the forward-guidance language of the late hike cycle ("rates need to reach restrictive levels for as long as necessary; the transmission of past hikes is starting to bite").- β₯2024 cuts/holds rank
rate(singular),risk/risks,think,eurodistinctively high β the language of individual rate decisions and post-pivot communication, with more discussion of risks/uncertainty (typical of central banks once they stop committing to a direction).thinkis again a Q&A artifact.The shared top terms (
inflation, council, governing, policy, monetary, rates, target) are corpus-wide boilerplate. The economic story shows up in the distinctive terms β singularrate(one decision at a time, post-pivot) vs. pluralrates + levels + long(the forward-guidance trajectory of the hike cycle).A word of caution about small-corpus splits. With 2 vs. 3 documents and
min_df=1, you're picking up idiosyncrasies of individual press conferences, not a stable difference between regimes β especially since the press-conference docs are 10Γ longer than the press releases and dominate the per-period averages. With a real corpus of 50+ docs per period, the comparison becomes more meaningful. The right way to tighten this in research code: bootstrap the per-period means and only report terms whose 95% bootstrap CI doesn't cross zero on the difference.
# Word co-occurrence β which words tend to appear in the same documents?
# This is an approximation of semantic similarity using BoW
bow_small = CountVectorizer(max_features=30, min_df=2, stop_words="english",
token_pattern=r"\b[a-zA-Z]{4,}\b")
bow_mat = bow_small.fit_transform(corpus["body_clean"]).toarray()
terms = bow_small.get_feature_names_out()
# Co-occurrence matrix: (term Γ term), normalised
cooc = (bow_mat.T @ bow_mat).astype(float)
np.fill_diagonal(cooc, 0)
# Top co-occurring pairs
pairs = []
for i in range(len(terms)):
for j in range(i+1, len(terms)):
if cooc[i, j] > 0:
pairs.append((terms[i], terms[j], cooc[i, j]))
pairs.sort(key=lambda x: -x[2])
print("Top 10 co-occurring term pairs (raw counts):")
for t1, t2, c in pairs[:10]:
print(f" {t1:<15} β {t2:<15} count={c:.0f}")
π Reading the output. On the L6 corpus the top pairs are dominated by
cent β inflation(β4,957),inflation β policy(3,740),inflation β monetary(3,562),growth β inflation(3,438),inflation β question(3,097). The huge counts come from the press-conference docs, where these terms are repeated dozens of times per document; raw co-occurrence counts scale multiplicatively with term frequency.Most pairs are trivial (
inflationco-occurs with everything because every doc is about inflation), but two are interesting:growth β inflationflags the trade-off vocabulary, andinflation β questionis again the Q&A signature β journalists ask about inflation, Lagarde answers about inflation.The construction matters.
cooc = bow_mat.T @ bow_matis the matrix of dot products between term vectors: cell(i, j)isΞ£_d count(term_i, doc_d) Γ count(term_j, doc_d), which is not the same as the number of documents in which both terms appear. With raw counts, a term appearing 50 times in one document and another appearing 40 times in the same document contributes 2,000 to the cell, not 1. To get a "documents in which both appear" count, binarise first:(bow_mat > 0).T @ (bow_mat > 0). To get an actual similarity measure, normalise by row/column norms β that's PMI (pointwise mutual information).Why this matters as a representation. Co-occurrence matrices are the historical ancestor of word embeddings (L9): the SVD of a PMI-weighted co-occurrence matrix gives you a low-dimensional dense vector per word, which is essentially what GloVe (Pennington, Socher, Manning 2014) and earlier LSA/LSI methods compute. Modern embeddings just use a different (neural) optimisation target on top of the same co-occurrence intuition.
8. Wordclouds β visualisation, not analysisΒΆ
Before we move to the exercise, a brief detour on wordclouds. They are the most recognisable NLP visualisation, and you will be expected to produce one if you ever present text-data work to a non-technical audience. Three things to keep in mind before we draw any:
Wordclouds are a display tool, not an analysis tool. They convey one piece of information (term frequency) through three visual channels (font size, position, colour) β two of which are arbitrary. Hearst and Rosner (HCI International 2008) and Schuermans (2011) have argued that wordclouds make rigorous comparison difficult: word length confounds size, orientation confounds reading order, and the same frequency distribution can yield very different-looking clouds. For comparing documents or periods, the bar chart of Β§7 is strictly more informative.
Always feed them frequencies you computed yourself. The default
WordCloud().generate(text)re-tokenises the raw string with its own (English-only, no lemmatisation) logic. Usegenerate_from_frequencies(dict)instead, passing a{lemma: count}dictionary built from your preprocessing pipeline β otherwise the wordcloud will silently disagree with all the BoW/TF-IDF analysis above.They are great for slides, blog posts, and one-pagers β for capturing attention and giving an at-a-glance sense of what a corpus is "about" before the technical analysis begins. Used as the first visual in a presentation, they work well; used as the only visual, they oversimplify.
We will produce three variants on our ECB corpus, all from the same frequency dictionary built on body_clean.
# Build the frequency dictionary from BoW counts on the corpus we've been using.
# This uses the same preprocessing logic as Β§5 (alphabetic, β₯4 chars, English stopwords)
# plus a domain stopword list to suppress boilerplate.
import os, urllib.request
from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image, ImageDraw, ImageFont
from scipy.ndimage import gaussian_gradient_magnitude
DOMAIN_STOP = {"per", "cent", "ecb", "governing", "council", "decided",
"december", "march", "today", "question", "think", "said"}
bow_for_wc = CountVectorizer(max_features=300, min_df=1, stop_words="english",
token_pattern=r"\b[a-zA-Z]{4,}\b")
bow_wc_matrix = bow_for_wc.fit_transform(corpus["body_clean"])
freq_array = bow_wc_matrix.sum(axis=0).A1
vocab_wc = bow_for_wc.get_feature_names_out()
freq = {w: int(c) for w, c in zip(vocab_wc, freq_array) if w not in DOMAIN_STOP}
print(f"Frequency dictionary: {len(freq)} terms")
print(f"Top 10: {sorted(freq.items(), key=lambda x: -x[1])[:10]}")
Variant 1 β Classic rectangular wordcloudΒΆ
The simplest form: a rectangular canvas, a colormap, and no shape constraints.
wc1 = WordCloud(width=800, height=400, background_color="white",
max_words=150, colormap="viridis", random_state=42)
wc1.generate_from_frequencies(freq)
fig, ax = plt.subplots(figsize=(10, 5))
ax.imshow(wc1, interpolation="bilinear")
ax.axis("off")
ax.set_title("ECB corpus β classic wordcloud (top 150 BoW lemmas)", loc="left", fontsize=11)
fig.tight_layout()
plt.show()
π Reading the output. The largest words are
inflation,policy,monetary,growth,rate,euro,target,term,rates,risk. With domain stopwords (council,governing, etc.) removed, the cloud foregrounds the substantive vocabulary of monetary-policy communication: macro variables (inflation,growth), instruments (rate(s),policy,monetary), goals (target,term/medium-term), and risk language. Secondary terms (transmission,outlook,assessment,pepp) are the technical vocabulary an economist would actually want to track.What the cloud cannot tell you. It cannot tell you (i) which document a term belongs to, (ii) whether the term is used positively or negatively, (iii) whether it is rising or falling over time, (iv) which terms are distinctive (TF-IDF) vs. merely frequent (BoW). All of those questions need a different visualisation.
Variant 2 β Mask shaped like β¬ΒΆ
We can constrain the cloud to fill an arbitrary shape by passing a mask argument: a 2D array where white pixels (255) are masked out (no words placed) and dark pixels (0) are where words are allowed. Building the mask procedurally β drawing a "β¬" with PIL β avoids any external image asset and keeps the notebook fully self-contained.
def find_bold_font():
# Cross-platform bold-font lookup via matplotlib's font manager.
# matplotlib is already a dependency and bundles DejaVuSans-Bold as a
# guaranteed last-resort fallback, so this never returns None.
import matplotlib.font_manager as fm
return fm.findfont(fm.FontProperties(family="sans-serif", weight="bold"))
def make_euro_mask(size=600):
# Build a binary mask shaped like a β¬ symbol via PIL.
# Convention: white pixels (255) = masked out; dark pixels (0) = words go here.
mask = Image.new("L", (size, size), color=255)
draw = ImageDraw.Draw(mask)
font = ImageFont.truetype(find_bold_font(), int(size * 0.85))
bbox = draw.textbbox((0, 0), "\u20ac", font=font)
w, h = bbox[2] - bbox[0], bbox[3] - bbox[1]
pos = ((size - w) // 2 - bbox[0], (size - h) // 2 - bbox[1])
draw.text(pos, "\u20ac", fill=0, font=font)
return np.array(mask)
euro_mask = make_euro_mask(size=600)
wc2 = WordCloud(mask=euro_mask, background_color="white",
max_words=300, colormap="Blues", random_state=42,
contour_width=2, contour_color="#003299") # ECB blue
wc2.generate_from_frequencies(freq)
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(wc2, interpolation="bilinear")
ax.axis("off")
ax.set_title("ECB corpus β β¬ mask", loc="left", fontsize=11)
fig.tight_layout()
plt.show()
π Reading the output. Same frequency dictionary as Variant 1, but the cloud now fits inside the β¬ symbol. Visually striking; informationally identical (and arguably less readable β words near the edges of the symbol get clipped or rotated).
The mask convention is counter-intuitive at first. White pixels (value 255) = excluded; dark pixels (value 0) = allowed. So we draw the β¬ in black on a white background. To check your mask is correct:
(euro_mask == 0).sum()should give the number of pixels available for words β if it's near zero, you have the convention inverted and the cloud will be empty.
contour_widthandcontour_coloradd the visible outline that makes the symbol recognisable even when the words don't fully fill the shape. Without them the cloud just looks like a blob of text.
Variant 3 β Image-coloured wordcloud (the parrot example)ΒΆ
The most elaborate technique from the wordcloud library examples: take any real-world image (here, a tropical parrot β the canonical example from the library's documentation), use it both as a shape mask and as a colour source, with edge detection to keep the colours from bleeding across regions.
The point of using a parrot for ECB text is precisely that it's absurd. Wordclouds let you put your words inside any silhouette β a parrot, a country map, a company logo, a portrait. The technique is general; the choice of image is a presentation decision. We download the original parrot photo from the wordcloud library's GitHub examples (MIT-licensed), recolour the ECB lemmas using the parrot's plumage palette, and let the silhouette do the work.
# Download the parrot image from the wordcloud examples (MIT-licensed).
# After the first run it's cached locally, so re-execution is offline-friendly.
import urllib.request
PARROT_URL = ("https://raw.githubusercontent.com/amueller/word_cloud/"
"main/examples/parrot-by-jose-mari-gimenez2.jpg")
PARROT_PATH = "parrot.jpg"
if not os.path.exists(PARROT_PATH):
print(f"Downloading parrot image from {PARROT_URL}...")
urllib.request.urlretrieve(PARROT_URL, PARROT_PATH)
print("Done.")
# Load and subsample (the original is 4017x3235 β too large to render quickly)
parrot_color = np.array(Image.open(PARROT_PATH))
parrot_color = parrot_color[::3, ::3]
print(f"Parrot image after 3x subsampling: {parrot_color.shape}")
# Build the placement mask: same convention (white = excluded).
# Where the original image is pure black (sum == 0), treat it as background.
parrot_mask = parrot_color.copy()
parrot_mask[parrot_mask.sum(axis=2) == 0] = 255
# Edge detection on the colour image to enforce sharp colour boundaries β
# words near the edges of distinct colour regions are pushed inwards,
# which prevents the colours from looking washed out in the final cloud.
edges = np.mean([gaussian_gradient_magnitude(parrot_color[:, :, i] / 255., 2)
for i in range(3)], axis=0)
parrot_mask[edges > .08] = 255
# Build the wordcloud β note `max_words=2000` to encourage dense fill,
# and `relative_scaling=0` for the "less accurate but better-looking" trade-off.
wc3 = WordCloud(max_words=2000, mask=parrot_mask, max_font_size=40,
random_state=42, relative_scaling=0, background_color="white")
wc3.generate_from_frequencies(freq)
# Recolour: each word inherits the colour of the pixel it lands on
image_colors = ImageColorGenerator(parrot_color)
wc3.recolor(color_func=image_colors)
fig, ax = plt.subplots(figsize=(10, 10))
ax.imshow(wc3, interpolation="bilinear")
ax.axis("off")
ax.set_title("ECB corpus β parrot-shaped image-coloured wordcloud", loc="left", fontsize=11)
fig.tight_layout()
plt.show()
π Reading the output. The ECB lemmas now fill the silhouette of a tropical parrot, with each word painted in the colour of the pixel beneath it β red on the head, blue and green across the body and wings, yellow on the underside. The silhouette is recognisable; the words remain readable; the dissonance between "ECB monetary-policy text" and "tropical bird" is the entire point.
Why the fill is sparser than the canonical parrot example. The original
wordcloudlibrary demo uses a Wikipedia article on rainbows (~1,000+ unique tokens) precisely because that vocabulary fits the high-detail parrot silhouette. Our ECB corpus has only ~290 lemmas after filtering, so narrow regions (beak, feet, fine plumage detail) end up partly empty. With a real ECB archive of 200+ documents you'd easily hit 2,000+ unique lemmas and get a denser fill. For now, accept the sparser look β the technique is what matters.Three implementation details worth flagging.
gaussian_gradient_magnitudecomputes the gradient of pixel intensity β high values where colours change rapidly. Settingparrot_mask[edges > 0.08] = 255forces the wordcloud to avoid placing words across colour boundaries, which keeps the colours from bleeding and produces sharper-looking output. Without this step the parrot looks washed out.- The double-mask trick. We use the colour image both as the source of colours (
ImageColorGenerator) and (after thresholding) as the placement mask. This is the parrot-example technique: it ensures words go where colour exists and inherit the colour of where they land.relative_scaling=0disables the size-based weighting of word frequencies. The official documentation calls it "less accurate but better-looking" β a perfect summary of what wordclouds are, in general.The take-away. Once you understand the mask + image-colouring pattern, you can put any text inside any silhouette. The Loughran-McDonald 10-K corpus inside a Wall Street bull, FOMC transcripts inside a portrait of Powell, parliamentary speeches inside the country's coat of arms β all the same five lines of code with a different image. Whether any of those is useful is a separate question (almost always: no, use a bar chart). Whether they look striking on a slide: yes.
9. ExerciseΒΆ
Time: ~25 minutes. Work individually.
Task 1. Apply the full preprocessing pipeline to your corpus:
- Clean text β spaCy tokens (lemmatised, no stopwords)
- Add a
tokenscolumn - Extend the stopword list with at least 5 domain-specific terms that are uninformative in your corpus
Task 2. Fit a TF-IDF vectoriser on body_clean. Report:
- Vocabulary size
- The 20 terms with highest mean TF-IDF
- A horizontal bar chart of those 20 terms
Task 3. Split your corpus into two periods (you choose the cutoff). Compute the top 15 terms by mean TF-IDF in each period. Comment briefly on what differs.
Task 4 (optional). Compute pairwise cosine similarity between all documents using the TF-IDF matrix. Plot the similarity of each document to the first document in the corpus, ordered by date. Does similarity decay over time?
# Task 1 β preprocessing pipeline
# YOUR CODE HERE
# Task 2 β TF-IDF analysis
# YOUR CODE HERE
# Task 3 β group comparison
# YOUR CODE HERE
# Task 4 (optional) β cosine similarity over time
# YOUR CODE HERE
# ββ SOLUTION ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
import pandas as pd, numpy as np, matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import spacy
nlp = spacy.load("en_core_web_sm")
corpus = pd.read_csv("corpus_clean.csv")
corpus["date_parsed"] = pd.to_datetime(corpus["date_parsed"], errors="coerce")
corpus["year"] = corpus["date_parsed"].dt.year
# ββ Task 1 ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
EXTRA_STOP = {"governing", "council", "ecb", "today", "decided"}
def spacy_tokens(text):
doc = nlp(text)
return [t.lemma_.lower() for t in doc
if not t.is_stop and not t.is_punct and not t.is_space
and t.is_alpha and len(t.text) >= 3
and t.lemma_.lower() not in EXTRA_STOP]
corpus["tokens"] = corpus["body_clean"].apply(spacy_tokens)
print(f"Mean tokens/doc: {corpus['tokens'].str.len().mean():.0f}")
# ββ Task 2 ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
tfidf = TfidfVectorizer(max_features=500, min_df=2, stop_words="english",
token_pattern=r"\b[a-zA-Z]{3,}\b", sublinear_tf=True)
X = tfidf.fit_transform(corpus["body_clean"])
print(f"Vocabulary: {len(tfidf.get_feature_names_out())}")
mean_tfidf = pd.Series(X.toarray().mean(axis=0), index=tfidf.get_feature_names_out())
top20 = mean_tfidf.sort_values(ascending=False).head(20)
fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(top20.index[::-1], top20.values[::-1], color="steelblue")
ax.set_title("Top 20 terms by mean TF-IDF")
fig.tight_layout(); plt.show()
# ββ Task 3 ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
corpus["period"] = corpus["year"].apply(lambda y: "β₯2024" if y >= 2024 else "β€2023")
for period, grp in corpus.groupby("period"):
vec = TfidfVectorizer(max_features=300, min_df=1, stop_words="english",
token_pattern=r"\b[a-zA-Z]{3,}\b")
mat = vec.fit_transform(grp["body_clean"])
means = pd.Series(mat.mean(axis=0).A1, index=vec.get_feature_names_out())
print(f"\n{period}:", means.nlargest(15).index.tolist())
# ββ Task 4 ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
sim = cosine_similarity(X)
sim_to_first = pd.Series(sim[0], index=corpus["date_parsed"]).sort_index()
fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(sim_to_first.index, sim_to_first.values, marker="o", color="tomato")
ax.set_title(f"Cosine similarity to first document ({sim_to_first.index[0].date()})")
ax.set_ylabel("Cosine similarity"); ax.grid(alpha=0.3)
fig.tight_layout(); plt.show()
SummaryΒΆ
| Step | Tool | Notes |
|---|---|---|
| Tokenise | nltk.word_tokenize / spacy | spaCy is more accurate |
| Remove stopwords | nltk.stopwords / spacy is_stop | Add domain-specific stops |
| Stem | nltk.PorterStemmer | Fast, non-words |
| Lemmatise | spacy token.lemma_ | Accurate, real words |
| Bag-of-words | sklearn.CountVectorizer | Sparse matrix |
| TF-IDF | sklearn.TfidfVectorizer | Use sublinear_tf=True |
| Topic modelling | sklearn.LatentDirichletAllocation | Counts (not TF-IDF); choose K by coherence |
| Visualisation | wordcloud.WordCloud | Display tool, not analysis tool |
Next lectureΒΆ
Lecture 8 β Sentiment analysis & advanced text models: dictionary methods (Loughran-McDonald), supervised classifiers, word embeddings, and an end-to-end pipeline replicating Hansen et al. (QJE 2018) on a small corpus.