Extra L7 — NLP Representation, Similarity, and What Bag-of-Words Misses¶

This notebook complements Lecture 7: NLP Representation.

Goal¶

Understand what text representations capture — and what they discard.

Why this matters for research¶

Text features are measurements. You should know their failure modes before using them in substantive research. Lecture 7 introduced cosine similarity (§5.bis) with a caveat that the measure captures vocabulary overlap, not meaning. This notebook shows that caveat in action with a minimal worked example: three sentences chosen so that BoW/TF-IDF give a misleading picture of which two are most similar.

In [ ]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

1. A tiny corpus¶

Three sentences chosen to expose the limits of bag-of-words representation:

The first two sentences have overlapping vocabulary (minister, support, immigration, reform) but differ in meaning — one is a negation of the other.
The third is topically related to the second (same content) but uses different words (cabinet, backed, migration policy instead of minister, supported, immigration reform).

A good representation should make the second and third look similar, and the first and second look different. Let's see what BoW and TF-IDF do.

In [ ]:

docs = [
    "The minister did not support immigration reform",
    "The minister supported immigration reform",
    "The cabinet backed a reform of migration policy"
]

doc_names = ["negated", "affirmative", "paraphrase"]
pd.DataFrame({"doc": doc_names, "text": docs})

2. Bag-of-words representation¶

In [ ]:

cv = CountVectorizer(stop_words="english")
X_count = cv.fit_transform(docs)
count_df = pd.DataFrame(X_count.toarray(), columns=cv.get_feature_names_out(), index=doc_names)
count_df

3. TF-IDF representation¶

In [ ]:

tfidf = TfidfVectorizer(stop_words="english")
X_tfidf = tfidf.fit_transform(docs)
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out(), index=doc_names)
tfidf_df.round(3)

4. Similarity comparison¶

In [ ]:

sim = cosine_similarity(X_tfidf)
sim_df = pd.DataFrame(sim, index=doc_names, columns=doc_names)
sim_df.round(3)

Interpretation¶

Look at the similarity matrix. The negated and affirmative sentences come back at cosine 0.51 — moderately similar despite saying opposite things. They share minister, immigration, reform and only differ in the negation word not and the verb form (support vs supported); CountVectorizer's default stop_words="english" drops not outright, removing the very token that flips the meaning. The paraphrase sits at cosine 0.11 vs the affirmative version — much less similar, even though it says semantically the same thing, just with different words (cabinet backed reform of migration policy).

Bag-of-words and TF-IDF largely ignore:

word order — dog bites man and man bites dog get identical vectors
syntax — subject-verb-object structure is not represented
scope of negation — not support and support look almost identical, especially with default stopword removal that drops not
semantic equivalence — cabinet backed and minister supported describe the same thing in different words, and the model has no way to know

That is why the negated and affirmative texts can appear misleadingly similar — and the paraphrase can appear misleadingly distant. This is the caveat from Lecture 7 § 5.bis made concrete: cosine similarity on TF-IDF measures shared vocabulary, not shared meaning. When the two diverge — under negation, paraphrase, or sarcasm — the similarity score lies.

5. Why this matters in economics¶

If you classify political speeches, parliamentary statements, manifesto sentences, central-bank communication, or survey responses, the choice of representation directly affects:

construct validity — does the feature measure the concept you want?
measurement error — how often does the representation flip a sentence's meaning?
interpretability — can you defend the coding to a referee?

A research example. Suppose you classify ECB statements as "hawkish" or "dovish" using TF-IDF + a dictionary of policy-stance words. The sentence "rates will not remain restrictive for as long as previously expected" is structurally dovish — the negation is the entire point. A BoW classifier reads rates, remain, restrictive, long, expected, sees the same vocabulary as a hawkish statement, and codes it wrong. The error is not random: it is concentrated in exactly the meeting-to-meeting changes that matter for identification. Measurement error correlated with the variation you're using to identify your effect is the worst kind, because it biases the estimate rather than just adding noise.

This is the reason supervised classifiers (Lecture 8 § 6) and contextual embeddings (Lecture 8 § 8) exist: to encode word order and context into the representation rather than throwing it away upfront.

Short exercise¶

Answer in words (a paragraph each):

Why does negation create a problem for simple vector models? Refer to the similarity matrix you computed above.
Why can two semantically similar texts (here: affirmative and paraphrase) appear less similar than two semantically opposite texts (here: negated and affirmative)?
When is a simple BoW/TF-IDF model still preferable despite these limitations? Think about transparency, replicability, and computational cost.

Optional extension¶

Add bigrams with ngram_range=(1, 2) and rebuild the TF-IDF matrix. Recompute the similarity. Does not support become a feature? Does the negated sentence now look meaningfully different from the affirmative one?
Build the document-term matrix with and without stopword removal. The default stop_words="english" drops not — keep it as a feature and see whether the negation becomes detectable.
Try the same exercise with TfidfVectorizer(sublinear_tf=True) (the setting we used in L7). Does it change the qualitative picture?
For a real-data version: take two ECB press releases that differ in policy stance (e.g., one hike statement, one cut statement) and compute their TF-IDF cosine similarity. Then read them. Does the score reflect what you read?