Lecture 9 — Machine Learning for Text & LLMs via API¶
Python for Economists · University of Bologna · 2025/2026¶
What we cover today¶
- Warm-up: zero-shot classification in three lines with a pre-trained transformer
- When NOT to use these tools — three-way comparison (manual pipeline / transformer / LLM API)
- Supervised text classification: logistic regression and Naive Bayes on TF-IDF
- Evaluation: precision, recall, F1, confusion matrix
- Cross-validation and regularisation
- Word embeddings: intuition and use
- HuggingFace transformers: zero-shot with BART-MNLI
- Domain-specific BERT: FinBERT for financial sentiment
- ⏱ Ten-minute challenge: prompt engineering with zero-shot labels
- LLMs via API (Claude + OpenAI): setup, pricing, when it's worth paying (optional, paid)
- Exercise: classify your corpus
Key insight: today we close the L7-L8-L9 arc on machine learning for text. L7 introduced unsupervised methods (BoW, TF-IDF, LDA); L8 added dictionary-based and supervised approaches. L9 brings two more options to the toolkit:
- Pre-trained transformers (BART-MNLI, FinBERT) — free, local, zero-shot. They classify with arbitrary labels without any training data, at the cost of being a black box and slower than TF-IDF at scale.
- LLMs via API (Claude, GPT) — flexible and powerful for structured extraction and complex tasks, but paid and harder to replicate. We'll see when paying is worth it and when it isn't.
Across all approaches, the same trade-off recurs: cost, reproducibility, interpretability, scale. By the end of today you should be able to choose — and defend the choice in your projects.
0. Setup — packages required for this lecture¶
Before running the imports below, make sure your_environment has the ML-for-text and transformer stack installed. Run this once in your terminal (not in the notebook):
conda activate your_environment
conda install -c conda-forge scikit-learn pandas matplotlib spacy
python -m spacy download en_core_web_md # ~50 MB English model with word vectors
pip install transformers torch # HuggingFace stack (no conda-forge build)
scikit-learn, pandas, and matplotlib are already installed from L5-L8. New dependencies today: spacy with the en_core_web_md model (for word embeddings in §6), and transformers + torch (for BART-MNLI and FinBERT in §1, §7, §8 — these are not on conda-forge, install with pip).
Section 10 (LLM APIs) is optional and not required to run this notebook — the API code there is shown in markdown for reference only. If you decide to experiment at home, you'll find install instructions in that section.
Heads-up on first run: the HuggingFace pipelines download model weights (~1.5 GB for BART-MNLI, ~440 MB for FinBERT) and cache them in ~/.cache/huggingface/. The download happens once; subsequent runs are fast and offline. Recommendation: run the warm-up cell at home before class to pre-warm the cache.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os, re, warnings
warnings.filterwarnings("ignore")
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import (classification_report, confusion_matrix,
ConfusionMatrixDisplay, roc_auc_score)
from sklearn.pipeline import Pipeline
print("sklearn loaded.")
1. Warm-up — the same task, then and now¶
We spent four lectures building NLP pipelines — scraping, tokenising, TF-IDF, supervised classification. All of it real, all of it worth knowing.
Today, I'll show you how most of what we did can now be replicated in three lines of code using a pre-trained transformer. That's not a reason to throw away what we learned — it's a reason to know exactly when each approach is the right one.
The thing itself — zero-shot classification, free and local¶
We use HuggingFace transformers with the BART-MNLI model. No API, no payment, no internet after the first download — the model runs entirely on your laptop. (We'll cover what's actually happening under the hood in Section 5.)
# Open-source transformer, runs locally, completely free.
# First run downloads ~1.5 GB and caches it in ~/.cache/huggingface/
# (Recommended: warm the cache by running this cell at home before class.)
from transformers import pipeline as hf_pipeline
zs_classifier = hf_pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli",
device=-1, # -1 = CPU; 0 = first GPU if available
)
print("Zero-shot classifier loaded.")
# Classify headlines into policy areas - no labelled data, no training, no payment.
headlines = [
"ECB raises rates by 25bp to anchor inflation expectations",
"Commission proposes new fiscal rules with flexibility for investment",
"Eurozone unemployment falls to record low in Q3",
"Germany announces 100bn special fund for defence",
"ECB signals patience on further hikes as core inflation moderates",
]
policy_areas = ["monetary policy", "fiscal policy", "labour", "defence", "other"]
print(f"{'Predicted label':<20} {'Confidence':<11} Headline")
print("-" * 110)
for h in headlines:
result = zs_classifier(h, candidate_labels=policy_areas)
label = result["labels"][0]
score = result["scores"][0]
print(f"{label:<20} {score:.2f} {h}")
📊 Reading the output. Five headlines, five sensible labels — and you can read the model's uncertainty off the confidence column. A few things worth noticing:
- Defence (0.93) and monetary policy (0.88, 0.85) are high-confidence: the headlines contain unambiguous topical anchors (
100bn special fund for defence,raises rates,core inflation).- Labour (0.51) is the weakest call. The headline is about unemployment falling, which BART-MNLI recognises as labour-adjacent but not strongly so — possibly because "labour" as a label is broader than the headline's specific topic. A confidence of 0.51 is the kind of number that should make you slow down: in a real pipeline, you'd flag everything below ~0.7 for manual review.
- Fiscal policy (0.70) is plausible but not high. The headline mentions both
fiscal rules(fiscal policy) andinvestment(could be many things) — the model splits its probability mass across categories.Three lines of actual logic — load the model, pick your candidate labels, classify. No training data, no TF-IDF, no pipeline.
So why did we spend four lectures on the manual pipeline?
2. When NOT to reach for these tools — a case for keeping your pipeline¶
Pre-trained transformers and LLMs are the right tool sometimes. They are the wrong tool other times. Here's the honest comparison across the three families we'll see today:
| Dimension | Manual pipeline (TF-IDF + LogReg) | Transformer locally (BART, FinBERT) | LLM via API (Claude, GPT) |
|---|---|---|---|
| Cost | Zero marginal | Compute time (free), 1-2 sec/doc on CPU | Per-call payment, scales linearly with corpus |
| Reproducibility | Deterministic; replicable forever | Deterministic; pinned by model hash | Non-deterministic unless temperature=0; model can be retired |
| Interpretability | Every word has a weight; you can audit it | Black box; some attention-based explainability | Black box; no "why" |
| Scale | Very fast (~10k docs/sec) | Moderate (~1 doc/sec CPU; ~50/sec GPU) | Slow and paid (~1-3 docs/sec, bottlenecked by API) |
| Flexibility | Fixed: one task per trained model | Zero-shot: arbitrary label sets at inference time | Maximum: arbitrary tasks, free-form output |
Concrete examples:
- Score sentiment on 50,000 ECB press releases. → Dictionary or TF-IDF. 30 seconds, free, replicable in 10 years. The LLM bill would be in the range of a few dollars, but the pipeline is what your replication package will want.
- Classify 300 policy documents into 12 nuanced categories where context matters. → Transformer (BART-MNLI) or LLM. 300 docs is too few to train a supervised model well.
- You need the method in a published paper's replication package, 10 years from now. → Manual pipeline + (optionally) pinned local transformer. APIs deprecate; open-source weights don't.
- You want to explore a corpus you don't know yet. → LLM or zero-shot transformer first, manual pipeline once you know what to measure.
The punchline: each tool fits a different research moment. Use whichever clears the task at the lowest replication risk, and—this is the crucial part for every projects—be able to defend the choice.
# Load the Manifesto Project dataset (public, available at manifesto-project.wzb.eu)
# For this lecture we use a representative synthetic dataset that mirrors
# the structure of the real one. In the exercise you will work with the real data.
np.random.seed(42)
n = 300
# Phrases that lean left or right but are NOT perfectly diagnostic
left_phrases = [
"workers should have stronger protections",
"public investment in healthcare is needed",
"inequality has grown too much",
"social safety nets need strengthening",
"the wealthy should contribute more in taxes",
"collective bargaining helps wage growth",
"income support programs reduce poverty",
"climate action requires public spending",
"housing affordability is a serious problem",
"public services should be expanded",
"regulation can prevent market failures",
"the welfare state needs reinforcement",
]
right_phrases = [
"markets generally allocate resources well",
"government spending should be controlled",
"lower taxes can stimulate growth",
"individual choice matters in economic outcomes",
"regulation can impose unnecessary costs",
"the private sector drives most innovation",
"fiscal discipline matters for credibility",
"competition policy benefits consumers",
"property rights underpin investment",
"trade openness has aggregate benefits",
"government programs should be evaluated for efficiency",
"labor market flexibility supports employment",
]
# Neutral vocabulary - phrases either side might use
shared_phrases = [
"the economy faces multiple challenges",
"policy tradeoffs need to be considered",
"data should inform decision making",
"long term outcomes matter",
"the evidence is mixed",
"economic growth has slowed recently",
"inflation concerns are widespread",
"labor markets are evolving",
"households face rising costs",
"investment decisions depend on conditions",
]
def make_doc(primary_phrases, primary_share=0.55, shared_share=0.30):
"""Generate a noisy document: mostly primary, some neutral, a bit of opponent."""
n_total = np.random.randint(6, 12)
n_primary = max(1, int(n_total * primary_share))
n_shared = max(1, int(n_total * shared_share))
n_other = max(0, n_total - n_primary - n_shared)
other = left_phrases if primary_phrases is right_phrases else right_phrases
parts = list(np.random.choice(primary_phrases, n_primary, replace=True))
parts += list(np.random.choice(shared_phrases, n_shared, replace=True))
if n_other > 0:
parts += list(np.random.choice(other, n_other, replace=True))
np.random.shuffle(parts)
return " ".join(parts)
texts = ([make_doc(left_phrases) for _ in range(n//2)] +
[make_doc(right_phrases) for _ in range(n//2)])
labels = ["left"] * (n//2) + ["right"] * (n//2)
manifesto = pd.DataFrame({"text": texts, "orientation": labels})
manifesto = manifesto.sample(frac=1, random_state=42).reset_index(drop=True)
print(f"Dataset: {len(manifesto)} documents")
print(manifesto["orientation"].value_counts())
manifesto.head(3)
3. Supervised text classification¶
The pipeline:
- Represent text as TF-IDF vectors
- Train a classifier on labelled examples
- Evaluate on held-out test set
sklearn.pipeline.Pipeline chains these steps and prevents data leakage.
# Train/test split - stratified to preserve class balance
X_train, X_test, y_train, y_test = train_test_split(
manifesto["text"],
manifesto["orientation"],
test_size=0.2,
random_state=42,
stratify=manifesto["orientation"]
)
print(f"Train: {len(X_train)} | Test: {len(X_test)}")
# Pipeline: TF-IDF -> Logistic Regression
pipeline_lr = Pipeline([
("tfidf", TfidfVectorizer(
max_features=2000,
ngram_range=(1, 2), # unigrams and bigrams
min_df=2,
stop_words="english",
sublinear_tf=True,
)),
("clf", LogisticRegression(
C=1.0, # inverse regularisation strength
max_iter=1000,
random_state=42,
)),
])
pipeline_lr.fit(X_train, y_train)
y_pred = pipeline_lr.predict(X_test)
print("Logistic Regression:")
print(classification_report(y_test, y_pred, digits=3))
# Naive Bayes - fast baseline, often competitive on text
pipeline_nb = Pipeline([
("tfidf", TfidfVectorizer(max_features=2000, stop_words="english",
min_df=2, sublinear_tf=False)),
("clf", MultinomialNB(alpha=1.0)),
])
pipeline_nb.fit(X_train, y_train)
y_pred_nb = pipeline_nb.predict(X_test)
print("Naive Bayes:")
print(classification_report(y_test, y_pred_nb, digits=3))
📊 Reading the output. With the noisier synthetic dataset, both classifiers do well but not perfectly.
- Logistic Regression typically reaches ~0.92 F1 with bigrams enabled. The model picks up
lower taxes,private sector,government spendingon the right, andworkers rights,public investment,social safetyon the left.- Naive Bayes typically lands ~0.85-0.88 F1 — a few points behind. The "naive" assumption (feature independence) costs it precisely on the cases where bigrams matter, since it can't exploit correlations between lower and taxes.
A warning for your own data. If you ever see all-1.000 metrics on a real corpus, that's a red flag—not a celebration. Two diagnostics for "too good to be true":
- Look at the most predictive features (next section) — do they make economic sense, or is the model exploiting an artefact (a date, a header, a metadata leakage)?
- Check whether train/test split is truly independent — near-duplicates in the corpus, or stratification on something correlated with the label, will inflate performance.
4. Evaluation metrics¶
| Metric | Definition | Use when |
|---|---|---|
| Accuracy | % correctly classified | Classes are balanced |
| Precision | True positives / predicted positives | Cost of false positives is high |
| Recall | True positives / actual positives | Cost of false negatives is high |
| F1 | Harmonic mean of precision and recall | Imbalanced classes |
| AUC-ROC | Area under ROC curve | Probabilistic ranking |
# Confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
for ax, (model_name, y_p) in zip(axes, [("Logistic Regression", y_pred), ("Naive Bayes", y_pred_nb)]):
cm = confusion_matrix(y_test, y_p, labels=["left","right"])
disp = ConfusionMatrixDisplay(cm, display_labels=["left","right"])
disp.plot(ax=ax, colorbar=False, cmap="Blues")
ax.set_title(f"Confusion matrix - {model_name}", fontsize=11)
fig.tight_layout()
plt.show()
# Most predictive features - what does the model actually learn?
feature_names = pipeline_lr.named_steps["tfidf"].get_feature_names_out()
coefs = pipeline_lr.named_steps["clf"].coef_[0] # positive = "right", negative = "left"
coef_df = pd.DataFrame({"feature": feature_names, "coef": coefs})
coef_df = coef_df.sort_values("coef")
top_left = coef_df.head(15) # most negative = most "left"
top_right = coef_df.tail(15) # most positive = most "right"
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))
ax1.barh(top_left["feature"], top_left["coef"], color="steelblue", edgecolor="white")
ax1.set_title("Top 'left' features", fontsize=11)
ax1.set_xlabel("Coefficient")
ax2.barh(top_right["feature"], top_right["coef"], color="tomato", edgecolor="white")
ax2.set_title("Top 'right' features", fontsize=11)
ax2.set_xlabel("Coefficient")
fig.suptitle("Most predictive terms - Logistic Regression", fontsize=12, y=1.02)
fig.tight_layout()
plt.show()
📊 Reading the output. This is the auditability that LLMs do not give you. Each bigram has a signed weight: positive pushes the prediction toward "right", negative toward "left".
On this synthetic data, expect bigrams like
private sector,lower taxes,government spending,fiscal disciplineto appear among the top right-leaning terms;workers rights,public investment,social safety,welfare stateon the left. These are substantively plausible — they correspond to the actual semantic split between the two phrase pools.The diagnostic value. If your model surfaced terms like
january,2023,table_3as top predictors, that's the signal to stop: the model is learning a metadata artefact, not language. When you write up text-based measures in a paper, including a feature-importance table is standard practice — referees expect to see it, and it's what distinguishes a measure you can defend from a black box.
5. Cross-validation and overfitting on text data¶
Text classification is particularly prone to overfitting because:
- The feature space (vocabulary) is very high-dimensional
- Train/test overlap in vocabulary is high
- Small datasets have high variance
Always report cross-validated metrics, not just test set performance.
# 5-fold stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_lr = cross_val_score(pipeline_lr, manifesto["text"], manifesto["orientation"],
cv=cv, scoring="f1_macro")
scores_nb = cross_val_score(pipeline_nb, manifesto["text"], manifesto["orientation"],
cv=cv, scoring="f1_macro")
print(f"Logistic Regression - CV F1: {scores_lr.mean():.3f} +/- {scores_lr.std():.3f}")
print(f"Naive Bayes - CV F1: {scores_nb.mean():.3f} +/- {scores_nb.std():.3f}")
# Regularisation: effect of C on performance
C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
cv_means, cv_stds = [], []
for C in C_values:
p = Pipeline([
("tfidf", TfidfVectorizer(max_features=2000, stop_words="english", sublinear_tf=True)),
("clf", LogisticRegression(C=C, max_iter=1000, random_state=42)),
])
s = cross_val_score(p, manifesto["text"], manifesto["orientation"],
cv=cv, scoring="f1_macro")
cv_means.append(s.mean()); cv_stds.append(s.std())
fig, ax = plt.subplots(figsize=(7, 4))
ax.errorbar(range(len(C_values)), cv_means, yerr=cv_stds,
fmt="o-", color="steelblue", capsize=4, linewidth=2)
ax.set_xticks(range(len(C_values)))
ax.set_xticklabels([str(c) for c in C_values])
ax.set_title("CV F1 vs regularisation strength C", fontsize=12)
ax.set_xlabel("C (inverse regularisation)"); ax.set_ylabel("F1 macro (5-fold CV)")
fig.tight_layout(); plt.show()
📊 Reading the output. Two things to read off this plot:
CV F1 mean ± std: on this noisier dataset, expect LogReg around 0.92 ± 0.05 and NB around 0.86 ± 0.05. The non-zero standard deviation is good news — it means the folds disagree, which gives you an honest signal of model variance. Zero variance on a small dataset is usually a sign that the data is too clean to teach you anything about robustness.
C-curve shape. Very small C (strong regularisation) underfits — the model can't distinguish the two classes. Very large C (no regularisation) overfits — performance plateaus or drops because the model memorises training noise. The sweet spot is typically around C ≈ 1, but always validate this for your own data. The error bars (std across folds) tell you how stable the estimate is — wide bars indicate sensitivity to which documents end up in which fold, often a sign you need more data.
6. Word embeddings: intuition¶
TF-IDF has a fundamental limitation: it treats words as independent symbols. "Interest rate" and "borrowing cost" are completely different features, even though they mean similar things.
Word embeddings (Word2Vec, GloVe, fastText) map each word to a dense vector in a high-dimensional space, where semantically similar words are nearby.
king - man + woman = queenECB - Europe + United States = Fed
We use pre-trained embeddings here - training your own requires a large corpus.
# Using pre-trained embeddings with spaCy (medium or large model needed for vectors)
# python -m spacy download en_core_web_md
try:
import spacy
nlp_md = spacy.load("en_core_web_md")
has_vectors = True
print("spaCy medium model loaded - word vectors available.")
except OSError:
has_vectors = False
print("en_core_web_md not available. Run: python -m spacy download en_core_web_md")
print("Demonstrating the concept only.")
if has_vectors:
# Word similarity via cosine distance in embedding space
pairs = [("inflation","price"), ("inflation","unemployment"),
("rate","interest"), ("rate","percentage"), ("bank","money")]
for w1, w2 in pairs:
t1, t2 = nlp_md(w1), nlp_md(w2)
sim = t1.similarity(t2)
print(f" {w1!r} - {w2!r}: {sim:.3f}")
# Document embeddings: average of word vectors
if has_vectors:
def doc_embedding(text):
doc = nlp_md(str(text))
vectors = [token.vector for token in doc
if not token.is_stop and token.has_vector]
if not vectors:
return np.zeros(nlp_md.vocab.vectors_length)
return np.mean(vectors, axis=0)
# Apply to manifesto sample (first 50 docs for speed)
sample = manifesto.head(50).copy()
embeddings = np.array([doc_embedding(t) for t in sample["text"]])
print(f"Embedding matrix: {embeddings.shape} (documents x dimensions)")
# Quick classifier on top of embeddings
from sklearn.svm import LinearSVC
X_e_tr, X_e_te, y_e_tr, y_e_te = train_test_split(
embeddings, sample["orientation"], test_size=0.2, random_state=42)
svm = LinearSVC(max_iter=5000).fit(X_e_tr, y_e_tr)
print(f"SVM on embeddings - accuracy: {svm.score(X_e_te, y_e_te):.3f}")
else:
print("Skipping embedding classifier - spaCy medium model not available.")
📊 Reading the output. Two parts here.
Word similarities.
inflation-unemployment(0.633) is higher thaninflation-price(0.397), which surprises most economists at first sight. The explanation: GloVe (the model behinden_core_web_md) learns co-occurrence statistics over a general English corpus — and unemployment and inflation are mentioned together far more often than inflation and price in everyday text (the Phillips curve is a journalistic staple). Word vector similarity ≠ semantic identity. It measures contextual similarity in general text, which is a noisy proxy for the meaning you care about.Document classifier on embeddings. The SVM on mean-pooled embeddings often achieves accuracy close to the TF-IDF pipelines, sometimes higher. But: mean pooling discards word order entirely — "government should cut spending" and "spending should cut government" produce identical vectors. For documents where order matters (most), transformer-based embeddings (next section) are dramatically better. Mean pooling is fine for cheap dense representations or quick clustering.
7. HuggingFace Transformers - zero-shot classification¶
Transformer models (BERT, RoBERTa, DistilBERT, BART) encode text using attention mechanisms that capture context. They dramatically outperform TF-IDF on most NLP tasks.
Zero-shot classification — classify text into arbitrary categories without any labelled training data. We already loaded facebook/bart-large-mnli in the warm-up. Under the hood it's trained on natural-language inference (NLI): given a premise (your document) and a hypothesis ("this text is about monetary policy"), it predicts whether the hypothesis is entailed. We exploit that to do classification with arbitrary labels.
Use this when:
- You have no labelled training set
- Your categories are well-described in plain English
- Your corpus is small enough that a few seconds per document is acceptable
- You want free, local, replicable classification with no API dependency
# We already loaded zs_classifier in the warm-up; reuse it here.
# This is the same call we used on headlines, now applied to a more nuanced task:
# distinguishing hawkish vs dovish vs neutral central bank statements.
candidate_labels = ["hawkish monetary policy", "dovish monetary policy", "neutral statement"]
texts_to_classify = [
"The Governing Council decided to raise rates by 75 basis points. Inflation is far too high.",
"The Governing Council decided to keep rates unchanged. The economy is recovering gradually.",
"The Governing Council decided to cut rates. The inflation outlook has improved significantly.",
]
print("Zero-shot classification (BART-MNLI):")
for text in texts_to_classify:
result = zs_classifier(text, candidate_labels=candidate_labels)
top_label = result["labels"][0]
top_score = result["scores"][0]
print(f"\n [{top_label} {top_score:.2f}]")
print(f" {text[:90]}...")
📊 Reading the output. Look at the second statement:
keep rates unchanged. The economy is recovering gradually→ classified as dovish (0.65), not neutral. Why? The phrase "rates unchanged" is a hold, but BART-MNLI reads the positive framing of "recovering gradually" as dovish-leaning. This is exactly the kind of failure mode you want to know about before you run this on 10,000 central-bank statements.The first and third statements are confidently and correctly classified (0.84 hawkish, 0.93 dovish) because they contain unambiguous policy verbs (
raise rates,cut rates).Real central-bank communications rarely contain "raise rates by 75bp" so explicitly — they bury the signal in hedged language. Zero-shot models handle this better than dictionaries (which need exact lexical matches) because they encode semantic similarity. However, validate carefully on your domain: BART-MNLI was trained on general web text, not central-bank speeches, so it can systematically miss the nuances of policy-speak. For ECB and Fed specifically, FinBERT (next section) or a small fine-tuned model usually outperforms it.
8. Domain-specific BERT: FinBERT for financial sentiment¶
Zero-shot models like BART-MNLI are general-purpose. When your domain is well-defined (finance, biomedicine, legal text), a fine-tuned BERT trained on in-domain data typically outperforms zero-shot — at a fraction of the inference cost.
FinBERT (ProsusAI/finbert) is BERT-base fine-tuned on financial news. It outputs three labels: positive, negative, neutral. Standard reference: Araci, D. (2019). "FinBERT: Financial Sentiment Analysis with Pre-trained Language Models." arXiv:1908.10063.
Common research uses:
- Score earnings call transcripts
- Build a sentiment index from financial news headlines
- Compare central-bank statement tone over time (alongside dictionary methods like Loughran-McDonald)
When to prefer FinBERT over BART-MNLI:
- You specifically want financial sentiment with high domain accuracy
- Your labels match FinBERT's three-class output
- You want a smaller, faster model (~440 MB vs ~1.5 GB)
When to prefer BART-MNLI over FinBERT:
- Your labels are non-standard or domain-specific in some other way (legal, biomedical, political)
- You need labels beyond positive/negative/neutral
# FinBERT - domain-specific sentiment for financial text
# Downloads ~440 MB on first run; cached afterwards in ~/.cache/huggingface/
finbert = hf_pipeline(
"text-classification",
model="ProsusAI/finbert",
device=-1,
)
financial_texts = [
"The ECB raised rates by 75bp to combat persistent inflation.",
"Eurozone unemployment fell to a record low, supporting consumer spending.",
"The economy contracted by 0.4% in Q3 amid the ongoing energy shock.",
"Quarterly earnings beat analyst expectations by a wide margin.",
"The central bank signalled prolonged restrictive policy.",
]
print("FinBERT sentiment classification:")
print("-" * 80)
for text in financial_texts:
result = finbert(text)[0]
print(f" [{result['label']:<8s} {result['score']:.2f}] {text}")
📊 Reading the output. Read this carefully — it's a teachable failure mode.
| Statement | FinBERT label | What it should be | |---|---|---| | "ECB raised rates by 75bp to combat persistent inflation" | positive (0.92) | hawkish, restrictive — for the economy, not "positive" | | "Eurozone unemployment fell to a record low" | negative (0.88) | clearly positive economic news | | "The economy contracted by 0.4%" | negative (0.98) | ✓ correct | | "Quarterly earnings beat expectations" | positive (0.96) | ✓ correct | | "Central bank signalled prolonged restrictive policy" | negative (0.62) | reasonable — bad for risk assets |
The "unemployment fell" failure is exactly the VADER problem from L8 — the model has learned that the word unemployment signals negativity, and the rest of the sentence cannot save it. Domain pre-training does not eliminate the issue, it just shifts it: FinBERT now confuses good news for the economy with good news for asset prices.
Rule of thumb for any project. Even a domain-specific BERT needs validation on a hand-coded sample of 50-100 documents from your specific corpus before you trust it at scale. Especially when the polarity of a phenomenon (unemployment, inflation, rates) is reversed between general English ("low = good") and your research question ("low unemployment = labour market strength, but possibly inflationary pressure").
# Compare FinBERT vs BART-MNLI on the same text
test_text = "The Governing Council maintains a data-dependent approach. Risks to the inflation outlook are broadly balanced."
# FinBERT - returns one of three fixed labels
fb_result = finbert(test_text)[0]
print("FinBERT (fine-tuned on financial news):")
print(f" {fb_result['label']:<10s} {fb_result['score']:.3f}")
# BART-MNLI - returns scores for arbitrary labels you provide
bart_result = zs_classifier(test_text,
candidate_labels=["hawkish", "dovish", "neutral"])
print("\nBART-MNLI (zero-shot, your own labels):")
for lab, sc in zip(bart_result["labels"], bart_result["scores"]):
print(f" {lab:<10s} {sc:.3f}")
📊 Reading the output. On the hedged statement "data-dependent approach, risks broadly balanced":
- FinBERT correctly labels it
neutral (0.88). That's the wording central banks use to signal they haven't decided yet, and FinBERT — having seen plenty of similar phrasing in its training corpus — picks up the hedging.- BART-MNLI with explicit
hawkish/dovish/neutrallabels givesdovish (0.71)as the top label. The model has no specific training on policy-speak and over-weights "balanced" as a positive/dovish signal.When to prefer FinBERT for your projects: financial news, earnings calls, market commentary — anything within FinBERT's training distribution.
When to prefer BART-MNLI: you need finer distinctions than positive/negative/neutral (e.g. hawkish vs dovish, escalation vs de-escalation, expansionary vs contractionary) — but accept that you'll need to validate carefully against a hand-coded sample.
9. ⏱ Ten-minute challenge — labels matter¶
You've now seen two zero-shot tools (BART-MNLI with arbitrary labels, FinBERT with three fixed labels). The choice of labels is the modern equivalent of feature engineering — it's the lever you have over the model's output.
Your task (10 minutes):
- Pick a recent news article — any newspaper, any language (English works best with these models).
- Choose three different label sets that could plausibly describe the same article. For example, for an article about an ECB rate decision:
- Set A (broad):
["economy", "politics", "society", "technology"] - Set B (policy stance):
["hawkish", "dovish", "neutral"] - Set C (sentiment):
["positive outlook", "negative outlook", "mixed signals"]
- Set A (broad):
- Run BART-MNLI on all three label sets. Record the top label and confidence each time.
This is the zero-shot analogue of prompt engineering. There is no objectively "right" label set — only labels that are well-aligned with the question you're asking.
Then we compare in class: same articles, different students, different label choices → very different "structured measurements" of the same text. Discuss: which label set tells you something useful for your research question?
Fill the cell below with your article and your three label sets.
# Your turn: paste your article, define three label sets, run.
#
# 1. Replace the article placeholder with the text of a real news article.
# 2. Replace each label set with three labels relevant to YOUR research question.
# 3. Run the cell.
article = """
PASTE YOUR ARTICLE HERE
"""
label_sets = {
"Set A — describe it": ["label_1", "label_2", "label_3"],
"Set B — describe it": ["label_1", "label_2", "label_3"],
"Set C — describe it": ["label_1", "label_2", "label_3"],
}
# YOUR CODE BELOW — call zs_classifier on `article` for each label set,
# then print the top label and confidence for each.
# -- SOLUTION ------------------------------------------------------------------
# A worked example using a real ECB rate-hike article and three label sets
# that operationalise three different research questions.
article = """\
FRANKFURT - The European Central Bank raised its three key interest rates by 25 basis points
on Thursday, taking the deposit facility rate to 4.00%. President Lagarde said inflation is
expected to remain too high for too long, citing core inflation of 5.3% in August. The
Governing Council remains committed to ensuring inflation returns to its 2% medium-term target
in a timely manner. Markets had priced in a 65% probability of the hike.
"""
label_sets = {
"Set A — Broad topic": ["economy", "politics", "society", "technology"],
"Set B — Policy stance": ["hawkish", "dovish", "neutral"],
"Set C — Article tone": ["positive outlook", "negative outlook", "mixed signals"],
}
for name, labels in label_sets.items():
result = zs_classifier(article, candidate_labels=labels)
print(f"{name}:")
for lab, sc in zip(result['labels'], result['scores']):
print(f" {lab:<22s} {sc:.3f}")
print()
10. LLMs via API — Claude and OpenAI (optional, paid)¶
Note. This section uses commercial APIs (Anthropic Claude, OpenAI GPT). API calls are billed per token. New accounts on both providers receive trial credits (~$5) that comfortably cover everything in this section, but this material is not required to complete the course or the final exercise — the exercise can be done entirely with the free tools from Sections 1-6.
Read this section if you want to: (a) understand how API-based LLMs differ from local transformers, (b) use them for your projects if your research budget allows, (c) be able to discuss the trade-off in your defence.
Large language models (Claude, GPT-5, Llama, Mistral) can perform complex NLP tasks in a few lines of code via API — tasks that would require significant engineering effort with traditional ML, or that local transformers can't easily do (free-form extraction, multi-step reasoning, summarisation).
What LLMs do that local zero-shot transformers can't:
- Free-form structured output (e.g. JSON with arbitrary fields)
- Few-shot learning via in-context examples
- Cross-lingual reasoning without per-language models
- Following complex multi-step instructions
Important caveats:
- API calls are paid - budget accordingly for large corpora
- Reproducibility: model outputs can change between API versions; always pin model IDs in your code
- Hallucination: LLMs can generate plausible-sounding but incorrect information
- Always validate a sample of LLM outputs manually before trusting them in a paper
10a. Setting up API access¶
To use the LLM APIs you need to: (1) install the client libraries, (2) create an account, (3) generate an API key, (4) store it as an environment variable.
Install the client libraries (only needed if you actually plan to call the APIs — not required to run this notebook):
pip install anthropic openai
Account and API key. Never hardcode API keys in your notebook or commit them to Git — they grant full access to your billing account, and if leaked on a public repo they will be scraped within minutes.
Anthropic (Claude):
- Sign up at https://console.anthropic.com
- Generate a key at Settings → API Keys (format:
sk-ant-...) - Buy a small amount of credit (or use the free trial credit)
OpenAI (ChatGPT):
- Sign up at https://platform.openai.com
- Generate a key at API Keys (format:
sk-...) - Buy a small amount of credit
Storing keys as environment variables (Mac/Linux):
# Temporary (current shell only):
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
# Persistent (zsh, default on macOS):
echo 'export ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshrc
echo 'export OPENAI_API_KEY="sk-..."' >> ~/.zshrc
source ~/.zshrc
Windows PowerShell:
[Environment]::SetEnvironmentVariable("ANTHROPIC_API_KEY", "sk-ant-...", "User")
[Environment]::SetEnvironmentVariable("OPENAI_API_KEY", "sk-...", "User")
Recommended for this course — scoped to the conda environment:
conda env config vars set ANTHROPIC_API_KEY="sk-ant-..." -n Python_for_Economists
conda env config vars set OPENAI_API_KEY="sk-..." -n Python_for_Economists
conda deactivate && conda activate Python_for_Economists
This keeps your keys tied to one environment instead of polluting your global shell. The Python clients (Anthropic(), OpenAI()) read these variables automatically — you never pass the key in code.
Verify your setup in Python:
import os
print(f"ANTHROPIC_API_KEY set: {bool(os.environ.get('ANTHROPIC_API_KEY'))}")
print(f"OPENAI_API_KEY set : {bool(os.environ.get('OPENAI_API_KEY'))}")
If both show False, the env vars are not visible to the kernel — restart Jupyter from a shell where they ARE set, or use the conda approach above.
10b. Pricing (as of May 2026)¶
Both providers charge per million tokens (MTok), with separate rates for input (your prompt + context) and output (the model's response). A token is roughly 0.75 English words. No permanent free tier exists, but new accounts receive trial credits (~$5) that comfortably cover experimentation.
| Provider | Model | Input ($/MTok) | Output ($/MTok) | Notes |
|---|---|---|---|---|
| Anthropic | Claude Haiku 4.5 | $1.00 | $5.00 | Default for classification / extraction |
| Anthropic | Claude Sonnet 4.6 | $3.00 | $15.00 | Complex reasoning, nuanced text |
| Anthropic | Claude Opus 4.7 | $5.00 | $25.00 | Hardest tasks, longest context |
| OpenAI | gpt-4o-mini | $0.15 | $0.60 | Cheap, well-documented legacy default |
| OpenAI | gpt-5-mini | $0.25 | $2.00 | Cheap with stronger reasoning |
| OpenAI | gpt-5 | $1.25 | $10.00 | OpenAI flagship-lite |
| OpenAI | gpt-5.4 | $2.50 | $15.00 | OpenAI current flagship |
Reality check for academic workloads:
| Workload | Claude Haiku | GPT-4o-mini |
|---|---|---|
| Quick demo (10-20 calls) | < $0.001 | < $0.001 |
| 1,000 ECB press releases (~500 in + 50 out tokens each) | ~$0.75 | ~$0.10 |
| 50,000 news headlines (~30 in + 10 out) | ~$4 | ~$0.50 |
| 5,000 long earnings transcripts (~5,000 in + 200 out) | ~$30 | ~$4 |
Cost-saving features worth knowing for project-scale corpora:
- Batch API (both providers): 50% discount, results within 24h. Always use this for anything that does not need real-time output.
- Prompt caching (Anthropic): up to 90% off repeated system prompts / instructions. Useful when you re-use the same long classification rubric across thousands of calls.
- Pin model IDs with date suffixes (e.g.
claude-haiku-4-5-20251001) so your replication package keeps working when the provider rolls a new default.
10c. Same task, two providers — a reusable pattern¶
Both APIs follow the same logic: a client object, a messages.create() (Anthropic) or chat.completions.create() (OpenAI) call, and a parsed response. The function below abstracts over the difference.
Initialise both clients:
from anthropic import Anthropic
from openai import OpenAI
claude = Anthropic() # reads ANTHROPIC_API_KEY from env
openai_client = OpenAI() # reads OPENAI_API_KEY from env
A reusable classification function for either provider:
def classify_with_llm(text, categories, provider="claude", model=None):
\"\"\"
Classify a text document into one of `categories` using an LLM.
provider: 'claude' or 'openai'.
\"\"\"
cats = ", ".join(f'"{c}"' for c in categories)
prompt = (
f"Classify the following central bank statement as exactly one of: {cats}.\n\n"
f"Statement:\n{text}\n\n"
"Respond with ONLY the category label and nothing else."
)
if provider == "claude":
model = model or "claude-haiku-4-5-20251001"
resp = claude.messages.create(
model=model,
max_tokens=20,
messages=[{"role": "user", "content": prompt}],
)
return resp.content[0].text.strip()
elif provider == "openai":
model = model or "gpt-4o-mini"
resp = openai_client.chat.completions.create(
model=model,
max_tokens=20,
temperature=0,
messages=[{"role": "user", "content": prompt}],
)
return resp.choices[0].message.content.strip()
else:
raise ValueError("provider must be 'claude' or 'openai'")
Use it on the same statements with both providers:
statements = [
"The Governing Council decided to raise rates by 75 basis points. Inflation is far too high.",
"The Governing Council decided to keep rates unchanged. The economy is recovering.",
"The Governing Council decided to cut rates by 25 basis points to support growth.",
]
categories = ["hawkish", "dovish", "neutral"]
for s in statements:
c = classify_with_llm(s, categories, provider="claude")
o = classify_with_llm(s, categories, provider="openai")
print(f"{c:<12} | {o:<12} | {s[:70]}...")
Why this pattern matters. Switching providers (or model tiers within the same provider) is a one-argument change. This matters because:
- Robustness checks. In a paper, re-running your classification with a second provider is the closest thing to "out-of-sample validation" for LLM-based measures. Disagreement rates are publishable as a robustness table.
- Replication-friendliness. If one provider deprecates the model you used, the other becomes a fallback.
- Cost. If you find the cheap model gets it right 95% of the time, you only escalate to the expensive one on disagreements.
10d. Structured extraction — where LLMs really pay off¶
One thing that's genuinely hard with local transformers and easy with LLMs: getting structured JSON output from unstructured text. Here's the pattern for free-form information extraction.
import json
def extract_structured(article_text, provider="claude"):
\"\"\"Extract structured fields from a news article as JSON.\"\"\"
prompt = (
"Extract the following structured information from this news article. "
"Respond with ONLY valid JSON, no markdown fences, no commentary.\n\n"
"Required fields:\n"
" main_actor (str): the primary actor in the story\n"
" action_taken (str): one short sentence\n"
" policy_area (str): one of [monetary, fiscal, labour, defence, trade, other]\n"
" numbers_mentioned (list of str): all numeric quantities with their units\n"
" sentiment (str): one of [positive, neutral, negative]\n\n"
f"Article:\n{article_text}"
)
if provider == "claude":
resp = claude.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=400,
messages=[{"role": "user", "content": prompt}],
)
return json.loads(resp.content[0].text)
else:
resp = openai_client.chat.completions.create(
model="gpt-4o-mini",
max_tokens=400,
temperature=0,
response_format={"type": "json_object"},
messages=[{"role": "user", "content": prompt}],
)
return json.loads(resp.choices[0].message.content)
# Example: extract structured data from the ECB article used in the challenge
result = extract_structured(article, provider="claude")
print(json.dumps(result, indent=2))
The output would look something like:
{
"main_actor": "European Central Bank",
"action_taken": "Raised interest rates by 25 basis points to 4.00%",
"policy_area": "monetary",
"numbers_mentioned": ["25 basis points", "4.00%", "5.3%", "2%", "65%"],
"sentiment": "negative"
}
This kind of structured extraction — going from messy prose to a clean row of a panel dataset — is what makes LLMs genuinely useful in empirical economics. Doing the same task with regex or rule-based parsing is possible but enormously more time-consuming, and brittle to small wording changes across documents.
10e. Cost estimation — run this BEFORE launching on a large corpus¶
The cell below is the one executable piece of Section 10 — pure arithmetic, no API calls. Use it to budget before you launch any LLM job on a large corpus.
def estimate_llm_cost(n_docs,
avg_input_tokens=500,
avg_output_tokens=50,
input_price_per_mtok=1.00,
output_price_per_mtok=5.00,
model_name="Claude Haiku 4.5"):
"""Rough cost estimate for an LLM classification/extraction run."""
in_tok = n_docs * avg_input_tokens
out_tok = n_docs * avg_output_tokens
cost = (in_tok / 1e6) * input_price_per_mtok + (out_tok / 1e6) * output_price_per_mtok
print(f"Model : {model_name}")
print(f" Documents : {n_docs:,}")
print(f" Input tokens : {in_tok:>12,} @ ${input_price_per_mtok}/MTok")
print(f" Output tokens : {out_tok:>12,} @ ${output_price_per_mtok}/MTok")
print(f" Est. cost : ${cost:.4f}")
print()
# Compare across providers on a hypothetical 5,000-document corpus
N_DOCS = 5_000
estimate_llm_cost(N_DOCS, 500, 50, 1.00, 5.00, "Claude Haiku 4.5")
estimate_llm_cost(N_DOCS, 500, 50, 3.00, 15.00, "Claude Sonnet 4.6")
estimate_llm_cost(N_DOCS, 500, 50, 0.15, 0.60, "GPT-4o-mini")
estimate_llm_cost(N_DOCS, 500, 50, 0.25, 2.00, "GPT-5-mini")
📊 Reading the output. The order of magnitude is what matters here. For a typical project-scale corpus (a few thousand documents, short prompts), even the most expensive Claude tier costs under $30 — less than a textbook. For 50k+ documents, the difference between Haiku and Sonnet is the difference between negligible and meaningful; pick the cheaper model first and only upgrade if validation tells you it's necessary.
GPT-4o-mini is the price floor: at \$0.15/\$0.60 per MTok it can process the same workload for ~\$0.50 vs Claude Haiku's ~\$3.75. If you don't need Claude's specific strengths, the OpenAI cheap tier is hard to beat on cost alone.
11. Exercise¶
Time: ~20 minutes. Work individually. Remainder of time: proposal preparation.
Task 1. Train a logistic regression classifier on your corpus. You will need a labelled variable - this can be:
- Time period (pre/post a specific event)
- Document category (if your corpus has multiple types)
- Any binary metadata column you created in L6
Report precision, recall, F1. Plot the confusion matrix.
Task 2. Run zero-shot classification on a sample of 10-20 documents from your corpus, using one of:
- BART-MNLI (general-purpose zero-shot, free, local) — default choice
- FinBERT (if your corpus is financial, free, local)
- Optional: Claude or GPT-4o-mini via API (only if you have keys set up — see Section 7)
Choose 2-3 labels that are relevant to your research question. Compare results to your logistic regression classifier (if applicable).
Task 3. Identify the 5 most predictive positive and 5 most predictive negative features from your logistic regression. Do they make economic sense?
# Task 1 - supervised classifier
# YOUR CODE HERE
# Task 2 - zero-shot / LLM classification
# YOUR CODE HERE
# Task 3 - most predictive features
# YOUR CODE HERE
# -- SOLUTION ------------------------------------------------------------------
# This solution uses the synthetic `manifesto` dataset built earlier in the lecture.
# For your own corpus, replace `manifesto` with your DataFrame and adapt
# the label column accordingly.
import numpy as np, pandas as pd, matplotlib.pyplot as plt, warnings
warnings.filterwarnings("ignore")
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, confusion_matrix
from sklearn.pipeline import Pipeline
# ── Task 1: supervised classifier on TF-IDF ──────────────────────────────────
X_tr, X_te, y_tr, y_te = train_test_split(
manifesto["text"], manifesto["orientation"],
test_size=0.25, random_state=42, stratify=manifesto["orientation"])
pipeline_sol = Pipeline([
("tfidf", TfidfVectorizer(max_features=2000, ngram_range=(1, 2),
min_df=2, stop_words="english", sublinear_tf=True)),
("clf", LogisticRegression(C=1.0, max_iter=1000, random_state=42)),
])
pipeline_sol.fit(X_tr, y_tr)
y_pred_sol = pipeline_sol.predict(X_te)
print("Task 1 - Logistic Regression classification report:")
print(classification_report(y_te, y_pred_sol, digits=3))
fig, ax = plt.subplots(figsize=(5, 4))
ConfusionMatrixDisplay(confusion_matrix(y_te, y_pred_sol, labels=["left","right"]),
display_labels=["left","right"]
).plot(ax=ax, colorbar=False, cmap="Blues")
ax.set_title("Confusion matrix - LogReg on manifesto"); fig.tight_layout(); plt.show()
# ── Task 2: zero-shot classification on a sample ─────────────────────────────
sample = manifesto.sample(15, random_state=42).reset_index(drop=True)
candidate_labels = ["left-wing political position", "right-wing political position"]
print("\nTask 2 - Zero-shot (BART-MNLI) on 15 sample documents:")
print("-" * 95)
n_agree = 0
for _, row in sample.iterrows():
zs = zs_classifier(row["text"][:300], candidate_labels=candidate_labels)
zs_label = "left" if "left" in zs["labels"][0] else "right"
score = zs["scores"][0]
agree = "✓" if zs_label == row["orientation"] else "✗"
if zs_label == row["orientation"]:
n_agree += 1
print(f" true={row['orientation']:<6} zs={zs_label:<6} ({score:.2f}) {agree} {row['text'][:55]}...")
print(f"\nAgreement with synthetic labels: {n_agree}/{len(sample)} = {n_agree/len(sample):.1%}")
# ── Task 3: most predictive features from the supervised classifier ──────────
fnames = pipeline_sol.named_steps["tfidf"].get_feature_names_out()
coefs = pipeline_sol.named_steps["clf"].coef_[0]
coef_df = pd.DataFrame({"feature": fnames, "coef": coefs}).sort_values("coef")
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.barh(coef_df.head(5)["feature"], coef_df.head(5)["coef"],
color="steelblue", edgecolor="white")
ax1.set_title("Top 5 'left' features"); ax1.set_xlabel("Coefficient")
ax2.barh(coef_df.tail(5)["feature"], coef_df.tail(5)["coef"],
color="tomato", edgecolor="white")
ax2.set_title("Top 5 'right' features"); ax2.set_xlabel("Coefficient")
fig.suptitle("Task 3 - Most predictive bigrams", y=1.02)
fig.tight_layout(); plt.show()
Wrap-up — what you now know how to do¶
Nine lectures ago, most of you had never written a Python script. Stop for a moment and look at where you are:
- You can pull data from a public API and build a clean panel (L3).
- You can estimate OLS and panel FE models with clustered SE and export publication-ready tables (L3).
- You can solve a dynamic optimisation problem numerically (L4).
- You can scrape a live website, respecting its ToS and your own ethics (L5-L6).
- You can turn a corpus of text into measures of sentiment, topics, and themes (L7).
- You can train and evaluate a supervised classifier on text, and use BERT-family models out of the box (L8-L9).
- You understand how LLM APIs fit into the toolkit, when they're worth the cost, and when they're not (L9).
This is not a skillset typical of a master's student in economics. It is the skillset of an empirical researcher in 2026. Use it carefully.
Summary¶
| Method | Use case | Library | Cost |
|---|---|---|---|
| Logistic regression on TF-IDF | Binary/multi-class classification | sklearn | Free |
| Naive Bayes | Fast baseline, sparse data | sklearn | Free |
| Cross-validation | Robust performance estimation | sklearn | Free |
| Word embeddings | Semantic similarity | spacy (en_core_web_md) | Free |
| BART-MNLI zero-shot | No labelled data, custom labels | transformers | Free, local |
| FinBERT | Financial sentiment, fine-tuned | transformers (ProsusAI/finbert) | Free, local |
| LLM API (Claude / OpenAI) | Structured extraction, complex tasks | anthropic / openai | Paid (~$1-30 for typical corpora) |