Lecture 5 — Web Scraping I: requests & BeautifulSoup¶
Python for Economists · University of Bologna · 2025/2026¶
What we cover today¶
- Warm-up: scraping live, before any theory
- How the web works: HTTP, HTML, and the DOM
- Making requests with
requests - Parsing HTML with
BeautifulSoup - Building a scraper: loops, pagination, delays
- Ethics and legality of web scraping (revisited in depth)
- Case study: ECB press release archive
- Diagnosing scraping problems (and when to switch to L6)
- Exercise: build your first economic corpus
Why scraping? Most economically relevant text — parliamentary speeches, central bank communications, court rulings, news articles — is not available as a clean CSV. Scraping is how you build the dataset that does not yet exist.
# Install if needed
# pip install requests beautifulsoup4 lxml
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
from datetime import datetime
print("Libraries loaded.")
0. Warm-up — scraping live, before any theory¶
I'm going to scrape a real website now, with no preparation, in front of you. The site is the Federal Reserve speeches archive for 2024. By the end of this cell we'll have the date, title, speaker and URL of every Fed Board speech delivered in 2024 — as a DataFrame.
We'll then spend the rest of the lecture understanding what each line does.
# Live demo — minimal scraper, ~10 lines.
URL = "https://www.federalreserve.gov/newsevents/speech/2024-speeches.htm"
HEADERS = {"User-Agent": "Mozilla/5.0 (teaching; contact: marco.rosso4@unibo.it)"}
resp = requests.get(URL, headers=HEADERS, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
# Each Fed speech URL follows a recognisable pattern:
# /newsevents/speech/{lastname}{YYYYMMDD}{letter}.htm e.g. powell20240930a.htm
SPEECH_URL_RE = re.compile(r"/newsevents/speech/[a-z]+\d{8}[a-z]\.htm$")
items = []
for a in soup.find_all("a", href=SPEECH_URL_RE):
title = a.get_text(strip=True)
if not title:
continue
href = a.get("href", "")
items.append({
"title": title,
"url": "https://www.federalreserve.gov" + href if href.startswith("/") else href,
})
df_warmup = pd.DataFrame(items).head(10)
df_warmup
That was it. Three libraries (requests, BeautifulSoup, pandas), ten lines, and we have a structured dataset from a live website.
Before we go any deeper — a non-negotiable preliminary.
The ethics moment (5 minutes, non-negotiable)¶
Scraping is powerful and, badly used, harmful. Three rules to apply every time:
- Read
robots.txt. Every site publishes one at/robots.txt— it tells you what you can and cannot scrape. Check it. - Identify yourself. Set a
User-Agentwith a contact email. If the site admin has a problem, they can write to you instead of blocking you. - Rate-limit. A human reads one page every ~10 seconds. A scraper can read 100 per second. Don't. Use
time.sleep(1)between requests at minimum.
Legal note. Scraping public, non-copyrighted information for research is legal in the EU under specific conditions — the Text and Data Mining (TDM) exception for scientific research, Directive 2019/790, art. 3. Scraping copyrighted content, personal data, or content behind a paywall is not. For your research proposals, stick to public institutional sources (ECB, parliaments, government press releases) and you'll be safe.
We'll come back to this in Section 5 with the practical handbook.
# Always check robots.txt before scraping
robots = requests.get("https://www.federalreserve.gov/robots.txt", headers=HEADERS, timeout=5)
print(robots.text[:500])
📊 Reading the
robots.txt. What you see is plain text instructions for web crawlers. Lines starting withUser-agent: *apply to all bots;Disallow:paths are off-limits. Our/newsevents/speech/path is not listed → free to scrape. Good citizenship: always check this before starting a scraping project.
1. How the web works¶
1.1 The HTTP request-response cycle¶
When you type a URL in your browser:
- Your browser sends an HTTP GET request to the server at that URL
- The server returns an HTTP response with a status code and a body
- The body is usually HTML — the browser renders it visually
Web scraping replicates step 1–2 in Python, then reads the HTML programmatically instead of rendering it.
1.2 Status codes¶
| Code | Meaning |
|---|---|
| 200 | OK — request succeeded |
| 301/302 | Redirect |
| 403 | Forbidden — server refuses the request |
| 404 | Not found |
| 429 | Too many requests — you are being rate-limited |
| 500 | Server error |
1.3 HTML structure¶
HTML is a tree of elements (tags). Every piece of content sits inside a tag:
<html>
<head>
<title>Page title</title>
</head>
<body>
<h1 class="headline">Main heading</h1>
<div id="content">
<p>First paragraph.</p>
<p class="abstract">Second paragraph.</p>
<a href="https://example.com">A link</a>
</div>
</body>
</html>
How to find the right tags: open any webpage in Chrome/Firefox, right-click on the element you want, and select Inspect. The browser's developer tools show the HTML behind any element.
2. Making requests with requests¶
# A minimal GET request
url = "https://www.federalreserve.gov/newsevents/speech/2024-speeches.htm"
response = requests.get(url)
print(f"Status code : {response.status_code}")
print(f"Content-Type: {response.headers.get('Content-Type', 'n/a')}")
print(f"Body length : {len(response.text):,} characters")
print()
# Show the first 500 characters of raw HTML
print(response.text[:500])
📊 Reading the response. Status
200= success. The server returns ~180 KB of HTML — that's the entire 2024 speeches index page as the browser would receive it. Note the<!doctype html>opening: this is the raw rendered markup, identical to what you'd see in Chrome's "View Page Source".
# Always check the status code before proceeding
def safe_get(url, pause=1.0):
"""
Perform a GET request with a polite pause.
Returns the response if successful, None otherwise.
"""
time.sleep(pause) # be polite — do not hammer the server
try:
r = requests.get(url, timeout=10)
r.raise_for_status() # raises HTTPError for 4xx/5xx
return r
except requests.HTTPError as e:
print(f"HTTP error {e.response.status_code} for {url}")
return None
except requests.RequestException as e:
print(f"Request failed for {url}: {e}")
return None
r = safe_get(url)
print(r.status_code if r else "Failed")
📊 Reading the output.
200means OK. In production scraping you'll meet404(page gone),403(blocked — try a different User-Agent),429(rate-limited — slow down), and500(server error — retry later). Wrappingrequests.get()in a function with status-code handling is the difference between a one-shot script and a robust pipeline.
📋 HTTP status codes — a scraper's field guide¶
Every response from a server includes a 3-digit status code. The first digit tells you the class of response. As a scraper, you'll meet a small subset over and over:
✅ Success (2xx)¶
| Code | Name | Meaning |
|---|---|---|
200 | OK | The request succeeded; the body contains what you asked for. |
204 | No Content | Request succeeded but the response body is empty (common for DELETE or some APIs). |
🔀 Redirects (3xx)¶
A redirect is when the server says "the resource you asked for now lives at a different URL — go there instead." Instead of returning the page content, the server returns a 3xx status code together with a Location: header pointing to the new URL. Browsers (and requests) typically follow the redirect automatically and show you the final page, which is why redirects are usually invisible.
Why do servers redirect?
- Site restructuring —
/old-page→/new-section/pageafter a redesign. The Fed and the ECB do this regularly to their archives. - Adding
https—http://...→https://...to enforce encrypted connections. - Adding
www—example.com→www.example.com(or vice versa) for canonical-URL consistency. - Authentication walls — protected page → login page, then back to the original.
- Locale routing —
/page→/en/pagebased on the visitor'sAccept-Languageheader. - Trailing slashes —
/page→/page/for URL normalisation.
| Code | Name | Meaning |
|---|---|---|
301 | Moved Permanently | The new URL is the canonical one. Search engines and crawlers should update their records. Use this to update your scraper code to the new URL. |
302 | Found (Temporary) | The redirect is temporary — keep using the original URL in future requests. |
307 | Temporary Redirect | Like 302, but the HTTP method is preserved (a POST stays a POST — relevant for APIs). |
308 | Permanent Redirect | Like 301, with method preserved. |
requests handles them for you — but you should still inspect:
r = requests.get("http://www.federalreserve.gov/newsevents/speech/2024-speeches.htm")
print(f"Final URL : {r.url}") # the URL you actually ended up on
print(f"Status : {r.status_code}") # 200 (the *final* response)
print(f"Hops taken : {len(r.history)}") # number of redirects followed
for hop in r.history:
print(f" {hop.status_code} {hop.url} → {hop.headers.get('Location')}")
For the URL above you'll typically see one hop (http → https). Some sites have chains of 3–4 redirects; if you see more than that, something is misconfigured (or you're in a redirect loop).
When to disable automatic redirects. Pass allow_redirects=False if you want to inspect the redirect itself rather than follow it — useful for debugging "why does this URL behave strangely?" or for scrapers that need to capture intermediate URLs.
🎯 Practical implications for your scraper. If you see a
301permanently redirecting away from a URL you've hard-coded, update your code to use the new URL: extra hops mean extra latency on every single request. If you see302Temporary, keep the original. And ifr.urldiffers from the URL you requested, your "scrape this URL" log entries should reflect what actually got scraped, not what was asked for.
❌ Client errors (4xx) — the request is the problem¶
| Code | Name | Most common cause when scraping | Fix |
|---|---|---|---|
400 | Bad Request | Malformed URL, invalid query parameters | Check the URL and params= dict |
401 | Unauthorized | Endpoint requires authentication | Out of scope for public scraping |
403 | Forbidden | Server detected and blocked your python-requests/x.y User-Agent, or geo-block, or a more aggressive bot detector (Cloudflare) | Set a browser-style User-Agent; if that fails, the site is actively defending — consider Selenium (L6) or back off |
404 | Not Found | The page no longer exists or the URL pattern changed | Verify the URL in a browser; the site may have restructured |
405 | Method Not Allowed | You used GET on an endpoint expecting POST | Check the API docs |
429 | Too Many Requests | Rate limiting: you're sending requests faster than the server allows. Often combined with a Retry-After header telling you how long to wait. | Slow down — increase time.sleep(); consider exponential backoff; in extreme cases use requests.Session() with retries (L6) |
🔥 Server errors (5xx) — the server is the problem¶
| Code | Name | What you do |
|---|---|---|
500 | Internal Server Error | The server crashed on your request. Retry once or twice with a delay; if persistent, the URL itself may be problematic. |
502 / 503 / 504 | Bad Gateway / Service Unavailable / Gateway Timeout | Transient infrastructure issues. Retry with exponential backoff (1s → 2s → 4s → 8s). |
🛠 The defensive scraping pattern¶
Whenever you write a scraper, your inner loop should look like this:
import time
import requests
def fetch_with_retry(url, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
r = requests.get(url, headers=HEADERS, timeout=10)
if r.status_code == 429:
# Rate limited — honour Retry-After if present, else exponential backoff
wait = int(r.headers.get("Retry-After", base_delay * (2 ** attempt)))
print(f" Rate limited; waiting {wait}s before retry {attempt+1}")
time.sleep(wait)
continue
if r.status_code >= 500:
# Server error — exponential backoff
wait = base_delay * (2 ** attempt)
print(f" Server error {r.status_code}; waiting {wait}s")
time.sleep(wait)
continue
r.raise_for_status() # raises for any other 4xx
return r
except requests.RequestException as e:
print(f" Network error: {e}; attempt {attempt+1}/{max_retries}")
time.sleep(base_delay * (2 ** attempt))
return None
🎯 Key takeaway. Codes
4xxmean you must change something (URL, headers, rate). Codes5xxmean the server failed — usually transient, retry with backoff. The single most common one in academic scraping is429: it's the server politely telling you to slow down. Always honour it.
# Headers — simulate a browser to avoid 403 errors on some sites
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}
r = requests.get(url, headers=HEADERS, timeout=10)
print(r.status_code)
📊 Reading the output. Same
200as before — the Fed accepts requests both with and without a custom User-Agent. But on sites with stricter filters (Cloudflare, Akamai), the defaultpython-requests/2.31.0UA gets blocked instantly. Setting a browser-style UA with a contact email is the polite default.
2.1 Query parameters and the response object¶
In the warm-up we passed the full URL with all parameters baked in. Often you want to construct URLs programmatically — requests lets you pass parameters as a dict via params=. This avoids manual string concatenation and handles URL-encoding for you.
# Query parameters via params= (cleaner than f-strings inside URLs)
demo_params = {"q": "monetary policy", "fromDate": "2024-01-01", "toDate": "2024-12-31"}
prepared = requests.Request("GET", "https://example.com/search", params=demo_params).prepare()
print("Prepared URL:", prepared.url)
# Inspect the response object more thoroughly
r = requests.get(url, headers=HEADERS, timeout=10)
print(f"Final URL (after redirects): {r.url}")
print(f"Status : {r.status_code} {r.reason}")
print(f"Encoding : {r.encoding}")
print(f"Cookies set by server : {dict(r.cookies)}")
print(f"Redirect history : {[h.status_code for h in r.history]}")
print(f"Response time (s) : {r.elapsed.total_seconds():.3f}")
📊 Reading the response object.
requestsreturns a rich object: final URL after redirects, encoding, cookies set by the server, redirect history, and elapsed time. TheEncoding: ISO-8859-1value is interesting — the Fed's HTTP header declares Latin-1, but the actual content is UTF-8. We'll fix that mismatch in Section 7 (encoding diagnostics).
3. Parsing HTML with BeautifulSoup¶
# Parse the HTML response
soup = BeautifulSoup(r.text, "lxml")
# The soup object is a tree you can navigate
print(type(soup))
print(soup.title.text) # page title
📊 Reading the output.
BeautifulSoupexposes the parsed DOM as a Python object.soup.title.stringextracts the<title>tag content — useful as a sanity check that we got the page we expected (Federal Reserve Board - 2024 Speeches, not an error page).
3.1 Core methods¶
| Method | Returns | Use when |
|---|---|---|
soup.find(tag, attrs) | First matching element | You need exactly one element |
soup.find_all(tag, attrs) | List of all matching elements | You need all elements of a type |
element.text | Plain text content | Extracting readable text |
element.get("attr") | Attribute value | Extracting URLs, IDs, classes |
element.select("css_selector") | List by CSS selector | Precise targeting of nested elements |
# Working with a simple HTML string first — so results are predictable
html_example = """
<html>
<body>
<div class="press-list">
<dl>
<dt class="date">7 November 2024</dt>
<dd>
<a href="/press/pr/date/2024/html/ecb.pr241107~abc.en.html">
Monetary policy decisions
</a>
</dd>
<dt class="date">17 October 2024</dt>
<dd>
<a href="/press/pr/date/2024/html/ecb.pr241017~xyz.en.html">
Monetary policy decisions
</a>
</dd>
<dt class="date">12 September 2024</dt>
<dd>
<a href="/press/pr/date/2024/html/ecb.pr240912~def.en.html">
Monetary policy decisions
</a>
</dd>
</dl>
</div>
</body>
</html>
"""
example_soup = BeautifulSoup(html_example, "lxml")
# find_all — get all date elements
dates = example_soup.find_all("dt", class_="date")
for d in dates:
print(d.text.strip())
📊 Reading the output. Three dates extracted from a synthetic HTML snippet. Working with controlled fixtures first is good practice: when you debug a parser, you want predictable inputs. Once it works on the fixture, you swap in the real page.
# Extract titles and URLs from the links
links = example_soup.find_all("a")
for link in links:
title = link.text.strip()
href = link.get("href")
print(f"{title!r:35} → {href}")
📊 Reading the output. Every
<a>tag yieldstitle(the visible text) andhref(the URL). Note the URLs are relative (/press/...) — to make them clickable later we'll need to prepend the domain (https://www.ecb.europa.eu). This is the standard pattern: scrape, then resolve.
# CSS selectors — more precise targeting
# "div.press-list a" means: <a> elements inside a <div class="press-list">
for link in example_soup.select("div.press-list a"):
print(link.text.strip(), "|", link.get("href"))
📊 Reading the output. Same three records, but obtained with a single
.select(...)call using CSS syntax. CSS selectors are more powerful thanfind_allfor nested queries:div.press-list asays "find an<a>inside any<div>with classpress-list". Once you know CSS, you can target almost anything.
# Pairing dates with links — navigate sibling elements
records = []
dts = example_soup.find_all("dt", class_="date")
for dt in dts:
date_str = dt.text.strip()
dd = dt.find_next_sibling("dd")
link = dd.find("a")
records.append({
"date": date_str,
"title": link.text.strip(),
"url": "https://www.ecb.europa.eu" + link.get("href"),
})
pd.DataFrame(records)
📊 Reading the output. A clean DataFrame: each row pairs a date with its corresponding press release. The trick was to walk between siblings (
<dt>and<dd>) rather than treating them in isolation. This is one of the most common scraping tasks: pairing related elements that aren't nested inside each other.
3.2 Tree navigation: parents, siblings, descendants¶
find and select get you to one element. From there, you can move around the tree:
# Build a tiny tree we can navigate
nav_html = '''
<article>
<header>
<h2>Press release</h2>
<time>7 November 2024</time>
</header>
<section class="body">
<p class="lede">First paragraph.</p>
<p>Second paragraph.</p>
<p>Third paragraph.</p>
</section>
</article>
'''
nav_soup = BeautifulSoup(nav_html, "lxml")
lede = nav_soup.find("p", class_="lede")
print("Element :", lede.name, "→", lede.text.strip())
print("Parent :", lede.parent.name)
print("All ancestors :", [a.name for a in lede.parents if a.name])
print("Next sibling :", lede.find_next_sibling().text.strip())
print("All siblings :", [s.text.strip() for s in lede.find_next_siblings()])
section = nav_soup.find("section")
print("Direct children:", [c.name for c in section.children if c.name])
print("All descendants:", [d.name for d in section.descendants if d.name])
📊 Reading the output. This shows the four classic ways to navigate the DOM tree:
.parent,.parents,.next_sibling, and.children/.descendants. The distinction betweenchildren(direct only:['p', 'p', 'p']) anddescendants(all nested: same in this case because the structure is flat) becomes important on real pages with deep nesting.
Interlude — notebook vs. script: when to switch¶
We've been living in Jupyter since L1. That's great for exploration — but a serious scraper is not exploration. It's a pipeline that must run unattended, maybe overnight, maybe weekly, with error handling and logging.
Notebooks are bad at this. Scripts are good at this.
Let me show you what the same scraping code looks like as a proper Python script, opened in Spyder (which you already have — it comes with Anaconda) and in VS Code (the industry standard).
[Live demo: instructor opens ecb_scraper.py in Spyder, shows: variable explorer, debugger, run-selection, integrated terminal. Then opens the same file in VS Code, shows: Python extension, interpreter selector for the econ environment, integrated debugger with breakpoints.]
Bottom line for this course:
- Lectures and exercises → Jupyter (interactive, immediate feedback).
- Your scraping pipeline for the exam → Jupyter is fine for prototyping, but run the final version as a script if your corpus takes more than ~10 minutes to build.
- Your projects → VS Code will be your friend. Start now.
4. Building a scraper: loops, pagination, delays¶
Our case study: build a corpus of Federal Reserve Board speeches across multiple years. The Fed publishes a year-by-year archive at:
https://www.federalreserve.gov/newsevents/speech/{YEAR}-speeches.htm
Each entry has a date, a hyperlinked title, the speaker's name and role, and the event description. The individual speech pages contain the full text in HTML — that's our text corpus.
Plan:
- Scrape one year's index page → list of speeches with metadata
- Scale to multiple years
- Follow each link to fetch the full speech text
- Save as CSV
4.1 Inspect first — never scrape blind¶
Before writing the loop, look at one page. Confirm the URL pattern, count the items, peek at the structure. This is the single most valuable habit in scraping.
# Step 1 — fetch one year and inspect what we got
ROOT = "https://www.federalreserve.gov"
YEAR_URL = "https://www.federalreserve.gov/newsevents/speech/{year}-speeches.htm"
SPEECH_URL_RE = re.compile(r"/newsevents/speech/[a-z]+\d{8}[a-z]\.htm$")
DATE_RE = re.compile(r"(\d{1,2})/(\d{1,2})/(\d{4})") # MM/DD/YYYY
resp = requests.get(YEAR_URL.format(year=2024), headers=HEADERS, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
speech_links = soup.find_all("a", href=SPEECH_URL_RE)
print(f"Speech links found in 2024 archive: {len(speech_links)}\n")
# Inspect the first one and its surroundings
a = speech_links[0]
print("First link:")
print(f" Title: {a.get_text(strip=True)}")
print(f" URL : {a.get('href')}")
# The date is in the page text RIGHT BEFORE each title link.
# Walk backwards to find the nearest MM/DD/YYYY string.
def find_preceding_date(link, max_chars=300):
"""Return the most recent MM/DD/YYYY found *before* this <a> in the document."""
text_before = ""
for prev in link.previous_elements:
if hasattr(prev, "get_text"):
text_before = prev.get_text(" ", strip=True) + " " + text_before
elif isinstance(prev, str):
text_before = prev + " " + text_before
if len(text_before) > max_chars:
break
matches = DATE_RE.findall(text_before)
return matches[-1] if matches else None # the LAST match is the closest to the link
print(f" Date (closest preceding MM/DD/YYYY): {find_preceding_date(a)}")
📊 Reading the output. 107 speech links found on the 2024 page. The first one is Adriana Kugler's December 3 speech, with date
12/3/2024correctly extracted by walking backwards from the<a>tag. The "preceding date" trick is robust because Fed pages always put the date before the title link in document order — we don't need a CSS class that might change.
4.2 A reusable parser¶
Now wrap the parsing logic into a function. For each speech link we extract: date, title, URL, speaker, event description.
def parse_fed_year(soup):
"""Extract speech metadata from one Fed year-archive page."""
records = []
links = soup.find_all("a", href=SPEECH_URL_RE)
for a in links:
title = a.get_text(strip=True)
if not title:
continue
href = a.get("href", "")
# Date: walk backwards from the link
date_match = find_preceding_date(a)
date_str = "/".join(date_match) if date_match else ""
# Speaker + event: walk forwards collecting text, stop at next speech link
text_after = []
for nxt in a.next_elements:
if hasattr(nxt, "name") and nxt.name == "a" and nxt is not a:
if SPEECH_URL_RE.search(nxt.get("href", "")):
break
if isinstance(nxt, str):
text_after.append(nxt)
if sum(len(x) for x in text_after) > 500:
break
block = " ".join(text_after)
# Clean up: collapse whitespace
block = re.sub(r"\s+", " ", block).strip()
# Strip the title if it leaked into the block (happens when <a> is nested)
if title in block:
block = block.split(title, 1)[-1].strip()
# Strip the next item's date from the tail
block = re.sub(r"\d{1,2}/\d{1,2}/\d{4}.*$", "", block).strip()
# Strip trailing "Watch Live" / "Video" boilerplate
block = re.sub(r"\b(Watch Live|Video)\b\s*", "", block).strip()
# Split speaker / event on " At "
if " At " in block:
speaker, _, event = block.partition(" At ")
event = "At " + event
else:
speaker, event = block, ""
records.append({
"date": date_str,
"title": title,
"url": ROOT + href if href.startswith("/") else href,
"speaker": speaker.strip()[:200],
"event": event.strip()[:300],
})
return records
# Test on the page we already loaded
records = parse_fed_year(soup)
print(f"Parsed {len(records)} records from 2024\n")
pd.DataFrame(records).head()
📊 Reading the output. All 107 records parsed into a DataFrame with five columns (
date,title,url,speaker,event). The cleanup we added — strip the title, strip the next item's date, strip "Watch Live"/"Video" — gives clean speaker names and event descriptions. Compare row 0: speaker = "Governor Adriana D. Kugler", event = "At the Detroit Economic Club, Detroit, Michigan". Production-quality.
4.3 Scale to multiple years¶
The function works on any year — we just change the URL. Loop with a polite time.sleep between requests.
def scrape_year(year, pause=1.0):
"""Scrape one full year of Fed speeches."""
time.sleep(pause)
url = YEAR_URL.format(year=year)
try:
r = requests.get(url, headers=HEADERS, timeout=15)
r.raise_for_status()
except requests.RequestException as e:
print(f" {year}: error → {e}")
return []
s = BeautifulSoup(r.text, "lxml")
recs = parse_fed_year(s)
for rec in recs:
rec["year"] = year
return recs
all_records = []
for yr in range(2020, 2025):
recs = scrape_year(yr)
all_records.extend(recs)
print(f" {yr}: {len(recs)} speeches")
index_df = pd.DataFrame(all_records)
print(f"\nTotal: {len(index_df)} speeches across {len(set(index_df['year']))} years")
index_df.head()
📊 Reading the output. Five years aggregated: 53 + 69 + 49 + 97 + 107 = 375 speeches in 2020–2024. The yearly counts tell their own story — 2024 stands out (107) likely reflecting the Fed's increased communication around the rate-cut cycle. With
time.sleep(1)between requests, this took ~5 seconds: 5 × 1s pause + page download.
4.4 Parse dates and clean up¶
index_df["date_parsed"] = pd.to_datetime(
index_df["date"], format="%m/%d/%Y", errors="coerce"
)
# Drop rows where date couldn't be parsed (a few stray sidebar/breadcrumb links
# may have matched our URL regex without a preceding date — those are noise).
n_before = len(index_df)
index_df = index_df.dropna(subset=["date_parsed"]).reset_index(drop=True)
n_after = len(index_df)
print(f"Dropped {n_before - n_after} rows with unparseable dates ({n_after} remaining).\n")
index_df = index_df.sort_values("date_parsed").reset_index(drop=True)
print(index_df.dtypes)
print()
print(index_df[["date_parsed","title","speaker"]].tail(5))
📊 Reading the output. Important diagnostic: 82 of 375 rows had unparseable dates. Those are typically links to "speeches archive", index pages, or sidebar promos that match our URL regex but lack a preceding date in the document flow. Dropping them leaves 293 real speeches — a healthier base for analysis. Always check
dropnacounts in scraping pipelines: silent data loss is the #1 source of bad results.
4.5 First descriptive: speeches per year¶
import matplotlib.pyplot as plt
counts = index_df.groupby("year")["title"].count()
fig, ax = plt.subplots(figsize=(8, 4))
counts.plot(kind="bar", ax=ax, color="steelblue", edgecolor="white")
ax.set_title("Federal Reserve Board speeches per year (2020–2024)")
ax.set_xlabel("Year"); ax.set_ylabel("Count")
fig.tight_layout()
plt.show()
4.6 Scraping the full text of individual speeches¶
The index gives us titles and URLs. To build a text corpus, we follow each URL and extract the body of the speech. Fed speech pages put the actual text inside the main content area — we'll grab the <div id="article"> (or fall back to a generic <article> / <main> if that ID changes in future).
def scrape_speech_text(url, pause=1.0):
"""Fetch a single Fed speech page and return its full text."""
time.sleep(pause)
try:
r = requests.get(url, headers=HEADERS, timeout=15)
r.raise_for_status()
except requests.RequestException:
return ""
s = BeautifulSoup(r.text, "lxml")
# Try the most specific container first, fall back gracefully
article = (s.find("div", id="article") or
s.find("article") or
s.find("main") or
s.find("body"))
paragraphs = article.find_all("p") if article else []
return " ".join(p.get_text(" ", strip=True) for p in paragraphs
if len(p.get_text(strip=True)) > 20)
# Test on the most recent speech in our index
test_url = index_df.iloc[-1]["url"]
test_body = scrape_speech_text(test_url)
print(f"URL : {test_url}")
print(f"Length : {len(test_body):,} characters")
print(f"Preview: {test_body[:400]}...")
📊 Reading the output. Following one URL into the speech page yields 17,206 characters of clean text (the Kugler speech). The fallback chain
<div id="article">→<article>→<main>→<body>makes the function survive minor site updates: even if the Fed changes the container, we'll still find paragraphs.
4.7 Build the text corpus (sample)¶
Iterating over hundreds of speeches takes time (rate-limited at 1 second per request, ~7 minutes for 5 years). For the lecture demo we sample 10 — for the exercise you scale up.
sample = index_df.tail(10).reset_index(drop=True).copy()
corpus = []
for i, row in sample.iterrows():
print(f" [{i+1:2d}/{len(sample)}] {row['date_parsed'].date()} — {row['speaker'][:40]}")
body = scrape_speech_text(row["url"])
rec = row.to_dict()
rec["body"] = body
rec["n_chars"] = len(body)
corpus.append(rec)
corpus_df = pd.DataFrame(corpus)
print(f"\nCorpus: {len(corpus_df)} documents, avg {corpus_df['n_chars'].mean():.0f} chars each")
corpus_df[["date_parsed","title","speaker","n_chars"]]
📊 Reading the output. Building a 10-document corpus took ~10 seconds (with rate limiting). The progress prints (
[ N/10] date — speaker) are essential when scaling: they tell you the loop is alive and where it is. For the 653-document full corpus in the exercise, you'll want even more frequent feedback (every 50 docs).
4.8 Save the corpus¶
corpus_df.to_csv("fed_speeches_corpus.csv", index=False)
print(f"Saved: fed_speeches_corpus.csv ({len(corpus_df)} rows)")
📊 Reading the output. Saved to disk — this is the actual input to L7's NLP analysis. From now on, you don't need to re-scrape; just
pd.read_csv("fed_speeches_corpus.csv"). Always save scraped data to disk immediately: scraping is slow and brittle, but reading CSV is fast and reliable.
⏱ Five-minute challenge — scrape your first titles¶
Your task: scrape the 5 most recent press releases from the Federal Reserve 2024 archive.
URL: https://www.federalreserve.gov/newsevents/pressreleases/2024-press.htm
Build a DataFrame with columns title and url. Five minutes.
Hint. Press release URLs follow a pattern similar to (but not identical to) the speech URLs we used in Section 4:
/newsevents/pressreleases/{type}{YYYYMMDD}{letter}.htm
For example: monetary20240731a.htm, enforcement20241115b.htm, bcreg20241023a.htm. The {type} part identifies the kind of release (monetary policy, enforcement, bank regulation, etc.) — your regex needs to allow any letters there.
The list on the page is in reverse chronological order, so the 5 most recent are simply the first 5 matches.
Why this exercise matters: you've already seen the same domain (Federal Reserve) work on the speeches archive. Now you're applying the same technique to a different content type on the same site. Real research scraping is mostly about pattern recognition and adaptation, not memorising selectors.
# YOUR CODE HERE — five minutes!
# Hint: build a regex similar to SPEECH_URL_RE but matching:
# /newsevents/pressreleases/{type}{YYYYMMDD}{letter}.htm
# where {type} is one or more lowercase letters and {letter} is exactly one letter.
URL = "https://www.federalreserve.gov/newsevents/pressreleases/2024-press.htm"
# resp = requests.get(URL, headers=HEADERS, timeout=10)
# soup = BeautifulSoup(resp.text, "lxml")
# PRESS_URL_RE = re.compile(r"...")
# items = []
# for a in soup.find_all("a", href=PRESS_URL_RE):
# ...
# pd.DataFrame(items[:5])
URL = "https://www.federalreserve.gov/newsevents/pressreleases/2024-press.htm"
resp = requests.get(URL, headers=HEADERS, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
# Press release URLs: /newsevents/pressreleases/{type}{YYYYMMDD}{letter}.htm
# Same shape as speeches but with a {type} prefix (monetary, enforcement, bcreg, ...)
PRESS_URL_RE = re.compile(r"/newsevents/pressreleases/[a-z]+\d{8}[a-z]\.htm$")
items = []
for a in soup.find_all("a", href=PRESS_URL_RE):
title = a.get_text(strip=True)
if not title:
continue
href = a.get("href", "")
items.append({
"title": title,
"url": "https://www.federalreserve.gov" + href if href.startswith("/") else href,
})
# The Fed lists releases in reverse-chronological order on this page,
# so the 5 most recent are simply the first 5 items.
df_press = pd.DataFrame(items).head(5)
print(f"Total 2024 press releases found: {len(items)}")
df_press
📊 Reading the output. 120 press releases in 2024 — more than the speeches (107), reflecting that "press releases" includes monetary policy decisions, enforcement actions, bank regulation, etc. Note the regex change:
[a-z]+instead of fixed[a-z]+\d{8}lets us matchmonetary,enforcement,bcreg, etc. Same skill, different prefix — that's the pattern recognition we wanted to teach.
5. Ethics and legality of web scraping — the practical handbook¶
We covered the three triage rules at the start of the lecture. Here is the deeper version you'll consult before each scraping project.
5.1 robots.txt — what to look for¶
Most websites publish a robots.txt file at domain.com/robots.txt. Read it like this:
User-agent: *followed byDisallow: /admin/→ no crawler may access/admin/Crawl-delay: 5→ wait at least 5 seconds between requestsSitemap: https://...→ a structured list of pages, often gold for scraping
If your target path is in a Disallow: rule, do not scrape it.
# Always check robots.txt before scraping
robots = requests.get("https://www.federalreserve.gov/robots.txt", headers=HEADERS, timeout=5)
print(robots.text[:500])
5.2 Practical rules¶
- Respect
robots.txt: do not scrape paths markedDisallow - Rate limiting: add
time.sleep(1.0)or more between requests — never hammer a server - Terms of service: some sites explicitly prohibit scraping; research use is often an exception
- Data protection: do not scrape and publish personal data (names, emails) without legal basis
- Public data preference: if an API or bulk download is available, use that instead
- Identify yourself: some researchers add a contact email to the
User-Agentheader as a courtesy
For academic research, scraping publicly accessible pages for non-commercial analysis is generally accepted practice in most jurisdictions. When in doubt, check your institution's guidelines or contact the site owner.
6. Basic text cleaning¶
Raw scraped text needs cleaning before any NLP analysis. We will go much deeper in L7, but here are the basics.
def clean_text(text):
"""
Basic text cleaning pipeline.
- Remove HTML entities and tags
- Normalise whitespace
- Strip leading/trailing whitespace
"""
if not isinstance(text, str):
return ""
# Remove HTML tags (safety net — BeautifulSoup should have handled these)
text = re.sub(r"<[^>]+>", " ", text)
# Replace common HTML entities
text = text.replace("&", "&").replace(" ", " ").replace("€", "€")
# Normalise whitespace: collapse multiple spaces/newlines into one space
text = re.sub(r"\s+", " ", text)
return text.strip()
# Apply to corpus
corpus_df["body_clean"] = corpus_df["body"].apply(clean_text)
# Quick check
for _, row in corpus_df.head(3).iterrows():
print(f"--- {row['date_parsed'].date()} ---")
print(row["body_clean"][:200])
print()
📊 Reading the output. Notice the artifact:
Accessible Keys for Video [Space Bar] toggles play/pau.... That's video-player UI text that the speech page includes for accessibility — it leaked into our scrape because it lives in<p>tags. Realistic scraping always hits this kind of noise: the cleaner you write, the better your downstream analysis. We'll address this systematically in L7.
7. Diagnosing scraping problems¶
When your scraper returns nothing, it's almost always one of these. Work through them in order:
| Symptom | Likely cause | Fix |
|---|---|---|
soup.find(...) returns None but element is visible in browser | Page renders content via JavaScript after load | L6: use Selenium, or look for the underlying JSON API |
r.status_code == 403 | Server blocks default python-requests/x.y user agent | Set a realistic User-Agent header (cell 6) |
r.status_code == 429 | Rate-limited — too many requests too fast | Increase time.sleep(); use requests.Session() (L6) |
Garbled accents (é instead of é) | Encoding mismatch | Set r.encoding = "utf-8" before calling r.text, or use r.content and decode manually |
r.text ends mid-page | Server closes the connection | Add timeout= and retries |
Elements present but .text is empty | Content is in data-* attributes or nested <noscript> | Use el.get("data-content"), or move to Selenium |
The decisive diagnostic. Open the page in Chrome, hit Ctrl+U (View Page Source). If the content you want is in the source, requests + BeautifulSoup is enough. If the source is a mostly-empty shell with a giant <script> tag, you need Selenium — that's L6.
# Quick encoding-detection helper — useful when accents look broken
def detect_and_fix_encoding(response):
declared = response.encoding
apparent = response.apparent_encoding # uses chardet
if declared.lower() != apparent.lower():
print(f" Encoding mismatch: declared={declared}, apparent={apparent}")
response.encoding = apparent
return response.text
# Test the page-source heuristic
r = requests.get(url, headers=HEADERS, timeout=10)
text = detect_and_fix_encoding(r)
has_meaningful_html = text.lower().count("<p>") > 5 or text.lower().count("<div") > 20
print(f"Static HTML appears usable: {has_meaningful_html}")
📊 Reading the output. The Fed declares
ISO-8859-1(Latin-1) but the file is actuallyUTF-8-SIG(UTF-8 with BOM). If we trusted the declared encoding, accented characters would render as garbage (éinstead ofé). Thechardet-based detection catches this — a small but critical defensive measure for international scraping.
8. Exercise¶
Time: ~25 minutes. Work individually.
The goal is to build a small corpus from a source relevant to your own research proposal.
Option A — Fed speeches (recommended if your topic is monetary policy / central bank communication)¶
Extend the scraper from the lecture to collect all Federal Reserve Board speeches from 2015 to 2024. For each speech extract: date, title, URL, speaker, event description, and full body text. Save as CSV.
Then answer these questions with code:
- How many speeches per year?
- What is the average and median speech length (in characters)?
- Who are the 5 most prolific speakers in the period?
Option B — Banca d'Italia governor's speeches (recommended if you prefer an Italian source)¶
Scrape https://www.bancaditalia.it/pubblicazioni/interventi-governatore/integov{year}/index.html across 2020–2025. Build a metadata DataFrame (date, title, URL, event description). The full speeches are in PDF — out of scope for requests + BeautifulSoup, but you can list the PDF URLs.
Option C — your own source¶
If you have already identified a data source for your research proposal, build a scraper for it. Minimum deliverable: a CSV with at least 50 rows and at least 3 columns (date/id, title/headline, URL or body text).
# YOUR CODE HERE
# Suggested structure:
# Step 1: define scraper function(s)
# Step 2: run scraper across target pages
# Step 3: build DataFrame, clean, save to CSV
# Step 4: descriptive analysis (document counts, lengths, top entities)
# ── SOLUTION — Option A (Fed speeches, full corpus 2015–2024) ───────────────────
#
# This cell is INTERRUPT-FRIENDLY and RESUMABLE:
# • Progress is checkpointed to `fed_corpus_partial.csv` every 50 documents.
# • Press the stop button (■) any time → progress is saved, no work lost.
# • Re-run the cell → it skips already-scraped URLs and continues from there.
# • Final run produces `fed_corpus.csv` with all collected documents.
#
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
import time, re, os
ROOT = "https://www.federalreserve.gov"
YEAR_URL = "https://www.federalreserve.gov/newsevents/speech/{year}-speeches.htm"
SPEECH_URL_RE = re.compile(r"/newsevents/speech/[a-z]+\d{8}[a-z]\.htm$")
DATE_RE = re.compile(r"(\d{1,2})/(\d{1,2})/(\d{4})")
HEADERS = {"User-Agent": "Mozilla/5.0 (academic research, contact: yourname@unibo.it)"}
PARTIAL_CSV = "fed_corpus_partial.csv"
FINAL_CSV = "fed_corpus.csv"
def safe_get(url, pause=1.0):
time.sleep(pause)
try:
r = requests.get(url, headers=HEADERS, timeout=15)
r.raise_for_status()
return r
except requests.RequestException as e:
print(f" Error: {e}")
return None
def find_preceding_date(link, max_chars=300):
text_before = ""
for prev in link.previous_elements:
if hasattr(prev, "get_text"):
text_before = prev.get_text(" ", strip=True) + " " + text_before
elif isinstance(prev, str):
text_before = prev + " " + text_before
if len(text_before) > max_chars:
break
m = DATE_RE.findall(text_before)
return "/".join(m[-1]) if m else ""
def parse_fed_year(soup):
records = []
for a in soup.find_all("a", href=SPEECH_URL_RE):
title = a.get_text(strip=True)
if not title:
continue
href = a.get("href", "")
date_str = find_preceding_date(a)
text_after = []
for nxt in a.next_elements:
if hasattr(nxt, "name") and nxt.name == "a" and nxt is not a and SPEECH_URL_RE.search(nxt.get("href", "")):
break
if isinstance(nxt, str):
text_after.append(nxt)
if sum(len(x) for x in text_after) > 500:
break
block = re.sub(r"\s+", " ", " ".join(text_after)).strip()
if title in block:
block = block.split(title, 1)[-1].strip()
block = re.sub(r"\d{1,2}/\d{1,2}/\d{4}.*$", "", block).strip()
block = re.sub(r"\b(Watch Live|Video)\b\s*", "", block).strip()
speaker, _, event = block.partition(" At ")
event = ("At " + event) if event else ""
records.append({
"date": date_str,
"title": title,
"url": ROOT + href if href.startswith("/") else href,
"speaker": speaker.strip()[:200],
"event": event.strip()[:300],
})
return records
def scrape_speech_text(url):
r = safe_get(url)
if not r: return ""
s = BeautifulSoup(r.text, "lxml")
article = (s.find("div", id="article") or s.find("article") or
s.find("main") or s.find("body"))
paras = article.find_all("p") if article else []
return " ".join(p.get_text(" ", strip=True) for p in paras
if len(p.get_text(strip=True)) > 20)
# ── Step 1: build the index ─────────────────────────────────────────────────
all_records = []
for yr in range(2015, 2025):
r = safe_get(YEAR_URL.format(year=yr))
if not r: continue
recs = parse_fed_year(BeautifulSoup(r.text, "lxml"))
for rec in recs: rec["year"] = yr
all_records.extend(recs)
print(f" {yr}: {len(recs)} speeches")
index_df = pd.DataFrame(all_records)
index_df["date_parsed"] = pd.to_datetime(index_df["date"], format="%m/%d/%Y", errors="coerce")
index_df = index_df.dropna(subset=["date_parsed"]).sort_values("date_parsed").reset_index(drop=True)
print(f"\nIndex: {len(index_df)} speeches\n")
# ── Step 2: scrape full text — RESUMABLE ────────────────────────────────────
# Load any prior partial work
done_urls = set()
prior_records = []
if os.path.exists(PARTIAL_CSV):
prior = pd.read_csv(PARTIAL_CSV)
prior_records = prior.to_dict("records")
done_urls = set(prior["url"].tolist())
print(f"Resuming: {len(done_urls)} documents already scraped, "
f"{len(index_df) - len(done_urls)} remaining.\n")
corpus = list(prior_records)
new_count = 0
checkpoint = 50
try:
for i, row in index_df.iterrows():
if row["url"] in done_urls:
continue
body = scrape_speech_text(row["url"])
rec = {**row.to_dict(), "body": body, "n_chars": len(body)}
corpus.append(rec)
new_count += 1
# Checkpoint every `checkpoint` new documents
if new_count % checkpoint == 0:
pd.DataFrame(corpus).to_csv(PARTIAL_CSV, index=False)
print(f" Checkpoint: {len(corpus)}/{len(index_df)} documents saved to {PARTIAL_CSV}")
except KeyboardInterrupt:
print(f"\n ⏸ Interrupted at {len(corpus)}/{len(index_df)} documents.")
pd.DataFrame(corpus).to_csv(PARTIAL_CSV, index=False)
print(f" Progress saved to {PARTIAL_CSV}. Re-run the cell to resume.")
raise # so you see the traceback and know it stopped
# ── Step 3: finalise ────────────────────────────────────────────────────────
corpus_df = pd.DataFrame(corpus)
corpus_df.to_csv(FINAL_CSV, index=False)
print(f"\n✓ Done: {len(corpus_df)} documents saved to {FINAL_CSV}")
# ── Step 4: descriptive analysis ────────────────────────────────────────────
print("\nSpeeches per year:")
print(corpus_df.groupby("year")["title"].count())
print(f"\nAvg length : {corpus_df['n_chars'].mean():.0f} chars")
print(f"Median length : {corpus_df['n_chars'].median():.0f} chars")
print("\nTop 5 speakers (by speech count):")
print(corpus_df["speaker"].value_counts().head())
# Plot: speeches per year
fig, ax = plt.subplots(figsize=(10, 4))
corpus_df.groupby("year")["title"].count().plot(
kind="bar", color="steelblue", edgecolor="white", ax=ax)
ax.set_title("Federal Reserve speeches per year, 2015–2024")
ax.set_xlabel("Year"); ax.set_ylabel("Count")
fig.tight_layout()
fig.savefig("fed_speeches_by_year.png", dpi=300, bbox_inches="tight")
plt.show()
Summary¶
| Step | Tool | Key concept |
|---|---|---|
| Make a request | requests.get(url) | Check .status_code before using .text |
| Parse HTML | BeautifulSoup(html, "lxml") | find(), find_all(), select() |
| Extract text | .text.strip() | Always strip whitespace |
| Extract attributes | .get("href") | Links, IDs, classes |
| Navigate siblings | .find_next_sibling() | Pair dates with content |
| Be polite | time.sleep(1.5) | Avoid 429 errors and server load |
| Handle errors | raise_for_status() + try/except | Never assume 200 |
Next lecture¶
Lecture 6 — Web scraping II: cookies, session headers, and selenium for JavaScript-heavy sites. We will also start pre-processing the text corpora you have begun to build.