Lecture 5 — Web Scraping I: requests & BeautifulSoup¶

Python for Economists · University of Bologna · 2025/2026¶


What we cover today¶

  1. Warm-up: scraping live, before any theory
  2. How the web works: HTTP, HTML, and the DOM
  3. Making requests with requests
  4. Parsing HTML with BeautifulSoup
  5. Building a scraper: loops, pagination, delays
  6. Ethics and legality of web scraping (revisited in depth)
  7. Case study: ECB press release archive
  8. Diagnosing scraping problems (and when to switch to L6)
  9. Exercise: build your first economic corpus

Why scraping? Most economically relevant text — parliamentary speeches, central bank communications, court rulings, news articles — is not available as a clean CSV. Scraping is how you build the dataset that does not yet exist.

In [ ]:
# Install if needed
# pip install requests beautifulsoup4 lxml

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
from datetime import datetime

print("Libraries loaded.")

0. Warm-up — scraping live, before any theory¶

I'm going to scrape a real website now, with no preparation, in front of you. The site is the Federal Reserve speeches archive for 2024. By the end of this cell we'll have the date, title, speaker and URL of every Fed Board speech delivered in 2024 — as a DataFrame.

We'll then spend the rest of the lecture understanding what each line does.

In [ ]:
# Live demo — minimal scraper, ~10 lines.
URL = "https://www.federalreserve.gov/newsevents/speech/2024-speeches.htm"
HEADERS = {"User-Agent": "Mozilla/5.0 (teaching; contact: marco.rosso4@unibo.it)"}

resp = requests.get(URL, headers=HEADERS, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")

# Each Fed speech URL follows a recognisable pattern:
# /newsevents/speech/{lastname}{YYYYMMDD}{letter}.htm   e.g. powell20240930a.htm
SPEECH_URL_RE = re.compile(r"/newsevents/speech/[a-z]+\d{8}[a-z]\.htm$")

items = []
for a in soup.find_all("a", href=SPEECH_URL_RE):
    title = a.get_text(strip=True)
    if not title:
        continue
    href = a.get("href", "")
    items.append({
        "title": title,
        "url":   "https://www.federalreserve.gov" + href if href.startswith("/") else href,
    })

df_warmup = pd.DataFrame(items).head(10)
df_warmup

📊 Reading the output. The DataFrame shows the 10 most recent Fed speeches: title and full URL. We've gone from a live website to structured tabular data in ~10 lines, with no preparation. Note the 2-column structure (title, url) — date and speaker will come in Section 4 once we look at the page more carefully.

That was it. Three libraries (requests, BeautifulSoup, pandas), ten lines, and we have a structured dataset from a live website.

Before we go any deeper — a non-negotiable preliminary.

The ethics moment (5 minutes, non-negotiable)¶

Scraping is powerful and, badly used, harmful. Three rules to apply every time:

  1. Read robots.txt. Every site publishes one at /robots.txt — it tells you what you can and cannot scrape. Check it.
  2. Identify yourself. Set a User-Agent with a contact email. If the site admin has a problem, they can write to you instead of blocking you.
  3. Rate-limit. A human reads one page every ~10 seconds. A scraper can read 100 per second. Don't. Use time.sleep(1) between requests at minimum.

Legal note. Scraping public, non-copyrighted information for research is legal in the EU under specific conditions — the Text and Data Mining (TDM) exception for scientific research, Directive 2019/790, art. 3. Scraping copyrighted content, personal data, or content behind a paywall is not. For your research proposals, stick to public institutional sources (ECB, parliaments, government press releases) and you'll be safe.

We'll come back to this in Section 5 with the practical handbook.

In [ ]:
# Always check robots.txt before scraping
robots = requests.get("https://www.federalreserve.gov/robots.txt", headers=HEADERS, timeout=5)
print(robots.text[:500])

📊 Reading the robots.txt. What you see is plain text instructions for web crawlers. Lines starting with User-agent: * apply to all bots; Disallow: paths are off-limits. Our /newsevents/speech/ path is not listed → free to scrape. Good citizenship: always check this before starting a scraping project.


1. How the web works¶

1.1 The HTTP request-response cycle¶

When you type a URL in your browser:

  1. Your browser sends an HTTP GET request to the server at that URL
  2. The server returns an HTTP response with a status code and a body
  3. The body is usually HTML — the browser renders it visually

Web scraping replicates step 1–2 in Python, then reads the HTML programmatically instead of rendering it.

1.2 Status codes¶

Code Meaning
200 OK — request succeeded
301/302 Redirect
403 Forbidden — server refuses the request
404 Not found
429 Too many requests — you are being rate-limited
500 Server error

1.3 HTML structure¶

HTML is a tree of elements (tags). Every piece of content sits inside a tag:

<html>
  <head>
    <title>Page title</title>
  </head>
  <body>
    <h1 class="headline">Main heading</h1>
    <div id="content">
      <p>First paragraph.</p>
      <p class="abstract">Second paragraph.</p>
      <a href="https://example.com">A link</a>
    </div>
  </body>
</html>

How to find the right tags: open any webpage in Chrome/Firefox, right-click on the element you want, and select Inspect. The browser's developer tools show the HTML behind any element.


2. Making requests with requests¶

In [ ]:
# A minimal GET request
url = "https://www.federalreserve.gov/newsevents/speech/2024-speeches.htm"

response = requests.get(url)

print(f"Status code : {response.status_code}")
print(f"Content-Type: {response.headers.get('Content-Type', 'n/a')}")
print(f"Body length : {len(response.text):,} characters")
print()
# Show the first 500 characters of raw HTML
print(response.text[:500])

📊 Reading the response. Status 200 = success. The server returns ~180 KB of HTML — that's the entire 2024 speeches index page as the browser would receive it. Note the <!doctype html> opening: this is the raw rendered markup, identical to what you'd see in Chrome's "View Page Source".

In [ ]:
# Always check the status code before proceeding
def safe_get(url, pause=1.0):
    """
    Perform a GET request with a polite pause.
    Returns the response if successful, None otherwise.
    """
    time.sleep(pause)   # be polite — do not hammer the server
    try:
        r = requests.get(url, timeout=10)
        r.raise_for_status()   # raises HTTPError for 4xx/5xx
        return r
    except requests.HTTPError as e:
        print(f"HTTP error {e.response.status_code} for {url}")
        return None
    except requests.RequestException as e:
        print(f"Request failed for {url}: {e}")
        return None

r = safe_get(url)
print(r.status_code if r else "Failed")

📊 Reading the output. 200 means OK. In production scraping you'll meet 404 (page gone), 403 (blocked — try a different User-Agent), 429 (rate-limited — slow down), and 500 (server error — retry later). Wrapping requests.get() in a function with status-code handling is the difference between a one-shot script and a robust pipeline.

📋 HTTP status codes — a scraper's field guide¶

Every response from a server includes a 3-digit status code. The first digit tells you the class of response. As a scraper, you'll meet a small subset over and over:

✅ Success (2xx)¶

Code Name Meaning
200 OK The request succeeded; the body contains what you asked for.
204 No Content Request succeeded but the response body is empty (common for DELETE or some APIs).

🔀 Redirects (3xx)¶

A redirect is when the server says "the resource you asked for now lives at a different URL — go there instead." Instead of returning the page content, the server returns a 3xx status code together with a Location: header pointing to the new URL. Browsers (and requests) typically follow the redirect automatically and show you the final page, which is why redirects are usually invisible.

Why do servers redirect?

  • Site restructuring — /old-page → /new-section/page after a redesign. The Fed and the ECB do this regularly to their archives.
  • Adding https — http://... → https://... to enforce encrypted connections.
  • Adding www — example.com → www.example.com (or vice versa) for canonical-URL consistency.
  • Authentication walls — protected page → login page, then back to the original.
  • Locale routing — /page → /en/page based on the visitor's Accept-Language header.
  • Trailing slashes — /page → /page/ for URL normalisation.
Code Name Meaning
301 Moved Permanently The new URL is the canonical one. Search engines and crawlers should update their records. Use this to update your scraper code to the new URL.
302 Found (Temporary) The redirect is temporary — keep using the original URL in future requests.
307 Temporary Redirect Like 302, but the HTTP method is preserved (a POST stays a POST — relevant for APIs).
308 Permanent Redirect Like 301, with method preserved.

requests handles them for you — but you should still inspect:

r = requests.get("http://www.federalreserve.gov/newsevents/speech/2024-speeches.htm")
print(f"Final URL  : {r.url}")              # the URL you actually ended up on
print(f"Status     : {r.status_code}")       # 200 (the *final* response)
print(f"Hops taken : {len(r.history)}")      # number of redirects followed
for hop in r.history:
    print(f"  {hop.status_code} {hop.url}  →  {hop.headers.get('Location')}")

For the URL above you'll typically see one hop (http → https). Some sites have chains of 3–4 redirects; if you see more than that, something is misconfigured (or you're in a redirect loop).

When to disable automatic redirects. Pass allow_redirects=False if you want to inspect the redirect itself rather than follow it — useful for debugging "why does this URL behave strangely?" or for scrapers that need to capture intermediate URLs.

🎯 Practical implications for your scraper. If you see a 301 permanently redirecting away from a URL you've hard-coded, update your code to use the new URL: extra hops mean extra latency on every single request. If you see 302 Temporary, keep the original. And if r.url differs from the URL you requested, your "scrape this URL" log entries should reflect what actually got scraped, not what was asked for.

❌ Client errors (4xx) — the request is the problem¶

Code Name Most common cause when scraping Fix
400 Bad Request Malformed URL, invalid query parameters Check the URL and params= dict
401 Unauthorized Endpoint requires authentication Out of scope for public scraping
403 Forbidden Server detected and blocked your python-requests/x.y User-Agent, or geo-block, or a more aggressive bot detector (Cloudflare) Set a browser-style User-Agent; if that fails, the site is actively defending — consider Selenium (L6) or back off
404 Not Found The page no longer exists or the URL pattern changed Verify the URL in a browser; the site may have restructured
405 Method Not Allowed You used GET on an endpoint expecting POST Check the API docs
429 Too Many Requests Rate limiting: you're sending requests faster than the server allows. Often combined with a Retry-After header telling you how long to wait. Slow down — increase time.sleep(); consider exponential backoff; in extreme cases use requests.Session() with retries (L6)

🔥 Server errors (5xx) — the server is the problem¶

Code Name What you do
500 Internal Server Error The server crashed on your request. Retry once or twice with a delay; if persistent, the URL itself may be problematic.
502 / 503 / 504 Bad Gateway / Service Unavailable / Gateway Timeout Transient infrastructure issues. Retry with exponential backoff (1s → 2s → 4s → 8s).

🛠 The defensive scraping pattern¶

Whenever you write a scraper, your inner loop should look like this:

import time
import requests

def fetch_with_retry(url, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            r = requests.get(url, headers=HEADERS, timeout=10)
            if r.status_code == 429:
                # Rate limited — honour Retry-After if present, else exponential backoff
                wait = int(r.headers.get("Retry-After", base_delay * (2 ** attempt)))
                print(f"  Rate limited; waiting {wait}s before retry {attempt+1}")
                time.sleep(wait)
                continue
            if r.status_code >= 500:
                # Server error — exponential backoff
                wait = base_delay * (2 ** attempt)
                print(f"  Server error {r.status_code}; waiting {wait}s")
                time.sleep(wait)
                continue
            r.raise_for_status()  # raises for any other 4xx
            return r
        except requests.RequestException as e:
            print(f"  Network error: {e}; attempt {attempt+1}/{max_retries}")
            time.sleep(base_delay * (2 ** attempt))
    return None

🎯 Key takeaway. Codes 4xx mean you must change something (URL, headers, rate). Codes 5xx mean the server failed — usually transient, retry with backoff. The single most common one in academic scraping is 429: it's the server politely telling you to slow down. Always honour it.

In [ ]:
# Headers — simulate a browser to avoid 403 errors on some sites
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
}

r = requests.get(url, headers=HEADERS, timeout=10)
print(r.status_code)

📊 Reading the output. Same 200 as before — the Fed accepts requests both with and without a custom User-Agent. But on sites with stricter filters (Cloudflare, Akamai), the default python-requests/2.31.0 UA gets blocked instantly. Setting a browser-style UA with a contact email is the polite default.

2.1 Query parameters and the response object¶

In the warm-up we passed the full URL with all parameters baked in. Often you want to construct URLs programmatically — requests lets you pass parameters as a dict via params=. This avoids manual string concatenation and handles URL-encoding for you.

In [ ]:
# Query parameters via params= (cleaner than f-strings inside URLs)
demo_params = {"q": "monetary policy", "fromDate": "2024-01-01", "toDate": "2024-12-31"}
prepared = requests.Request("GET", "https://example.com/search", params=demo_params).prepare()
print("Prepared URL:", prepared.url)

# Inspect the response object more thoroughly
r = requests.get(url, headers=HEADERS, timeout=10)
print(f"Final URL (after redirects): {r.url}")
print(f"Status                     : {r.status_code} {r.reason}")
print(f"Encoding                   : {r.encoding}")
print(f"Cookies set by server      : {dict(r.cookies)}")
print(f"Redirect history           : {[h.status_code for h in r.history]}")
print(f"Response time (s)          : {r.elapsed.total_seconds():.3f}")

📊 Reading the response object. requests returns a rich object: final URL after redirects, encoding, cookies set by the server, redirect history, and elapsed time. The Encoding: ISO-8859-1 value is interesting — the Fed's HTTP header declares Latin-1, but the actual content is UTF-8. We'll fix that mismatch in Section 7 (encoding diagnostics).


3. Parsing HTML with BeautifulSoup¶

In [ ]:
# Parse the HTML response
soup = BeautifulSoup(r.text, "lxml")

# The soup object is a tree you can navigate
print(type(soup))
print(soup.title.text)   # page title

📊 Reading the output. BeautifulSoup exposes the parsed DOM as a Python object. soup.title.string extracts the <title> tag content — useful as a sanity check that we got the page we expected (Federal Reserve Board - 2024 Speeches, not an error page).

3.1 Core methods¶

Method Returns Use when
soup.find(tag, attrs) First matching element You need exactly one element
soup.find_all(tag, attrs) List of all matching elements You need all elements of a type
element.text Plain text content Extracting readable text
element.get("attr") Attribute value Extracting URLs, IDs, classes
element.select("css_selector") List by CSS selector Precise targeting of nested elements
In [ ]:
# Working with a simple HTML string first — so results are predictable
html_example = """
<html>
<body>
  <div class="press-list">
    <dl>
      <dt class="date">7 November 2024</dt>
      <dd>
        <a href="/press/pr/date/2024/html/ecb.pr241107~abc.en.html">
          Monetary policy decisions
        </a>
      </dd>
      <dt class="date">17 October 2024</dt>
      <dd>
        <a href="/press/pr/date/2024/html/ecb.pr241017~xyz.en.html">
          Monetary policy decisions
        </a>
      </dd>
      <dt class="date">12 September 2024</dt>
      <dd>
        <a href="/press/pr/date/2024/html/ecb.pr240912~def.en.html">
          Monetary policy decisions
        </a>
      </dd>
    </dl>
  </div>
</body>
</html>
"""

example_soup = BeautifulSoup(html_example, "lxml")

# find_all — get all date elements
dates = example_soup.find_all("dt", class_="date")
for d in dates:
    print(d.text.strip())

📊 Reading the output. Three dates extracted from a synthetic HTML snippet. Working with controlled fixtures first is good practice: when you debug a parser, you want predictable inputs. Once it works on the fixture, you swap in the real page.

In [ ]:
# Extract titles and URLs from the links
links = example_soup.find_all("a")
for link in links:
    title = link.text.strip()
    href  = link.get("href")
    print(f"{title!r:35}  →  {href}")

📊 Reading the output. Every <a> tag yields title (the visible text) and href (the URL). Note the URLs are relative (/press/...) — to make them clickable later we'll need to prepend the domain (https://www.ecb.europa.eu). This is the standard pattern: scrape, then resolve.

In [ ]:
# CSS selectors — more precise targeting
# "div.press-list a" means: <a> elements inside a <div class="press-list">
for link in example_soup.select("div.press-list a"):
    print(link.text.strip(), "|", link.get("href"))

📊 Reading the output. Same three records, but obtained with a single .select(...) call using CSS syntax. CSS selectors are more powerful than find_all for nested queries: div.press-list a says "find an <a> inside any <div> with class press-list". Once you know CSS, you can target almost anything.

In [ ]:
# Pairing dates with links — navigate sibling elements
records = []
dts = example_soup.find_all("dt", class_="date")
for dt in dts:
    date_str = dt.text.strip()
    dd = dt.find_next_sibling("dd")
    link = dd.find("a")
    records.append({
        "date":  date_str,
        "title": link.text.strip(),
        "url":   "https://www.ecb.europa.eu" + link.get("href"),
    })

pd.DataFrame(records)

📊 Reading the output. A clean DataFrame: each row pairs a date with its corresponding press release. The trick was to walk between siblings (<dt> and <dd>) rather than treating them in isolation. This is one of the most common scraping tasks: pairing related elements that aren't nested inside each other.

3.2 Tree navigation: parents, siblings, descendants¶

find and select get you to one element. From there, you can move around the tree:

In [ ]:
# Build a tiny tree we can navigate
nav_html = '''
<article>
  <header>
    <h2>Press release</h2>
    <time>7 November 2024</time>
  </header>
  <section class="body">
    <p class="lede">First paragraph.</p>
    <p>Second paragraph.</p>
    <p>Third paragraph.</p>
  </section>
</article>
'''
nav_soup = BeautifulSoup(nav_html, "lxml")

lede = nav_soup.find("p", class_="lede")
print("Element        :", lede.name, "→", lede.text.strip())
print("Parent         :", lede.parent.name)
print("All ancestors  :", [a.name for a in lede.parents if a.name])
print("Next sibling   :", lede.find_next_sibling().text.strip())
print("All siblings   :", [s.text.strip() for s in lede.find_next_siblings()])

section = nav_soup.find("section")
print("Direct children:", [c.name for c in section.children if c.name])
print("All descendants:", [d.name for d in section.descendants if d.name])

📊 Reading the output. This shows the four classic ways to navigate the DOM tree: .parent, .parents, .next_sibling, and .children/.descendants. The distinction between children (direct only: ['p', 'p', 'p']) and descendants (all nested: same in this case because the structure is flat) becomes important on real pages with deep nesting.


Interlude — notebook vs. script: when to switch¶

We've been living in Jupyter since L1. That's great for exploration — but a serious scraper is not exploration. It's a pipeline that must run unattended, maybe overnight, maybe weekly, with error handling and logging.

Notebooks are bad at this. Scripts are good at this.

Let me show you what the same scraping code looks like as a proper Python script, opened in Spyder (which you already have — it comes with Anaconda) and in VS Code (the industry standard).

[Live demo: instructor opens ecb_scraper.py in Spyder, shows: variable explorer, debugger, run-selection, integrated terminal. Then opens the same file in VS Code, shows: Python extension, interpreter selector for the econ environment, integrated debugger with breakpoints.]

Bottom line for this course:

  • Lectures and exercises → Jupyter (interactive, immediate feedback).
  • Your scraping pipeline for the exam → Jupyter is fine for prototyping, but run the final version as a script if your corpus takes more than ~10 minutes to build.
  • Your projects → VS Code will be your friend. Start now.

4. Building a scraper: loops, pagination, delays¶

Our case study: build a corpus of Federal Reserve Board speeches across multiple years. The Fed publishes a year-by-year archive at:

https://www.federalreserve.gov/newsevents/speech/{YEAR}-speeches.htm

Each entry has a date, a hyperlinked title, the speaker's name and role, and the event description. The individual speech pages contain the full text in HTML — that's our text corpus.

Plan:

  1. Scrape one year's index page → list of speeches with metadata
  2. Scale to multiple years
  3. Follow each link to fetch the full speech text
  4. Save as CSV

4.1 Inspect first — never scrape blind¶

Before writing the loop, look at one page. Confirm the URL pattern, count the items, peek at the structure. This is the single most valuable habit in scraping.

In [ ]:
# Step 1 — fetch one year and inspect what we got
ROOT          = "https://www.federalreserve.gov"
YEAR_URL      = "https://www.federalreserve.gov/newsevents/speech/{year}-speeches.htm"
SPEECH_URL_RE = re.compile(r"/newsevents/speech/[a-z]+\d{8}[a-z]\.htm$")
DATE_RE       = re.compile(r"(\d{1,2})/(\d{1,2})/(\d{4})")   # MM/DD/YYYY

resp = requests.get(YEAR_URL.format(year=2024), headers=HEADERS, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")

speech_links = soup.find_all("a", href=SPEECH_URL_RE)
print(f"Speech links found in 2024 archive: {len(speech_links)}\n")

# Inspect the first one and its surroundings
a = speech_links[0]
print("First link:")
print(f"  Title: {a.get_text(strip=True)}")
print(f"  URL  : {a.get('href')}")

# The date is in the page text RIGHT BEFORE each title link.
# Walk backwards to find the nearest MM/DD/YYYY string.
def find_preceding_date(link, max_chars=300):
    """Return the most recent MM/DD/YYYY found *before* this <a> in the document."""
    text_before = ""
    for prev in link.previous_elements:
        if hasattr(prev, "get_text"):
            text_before = prev.get_text(" ", strip=True) + " " + text_before
        elif isinstance(prev, str):
            text_before = prev + " " + text_before
        if len(text_before) > max_chars:
            break
    matches = DATE_RE.findall(text_before)
    return matches[-1] if matches else None   # the LAST match is the closest to the link

print(f"  Date (closest preceding MM/DD/YYYY): {find_preceding_date(a)}")

📊 Reading the output. 107 speech links found on the 2024 page. The first one is Adriana Kugler's December 3 speech, with date 12/3/2024 correctly extracted by walking backwards from the <a> tag. The "preceding date" trick is robust because Fed pages always put the date before the title link in document order — we don't need a CSS class that might change.

4.2 A reusable parser¶

Now wrap the parsing logic into a function. For each speech link we extract: date, title, URL, speaker, event description.

In [ ]:
def parse_fed_year(soup):
    """Extract speech metadata from one Fed year-archive page."""
    records = []
    links   = soup.find_all("a", href=SPEECH_URL_RE)
    for a in links:
        title = a.get_text(strip=True)
        if not title:
            continue
        href = a.get("href", "")
        # Date: walk backwards from the link
        date_match = find_preceding_date(a)
        date_str   = "/".join(date_match) if date_match else ""
        # Speaker + event: walk forwards collecting text, stop at next speech link
        text_after = []
        for nxt in a.next_elements:
            if hasattr(nxt, "name") and nxt.name == "a" and nxt is not a:
                if SPEECH_URL_RE.search(nxt.get("href", "")):
                    break
            if isinstance(nxt, str):
                text_after.append(nxt)
            if sum(len(x) for x in text_after) > 500:
                break
        block = " ".join(text_after)
        # Clean up: collapse whitespace
        block = re.sub(r"\s+", " ", block).strip()
        # Strip the title if it leaked into the block (happens when <a> is nested)
        if title in block:
            block = block.split(title, 1)[-1].strip()
        # Strip the next item's date from the tail
        block = re.sub(r"\d{1,2}/\d{1,2}/\d{4}.*$", "", block).strip()
        # Strip trailing "Watch Live" / "Video" boilerplate
        block = re.sub(r"\b(Watch Live|Video)\b\s*", "", block).strip()
        # Split speaker / event on " At "
        if " At " in block:
            speaker, _, event = block.partition(" At ")
            event = "At " + event
        else:
            speaker, event = block, ""
        records.append({
            "date":    date_str,
            "title":   title,
            "url":     ROOT + href if href.startswith("/") else href,
            "speaker": speaker.strip()[:200],
            "event":   event.strip()[:300],
        })
    return records

# Test on the page we already loaded
records = parse_fed_year(soup)
print(f"Parsed {len(records)} records from 2024\n")
pd.DataFrame(records).head()

📊 Reading the output. All 107 records parsed into a DataFrame with five columns (date, title, url, speaker, event). The cleanup we added — strip the title, strip the next item's date, strip "Watch Live"/"Video" — gives clean speaker names and event descriptions. Compare row 0: speaker = "Governor Adriana D. Kugler", event = "At the Detroit Economic Club, Detroit, Michigan". Production-quality.

4.3 Scale to multiple years¶

The function works on any year — we just change the URL. Loop with a polite time.sleep between requests.

In [ ]:
def scrape_year(year, pause=1.0):
    """Scrape one full year of Fed speeches."""
    time.sleep(pause)
    url = YEAR_URL.format(year=year)
    try:
        r = requests.get(url, headers=HEADERS, timeout=15)
        r.raise_for_status()
    except requests.RequestException as e:
        print(f"  {year}: error → {e}")
        return []
    s = BeautifulSoup(r.text, "lxml")
    recs = parse_fed_year(s)
    for rec in recs:
        rec["year"] = year
    return recs

all_records = []
for yr in range(2020, 2025):
    recs = scrape_year(yr)
    all_records.extend(recs)
    print(f"  {yr}: {len(recs)} speeches")

index_df = pd.DataFrame(all_records)
print(f"\nTotal: {len(index_df)} speeches across {len(set(index_df['year']))} years")
index_df.head()

📊 Reading the output. Five years aggregated: 53 + 69 + 49 + 97 + 107 = 375 speeches in 2020–2024. The yearly counts tell their own story — 2024 stands out (107) likely reflecting the Fed's increased communication around the rate-cut cycle. With time.sleep(1) between requests, this took ~5 seconds: 5 × 1s pause + page download.

4.4 Parse dates and clean up¶

In [ ]:
index_df["date_parsed"] = pd.to_datetime(
    index_df["date"], format="%m/%d/%Y", errors="coerce"
)

# Drop rows where date couldn't be parsed (a few stray sidebar/breadcrumb links
# may have matched our URL regex without a preceding date — those are noise).
n_before = len(index_df)
index_df = index_df.dropna(subset=["date_parsed"]).reset_index(drop=True)
n_after  = len(index_df)
print(f"Dropped {n_before - n_after} rows with unparseable dates ({n_after} remaining).\n")

index_df = index_df.sort_values("date_parsed").reset_index(drop=True)
print(index_df.dtypes)
print()
print(index_df[["date_parsed","title","speaker"]].tail(5))

📊 Reading the output. Important diagnostic: 82 of 375 rows had unparseable dates. Those are typically links to "speeches archive", index pages, or sidebar promos that match our URL regex but lack a preceding date in the document flow. Dropping them leaves 293 real speeches — a healthier base for analysis. Always check dropna counts in scraping pipelines: silent data loss is the #1 source of bad results.

4.5 First descriptive: speeches per year¶

In [ ]:
import matplotlib.pyplot as plt

counts = index_df.groupby("year")["title"].count()
fig, ax = plt.subplots(figsize=(8, 4))
counts.plot(kind="bar", ax=ax, color="steelblue", edgecolor="white")
ax.set_title("Federal Reserve Board speeches per year (2020–2024)")
ax.set_xlabel("Year"); ax.set_ylabel("Count")
fig.tight_layout()
plt.show()

4.6 Scraping the full text of individual speeches¶

The index gives us titles and URLs. To build a text corpus, we follow each URL and extract the body of the speech. Fed speech pages put the actual text inside the main content area — we'll grab the <div id="article"> (or fall back to a generic <article> / <main> if that ID changes in future).

In [ ]:
def scrape_speech_text(url, pause=1.0):
    """Fetch a single Fed speech page and return its full text."""
    time.sleep(pause)
    try:
        r = requests.get(url, headers=HEADERS, timeout=15)
        r.raise_for_status()
    except requests.RequestException:
        return ""
    s = BeautifulSoup(r.text, "lxml")
    # Try the most specific container first, fall back gracefully
    article = (s.find("div", id="article") or
               s.find("article") or
               s.find("main") or
               s.find("body"))
    paragraphs = article.find_all("p") if article else []
    return " ".join(p.get_text(" ", strip=True) for p in paragraphs
                    if len(p.get_text(strip=True)) > 20)

# Test on the most recent speech in our index
test_url  = index_df.iloc[-1]["url"]
test_body = scrape_speech_text(test_url)
print(f"URL    : {test_url}")
print(f"Length : {len(test_body):,} characters")
print(f"Preview: {test_body[:400]}...")

📊 Reading the output. Following one URL into the speech page yields 17,206 characters of clean text (the Kugler speech). The fallback chain <div id="article"> → <article> → <main> → <body> makes the function survive minor site updates: even if the Fed changes the container, we'll still find paragraphs.

4.7 Build the text corpus (sample)¶

Iterating over hundreds of speeches takes time (rate-limited at 1 second per request, ~7 minutes for 5 years). For the lecture demo we sample 10 — for the exercise you scale up.

In [ ]:
sample = index_df.tail(10).reset_index(drop=True).copy()
corpus = []

for i, row in sample.iterrows():
    print(f"  [{i+1:2d}/{len(sample)}] {row['date_parsed'].date()} — {row['speaker'][:40]}")
    body = scrape_speech_text(row["url"])
    rec = row.to_dict()
    rec["body"]    = body
    rec["n_chars"] = len(body)
    corpus.append(rec)

corpus_df = pd.DataFrame(corpus)
print(f"\nCorpus: {len(corpus_df)} documents, avg {corpus_df['n_chars'].mean():.0f} chars each")
corpus_df[["date_parsed","title","speaker","n_chars"]]

📊 Reading the output. Building a 10-document corpus took ~10 seconds (with rate limiting). The progress prints ([ N/10] date — speaker) are essential when scaling: they tell you the loop is alive and where it is. For the 653-document full corpus in the exercise, you'll want even more frequent feedback (every 50 docs).

4.8 Save the corpus¶

In [ ]:
corpus_df.to_csv("fed_speeches_corpus.csv", index=False)
print(f"Saved: fed_speeches_corpus.csv ({len(corpus_df)} rows)")

📊 Reading the output. Saved to disk — this is the actual input to L7's NLP analysis. From now on, you don't need to re-scrape; just pd.read_csv("fed_speeches_corpus.csv"). Always save scraped data to disk immediately: scraping is slow and brittle, but reading CSV is fast and reliable.


⏱ Five-minute challenge — scrape your first titles¶

Your task: scrape the 5 most recent press releases from the Federal Reserve 2024 archive.

URL: https://www.federalreserve.gov/newsevents/pressreleases/2024-press.htm

Build a DataFrame with columns title and url. Five minutes.

Hint. Press release URLs follow a pattern similar to (but not identical to) the speech URLs we used in Section 4:

/newsevents/pressreleases/{type}{YYYYMMDD}{letter}.htm

For example: monetary20240731a.htm, enforcement20241115b.htm, bcreg20241023a.htm. The {type} part identifies the kind of release (monetary policy, enforcement, bank regulation, etc.) — your regex needs to allow any letters there.

The list on the page is in reverse chronological order, so the 5 most recent are simply the first 5 matches.

Why this exercise matters: you've already seen the same domain (Federal Reserve) work on the speeches archive. Now you're applying the same technique to a different content type on the same site. Real research scraping is mostly about pattern recognition and adaptation, not memorising selectors.

In [ ]:
# YOUR CODE HERE — five minutes!
# Hint: build a regex similar to SPEECH_URL_RE but matching:
#   /newsevents/pressreleases/{type}{YYYYMMDD}{letter}.htm
# where {type} is one or more lowercase letters and {letter} is exactly one letter.

URL = "https://www.federalreserve.gov/newsevents/pressreleases/2024-press.htm"

# resp = requests.get(URL, headers=HEADERS, timeout=10)
# soup = BeautifulSoup(resp.text, "lxml")
# PRESS_URL_RE = re.compile(r"...")
# items = []
# for a in soup.find_all("a", href=PRESS_URL_RE):
#     ...
# pd.DataFrame(items[:5])
In [ ]:
URL = "https://www.federalreserve.gov/newsevents/pressreleases/2024-press.htm"

resp = requests.get(URL, headers=HEADERS, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")

# Press release URLs: /newsevents/pressreleases/{type}{YYYYMMDD}{letter}.htm
# Same shape as speeches but with a {type} prefix (monetary, enforcement, bcreg, ...)
PRESS_URL_RE = re.compile(r"/newsevents/pressreleases/[a-z]+\d{8}[a-z]\.htm$")

items = []
for a in soup.find_all("a", href=PRESS_URL_RE):
    title = a.get_text(strip=True)
    if not title:
        continue
    href = a.get("href", "")
    items.append({
        "title": title,
        "url":   "https://www.federalreserve.gov" + href if href.startswith("/") else href,
    })

# The Fed lists releases in reverse-chronological order on this page,
# so the 5 most recent are simply the first 5 items.
df_press = pd.DataFrame(items).head(5)
print(f"Total 2024 press releases found: {len(items)}")
df_press

📊 Reading the output. 120 press releases in 2024 — more than the speeches (107), reflecting that "press releases" includes monetary policy decisions, enforcement actions, bank regulation, etc. Note the regex change: [a-z]+ instead of fixed [a-z]+\d{8} lets us match monetary, enforcement, bcreg, etc. Same skill, different prefix — that's the pattern recognition we wanted to teach.


5. Ethics and legality of web scraping — the practical handbook¶

We covered the three triage rules at the start of the lecture. Here is the deeper version you'll consult before each scraping project.

5.1 robots.txt — what to look for¶

Most websites publish a robots.txt file at domain.com/robots.txt. Read it like this:

  • User-agent: * followed by Disallow: /admin/ → no crawler may access /admin/
  • Crawl-delay: 5 → wait at least 5 seconds between requests
  • Sitemap: https://... → a structured list of pages, often gold for scraping

If your target path is in a Disallow: rule, do not scrape it.

In [ ]:
# Always check robots.txt before scraping
robots = requests.get("https://www.federalreserve.gov/robots.txt", headers=HEADERS, timeout=5)
print(robots.text[:500])

5.2 Practical rules¶

  1. Respect robots.txt: do not scrape paths marked Disallow
  2. Rate limiting: add time.sleep(1.0) or more between requests — never hammer a server
  3. Terms of service: some sites explicitly prohibit scraping; research use is often an exception
  4. Data protection: do not scrape and publish personal data (names, emails) without legal basis
  5. Public data preference: if an API or bulk download is available, use that instead
  6. Identify yourself: some researchers add a contact email to the User-Agent header as a courtesy

For academic research, scraping publicly accessible pages for non-commercial analysis is generally accepted practice in most jurisdictions. When in doubt, check your institution's guidelines or contact the site owner.


6. Basic text cleaning¶

Raw scraped text needs cleaning before any NLP analysis. We will go much deeper in L7, but here are the basics.

In [ ]:
def clean_text(text):
    """
    Basic text cleaning pipeline.
    - Remove HTML entities and tags
    - Normalise whitespace
    - Strip leading/trailing whitespace
    """
    if not isinstance(text, str):
        return ""
    # Remove HTML tags (safety net — BeautifulSoup should have handled these)
    text = re.sub(r"<[^>]+>", " ", text)
    # Replace common HTML entities
    text = text.replace("&amp;", "&").replace("&nbsp;", " ").replace("&euro;", "€")
    # Normalise whitespace: collapse multiple spaces/newlines into one space
    text = re.sub(r"\s+", " ", text)
    return text.strip()

# Apply to corpus
corpus_df["body_clean"] = corpus_df["body"].apply(clean_text)

# Quick check
for _, row in corpus_df.head(3).iterrows():
    print(f"--- {row['date_parsed'].date()} ---")
    print(row["body_clean"][:200])
    print()

📊 Reading the output. Notice the artifact: Accessible Keys for Video [Space Bar] toggles play/pau.... That's video-player UI text that the speech page includes for accessibility — it leaked into our scrape because it lives in <p> tags. Realistic scraping always hits this kind of noise: the cleaner you write, the better your downstream analysis. We'll address this systematically in L7.


7. Diagnosing scraping problems¶

When your scraper returns nothing, it's almost always one of these. Work through them in order:

Symptom Likely cause Fix
soup.find(...) returns None but element is visible in browser Page renders content via JavaScript after load L6: use Selenium, or look for the underlying JSON API
r.status_code == 403 Server blocks default python-requests/x.y user agent Set a realistic User-Agent header (cell 6)
r.status_code == 429 Rate-limited — too many requests too fast Increase time.sleep(); use requests.Session() (L6)
Garbled accents (é instead of é) Encoding mismatch Set r.encoding = "utf-8" before calling r.text, or use r.content and decode manually
r.text ends mid-page Server closes the connection Add timeout= and retries
Elements present but .text is empty Content is in data-* attributes or nested <noscript> Use el.get("data-content"), or move to Selenium

The decisive diagnostic. Open the page in Chrome, hit Ctrl+U (View Page Source). If the content you want is in the source, requests + BeautifulSoup is enough. If the source is a mostly-empty shell with a giant <script> tag, you need Selenium — that's L6.

In [ ]:
# Quick encoding-detection helper — useful when accents look broken
def detect_and_fix_encoding(response):
    declared = response.encoding
    apparent = response.apparent_encoding   # uses chardet
    if declared.lower() != apparent.lower():
        print(f"  Encoding mismatch: declared={declared}, apparent={apparent}")
        response.encoding = apparent
    return response.text

# Test the page-source heuristic
r = requests.get(url, headers=HEADERS, timeout=10)
text = detect_and_fix_encoding(r)
has_meaningful_html = text.lower().count("<p>") > 5 or text.lower().count("<div") > 20
print(f"Static HTML appears usable: {has_meaningful_html}")

📊 Reading the output. The Fed declares ISO-8859-1 (Latin-1) but the file is actually UTF-8-SIG (UTF-8 with BOM). If we trusted the declared encoding, accented characters would render as garbage (é instead of é). The chardet-based detection catches this — a small but critical defensive measure for international scraping.


8. Exercise¶

Time: ~25 minutes. Work individually.

The goal is to build a small corpus from a source relevant to your own research proposal.

Option A — Fed speeches (recommended if your topic is monetary policy / central bank communication)¶

Extend the scraper from the lecture to collect all Federal Reserve Board speeches from 2015 to 2024. For each speech extract: date, title, URL, speaker, event description, and full body text. Save as CSV.

Then answer these questions with code:

  1. How many speeches per year?
  2. What is the average and median speech length (in characters)?
  3. Who are the 5 most prolific speakers in the period?

Option B — Banca d'Italia governor's speeches (recommended if you prefer an Italian source)¶

Scrape https://www.bancaditalia.it/pubblicazioni/interventi-governatore/integov{year}/index.html across 2020–2025. Build a metadata DataFrame (date, title, URL, event description). The full speeches are in PDF — out of scope for requests + BeautifulSoup, but you can list the PDF URLs.

Option C — your own source¶

If you have already identified a data source for your research proposal, build a scraper for it. Minimum deliverable: a CSV with at least 50 rows and at least 3 columns (date/id, title/headline, URL or body text).

In [ ]:
# YOUR CODE HERE
# Suggested structure:

# Step 1: define scraper function(s)

# Step 2: run scraper across target pages

# Step 3: build DataFrame, clean, save to CSV

# Step 4: descriptive analysis (document counts, lengths, top entities)
In [ ]:
# ── SOLUTION — Option A (Fed speeches, full corpus 2015–2024) ───────────────────
#
# This cell is INTERRUPT-FRIENDLY and RESUMABLE:
#   • Progress is checkpointed to `fed_corpus_partial.csv` every 50 documents.
#   • Press the stop button (■) any time → progress is saved, no work lost.
#   • Re-run the cell → it skips already-scraped URLs and continues from there.
#   • Final run produces `fed_corpus.csv` with all collected documents.
#
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
import time, re, os

ROOT          = "https://www.federalreserve.gov"
YEAR_URL      = "https://www.federalreserve.gov/newsevents/speech/{year}-speeches.htm"
SPEECH_URL_RE = re.compile(r"/newsevents/speech/[a-z]+\d{8}[a-z]\.htm$")
DATE_RE       = re.compile(r"(\d{1,2})/(\d{1,2})/(\d{4})")
HEADERS       = {"User-Agent": "Mozilla/5.0 (academic research, contact: yourname@unibo.it)"}

PARTIAL_CSV = "fed_corpus_partial.csv"
FINAL_CSV   = "fed_corpus.csv"

def safe_get(url, pause=1.0):
    time.sleep(pause)
    try:
        r = requests.get(url, headers=HEADERS, timeout=15)
        r.raise_for_status()
        return r
    except requests.RequestException as e:
        print(f"  Error: {e}")
        return None

def find_preceding_date(link, max_chars=300):
    text_before = ""
    for prev in link.previous_elements:
        if hasattr(prev, "get_text"):
            text_before = prev.get_text(" ", strip=True) + " " + text_before
        elif isinstance(prev, str):
            text_before = prev + " " + text_before
        if len(text_before) > max_chars:
            break
    m = DATE_RE.findall(text_before)
    return "/".join(m[-1]) if m else ""

def parse_fed_year(soup):
    records = []
    for a in soup.find_all("a", href=SPEECH_URL_RE):
        title = a.get_text(strip=True)
        if not title:
            continue
        href = a.get("href", "")
        date_str = find_preceding_date(a)
        text_after = []
        for nxt in a.next_elements:
            if hasattr(nxt, "name") and nxt.name == "a" and nxt is not a and SPEECH_URL_RE.search(nxt.get("href", "")):
                break
            if isinstance(nxt, str):
                text_after.append(nxt)
            if sum(len(x) for x in text_after) > 500:
                break
        block = re.sub(r"\s+", " ", " ".join(text_after)).strip()
        if title in block:
            block = block.split(title, 1)[-1].strip()
        block = re.sub(r"\d{1,2}/\d{1,2}/\d{4}.*$", "", block).strip()
        block = re.sub(r"\b(Watch Live|Video)\b\s*", "", block).strip()
        speaker, _, event = block.partition(" At ")
        event = ("At " + event) if event else ""
        records.append({
            "date":  date_str,
            "title": title,
            "url":   ROOT + href if href.startswith("/") else href,
            "speaker": speaker.strip()[:200],
            "event":   event.strip()[:300],
        })
    return records

def scrape_speech_text(url):
    r = safe_get(url)
    if not r: return ""
    s = BeautifulSoup(r.text, "lxml")
    article = (s.find("div", id="article") or s.find("article") or
               s.find("main") or s.find("body"))
    paras = article.find_all("p") if article else []
    return " ".join(p.get_text(" ", strip=True) for p in paras
                    if len(p.get_text(strip=True)) > 20)

# ── Step 1: build the index ─────────────────────────────────────────────────
all_records = []
for yr in range(2015, 2025):
    r = safe_get(YEAR_URL.format(year=yr))
    if not r: continue
    recs = parse_fed_year(BeautifulSoup(r.text, "lxml"))
    for rec in recs: rec["year"] = yr
    all_records.extend(recs)
    print(f"  {yr}: {len(recs)} speeches")

index_df = pd.DataFrame(all_records)
index_df["date_parsed"] = pd.to_datetime(index_df["date"], format="%m/%d/%Y", errors="coerce")
index_df = index_df.dropna(subset=["date_parsed"]).sort_values("date_parsed").reset_index(drop=True)
print(f"\nIndex: {len(index_df)} speeches\n")

# ── Step 2: scrape full text — RESUMABLE ────────────────────────────────────
# Load any prior partial work
done_urls = set()
prior_records = []
if os.path.exists(PARTIAL_CSV):
    prior = pd.read_csv(PARTIAL_CSV)
    prior_records = prior.to_dict("records")
    done_urls = set(prior["url"].tolist())
    print(f"Resuming: {len(done_urls)} documents already scraped, "
          f"{len(index_df) - len(done_urls)} remaining.\n")

corpus     = list(prior_records)
new_count  = 0
checkpoint = 50

try:
    for i, row in index_df.iterrows():
        if row["url"] in done_urls:
            continue
        body = scrape_speech_text(row["url"])
        rec  = {**row.to_dict(), "body": body, "n_chars": len(body)}
        corpus.append(rec)
        new_count += 1

        # Checkpoint every `checkpoint` new documents
        if new_count % checkpoint == 0:
            pd.DataFrame(corpus).to_csv(PARTIAL_CSV, index=False)
            print(f"  Checkpoint: {len(corpus)}/{len(index_df)} documents saved to {PARTIAL_CSV}")

except KeyboardInterrupt:
    print(f"\n  ⏸  Interrupted at {len(corpus)}/{len(index_df)} documents.")
    pd.DataFrame(corpus).to_csv(PARTIAL_CSV, index=False)
    print(f"  Progress saved to {PARTIAL_CSV}. Re-run the cell to resume.")
    raise   # so you see the traceback and know it stopped

# ── Step 3: finalise ────────────────────────────────────────────────────────
corpus_df = pd.DataFrame(corpus)
corpus_df.to_csv(FINAL_CSV, index=False)
print(f"\n✓ Done: {len(corpus_df)} documents saved to {FINAL_CSV}")

# ── Step 4: descriptive analysis ────────────────────────────────────────────
print("\nSpeeches per year:")
print(corpus_df.groupby("year")["title"].count())

print(f"\nAvg length    : {corpus_df['n_chars'].mean():.0f} chars")
print(f"Median length : {corpus_df['n_chars'].median():.0f} chars")

print("\nTop 5 speakers (by speech count):")
print(corpus_df["speaker"].value_counts().head())

# Plot: speeches per year
fig, ax = plt.subplots(figsize=(10, 4))
corpus_df.groupby("year")["title"].count().plot(
    kind="bar", color="steelblue", edgecolor="white", ax=ax)
ax.set_title("Federal Reserve speeches per year, 2015–2024")
ax.set_xlabel("Year"); ax.set_ylabel("Count")
fig.tight_layout()
fig.savefig("fed_speeches_by_year.png", dpi=300, bbox_inches="tight")
plt.show()

Summary¶

Step Tool Key concept
Make a request requests.get(url) Check .status_code before using .text
Parse HTML BeautifulSoup(html, "lxml") find(), find_all(), select()
Extract text .text.strip() Always strip whitespace
Extract attributes .get("href") Links, IDs, classes
Navigate siblings .find_next_sibling() Pair dates with content
Be polite time.sleep(1.5) Avoid 429 errors and server load
Handle errors raise_for_status() + try/except Never assume 200

Next lecture¶

Lecture 6 — Web scraping II: cookies, session headers, and selenium for JavaScript-heavy sites. We will also start pre-processing the text corpora you have begun to build.