Lecture 6 β€” Web Scraping II: Sessions, Headers & SeleniumΒΆ

Python for Economists Β· University of Bologna Β· 2025/2026ΒΆ


What we cover todayΒΆ

  1. Warm-up: when requests isn't enough β€” Selenium in action
  2. Sessions and cookies
  3. Handling pagination systematically
  4. Selenium: scraping JavaScript-rendered pages (the in-depth tour)
  5. Building and cleaning a text corpus
  6. Milestone: your corpus is ready
  7. Exercise: extend your corpus with metadata
  8. (optional) From script to production: the Page Object Model

PrerequisitesΒΆ

Before running this notebook, install Selenium in your course environment. From a terminal:

conda activate econ
conda install -c conda-forge selenium

After installation, restart the Jupyter kernel (Kernel β†’ Restart) so the new package is picked up.

Note: You also need Google Chrome installed on your machine. The matching ChromeDriver is downloaded automatically by Selenium the first time you launch the browser β€” no manual setup needed.

TroubleshootingΒΆ

If webdriver.Chrome() raises WebDriverException: Unable to obtain working Selenium Manager binary:

This means the platform-specific binary that Selenium uses to download ChromeDriver wasn't installed correctly β€” typically because pip fell back to a source distribution that doesn't include it (common on bleeding-edge Python versions). Reinstall, forcing a pre-built wheel:

conda activate econ
pip uninstall -y selenium
pip install --no-cache-dir --force-reinstall --only-binary=:all: selenium

On macOS only, if the binary is present but blocked by Gatekeeper, fix permissions and clear the quarantine flag:

SM=$(python -c "import selenium, os; print(os.path.join(os.path.dirname(selenium.__file__), 'webdriver/common/macos/selenium-manager'))")
chmod +x "$SM"
xattr -d com.apple.quarantine "$SM" 2>/dev/null || true

After any of the above, restart the Jupyter kernel before retrying the demo.

Why this happens. Anaconda installations from late 2025 onwards may default to Python 3.14, which is too new for some pre-built wheels. If the trouble persists, recreate the econ environment pinning Python 3.12 (conda create -n econ python=3.12 -c conda-forge) β€” that's the version this course is tested against.

InΒ [Β ]:
# pip install requests beautifulsoup4 lxml selenium
# Note: since Selenium 4.6 (Oct 2022), the driver binary is fetched automatically
# by Selenium Manager β€” webdriver-manager is no longer needed.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time, re, json
from pathlib import Path

print("Libraries loaded.")

0. Warm-up β€” when requests isn't enoughΒΆ

Last lecture we scraped with requests + BeautifulSoup. That works when the server sends you the HTML you need. But many modern sites don't: they send a mostly-empty HTML shell, and JavaScript fills in the content after the page loads. If you requests.get() such a site, you get the empty shell and nothing useful.

The fix: drive a real browser from Python. That's Selenium β€” it controls a real Chrome window programmatically, seeing the page exactly as a human would (after all JavaScript has executed).

Demo β€” same page, two approachesΒΆ

We use quotes.toscrape.com/js/, a teaching site explicitly built for scraping practice. It's the JavaScript-rendered variant of the static page you saw last lecture. Watch what happens with requests vs. with Selenium on the same URL.

⚠️ Important framing. Selenium gives you a programmable browser β€” that means you can automate logins, hammer commercial endpoints, click through forms at scale. Don't. The same techniques applied to social media, e-commerce, paid platforms, or sites with bot-detection (e.g. Cloudflare) typically violate their Terms of Service and can have legal consequences.

The point of this demo is the technique: rendering JS, locating elements, simulating events. These are the exact skills you'll use for legitimate research scraping on public sources (ECB, parliaments, news archives, government portals).

Why we use toscrape.com: it's a sandbox site explicitly built to be scraped. No ToS issue, no bot detection, stable HTML β€” perfect for teaching.

InΒ [Β ]:
# Demo: when `requests` isn't enough β€” the SAME URL, two approaches.
#
# Run HEADED in class so students see the browser open.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

URL = "https://quotes.toscrape.com/js/"

# --- Attempt 1: plain `requests` -------------------------------------------
print(">>> Attempt 1: requests + BeautifulSoup")
html = requests.get(URL, timeout=10).text
soup = BeautifulSoup(html, "lxml")
quotes_via_requests = soup.select("div.quote span.text")
print(f"   Quotes found: {len(quotes_via_requests)}")
print(f"   Length of HTML returned: {len(html):,} characters")
print()
# Spoiler: 0 quotes. The HTML is a JavaScript shell β€” the quotes will be
# generated *in the browser* once the JS runs. `requests` doesn't run JS.

# --- Attempt 2: Selenium ----------------------------------------------------
print(">>> Attempt 2: Selenium (real browser)")
options = Options()
# options.add_argument("--headless=new")   # un-comment for quiet runs
options.add_argument("--window-size=1280,900")
driver = webdriver.Chrome(options=options)

try:
    driver.get(URL)
    # Wait until the JS has populated the DOM with quote elements.
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "div.quote"))
    )
    quotes_via_selenium = driver.find_elements(By.CSS_SELECTOR, "div.quote span.text")
    authors             = driver.find_elements(By.CSS_SELECTOR, "div.quote small.author")
    print(f"   Quotes found: {len(quotes_via_selenium)}")
    print()
    print("   First three quotes:")
    for q, a in zip(quotes_via_selenium[:3], authors[:3]):
        print(f"     β€’ {q.text[:70]}…  β€” {a.text}")
finally:
    driver.quit()

πŸ“Š Reading the output. The first block prints Quotes found: 0 even though the same URL renders 10 quotes in any browser. The HTML returned by requests is ~5–6 KB β€” a JavaScript shell with no quote elements yet. Selenium then opens a real Chrome window, lets the JS execute, waits for div.quote to appear, and finds the same 10 quotes a human would see.

This is the entire reason Selenium exists in scraping work. When requests returns HTML that doesn't contain the data you can see in the browser, it's because that data is generated client-side by JavaScript β€” and only a real browser will run that JavaScript for you.

A quick word on what Selenium can do β€” and what you should doΒΆ

Selenium effectively gives you a programmable human inside a browser. That's a lot of power. With great power, etc.

  • Terms of Service. Automation of commercial sites (games, social media, e-commerce) typically violates their ToS. Don't scrape behind login walls, don't automate user accounts on platforms, don't bypass CAPTCHAs.
  • Academic scraping. For research on public texts (ECB, parliaments, news wires, government sites), Selenium is fine when requests isn't β€” stick to public content, rate-limit yourself, identify your traffic.
  • Personal data. GDPR applies. Don't scrape personal data (names, profiles, addresses) unless you have a lawful basis and have thought about ethics review.

In short: Selenium is a research instrument, not a cheat code.


1. Sessions and cookiesΒΆ

In L5 we sent individual requests. Many websites require session state β€” a cookie that identifies you as a returning visitor. Without it, the server may return a login page or a 403.

A requests.Session object persists cookies and headers across multiple requests automatically.

InΒ [Β ]:
session = requests.Session()
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept":          "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection":      "keep-alive",
})

def safe_get(url, pause=1.5, sess=session):
    """GET request with error handling and polite delay."""
    time.sleep(pause)
    try:
        r = sess.get(url, timeout=15)
        r.raise_for_status()
        return r
    except requests.HTTPError as e:
        print(f"  HTTP {e.response.status_code}: {url}")
        return None
    except requests.RequestException as e:
        print(f"  Error: {e}")
        return None

r = safe_get("https://www.ecb.europa.eu/press/pr/date/2024/html/index_en.html")
print(r.status_code if r else "Failed")

πŸ“Š Reading the output. We built a requests.Session() and gave it browser-like headers (User-Agent, Accept-Language, etc.). The session then made one request and printed 200, meaning the server accepted us.

Why this matters. A bare requests.get(url) sends the default User-Agent python-requests/X.Y.Z, which many sites refuse outright. A Session also persists cookies across requests, so if the first request sets a session cookie, the second one will send it back automatically β€” essential for sites that gate content behind a click-through (cookie banners, age gates, login).

1.1 When a site requires loginΒΆ

Some archives sit behind a login form. The pattern is:

  1. GET the login page to collect any hidden CSRF tokens
  2. POST your credentials along with the token
  3. The server sets a session cookie β€” subsequent requests are authenticated

We show the pattern here without running it.

InΒ [Β ]:
# Pattern only β€” do not run without real credentials
def login(session, login_url, username, password):
    r = session.get(login_url)
    soup = BeautifulSoup(r.text, "lxml")
    token_tag = soup.find("input", {"name": "csrf_token"})
    csrf = token_tag["value"] if token_tag else ""
    payload = {"username": username, "password": password, "csrf_token": csrf}
    r_post = session.post(login_url, data=payload)
    return "logout" in r_post.text.lower() or r_post.url != login_url

print("Login function defined (not executed).")

2. Handling pagination systematicallyΒΆ

Most archives spread results across multiple pages. Three common patterns:

Pattern Example URL Strategy
Page number in URL ?page=1, /page/2 Loop over integers
Offset in URL ?start=0, ?offset=20 Loop with step
"Next" link in HTML <a class="next">Next</a> Follow links until absent
InΒ [Β ]:
# Pattern 1: page number in URL
def scrape_all_pages_numbered(base_url, max_pages=10):
    all_items = []
    for page in range(1, max_pages + 1):
        url = base_url.format(page=page)
        r = safe_get(url)
        if r is None:
            break
        soup = BeautifulSoup(r.text, "lxml")
        items = soup.find_all("div", class_="result-item")
        if not items:
            print(f"  No results on page {page} β€” stopping.")
            break
        all_items.extend(items)
        print(f"  Page {page}: {len(items)} items")
    return all_items

# Pattern 2: follow "next" link
def scrape_follow_next(start_url):
    all_items = []
    url  = start_url
    page = 1
    while url:
        r = safe_get(url)
        if r is None:
            break
        soup  = BeautifulSoup(r.text, "lxml")
        items = soup.find_all("article")
        all_items.extend(items)
        print(f"  Page {page}: {len(items)} items")
        next_tag = (
            soup.find("a", class_="next") or
            soup.find("a", rel="next") or
            soup.find("a", string=re.compile(r"next", re.I))
        )
        if next_tag and next_tag.get("href"):
            href = next_tag["href"]
            url  = href if href.startswith("http") else start_url.rstrip("/") + "/" + href.lstrip("/")
        else:
            url = None
        page += 1
    return all_items

print("Pagination helpers defined.")
InΒ [Β ]:
# Demo: pagination by page number, applied to a stable teaching site.
#
# In your research the same pattern adapts to: ECB speeches, Fed press releases,
# parliamentary records, news archives β€” anywhere a list is paginated as
# /page/1, /page/2, … just change the URL template and the CSS selectors.
#
# (We use quotes.toscrape.com instead of a real central-bank archive because
# central-bank portals get redesigned every few years β€” don't tie a
# teaching example to selectors that go stale. For your research scraper,
# always inspect the live page first with DevTools.)

QUOTES_URL = "https://quotes.toscrape.com/page/{page}/"

def scrape_quotes_paginated(max_pages=10):
    rows = []
    for page in range(1, max_pages + 1):
        r = safe_get(QUOTES_URL.format(page=page))
        if r is None:
            break
        soup   = BeautifulSoup(r.text, "lxml")
        quotes = soup.select("div.quote")
        if not quotes:
            print(f"  Page {page}: empty β€” stopping.")
            break
        for q in quotes:
            rows.append({
                "page":   page,
                "text":   q.select_one("span.text").get_text(strip=True),
                "author": q.select_one("small.author").get_text(strip=True),
                "tags":   ", ".join(t.get_text() for t in q.select("a.tag")),
            })
        print(f"  Page {page}: {len(quotes)} quotes")
    return rows

quote_records = scrape_quotes_paginated(10)
quotes_df     = pd.DataFrame(quote_records)
print(f"\nTotal: {len(quotes_df)} quotes across {quotes_df['page'].max()} pages")
quotes_df.head()

πŸ“Š Reading the output. The loop hit pages 1–10, found exactly 10 quotes on each, and produced a 100-row DataFrame. The total prints as 100 quotes across 10 pages.

Two patterns to internalise. First, the URL template "...page/{page}/" plus an integer counter is the simplest pagination strategy and works for most archives β€” ECB speeches by year, Fed press releases, parliamentary records all follow variants of this. Second, the early-exit if not quotes: break makes the loop self-limiting: when you don't know how many pages exist, stop as soon as a page comes back empty.


3. Selenium β€” the in-depth tourΒΆ

requests + BeautifulSoup only works on static HTML. Many modern websites render content dynamically via JavaScript. Selenium controls a real browser programmatically, seeing the page exactly as a human would β€” after all JavaScript has executed.

When do you need Selenium?ΒΆ

  • The element you want is absent from response.text even though it is visible in the browser
  • The page requires clicking, scrolling, or form interaction to load data
  • The URL does not change between pages (infinite scroll)

When you do NOT need SeleniumΒΆ

  • The content is in the raw HTML (verify with Ctrl+U in Chrome)
  • The page loads data via a JSON API β€” intercept the API directly (faster, cleaner, more stable; see Section 3.8)

3.1 Driver setupΒΆ

Since Selenium 4.6 (October 2022), Selenium Manager ships with the library and handles the ChromeDriver binary automatically. You no longer need webdriver-manager or to download chromedriver by hand. Just configure the options and instantiate webdriver.Chrome(options=...).

InΒ [Β ]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def make_driver(headless=True):
    """Create a Chrome WebDriver. headless=False opens a visible window (useful for debugging)."""
    options = Options()
    if headless:
        options.add_argument("--headless=new")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--window-size=1280,900")
    options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36"
    )
    return webdriver.Chrome(options=options)

print("Driver factory defined.")

3.2 Your first browser session β€” open, navigate, inspectΒΆ

The bare-minimum Selenium workflow is three lines: open a driver, get a URL, do something with the page. Always close the driver at the end (driver.quit()).

InΒ [Β ]:
driver = make_driver(headless=True)

try:
    driver.get("https://www.ecb.europa.eu")
    print("Page title :", driver.title)
    print("Current URL:", driver.current_url)

    # The page source is the HTML *as the browser sees it* β€” after JS has run.
    # That's what makes Selenium worth the overhead.
    print("HTML length:", len(driver.page_source))
finally:
    driver.quit()

πŸ“Š Reading the output. Three lines of meaningful work: open a Chrome window in the background, navigate to the ECB home page, and print three things about it. The HTML length (~220 KB here) is the post-JavaScript size β€” considerably larger than what requests.get() would return for the same URL, because by the time we read driver.page_source, all client-side scripts have already run.

Always pair make_driver() with a try/finally: driver.quit(). A leaked Chrome process keeps a zombie window open and locks port resources β€” if you hit Ctrl+C mid-cell without quit(), the next cell will error trying to start a new driver.

3.3 Locating elements: By.* and find_element[s]ΒΆ

Selenium 4 uses one consistent API: driver.find_element(By.X, value) for the first match, driver.find_elements(By.X, value) for all matches. The old find_element_by_id, find_element_by_class_name, etc. were removed in Selenium 4.3 β€” don't use them.

By.* What it matches Example
ID element with id="..." find_element(By.ID, "search")
NAME element with name="..." (typical for form inputs) find_element(By.NAME, "q")
CLASS_NAME element with single class name find_element(By.CLASS_NAME, "result")
TAG_NAME tag name find_elements(By.TAG_NAME, "article")
LINK_TEXT <a> with exact visible text find_element(By.LINK_TEXT, "Press releases")
PARTIAL_LINK_TEXT <a> containing the text find_element(By.PARTIAL_LINK_TEXT, "press")
CSS_SELECTOR any CSS selector find_elements(By.CSS_SELECTOR, "dl.ecb-basicList dd a")
XPATH any XPath expression find_element(By.XPATH, "//h1[@class='title']")

In practice, CSS_SELECTOR covers ~90% of cases and is the most readable. XPATH is the escape hatch for the other 10% (e.g. selecting elements by their text content).

InΒ [Β ]:
driver = make_driver(headless=True)

try:
    driver.get("https://quotes.toscrape.com/js/")
    # Give the JS a moment to render (in 3.4 we'll do this properly with waits)
    time.sleep(2)

    # Multiple matches β€” find_elements returns a list (empty if none found)
    quote_texts = driver.find_elements(By.CSS_SELECTOR, "div.quote span.text")
    print(f"Found {len(quote_texts)} quotes.\n")
    for q in quote_texts[:5]:
        print(" -", q.text.strip()[:90])

    # Single match β€” find_element raises NoSuchElementException if absent
    h1 = driver.find_element(By.TAG_NAME, "h1")
    print("\nPage H1:", h1.text.strip())
finally:
    driver.quit()

πŸ“Š Reading the output. find_elements returned a list of 10 quote-text spans; we printed five. Then a single find_element(By.TAG_NAME, "h1") returned the page heading.

The two methods differ in failure mode. find_elements returning [] is not an error β€” it just means nothing matched, and your code has to decide what to do with that. find_element (singular) raises NoSuchElementException if there's no match. Use the singular when the element is required for the next step; use the plural when you're collecting whatever's there.

3.4 Explicit waits β€” WebDriverWait + expected_conditionsΒΆ

time.sleep(2) is a brittle hack: too short and you race the page, too long and you waste time. The proper tool is explicit waits β€” wait up to N seconds for a specific condition to be true, then proceed as soon as it is.

The two pieces:

  • WebDriverWait(driver, timeout) β€” the timer
  • expected_conditions (typically aliased as EC) β€” the condition to wait for

Common conditions:

Condition Meaning
EC.presence_of_element_located((By.X, val)) Element exists in the DOM (may not yet be visible)
EC.visibility_of_element_located((By.X, val)) Element is in the DOM and visible
EC.element_to_be_clickable((By.X, val)) Element is visible and enabled
EC.text_to_be_present_in_element((By.X, val), "text") The element contains the given text
EC.url_contains("path") The current URL contains the substring
InΒ [Β ]:
driver = make_driver(headless=True)

try:
    driver.get("https://quotes.toscrape.com/js/")

    # Wait up to 10 s for the JS to populate the page with quote elements,
    # then extract the first 5. Without the wait, the page source would
    # contain only the JavaScript shell.
    wait      = WebDriverWait(driver, 10)
    quotes    = wait.until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.quote"))
    )
    print(f"Quotes appeared. Got {len(quotes)} elements; showing first 5:\n")
    for q in quotes[:5]:
        text   = q.find_element(By.CSS_SELECTOR, "span.text").text.strip()
        author = q.find_element(By.CSS_SELECTOR, "small.author").text.strip()
        print(f" β€” {text[:70]}… ({author})")

except Exception as e:
    print(f"Failed: {type(e).__name__}: {e}")

finally:
    driver.quit()

πŸ“Š Reading the output. The WebDriverWait(driver, 10).until(…) call blocked for as long as it needed (in practice well under a second on a fast connection) until at least one div.quote element existed in the DOM. Then it returned the full list. We printed five.

Why this is better than time.sleep(2). A sleep always burns the full duration even when the page is ready in 200 ms, and fails when the page happens to be slow on a particular run. An explicit wait succeeds as soon as the condition holds and fails loudly if it never does β€” which is exactly what you want a scraper to do.

3.5 Interactions: clicks, keyboard input, navigationΒΆ

Selenium can fill forms, click buttons, navigate back and forward β€” anything a human can do with mouse and keyboard. The methods you'll use most:

  • element.click() β€” click on an element
  • element.send_keys("text") β€” type into a text field
  • element.send_keys(Keys.RETURN) β€” press a special key (Enter, Tab, arrows...)
  • driver.back(), driver.forward(), driver.refresh() β€” navigation history

For more complex sequences (drag, hover, double-click, multi-key shortcuts), use ActionChains β€” that's what the Cookie Clicker demo uses.

InΒ [Β ]:
driver = make_driver(headless=True)

options = Options()
# options.add_argument("--headless=new")   # un-comment for quiet runs
options.add_argument("--window-size=1280,900")
driver = webdriver.Chrome(options=options)

try:
    time.sleep(4)
    # 1) Open Wikipedia and find the search box
    start_url = "https://en.wikipedia.org/wiki/Main_Page"
    driver.get(start_url)
    box = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.NAME, "search"))
    )
    time.sleep(4)
    # 2) Type the query AND press RETURN in a SINGLE send_keys call.
    #    Why not two separate calls? Wikipedia shows an autocomplete
    #    dropdown the moment we start typing, which re-paints the DOM
    #    around the search box. A second send_keys for Keys.RETURN
    #    arriving immediately afterwards finds the element momentarily
    #    not interactable and raises ElementNotInteractableException.
    #    Sending the whole sequence at once avoids that race.
    box.send_keys("European Central Bank" + Keys.RETURN)
    time.sleep(4)
    # 3) Wait for the search to actually navigate. We must NOT wait for
    #    `#firstHeading` here β€” that element also exists on Main_Page,
    #    so the wait would pass instantly without the search ever resolving.
    #    Instead, wait for the URL to change.
    WebDriverWait(driver, 10).until(EC.url_changes(start_url))
    time.sleep(4)
    # 4) Now read the article-specific elements
    h1 = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "firstHeading"))
    )
    print("Article :", h1.text.strip())
    print("URL     :", driver.current_url)

    first_p = driver.find_element(By.CSS_SELECTOR, "#mw-content-text p")
    print("\nLead paragraph:")
    print(" ", first_p.text.strip()[:300], "…")

    time.sleep(4)
    # 5) Navigate back to Main_Page, then forward again
    driver.back()
    print("\nAfter back   :", driver.current_url)
    time.sleep(4)
    driver.forward()
    print("After forward:", driver.current_url[:80])

finally:
    time.sleep(10)
    driver.quit()

πŸ“Š Reading the output. We typed "European Central Bank" + RETURN into the search box on Main_Page in one keystroke sequence. Wikipedia's search resolved that query directly to the ECB article (because it's an exact title match), so by the time we read #firstHeading, the URL had already changed to /wiki/European_Central_Bank and we got the article's lead paragraph. The back()/forward() calls then exercise the browser's history just like a human clicking those buttons.

Two subtleties worth flagging.

  1. Atomic key sequences. The query text and Keys.RETURN go in one send_keys call, not two. Wikipedia's search box pops up an autocomplete dropdown the moment you start typing, which re-paints the surrounding DOM and briefly makes the cached element non-interactable. A second separate send_keys for the RETURN key would race that re-paint and raise ElementNotInteractableException. As a rule: when a form element has live autocomplete, validation, or any JS that fires on each keystroke, send the whole sequence at once.
  2. Picking a wait condition that's unique to the destination. EC.url_changes(start_url) is essential before reading #firstHeading, because that element exists on every Wikipedia page including Main_Page itself. A wait that only checks for the element would resolve before any navigation occurred. Always pick a wait condition that's uniquely true on the destination, not on both pages.

3.6 Selenium + BeautifulSoup β€” best of both worldsΒΆ

Selenium is great at getting the rendered page; BeautifulSoup is faster and more pleasant for parsing it. The standard pattern: use Selenium to load and interact, then hand the rendered HTML to BS for extraction.

InΒ [Β ]:
# Render with Selenium, parse with BeautifulSoup β€” best of both worlds.
# Target: quotes.toscrape.com/js/ (JS-rendered β€” `requests` would see only
# the JavaScript shell; Selenium sees the populated DOM).
#
# In your research scraper, just swap the URL and the CSS selectors;
# the structure of the cell stays identical.

driver = make_driver(headless=True)

try:
    driver.get("https://quotes.toscrape.com/js/")
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "div.quote"))
    )

    # Hand the rendered HTML to BeautifulSoup for clean extraction
    soup = BeautifulSoup(driver.page_source, "lxml")
    rows = []
    for q in soup.select("div.quote"):
        rows.append({
            "text":   q.select_one("span.text").get_text(strip=True),
            "author": q.select_one("small.author").get_text(strip=True),
            "tags":   [t.get_text() for t in q.select("a.tag")],
        })
    df_quotes = pd.DataFrame(rows)
    print(f"Parsed {len(df_quotes)} quotes\n")
    print(df_quotes.head())
finally:
    driver.quit()

πŸ“Š Reading the output. Selenium loaded the JS-rendered page; we then handed driver.page_source (the full HTML as the browser sees it) to BeautifulSoup, which built a structured 10-row DataFrame in a few lines.

Why this hybrid is the standard pattern. Selenium is great at acquiring a page (running JS, clicking, waiting); BeautifulSoup with lxml is much faster than Selenium's element-finding for extracting data once the page is settled. In production scrapers it's common to use Selenium only to land on each page, then use BS for the parsing pass β€” the same ratio scales from a few documents to thousands.

3.7 The faster route β€” intercepting JSON APIsΒΆ

Before writing a Selenium scraper, always check whether the page has a JSON API behind it. If it does, calling that API directly is faster (no browser overhead), more stable (APIs change more slowly than DOM), and easier to maintain.

How to find one: Chrome DevTools β†’ Network tab β†’ filter XHR/Fetch β†’ reload the page. Look for requests that return JSON β€” those are usually the real data endpoints.

InΒ [Β ]:
# Intercepting JSON APIs β€” often faster, cleaner, and more stable than
# scraping the rendered HTML. How to find them:
#   DevTools β†’ Network tab β†’ filter XHR/Fetch β†’ reload the page
#   Look for requests that return JSON β€” those are your real data endpoints.
#
# Below is a working example against the ECB Data Portal SDMX REST endpoint.
# Pattern: data-api.ecb.europa.eu/service/data/{flowRef}/{key}?format=jsondata
# This is the canonical example from the ECB's own API documentation:
# the daily EUR/USD spot exchange rate.

API_URL = (
    "https://data-api.ecb.europa.eu/service/data/"
    "EXR/D.USD.EUR.SP00.A?format=jsondata&lastNObservations=5"
)

r = safe_get(API_URL, pause=1.0)
if r and r.headers.get("Content-Type", "").lower().startswith("application/"):
    try:
        data = r.json()
        # Top-level keys describe the SDMX message envelope
        print("JSON top-level keys:", list(data.keys()))
        # Drill in just enough to show the shape
        if "dataSets" in data:
            print("Number of datasets returned:", len(data["dataSets"]))
            # Pull out the most recent observations
            obs = data["dataSets"][0].get("series", {}).get("0:0:0:0:0", {}).get("observations", {})
            print("Observations (index β†’ [value]):")
            for k, v in list(obs.items())[:5]:
                print(f"  {k}: {v}")
        print("\nKey insight: a single GET returns structured data β€” no DOM, no waits.")
    except ValueError:
        print("Response was not valid JSON. Content-Type:", r.headers.get("Content-Type"))
else:
    print("No JSON response. Content-Type:",
          r.headers.get("Content-Type") if r else "no response")
    print("\nFallback message: ALWAYS check for an API before writing a Selenium scraper.")
    print("DevTools Network panel will show you whether the page calls one.")

πŸ“Š Reading the output. We hit the ECB Data Portal's SDMX REST endpoint with a plain requests call and got JSON back. The top-level keys describe the SDMX message envelope (header, dataSets, structure); the actual time-series values are nested inside dataSets. Total work: one HTTP request, no browser, no JavaScript engine, no waits.

Why this is the lesson, not the URL. Before writing any Selenium scraper, open Chrome DevTools (Cmd+Opt+I / Ctrl+Shift+I) β†’ Network tab β†’ filter XHR/Fetch β†’ reload the target page. The requests that return JSON are the page's real data endpoints. Calling those directly is faster (no rendering overhead), more stable (APIs change far less than DOM), and gives you data already structured for pd.DataFrame() or pd.json_normalize(). Selenium becomes a fallback for sites that genuinely don't expose an API β€” not the default.

Practical detail. Old ECB tutorials reference the legacy sdw-wsrest.ecb.europa.eu URL, which now 404s; the live host is data-api.ecb.europa.eu. This is the second-most-common reason scraping code rots: even when the code is fine, the endpoints in the code drift. Document the date you wrote a scraper alongside its URLs.


⏱ Ten-minute challenge β€” your first Selenium scraperΒΆ

Your task. Pick one of the two targets below (both genuinely require Selenium because they're JS-rendered) and extract items into a pandas.DataFrame.

  • Option A β€” Infinite scroll: quotes.toscrape.com/scroll. The page loads 10 quotes, then loads 10 more each time you scroll near the bottom. Collect at least 30 quotes (3 scrolls), each with text, author, tags.
  • Option B β€” Paginated JS site: quotes.toscrape.com/js/page/N/. Same content as Option A but split across pages. Collect quotes from the first 5 pages with text, author, tags. Stop early if a page has no quotes.

You have 10 minutes. Work in pairs if you want. Tips:

  • Use WebDriverWait rather than time.sleep. The only legitimate use of sleep here is between scrolls (let new content load).
  • Inspect elements in Chrome DevTools first (Ctrl+Shift+I / Cmd+Opt+I) to find stable CSS selectors.
  • Run with make_driver(headless=False) for the first attempt β€” seeing what the bot does is half the debugging.

Teaching note: the two options stress different mechanics β€” Option A is about deciding when to stop scrolling, Option B is about looping over URL patterns and detecting an empty page. Whichever students pick first, the other is a useful follow-up exercise.

InΒ [Β ]:
# YOUR CODE HERE β€” 10 minutes
# Skeleton:
# driver = make_driver(headless=False)
# try:
#     driver.get(...)
#     WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ...)))
#     items = driver.find_elements(By.CSS_SELECTOR, ...)
#     records = [{"...": it.text, ...} for it in items[:10]]
#     pd.DataFrame(records)
# finally:
#     driver.quit()
InΒ [Β ]:
# Solution β€” Option A: infinite-scroll scraper
#
# Strategy:
#   1. Scroll to the bottom.
#   2. Wait briefly for new quotes to load (network round-trip).
#   3. Re-count quotes; if the count didn't grow, we've hit the end.
#   4. Stop after `target` quotes or when the page stops producing new ones.

driver = make_driver(headless=False)
QUOTES_CSS = "div.quote"
target     = 30

try:
    driver.get("https://quotes.toscrape.com/scroll")

    # Wait for the first batch of quotes
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, QUOTES_CSS))
    )

    seen = 0
    while True:
        quotes = driver.find_elements(By.CSS_SELECTOR, QUOTES_CSS)
        if len(quotes) >= target:
            break
        if len(quotes) == seen:
            # Scrolled but no new content arrived β€” we've hit the bottom.
            print(f"  No new quotes after scrolling β€” stopping at {seen}.")
            break
        seen = len(quotes)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1.5)   # give the JS time to fetch the next batch

    # Extract structured data
    records = []
    for q in driver.find_elements(By.CSS_SELECTOR, QUOTES_CSS):
        records.append({
            "text":   q.find_element(By.CSS_SELECTOR, "span.text").text,
            "author": q.find_element(By.CSS_SELECTOR, "small.author").text,
            "tags":   [t.text for t in q.find_elements(By.CSS_SELECTOR, "a.tag")],
        })
    df_quotes = pd.DataFrame(records)
    print(f"\nCollected {len(df_quotes)} quotes")
    print(df_quotes.head())

finally:
    driver.quit()

πŸ“Š Reading the output. The while loop scrolled to the bottom, slept 1.5 s for the JS to load the next batch, and re-counted. As long as the count grew, it kept scrolling. After 30 quotes were in hand, it exited and built the DataFrame.

Two robustness details to highlight. (1) The loop has two termination conditions: enough data (len(quotes) >= target) or no progress (len(quotes) == seen) β€” the second guards against the infinite-scroll page that genuinely runs out of content. (2) The 1.5-second sleep is the only place time.sleep is appropriate in this file, because we're waiting for an out-of-band network round-trip rather than for an element to appear.

InΒ [Β ]:
# Solution β€” Option B: paginated JS-rendered site

from selenium.common.exceptions import TimeoutException

driver = make_driver(headless=False)

try:
    rows = []
    for page in range(1, 6):                      # pages 1..5
        driver.get(f"https://quotes.toscrape.com/js/page/{page}/")

        # Wait briefly for at least one quote to render. If none appears,
        # we've gone past the last page β€” stop cleanly.
        try:
            WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "div.quote"))
            )
        except TimeoutException:
            print(f"  Page {page}: no quotes β€” stopping.")
            break

        for q in driver.find_elements(By.CSS_SELECTOR, "div.quote"):
            rows.append({
                "page":   page,
                "text":   q.find_element(By.CSS_SELECTOR, "span.text").text,
                "author": q.find_element(By.CSS_SELECTOR, "small.author").text,
                "tags":   [t.text for t in q.find_elements(By.CSS_SELECTOR, "a.tag")],
            })
        print(f"  Page {page}: cumulative total = {len(rows)}")

    df_quotes = pd.DataFrame(rows)
    print(f"\nCollected {len(df_quotes)} quotes across {df_quotes['page'].max()} pages")
    print(df_quotes.head())

finally:
    driver.quit()

πŸ“Š Reading the output. We looped pages 1–5, navigated to each, and waited 5 seconds for at least one quote to appear. If the wait timed out, we treated that as "past the last page" and broke. The five pages each contributed 10 quotes for 50 total.

Compare with Solution A. Both use Selenium because the content is JS-rendered, but they differ in the control structure: Option A is one URL with progressive content ("keep scrolling until enough"); Option B is many URLs with a fixed unit per URL ("loop until you hit empty"). In practice you'll meet both β€” the question to ask before writing either is which one is this?


3.8 Putting it together β€” build a real corpusΒΆ

Now we combine everything from this lecture: a list of source URLs, polite requests fetching, BeautifulSoup parsing, and a Pandas DataFrame written to disk. The output of this section is the input to Section 4.

Why a hardcoded list of URLs? Two reasons. First, the ECB press portal is now JavaScript-rendered (you saw earlier that dl.ecb-basicList no longer exists), so a clean static index is no longer available. Second, in real research the workflow is often: manually identify a starting list of documents β€” from an archive page, an API, or a colleague's spreadsheet β€” then iterate over them. Once you have the URLs, the rest of the pipeline is mechanical. We're showing the second half here.

The eight URLs below are individual press releases (ecb.mp…) and press conference statements (ecb.is…). The hash in each filename is permanent once a document is published, so these URLs don't rot the way index pages do.

InΒ [Β ]:
# Build ecb_corpus.csv from a hardcoded list of stable ECB document URLs.

ECB_DOCS = [
    # (date, url) β€” press release vs. press-conference statement, mixed.
    ("2024-01-25", "https://www.ecb.europa.eu/press/pr/date/2024/html/ecb.mp240125~f738889bde.en.html"),
    ("2024-12-12", "https://www.ecb.europa.eu/press/press_conference/monetary-policy-statement/2024/html/ecb.is241212~ce143b3bc8.en.html"),
    ("2025-10-30", "https://www.ecb.europa.eu/press/press_conference/monetary-policy-statement/2025/html/ecb.is251030~4f74dde15e.en.html"),
    ("2023-09-14", "https://www.ecb.europa.eu/press/press_conference/monetary-policy-statement/2023/html/ecb.is230914~686786984a.en.html"),
    ("2023-05-04", "https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.mp230504~cdfd11a697.en.html"),
]

def parse_ecb_document(url, date_str):
    """Fetch one ECB document and extract title + body.

    Returns a dict (or None if the fetch fails). Selectors are written
    defensively because the ECB site has multiple page templates:
      - <main id="main-content"> is consistent across templates
      - <h1> is the document title
      - body paragraphs are the <p> tags inside <main>, filtered for length
    """
    r = safe_get(url, pause=1.5)
    if r is None:
        return None

    soup = BeautifulSoup(r.text, "lxml")
    main = soup.find("main", id="main-content") or soup.find("main") or soup

    h1 = main.find("h1")
    title = h1.get_text(strip=True) if h1 else "(no title)"

    # Collect substantive paragraphs only β€” skip very short ones (often
    # navigation labels, image captions, or 'Related topics' snippets).
    paragraphs = [
        p.get_text(" ", strip=True)
        for p in main.find_all("p")
        if len(p.get_text(strip=True)) >= 40
    ]
    body = "\n\n".join(paragraphs)

    return {
        "date_parsed": date_str,
        "title":       title,
        "url":          url,
        "body":         body,
    }

records = []
for date_str, url in ECB_DOCS:
    rec = parse_ecb_document(url, date_str)
    if rec is None:
        print(f"  SKIP {date_str}: fetch failed")
        continue
    records.append(rec)
    print(f"  OK   {date_str}: {len(rec['body'].split())} words β€” {rec['title'][:50]}…")

if not records:
    raise RuntimeError("No documents could be fetched. Check your network connection.")

corpus_df = pd.DataFrame(records)
corpus_df["date_parsed"] = pd.to_datetime(corpus_df["date_parsed"])
corpus_df = corpus_df.sort_values("date_parsed").reset_index(drop=True)

# Persist to disk β€” Section 4 will pick this up on the next cell run.
corpus_df.to_csv("ecb_corpus.csv", index=False)
print(f"\nSaved ecb_corpus.csv with {len(corpus_df)} documents.")
print(f"Date range: {corpus_df['date_parsed'].min().date()} β†’ {corpus_df['date_parsed'].max().date()}")
corpus_df[["date_parsed", "title"]].head()

πŸ“Š Reading the output. The cell looped over a list of ECB document URLs, fetched each one with safe_get (so we get polite delays and graceful 404 handling), parsed the HTML with BeautifulSoup, and pulled out a title plus a concatenated body. Successfully-parsed documents went into a DataFrame, which was sorted by date and written to ecb_corpus.csv.

Two robustness patterns to internalise. (1) The function parse_ecb_document returns None on fetch failure rather than raising, so a single broken URL doesn't kill the whole run β€” you get a partial corpus instead of nothing. (2) The body extraction uses defensive selectors: it tries <main id="main-content"> first, falls back to any <main>, and ultimately to the whole page. This makes the same parser work across slightly different ECB page templates (the press-release template differs from the press-conference statement template). When you write your own extractor, ask yourself what's the next thing I'd try if my preferred selector returns nothing? and write that path explicitly.

Why this is the bridge to Section 4. Section 4 starts with pd.read_csv("ecb_corpus.csv"). Until now that file didn't exist, so the cell fell back to a synthetic 8-document corpus. Now the file does exist with real ECB content, so Section 4's load branch succeeds and the cleaning pipeline runs on actual press releases.


4. Building and cleaning a text corpusΒΆ

By now you have a DataFrame with document text. This cleaning pipeline bridges scraping (L5–L6) and NLP (L7).

InΒ [Β ]:
try:
    corpus_df = pd.read_csv("ecb_corpus.csv")
    print(f"Loaded corpus: {len(corpus_df)} documents")
except FileNotFoundError:
    # Synthetic fallback: 8 ECB-style monetary-policy press releases spanning
    # three years. Bodies are kept around 80–120 words β€” long enough
    # to pass the MIN_WORDS filter applied later, short enough to read.
    corpus_df = pd.DataFrame({
        "date_parsed": [
            "2024-01-25", "2024-04-11", "2024-09-12",
            "2023-10-26", "2023-09-14", "2023-06-15",
            "2022-12-15", "2022-09-08",
        ],
        "title": ["Monetary policy decisions"] * 8,
        "body": [
            # 2024-01-25
            "The Governing Council today decided to keep the three key ECB interest rates unchanged. "
            "The deposit facility rate remains at 4.00%, the main refinancing operations rate at 4.50% "
            "and the marginal lending facility rate at 4.75%. Inflation is projected to decline gradually "
            "further over 2024 but will remain above the 2% target for most of the year. The Council "
            "reiterated that future decisions will ensure that policy rates are set at sufficiently "
            "restrictive levels for as long as necessary to achieve a timely return of inflation to the "
            "medium-term target. Future decisions will continue to follow a data-dependent approach.",
            # 2024-04-11
            "The Governing Council today decided to keep the three key ECB interest rates unchanged. "
            "Incoming information has broadly confirmed the Governing Council's previous assessment of "
            "the medium-term inflation outlook. Inflation has continued to fall, led by lower food and "
            "goods inflation. Most measures of underlying inflation are easing, wage growth is gradually "
            "moderating and firms are absorbing part of the rise in labour costs in their profits. "
            "Financing conditions remain restrictive and past interest rate increases continue to weigh "
            "on demand, helping to push down inflation toward the target.",
            # 2024-09-12
            "The Governing Council today decided to lower the deposit facility rate by 25 basis points. "
            "Based on the Governing Council's updated assessment of the inflation outlook, the dynamics "
            "of underlying inflation, and the strength of monetary policy transmission, it is now "
            "appropriate to take another step in moderating the degree of monetary policy restriction. "
            "Recent inflation data have come in broadly as expected, and the latest staff projections "
            "confirm the previous inflation outlook. Domestic inflation remains high as wages are still "
            "rising at an elevated pace.",
            # 2023-10-26
            "The Governing Council today decided to keep the three key ECB interest rates unchanged. "
            "The incoming information has broadly confirmed the previous assessment of the medium-term "
            "inflation outlook. Inflation is still expected to stay too high for too long, and domestic "
            "price pressures remain strong. At the same time, inflation dropped markedly in September, "
            "including due to strong base effects, and most measures of underlying inflation have "
            "continued to ease. The Governing Council's past interest rate increases continue to be "
            "transmitted forcefully into financing conditions.",
            # 2023-09-14
            "The Governing Council today decided to raise the three key ECB interest rates by 25 basis "
            "points. The deposit facility rate will be increased to 4.00%. Inflation continues to decline "
            "but is still expected to remain too high for too long. The Governing Council is determined "
            "to ensure that inflation returns to its 2% medium-term target in a timely manner. Based on "
            "the current assessment, the Governing Council considers that the key ECB interest rates "
            "have reached levels that, maintained for a sufficiently long duration, will make a "
            "substantial contribution to the timely return of inflation to the target.",
            # 2023-06-15
            "The Governing Council today decided to raise the three key ECB interest rates by 25 basis "
            "points. The deposit facility rate will be increased to 3.50%. Inflation has been coming "
            "down but is projected to remain too high for too long. The Governing Council is determined "
            "to ensure the timely return of inflation to the 2% medium-term target. The Council will "
            "continue to follow a data-dependent approach to determining the appropriate level and "
            "duration of restriction.",
            # 2022-12-15
            "The Governing Council today decided to raise the three key ECB interest rates by 50 basis "
            "points. Based on the substantial upward revision to the inflation outlook, the deposit "
            "facility rate will be increased to 2.00%. The Governing Council judges that interest rates "
            "will still have to rise significantly at a steady pace to reach levels that are sufficiently "
            "restrictive to ensure a timely return of inflation to the 2% medium-term target. Keeping "
            "interest rates at restrictive levels will over time reduce inflation by dampening demand.",
            # 2022-09-08
            "The Governing Council today decided to raise the three key ECB interest rates by 75 basis "
            "points. This major step frontloads the transition from the prevailing highly accommodative "
            "level of policy rates towards levels that will ensure the timely return of inflation to "
            "the 2% medium-term target. Inflation remains far too high and is likely to stay above target "
            "for an extended period. In August, euro area inflation reached 9.1%. Soaring energy and food "
            "prices, demand pressures in some sectors owing to the reopening of the economy, and supply "
            "bottlenecks are still driving up inflation.",
        ],
    })
    corpus_df["date_parsed"] = pd.to_datetime(corpus_df["date_parsed"])
    print(f"Using synthetic corpus ({len(corpus_df)} documents).")
corpus_df.head()

πŸ“Š Reading the output. The cell loaded ecb_corpus.csv β€” the file Section 3.8 just wrote β€” and printed the row count. With the live scrape working, you should see something like "Loaded corpus: 5 documents" rather than the synthetic-fallback message.

Why keep the except FileNotFoundError branch at all? Because not every run will have network access. If you run the notebook offline (on a train, in a classroom with bad Wi-Fi, or while debugging Section 4 alone), Section 3.8 won't have produced the file, and the fallback synthetic corpus keeps the rest of the lecture working. This is a small but useful pattern for any data-pipeline notebook: if your preferred input source fails, have a deterministic fallback so the downstream cells still run and you can debug them.

InΒ [Β ]:
def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = re.sub(r"<[^>]+>", " ", text)
    text = text.replace("&amp;", "&").replace("&nbsp;", " ")
    text = re.sub(r"\s+", " ", text)
    return text.strip()

def extract_metadata(df):
    df = df.copy()
    df["body_clean"] = df["body"].apply(clean_text)
    df["n_chars"]    = df["body_clean"].str.len()
    df["n_words"]    = df["body_clean"].str.split().str.len()
    if "date_parsed" in df.columns:
        df["date_parsed"] = pd.to_datetime(df["date_parsed"], errors="coerce")
        df["year"]  = df["date_parsed"].dt.year
        df["month"] = df["date_parsed"].dt.month
    return df

corpus_df = extract_metadata(corpus_df)
corpus_df[["date_parsed","n_chars","n_words"]].describe().round(1)
InΒ [Β ]:
MIN_WORDS = 50

corpus_df["is_valid"] = corpus_df["n_words"] >= MIN_WORDS
print(f"Valid documents (>= {MIN_WORDS} words): {corpus_df['is_valid'].sum()} / {len(corpus_df)}")

invalid = corpus_df[~corpus_df["is_valid"]][["date_parsed","title","n_words"]]
if len(invalid) > 0:
    print("\nDocuments flagged for review:")
    print(invalid.to_string(index=False))

πŸ“Š Reading the output. We added a boolean column is_valid flagging documents with at least 50 words. With the expanded synthetic corpus, all 8 documents pass the filter, so the "flagged for review" block is empty.

The 50-word threshold is arbitrary but defensible. Below ~50 words you typically have a navigation snippet, a redirect placeholder, or a corrupted scrape rather than a real document. Tightening the threshold to 100–200 words (and inspecting what gets cut) is a useful sanity-check step before any serious NLP. For your own corpus, set this based on the typical length of the shortest legitimate document in your source.

InΒ [Β ]:
clean_corpus = corpus_df[corpus_df["is_valid"]].reset_index(drop=True)
clean_corpus.to_csv("corpus_clean.csv", index=False)
print(f"Saved: corpus_clean.csv ({len(clean_corpus)} documents)")
print(f"  Avg words: {clean_corpus['n_words'].mean():.0f}")
print(f"  Total words: {clean_corpus['n_words'].sum():,}")

πŸ“Š Reading the output. The valid documents were written to corpus_clean.csv, and the cell printed how many made it through, plus the average and total word counts. With the synthetic data this is 8 documents averaging ~95 words.

corpus_clean.csv is now your handoff to L7 (NLP). Everything that follows in this notebook β€” the corpus self-check, the year-over-year plots, and next lecture's tokenisation/TF-IDF β€” reads from this file. If you need to tweak the cleaning pipeline, change extract_metadata or MIN_WORDS and re-run from the corpus-load cell; the file will be overwritten.


5. Milestone β€” Your corpus is readyΒΆ

By the end of this lecture you should have a clean CSV with at minimum:

Column Description
date_parsed Document date as datetime
title Document title or headline
body_clean Cleaned body text
url Source URL
n_words Word count
InΒ [Β ]:
# Corpus self-check
my_corpus = pd.read_csv("corpus_clean.csv")
print("=== CORPUS SUMMARY ===")
print(f"  Documents : {len(my_corpus)}")
print(f"  Columns   : {list(my_corpus.columns)}")
if "date_parsed" in my_corpus.columns:
    dates = pd.to_datetime(my_corpus["date_parsed"], errors="coerce")
    print(f"  Date range: {dates.min().date()} – {dates.max().date()}")
if "n_words" in my_corpus.columns:
    print(f"  Avg words : {my_corpus['n_words'].mean():.0f}")
my_corpus.head(3)

πŸ“Š Reading the output. The self-check re-reads corpus_clean.csv from disk (rather than reusing the in-memory DataFrame) and prints a one-screen sanity report: row count, column list, date span, average word count.

Why re-read from disk? Two reasons. First, it confirms the file you just wrote is loadable β€” a surprising number of corpus bugs come from CSVs that look fine in memory but break on read because of unescaped quotes or line breaks in the body text. Second, it forces you to use the exact data structure that the next lecture will start from, with no hidden state from the cleaning pipeline.


6. ExerciseΒΆ

Time: ~25 minutes. Work individually.

Task 1. If your target site is JavaScript-rendered, rewrite your L5 scraper using Selenium. If requests + BeautifulSoup already works, skip to Task 2.

Task 2. Add at least two metadata columns to your corpus:

  • n_sentences β€” count sentences with re.split(r'[.!?]', text)
  • mentions_rates β€” boolean: does the text mention "interest rate"?
  • speaker or author β€” extracted from page header
  • has_table β€” boolean: does the page contain a <table> element?

Task 3. Produce a two-panel figure: documents per year (bar) and average word count per year (line). Save as corpus_overview.png.

Task 4. Write a short corpus description (markdown cell): source, scraping method, time period, n_docs, avg length, known limitations.

InΒ [Β ]:
# Task 1 β€” Selenium rewrite (if needed)
# YOUR CODE HERE
InΒ [Β ]:
# Task 2 β€” additional metadata
# YOUR CODE HERE
InΒ [Β ]:
# Task 3 β€” corpus overview figure
# YOUR CODE HERE

Task 4 β€” Corpus description:

(Write your paragraph here.)

InΒ [Β ]:
# ── SOLUTION ──────────────────────────────────────────────────────────────────
import pandas as pd, matplotlib.pyplot as plt, re

corpus = pd.read_csv("corpus_clean.csv")
corpus["date_parsed"] = pd.to_datetime(corpus["date_parsed"], errors="coerce")
corpus["year"] = corpus["date_parsed"].dt.year

# Task 2
def count_sentences(text):
    if not isinstance(text, str): return 0
    return len([s for s in re.split(r"[.!?]+", text) if s.strip()])

corpus["n_sentences"]  = corpus["body_clean"].apply(count_sentences)
corpus["avg_sent_len"] = (corpus["n_words"] / corpus["n_sentences"].replace(0, 1)).round(1)
corpus["mentions_rates"] = corpus["body_clean"].str.contains(
    r"interest rate|deposit facility", case=False, regex=True, na=False
)
print(corpus[["date_parsed","n_words","n_sentences","avg_sent_len","mentions_rates"]].head())

# Task 3
by_year = corpus.groupby("year").agg(
    n_docs    = ("title",   "count"),
    avg_words = ("n_words", "mean"),
).reset_index()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.bar(by_year["year"], by_year["n_docs"], color="steelblue", edgecolor="white")
ax1.set_title("Documents per year"); ax1.set_xlabel("Year"); ax1.set_ylabel("Count")

ax2.plot(by_year["year"], by_year["avg_words"], marker="o", color="tomato", linewidth=2)
ax2.set_title("Avg word count per year"); ax2.set_xlabel("Year"); ax2.set_ylabel("Words")

fig.suptitle("ECB Corpus Overview", fontsize=13, y=1.02)
fig.tight_layout()
fig.savefig("corpus_overview.png", dpi=300, bbox_inches="tight")
plt.show()
print("Saved: corpus_overview.png")

πŸ“Š Reading the output. We added three metadata columns (n_sentences, avg_sent_len, mentions_rates), grouped the corpus by year, and produced a two-panel figure: bar chart of documents per year and line chart of average word count per year. The figure was saved to corpus_overview.png and displayed inline.

What to look for in the plots. With the synthetic corpus the per-year counts are 2 (2022), 3 (2023), 3 (2024) β€” not very informative on its own, but a real scraped corpus will show realistic publication-rate variation across the rate-hike cycle. The average word count line is the more interesting series: it captures how communication style changes with policy stance, and it's a cheap proxy you can correlate against macro outcomes.


7. (optional) From script to production: the Page Object ModelΒΆ

If you're running short on time, skip this section β€” it's an advanced bonus, not required for the exam. If you have ~20 minutes left, this is one of the most useful patterns you'll see in software engineering, and it directly leverages the OOP we covered in L2.

When a scraper grows beyond a single notebook β€” multiple pages, multiple workflows, repeated maintenance β€” flat scripts become hard to read and brittle to change. Every time the site tweaks a CSS class, you're hunting through cells to find the references.

The Page Object Model (POM) is the standard fix. The idea:

  • One class per page of the site, exposing methods that describe what a user can do on that page (search(...), go_to_results(), etc.).
  • Locators isolated in a separate module β€” so when the site's HTML changes, you fix one file.
  • The test/scraper script reads almost like prose: home.search("ECB"); results.collect_titles().

This is exactly the architecture used in production test suites (Selenium's own docs recommend it). It's also a clean callback to L2: descriptors, inheritance, separation of concerns.

InΒ [Β ]:
# locators.py β€” all CSS/XPath selectors live here
from selenium.webdriver.common.by import By

class HomePageLocators:
    SEARCH_BOX = (By.NAME, "q")
    GO_BUTTON  = (By.ID, "submit")

class ResultsPageLocators:
    NO_RESULTS = (By.XPATH, "//*[contains(text(), 'No results found')]")
InΒ [Β ]:
# pages.py β€” one class per page; methods describe user actions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class BasePage:
    def __init__(self, driver, timeout=10):
        self.driver = driver
        self.wait   = WebDriverWait(driver, timeout)

class HomePage(BasePage):
    def title_matches(self, expected):
        return expected in self.driver.title

    def search(self, query):
        box = self.wait.until(EC.element_to_be_clickable(HomePageLocators.SEARCH_BOX))
        box.clear()
        box.send_keys(query)
        self.driver.find_element(*HomePageLocators.GO_BUTTON).click()
        return ResultsPage(self.driver)

class ResultsPage(BasePage):
    def has_results(self):
        return "No results found" not in self.driver.page_source
InΒ [Β ]:
# test_search.py β€” reads like prose, no Selenium plumbing in sight
import unittest
from selenium import webdriver

class PythonOrgSearchTest(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Chrome()
        self.driver.get("https://www.python.org")

    def test_search_pycon(self):
        home = HomePage(self.driver)
        self.assertTrue(home.title_matches("Python"))
        results = home.search("pycon")
        self.assertTrue(results.has_results())

    def tearDown(self):
        self.driver.quit()

# To run from a terminal:  python -m unittest test_search.py
# (In a notebook, you can use unittest.main(argv=[''], exit=False))

Why this matters for your research code:

  1. When a website changes, you fix one file (locators.py) instead of dozens of cells.
  2. Your scraper script becomes documentation: anyone reading it can see what you're doing without wading through CSS selectors.
  3. You can write unit tests against your scraper β€” a single broken page doesn't silently corrupt your dataset.

For a master's research scraper this is overkill. For a thesis, a published replication package, or anything you'll maintain for more than a year, it's the difference between code that ages well and code that doesn't.


SummaryΒΆ

Situation Tool
Static HTML requests + BeautifulSoup
Need cookies / session requests.Session()
JavaScript-rendered content selenium (4.6+, no webdriver-manager needed)
Site loads data via JSON API requests.get(api_url).json() β€” try this first!
Pagination by page number Loop over integers in URL
Pagination by "Next" link Follow <a rel="next">
Element location driver.find_element(By.CSS_SELECTOR, "...")
Avoid race conditions WebDriverWait + EC.* (never time.sleep for waits)
Maintainable scraper Page Object Model (Section 7)

Next lectureΒΆ

Lecture 7 β€” NLP: text representation: tokenisation, stopwords, stemming, TF-IDF. Your corpus from this lecture is the input.