Extra L5 — Web Scraping Robustness and HTML Fragility¶

This notebook complements Lecture 5: Web Scraping I.

Goal¶

Understand why scraping code breaks and how to make parsers more robust.

Important note¶

This notebook uses mock HTML so that everything runs offline and ethically.

In [ ]:
from bs4 import BeautifulSoup
import pandas as pd

1. Two HTML versions of the same webpage¶

Version A is what your scraper was built on.
Version B reflects a small redesign by the website.

In [ ]:
html_v1 = '''
<html>
  <body>
    <div class="article">
      <h2 class="title">Inflation slows in March</h2>
      <span class="date">2026-03-12</span>
    </div>
    <div class="article">
      <h2 class="title">Unemployment stable</h2>
      <span class="date">2026-03-13</span>
    </div>
  </body>
</html>
'''

html_v2 = '''
<html>
  <body>
    <article class="news-item">
      <h2 data-role="headline">Inflation slows in March</h2>
      <time datetime="2026-03-12">12 March 2026</time>
    </article>
    <article class="news-item">
      <h2 data-role="headline">Unemployment stable</h2>
      <time datetime="2026-03-13">13 March 2026</time>
    </article>
  </body>
</html>
'''

2. A brittle parser¶

This parser works only for version A.

In [ ]:
def parse_brittle(html):
    soup = BeautifulSoup(html, "html.parser")
    rows = []
    for block in soup.select("div.article"):
        title = block.select_one("h2.title").get_text(strip=True)
        date = block.select_one("span.date").get_text(strip=True)
        rows.append({"title": title, "date": date})
    return pd.DataFrame(rows)

print(parse_brittle(html_v1))
In [ ]:
try:
    print(parse_brittle(html_v2))
except Exception as e:
    print("Parser failed:", repr(e))

3. A more robust parser¶

A better parser:

  • tries multiple selectors,
  • checks whether a node exists,
  • fails gracefully.
In [ ]:
def first_text(parent, selectors):
    for sel in selectors:
        node = parent.select_one(sel)
        if node is not None:
            if node.has_attr("datetime"):
                return node["datetime"]
            return node.get_text(strip=True)
    return None

def parse_robust(html):
    soup = BeautifulSoup(html, "html.parser")
    rows = []
    blocks = soup.select("div.article, article.news-item")
    for block in blocks:
        title = first_text(block, ["h2.title", "h2[data-role='headline']", "h2"])
        date = first_text(block, ["span.date", "time[datetime]", "time"])
        rows.append({"title": title, "date": date})
    return pd.DataFrame(rows)

print(parse_robust(html_v1))
print(parse_robust(html_v2))

4. Logging parse quality¶

In real projects, you should count:

  • parsed items,
  • missing titles,
  • missing dates,
  • duplicate URLs or IDs.
In [ ]:
def parse_with_quality_report(html):
    df = parse_robust(html)
    report = {
        "n_rows": len(df),
        "missing_title": int(df["title"].isna().sum()),
        "missing_date": int(df["date"].isna().sum())
    }
    return df, report

for label, html in {"v1": html_v1, "v2": html_v2}.items():
    df_parsed, report = parse_with_quality_report(html)
    print(label, report)
    display(df_parsed)

Short exercise¶

In words:

  1. Why is selecting by a long CSS path often fragile?
  2. Why should a scraper produce a quality report after each run?
  3. What would you save to disk after a scraping session besides the final dataset?

Optional extension¶

  • Add a third HTML version with a missing date.
  • Modify the parser so it stores a parse_status field.
  • Save the raw HTML before parsing so the scraping step and parsing step are separable.