Lecture 3 — Pandas, APIs & Econometrics¶

Python for Economists · University of Bologna · 2025/2026¶


What we cover today¶

  1. Warm-up: what pandas + an API can do in 30 seconds
  2. Why pandas? From dicts to DataFrames
  3. Selection, filtering, groupby, merge, reshape
  4. Fetching real data from public APIs
  5. OLS regression with statsmodels
  6. Panel data with fixed effects using linearmodels
  7. Exercise + Milestone: identify your research dataset

Installation (in your course environment):

conda activate <your_env>
conda install -c conda-forge wbdata
pip install linearmodels

requests ships with Anaconda — nothing to install for §6.1.

In [ ]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")

print(f"pandas      {pd.__version__}")
print(f"statsmodels {sm.__version__}")

# pip install linearmodels
try:
    from linearmodels.panel import PanelOLS, PooledOLS, RandomEffects
    import linearmodels
    print(f"linearmodels {linearmodels.__version__}")
except ImportError:
    print("linearmodels not installed — run: pip install linearmodels")

0. Warm-up — from API to chart in 30 seconds¶

Before we look at a single DataFrame method, let's see what we can do with the tools of today. I'll pull live inflation data for five eurozone countries — straight from the World Bank — and plot twenty years of price dynamics.

Watch closely. By the end of the lecture you will be able to write this yourself.

In [ ]:
# Warm-up demo: live inflation data from the World Bank + annotated plot
import matplotlib.pyplot as plt
import pandas as pd

# Eurozone subset — ISO2 codes; the full English names come back from wbdata
countries = {"IT": "Italy", "DE": "Germany", "FR": "France",
             "ES": "Spain", "NL": "Netherlands"}

try:
    import wbdata
    infl = wbdata.get_dataframe(
        {"FP.CPI.TOTL.ZG": "inflation"},
        country=list(countries.keys()),
        date=("2000", "2023"),
    ).reset_index()
    # wbdata returns the country's full name — map back to the ISO2 code
    name_to_code = {v: k for k, v in countries.items()}
    infl["country"] = infl["country"].map(name_to_code)
    infl["year"] = infl["date"].astype(int)
    wide = infl.pivot(index="year", columns="country", values="inflation").sort_index()
except Exception as e:
    # Fallback to cached illustrative values if the API is unreachable in class
    msg = f"{type(e).__name__}: {e}"
    if isinstance(e, ModuleNotFoundError):
        msg += f"  [modulo mancante: {e.name}]"
    print(f"World Bank fetch failed ({msg}), using cached values.")
    years = list(range(2000, 2024))
    cached = {
        "IT": [2.6,2.3,2.6,2.8,2.3,2.2,2.2,2.0,3.5,0.8,1.5,2.7,3.0,1.2,0.2,0.0,-0.1,1.2,1.1,0.6,-0.1,1.9,8.2,5.6],
        "DE": [1.4,2.0,1.4,1.0,1.7,1.5,1.6,2.3,2.6,0.3,1.1,2.1,2.0,1.5,0.9,0.5,0.5,1.5,1.7,1.4,0.4,3.1,6.9,5.9],
        "FR": [1.7,1.6,1.9,2.1,2.1,1.7,1.7,1.5,2.8,0.1,1.5,2.1,2.0,0.9,0.5,0.0,0.2,1.0,1.9,1.1,0.5,1.6,5.2,4.9],
        "ES": [3.4,3.6,3.1,3.0,3.0,3.4,3.5,2.8,4.1,-0.3,1.8,3.2,2.4,1.4,-0.2,-0.5,-0.2,2.0,1.7,0.7,-0.3,3.1,8.4,3.5],
        "NL": [2.3,5.1,3.9,2.2,1.4,1.7,1.2,1.6,2.2,1.0,0.9,2.5,2.8,2.6,0.3,0.2,0.1,1.3,1.6,2.6,1.3,2.8,11.6,4.1],
    }
    wide = pd.DataFrame(cached, index=years)

fig, ax = plt.subplots(figsize=(10, 5))
wide.plot(ax=ax, linewidth=2)
ax.axhline(2, color="grey", linestyle=":", linewidth=0.8, label="ECB target (2%)")
ax.axvspan(2021, 2023, color="red", alpha=0.08, label="Energy shock")
ax.set_title("Eurozone inflation, 2000–2023 (% YoY CPI)", fontsize=13, loc="left")
ax.set_xlabel("")
ax.set_ylabel("Inflation, %")
ax.legend(loc="upper left", frameon=False, ncol=4)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.tight_layout()
plt.show()

What just happened?

Four lines of conceptual content:

  1. An API call retrieved data directly from the World Bank — no manual download, no CSV to clean.
  2. pandas reshaped the data from long to wide format.
  3. matplotlib drew the figure, with a target line and a highlighted shock region.
  4. All in under 20 lines of code.

The 2022 inflation spike is clearly visible — and so is the fact that the ECB's 2% target has held for most of the sample. By the end of today's lecture, you'll have built every step of this pipeline yourself.

Discussion prompts:

  • Which country looks most vulnerable to the 2022 shock? Which most stable?
  • Why did Spain and Italy have negative inflation in 2014–2016?

Answers:

  • Spain and Italy had negative inflation in 2014–2016: sovereign debt crisis aftermath and ECB deflation concerns.

Why this matters: primes the motivation for pandas + API access as research tools, not just programming toys.


1–5. Pandas essentials¶

We cover the core pandas operations quickly — building on L2 where we worked with dicts and lists manually.

In [ ]:
# From a dict of lists
df = pd.DataFrame({
    "country":   ["Italy","Germany","France","Spain","Netherlands"],
    "code":      ["ITA","DEU","FRA","ESP","NLD"],
    "gdp":       [2010.0,4082.0,2794.0,1419.0,1118.0],
    "growth":    [0.73,-0.30,0.90,2.50,0.10],
    "inflation": [5.8,6.1,5.7,3.5,3.8],
})
df = df.set_index("code")
df["real_growth"] = df["growth"] - df["inflation"]
print(df.shape)
df.describe().round(2)
In [ ]:
# Boolean filtering
print(df[(df["growth"] > 0) & (df["inflation"] < 5)])

# .query() syntax
print(df.query("growth > 0 and inflation < 5"))
In [ ]:
panel = pd.DataFrame({
    "country":    ["Italy","Italy","Italy","Germany","Germany","Germany",
                   "Spain","Spain","Spain"],
    "year":       [2021,2022,2023,2021,2022,2023,2021,2022,2023],
    "gdp_growth": [6.58,7.31,5.29,2.88,7.47,5.49,5.55,10.02,6.85],
    "inflation":  [1.9,8.7,5.8,3.2,8.7,6.1,3.1,8.4,3.5],
    "unemployment":[9.5,8.1,6.7,3.7,3.0,2.9,14.8,12.9,11.8],
})

panel.groupby("country").agg(
    avg_growth   =("gdp_growth",   "mean"),
    avg_inflation=("inflation",    "mean"),
    avg_unemp    =("unemployment", "mean"),
).round(2)

6. Fetching real data from public APIs¶

Everything we've done so far used toy DataFrames. In research you usually start from an API — a URL you query that returns structured data.

We'll use the World Bank API (free, no key required, data for 200+ countries) and do the same fetch twice:

  • §6.1 — by hand, with requests. This shows what an API actually is: a URL, some query parameters, a JSON response you parse. Every REST API you'll meet works this way.
  • §6.2 — with wbdata, a Python wrapper. Once you've seen what happens underneath, the one-liner version is a legitimate shortcut.

The point of doing both: wrappers exist only for the most popular data sources. The next API you need (a central bank, a ministry, a private provider) probably won't have one — and then requests is all you have.

6.1 — Talking to the API by hand with requests¶

The World Bank exposes its data at URLs of the form:

https://api.worldbank.org/v2/country/{codes}/indicator/{indicator_code}?format=json&date=YYYY:YYYY

Let's construct one, send a GET request, and look at what comes back.

In [ ]:
import requests

countries = ["IT","DE","FR","ES","NL","PT","GR","BE","AT","FI"]
indicator = "NY.GDP.MKTP.KD.ZG"   # real GDP growth, annual %

url = f"https://api.worldbank.org/v2/country/{';'.join(countries)}/indicator/{indicator}"
params = {"format": "json", "date": "2000:2023", "per_page": 1000}

resp = requests.get(url, params=params, timeout=30)
resp.raise_for_status()           # raise an error if the HTTP status is not 2xx

payload = resp.json()

# The World Bank returns a two-element JSON array:
#   payload[0] = pagination metadata (page, pages, per_page, total, ...)
#   payload[1] = list of observations, one per (country, year)
print("Full request URL:", resp.url)
print("Metadata        :", payload[0])
print("\nFirst raw observation:")
print(payload[1][0])
In [ ]:
# Build a clean DataFrame from the raw JSON observations.
# Each observation is a dict with keys: country, countryiso3code, date, value, ...
g = pd.DataFrame([
    {"country":    obs["country"]["id"],   # ISO2 code
     "year":       int(obs["date"]),
     "gdp_growth": obs["value"]}
    for obs in payload[1]
    if obs["value"] is not None           # drop years with no reported value
])

print(g.shape)
g.head()

6.2 — The same thing with a wrapper: wbdata¶

Now that you've seen what the HTTP/JSON layer looks like, we can accept a library that hides it. wbdata is the actively-maintained Python wrapper for the World Bank. Bonus: it can fetch multiple indicators at once and hand you back a ready-to-use DataFrame.

In [ ]:
import wbdata

countries = ["IT","DE","FR","ES","NL","PT","GR","BE","AT","FI"]

# A dict mapping World Bank indicator codes to the column names we want
indicators = {
    "NY.GDP.MKTP.KD.ZG": "gdp_growth",
    "SL.UEM.TOTL.ZS":    "unemployment",
}

df = wbdata.get_dataframe(
    indicators,
    country=countries,
    date=("2000", "2023"),
).reset_index()

# wbdata returns full country names ("Italy", "Germany", ...) — map back to ISO2
# so the country column matches what we saw in §6.1 and stays compact downstream.
name_to_code = {
    "Italy":"IT","Germany":"DE","France":"FR","Spain":"ES","Netherlands":"NL",
    "Portugal":"PT","Greece":"GR","Belgium":"BE","Austria":"AT","Finland":"FI",
}
df["country"] = df["country"].map(name_to_code)
df = df.rename(columns={"date": "year"})
df["year"] = df["year"].astype(int)
df = df.dropna().reset_index(drop=True)

df.to_csv("eurozone_panel.csv", index=False)
print(df.shape)
df.head()

World Bank indicators — mini dictionary¶

A curated set of the indicator codes most commonly used in applied macro / political-economy work. Pass any of these as the key in wbdata.get_dataframe({code: name}, ...), or substitute the indicator variable in the raw-requests version of §6.1.

Full catalogue: https://data.worldbank.org/indicator — or use wbdata.search_indicators("keyword") from the notebook.

Growth & Output

Code Indicator Units
NY.GDP.MKTP.KD.ZG GDP growth, real annual %
NY.GDP.MKTP.CD GDP, nominal current USD
NY.GDP.MKTP.KD GDP, real constant 2015 USD
NY.GDP.PCAP.KD GDP per capita, real constant 2015 USD
NY.GDP.PCAP.KD.ZG GDP per capita growth annual %
NY.GDP.PCAP.PP.KD GDP per capita, PPP constant 2017 international $

Labour market

Code Indicator Units
SL.UEM.TOTL.ZS Unemployment rate, total % of labour force
SL.UEM.1524.ZS Youth unemployment (15–24) % of youth labour force
SL.TLF.CACT.ZS Labour force participation % of population 15+
SL.TLF.CACT.FE.ZS Female labour force participation % of female pop. 15+
SL.EMP.TOTL.SP.ZS Employment-to-population ratio % of pop. 15+

Prices & exchange rates

Code Indicator Units
FP.CPI.TOTL.ZG CPI inflation annual %
FP.CPI.TOTL Consumer price index 2010 = 100
NY.GDP.DEFL.KD.ZG GDP deflator, inflation annual %
PA.NUS.FCRF Official exchange rate LCU per USD, period avg

Fiscal & public finance

Code Indicator Units
GC.DOD.TOTL.GD.ZS Central government debt % of GDP
GC.XPN.TOTL.GD.ZS General government expenditure % of GDP
GC.REV.XGRT.GD.ZS General government revenue (ex. grants) % of GDP
GC.BAL.CASH.GD.ZS Cash surplus / deficit % of GDP
GC.TAX.TOTL.GD.ZS Tax revenue % of GDP

Trade & external sector

Code Indicator Units
NE.EXP.GNFS.ZS Exports of goods and services % of GDP
NE.IMP.GNFS.ZS Imports of goods and services % of GDP
BN.CAB.XOKA.GD.ZS Current account balance % of GDP
BX.KLT.DINV.WD.GD.ZS FDI, net inflows % of GDP

Money & finance

Code Indicator Units
FM.LBL.BMNY.GD.ZS Broad money (M2) % of GDP
FR.INR.RINR Real interest rate %
FS.AST.DOMS.GD.ZS Domestic credit provided by financial sector % of GDP

Demography & human capital

Code Indicator Units
SP.POP.TOTL Population, total persons
SP.POP.GROW Population growth annual %
SP.DYN.LE00.IN Life expectancy at birth years
SP.URB.TOTL.IN.ZS Urban population % of total
SE.ADT.LITR.ZS Adult literacy rate % of pop. 15+
SE.XPD.TOTL.GD.ZS Government expenditure on education % of GDP

Inequality & poverty

Code Indicator Units
SI.POV.GINI Gini index 0–100
SI.POV.DDAY Poverty headcount at $2.15/day (2017 PPP) % of pop.
SI.DST.10TH.10 Income share held by highest 10% %

Governance & institutions (note: these are WGI indicators, not all in every country/year)

Code Indicator Units
CC.EST Control of corruption, estimate –2.5 to 2.5
RQ.EST Regulatory quality, estimate –2.5 to 2.5
RL.EST Rule of law, estimate –2.5 to 2.5
VA.EST Voice and accountability, estimate –2.5 to 2.5

Practical tips.

  • Always check coverage before using — wbdata.get_dataframe({code: name}, country=..., date=...) then .isna().sum() per column.
  • Indicator units are not always what the name suggests: GC.* series are central government in some countries and general government in others. Read the metadata at https://data.worldbank.org/indicator/{code}.
  • For EU/eurozone work, Eurostat has finer geography and higher frequency; use World Bank for cross-region comparability.

⏱ Five-minute challenge¶

Using what we've just seen (World Bank API + pandas), answer the following question in 5 minutes:

Which EU country had the lowest youth unemployment rate (ages 15–24) in the most recent available year? And which had the highest?

Rules

  • Use the tools we've seen above — either requests or wbdata is fine.
  • You may discuss with a neighbour but must type the code yourself.
  • When you have an answer, raise your hand.

Hint. The World Bank indicator code for youth unemployment (ages 15–24) is SL.UEM.1524.ZS.

In [ ]:
# Reference solution — reveal only after the timer
import wbdata

eu = ["AT","BE","BG","HR","CY","CZ","DK","EE","FI","FR","DE","GR","HU",
      "IE","IT","LV","LT","LU","MT","NL","PL","PT","RO","SK","SI","ES","SE"]

yu = wbdata.get_dataframe(
    {"SL.UEM.1524.ZS": "youth_unemp"},
    country=eu,
    date=("2020", "2024"),
).reset_index().rename(columns={"date": "year"})
yu["year"] = yu["year"].astype(int)
yu = yu.dropna()

# Most recent year available per country
latest = (yu.sort_values("year").groupby("country").tail(1)
            .sort_values("youth_unemp"))

print("LOWEST youth unemployment:")
print(latest.head(3).to_string(index=False))
print("\nHIGHEST youth unemployment:")
print(latest.tail(3).to_string(index=False))

Notes:

  • First student to find the answer earns a participation point.
  • Reveal the reference code only after the timer.
  • Typical answers: Germany/Netherlands/Czechia at the low end, Spain/Greece/Italy at the high end.
  • Useful segue into labour market heterogeneity, which sets up the Okun's law regressions coming next.

7. OLS Regression with statsmodels¶

statsmodels is the standard Python econometrics library. Its formula interface mirrors R and is close to Stata's reg.

7.1 Cross-sectional OLS: Okun's law¶

Does higher unemployment predict lower GDP growth? We start with country averages.

In [ ]:
# Cross-sectional OLS: Okun's law
avg = df.groupby("country").agg(
    gdp_growth   =("gdp_growth",   "mean"),
    unemployment =("unemployment", "mean"),
).reset_index()

model_cs = smf.ols("gdp_growth ~ unemployment", data=avg).fit()
print(model_cs.summary())
In [ ]:
# Robust standard errors (HC3)
model_robust = smf.ols("gdp_growth ~ unemployment", data=avg).fit(cov_type="HC3")

print(f"Coeff  : {model_robust.params['unemployment']:.3f}")
print(f"SE(HC3): {model_robust.bse['unemployment']:.3f}")
print(f"p-value: {model_robust.pvalues['unemployment']:.4f}")
print(f"95% CI : {model_robust.conf_int().loc['unemployment'].tolist()}")
In [ ]:
from statsmodels.iolib.summary2 import summary_col
import re

m1 = smf.ols("gdp_growth ~ unemployment", data=df).fit()
m2 = smf.ols("gdp_growth ~ unemployment", data=df).fit(cov_type="HC3")
m3 = smf.ols("gdp_growth ~ unemployment + C(year)", data=df).fit(cov_type="HC3")

table = summary_col(
    [m1, m2, m3],
    model_names=["(1) OLS","(2) HC3","(3) Year FE"],
    stars=True,
    info_dict={"N":  lambda m: f"{int(m.nobs)}",
               "R2": lambda m: f"{m.rsquared:.3f}"}
)
print(table)

8. Panel Data with Fixed Effects using linearmodels¶

Pooled OLS ignores the panel structure. Any time-invariant country characteristic (institutions, geography) that correlates with both unemployment and growth will confound our estimates.

Fixed effects (FE) absorb all time-invariant unobserved heterogeneity by demeaning within each unit — equivalent to a dummy per country.

The FE estimator identifies the coefficient purely from within-country variation over time. Cross-country level differences are absorbed.

The Bellman equation of econometrics¶

$$V_t(W_t) = \max_{c_t} \left[ u(c_t) + \beta \mathbb{E}[V_{t+1}(W_{t+1})] \right]$$

...actually that comes in L4. Here the key equation is the within estimator:

$$y_{it} - \bar{y}_i = \beta(x_{it} - \bar{x}_i) + (\varepsilon_{it} - \bar{\varepsilon}_i)$$

In [ ]:
# Panel setup for linearmodels — requires MultiIndex (entity, time)
from linearmodels.panel import PanelOLS, PooledOLS, RandomEffects

panel_df = df.set_index(["country","year"])
print(panel_df.head(8))
print(f"\nDimensions: {panel_df.index.get_level_values(0).nunique()} countries x "
      f"{panel_df.index.get_level_values(1).nunique()} years")
In [ ]:
# Pooled OLS — baseline (ignores panel structure)
res_pool = PooledOLS.from_formula(
    "gdp_growth ~ unemployment",
    data=panel_df
).fit(cov_type="clustered", cluster_entity=True)
print("Pooled OLS (clustered SE):")
print(res_pool.summary.tables[1])
In [ ]:
# Two-way FE: country FE + year FE — manual year dummies
#
# Why manual dummies? Two bugs in linearmodels 7.0 × pandas 3.0 ruled out the
# shorter options:
#   1. `EntityEffects + TimeEffects` crashes in the two-way fast path
#      (`expand_categoricals` lacks an empty-frame guard in pandas 3.0).
#   2. `C(year_cat)` via the formula interface expands to a cartesian product
#      on MultiIndex panel data (rows blow up to n_obs × n_years).
# Safe path: build the year dummies ourselves and pass them as plain numeric
# regressors. Coefficients on `unemployment` are numerically identical to TWFE;
# we just see the year dummies printed explicitly rather than absorbed silently.

import pandas as pd

year_idx = panel_df.index.get_level_values("year")
year_dummies = pd.get_dummies(
    pd.Series(year_idx, index=panel_df.index, name="year"),
    prefix="yr", drop_first=True, dtype=float,
)

exog = pd.concat([panel_df[["unemployment"]], year_dummies], axis=1)

res_fe = PanelOLS(
    dependent=panel_df["gdp_growth"],
    exog=exog,
    entity_effects=True,
).fit(cov_type="clustered", cluster_entity=True)

print("Two-way FE (country + year), clustered SE:")
print(res_fe.summary.tables[1])
In [ ]:
# Side-by-side comparison
cp = float(res_pool.params.iloc[0]); sp = float(res_pool.std_errors.iloc[0])
cf = res_fe.params["unemployment"];   sf = res_fe.std_errors["unemployment"]

print("=" * 52)
print(f"{'':28} {'Pooled':>10}  {'TWFE':>8}")
print("-" * 52)
print(f"{'unemployment':<28} {cp:>10.3f}   {cf:>8.3f}")
print(f"{'':28} ({sp:.3f})      ({sf:.3f})")
print(f"{'R2 within':<28} {'':>10}   {res_fe.rsquared_within:>8.3f}")
print(f"{'N':<28} {int(res_pool.nobs):>10}   {int(res_fe.nobs):>8}")
print("=" * 52)
print("SEs clustered at country level.")
In [ ]:
# Random Effects — compare with FE (Hausman intuition)
try:
    res_re = RandomEffects.from_formula(
        "gdp_growth ~ unemployment", data=panel_df
    ).fit(cov_type="robust")
    cf = res_fe.params["unemployment"]
    cr = res_re.params["unemployment"]
    print(f"FE coefficient : {cf:.4f}")
    print(f"RE coefficient : {cr:.4f}")
    print(f"Difference     : {abs(cf-cr):.4f}")
    print()
    print("A large difference suggests the RE orthogonality assumption is violated")
    print("=> FE is the consistent estimator.")
except ZeroDivisionError:
    print("RandomEffects: numerical issue with this small panel.")
    print("Concept: RE assumes the entity effect is uncorrelated with regressors.")
    print("When violated, FE is the consistent estimator — which is what we used above.")
In [ ]:
# Export to LaTeX (like esttab/outreg2 in Stata)
latex_out = res_fe.summary.as_latex()
with open("twfe_table.tex","w") as f:
    f.write(latex_out)
print("Saved: twfe_table.tex")
print()
for line in latex_out.split("\n")[:15]:
    print(line)

9. Exercise¶

Time: ~25 minutes. Work individually.

Task 1. Download inflation (FP.CPI.TOTL.ZG) for the 10 eurozone countries, 2000–2023. Merge with the df DataFrame in memory. Add a post_crisis dummy (1 if year >= 2008) and an interaction unemp_x_post = unemployment × post_crisis.

Task 2. Run and compare four specifications:

  • (1) Pooled OLS
  • (2) Pooled OLS + inflation, HC3 SE
  • (3) Country FE only, clustered SE
  • (4) Two-way FE (country + year), clustered SE

Report in a comparison table. What happens to the unemployment coefficient?

Task 3. Add the interaction unemp_x_post to the TWFE specification. Did the Okun relationship change after the GFC?

Task 4 (optional). Export your preferred specification to a LaTeX table.

In [ ]:
# Task 1 — download GDP per capita and inflation, merge with df
# YOUR CODE HERE

# Task 2 — four specifications: Pooled OLS, Pooled+inflation, Country FE, TWFE
# YOUR CODE HERE

# Task 3 — interaction: unemployment x post_crisis dummy
# YOUR CODE HERE

# Task 4 (optional) — LaTeX table of preferred specification
# YOUR CODE HERE

Solution¶

In [ ]:
# SOLUTION ---------------------------------------------------------------
import pandas as pd, warnings
warnings.filterwarnings("ignore")
import wbdata
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col
from linearmodels.panel import PanelOLS, PooledOLS

countries = ["IT","DE","FR","ES","NL","PT","GR","BE","AT","FI"]
name_to_code = {
    "Italy":"IT","Germany":"DE","France":"FR","Spain":"ES","Netherlands":"NL",
    "Portugal":"PT","Greece":"GR","Belgium":"BE","Austria":"AT","Finland":"FI",
}

# Task 1
infl = wbdata.get_dataframe(
    {"FP.CPI.TOTL.ZG": "inflation"},
    country=countries,
    date=("2000", "2023"),
).reset_index().rename(columns={"date": "year"})
infl["country"] = infl["country"].map(name_to_code)
infl["year"] = infl["year"].astype(int)

df = pd.read_csv("eurozone_panel.csv"); df["year"] = df["year"].astype(int)
df = pd.merge(df, infl, on=["country","year"], how="left").dropna()
df["post_crisis"]  = (df["year"] >= 2008).astype(int)
df["unemp_x_post"] = df["unemployment"] * df["post_crisis"]
print("Merged:", df.shape)

# Task 2
panel = df.set_index(["country","year"])

# Build year dummies once — reused for TWFE specs (see lecture note in §8)
year_idx = panel.index.get_level_values("year")
year_dummies = pd.get_dummies(
    pd.Series(year_idx, index=panel.index, name="year"),
    prefix="yr", drop_first=True, dtype=float,
)

m1 = smf.ols("gdp_growth ~ unemployment", data=df).fit()
m2 = smf.ols("gdp_growth ~ unemployment + inflation", data=df).fit(cov_type="HC3")
m3 = PanelOLS.from_formula("gdp_growth ~ unemployment + EntityEffects",
     data=panel).fit(cov_type="clustered", cluster_entity=True)
exog_twfe = pd.concat([panel[["unemployment"]], year_dummies], axis=1)
m4 = PanelOLS(
    dependent=panel["gdp_growth"], exog=exog_twfe, entity_effects=True,
).fit(cov_type="clustered", cluster_entity=True)

print("\nCoefficient comparison (unemployment):")
print(f"  (1) Pooled OLS : {m1.params['unemployment']:.3f} ({m1.bse['unemployment']:.3f})")
print(f"  (2) + inflation: {m2.params['unemployment']:.3f} ({m2.bse['unemployment']:.3f})")
print(f"  (3) Country FE : {m3.params['unemployment']:.3f} ({m3.std_errors['unemployment']:.3f})")
print(f"  (4) TWFE       : {m4.params['unemployment']:.3f} ({m4.std_errors['unemployment']:.3f})")

# Task 3: interaction with post-crisis, on top of TWFE
exog_int = pd.concat(
    [panel[["unemployment","unemp_x_post"]], year_dummies],
    axis=1,
)
m5 = PanelOLS(
    dependent=panel["gdp_growth"], exog=exog_int, entity_effects=True,
    drop_absorbed=True,
).fit(cov_type="clustered", cluster_entity=True)
print("\nTask 3 - interaction results:")
print(m5.summary.tables[1])

# Task 4
with open("twfe_results.tex","w") as f:
    f.write(m4.summary.as_latex())
print("\nSaved: twfe_results.tex")

Summary¶

Task Tool Stata equivalent
OLS smf.ols().fit() reg y x
Robust SE .fit(cov_type="HC3") reg y x, robust
Year dummies C(year) in formula xi: reg y x i.year
Pooled OLS (panel) PooledOLS.from_formula() reg y x
Country FE EntityEffects xtreg y x, fe
Two-way FE EntityEffects + TimeEffects reghdfe y x, absorb(c y)
Clustered SE cov_type="clustered" , cluster(country)
LaTeX export .summary.as_latex() esttab

Next lecture¶

Lecture 4 — Visualization, EDA & Dynamic Programming: publication-ready figures, systematic EDA, cake eating problem.