Extra — Research Design & Replication¶
Companion notebook for building a dataset¶
Reading¶
- King, G. (1995). Replication, Replication. PS: Political Science and Politics 28(3), 444-452. — The classic article on replication standards, still the right starting point.
- Christensen, G. & Miguel, E. (2018). Transparency, Reproducibility, and the Credibility of Economics Research. Journal of Economic Literature 56(3), 920-980.
- Imbens, G. W. (2021). Statistical Significance, p-Values, and the Reporting of Uncertainty. Journal of Economic Perspectives 35(3), 157-174. — On how to report quantitative findings honestly.
- Manski, C. F. (2019). Communicating Uncertainty in Policy Analysis. PNAS 116(16), 7634-7641.
1. A replication checklist for dataset projects¶
A pre-publication audit. Read each item and mark done, partial, or todo. If anything is todo or partial, fix it before circulating the paper — these are the things a reader will check first.
The checklist is specialised to projects that build original datasets through web scraping or text-as-data pipelines. Generic replication advice (Stata do-files, fixed seeds, version control) is in the references above. This list captures items specific to the workflow used when the data are constructed by the researcher rather than downloaded ready-made.
import pandas as pd
# Replication checklist for dataset projects
dataset_checklist = pd.DataFrame({
"category": [
# Index quality (Priority 1)
"Index", "Index", "Index", "Index", "Index",
# Scraping reproducibility (Priority 2)
"Scraping", "Scraping", "Scraping", "Scraping", "Scraping",
# Honest interpretation (Priority 3)
"Interpretation", "Interpretation", "Interpretation", "Interpretation", "Interpretation",
],
"item": [
# Index quality
"Construct definition stated in one sentence (what the index measures, and what it does NOT)",
"Index methodology described step-by-step (tokenisation, dictionary or model, aggregation rule)",
"Validation sample: at least 20-50 documents hand-coded and compared to the index",
"Sample anchor texts included in the appendix (3-5 documents with high/low/mid index values)",
"Sensitivity check: index recomputed with one alternative choice (different dictionary, different model)",
# Scraping reproducibility
"Scraping script terminates cleanly on its own (no manual restarts hidden in comments)",
"Checkpoint logic implemented: re-running picks up where the previous run stopped",
"URL regex matches structural patterns, not CSS classes or fragile DOM positions",
"Date parsing handles edge cases (`dropna` after `to_datetime`, document rejection rule explicit)",
"Final corpus stats reported: N documents, date range, source, missing-value rate",
# Honest interpretation
"Claim type explicit ('we document', 'we estimate', 'we suggest') and matches the design",
"If causal language used: identification strategy stated and threats discussed",
"If descriptive: stated as such, no implicit causal verbs (`leads to`, `causes`, `drives`)",
"Time-series plots accompanied by event annotations (so the reader can audit visually)",
"Limitations section names at least 2 specific weaknesses (not generic 'more data needed')",
],
"status": ["todo"] * 15
})
print(dataset_checklist.to_string(index=False))
📊 How to use the checklist. Print it, fill it manually, attach it as an appendix to the paper if you want — referees appreciate a self-assessment that names specific weaknesses.
The most useful item is the claim-type explicit check (Interpretation #1). Read your own draft and underline every causal verb (
affect,cause,drive,lead to,result in,because). For each one, ask: does my design actually support this claim? If not, replace withis associated with,coincides with,co-moves with. Your conclusion section should not contain the word causal unless you have a clean identification strategy.
2. Project map — what to fill in¶
For each stage of your pipeline, fill in:
- File or script — the actual filename
- Main risk — the single thing most likely to break
- How validated — what evidence you have that the step worked correctly
The point is not to produce a perfect document but to force yourself to identify the weak link. Most dataset projects have one stage that is significantly more fragile than the others — usually scraping or index construction. Knowing which one helps you allocate time to fix it before circulating the draft.
# Empty template - fill in for YOUR project
import pandas as pd
project_map_template = pd.DataFrame({
"stage": [
"raw data acquisition",
"cleaning & deduplication",
"feature / index construction",
"validation",
"analysis",
"figures & tables",
"write-up"
],
"file_or_script": ["", "", "", "", "", "", ""],
"main_risk": ["", "", "", "", "", "", ""],
"how_validated": ["", "", "", "", "", "", ""]
})
project_map_template
3. Worked example 1 — ECB hawkish index, 2015-2025¶
A realistic dataset-construction project: scrape all ECB monetary policy statements between 2015 and 2025, construct a hawkishness index per statement, and document its evolution. The example below is a fully-filled project map for this hypothetical project, showing what a strong dataset paper looks like.
ecb_project_map = pd.DataFrame({
"stage": [
"raw data acquisition",
"cleaning & deduplication",
"feature / index construction",
"validation",
"analysis",
"figures & tables",
"write-up"
],
"file_or_script": [
"01_scrape_ecb_speeches.py",
"02_clean_corpus.py",
"03_build_hawkish_index.py",
"04_validate_index.py",
"05_analysis.py",
"06_make_figures.py",
"L9_paper_draft.tex"
],
"main_risk": [
"ECB site restructure breaks URL regex; new HTML tags introduce parsing errors",
"Non-English speeches included by mistake; near-duplicates from translations inflate N",
"Dictionary too narrow misses hedged hawkish language; too broad picks up unrelated terms",
"Hand-coded sample too small (N<20) to estimate kappa with useful confidence",
"Spurious time trend driven by composition shift (more press conferences, fewer formal speeches)",
"Figure axes default to misleading scales; event annotations missing key dates",
"Causal language slips into conclusions where only descriptive evidence exists"
],
"how_validated": [
"Re-run on different machine produces identical corpus (N=487, date range matches)",
"Manual audit of 30 random documents: 100% English, 0 obvious duplicates",
"Index correlates 0.74 with hand-coded scores on 50-doc validation sample (kappa=0.62)",
"Disagreement cases documented in appendix; 3 of 7 are genuinely ambiguous",
"Robustness check with subsample of formal speeches only — qualitative result holds",
"Each figure has source note, date range, N, and 3+ vertical event lines (Lagarde, COVID, ZIRP exit)",
"Conclusion explicitly states 'descriptive evidence' three times; no IV or RD claim made"
]
})
ecb_project_map
📊 What makes this map good.
- The main risks are specific and known, not generic ("data quality"). Each one names a concrete failure mode the author has thought about.
- The validations are quantitative where they need to be (kappa=0.62, correlation=0.74) and qualitative where appropriate (manual audit of 30 documents).
- The interpretation column is honest: the project does not claim causal effects of ECB language on bond yields. It documents the evolution of language and shows the pattern is robust to one alternative specification. That is a defensible research project — the interpretation is honest about what the design supports and what it does not.
Contrast this with the typical failure mode: a project map where "main_risk" is empty or generic, "how_validated" is "checked by author" for every stage, and the write-up contains sentences like "hawkish ECB statements lead to higher yields" without any identification strategy. That project has the same pipeline but loses substantial credibility on the interpretation front.
4. Worked example 2 — Newspaper coverage of immigration, 2010-2024¶
A different dataset archetype: scrape headlines from one or more newspapers, build a topic-share index, document how attention to one issue evolved. The risks are different — newspaper sites are messier, the construct is harder to operationalise, the comparison to existing measures is more contested.
newspaper_project_map = pd.DataFrame({
"stage": [
"raw data acquisition",
"cleaning & deduplication",
"feature / index construction",
"validation",
"analysis",
"figures & tables",
"write-up"
],
"file_or_script": [
"01_scrape_corriere.py + 01b_scrape_repubblica.py",
"02_clean_headlines.py",
"03_build_immigration_share.py",
"04_validate_against_gallagher.py",
"05_event_study_2015_shock.py",
"06_make_figures.py",
"L9_paper_draft.tex"
],
"main_risk": [
"Paywall blocks recent articles for one paper; sample becomes unbalanced over time",
"Headlines change over the day (live editing); my scrape captures one snapshot per day",
"Keyword list misses synonyms ('migranti' vs 'rifugiati' vs 'sbarchi') and shifts in usage",
"No external benchmark of newspaper attention; my measure cannot be triangulated",
"Pre-trend in immigration coverage starts before the 2015 shock — anchoring at the wrong date",
"Two newspapers plotted on same axis may dominate visually; need facet panels",
"Tempting to claim 'immigration coverage drove vote shift' — only descriptive evidence supports this"
],
"how_validated": [
"Scraping log records HTTP status for each request; coverage rate documented per source/year",
"Headline text re-scraped 3 days later for 5% sample; 12% differ (audit in appendix)",
"Keyword list iterated with 5 native speakers; final list of 17 terms documented",
"Compared against ITANES survey question 'most important issue' — correlation 0.41 (weak but positive)",
"Pre-period robustness: trend estimated separately for 2010-2014; effect attenuated but persists",
"Figure uses small multiples (one panel per newspaper); event annotation = April 2015 shipwreck",
"Conclusion: 'we document a sharp and sustained increase in coverage after 2015' (descriptive, NOT causal)"
]
})
newspaper_project_map
📊 What this example shows that the ECB one doesn't.
- The construct is harder to defend. "Immigration coverage" is a vaguer concept than "hawkishness". The keyword-iteration step (5 native speakers, 17 terms) is the author's response — they cannot prove the keyword list is right, but they document that it was developed systematically rather than arbitrarily.
- External validation is weaker (correlation 0.41 with ITANES). Honest authors report this anyway. A reader who sees correlation 0.41 will trust the rest of the paper more than one who sees no validation at all, because they know the author is not hiding anything.
- The pre-trend matters. Acknowledging that immigration coverage was already rising before April 2015 is what separates a defensible paper from one that gets demolished in seminar. The robustness check (estimating the trend separately on 2010-2014) is the right way to engage with the threat.
The general lesson for both worked examples: the project map is most useful when filled in before you start writing, because it surfaces the weakest link. The weakest link is what you need to spend extra time defending in the paper. That is where credibility is won and lost.
5. From pipeline to write-up¶
A strong write-up answers these questions, in this order, on the first reading:
- What is the question? (one sentence, no jargon)
- Why should we care? (one paragraph, with at least one piece of evidence that the question is real)
- What data do we have? (source, period, unit of observation, N)
- How is the key variable constructed? (the index, validated as in §1)
- What is the empirical design? (descriptive / event study / panel regression / IV — whichever, stated explicitly)
- What are the main results? (one figure, one number with confidence interval)
- How fragile are they? (at least one robustness check explicitly aimed at the weakest link)
- What remains unresolved? (limitations — what your design cannot tell us)
A reader who can answer all eight questions after reading your abstract and main figure has read a strong paper. A reader who has to hunt for any of them has not.
Diagnostic exercise: write the eight one-sentence answers for your own project on a single piece of paper. If any answer is longer than two lines, that section of your project needs work.
6. Common failure modes in dataset construction¶
The mistakes that undermine a dataset paper are not coding bugs. At this level they are almost always conceptual or rhetorical.
| Failure mode | What it looks like | How to fix |
|---|---|---|
| Construct ambiguity | "We measure economic uncertainty" without defining what uncertainty is | Define operationally: "the share of headlines containing one of {recession, crisis, downturn, uncertainty}" |
| Hidden construction choices | "We use an LDA topic model" without listing K, alpha, beta, preprocessing | Method appendix with every hyperparameter; sensitivity to one alternative |
| Descriptive graph as causal evidence | A time-series plot with the headline "$X$ leads to $Y$" | Headline should describe the pattern ("$X$ and $Y$ co-move from 2015"); causal language only with a design |
| No discussion of missing data | Sample N reported once at the start, then forgotten | One sentence on missingness rate per source/period; one sensitivity check on restricted sample |
| Generic limitations | "Future research should use more data" | Specific: "our index does not capture sarcasm; 4 of 50 validation documents were misclassified for this reason" |
| Robustness theatre | 12 robustness tables that all confirm the main result, exactly | One serious robustness check aimed at the most plausible threat, even if it weakens the result |
The last row is worth dwelling on. A robustness check that the author was clearly afraid of is more persuasive than ten that they were not. If your most plausible threat is composition shift over time, run the analysis on a fixed subsample (only formal speeches, only one newspaper) and report whether the result survives — and what it means if it does not.
7. Writing prompt — your own pre-submission audit¶
Write 8-10 lines, total, answering these four questions about your own project. Be specific — generic answers are not useful here, because the point is to surface the specific weak link in your project, not in projects in general.
- What is the single strongest part of your project? (Be honest about what you do well.)
- What is the single most fragile step? (The one that, if challenged, would be hardest to defend.)
- What would a skeptical referee attack first? (Read your own abstract as if you wanted to demolish the paper. Where do you go first?)
- What robustness check should you implement before circulating the draft? (Aim it at the answer to question 3, not at a comfortable extension.)
Write the answers in the cell below. This is the most useful single exercise in this notebook; do not skip it.
Your answer:
[8-10 lines, total, specific to your dataset project]
8. Optional — convert the checklist into a peer rubric¶
If you have a co-author or collaborator, exchange completed checklists and project_map tables before the paper is finalised. Each of you scores the other's project on the three priorities:
| Priority | Score 1 (weak) | Score 2 (adequate) | Score 3 (strong) |
|---|---|---|---|
| Index quality | Construct not defined or no validation | Construct defined; informal validation | Defined, hand-coded sample, kappa reported |
| Scraping reproducibility | Manual fixes during scraping; not re-runnable | Script terminates cleanly; corpus stats reported | Checkpoint logic; documented re-run on different machine |
| Honest interpretation | Causal language without identification | Mostly descriptive; some causal slips | Claims match design; limitations specific |
Total: 3-9. A 9/9 is publication-quality work; an 8/9 with one weak score that the author acknowledges in the limitations is also very strong.
Most dataset projects come in at 6-7 on a first internal review. The single most efficient way to move from 6 to 8 is to fix the honest-interpretation column, because it requires only re-reading and editing — not new data, not new code.
Wrap-up¶
The discipline of replication is not bureaucratic. It is the form that intellectual honesty takes in empirical research: making it possible for someone else (or your future self) to check what you did and to challenge what you concluded.
Strong empirical research projects defend three things explicitly: the construct (what is being measured), the data (where it comes from and what is missing), the claim (what the design supports and what it does not). Everything else is detail.