Extra L2 — Data Structures, Complexity, and Vectorization¶

This notebook complements Lecture 2: Data Structures & OOP.

Goal¶

See how data structures affect computational cost, readability, and reliability.

Why this matters¶

You should not only know how to store information, but also why a certain representation scales better.

In [ ]:
import math
import time
import statistics
import numpy as np

1. A simple problem¶

Suppose you observe wages for many workers and want to compute a transformed score:

[ s_i = \log(1 + wage_i^2) ]

We compare:

  1. a Python loop over a list,
  2. a list comprehension,
  3. a NumPy vectorized solution.
In [ ]:
n = 300_000
wages_list = [10 + 20 * math.sin(i / 3000) + (i % 7) for i in range(n)]
wages_array = np.array(wages_list)
len(wages_list), wages_array.shape
In [ ]:
def score_loop(x):
    out = []
    for value in x:
        out.append(math.log(1 + value**2))
    return out

def score_list_comp(x):
    return [math.log(1 + value**2) for value in x]

def score_numpy(x):
    return np.log(1 + x**2)
In [ ]:
def benchmark(func, obj, repeats=3):
    timings = []
    for _ in range(repeats):
        t0 = time.perf_counter()
        func(obj)
        timings.append(time.perf_counter() - t0)
    return {
        "mean_seconds": round(statistics.mean(timings), 4),
        "min_seconds": round(min(timings), 4),
        "max_seconds": round(max(timings), 4)
    }

results = {
    "loop": benchmark(score_loop, wages_list),
    "list_comprehension": benchmark(score_list_comp, wages_list),
    "numpy_vectorized": benchmark(score_numpy, wages_array)
}
results

Interpretation¶

The main point is not just that NumPy is faster. The deeper point is that:

  • Python loops execute element by element in the interpreter,
  • NumPy delegates array operations to optimized lower-level code.

2. Membership tests: list vs set vs dictionary¶

Now consider a very common research task: matching observations against a known set of IDs.

In [ ]:
ids = list(range(200_000))
id_set = set(ids)
id_dict = {i: True for i in ids}
queries = list(range(150_000, 250_000))
In [ ]:
def count_membership(container, queries):
    c = 0
    for q in queries:
        if q in container:
            c += 1
    return c

membership_results = {
    "list": benchmark(count_membership, ids),
    "set": benchmark(count_membership, id_set),
    "dict": benchmark(count_membership, id_dict)
}
membership_results

Why this matters¶

For lookups:

  • list is typically linear in the number of elements,
  • set and dict are typically near-constant average time.

If your code repeatedly checks membership, choosing the wrong data structure can dominate runtime.

3. Reliability and structure¶

A second issue is semantic clarity. Compare these two ways to store researcher records.

In [ ]:
researchers_bad = [
    ["Ana", "macro", 3],
    ["Luca", "labor", 1],
    ["Mina", "text", 4]
]

researchers_good = [
    {"name": "Ana", "field": "macro", "projects": 3},
    {"name": "Luca", "field": "labor", "projects": 1},
    {"name": "Mina", "field": "text", "projects": 4},
]

print("Opaque access:", researchers_bad[0][2])
print("Readable access:", researchers_good[0]["projects"])

4. Short exercise¶

Answer in words:

  1. When would a list still be preferable to a set?
  2. Why is a dictionary often a better intermediate format before moving to pandas?
  3. Why is vectorization not just a speed improvement, but also a reduction in error risk?

Optional extension¶

  • Compare memory usage approximately with sys.getsizeof.
  • Rewrite one example from your main course notebook using a more appropriate data structure.
  • Create a function that automatically converts a list of tuples into a dictionary keyed by ID.