Extra L2 — Data Structures, Complexity, and Vectorization¶
This notebook complements Lecture 2: Data Structures & OOP.
Goal¶
See how data structures affect computational cost, readability, and reliability.
Why this matters¶
You should not only know how to store information, but also why a certain representation scales better.
In [ ]:
import math
import time
import statistics
import numpy as np
1. A simple problem¶
Suppose you observe wages for many workers and want to compute a transformed score:
[ s_i = \log(1 + wage_i^2) ]
We compare:
- a Python loop over a list,
- a list comprehension,
- a NumPy vectorized solution.
In [ ]:
n = 300_000
wages_list = [10 + 20 * math.sin(i / 3000) + (i % 7) for i in range(n)]
wages_array = np.array(wages_list)
len(wages_list), wages_array.shape
In [ ]:
def score_loop(x):
out = []
for value in x:
out.append(math.log(1 + value**2))
return out
def score_list_comp(x):
return [math.log(1 + value**2) for value in x]
def score_numpy(x):
return np.log(1 + x**2)
In [ ]:
def benchmark(func, obj, repeats=3):
timings = []
for _ in range(repeats):
t0 = time.perf_counter()
func(obj)
timings.append(time.perf_counter() - t0)
return {
"mean_seconds": round(statistics.mean(timings), 4),
"min_seconds": round(min(timings), 4),
"max_seconds": round(max(timings), 4)
}
results = {
"loop": benchmark(score_loop, wages_list),
"list_comprehension": benchmark(score_list_comp, wages_list),
"numpy_vectorized": benchmark(score_numpy, wages_array)
}
results
Interpretation¶
The main point is not just that NumPy is faster. The deeper point is that:
- Python loops execute element by element in the interpreter,
- NumPy delegates array operations to optimized lower-level code.
2. Membership tests: list vs set vs dictionary¶
Now consider a very common research task: matching observations against a known set of IDs.
In [ ]:
ids = list(range(200_000))
id_set = set(ids)
id_dict = {i: True for i in ids}
queries = list(range(150_000, 250_000))
In [ ]:
def count_membership(container, queries):
c = 0
for q in queries:
if q in container:
c += 1
return c
membership_results = {
"list": benchmark(count_membership, ids),
"set": benchmark(count_membership, id_set),
"dict": benchmark(count_membership, id_dict)
}
membership_results
Why this matters¶
For lookups:
listis typically linear in the number of elements,setanddictare typically near-constant average time.
If your code repeatedly checks membership, choosing the wrong data structure can dominate runtime.
3. Reliability and structure¶
A second issue is semantic clarity. Compare these two ways to store researcher records.
In [ ]:
researchers_bad = [
["Ana", "macro", 3],
["Luca", "labor", 1],
["Mina", "text", 4]
]
researchers_good = [
{"name": "Ana", "field": "macro", "projects": 3},
{"name": "Luca", "field": "labor", "projects": 1},
{"name": "Mina", "field": "text", "projects": 4},
]
print("Opaque access:", researchers_bad[0][2])
print("Readable access:", researchers_good[0]["projects"])
4. Short exercise¶
Answer in words:
- When would a list still be preferable to a set?
- Why is a dictionary often a better intermediate format before moving to pandas?
- Why is vectorization not just a speed improvement, but also a reduction in error risk?
Optional extension¶
- Compare memory usage approximately with
sys.getsizeof. - Rewrite one example from your main course notebook using a more appropriate data structure.
- Create a function that automatically converts a list of tuples into a dictionary keyed by ID.