Python Runtime Deep Dive
Target audience: candidates interviewing in Python at Big Tech, ML, infra, or any role where the interviewer is allowed to ask “how does it actually work?”
Scope: CPython. PyPy and other implementations are noted only when they materially change interview answers.
Python’s reputation for being “easy” is exactly why senior interviewers grill it hardest. The candidate who can write a clean two-pointer solution in Python and explain why their dict lookup is O(1) amortized but worst-case O(N), why their threading.Thread doesn’t help CPU-bound code, and why [[]] * 3 is a foot-gun, is rare. Be that candidate.
1. CPython Interpreter, Bytecode, Frame Objects
CPython is a stack-based bytecode interpreter. Source code → AST → bytecode → executed by an evaluation loop in C (ceval.c).
What runs your code
- Lexer/Parser → AST.
- Compiler → bytecode (
.pyccached in__pycache__/). - Interpreter loop (
PyEval_EvalFrameEx) → fetches one bytecode opcode at a time, dispatches.
import dis
def add(a, b):
return a + b
dis.dis(add)
# 2 0 RESUME 0
# 3 2 LOAD_FAST 0 (a)
# 4 LOAD_FAST 1 (b)
# 6 BINARY_OP 0 (+)
# 10 RETURN_VALUE
Frame objects
Every function call allocates a frame object on the Python call stack. A frame holds: locals, the value stack, the bytecode instruction pointer, the parent frame.
import sys
def f():
frame = sys._getframe()
print(frame.f_code.co_name, frame.f_lineno)
f() # f, <line>
Frames are heap-allocated objects, not C stack frames. This is why Python’s recursion limit is a Python-level integer (sys.setrecursionlimit), not a kernel limit.
Interview framing
“When you call a Python function, what’s the cost?”
Allocate a frame object, push it, populate locals from the argument tuple, execute bytecode, decref the frame. Function calls in Python are expensive — typically 100ns–1µs — which is why for over a list is faster than map(lambda…) for trivial bodies. Knowing this lets you defend choices like inline arithmetic vs operator.add.
2. The GIL — What It Is, What It Protects, When It Releases
The Global Interpreter Lock is a mutex inside the CPython interpreter. Only one thread can execute Python bytecode at a time per process.
What it protects
The GIL exists because CPython’s memory management (refcounts, GC structures, dict internals, etc.) is not thread-safe. Without the GIL, every refcount increment would need an atomic, killing single-threaded performance.
It protects interpreter state, not your data structures. list.append is atomic by accident (it’s a single bytecode), but counter += 1 is not (it’s LOAD, ADD, STORE).
When it releases
- I/O operations (file read/write, socket,
time.sleep) — the C extension drops the GIL while blocking. - Some C extensions explicitly drop it (NumPy heavy ops, hashlib).
- Every ~5ms (
sys.setswitchinterval) — the interpreter voluntarily releases for scheduling.
import threading, time
counter = 0
def bump():
global counter
for _ in range(1_000_000):
counter += 1 # NOT atomic
threads = [threading.Thread(target=bump) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
print(counter) # NOT 4_000_000
Implications
- CPU-bound parallelism via
threadingis impossible in standard CPython. Usemultiprocessingor release the GIL via C extensions. - I/O-bound parallelism via
threadingworks. Each thread releases the GIL while waiting on the network. asynciois an alternative I/O model; it does not bypass the GIL — it doesn’t need to, because there’s only one thread anyway.
Free-threaded Python (3.13+)
PEP 703 introduced an optional no-GIL build (python3.13t). Refcounts become atomic, dict/list grow per-thread fast paths, GC adopts new locking. It is not the default and many C extensions break under it. For interview purposes:
“Python 3.13 ships an experimental free-threaded build that removes the GIL. It’s opt-in, slower for single-threaded code today, and not yet ABI-stable for the ecosystem. Default CPython still has the GIL.”
3. Memory Model — Refcounts + Generational GC
Every Python object has a reference count. When it hits zero, the object is freed immediately.
import sys
a = [1, 2, 3]
sys.getrefcount(a) # 2 — one for `a`, one for the argument to getrefcount
b = a
sys.getrefcount(a) # 3
del b
sys.getrefcount(a) # 2
Why we need a GC on top
Refcounts cannot collect cycles:
a = []
b = []
a.append(b)
b.append(a)
del a, b # Refcount of each is still 1 — they reference each other.
The generational tracing GC in the gc module sweeps for cycles. Three generations (0, 1, 2). Newly created containers go in gen 0. Survivors are promoted. Older generations are collected less often.
import gc
gc.collect() # force a full collection
gc.get_threshold() # (700, 10, 10)
__del__ pitfalls
__del__ is a finalizer, not a destructor. Two traps:
- Cycles with
__del__used to be uncollectable before Python 3.4. Now they are collected, but the order is unspecified. __del__may run during interpreter shutdown when module globals are alreadyNone.
class Bad:
def __del__(self):
print(open) # may be None during shutdown
Use weakref.finalize or context managers (with) instead.
weakref
A weakref does not increment the refcount. Useful for caches and observer patterns.
import weakref
class Node: pass
n = Node()
r = weakref.ref(n)
print(r()) # <Node>
del n
print(r()) # None
Interview framing
“How does Python free memory?”
Refcounting frees most things eagerly; a generational tracing collector cleans up cycles. Compared to Java, allocations are cheaper to free on the common path (no pauses on most exits) but every operation has a per-pointer atomic increment cost — which is part of why Python is slow.
4. Object Model — __slots__, Descriptors, MRO
Every Python object is, by default, a dict-backed thing: instance attributes live in __dict__. This is why Python objects are 5–10x larger than equivalent C structs.
__slots__
Declare attributes statically and the interpreter skips __dict__:
class Point:
__slots__ = ('x', 'y')
def __init__(self, x, y):
self.x, self.y = x, y
# ~56 bytes per Point with __slots__, ~328 bytes without (roughly).
__slots__ cost: no dynamic attribute addition. Subclasses that don’t redeclare __slots__ lose the optimization. Use them for value classes with millions of instances.
Descriptors
Properties, classmethods, staticmethods are all built on the descriptor protocol: an attribute access triggers __get__ / __set__ / __delete__ on the class attribute.
class Lazy:
def __init__(self, fn): self.fn = fn
def __get__(self, obj, cls):
v = self.fn(obj)
setattr(obj, self.fn.__name__, v)
return v
class C:
@Lazy
def expensive(self):
return sum(range(10**6))
MRO and C3
Multiple inheritance resolution uses the C3 linearization algorithm. It guarantees a deterministic order that respects: a class precedes its parents; left-to-right inheritance order.
class A: pass
class B(A): pass
class C(A): pass
class D(B, C): pass
print(D.__mro__)
# (D, B, C, A, object)
Interview framing
“Why is Python OO so slow?”
Every attribute access is a dict lookup on the instance, then a walk up the MRO if not found. __slots__, caching, and attrs/dataclass(slots=True) mitigate. JITs (PyPy) inline these.
5. Iterators, Generators, yield from
An iterator is any object with __iter__ and __next__. A generator is a function with yield; calling it returns an iterator without running the body.
def count_up(n):
for i in range(n):
yield i
g = count_up(3)
next(g) # 0
next(g) # 1
Generators suspend frame state on yield. The frame is heap-allocated, kept alive by the generator object, and resumed on next().
yield from
Delegates iteration to a sub-iterator and forwards send, throw, close.
def chain(a, b):
yield from a
yield from b
Why this matters
Generators are the foundation of asyncio (coroutines were generators before async def), pipelines, and lazy I/O. They allow processing infinite or huge sequences without materializing them.
def lines(path):
with open(path) as f:
for line in f: # iterator protocol over file
yield line.rstrip() # constant memory
# Process 100GB log: O(line) memory.
6. List Internals — Over-allocation, Amortized Append
list is a dynamic array of PyObject*. Capacity grows geometrically.
CPython’s growth pattern (in listobject.c):
new_size = (new_size + (new_size >> 3) + 6) & ~3
That’s roughly 1.125x growth (smaller than C++ vector’s ~1.5–2x).
| Operation | Complexity |
|---|---|
lst[i] | O(1) |
lst.append(x) | Amortized O(1), worst O(N) on resize |
lst.insert(0, x) | O(N) |
lst.pop() | O(1) |
lst.pop(0) | O(N) — use collections.deque |
x in lst | O(N) |
lst.sort() | O(N log N), Timsort, stable |
lst[a:b] | O(b-a), creates a copy |
lst = []
for i in range(1_000_000):
lst.append(i) # Amortized O(1) total O(N)
Pitfalls
grid = [[0] * 3] * 3 # WRONG — three references to one row
grid[0][0] = 1
print(grid) # [[1, 0, 0], [1, 0, 0], [1, 0, 0]]
grid = [[0] * 3 for _ in range(3)] # right
Interview framing
“Why does
list.appendaverage O(1)?”
Geometric growth: O(N) total work across N appends → O(1) amortized. The classic amortization proof.
7. Dict Internals — Open Addressing, Probing, Ordering
dict is a hash table with open addressing. Compact since 3.6 (split into a sparse index array + dense entries array). Insertion-ordered since 3.7 (language guarantee, not just CPython).
Lookup algorithm (simplified)
- Compute
hash(key) & (table_size - 1)→ slot. - If slot empty → not found. If key matches → hit.
- Else probe:
i = (5*i + 1 + perturb) % size; perturb >>= 5. The “perturb” trick mixes high bits of the hash into early probes, reducing clustering.
Hash randomization
hash(str) and hash(bytes) use a per-process random seed (since 3.3) to mitigate algorithmic-complexity DoS. PYTHONHASHSEED=0 disables it, useful for reproducibility but unsafe in production.
import os
os.environ['PYTHONHASHSEED'] # not set in interactive: random per process
hash("foo") # different across processes
hash(int) is the integer itself (mod a prime). hash(-1) is special-cased to -2 because -1 signals errors in C.
Worst case
Adversarial keys with colliding hashes degrade to O(N) per operation. Hash randomization defeats the basic attack but custom __hash__ returning a constant still breaks it.
class Bad:
def __hash__(self): return 0
def __eq__(self, other): return False # never equal — every insert collides
d = {}
for i in range(1000):
d[Bad()] = i # O(N) per insert → O(N²) total
Complexity table
| Operation | Avg | Worst |
|---|---|---|
d[k] | O(1) | O(N) |
d[k] = v | O(1) amortized | O(N) |
del d[k] | O(1) | O(N) |
k in d | O(1) | O(N) |
| iter | O(N) | O(N) |
Interview framing
“Why are Python dicts ordered?”
Compact dict (3.6 CPython) stored entries in insertion order in a dense array, with a sparse index. The ordering was an implementation detail, then promoted to a language guarantee in 3.7.
8. Set Internals
set and frozenset are open-addressed hash tables, mechanically the same as dict minus the value column. Same complexity table, same adversarial caveats.
s = {1, 2, 3}
s | {4} # union, O(len(self) + len(other))
s & {2, 3, 5} # intersection, O(min(...))
s - {2} # difference, O(len(self))
Sets are not insertion-ordered. Do not rely on iteration order.
9. String Internals — Interning, Encoding, bytes vs str
Python 3 strings (str) are immutable Unicode code-point sequences. CPython stores them as one of:
- Latin-1 (1 byte/char) when all code points fit.
- UCS-2 (2 bytes/char) up to U+FFFF.
- UCS-4 (4 bytes/char) for the full range.
This is PEP 393 (“flexible string representation”). A string with a single emoji is 4× the bytes per char of a pure-ASCII string of the same length.
Interning
Short strings that look like identifiers are auto-interned. Equal interned strings share the same object → is works (by accident).
"hello" is "hello" # True (CPython, syntactic literals)
a = "hello world"
b = "hello world"
a is b # CPython 3.x: often True, but DO NOT RELY ON THIS
sys.intern(s) forces interning for runtime-built strings; speeds up dict lookups when the same key is used many times.
Concat in loop is O(N²)
s = ""
for c in chars:
s += c # creates a new string each time
CPython has a special-case optimization that sometimes makes this O(N) (when the refcount of s is 1 and the allocator can extend in place), but it is not guaranteed and disappears under any other reference. Use "".join(chars).
bytes vs str
bytes is an immutable byte sequence. str is a Unicode sequence. They do not implicitly convert in Python 3.
b"abc" + "xyz" # TypeError
b"abc".decode("utf-8") # → "abc"
"abc".encode("utf-8") # → b"abc"
Network/file boundaries are bytes. Application logic is strings. Convert at the boundary, never in the middle.
Interview framing
“Why does
s += cin a loop blow up?”
Strings are immutable, so each append allocates a new string and copies. CPython has an opportunistic in-place-extend hack that hides this in toy examples, but it’s fragile. Always "".join.
10. Hashing Protocol and __hash__ Contract
Two objects that compare equal must have the same hash. The reverse is not required.
class Point:
def __init__(self, x, y): self.x, self.y = x, y
def __eq__(self, other):
return isinstance(other, Point) and (self.x, self.y) == (other.x, other.y)
def __hash__(self):
return hash((self.x, self.y))
If you define __eq__ and not __hash__, your class becomes unhashable (Python sets __hash__ to None). This is by design — overriding equality without hash is almost always a bug.
Mutable types (list, dict, set) are unhashable by default — their hash would have to change as they mutate, breaking the dict/set invariant.
11. Mutable Default Arguments — The Most Famous Trap
def append_to(x, lst=[]):
lst.append(x)
return lst
append_to(1) # [1]
append_to(2) # [1, 2] ← !
Default values are evaluated once, when the def statement runs, and shared across calls. Idiomatic fix:
def append_to(x, lst=None):
if lst is None: lst = []
lst.append(x)
return lst
Every Python interviewer asks this once a year. Get it right and move on.
12. Closures and Late Binding
Free variables in closures are looked up by name at call time, not captured by value at definition.
funcs = [lambda: i for i in range(3)]
[f() for f in funcs] # [2, 2, 2] — not [0, 1, 2]
Fix with default arg (evaluated at def):
funcs = [lambda i=i: i for i in range(3)]
[f() for f in funcs] # [0, 1, 2]
This is the same bug as JavaScript’s var i in a loop. Both languages punish late binding.
13. Concurrency — Threading vs Multiprocessing vs Asyncio
| Model | Parallelism | Use For | Cost |
|---|---|---|---|
threading | Concurrent (GIL) | I/O-bound | Cheap threads (~MB stack), context switches |
multiprocessing | Parallel (separate processes) | CPU-bound | Process startup, IPC pickling |
asyncio | Concurrent (single thread) | High-fanout I/O | No OS threads; cooperative |
concurrent.futures | Wraps either | Convenient API | — |
Picking
- 10K simultaneous network connections?
asyncio. - 10 simultaneous network calls?
threading(or asyncio). - Heavy NumPy computation?
threading— NumPy releases the GIL. - Pure-Python CPU work?
multiprocessingor write the hot loop in C/Cython/Numba.
14. AsyncIO Model — Event Loop, Coroutines, Don’t Block The Loop
asyncio runs an event loop in one thread. Coroutines (async def) yield control on await, the loop schedules another coroutine that’s ready.
import asyncio
async def fetch(url):
print(f"start {url}")
await asyncio.sleep(1) # yields control
print(f"done {url}")
async def main():
await asyncio.gather(*[fetch(u) for u in ["a", "b", "c"]])
asyncio.run(main())
# All three start, all three finish ~1s later — concurrent on one thread.
Blocking the loop
If you do CPU work or call a sync blocking I/O function inside a coroutine, the entire loop stalls. Symptom: latency spikes for everyone.
async def bad():
time.sleep(1) # blocking — stalls the loop
requests.get("http://x") # blocking — same problem
async def good():
await asyncio.sleep(1)
async with aiohttp.ClientSession() as s:
await s.get("http://x")
For unavoidable blocking work: await loop.run_in_executor(None, blocking_fn).
Interview framing
“What’s the difference between
asyncioandthreading?”
Threading is preemptive multi-tasking by the OS, threads share memory, GIL serializes. Asyncio is cooperative single-thread; coroutines must await to yield. Both win on I/O. Asyncio scales to more in-flight ops because there’s no per-task OS thread.
15. Multiprocessing — Fork vs Spawn, Pickling
multiprocessing creates separate Python processes. Each has its own GIL → real parallelism.
Start methods
| Method | Default On | Cost | Caveat |
|---|---|---|---|
fork | Linux (was default until 3.14) | Cheap | Copy-on-write; not safe with threads, locks, or libraries that aren’t fork-safe (e.g. CUDA). |
spawn | macOS, Windows; default on Linux 3.14+ | Slow (re-imports) | All args must be picklable. |
forkserver | Linux | Mid | Compromise |
Pickling
Args and return values cross process boundaries via pickle. Lambdas, local functions, and many file/socket objects are not picklable.
from multiprocessing import Pool
def square(x): return x * x # top-level — picklable
with Pool(4) as p:
print(p.map(square, range(10)))
Shared memory
multiprocessing.shared_memory.SharedMemory (Python 3.8+) for zero-copy NumPy/byte sharing. Avoids the pickling round-trip for big arrays.
16. NumPy / Vectorization
NumPy stores numbers in contiguous C arrays of native types, not as PyObject*. Operations dispatch to optimized C/SIMD that releases the GIL.
import numpy as np
a = np.arange(1_000_000)
b = a * 2 + 1 # vectorized, ~ms; pure Python equivalent ~100ms
A for loop over a NumPy array is the worst of both worlds: Python overhead per iteration, no SIMD. If the operation has no NumPy expression, fall back to Numba, Cython, or a C extension.
This is a sneaky interview line: “Implement vector dot product without NumPy” → straightforward Python, then “now optimize” → vectorize, then “now scale” → talk about BLAS underneath NumPy.
17. Common Interview Gotchas
Integer caching
a = 256; b = 256
a is b # True — CPython caches -5..256
a = 257; b = 257
a is b # False (not guaranteed True; do NOT use `is` for value compare)
is vs ==
is is identity (same object). == is equality (__eq__). Use == for values. Use is for None, True, False, and singletons.
Sort stability
sorted() and list.sort() use Timsort — stable. You can sort by multiple keys via successive stable sorts (least significant first).
data = [("a", 2), ("b", 1), ("a", 1)]
data.sort(key=lambda x: x[1])
data.sort(key=lambda x: x[0]) # stable preserves the previous order on ties
dict.get default mutation
d = {}
d.setdefault("k", []).append(1) # one allocation
# vs
d["k"] = d.get("k", []) + [1] # quadratic for many appends
Use collections.defaultdict(list) if you append a lot.
Truthy surprises
bool([]) # False
bool([0]) # True (non-empty list)
bool(0.0) # False
bool("False") # True (non-empty string)
18. Recursion Limits
import sys
sys.getrecursionlimit() # default 1000
sys.setrecursionlimit(10000)
Python frames are heap-allocated but each is non-trivial (~500 bytes). Setting the limit too high crashes the interpreter on stack overflow of the C stack.
Convert deep recursion to iteration with an explicit stack (Phase 1, Lab 8). This isn’t optional in interviews — recursion depth = N in tree problems with skewed inputs is real.
19. Performance Hot Tips
-
Avoid attribute lookup in hot loops: bind to a local first.
append = result.append # local — fastest opcode for x in data: append(transform(x)) -
Built-ins are C.
sum,min,max,any,all,map,sorted— written in C, beat hand-rolled loops. -
Comprehensions beat
for+append. They skip theLOAD_ATTRforappend. -
functools.lru_cachefor memoization — drop-in, fast. -
String formatting: f-strings >
%>.format()> concatenation. -
__slots__for value classes with millions of instances. -
Profile before optimizing.
cProfile,pyinstrument,py-spy(sampling, no code changes).
import cProfile
cProfile.run("expensive()", sort="cumulative")
20. Standard Library Essentials
collections
Counter— multiset;most_common(k).deque— O(1) appends/pops at both ends. Use for queues and sliding windows.defaultdict— auto-vivifying dict.OrderedDict— historically ordered; todaydictis too. Use OrderedDict only formove_to_endand reverse iteration.namedtuple— lightweight value class.dataclass(frozen=True, slots=True)is the modern alternative.
heapq
Min-heap on a list. No max-heap — negate values.
import heapq
h = []
heapq.heappush(h, 3)
heapq.heappush(h, 1)
heapq.heappop(h) # 1
heapq.nlargest(3, data) # k-largest
bisect
Binary search on a sorted list.
import bisect
i = bisect.bisect_left(sorted_arr, x) # insertion point
bisect.insort(sorted_arr, x) # O(N) insert
itertools
chain,cycle,repeatcombinations,permutations,productaccumulate(prefix sums!),groupby,pairwise(3.10+)
from itertools import accumulate, pairwise
list(accumulate([1, 2, 3, 4])) # [1, 3, 6, 10]
list(pairwise([1, 2, 3, 4])) # [(1,2), (2,3), (3,4)]
functools
lru_cache,cache(3.9+).reduce.partial.cached_property.
What To Memorize Cold
- GIL releases on I/O and at switch interval; doesn’t release for pure-Python CPU.
dictis open-addressed, ordered since 3.7, hash-randomized.- List growth ~1.125×. Append amortized O(1).
strimmutable, three width tiers (1/2/4 byte). Concat in loop ⇒"".join.isvs==. Integer cache -5..256.- Mutable default args evaluated once.
- Closure late binding fix:
lambda x=x: …. - Threading for I/O, multiprocessing for CPU, asyncio for fanout.
__hash__must agree with__eq__.heapqis min-only.
If any of those is fuzzy, re-read this document. Then code something that breaks because of it, on purpose. That’s the lesson that sticks.