Python Runtime Deep Dive

Target audience: candidates interviewing in Python at Big Tech, ML, infra, or any role where the interviewer is allowed to ask “how does it actually work?”

Scope: CPython. PyPy and other implementations are noted only when they materially change interview answers.

Python’s reputation for being “easy” is exactly why senior interviewers grill it hardest. The candidate who can write a clean two-pointer solution in Python and explain why their dict lookup is O(1) amortized but worst-case O(N), why their threading.Thread doesn’t help CPU-bound code, and why [[]] * 3 is a foot-gun, is rare. Be that candidate.


1. CPython Interpreter, Bytecode, Frame Objects

CPython is a stack-based bytecode interpreter. Source code → AST → bytecode → executed by an evaluation loop in C (ceval.c).

What runs your code

  1. Lexer/Parser → AST.
  2. Compiler → bytecode (.pyc cached in __pycache__/).
  3. Interpreter loop (PyEval_EvalFrameEx) → fetches one bytecode opcode at a time, dispatches.
import dis

def add(a, b):
    return a + b

dis.dis(add)
#  2           0 RESUME                   0
#  3           2 LOAD_FAST                0 (a)
#              4 LOAD_FAST                1 (b)
#              6 BINARY_OP                0 (+)
#             10 RETURN_VALUE

Frame objects

Every function call allocates a frame object on the Python call stack. A frame holds: locals, the value stack, the bytecode instruction pointer, the parent frame.

import sys

def f():
    frame = sys._getframe()
    print(frame.f_code.co_name, frame.f_lineno)

f()  # f, <line>

Frames are heap-allocated objects, not C stack frames. This is why Python’s recursion limit is a Python-level integer (sys.setrecursionlimit), not a kernel limit.

Interview framing

“When you call a Python function, what’s the cost?”

Allocate a frame object, push it, populate locals from the argument tuple, execute bytecode, decref the frame. Function calls in Python are expensive — typically 100ns–1µs — which is why for over a list is faster than map(lambda…) for trivial bodies. Knowing this lets you defend choices like inline arithmetic vs operator.add.


2. The GIL — What It Is, What It Protects, When It Releases

The Global Interpreter Lock is a mutex inside the CPython interpreter. Only one thread can execute Python bytecode at a time per process.

What it protects

The GIL exists because CPython’s memory management (refcounts, GC structures, dict internals, etc.) is not thread-safe. Without the GIL, every refcount increment would need an atomic, killing single-threaded performance.

It protects interpreter state, not your data structures. list.append is atomic by accident (it’s a single bytecode), but counter += 1 is not (it’s LOAD, ADD, STORE).

When it releases

  • I/O operations (file read/write, socket, time.sleep) — the C extension drops the GIL while blocking.
  • Some C extensions explicitly drop it (NumPy heavy ops, hashlib).
  • Every ~5ms (sys.setswitchinterval) — the interpreter voluntarily releases for scheduling.
import threading, time

counter = 0
def bump():
    global counter
    for _ in range(1_000_000):
        counter += 1  # NOT atomic

threads = [threading.Thread(target=bump) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
print(counter)  # NOT 4_000_000

Implications

  • CPU-bound parallelism via threading is impossible in standard CPython. Use multiprocessing or release the GIL via C extensions.
  • I/O-bound parallelism via threading works. Each thread releases the GIL while waiting on the network.
  • asyncio is an alternative I/O model; it does not bypass the GIL — it doesn’t need to, because there’s only one thread anyway.

Free-threaded Python (3.13+)

PEP 703 introduced an optional no-GIL build (python3.13t). Refcounts become atomic, dict/list grow per-thread fast paths, GC adopts new locking. It is not the default and many C extensions break under it. For interview purposes:

“Python 3.13 ships an experimental free-threaded build that removes the GIL. It’s opt-in, slower for single-threaded code today, and not yet ABI-stable for the ecosystem. Default CPython still has the GIL.”


3. Memory Model — Refcounts + Generational GC

Every Python object has a reference count. When it hits zero, the object is freed immediately.

import sys
a = [1, 2, 3]
sys.getrefcount(a)  # 2 — one for `a`, one for the argument to getrefcount
b = a
sys.getrefcount(a)  # 3
del b
sys.getrefcount(a)  # 2

Why we need a GC on top

Refcounts cannot collect cycles:

a = []
b = []
a.append(b)
b.append(a)
del a, b  # Refcount of each is still 1 — they reference each other.

The generational tracing GC in the gc module sweeps for cycles. Three generations (0, 1, 2). Newly created containers go in gen 0. Survivors are promoted. Older generations are collected less often.

import gc
gc.collect()  # force a full collection
gc.get_threshold()  # (700, 10, 10)

__del__ pitfalls

__del__ is a finalizer, not a destructor. Two traps:

  1. Cycles with __del__ used to be uncollectable before Python 3.4. Now they are collected, but the order is unspecified.
  2. __del__ may run during interpreter shutdown when module globals are already None.
class Bad:
    def __del__(self):
        print(open)  # may be None during shutdown

Use weakref.finalize or context managers (with) instead.

weakref

A weakref does not increment the refcount. Useful for caches and observer patterns.

import weakref

class Node: pass
n = Node()
r = weakref.ref(n)
print(r())  # <Node>
del n
print(r())  # None

Interview framing

“How does Python free memory?”

Refcounting frees most things eagerly; a generational tracing collector cleans up cycles. Compared to Java, allocations are cheaper to free on the common path (no pauses on most exits) but every operation has a per-pointer atomic increment cost — which is part of why Python is slow.


4. Object Model — __slots__, Descriptors, MRO

Every Python object is, by default, a dict-backed thing: instance attributes live in __dict__. This is why Python objects are 5–10x larger than equivalent C structs.

__slots__

Declare attributes statically and the interpreter skips __dict__:

class Point:
    __slots__ = ('x', 'y')
    def __init__(self, x, y):
        self.x, self.y = x, y

# ~56 bytes per Point with __slots__, ~328 bytes without (roughly).

__slots__ cost: no dynamic attribute addition. Subclasses that don’t redeclare __slots__ lose the optimization. Use them for value classes with millions of instances.

Descriptors

Properties, classmethods, staticmethods are all built on the descriptor protocol: an attribute access triggers __get__ / __set__ / __delete__ on the class attribute.

class Lazy:
    def __init__(self, fn): self.fn = fn
    def __get__(self, obj, cls):
        v = self.fn(obj)
        setattr(obj, self.fn.__name__, v)
        return v

class C:
    @Lazy
    def expensive(self):
        return sum(range(10**6))

MRO and C3

Multiple inheritance resolution uses the C3 linearization algorithm. It guarantees a deterministic order that respects: a class precedes its parents; left-to-right inheritance order.

class A: pass
class B(A): pass
class C(A): pass
class D(B, C): pass

print(D.__mro__)
# (D, B, C, A, object)

Interview framing

“Why is Python OO so slow?”

Every attribute access is a dict lookup on the instance, then a walk up the MRO if not found. __slots__, caching, and attrs/dataclass(slots=True) mitigate. JITs (PyPy) inline these.


5. Iterators, Generators, yield from

An iterator is any object with __iter__ and __next__. A generator is a function with yield; calling it returns an iterator without running the body.

def count_up(n):
    for i in range(n):
        yield i

g = count_up(3)
next(g)  # 0
next(g)  # 1

Generators suspend frame state on yield. The frame is heap-allocated, kept alive by the generator object, and resumed on next().

yield from

Delegates iteration to a sub-iterator and forwards send, throw, close.

def chain(a, b):
    yield from a
    yield from b

Why this matters

Generators are the foundation of asyncio (coroutines were generators before async def), pipelines, and lazy I/O. They allow processing infinite or huge sequences without materializing them.

def lines(path):
    with open(path) as f:
        for line in f:           # iterator protocol over file
            yield line.rstrip()  # constant memory

# Process 100GB log: O(line) memory.

6. List Internals — Over-allocation, Amortized Append

list is a dynamic array of PyObject*. Capacity grows geometrically.

CPython’s growth pattern (in listobject.c):

new_size = (new_size + (new_size >> 3) + 6) & ~3

That’s roughly 1.125x growth (smaller than C++ vector’s ~1.5–2x).

OperationComplexity
lst[i]O(1)
lst.append(x)Amortized O(1), worst O(N) on resize
lst.insert(0, x)O(N)
lst.pop()O(1)
lst.pop(0)O(N) — use collections.deque
x in lstO(N)
lst.sort()O(N log N), Timsort, stable
lst[a:b]O(b-a), creates a copy
lst = []
for i in range(1_000_000):
    lst.append(i)  # Amortized O(1) total O(N)

Pitfalls

grid = [[0] * 3] * 3   # WRONG — three references to one row
grid[0][0] = 1
print(grid)            # [[1, 0, 0], [1, 0, 0], [1, 0, 0]]

grid = [[0] * 3 for _ in range(3)]  # right

Interview framing

“Why does list.append average O(1)?”

Geometric growth: O(N) total work across N appends → O(1) amortized. The classic amortization proof.


7. Dict Internals — Open Addressing, Probing, Ordering

dict is a hash table with open addressing. Compact since 3.6 (split into a sparse index array + dense entries array). Insertion-ordered since 3.7 (language guarantee, not just CPython).

Lookup algorithm (simplified)

  1. Compute hash(key) & (table_size - 1) → slot.
  2. If slot empty → not found. If key matches → hit.
  3. Else probe: i = (5*i + 1 + perturb) % size; perturb >>= 5. The “perturb” trick mixes high bits of the hash into early probes, reducing clustering.

Hash randomization

hash(str) and hash(bytes) use a per-process random seed (since 3.3) to mitigate algorithmic-complexity DoS. PYTHONHASHSEED=0 disables it, useful for reproducibility but unsafe in production.

import os
os.environ['PYTHONHASHSEED']  # not set in interactive: random per process
hash("foo")  # different across processes

hash(int) is the integer itself (mod a prime). hash(-1) is special-cased to -2 because -1 signals errors in C.

Worst case

Adversarial keys with colliding hashes degrade to O(N) per operation. Hash randomization defeats the basic attack but custom __hash__ returning a constant still breaks it.

class Bad:
    def __hash__(self): return 0
    def __eq__(self, other): return False  # never equal — every insert collides

d = {}
for i in range(1000):
    d[Bad()] = i  # O(N) per insert → O(N²) total

Complexity table

OperationAvgWorst
d[k]O(1)O(N)
d[k] = vO(1) amortizedO(N)
del d[k]O(1)O(N)
k in dO(1)O(N)
iterO(N)O(N)

Interview framing

“Why are Python dicts ordered?”

Compact dict (3.6 CPython) stored entries in insertion order in a dense array, with a sparse index. The ordering was an implementation detail, then promoted to a language guarantee in 3.7.


8. Set Internals

set and frozenset are open-addressed hash tables, mechanically the same as dict minus the value column. Same complexity table, same adversarial caveats.

s = {1, 2, 3}
s | {4}        # union, O(len(self) + len(other))
s & {2, 3, 5}  # intersection, O(min(...))
s - {2}        # difference, O(len(self))

Sets are not insertion-ordered. Do not rely on iteration order.


9. String Internals — Interning, Encoding, bytes vs str

Python 3 strings (str) are immutable Unicode code-point sequences. CPython stores them as one of:

  • Latin-1 (1 byte/char) when all code points fit.
  • UCS-2 (2 bytes/char) up to U+FFFF.
  • UCS-4 (4 bytes/char) for the full range.

This is PEP 393 (“flexible string representation”). A string with a single emoji is 4× the bytes per char of a pure-ASCII string of the same length.

Interning

Short strings that look like identifiers are auto-interned. Equal interned strings share the same object → is works (by accident).

"hello" is "hello"  # True (CPython, syntactic literals)
a = "hello world"
b = "hello world"
a is b  # CPython 3.x: often True, but DO NOT RELY ON THIS

sys.intern(s) forces interning for runtime-built strings; speeds up dict lookups when the same key is used many times.

Concat in loop is O(N²)

s = ""
for c in chars:
    s += c   # creates a new string each time

CPython has a special-case optimization that sometimes makes this O(N) (when the refcount of s is 1 and the allocator can extend in place), but it is not guaranteed and disappears under any other reference. Use "".join(chars).

bytes vs str

bytes is an immutable byte sequence. str is a Unicode sequence. They do not implicitly convert in Python 3.

b"abc" + "xyz"  # TypeError
b"abc".decode("utf-8")  # → "abc"
"abc".encode("utf-8")   # → b"abc"

Network/file boundaries are bytes. Application logic is strings. Convert at the boundary, never in the middle.

Interview framing

“Why does s += c in a loop blow up?”

Strings are immutable, so each append allocates a new string and copies. CPython has an opportunistic in-place-extend hack that hides this in toy examples, but it’s fragile. Always "".join.


10. Hashing Protocol and __hash__ Contract

Two objects that compare equal must have the same hash. The reverse is not required.

class Point:
    def __init__(self, x, y): self.x, self.y = x, y
    def __eq__(self, other):
        return isinstance(other, Point) and (self.x, self.y) == (other.x, other.y)
    def __hash__(self):
        return hash((self.x, self.y))

If you define __eq__ and not __hash__, your class becomes unhashable (Python sets __hash__ to None). This is by design — overriding equality without hash is almost always a bug.

Mutable types (list, dict, set) are unhashable by default — their hash would have to change as they mutate, breaking the dict/set invariant.


11. Mutable Default Arguments — The Most Famous Trap

def append_to(x, lst=[]):
    lst.append(x)
    return lst

append_to(1)  # [1]
append_to(2)  # [1, 2]   ← !

Default values are evaluated once, when the def statement runs, and shared across calls. Idiomatic fix:

def append_to(x, lst=None):
    if lst is None: lst = []
    lst.append(x)
    return lst

Every Python interviewer asks this once a year. Get it right and move on.


12. Closures and Late Binding

Free variables in closures are looked up by name at call time, not captured by value at definition.

funcs = [lambda: i for i in range(3)]
[f() for f in funcs]  # [2, 2, 2] — not [0, 1, 2]

Fix with default arg (evaluated at def):

funcs = [lambda i=i: i for i in range(3)]
[f() for f in funcs]  # [0, 1, 2]

This is the same bug as JavaScript’s var i in a loop. Both languages punish late binding.


13. Concurrency — Threading vs Multiprocessing vs Asyncio

ModelParallelismUse ForCost
threadingConcurrent (GIL)I/O-boundCheap threads (~MB stack), context switches
multiprocessingParallel (separate processes)CPU-boundProcess startup, IPC pickling
asyncioConcurrent (single thread)High-fanout I/ONo OS threads; cooperative
concurrent.futuresWraps eitherConvenient API

Picking

  • 10K simultaneous network connections? asyncio.
  • 10 simultaneous network calls? threading (or asyncio).
  • Heavy NumPy computation? threading — NumPy releases the GIL.
  • Pure-Python CPU work? multiprocessing or write the hot loop in C/Cython/Numba.

14. AsyncIO Model — Event Loop, Coroutines, Don’t Block The Loop

asyncio runs an event loop in one thread. Coroutines (async def) yield control on await, the loop schedules another coroutine that’s ready.

import asyncio

async def fetch(url):
    print(f"start {url}")
    await asyncio.sleep(1)        # yields control
    print(f"done {url}")

async def main():
    await asyncio.gather(*[fetch(u) for u in ["a", "b", "c"]])

asyncio.run(main())
# All three start, all three finish ~1s later — concurrent on one thread.

Blocking the loop

If you do CPU work or call a sync blocking I/O function inside a coroutine, the entire loop stalls. Symptom: latency spikes for everyone.

async def bad():
    time.sleep(1)         # blocking — stalls the loop
    requests.get("http://x")  # blocking — same problem

async def good():
    await asyncio.sleep(1)
    async with aiohttp.ClientSession() as s:
        await s.get("http://x")

For unavoidable blocking work: await loop.run_in_executor(None, blocking_fn).

Interview framing

“What’s the difference between asyncio and threading?”

Threading is preemptive multi-tasking by the OS, threads share memory, GIL serializes. Asyncio is cooperative single-thread; coroutines must await to yield. Both win on I/O. Asyncio scales to more in-flight ops because there’s no per-task OS thread.


15. Multiprocessing — Fork vs Spawn, Pickling

multiprocessing creates separate Python processes. Each has its own GIL → real parallelism.

Start methods

MethodDefault OnCostCaveat
forkLinux (was default until 3.14)CheapCopy-on-write; not safe with threads, locks, or libraries that aren’t fork-safe (e.g. CUDA).
spawnmacOS, Windows; default on Linux 3.14+Slow (re-imports)All args must be picklable.
forkserverLinuxMidCompromise

Pickling

Args and return values cross process boundaries via pickle. Lambdas, local functions, and many file/socket objects are not picklable.

from multiprocessing import Pool

def square(x): return x * x   # top-level — picklable

with Pool(4) as p:
    print(p.map(square, range(10)))

Shared memory

multiprocessing.shared_memory.SharedMemory (Python 3.8+) for zero-copy NumPy/byte sharing. Avoids the pickling round-trip for big arrays.


16. NumPy / Vectorization

NumPy stores numbers in contiguous C arrays of native types, not as PyObject*. Operations dispatch to optimized C/SIMD that releases the GIL.

import numpy as np
a = np.arange(1_000_000)
b = a * 2 + 1     # vectorized, ~ms; pure Python equivalent ~100ms

A for loop over a NumPy array is the worst of both worlds: Python overhead per iteration, no SIMD. If the operation has no NumPy expression, fall back to Numba, Cython, or a C extension.

This is a sneaky interview line: “Implement vector dot product without NumPy” → straightforward Python, then “now optimize” → vectorize, then “now scale” → talk about BLAS underneath NumPy.


17. Common Interview Gotchas

Integer caching

a = 256; b = 256
a is b   # True — CPython caches -5..256
a = 257; b = 257
a is b   # False (not guaranteed True; do NOT use `is` for value compare)

is vs ==

is is identity (same object). == is equality (__eq__). Use == for values. Use is for None, True, False, and singletons.

Sort stability

sorted() and list.sort() use Timsort — stable. You can sort by multiple keys via successive stable sorts (least significant first).

data = [("a", 2), ("b", 1), ("a", 1)]
data.sort(key=lambda x: x[1])
data.sort(key=lambda x: x[0])  # stable preserves the previous order on ties

dict.get default mutation

d = {}
d.setdefault("k", []).append(1)  # one allocation
# vs
d["k"] = d.get("k", []) + [1]    # quadratic for many appends

Use collections.defaultdict(list) if you append a lot.

Truthy surprises

bool([])      # False
bool([0])     # True  (non-empty list)
bool(0.0)     # False
bool("False") # True  (non-empty string)

18. Recursion Limits

import sys
sys.getrecursionlimit()  # default 1000
sys.setrecursionlimit(10000)

Python frames are heap-allocated but each is non-trivial (~500 bytes). Setting the limit too high crashes the interpreter on stack overflow of the C stack.

Convert deep recursion to iteration with an explicit stack (Phase 1, Lab 8). This isn’t optional in interviews — recursion depth = N in tree problems with skewed inputs is real.


19. Performance Hot Tips

  • Avoid attribute lookup in hot loops: bind to a local first.

    append = result.append   # local — fastest opcode
    for x in data:
        append(transform(x))
    
  • Built-ins are C. sum, min, max, any, all, map, sorted — written in C, beat hand-rolled loops.

  • Comprehensions beat for + append. They skip the LOAD_ATTR for append.

  • functools.lru_cache for memoization — drop-in, fast.

  • String formatting: f-strings > % > .format() > concatenation.

  • __slots__ for value classes with millions of instances.

  • Profile before optimizing. cProfile, pyinstrument, py-spy (sampling, no code changes).

import cProfile
cProfile.run("expensive()", sort="cumulative")

20. Standard Library Essentials

collections

  • Counter — multiset; most_common(k).
  • deque — O(1) appends/pops at both ends. Use for queues and sliding windows.
  • defaultdict — auto-vivifying dict.
  • OrderedDict — historically ordered; today dict is too. Use OrderedDict only for move_to_end and reverse iteration.
  • namedtuple — lightweight value class. dataclass(frozen=True, slots=True) is the modern alternative.

heapq

Min-heap on a list. No max-heap — negate values.

import heapq
h = []
heapq.heappush(h, 3)
heapq.heappush(h, 1)
heapq.heappop(h)  # 1
heapq.nlargest(3, data)  # k-largest

bisect

Binary search on a sorted list.

import bisect
i = bisect.bisect_left(sorted_arr, x)  # insertion point
bisect.insort(sorted_arr, x)            # O(N) insert

itertools

  • chain, cycle, repeat
  • combinations, permutations, product
  • accumulate (prefix sums!), groupby, pairwise (3.10+)
from itertools import accumulate, pairwise
list(accumulate([1, 2, 3, 4]))   # [1, 3, 6, 10]
list(pairwise([1, 2, 3, 4]))     # [(1,2), (2,3), (3,4)]

functools

  • lru_cache, cache (3.9+).
  • reduce.
  • partial.
  • cached_property.

What To Memorize Cold

  • GIL releases on I/O and at switch interval; doesn’t release for pure-Python CPU.
  • dict is open-addressed, ordered since 3.7, hash-randomized.
  • List growth ~1.125×. Append amortized O(1).
  • str immutable, three width tiers (1/2/4 byte). Concat in loop ⇒ "".join.
  • is vs ==. Integer cache -5..256.
  • Mutable default args evaluated once.
  • Closure late binding fix: lambda x=x: ….
  • Threading for I/O, multiprocessing for CPU, asyncio for fanout.
  • __hash__ must agree with __eq__.
  • heapq is min-only.

If any of those is fuzzy, re-read this document. Then code something that breaks because of it, on purpose. That’s the lesson that sticks.