p22 — Clone Graph

Source: LeetCode 133 · Medium · Topics: DFS, BFS, Hash Table, Graph Companies (2024–2025 frequency): Meta (very high), Google (high), Amazon (high), Microsoft (medium), Bloomberg (medium), Uber (medium), Apple (medium) Loop position: mid-onsite at Meta/Google. Heavy “data structure literacy” signal.

1. Quick Context

Given a reference to a node in a connected undirected graph, return a deep copy (clone). Each node has a value and a list of neighbors.

The structural challenge is the cycle: a naive recursive copy enters an infinite loop the moment it hits its first back-edge. The solution is a memoization map old_node → new_node that doubles as “have I cloned this yet?” tracking. This pattern is identical to LC 138 (Copy List with Random Pointer) and identical to how you’d serialize/deserialize any graph data structure.

What it looks like it tests: can you do BFS/DFS on a graph. What it actually tests: (a) do you recognize the cycle problem instantly, (b) do you reach for memoization as the canonical fix, (c) do you produce a truly independent clone (mutating the clone must NOT affect the original — a very common bug), (d) do you handle the None input case.


🔗 https://leetcode.com/problems/clone-graph/

STOP. 20-min timer. Try DFS first, then BFS.


3. Prerequisite Concepts

  • p21 — grid-as-graph (you’ve internalized “visited set or you loop”)
  • phase-04-graphs/labs/lab-09-graph-modeling.md
  • Hash maps with object identity (Python dict keys: instances hash by id by default)
  • LC 138 (Copy List with Random Pointer) — same pattern in linked-list form

4. How to Approach (FRAMEWORK Steps 1–9)

Step 1 — Restate: “Given the head node of a connected undirected graph (each node has int val and list[Node] neighbors), return a brand-new graph that is structurally identical but shares no node objects with the original.”

Step 2 — Clarify:

  • “Is the graph guaranteed connected?” (LC: yes. Starting from node, BFS/DFS reaches everything.)
  • “May node be None?” (LC: yes — return None.)
  • “Are node values unique?” (LC: yes, 1..N, used as labels. Don’t rely on this — the algorithm should work even with duplicate values. Identity, not value, is the dedup key.)
  • “Self-loops or multi-edges?” (LC says no, but state that your algorithm handles both naturally.)

Step 3 — Constraints: Up to 100 nodes; small. Time O(V+E). Space O(V) for the memo.

Step 4 — Examples: 4-node cycle: 1-2-3-4-1. The clone must be a separate 4-node cycle with the same shape, and mutating clone’s neighbors lists must not touch the original.

Step 5 — Brute Force: A “deepcopy via Python’s copy.deepcopy” works but is not the point of the question; explicitly call this out and then implement the algorithm by hand.

Step 6 — Complexity: O(V+E) time (each node + each edge processed once). O(V) auxiliary space for the memo + BFS queue / DFS stack.

Step 7 — Pattern Recognition: “Deep copy a graph” → memoized traversal. The memo is a dict[OldNode, NewNode]. Same pattern in LC 138 (random pointer linked list), LC 116 (populate next pointers in tree), serialization of any cyclic structure.

Step 8 — Develop (DFS):

def cloneGraph(node):
    if node is None: return None
    memo = {}  # old_node -> new_node
    def dfs(old):
        if old in memo: return memo[old]
        new = Node(old.val)
        memo[old] = new          # CRITICAL: register BEFORE recursing into neighbors
        new.neighbors = [dfs(n) for n in old.neighbors]
        return new
    return dfs(node)

The line memo[old] = new must come BEFORE the recursive call into neighbors, otherwise a back-edge will re-enter dfs(old) and infinite-loop. This is the entire trick.

Step 9 — Test: None input, single isolated node, single node with self-loop, 2-node cycle, complete K4, line graph.


5. Progressive Hints

If stuck for 5 min, open hints.md.


6. Deeper Insight — Why Memo Order Matters

The bug-magnet pattern is this:

def dfs(old):
    if old in memo: return memo[old]
    new = Node(old.val)
    new.neighbors = [dfs(n) for n in old.neighbors]  # WRONG
    memo[old] = new                                   # too late
    return new

Now dfs(A) recurses into dfs(B) which recurses into dfs(A) (back-edge) — and since A isn’t in memo yet, it starts a fresh clone. Infinite loop, then RecursionError.

The fix is one line moved: register the new node in memo before descending into neighbors. The memo simultaneously serves three roles: (1) dedup (“already cloned”), (2) cycle break (“don’t recurse forever”), (3) cross-reference (“how do I attach an edge to the right new node?”).

Identity vs equality: memo keys are old nodes. We hash them by id() (Python’s default for user classes). DO NOT switch to keying by old.val — that breaks the moment two nodes share a value. Identity is the correct dedup criterion for graph clones.


7. Anti-Pattern Analysis

  1. Memo write AFTER neighbor recursion. As shown above — infinite loop on the first back-edge.
  2. Keying memo by old.val. Works on LC’s sample inputs (unique values) but is semantically wrong and breaks on graphs with duplicate labels.
  3. Aliasing neighbor lists. new.neighbors = old.neighbors shares the SAME list object between old and new graphs; mutating one mutates the other. Must construct a fresh list of cloned nodes.
  4. Forgetting the None input. Crashes on the first attribute access. One-line guard.
  5. Returning node (the original) when you ran out of time. Some candidates panic and submit this. LC has a hidden test that mutates the clone and checks the original — instant fail.
  6. Using set of visited instead of dict of old→new. A set tells you “I’ve seen this,” but not “what’s the new copy I need to wire neighbors to.” You need the dict.

8. Skills & Takeaways

  • Memo-as-tri-purpose-tool — dedup, cycle break, cross-reference. One dict does all three.
  • Register-then-recurse discipline — the single most important line in this algorithm.
  • Identity vs equality as the dedup key — applies to LC 138, graph serialization, any cyclic copy.
  • Both DFS and BFS work — pick by stack-depth comfort.
  • Independence verification — your stress test must mutate the clone and assert the original is unchanged.

9. When to Move On

Move on when:

  • You can write DFS clone and BFS clone, each under 5 minutes, without referencing notes.
  • You can articulate why memo registration must precede neighbor recursion.
  • Your stress test runs 1000 random graphs, checks structural isomorphism, AND verifies that mutating the clone leaves the original intact.
  • You can pivot to LC 138 (Copy List with Random Pointer) in 3 minutes — same algorithm.
  • You can sketch graph serialization (write to JSON / read back) as a 5-minute extension.

10. Company Context

Meta:

  • One of the top-3 most-asked Meta questions, ever. The mock-interview signal is whether you write memo[old] = new BEFORE the recursive call. They watch for this specific line ordering.
  • Common follow-up: “now serialize the clone to a string and deserialize back.” Forces you to articulate the algorithm in two more ways.
  • Scorecard phrase: “Algorithm — recognized cycle problem instantly, correct memo ordering.”

Google:

  • Often asks the iterative BFS variant first (claims it’s because their interns favor it over DFS for production safety).
  • Will demand a proof: “argue that every node is cloned exactly once.”
  • Scorecard phrase: “Strong — clear invariants, defensive iterative approach.”

Amazon:

  • Pairs with a “what if the graph is disconnected — give us a list of components, return clones of all of them” follow-up. Trivial wrapper: iterate roots and call clone per root, sharing the memo dict.
  • Scorecard phrase: “Coding Excellence — clean DFS, handled disconnected variant on demand.”

Microsoft:

  • May ask the immutable variant: “design this so the original cannot be mutated even by accident.” Hint: freeze the dict (use MappingProxyType) or use tuples for neighbor lists.
  • Scorecard phrase: “Solid — recognized memo pattern, handled edge cases.”

Bloomberg:

  • Asks in a domain frame: “deep-copy this dependency graph of pricing models.” Same algorithm, different vocabulary.
  • Scorecard phrase: “Pragmatic — translated business domain into graph clone problem.”

Uber:

  • Pairs with “now the graph is too big for memory; clone it in streaming fashion.” Hint: the question is poorly posed — you NEED memo. Surface this, then propose chunking strategies (clone subgraph by subgraph, externalize the memo to disk/Redis).
  • Scorecard phrase: “Excellent — pushed back on impossible framing, proposed practical compromise.”

Apple:

  • Often combines with LC 138 in the same loop: “you did Clone Graph last round; now do Copy List with Random Pointer.” Test of pattern transfer.
  • Scorecard phrase: “Strong — transferred memo pattern across two structures.”

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff/Principal
Cycle awarenessDoesn’t noticeNotices after first runIdentifies in step 1Identifies + names “back-edge problem”
Memo orderingWrong (memo after recursion)Right after one debugRight first tryRight + explains the invariant aloud
Dedup keyval (buggy)Object id (correct)Object id (correct) + states whyObject id + cites LC 138 as same pattern
IndependenceDoesn’t verifyVerifies by inspectionStress-tests with mutationDiscusses immutability options
Disconnected variantStuckAsks if neededHandles on demandVolunteers it as a graceful extension

12. Level Delta

Mid (L4):

  • DFS with memo. Correct memo ordering after one round of debugging.
  • Handles None input.
  • Returns the right structure.

Senior (L5):

  • DFS and BFS variants, both clean.
  • Memo ordering correct on first try; can articulate why.
  • Sketches LC 138 as the same algorithm.
  • Stress-tests for independence (mutate clone, check original).

Staff (L6):

  • Discusses serialization as the natural extension; sketches a to_json / from_json pair using the same memo.
  • Handles disconnected variant cleanly.
  • Discusses Python-specific concerns: id() as the implicit dedup key for non-hashable custom classes (if __hash__ overridden).
  • Sketches the immutable-clone version with MappingProxyType.

Principal (L7+):

  • Connects to the canonical computer-science pattern of graph isomorphism and proves clone is isomorphic.
  • Discusses how distributed graph databases (Neo4j, TigerGraph) implement graph subset extraction — same memo pattern at scale.
  • Discusses cycle detection vs cycle preservation: this problem requires preservation; cycle detection (LC 207) is the dual.
  • Discusses memory pressure: for very large graphs, externalize the memo to LMDB / RocksDB; the algorithm structure is unchanged.

13. Follow-up Q&A

Q1: What if the graph might be disconnected? A: The function as given starts from one node, so it only clones that component. For disconnected graphs, the API should accept a list of roots (one per component). Iterate roots, calling clone on each, sharing the memo dict across calls so that if two “roots” happen to live in the same component, you don’t double-clone. Time still O(V+E).

Q2: How would you serialize this graph to disk and reload it? A: Use the same memo idea. Assign each node a unique integer id during a traversal: id_of = {old: i for i, old in enumerate(bfs_order)}. Serialize as a JSON { "nodes": [{"id": i, "val": v, "neighbors": [j, k, ...]}, ...] }. Deserialize in two passes: pass 1 create all Node(val) objects keyed by id; pass 2 wire neighbors via the id→Node map. The id→Node map is the deserialization memo — same structure as our clone memo.

Q3: What if Node.__hash__ is overridden so two different nodes with the same val hash equal? A: dict keying breaks. Switch to keying by id(old) explicitly: memo[id(old)] = new. Slightly uglier but bulletproof.

Q4: Can you make the clone provably immutable? A: After cloning, walk every node and do new.neighbors = tuple(new.neighbors). Override __setattr__ to raise. Or wrap each node’s neighbor list in MappingProxyType (for dict-style) or use Python’s dataclass(frozen=True). Note that “immutable” in Python is a discipline, not a hard guarantee — discuss this honestly.

Q5: How would you parallelize? A: Tricky because of shared memo writes. Partition the graph (e.g., METIS or simple BFS-partition into K pieces). Clone each piece independently with its own local memo. Then merge: cross-partition edges need a final reconciliation pass mapping local IDs to global new-node references. Overhead of partitioning + reconciliation usually only pays off for very large graphs.


14. Solution Walkthrough

See solution.py:

  • Node class with __slots__ = ("val", "neighbors").
  • clone_graph_dfs(node) — recursive memoized DFS. Canonical.
  • clone_graph_bfs(node) — iterative deque-based BFS. Same memo, different traversal order.
  • serialize(node) / deserialize(adj) — used as test infrastructure (build random graphs from adjacency lists and dump back).
  • graphs_isomorphic_and_independent(old, new) — structural isomorphism check + mutation test (mutate clone, confirm original unchanged).
  • stress_test() — 1000 random connected undirected graphs N=1..15, both DFS and BFS clones pass isomorphism + independence.
  • edge_cases() — None, single node, self-loop, 2-cycle, K5, line graph.

Run: python3 solution.py → “ALL TESTS PASSED”.


15. Beyond the Problem

Principal-engineer code review comment:

“The memo dict is doing three jobs at once — dedup, cycle break, and reference resolution. That’s elegant in 20 lines but becomes a maintenance hazard at 2000. If you ever generalize this to our internal graph framework, please split it: one set for ‘visited’, one dict for ‘cloned’. Conflating them is fine for an interview but invites bugs when someone adds a fourth concern (e.g., ‘edges-already-copied’) and overloads the same dict. Also: the algorithm here is identical to graph deserialization — please name it _traverse_with_memo and reuse it. Don’t copy-paste this loop into the JSON loader.”

Connections forward:

  • p23 (Course Schedule) — same DFS scaffold but the question is cycle detection (return bool) not cycle preservation (clone). The dual problem.
  • LC 138 — copy list with random pointer; literally the same memo pattern in linked-list form.
  • LC 1485 (Clone Binary Tree With Random Pointer) — same pattern on trees.
  • Production: graph-database subgraph extraction, model-graph deep-copy in PyTorch (parameter dependency graphs), serialization of cyclic Python objects (pickle uses this internally via its memo table).