Lab 07 — Union-Find Applications (Accounts Merge)

Goal

Implement a disjoint-set union (DSU) with path compression and union by rank, then apply it to a real merge problem where the “elements” are emails and the “groups” are accounts. After this lab you should be able to write DSU from a blank screen in <6 minutes, recognize the merge-by-shared-attribute signal in <60 seconds, and articulate when DSU beats DFS for connectivity (online updates, no spatial structure, simple connectivity-only queries).

Background Concepts

A disjoint-set union (DSU, aka union-find) maintains a partition of N elements under two operations:

find(x): return the representative (“root”) of x’s set.
union(x, y): merge the sets containing x and y.

With path compression (find rewrites every visited node to point directly at the root) and union by rank/size (always attach the smaller tree under the larger), both operations run in O(α(N)) amortized, where α is the inverse Ackermann function — effectively constant for any practical N.

DSU is the natural choice when you receive a stream of “merge x and y” operations and need to answer “are x and y in the same group” — and you don’t care about paths between them, only connectivity.

Interview Context

Accounts Merge (LC 721) is a top-tier Hard at Amazon and Google. Number of Provinces (LC 547) is the easier sibling at Meta. The trap: candidates default to BFS/DFS on the implicit graph (emails as nodes, “shared email between two accounts” as edges), which works but is messier code than DSU. The senior signal is recognizing the partition structure and reaching for DSU within 90 seconds.

Problem Statement

Given a list of accounts, where accounts[i] = [name, email1, email2, ...], two accounts belong to the same person if they share any common email (names alone are not enough — multiple people can share a name). Merge accounts: return a list where each element is [name, ...sorted unique emails], accounts in any order.

Constraints

1 ≤ |accounts| ≤ 1000
2 ≤ |accounts[i]| ≤ 10 (one name + 1..9 emails)
1 ≤ |email| ≤ 30
Emails are lowercase, contain @.

Clarifying Questions

Are emails case-sensitive? (Per LC: lowercase already.)
Two accounts with the same name but no shared email — are they merged? (No — names don’t merge.)
Should the output emails be sorted within each account? (Yes — alphabetically.)
Order of accounts in output? (Any order is accepted.)
Total emails: ≤ 1000 × 9 = 9000 — DSU on emails is fine.

Examples

accounts = [
  ["John","[email protected]","[email protected]"],
  ["John","[email protected]","[email protected]"],
  ["Mary","[email protected]"],
  ["John","[email protected]"]
]
→ [["John","[email protected]","[email protected]","[email protected]"],
   ["Mary","[email protected]"],
   ["John","[email protected]"]]

Initial Brute Force

Build implicit graph: each email is a node; for each account, connect all its emails to the first email of that account; run DFS/BFS to enumerate connected components; emit each component with the corresponding name. Works, but DSU is cleaner.

Brute Force Complexity

O(Σ |emails| · α) with DSU; O(Σ |emails|) with DFS — both linear in total email count. The DFS version requires building an adjacency list explicitly, which DSU skips.

Optimization Path

DSU directly:

Treat each unique email as a DSU element.
For each account, union all its emails to the first email.
After all unions, group emails by find(email) root.
For each group, attach the name (looked up via any email in the group → its account → the account’s name).
Sort emails within each group; output.

This is the cleanest expression. No explicit graph construction needed.

Final Expected Approach

parent = {}
def find(x): if parent[x] != x: parent[x] = find(parent[x]); return parent[x]
def union(x, y): parent[find(x)] = find(y)

email_to_name = {}
for account in accounts:
    name = account[0]
    for email in account[1:]:
        if email not in parent: parent[email] = email
        email_to_name[email] = name
        union(account[1], email)

groups = defaultdict(list)
for email in parent: groups[find(email)].append(email)

return [[email_to_name[group[0]]] + sorted(group) for group in groups.values()]

Data Structures Used

dict[str, str] for parent (DSU).
dict[str, str] for email_to_name.
defaultdict(list) for grouping by root.

Correctness Argument

DSU correctness: Initially every element is its own set. Each union merges two sets. find returns a canonical representative. After path compression, find(x) == find(y) iff they were ever transitively unioned. Path compression and union by rank/size preserve this invariant and amortize each op to α(N).

Reduction correctness: Two emails belong to the same person iff there is a chain of accounts where consecutive accounts share an email. The unions on each account’s emails form precisely these chains; the resulting partition matches the equivalence-class definition.

Output correctness: Each component’s name is unambiguous because (a) every account contributing to the component has the same name as the others in that component (otherwise they’d be different people, and the input is well-formed by problem statement), and (b) any email in the group recovers the name via email_to_name.

Complexity

Operation	Time	Space
Building DSU	O(Σ E · α) where E = total emails	O(Σ E)
Grouping + sort	O(Σ E log E) for sorting within each group	O(Σ E)
Total	O(Σ E log E)	O(Σ E)

Implementation Requirements

from collections import defaultdict

class DSU:
    def __init__(self):
        self.parent = {}
        self.rank = {}
    def find(self, x):
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])
        return self.parent[x]
    def union(self, x, y):
        rx, ry = self.find(x), self.find(y)
        if rx == ry: return
        if self.rank[rx] < self.rank[ry]: rx, ry = ry, rx
        self.parent[ry] = rx
        if self.rank[rx] == self.rank[ry]: self.rank[rx] += 1
    def add(self, x):
        if x not in self.parent:
            self.parent[x] = x
            self.rank[x] = 0

def accountsMerge(accounts):
    dsu = DSU()
    email_to_name = {}
    for account in accounts:
        name = account[0]
        first = account[1]
        for email in account[1:]:
            dsu.add(email)
            email_to_name[email] = name
            dsu.union(first, email)
    groups = defaultdict(list)
    for email in dsu.parent:
        groups[dsu.find(email)].append(email)
    return [[email_to_name[g[0]]] + sorted(g) for g in groups.values()]

Tests

Standard: 3-account merge → 1 merged + 2 separate.
All accounts disjoint → each emerges separately.
All accounts share one email → all merge into one.
Single account → unchanged.
Same name, different emails → separate accounts.
Empty emails (problem disallows but defend): account[1:] is empty → no unions, no groups; account is dropped if no emails. Verify behavior.

Follow-up Questions

“Number of Provinces (LC 547): given an N × N adjacency matrix, count groups.” → DSU over N nodes; union if M[i][j] == 1; count distinct roots.
“Online: accounts arrive in a stream.” → DSU handles this natively; just keep adding and unioning.
“What if names matter (same name + shared email merges, different name doesn’t)?” → Keep DSU but check name-compatibility before union; conflict means error or skip.
“What if you need to remove an account?” → DSU doesn’t support remove. Use Link-Cut Trees or rebuild from scratch.
“What if path compression isn’t allowed (read-only find)?” → Use union by rank only; O(log N) per op instead of α.

Product Extension

Identity-resolution at LinkedIn, Salesforce, and ad networks merges user records by shared email/phone using DSU. Image segmentation libraries (OpenCV’s connectedComponents) use DSU under the hood. Distributed-system membership-protocols use DSU-like merges to track partition healing. Kruskal’s MST (Lab 08) uses DSU as its core data structure.

Language/Runtime Follow-ups

Python: recursion in find may exceed limit at N > 10^4; use iterative two-pass (find root, then compress).
Java: int[] parent for integer keys is significantly faster than HashMap<Integer, Integer>. Use iterative find.
Go: parent := make(map[string]string) for string keys, or []int for integer indices.
C++: vector<int> parent(N); iota(parent.begin(), parent.end(), 0); is the clean pattern. Iterative find.
JS/TS: Map<string, string> is fine; use iterative find to avoid call-stack issues at large N.

Common Bugs

Recursion limit in Python’s find — at N = 10^4 with worst-case chains, blows the stack. Use iterative or increase limit.
Forgetting path compression — find becomes O(N), not α. Functionality correct but TLE.
Union without rank — same TLE risk on adversarial inputs.
Comparing parent[x] == x vs find(x) == x for “is root” — only parent[x] == x is correct; find(x) == x is always true after find rewrites.
Forgetting to add an email to parent before union (union calls find which dereferences parent[email]) — KeyError.
Mapping email_to_name per-account but overwriting — last write wins; usually fine here, but be deliberate.
Not deduplicating emails within an account (LC inputs may not, but the algorithm is robust either way).

Debugging Strategy

Print parent and rank after each union. For wrong groupings, trace which email failed to union with which other and which account broke the chain. For TLE, verify both path compression and union by rank are present; profile to confirm find dominates.

Mastery Criteria

Recognized “merge by shared attribute” as DSU in <60 seconds.
Wrote DSU with path compression and union by rank from blank screen in <6 minutes.
Stated O(α) amortized complexity unprompted.
Articulated when DFS is a valid alternative (offline, no online updates) and when DSU is mandatory (online stream of merges).
Solved LC 547 (Number of Provinces) in <8 minutes.
Solved LC 721 (Accounts Merge) in <20 minutes from cold start.
Solved LC 305 (Number of Islands II) in <20 minutes — the canonical online-DSU problem.
Articulated path compression’s effect on amortization in <60 seconds.

LeetCode Interview — Extreme Coding