Lab 06 — Topological Sort (Alien Dictionary)

Goal

Build a topological sort over an inferred constraint graph — a problem whose graph is not given but must be extracted from the input. After this lab you should be able to recognize “ordering with constraints” problems in <60 seconds, write Kahn’s algorithm from a blank screen in <8 minutes, and identify and handle the three degenerate cases of alien-dictionary parsing (prefix violation, no constraint differs, cyclic dependency).

Background Concepts

A topological sort orders the vertices of a DAG such that every directed edge u → v has u before v. Kahn’s algorithm repeatedly removes a node of in-degree 0, decrementing its neighbors’ in-degrees. If the final order has length V, the graph is a DAG and the order is valid. Otherwise, a cycle exists.

For alien dictionary, the input is a list of words known to be sorted lexicographically in some unknown alien alphabet. Each adjacent pair of words gives at most one ordering constraint between two characters: the first position where they differ tells you a < b for those characters. Build the constraint graph, run topological sort, output an order. Handle three degenerate inputs:

No constraint differs between adjacent words but the second is a prefix of the first (e.g., ["abc", "ab"]) — invalid; return "".
No constraint differs but it’s because the words are equal up to a common length and the second is the longer one (["ab", "abc"]) — fine; no constraint added.
Cycle in the constraint graph — invalid; return "".

Interview Context

Alien Dictionary (LC 269) is one of the most-asked Hard graph problems at Meta, Google, and Airbnb. Premium-only on LeetCode but widely leaked. It tests: (1) recognizing topological sort, (2) constraint extraction from non-graph input, (3) handling all three edge cases. Candidates who solve only the happy path lose major points. The senior signal: enumerate the three failure modes upfront, before writing code.

Problem Statement

There is a new alien language using English letters. The order of letters is unknown. Given a list of words sorted lexicographically by the alien language’s rules, return any valid letter ordering. If no valid ordering exists, return "". If multiple are valid, return any.

Constraints

1 ≤ |words| ≤ 100
1 ≤ |words[i]| ≤ 100
All words consist of lowercase English letters.

Clarifying Questions

Is the answer unique? (No — any valid topological order.)
Does the answer include letters not appearing in any word? (No — only letters that appear.)
Are duplicate words possible? (Possible; treat normally — they yield no constraint.)
What does “lexicographically sorted” mean for words of different length? (Standard prefix rule; if A is a prefix of B, then A < B; if B is a prefix of A, the input is invalid.)
What if words is a single word? (Output any permutation of its unique letters.)

Examples

words = ["wrt","wrf","er","ett","rftt"]
→ "wertf"  (one valid ordering)

words = ["z","x"]
→ "zx"

words = ["z","x","z"]
→ ""  (z and x must precede each other — cycle)

words = ["abc","ab"]
→ ""  (prefix violation — "abc" can't come before "ab")

Initial Brute Force

Enumerate all permutations of the alphabet of size ≤ 26; for each, verify that the input words are in that lexicographic order. O(26!) — infeasible.

Brute Force Complexity

O(26! · ΣL) — astronomical.

Optimization Path

The answer requires O(V + E) where V = number of distinct letters (≤ 26) and E = number of pairwise constraints (≤ |words| - 1). Kahn’s algorithm runs in O(V + E). Constraint extraction is O(Σ |words[i]|).

The total is O(V + E + Σ L) — bounded by Σ L since V ≤ 26. Trivially fast.

Final Expected Approach

Initialize in_degree[c] = 0 for every letter that appears anywhere.
Build adjacency: for each adjacent pair (w1, w2) in words:
- Walk both words in parallel; at the first index i where they differ, add edge w1[i] → w2[i] (if not already present); break.
- If no differing index found and len(w1) > len(w2): prefix violation — return "".
Run Kahn’s: queue all letters with in_degree[c] == 0; pop, append to order, decrement neighbors.
If the order length equals the number of distinct letters, return the order; else cycle → return "".

Data Structures Used

defaultdict(set) for adjacency (set prevents duplicate edges).
dict[char, int] for in_degree.
collections.deque for the Kahn queue.
list[char] for the result.

Correctness Argument

Each adjacent pair contributes at most one constraint — the first differing character. This is sound: if the words are correctly sorted, the relative order of w1[i] and w2[i] (at the first differing index) must be w1[i] < w2[i] in the alien alphabet. No constraint can be inferred from differences after the first; those are consistent with but not implied by the sortedness.

Topological sort over these constraints produces an ordering where every constraint a < b is satisfied (a precedes b in the output). If a cycle exists, the constraints are unsatisfiable and the input is impossible. The prefix-violation case is the only constraint-extraction-time invalid input.

Complexity

Operation	Time	Space
Constraint extraction	O(Σ L)	O(unique edges) ≤ O(26²)
Kahn’s	O(V + E)	O(V + E)
Total	O(Σ L)	O(unique letters + edges)

Implementation Requirements

from collections import defaultdict, deque

def alienOrder(words):
    adj = defaultdict(set)
    in_deg = {c: 0 for w in words for c in w}
    for i in range(len(words) - 1):
        w1, w2 = words[i], words[i+1]
        found = False
        for j in range(min(len(w1), len(w2))):
            if w1[j] != w2[j]:
                if w2[j] not in adj[w1[j]]:
                    adj[w1[j]].add(w2[j])
                    in_deg[w2[j]] += 1
                found = True
                break
        if not found and len(w1) > len(w2):
            return ""
    queue = deque([c for c, d in in_deg.items() if d == 0])
    order = []
    while queue:
        c = queue.popleft()
        order.append(c)
        for nb in adj[c]:
            in_deg[nb] -= 1
            if in_deg[nb] == 0:
                queue.append(nb)
    return "".join(order) if len(order) == len(in_deg) else ""

Tests

Standard: ["wrt","wrf","er","ett","rftt"] → some topo of w<e, r<t, t<f.
Single word: ["abc"] → some permutation of {a,b,c}.
Prefix violation: ["abc","ab"] → "".
Tie/equal words: ["a","a"] → "a".
Cycle: ["z","x","z"] → "".
All same length, all letters used: ["aa","ab","cb"].
Single letter: ["z"] → "z".
Long words, tiny alphabet: stress for in-degree correctness.

Follow-up Questions

“Find the lexicographically smallest valid order (in standard a-z order).” → Kahn’s with a min-heap instead of a queue.
“Find all valid orders.” → Backtracking over Kahn’s choices; exponential.
“Verify a given ordering.” → For each adjacent word pair, scan for the first differing char and check ordering. O(Σ L).
“Online: words arrive one by one.” → Maintain the adjacency incrementally; rerun Kahn’s lazily on query.
“What if the input has typos (wrongly-ordered pairs)?” → Return any consistent ordering, or report the conflict edge.

Product Extension

Build systems (Bazel, Make, Gradle) compute build orders via topological sort over the dependency DAG; cycle detection is a critical correctness property. Database query planners use topo sort over join-graph dependencies. Distributed task schedulers (Airflow, Argo) execute DAGs of jobs in topological order.

Language/Runtime Follow-ups

Python: defaultdict(set) and collections.deque are essential. dict.items() iteration is fine.
Java: Map<Character, Set<Character>> and Map<Character, Integer> for in-degree; ArrayDeque<Character> for queue. Use int[26] for in-degree if alphabet is fixed.
Go: map[byte]map[byte]bool for adjacency; map[byte]int for in-degree; slice as queue. Or [26]int for in-degree as alphabet is fixed.
C++: unordered_map<char, unordered_set<char>>; array<int, 26> for in-degree; queue<char>.
JS/TS: Map<string, Set<string>> and Map<string, number>; array-as-queue with care (shift is O(N)).

Common Bugs

Adding duplicate edges to in-degree — use a set for adjacency, check membership before incrementing.
Missing the prefix violation check — ["abc","ab"] returns "abc" if you don’t handle this.
Building in-degree only for letters that have outgoing edges, missing letters that only appear as targets.
Initializing in_degree only for the first word’s letters — letters appearing only in later words get missed.
Comparing len(order) == 26 instead of == len(in_deg) (only used letters count).
Using a list instead of a set for adjacency, then double-incrementing in-degree.
Returning the order in the wrong direction (Kahn’s gives the right direction; DFS post-order needs to be reversed).

Debugging Strategy

Print the adjacency and in-degree maps after constraint extraction. Verify each constraint is justified by tracing back to the input pair. For cycles, print the in-degree map at the point Kahn’s stalls — the remaining-positive in-degrees identify nodes in the cycle. For prefix-violation false negatives, print the pair (w1, w2) at each iteration to confirm the check fires.

Mastery Criteria

Recognized “ordering with constraints” as topological sort in <60 seconds.
Wrote constraint extraction from word pairs from cold start in <5 minutes.
Wrote Kahn’s algorithm from blank screen in <6 minutes.
Enumerated the three degenerate cases (cycle, prefix violation, equal-prefix-shorter-second) before coding.
Solved LC 269 in <25 minutes from cold start.
Solved LC 207 (Course Schedule) in <8 minutes by extracting the constraint structure from cold.
Articulated the white-path lemma / DFS-post-order alternative in <60 seconds.

LeetCode Interview — Extreme Coding