p69 — Alien Dictionary
Source: LeetCode 269 · Hard (LC Premium) · Topics: Graph, Topo Sort, Strings Companies (2024–2025): Google (very high), Meta (high), Amazon (high), Microsoft (medium-high) Loop position: The hardest topological sort variant on LC. You don’t get the graph — you must INFER edges from adjacent word comparisons, handle the prefix-violation edge case correctly, and detect cycles. Three independent failure modes; missing any = wrong answer.
1. Quick Context
A list of words from an alien language, sorted lexicographically in some unknown alphabet order. Derive that alphabet order. Return empty string if impossible.
Strategy:
- Build a directed graph over CHARACTERS where edge
a → bmeans “a comes before b.” - Add all distinct characters as nodes (even those with no edges).
- For each adjacent pair
words[i], words[i+1]: find first differing character; that gives ONE edge. - Prefix-violation check: if
words[i]is longer thanwords[i+1]ANDwords[i+1]is a prefix ofwords[i](e.g.,["abc", "ab"]), it’s impossible → return “”. - Topo sort (Kahn or DFS); cycle → return “”.
O(V + E + N·L) where N = number of words, L = max word length.
2. LeetCode Link + Attempt Gate
🔗 https://leetcode.com/problems/alien-dictionary/ (LC Premium)
STOP. 40-min timer. Three independent edge cases must all be handled.
3. Prerequisite Concepts
- Topological sort (Kahn’s BFS — see p62).
- Cycle detection in directed graph.
- Lexicographic order: comparison stops at first differing char.
- The “prefix” subtlety: in dictionary order, shorter prefix comes before longer extension.
4. How to Approach (FRAMEWORK 1–9)
1. Restate: Reverse-engineer an alphabet from sorted-word evidence.
2. Clarify: All chars lowercase? (Typically yes, but sometimes any chars.) Are all chars unique across words? No — they may repeat. Tie order ambiguous? (Multiple valid orderings; any one is acceptable.)
3. Examples:
["wrt","wrf","er","ett","rftt"]→"wertf".["abc","ab"]→""(prefix violation).["z","x","z"]→""(cycle: z→x, x→z).
5. Brute force: Try all permutations of distinct chars, check each against the word list. 26! too large but conceptually correct.
6. Target: O(V + E + N·L). V = distinct chars (≤26), E ≤ N-1.
7. Pattern: Topo sort over inferred graph.
8. Develop: see solution.py.
9. Test: prefix violation; cycle; single word (any permutation of its chars valid); duplicate adjacent words; chars-with-no-constraints (must still appear in result).
5. Progressive Hints
See hints.md.
6. Deeper Insight
6.1 Where do edges come from?
Between two adjacent words w1, w2, walk char-by-char:
- If
w1[i] != w2[i]: add edgew1[i] → w2[i]. STOP walking — subsequent characters give no info. - If
w1[i] == w2[i]: continue. - If we exhaust
w2(sow2is a prefix ofw1) ANDlen(w1) > len(w2): invalid input — a prefix must come BEFORE its extension in dictionary order, so["abc","ab"]is impossible.
The “first differing character” rule is critical: characters AFTER the first difference give NO ordering information. (abxxx vs acyyy only tells us b < c, not x vs y.)
6.2 Why include ALL characters as nodes
The result must include every character that appears in any word. A character with no incoming OR outgoing constraints (e.g., only appears in one word with no neighbors) must still be in the output.
Common bug: build the graph only from inferred edges, miss isolated nodes. The final string has the wrong character count.
6.3 Three failure modes
- Prefix violation: shorter word that’s a prefix appears AFTER its extension.
- Cycle: e.g., a→b and b→a inferred. Detected via Kahn (final order length < V) or DFS (back edge).
- No solution despite valid input: shouldn’t happen if input is well-formed; impossible only via 1 or 2.
Each is independent. Tests target all three.
6.4 Kahn vs DFS for topo sort
Kahn (BFS): Build indegrees; queue all nodes with indeg=0; pop, append to result, decrement neighbors. Cycle iff len(result) < V.
DFS with three-color: WHITE=0, GRAY=1, BLACK=2. Cycle iff we hit a GRAY (back edge during DFS).
Both O(V+E). Kahn is slightly cleaner here; gives an explicit order. DFS gives the order in reverse-post-order.
6.5 Multiple valid orderings
When the partial order is incomplete (e.g., only a→b is known, c is unconstrained), multiple total orders are valid. LC accepts any. Kahn with a deque (FIFO) gives one consistent order; using a min-heap gives the lex-smallest.
6.6 Connection to scheduling
Alien Dictionary is just LC 207 Course Schedule with the graph hidden in word-comparison data. Once edges are inferred, it’s a textbook topo sort. The “hardness” is the graph-construction phase + edge cases, not the sorting itself.
6.7 Generalization: order from preferences
Any “I have evidence of pairwise orderings; derive a global order” problem is alien-dictionary-shaped. Examples: tournament rankings from match outcomes, package dependency resolution from build logs, sentence reconstruction from word-adjacency frequencies. Topo sort + cycle detection is the universal hammer.
7. Anti-Pattern Analysis
- Compare entire words, not first-difference char — wrong edges (extra info from later chars).
- Forget the prefix-violation check — silently returns valid order on impossible input.
- Don’t add all chars as nodes — output missing chars; wrong length.
- Use indegree dict without initializing isolated chars to 0 — those chars never get into the Kahn queue.
- Cycle detection by “result not equal to input chars” — only checks length; misses cycles where some chars happen to have indeg 0.
- Compare adjacent words once and stop on first edge but then process EXTRA edges from later chars — silently adds wrong edges.
8. Skills & Takeaways
- Topo sort on inferred graph.
- “First differing character” rule for lex order.
- Prefix-violation invalidity.
- Cycle detection via Kahn’s len-check.
- Universal pattern: pairwise ordering evidence → topo sort.
9. When to Move On
- Solve LC 269 cold in 30 min handling all 3 edge cases.
- State the prefix-violation check in 30 seconds.
- Distinguish “no edges between two chars yet” from “cycle.”
- Solve LC 207, LC 210 (Course Schedule I/II), LC 444 (Sequence Reconstruction) cold.
10. Company Context
- Google: L5/L6; the most-asked Google graph problem alongside LC 207.
- Meta: E5; tests both topo sort AND attention to edge cases.
- Amazon: L6; comes with prefix-violation as the “did you really test it” gotcha.
- Microsoft: L65; similar.
Scorecard for strong: “Built graph correctly from first-differing char, handled prefix violation, added all chars as nodes, used Kahn with cycle detection, mentioned multiple valid orderings.”
11. Interviewer’s Lens
| Signal | Junior | Mid | Senior | Staff+ |
|---|---|---|---|---|
| Graph construction | Compares full words | First diff char | + stops processing after first diff | + handles all 3 edge cases unprompted |
| Prefix violation | Misses | After prompt | Cold | + articulates why (prefix must precede extension) |
| Isolated nodes | Misses | After fail | Adds preemptively | + explains node set ≠ edge endpoint set |
| Cycle detection | None | Kahn len check | + DFS three-color alt | + LC 444 unique-order extension |
12. Level Delta
- Mid (L4): Handles standard case; misses one of the three edge cases.
- Senior (L5): All three edge cases right; cleanly built graph; Kahn or DFS; cites pattern.
- Staff (L6): + LC 444 (Sequence Reconstruction — uniqueness check) + lex-smallest topo order via heap + multiple valid orderings discussion.
- Principal (L7+): + general “pairwise evidence → partial order → topo sort” framing + connection to learning-to-rank + production dependency-resolution systems (Cargo, Nix, NPM).
13. Follow-up Q&A
Q1: Multiple valid orderings — return the lex-smallest. A1: Replace deque with min-heap in Kahn’s algorithm. O((V+E) log V).
Q2: Is the order UNIQUE? (LC 444 Sequence Reconstruction) A2: Yes iff at every Kahn step the queue has exactly one element. If ever ≥ 2 simultaneously, order is ambiguous.
Q3: What if input has only one word? A3: No edges can be inferred. Return that word’s chars in any order (typically their original order in the word, deduplicated).
Q4: What if we’re allowed to be wrong on a few inputs (statistical)? A4: Use feedback-arc-set heuristics; treat as approximate ranking aggregation.
Q5: What if there’s a typo in the input — can you detect? A5: A “typo” might create a spurious cycle or prefix violation. You can identify the minimum-edge edit to make it consistent; that’s an NP-hard variant (minimum feedback arc set).
14. Solution Walkthrough
See solution.py:
alien_order_kahn— Kahn’s BFS topo sort (canonical).alien_order_dfs— iterative DFS three-color variant.alien_order_brute— enumerate permutations of distinct chars; verify each (only for stress with ≤6 chars).
15. Beyond the Problem
Principal code review: “Alien Dictionary is the prototypical ‘graph is hidden in the data’ problem. Half the work is reading the input correctly: extracting evidence from pairwise word comparisons under the strict first-differing-character rule, detecting the prefix-violation that immediately invalidates the input. The other half — topo sort with cycle detection — is the LC 207 you’ve already solved. This pattern recurs across systems: package managers reconstruct build order from declared dependencies; tournament systems infer rankings from match outcomes; learning-to-rank systems aggregate pairwise human preferences. The universal lesson is that ‘partial order + cycle detection + total ordering’ is one tool. When evidence is consistent, topo sort gives you the answer; when it’s not, you need feedback-arc-set heuristics. Knowing which case you’re in is the senior judgment.”