p69 — Alien Dictionary

Source: LeetCode 269 · Hard (LC Premium) · Topics: Graph, Topo Sort, Strings Companies (2024–2025): Google (very high), Meta (high), Amazon (high), Microsoft (medium-high) Loop position: The hardest topological sort variant on LC. You don’t get the graph — you must INFER edges from adjacent word comparisons, handle the prefix-violation edge case correctly, and detect cycles. Three independent failure modes; missing any = wrong answer.

1. Quick Context

A list of words from an alien language, sorted lexicographically in some unknown alphabet order. Derive that alphabet order. Return empty string if impossible.

Strategy:

  1. Build a directed graph over CHARACTERS where edge a → b means “a comes before b.”
  2. Add all distinct characters as nodes (even those with no edges).
  3. For each adjacent pair words[i], words[i+1]: find first differing character; that gives ONE edge.
  4. Prefix-violation check: if words[i] is longer than words[i+1] AND words[i+1] is a prefix of words[i] (e.g., ["abc", "ab"]), it’s impossible → return “”.
  5. Topo sort (Kahn or DFS); cycle → return “”.

O(V + E + N·L) where N = number of words, L = max word length.

🔗 https://leetcode.com/problems/alien-dictionary/ (LC Premium)

STOP. 40-min timer. Three independent edge cases must all be handled.

3. Prerequisite Concepts

  • Topological sort (Kahn’s BFS — see p62).
  • Cycle detection in directed graph.
  • Lexicographic order: comparison stops at first differing char.
  • The “prefix” subtlety: in dictionary order, shorter prefix comes before longer extension.

4. How to Approach (FRAMEWORK 1–9)

1. Restate: Reverse-engineer an alphabet from sorted-word evidence.

2. Clarify: All chars lowercase? (Typically yes, but sometimes any chars.) Are all chars unique across words? No — they may repeat. Tie order ambiguous? (Multiple valid orderings; any one is acceptable.)

3. Examples:

  • ["wrt","wrf","er","ett","rftt"]"wertf".
  • ["abc","ab"]"" (prefix violation).
  • ["z","x","z"]"" (cycle: z→x, x→z).

5. Brute force: Try all permutations of distinct chars, check each against the word list. 26! too large but conceptually correct.

6. Target: O(V + E + N·L). V = distinct chars (≤26), E ≤ N-1.

7. Pattern: Topo sort over inferred graph.

8. Develop: see solution.py.

9. Test: prefix violation; cycle; single word (any permutation of its chars valid); duplicate adjacent words; chars-with-no-constraints (must still appear in result).

5. Progressive Hints

See hints.md.

6. Deeper Insight

6.1 Where do edges come from?

Between two adjacent words w1, w2, walk char-by-char:

  • If w1[i] != w2[i]: add edge w1[i] → w2[i]. STOP walking — subsequent characters give no info.
  • If w1[i] == w2[i]: continue.
  • If we exhaust w2 (so w2 is a prefix of w1) AND len(w1) > len(w2): invalid input — a prefix must come BEFORE its extension in dictionary order, so ["abc","ab"] is impossible.

The “first differing character” rule is critical: characters AFTER the first difference give NO ordering information. (abxxx vs acyyy only tells us b < c, not x vs y.)

6.2 Why include ALL characters as nodes

The result must include every character that appears in any word. A character with no incoming OR outgoing constraints (e.g., only appears in one word with no neighbors) must still be in the output.

Common bug: build the graph only from inferred edges, miss isolated nodes. The final string has the wrong character count.

6.3 Three failure modes

  1. Prefix violation: shorter word that’s a prefix appears AFTER its extension.
  2. Cycle: e.g., a→b and b→a inferred. Detected via Kahn (final order length < V) or DFS (back edge).
  3. No solution despite valid input: shouldn’t happen if input is well-formed; impossible only via 1 or 2.

Each is independent. Tests target all three.

6.4 Kahn vs DFS for topo sort

Kahn (BFS): Build indegrees; queue all nodes with indeg=0; pop, append to result, decrement neighbors. Cycle iff len(result) < V.

DFS with three-color: WHITE=0, GRAY=1, BLACK=2. Cycle iff we hit a GRAY (back edge during DFS).

Both O(V+E). Kahn is slightly cleaner here; gives an explicit order. DFS gives the order in reverse-post-order.

6.5 Multiple valid orderings

When the partial order is incomplete (e.g., only a→b is known, c is unconstrained), multiple total orders are valid. LC accepts any. Kahn with a deque (FIFO) gives one consistent order; using a min-heap gives the lex-smallest.

6.6 Connection to scheduling

Alien Dictionary is just LC 207 Course Schedule with the graph hidden in word-comparison data. Once edges are inferred, it’s a textbook topo sort. The “hardness” is the graph-construction phase + edge cases, not the sorting itself.

6.7 Generalization: order from preferences

Any “I have evidence of pairwise orderings; derive a global order” problem is alien-dictionary-shaped. Examples: tournament rankings from match outcomes, package dependency resolution from build logs, sentence reconstruction from word-adjacency frequencies. Topo sort + cycle detection is the universal hammer.

7. Anti-Pattern Analysis

  1. Compare entire words, not first-difference char — wrong edges (extra info from later chars).
  2. Forget the prefix-violation check — silently returns valid order on impossible input.
  3. Don’t add all chars as nodes — output missing chars; wrong length.
  4. Use indegree dict without initializing isolated chars to 0 — those chars never get into the Kahn queue.
  5. Cycle detection by “result not equal to input chars” — only checks length; misses cycles where some chars happen to have indeg 0.
  6. Compare adjacent words once and stop on first edge but then process EXTRA edges from later chars — silently adds wrong edges.

8. Skills & Takeaways

  • Topo sort on inferred graph.
  • “First differing character” rule for lex order.
  • Prefix-violation invalidity.
  • Cycle detection via Kahn’s len-check.
  • Universal pattern: pairwise ordering evidence → topo sort.

9. When to Move On

  • Solve LC 269 cold in 30 min handling all 3 edge cases.
  • State the prefix-violation check in 30 seconds.
  • Distinguish “no edges between two chars yet” from “cycle.”
  • Solve LC 207, LC 210 (Course Schedule I/II), LC 444 (Sequence Reconstruction) cold.

10. Company Context

  • Google: L5/L6; the most-asked Google graph problem alongside LC 207.
  • Meta: E5; tests both topo sort AND attention to edge cases.
  • Amazon: L6; comes with prefix-violation as the “did you really test it” gotcha.
  • Microsoft: L65; similar.

Scorecard for strong: “Built graph correctly from first-differing char, handled prefix violation, added all chars as nodes, used Kahn with cycle detection, mentioned multiple valid orderings.”

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff+
Graph constructionCompares full wordsFirst diff char+ stops processing after first diff+ handles all 3 edge cases unprompted
Prefix violationMissesAfter promptCold+ articulates why (prefix must precede extension)
Isolated nodesMissesAfter failAdds preemptively+ explains node set ≠ edge endpoint set
Cycle detectionNoneKahn len check+ DFS three-color alt+ LC 444 unique-order extension

12. Level Delta

  • Mid (L4): Handles standard case; misses one of the three edge cases.
  • Senior (L5): All three edge cases right; cleanly built graph; Kahn or DFS; cites pattern.
  • Staff (L6): + LC 444 (Sequence Reconstruction — uniqueness check) + lex-smallest topo order via heap + multiple valid orderings discussion.
  • Principal (L7+): + general “pairwise evidence → partial order → topo sort” framing + connection to learning-to-rank + production dependency-resolution systems (Cargo, Nix, NPM).

13. Follow-up Q&A

Q1: Multiple valid orderings — return the lex-smallest. A1: Replace deque with min-heap in Kahn’s algorithm. O((V+E) log V).

Q2: Is the order UNIQUE? (LC 444 Sequence Reconstruction) A2: Yes iff at every Kahn step the queue has exactly one element. If ever ≥ 2 simultaneously, order is ambiguous.

Q3: What if input has only one word? A3: No edges can be inferred. Return that word’s chars in any order (typically their original order in the word, deduplicated).

Q4: What if we’re allowed to be wrong on a few inputs (statistical)? A4: Use feedback-arc-set heuristics; treat as approximate ranking aggregation.

Q5: What if there’s a typo in the input — can you detect? A5: A “typo” might create a spurious cycle or prefix violation. You can identify the minimum-edge edit to make it consistent; that’s an NP-hard variant (minimum feedback arc set).

14. Solution Walkthrough

See solution.py:

  • alien_order_kahn — Kahn’s BFS topo sort (canonical).
  • alien_order_dfs — iterative DFS three-color variant.
  • alien_order_brute — enumerate permutations of distinct chars; verify each (only for stress with ≤6 chars).

15. Beyond the Problem

Principal code review: “Alien Dictionary is the prototypical ‘graph is hidden in the data’ problem. Half the work is reading the input correctly: extracting evidence from pairwise word comparisons under the strict first-differing-character rule, detecting the prefix-violation that immediately invalidates the input. The other half — topo sort with cycle detection — is the LC 207 you’ve already solved. This pattern recurs across systems: package managers reconstruct build order from declared dependencies; tournament systems infer rankings from match outcomes; learning-to-rank systems aggregate pairwise human preferences. The universal lesson is that ‘partial order + cycle detection + total ordering’ is one tool. When evidence is consistent, topo sort gives you the answer; when it’s not, you need feedback-arc-set heuristics. Knowing which case you’re in is the senior judgment.”