Lab 05 — Suffix Automaton
Goal
Build a Suffix Automaton (SAM) for a string, and use it to count the number of distinct substrings in O(N).
Background
A suffix automaton is the smallest DFA that accepts every suffix of a given string. Discovered by Blumer et al. (1985).
Key facts:
- O(N) states and O(N) transitions for a string of length N (over an alphabet)
- Each state corresponds to an equivalence class of right-extensions of substrings
- Distinct substring count = sum of
len[state] - len[link[state]]over all states (excluding the initial state)
It’s the most powerful string data structure in competitive programming. Sometimes compared with suffix arrays + LCP arrays (which solve many of the same problems with different constants).
Interview Context
- Heavy ICPC / Codeforces presence
- Bioinformatics / genome alignment research
- Rarely in industry; Bloomberg may ask, but usually accepts suffix array
- Cryptography (some sequence-counting problems)
When to Skip This Topic
Skip if any of these are true:
- You’re not targeting ICPC or string-research roles
- You haven’t learned suffix arrays yet (lower-hanging fruit; more interview-relevant)
- You don’t have 2+ weeks to internalize this
SAM is conceptually the deepest topic in this phase. The implementation is short; the understanding is hard. Don’t fake it.
Problem Statement
Count Distinct Substrings.
Given a string s, count the number of distinct non-empty substrings.
Example: s = "abc" → substrings {“a”, “b”, “c”, “ab”, “bc”, “abc”} → 6.
s = "aaa" → substrings {“a”, “aa”, “aaa”} → 3.
Constraints
- 1 ≤ |s| ≤ 10^6
- Lowercase English (or larger alphabet — affects transition storage)
- Wall-clock: < 1 sec
Clarifying Questions
- Empty substring counts? (Usually no.)
- Substrings or distinct substrings? (Distinct; non-distinct is trivial: N(N+1)/2.)
- Alphabet size? (26 for English; affects map-vs-array tradeoff.)
Examples
"abc" → 6 (a, b, c, ab, bc, abc)
"aaa" → 3 (a, aa, aaa)
"abab" → 7 (a, b, ab, ba, aba, bab, abab)
"" → 0
Brute Force
Generate all substrings, insert into a hash set, return size. O(N²) substrings, each of average length N/2 to hash → O(N³) worst case, O(N² log N) average. For N = 10^6: dead.
Brute Force Complexity
- Time: O(N²) to O(N³)
- Space: O(N²) for the set
Optimization Path
Build the suffix automaton:
- Each state represents an equivalence class of substring occurrences
- For each state
s(except the initial), the number of distinct substrings ending at this state’s set of right-positions islen[s] - len[link[s]] - Sum these for all states → distinct substring count
Final Expected Approach
class SuffixAutomaton:
def __init__(self):
self.size = 1
self.last = 0
self.len = [0]
self.link = [-1]
self.next = [{}]
def extend(self, c):
cur = self.size
self.size += 1
self.len.append(self.len[self.last] + 1)
self.link.append(-1)
self.next.append({})
p = self.last
while p != -1 and c not in self.next[p]:
self.next[p][c] = cur
p = self.link[p]
if p == -1:
self.link[cur] = 0
else:
q = self.next[p][c]
if self.len[p] + 1 == self.len[q]:
self.link[cur] = q
else:
clone = self.size
self.size += 1
self.len.append(self.len[p] + 1)
self.link.append(self.link[q])
self.next.append(dict(self.next[q]))
while p != -1 and self.next[p].get(c) == q:
self.next[p][c] = clone
p = self.link[p]
self.link[q] = clone
self.link[cur] = clone
self.last = cur
def count_distinct_substrings(self):
return sum(self.len[i] - self.len[self.link[i]] for i in range(1, self.size))
# Usage
sam = SuffixAutomaton()
for c in "abab":
sam.extend(c)
print(sam.count_distinct_substrings()) # 7
Data Structures
len[]: longest substring represented by each statelink[]: suffix link (analogous to failure link in Aho-Corasick)next[]: transition map per state (dict or array of 26)
Correctness Argument
The SAM construction is non-trivial. The key invariants:
- After processing prefix of length k, the SAM recognizes exactly the suffixes of that prefix
- Each state’s
len[s] - len[link[s]]counts the number of distinct substrings whose right-extension class is exactly this state - Summing across all non-initial states gives total distinct substrings
The cloning step (when len[p] + 1 != len[q]) splits a state to maintain the equivalence class property — without it, the automaton wouldn’t be canonical.
A rigorous proof is in Maxime Crochemore’s textbook. Accept the construction; verify on small cases.
Complexity
- Construction: O(N · |Σ|) with dict transitions, or O(N) amortized with array transitions
- Distinct substring count: O(N) after construction
- Space: O(N · |Σ|) worst case
Implementation Requirements
- Use
dictper state for arbitrary alphabets, or[None] * 26for English - Allocate state arrays incrementally (or pre-allocate 2N for safety)
- Cloning step is the most error-prone — verify on small cases
- Avoid recursion; SAM is naturally iterative
Tests
"a"→ 1"aa"→ 2"ab"→ 3"abc"→ 6"aaa"→ 3"abab"→ 7"abcabc"→ 15- Stress: random strings of N=100 against brute force
- Performance: N = 10^6 single character → finishes in < 1 sec
Follow-up Questions
- Find longest common substring of two strings. → Build SAM of one; walk through the other tracking match length.
- Number of occurrences of each substring. → Count terminal nodes (via topological sort of suffix link tree, then propagate).
- K-th lexicographically smallest substring. → DFS on SAM with character ordering + count of substrings reachable.
- Substring matching count for many queries. → Walk pattern in SAM; if it ends, answer is the size of its “endpos” set.
Product Extension
- Genome assembly: distinct k-mers, longest common substrings across reads
- Plagiarism detection
- Compression (LZ-family algorithms use suffix structures)
- Search engine n-gram indexing (less common; usually suffix array)
Language/Runtime Follow-ups
- C++: use
int next[][26]for the alphabet — fast. - Python: dict transitions; constant factor allows N ≤ 10^5 comfortably.
- Java: TreeMap or HashMap; arrays preferred for fixed alphabet.
Common Bugs
- Forgot to clone: when len[p]+1 != len[q], failing to clone breaks the automaton.
- Wrong update of
last: must always becur, notclone. - Suffix link of
curset incorrectly: subtle; verify against reference. - Used same dict reference for clone: must
dict(self.next[q])(copy). - Off-by-one in distinct substring sum: start from state 1, not 0 (state 0 is the initial state and represents the empty substring).
Debugging Strategy
- Print states with (len, link, transitions) after each
extendcall - Visualize the suffix-link tree (parent = link[state])
- Verify against brute force for N ≤ 20
- For the cloning step: log when it triggers and which state is split
Mastery Criteria
- Implement SAM construction in ≤ 45 min from memory (it’s short but error-prone)
- Explain the role of suffix links and the cloning step
- Apply SAM to: distinct substring count, occurrence count, LCS of two strings
- State complexity precisely (O(N) states, O(N|Σ|) transitions, O(N) construction with arrays)
- Distinguish SAM from suffix arrays in problem-applicability