Lab 05 — Suffix Automaton

Goal

Build a Suffix Automaton (SAM) for a string, and use it to count the number of distinct substrings in O(N).

Background

A suffix automaton is the smallest DFA that accepts every suffix of a given string. Discovered by Blumer et al. (1985).

Key facts:

O(N) states and O(N) transitions for a string of length N (over an alphabet)
Each state corresponds to an equivalence class of right-extensions of substrings
Distinct substring count = sum of len[state] - len[link[state]] over all states (excluding the initial state)

It’s the most powerful string data structure in competitive programming. Sometimes compared with suffix arrays + LCP arrays (which solve many of the same problems with different constants).

Interview Context

Heavy ICPC / Codeforces presence
Bioinformatics / genome alignment research
Rarely in industry; Bloomberg may ask, but usually accepts suffix array
Cryptography (some sequence-counting problems)

When to Skip This Topic

Skip if any of these are true:

You’re not targeting ICPC or string-research roles
You haven’t learned suffix arrays yet (lower-hanging fruit; more interview-relevant)
You don’t have 2+ weeks to internalize this

SAM is conceptually the deepest topic in this phase. The implementation is short; the understanding is hard. Don’t fake it.

Problem Statement

Count Distinct Substrings.

Given a string s, count the number of distinct non-empty substrings.

Example: s = "abc" → substrings {“a”, “b”, “c”, “ab”, “bc”, “abc”} → 6.

s = "aaa" → substrings {“a”, “aa”, “aaa”} → 3.

Constraints

1 ≤ |s| ≤ 10^6
Lowercase English (or larger alphabet — affects transition storage)
Wall-clock: < 1 sec

Clarifying Questions

Empty substring counts? (Usually no.)
Substrings or distinct substrings? (Distinct; non-distinct is trivial: N(N+1)/2.)
Alphabet size? (26 for English; affects map-vs-array tradeoff.)

Examples

"abc"   → 6   (a, b, c, ab, bc, abc)
"aaa"   → 3   (a, aa, aaa)
"abab"  → 7   (a, b, ab, ba, aba, bab, abab)
""      → 0

Brute Force

Generate all substrings, insert into a hash set, return size. O(N²) substrings, each of average length N/2 to hash → O(N³) worst case, O(N² log N) average. For N = 10^6: dead.

Brute Force Complexity

Time: O(N²) to O(N³)
Space: O(N²) for the set

Optimization Path

Build the suffix automaton:

Each state represents an equivalence class of substring occurrences
For each state s (except the initial), the number of distinct substrings ending at this state’s set of right-positions is len[s] - len[link[s]]
Sum these for all states → distinct substring count

Final Expected Approach

class SuffixAutomaton:
    def __init__(self):
        self.size = 1
        self.last = 0
        self.len = [0]
        self.link = [-1]
        self.next = [{}]
    
    def extend(self, c):
        cur = self.size
        self.size += 1
        self.len.append(self.len[self.last] + 1)
        self.link.append(-1)
        self.next.append({})
        p = self.last
        while p != -1 and c not in self.next[p]:
            self.next[p][c] = cur
            p = self.link[p]
        if p == -1:
            self.link[cur] = 0
        else:
            q = self.next[p][c]
            if self.len[p] + 1 == self.len[q]:
                self.link[cur] = q
            else:
                clone = self.size
                self.size += 1
                self.len.append(self.len[p] + 1)
                self.link.append(self.link[q])
                self.next.append(dict(self.next[q]))
                while p != -1 and self.next[p].get(c) == q:
                    self.next[p][c] = clone
                    p = self.link[p]
                self.link[q] = clone
                self.link[cur] = clone
        self.last = cur
    
    def count_distinct_substrings(self):
        return sum(self.len[i] - self.len[self.link[i]] for i in range(1, self.size))

# Usage
sam = SuffixAutomaton()
for c in "abab":
    sam.extend(c)
print(sam.count_distinct_substrings())   # 7

Data Structures

len[]: longest substring represented by each state
link[]: suffix link (analogous to failure link in Aho-Corasick)
next[]: transition map per state (dict or array of 26)

Correctness Argument

The SAM construction is non-trivial. The key invariants:

After processing prefix of length k, the SAM recognizes exactly the suffixes of that prefix
Each state’s len[s] - len[link[s]] counts the number of distinct substrings whose right-extension class is exactly this state
Summing across all non-initial states gives total distinct substrings

The cloning step (when len[p] + 1 != len[q]) splits a state to maintain the equivalence class property — without it, the automaton wouldn’t be canonical.

A rigorous proof is in Maxime Crochemore’s textbook. Accept the construction; verify on small cases.

Complexity

Construction: O(N · |Σ|) with dict transitions, or O(N) amortized with array transitions
Distinct substring count: O(N) after construction
Space: O(N · |Σ|) worst case

Implementation Requirements

Use dict per state for arbitrary alphabets, or [None] * 26 for English
Allocate state arrays incrementally (or pre-allocate 2N for safety)
Cloning step is the most error-prone — verify on small cases
Avoid recursion; SAM is naturally iterative

Tests

"a" → 1
"aa" → 2
"ab" → 3
"abc" → 6
"aaa" → 3
"abab" → 7
"abcabc" → 15
Stress: random strings of N=100 against brute force
Performance: N = 10^6 single character → finishes in < 1 sec

Follow-up Questions

Find longest common substring of two strings. → Build SAM of one; walk through the other tracking match length.
Number of occurrences of each substring. → Count terminal nodes (via topological sort of suffix link tree, then propagate).
K-th lexicographically smallest substring. → DFS on SAM with character ordering + count of substrings reachable.
Substring matching count for many queries. → Walk pattern in SAM; if it ends, answer is the size of its “endpos” set.

Product Extension

Genome assembly: distinct k-mers, longest common substrings across reads
Plagiarism detection
Compression (LZ-family algorithms use suffix structures)
Search engine n-gram indexing (less common; usually suffix array)

Language/Runtime Follow-ups

C++: use int next[][26] for the alphabet — fast.
Python: dict transitions; constant factor allows N ≤ 10^5 comfortably.
Java: TreeMap or HashMap; arrays preferred for fixed alphabet.

Common Bugs

Forgot to clone: when len[p]+1 != len[q], failing to clone breaks the automaton.
Wrong update of last: must always be cur, not clone.
Suffix link of cur set incorrectly: subtle; verify against reference.
Used same dict reference for clone: must dict(self.next[q]) (copy).
Off-by-one in distinct substring sum: start from state 1, not 0 (state 0 is the initial state and represents the empty substring).

Debugging Strategy

Print states with (len, link, transitions) after each extend call
Visualize the suffix-link tree (parent = link[state])
Verify against brute force for N ≤ 20
For the cloning step: log when it triggers and which state is split

Mastery Criteria

Implement SAM construction in ≤ 45 min from memory (it’s short but error-prone)
Explain the role of suffix links and the cloning step
Apply SAM to: distinct substring count, occurrence count, LCS of two strings
State complexity precisely (O(N) states, O(N|Σ|) transitions, O(N) construction with arrays)
Distinguish SAM from suffix arrays in problem-applicability

LeetCode Interview — Extreme Coding