Lab 05 — Suffix Automaton

Goal

Build a Suffix Automaton (SAM) for a string, and use it to count the number of distinct substrings in O(N).

Background

A suffix automaton is the smallest DFA that accepts every suffix of a given string. Discovered by Blumer et al. (1985).

Key facts:

  • O(N) states and O(N) transitions for a string of length N (over an alphabet)
  • Each state corresponds to an equivalence class of right-extensions of substrings
  • Distinct substring count = sum of len[state] - len[link[state]] over all states (excluding the initial state)

It’s the most powerful string data structure in competitive programming. Sometimes compared with suffix arrays + LCP arrays (which solve many of the same problems with different constants).

Interview Context

  • Heavy ICPC / Codeforces presence
  • Bioinformatics / genome alignment research
  • Rarely in industry; Bloomberg may ask, but usually accepts suffix array
  • Cryptography (some sequence-counting problems)

When to Skip This Topic

Skip if any of these are true:

  • You’re not targeting ICPC or string-research roles
  • You haven’t learned suffix arrays yet (lower-hanging fruit; more interview-relevant)
  • You don’t have 2+ weeks to internalize this

SAM is conceptually the deepest topic in this phase. The implementation is short; the understanding is hard. Don’t fake it.

Problem Statement

Count Distinct Substrings.

Given a string s, count the number of distinct non-empty substrings.

Example: s = "abc" → substrings {“a”, “b”, “c”, “ab”, “bc”, “abc”} → 6.

s = "aaa" → substrings {“a”, “aa”, “aaa”} → 3.

Constraints

  • 1 ≤ |s| ≤ 10^6
  • Lowercase English (or larger alphabet — affects transition storage)
  • Wall-clock: < 1 sec

Clarifying Questions

  1. Empty substring counts? (Usually no.)
  2. Substrings or distinct substrings? (Distinct; non-distinct is trivial: N(N+1)/2.)
  3. Alphabet size? (26 for English; affects map-vs-array tradeoff.)

Examples

"abc"   → 6   (a, b, c, ab, bc, abc)
"aaa"   → 3   (a, aa, aaa)
"abab"  → 7   (a, b, ab, ba, aba, bab, abab)
""      → 0

Brute Force

Generate all substrings, insert into a hash set, return size. O(N²) substrings, each of average length N/2 to hash → O(N³) worst case, O(N² log N) average. For N = 10^6: dead.

Brute Force Complexity

  • Time: O(N²) to O(N³)
  • Space: O(N²) for the set

Optimization Path

Build the suffix automaton:

  • Each state represents an equivalence class of substring occurrences
  • For each state s (except the initial), the number of distinct substrings ending at this state’s set of right-positions is len[s] - len[link[s]]
  • Sum these for all states → distinct substring count

Final Expected Approach

class SuffixAutomaton:
    def __init__(self):
        self.size = 1
        self.last = 0
        self.len = [0]
        self.link = [-1]
        self.next = [{}]
    
    def extend(self, c):
        cur = self.size
        self.size += 1
        self.len.append(self.len[self.last] + 1)
        self.link.append(-1)
        self.next.append({})
        p = self.last
        while p != -1 and c not in self.next[p]:
            self.next[p][c] = cur
            p = self.link[p]
        if p == -1:
            self.link[cur] = 0
        else:
            q = self.next[p][c]
            if self.len[p] + 1 == self.len[q]:
                self.link[cur] = q
            else:
                clone = self.size
                self.size += 1
                self.len.append(self.len[p] + 1)
                self.link.append(self.link[q])
                self.next.append(dict(self.next[q]))
                while p != -1 and self.next[p].get(c) == q:
                    self.next[p][c] = clone
                    p = self.link[p]
                self.link[q] = clone
                self.link[cur] = clone
        self.last = cur
    
    def count_distinct_substrings(self):
        return sum(self.len[i] - self.len[self.link[i]] for i in range(1, self.size))

# Usage
sam = SuffixAutomaton()
for c in "abab":
    sam.extend(c)
print(sam.count_distinct_substrings())   # 7

Data Structures

  • len[]: longest substring represented by each state
  • link[]: suffix link (analogous to failure link in Aho-Corasick)
  • next[]: transition map per state (dict or array of 26)

Correctness Argument

The SAM construction is non-trivial. The key invariants:

  • After processing prefix of length k, the SAM recognizes exactly the suffixes of that prefix
  • Each state’s len[s] - len[link[s]] counts the number of distinct substrings whose right-extension class is exactly this state
  • Summing across all non-initial states gives total distinct substrings

The cloning step (when len[p] + 1 != len[q]) splits a state to maintain the equivalence class property — without it, the automaton wouldn’t be canonical.

A rigorous proof is in Maxime Crochemore’s textbook. Accept the construction; verify on small cases.

Complexity

  • Construction: O(N · |Σ|) with dict transitions, or O(N) amortized with array transitions
  • Distinct substring count: O(N) after construction
  • Space: O(N · |Σ|) worst case

Implementation Requirements

  • Use dict per state for arbitrary alphabets, or [None] * 26 for English
  • Allocate state arrays incrementally (or pre-allocate 2N for safety)
  • Cloning step is the most error-prone — verify on small cases
  • Avoid recursion; SAM is naturally iterative

Tests

  • "a" → 1
  • "aa" → 2
  • "ab" → 3
  • "abc" → 6
  • "aaa" → 3
  • "abab" → 7
  • "abcabc" → 15
  • Stress: random strings of N=100 against brute force
  • Performance: N = 10^6 single character → finishes in < 1 sec

Follow-up Questions

  • Find longest common substring of two strings. → Build SAM of one; walk through the other tracking match length.
  • Number of occurrences of each substring. → Count terminal nodes (via topological sort of suffix link tree, then propagate).
  • K-th lexicographically smallest substring. → DFS on SAM with character ordering + count of substrings reachable.
  • Substring matching count for many queries. → Walk pattern in SAM; if it ends, answer is the size of its “endpos” set.

Product Extension

  • Genome assembly: distinct k-mers, longest common substrings across reads
  • Plagiarism detection
  • Compression (LZ-family algorithms use suffix structures)
  • Search engine n-gram indexing (less common; usually suffix array)

Language/Runtime Follow-ups

  • C++: use int next[][26] for the alphabet — fast.
  • Python: dict transitions; constant factor allows N ≤ 10^5 comfortably.
  • Java: TreeMap or HashMap; arrays preferred for fixed alphabet.

Common Bugs

  1. Forgot to clone: when len[p]+1 != len[q], failing to clone breaks the automaton.
  2. Wrong update of last: must always be cur, not clone.
  3. Suffix link of cur set incorrectly: subtle; verify against reference.
  4. Used same dict reference for clone: must dict(self.next[q]) (copy).
  5. Off-by-one in distinct substring sum: start from state 1, not 0 (state 0 is the initial state and represents the empty substring).

Debugging Strategy

  • Print states with (len, link, transitions) after each extend call
  • Visualize the suffix-link tree (parent = link[state])
  • Verify against brute force for N ≤ 20
  • For the cloning step: log when it triggers and which state is split

Mastery Criteria

  • Implement SAM construction in ≤ 45 min from memory (it’s short but error-prone)
  • Explain the role of suffix links and the cloning step
  • Apply SAM to: distinct substring count, occurrence count, LCS of two strings
  • State complexity precisely (O(N) states, O(N|Σ|) transitions, O(N) construction with arrays)
  • Distinguish SAM from suffix arrays in problem-applicability