Lab 03 — Hashmap Mastery: Group Anagrams

Goal

Master hash-map design with composite keys, adversarial input awareness, and the equality/hashcode contract. The deliverable groups N strings into anagram buckets in O(N · L) time and articulates exactly why your hash key works and what an adversary could do to break it.

Background Concepts

Hash function design; key equality contract; adversarial inputs and load factor; ordered-vs-unordered map choice; counter pattern. Review the Hash Maps and Hash Sets sections of the Phase 1 README, plus runtime concept 5 (hash collisions).

Interview Context

Group Anagrams is interview-evergreen: appears at Meta, Google, Amazon, Microsoft. The interview signal is whether you reach for the right key. Naive candidates compare every pair (O(N² · L)). Decent candidates sort each string into a key (O(N · L log L)). Strong candidates use a counter tuple key (O(N · L)). Elite candidates discuss adversarial hash flooding and language-specific custom-hash mechanics.

Problem Statement

Given an array strs of N lowercase-ASCII strings, group the strings that are anagrams of each other. Return the groups as a list of lists. Within and across groups, any order is acceptable.

Constraints

  • 1 ≤ N ≤ 10^4
  • 0 ≤ |s_i| ≤ 100
  • s_i consists of lowercase English letters.
  • Total characters: Σ |s_i| ≤ 10^6.

Clarifying Questions

  1. Lowercase only? (Per constraints — confirm.)
  2. Do empty strings group together? (Yes — the empty string is an anagram of itself.)
  3. Is output order significant within or across groups? (Standard problem: no.)
  4. Are duplicates in the input allowed? (Yes — ["aa","aa"] is one group of size 2.)
  5. Memory constraints? (Should fit comfortably; mention you’ll discuss tradeoffs.)

Examples

InputOutput (any order)
["eat","tea","tan","ate","nat","bat"][["eat","tea","ate"], ["tan","nat"], ["bat"]]
[""][[""]]
["a"][["a"]]
["abc","cab","bca","xyz"][["abc","cab","bca"], ["xyz"]]

Initial Brute Force

For each pair (i, j), check if strs[j] is an anagram of strs[i] (e.g., by sorting both). Use a seen[] array. O(N² · L log L).

def group_anagrams_brute(strs):
    seen = [False] * len(strs)
    groups = []
    for i, s in enumerate(strs):
        if seen[i]: continue
        g = [s]
        seen[i] = True
        ks = sorted(s)
        for j in range(i + 1, len(strs)):
            if not seen[j] and sorted(strs[j]) == ks:
                g.append(strs[j])
                seen[j] = True
        groups.append(g)
    return groups

Brute Force Complexity

Time: O(N² · L log L) due to repeated sorting. Space: O(N) for the seen array plus O(L) per sort. Fails for N = 10^4 (10⁸ operations × log).

Optimization Path

The key insight: anagrams have the same multiset of characters. We need a hashable key derived from this multiset. Two canonical forms:

  1. Sorted string as key: sorted(s) → "aet" for "eat", "tea", "ate". Cost per key: O(L log L). Total: O(N · L log L).
  2. Count tuple as key: a length-26 tuple of counts. Cost per key: O(L). Total: O(N · L). Optimal for large L.

Pick the count tuple unless L is tiny.

Final Expected Approach

Bucket strings by count tuple in a hash map.

from collections import defaultdict

def group_anagrams(strs):
    buckets = defaultdict(list)
    for s in strs:
        counts = [0] * 26
        for c in s:
            counts[ord(c) - ord('a')] += 1
        buckets[tuple(counts)].append(s)
    return list(buckets.values())

Data Structures Used

  • A hash map keyed by a 26-tuple of int counts; values are lists of strings.
  • A constant-size 26-int array for tallying (per word).

Correctness Argument

Two strings are anagrams iff they have identical character multisets iff their count vectors are equal. Equal count vectors hash to the same bucket and compare equal under tuple equality, so anagrams land in the same bucket. Different vectors compare unequal under tuple equality, so non-anagrams don’t share a bucket (modulo accidental hash collisions, which the equality check resolves correctly — that’s the equality/hashcode contract at work).

Complexity

Time: O(Σ |s_i|) = O(N · L) for tallying plus O(N · 26) for tuple hashing = O(N · L). Space: O(N · L) for the buckets and keys.

Implementation Requirements

  • Use a hashable key — tuples in Python, String (built from the count array) in Java, struct or stringified key in Go, std::array<int, 26> in C++.
  • Don’t use the sorted string for very large L (suboptimal but acceptable for interview presentation).
  • Use defaultdict(list) or equivalent (computeIfAbsent in Java) to avoid manual “if not in map” branching.
  • Return values as a list-of-lists, not the dict itself.

Tests

  • Smoke: the canonical 6-string example above.
  • Unit: singletons, all-anagrams (["abc","bca","cab"]), no-anagrams (["a","b","c"]).
  • Edge: empty strings, single-char strings, duplicates (["aa","aa"]).
  • Large: N = 10⁴, L = 100, mix of group sizes; assert sub-second.
  • Random: generate random words; verify bucketing matches a reference (e.g., the sorted-string variant).
  • Invalid: uppercase or non-ASCII (per constraints disallowed; if extended, normalize first).

Follow-up Questions

  • “What if strings can be Unicode?” → switch to a Counter / HashMap<Char, Int> as the key (more expensive hashing). Or, use the sorted string with a Unicode-aware sort.
  • “What if the input is streamed?” → emit groups lazily as you find duplicates, but you can’t finalize a group until input ends.
  • “What if memory is tight (you can’t store N count arrays)?” → use the sorted-string key (only 1 allocation per word, free after bucketing) or a rolling hash with secondary check.
  • “Adversarial input — can the interviewer construct N strings whose count tuples all hash to the same bucket?” → yes for predictable-hash languages; mitigation is randomized hash seeds or a TreeMap fallback.
  • “Implement without a hash map.” → sort all strings by their sort-key, then group consecutive equal keys. O(N · L log L) due to sorting strings.

Product Extension

A duplicate-document detector. Each document is hashed by a content fingerprint (e.g., sorted shingles); documents with the same fingerprint are grouped. The same data-structure pattern (hash by canonical form) underlies large-scale dedup at file-storage and email systems. Discuss false positives (two different docs with the same fingerprint), the role of secondary equality check, and the tradeoff between fingerprint cost and accuracy.

Language/Runtime Follow-ups

  • Python: tuple(counts) is hashable; Counter is hashable only via frozenset(c.items()). defaultdict(list) is the idiom.
  • Java: must build a String (or use a hash of the int[] array combined with equals on the array — which means a custom class with proper hashCode/equals). The classic interview shortcut is to convert the count array to a string like "1#0#0#…#1". Beware boxing in HashMap<int[], List<String>>int[] does not override hashCode/equals, defaults to object identity. This is a top-3 Java bug on this problem.
  • Go: map keys must be comparable. [26]int is comparable; []int is not. Use the array.
  • C++: std::array<int, 26> is hashable with boost::hash or a custom std::hash specialization. Or stringify.
  • JS/TS: Map keys can be any value but use reference equality for arrays/objects. Use a string key like counts.join(',') or a Map<string, string[]>.
  • Adversarial keys: Java’s String.hashCode is well-known and allows hash flooding. Java HashMap mitigates with tree-to-bucket conversion past 8 collisions.

Common Bugs

  1. Java int[] as map key — uses object identity, not value equality. Every entry creates a new bucket. Fix: stringify or use Arrays.hashCode + custom wrapper.
  2. Mutating the count array between map ops — if you reuse one buffer and mutate, your inserted keys all alias the same buffer. Allocate fresh per word.
  3. Off-by-one in ord(c) - ord('a') — non-lowercase input goes negative or out of range.
  4. Empty-string handling — count array is all zeros; should still bucket correctly. Verify.
  5. Returning dict.values() directly in Python — works but the type is dict_values, not list. Wrap with list(...).

Debugging Strategy

  • Print the keys of the resulting map. If you see N keys for N strings, your key derivation is wrong (likely identity-based).
  • For Java: assert myKey.hashCode() == myOtherKey.hashCode() for a hand-crafted anagram pair.
  • Time on N=10⁴: should run in well under a second.

Mastery Criteria

  • Selected the count-tuple approach within 60 seconds, explaining why over the sorted-string approach.
  • Stated the equality/hashcode contract and how it affects key choice in Java.
  • Identified the int[] reference-equality trap (or its language equivalent) before coding.
  • Articulated the adversarial-input concern and the language’s defense.
  • Wrote a clean implementation in under 8 minutes.
  • Tested with the empty-string and all-duplicates edge cases.