Lab 03 — Hashmap Mastery: Group Anagrams
Goal
Master hash-map design with composite keys, adversarial input awareness, and the equality/hashcode contract. The deliverable groups N strings into anagram buckets in O(N · L) time and articulates exactly why your hash key works and what an adversary could do to break it.
Background Concepts
Hash function design; key equality contract; adversarial inputs and load factor; ordered-vs-unordered map choice; counter pattern. Review the Hash Maps and Hash Sets sections of the Phase 1 README, plus runtime concept 5 (hash collisions).
Interview Context
Group Anagrams is interview-evergreen: appears at Meta, Google, Amazon, Microsoft. The interview signal is whether you reach for the right key. Naive candidates compare every pair (O(N² · L)). Decent candidates sort each string into a key (O(N · L log L)). Strong candidates use a counter tuple key (O(N · L)). Elite candidates discuss adversarial hash flooding and language-specific custom-hash mechanics.
Problem Statement
Given an array strs of N lowercase-ASCII strings, group the strings that are anagrams of each other. Return the groups as a list of lists. Within and across groups, any order is acceptable.
Constraints
1 ≤ N ≤ 10^40 ≤ |s_i| ≤ 100s_iconsists of lowercase English letters.- Total characters:
Σ |s_i| ≤ 10^6.
Clarifying Questions
- Lowercase only? (Per constraints — confirm.)
- Do empty strings group together? (Yes — the empty string is an anagram of itself.)
- Is output order significant within or across groups? (Standard problem: no.)
- Are duplicates in the input allowed? (Yes —
["aa","aa"]is one group of size 2.) - Memory constraints? (Should fit comfortably; mention you’ll discuss tradeoffs.)
Examples
| Input | Output (any order) |
|---|---|
["eat","tea","tan","ate","nat","bat"] | [["eat","tea","ate"], ["tan","nat"], ["bat"]] |
[""] | [[""]] |
["a"] | [["a"]] |
["abc","cab","bca","xyz"] | [["abc","cab","bca"], ["xyz"]] |
Initial Brute Force
For each pair (i, j), check if strs[j] is an anagram of strs[i] (e.g., by sorting both). Use a seen[] array. O(N² · L log L).
def group_anagrams_brute(strs):
seen = [False] * len(strs)
groups = []
for i, s in enumerate(strs):
if seen[i]: continue
g = [s]
seen[i] = True
ks = sorted(s)
for j in range(i + 1, len(strs)):
if not seen[j] and sorted(strs[j]) == ks:
g.append(strs[j])
seen[j] = True
groups.append(g)
return groups
Brute Force Complexity
Time: O(N² · L log L) due to repeated sorting. Space: O(N) for the seen array plus O(L) per sort. Fails for N = 10^4 (10⁸ operations × log).
Optimization Path
The key insight: anagrams have the same multiset of characters. We need a hashable key derived from this multiset. Two canonical forms:
- Sorted string as key:
sorted(s) → "aet"for"eat","tea","ate". Cost per key: O(L log L). Total: O(N · L log L). - Count tuple as key: a length-26 tuple of counts. Cost per key: O(L). Total: O(N · L). Optimal for large L.
Pick the count tuple unless L is tiny.
Final Expected Approach
Bucket strings by count tuple in a hash map.
from collections import defaultdict
def group_anagrams(strs):
buckets = defaultdict(list)
for s in strs:
counts = [0] * 26
for c in s:
counts[ord(c) - ord('a')] += 1
buckets[tuple(counts)].append(s)
return list(buckets.values())
Data Structures Used
- A hash map keyed by a 26-tuple of int counts; values are lists of strings.
- A constant-size 26-int array for tallying (per word).
Correctness Argument
Two strings are anagrams iff they have identical character multisets iff their count vectors are equal. Equal count vectors hash to the same bucket and compare equal under tuple equality, so anagrams land in the same bucket. Different vectors compare unequal under tuple equality, so non-anagrams don’t share a bucket (modulo accidental hash collisions, which the equality check resolves correctly — that’s the equality/hashcode contract at work).
Complexity
Time: O(Σ |s_i|) = O(N · L) for tallying plus O(N · 26) for tuple hashing = O(N · L). Space: O(N · L) for the buckets and keys.
Implementation Requirements
- Use a hashable key — tuples in Python,
String(built from the count array) in Java, struct or stringified key in Go,std::array<int, 26>in C++. - Don’t use the sorted string for very large L (suboptimal but acceptable for interview presentation).
- Use
defaultdict(list)or equivalent (computeIfAbsentin Java) to avoid manual “if not in map” branching. - Return values as a list-of-lists, not the dict itself.
Tests
- Smoke: the canonical 6-string example above.
- Unit: singletons, all-anagrams (
["abc","bca","cab"]), no-anagrams (["a","b","c"]). - Edge: empty strings, single-char strings, duplicates (
["aa","aa"]). - Large: N = 10⁴, L = 100, mix of group sizes; assert sub-second.
- Random: generate random words; verify bucketing matches a reference (e.g., the sorted-string variant).
- Invalid: uppercase or non-ASCII (per constraints disallowed; if extended, normalize first).
Follow-up Questions
- “What if strings can be Unicode?” → switch to a
Counter/HashMap<Char, Int>as the key (more expensive hashing). Or, use the sorted string with a Unicode-aware sort. - “What if the input is streamed?” → emit groups lazily as you find duplicates, but you can’t finalize a group until input ends.
- “What if memory is tight (you can’t store N count arrays)?” → use the sorted-string key (only 1 allocation per word, free after bucketing) or a rolling hash with secondary check.
- “Adversarial input — can the interviewer construct N strings whose count tuples all hash to the same bucket?” → yes for predictable-hash languages; mitigation is randomized hash seeds or a TreeMap fallback.
- “Implement without a hash map.” → sort all strings by their sort-key, then group consecutive equal keys. O(N · L log L) due to sorting strings.
Product Extension
A duplicate-document detector. Each document is hashed by a content fingerprint (e.g., sorted shingles); documents with the same fingerprint are grouped. The same data-structure pattern (hash by canonical form) underlies large-scale dedup at file-storage and email systems. Discuss false positives (two different docs with the same fingerprint), the role of secondary equality check, and the tradeoff between fingerprint cost and accuracy.
Language/Runtime Follow-ups
- Python:
tuple(counts)is hashable;Counteris hashable only viafrozenset(c.items()).defaultdict(list)is the idiom. - Java: must build a
String(or use a hash of the int[] array combined withequalson the array — which means a custom class with properhashCode/equals). The classic interview shortcut is to convert the count array to a string like"1#0#0#…#1". Beware boxing inHashMap<int[], List<String>>—int[]does not overridehashCode/equals, defaults to object identity. This is a top-3 Java bug on this problem. - Go: map keys must be comparable.
[26]intis comparable;[]intis not. Use the array. - C++:
std::array<int, 26>is hashable withboost::hashor a customstd::hashspecialization. Or stringify. - JS/TS:
Mapkeys can be any value but use reference equality for arrays/objects. Use a string key likecounts.join(',')or aMap<string, string[]>. - Adversarial keys: Java’s
String.hashCodeis well-known and allows hash flooding. Java HashMap mitigates with tree-to-bucket conversion past 8 collisions.
Common Bugs
- Java
int[]as map key — uses object identity, not value equality. Every entry creates a new bucket. Fix: stringify or useArrays.hashCode+ custom wrapper. - Mutating the count array between map ops — if you reuse one buffer and mutate, your inserted keys all alias the same buffer. Allocate fresh per word.
- Off-by-one in
ord(c) - ord('a')— non-lowercase input goes negative or out of range. - Empty-string handling — count array is all zeros; should still bucket correctly. Verify.
- Returning
dict.values()directly in Python — works but the type isdict_values, notlist. Wrap withlist(...).
Debugging Strategy
- Print the keys of the resulting map. If you see N keys for N strings, your key derivation is wrong (likely identity-based).
- For Java: assert
myKey.hashCode() == myOtherKey.hashCode()for a hand-crafted anagram pair. - Time on N=10⁴: should run in well under a second.
Mastery Criteria
- Selected the count-tuple approach within 60 seconds, explaining why over the sorted-string approach.
- Stated the equality/hashcode contract and how it affects key choice in Java.
- Identified the
int[]reference-equality trap (or its language equivalent) before coding. - Articulated the adversarial-input concern and the language’s defense.
- Wrote a clean implementation in under 8 minutes.
- Tested with the empty-string and all-duplicates edge cases.