p10 — Contains Duplicate

Source: LeetCode 217 · Easy · Topics: Array, Hash Table, Sorting Companies (2024–2025 frequency): Amazon (high), Microsoft (high), Apple (medium), Bloomberg (medium), Meta (medium), Google (medium) Loop position: phone screen warmup; the lead-in to LC 219 (Contains Duplicate II — sliding window) and LC 220 (Contains Nearby Almost Duplicate — bucket sort)

1. Quick Context

The interviewer’s bar is not “can you solve it” — even a freshman can. The bar is: how many distinct approaches can you enumerate and rank by tradeoff? Strong candidates list 3–4 (sort, hash set, brute force, statistical/Bloom for huge N), state the time/space cost of each, then pick. Mid candidates pick HashSet first without naming alternatives. This is the cheapest problem in the curriculum at which to demonstrate the “I think about tradeoffs, not just solutions” senior signal.

It’s also the gateway to a family of problems: LC 219 (within-K distance — sliding window), LC 220 (within-K distance AND within-T value — bucket / treemap), LC 287 (find THE duplicate in O(1) space with constraints — Floyd’s), LC 442 (find ALL duplicates with in-place marking).

What it looks like it tests: detect a duplicate. What it actually tests: can you enumerate approaches and pick the right one based on memory / time / sortability constraints? Do you know when each wins?

2. LeetCode Link + Attempt Gate

🔗 https://leetcode.com/problems/contains-duplicate/

STOP. Set a 10-minute timer. Code it cold. Do not read past this section until solved or expired.

If repeat: write THREE distinct solutions in the same scratch file. Brute, sort, HashSet. Time yourself for ALL THREE in 8 min. The point isn’t to pick one — it’s to demonstrate fluency across approaches.

3. Prerequisite Concepts

HashSet / HashMap O(1) average insertion and lookup (phase-01 §3)
Sorting cost trade: O(N log N) time, O(1) extra space (if in-place sort allowed) vs HashSet’s O(N) time AND space
Adjacent-equal detection after sort: a O(N) linear scan post-sort
For massive N (billions, doesn’t fit memory): probabilistic structures (Bloom filter for “definitely-not-seen”, HyperLogLog for cardinality estimation)

4. How to Approach (FRAMEWORK Steps 1–9 applied)

Step 1 — Restate: “Given an integer array, return True if any value appears at least twice; else False.”

Step 2 — Clarify:

“What’s N’s upper bound?” (LC says 10^5; ask anyway, the production answer changes for 10^9.)
“Is the input mutable? Can I sort in place?” (Mutation may be off-limits in production; HashSet is the non-mutating choice.)
“Memory budget?” (If tight, sort + scan is O(1) extra; HashSet is O(N).)
“What are valid values? Negatives? Floats?” (Per LC, integers; in production ask.)
“Repeats considered duplicates if they are at the same index?” (No — same index is the same value, not a duplicate. Trivial, but ask.)

Step 3 — Constraints: N up to 10^5 per LC. Both O(N log N) and O(N) fit comfortably. Brute O(N²) = 10^10 will TLE. Solution memory: O(N) for HashSet, O(1) for sort-based (if in-place).

Step 4 — Examples:

[1, 2, 3, 1] → True
[1, 2, 3, 4] → False
[1, 1, 1, 1] → True (early exit on iteration 2)
[] → False (no values, no duplicate — confirm with interviewer)
[7] → False (single element)
[-1, -1, 0] → True (negatives)

Step 5 — Brute Force: Nested loop, compare every pair. O(N²) time, O(1) space. Acceptable as a stated baseline; never the final answer.

Step 6 — Brute Force Complexity: O(N²), O(1). TLE for LC’s constraint of 10^5.

Step 7 — Pattern Recognition: “Detect repetition in a sequence” → HashSet membership check (canonical) OR sort + adjacent compare (when memory matters). Family: LC 219 (with distance constraint → sliding window of size K), LC 220 (with distance AND value constraint → bucket-sort by value range), LC 287 (one specific duplicate in O(1) extra space → Floyd’s).

Step 8 — Optimize: Single-pass HashSet. Walk the array; for each x, if x in seen return True; else add x to seen. Return False at end. O(N) avg time, O(N) space. Early exit on first duplicate.

Step 9 — Prove correctness: Loop invariant: at top of iteration i, seen = {nums[0..i-1]} as a set. If nums[i] ∈ seen, then nums[i] equals some nums[j] with j < i — a true duplicate. If not, add and continue. After the loop, no duplicate was found, so all N values are distinct.

5. Progressive Hints

If stuck for 5 minutes, open hints.md. One hint, 5-min timer, retry.

6. Deeper Insight — Why It Works

Why HashSet is the textbook answer: “find a repeat” is structurally “have I seen this before?” — a membership question. The data structure built for that question in O(1) is the hash set. Memorize this connection: membership question → hash set.

Why sort-then-adjacent-compare is sometimes better: if N is huge but you can mutate, sort gives O(N log N) time and O(1) extra space (in-place sort). HashSet gives O(N) time but O(N) extra space. Memory often dominates the real cost — in Python, the overhead per dict entry is ~64 bytes, so a HashSet of 10^7 integers is ~640MB, which can push containers into OOM. Sort-based is the call for memory-constrained environments.

Why early exit matters: the HashSet check returns True the moment a duplicate is found. For arrays with frequent duplicates, expected runtime is O(distinct values) not O(N). The brute force can also early-exit, but its constant factor is worse and worst case is still O(N²).

Bloom filters for streaming / massive N: if N is 10^11 and you can’t fit a HashSet, a Bloom filter gives “definitely-not-seen” (no false negatives) with false-positive rate tunable by space. A pass with Bloom + a confirmation pass for flagged values gives expected O(N) at sub-N space. This is how anti-spam, anti-fraud, and CDN cache-membership work in production.

Why “Counter” overkill: Counter(nums).most_common(1)[0][1] > 1 works but allocates a full counter when you only need to know “any repeat?” — wastes O(distinct) memory growing to O(N) instead of early-exiting at first repeat. Idiomatic but suboptimal.

Why len(nums) != len(set(nums)) is cute but slow: materializes the full set even if a duplicate exists at index 1. Loses the early-exit advantage. For an array with a duplicate near the start, it’s O(N) where it could be O(2). The one-line elegance is a tradeoff against worst-case performance.

7. Anti-Pattern Analysis

Wrong-but-tempting #1 — Brute force:

for i in range(n):
    for j in range(i + 1, n):
        if nums[i] == nums[j]: return True
return False

O(N²). For N=10^5, this is 10^10 operations. TLE on LC, OOM-irrelevant but compute-fatal. State as baseline, then optimize. Never the final answer.

Wrong-but-tempting #2 — len(set(nums)) != len(nums):

return len(set(nums)) != len(nums)

Pythonic one-liner, O(N) time, O(N) space, but no early exit. Materializes the entire set even when the second element is a duplicate. For interview, mention as a fallback but explicitly note: “I’d write the explicit loop for early exit; this one-liner pays O(N) even on best-case inputs.”

Wrong-but-tempting #3 — Sort then check adjacent, mutating input without asking:

nums.sort()
return any(nums[i] == nums[i+1] for i in range(len(nums) - 1))

Works, O(N log N) time, O(1) extra space if sort is in-place. But mutates input. In production this can break callers. Always ask “may I mutate?” — if yes, this is the memory-efficient answer.

Wrong-but-tempting #4 — Counter / dict to count:

from collections import Counter
return any(c > 1 for c in Counter(nums).values())

O(N) time, O(distinct) space, no early exit (Counter is built fully before iterating values). HashSet with explicit loop is strictly better.

Wrong-but-tempting #5 — Treating it like LC 219 / 220 without being asked: Some candidates over-engineer and start using a sliding window or a sorted set. The vanilla problem doesn’t care about position or value distance. Solve what’s asked. Save the sliding-window for the follow-up.

8. Skills & Takeaways

Generalizable patterns:

Membership question → HashSet: any “have I seen this before?” question solves in O(1) avg per check.
Memory vs time tradeoff articulation: sort + scan (O(N log N), O(1)) vs HashSet (O(N), O(N)). Knowing both is Senior; knowing when to pick which is Staff.
Early-exit discipline: prefer explicit loops that can return on first success over one-liners that always materialize the full structure.

Analogous problems:

LC 219 — Contains Duplicate II (within K indices → sliding window of HashSet of size K)
LC 220 — Contains Nearby Almost Duplicate (within K indices AND |val_i − val_j| ≤ T → bucket sort by val ranges, OR a sorted multiset with binary search)
LC 287 — Find the Duplicate Number (Hard; O(1) space → Floyd’s on nums[nums[i]])
LC 442 — Find All Duplicates in an Array (in-place marking by negating nums[abs(x) - 1])
LC 268 — Missing Number (the inverse: XOR or sum trick)
LC 136 — Single Number (XOR every element: pairs cancel, the loner remains)
LC 645 — Set Mismatch (one missing + one duplicate; combine XOR and counting)

When NOT to use HashSet: when memory is constrained AND you may sort. When N is so large it doesn’t fit RAM (use Bloom filter + external sort). When values have special structure (small integer range → array-as-hashmap; e.g., LC 442’s in-place marking trick).

9. When to Move On (binary; must all be YES)

I wrote the HashSet solution with explicit early exit in <3 min
I named at least 3 distinct approaches (brute, sort+scan, HashSet) with time/space for each
I can articulate when sort+scan beats HashSet (memory-constrained, mutation OK)
I can explain why len(set(nums)) != len(nums) is suboptimal
My stress_test() includes empty, single, all-equal, all-distinct, and large random arrays
I solved LC 219 (Contains Duplicate II) and recognized the sliding-window-of-HashSet extension
I solved LC 136 (Single Number) and saw the XOR trick as an O(1)-space alternative for a constrained variant
I know what a Bloom filter is and when it would be the right answer here

If any unchecked: redo before moving to Week 2 review.

10. Company Context

Amazon (often part of a larger story problem)

The framing: “Detect any duplicate order ID in a day’s worth of transactions.”
Misleading example: Empty input. Many candidates’ code returns True or crashes; correct is False.
Adversarial extension: “Now N is 10 billion across 100 machines.” → distributed hash with sharding by ID mod 100; or Bloom filter per shard. Then: “Now we can’t use 640MB of RAM per machine — N is 10^9 per machine and we have 100MB.” → Bloom + confirmation pass.
What they want to hear: “Let me list the approaches first” with explicit memory/time table. Amazon values articulated tradeoffs.

Microsoft (phone screen warmup)

The framing: Straight problem.
Misleading example: [1, 1, 1, 1] — to see whether you exit early or keep iterating.
Adversarial extension: “Now within K indices.” (LC 219.) Then “now within K indices AND values within T of each other.” (LC 220.)
What they want to hear: HashSet with explicit early-exit loop, mention of sort-based alternative for memory.

Apple / Bloomberg (clean code)

The framing: Straight problem.
What they want to hear: Idiomatic code, named the two main approaches, articulated the memory tradeoff.

Meta (rapid follow-ups)

The framing: This → LC 219 → LC 220 → LC 287 in rapid succession.
What they want to hear: Naming each extension correctly and switching algorithms as the constraints evolve (HashSet → sliding window → bucket sort → Floyd’s).

Google (medium frequency, focus on tradeoffs)

The framing: “Solve, then optimize for memory.”
What they want to hear: All three approaches articulated with complexity, then a discussion of when each wins. Bonus: mention Bloom filter for massive N.

11. Interviewer’s Lens

Phase	Strong signal	Weak signal	Scorecard phrase (strong)
Reading problem	Asks “may I mutate? memory budget? N’s upper bound?”	Dives straight into HashSet	“Clarifies constraints before choosing data structure”
Pre-coding	Lists 3+ approaches with complexities	Picks HashSet without naming alternatives	“Demonstrates breadth of approaches — Senior signal”
Coding	Explicit loop with early exit, idiomatic but not gimmicky	One-liner `len(set(nums)) != len(nums)`	“Prefers explicit control flow for early-exit performance”
Follow-up (LC 219)	Pivots to sliding-window-of-HashSet	Re-uses plain HashSet, gets wrong answer	“Adapts data structure as constraints change”
Massive-N discussion	Mentions Bloom filter	Stops at HashSet	“Thinks about memory at scale — Principal signal”

The scorecard line that gets you the offer: “Articulated multiple approaches with tradeoffs; chose HashSet with early exit; adapted cleanly to follow-up variants.”

The scorecard line that loses you: “Used the one-line len(set(nums)) without acknowledging it forfeits early exit; couldn’t articulate when sort+scan would be preferred.”

12. Level Delta

Level	What their answer looks like
Mid	HashSet solution. Works. Doesn’t articulate alternatives unless asked. ~5 min.
Senior	HashSet with early exit + named alternatives (sort+scan, brute) with complexities + handled empty/single without prompting + ~3 min.
Staff	All of Senior + connected to LC 219 (sliding window) and LC 220 (bucket sort) and LC 287 (Floyd’s) as a family + mentioned Bloom filter for massive N + named when sort+scan beats HashSet. ~3 min.
Principal	All of Staff + discussed real production framing (deduplication in event pipelines, billing systems, ad impression tracking) + offered HyperLogLog as an O(1)-space cardinality-estimator alternative if you only need APPROXIMATE duplicate detection + named the sharding strategy for distributed implementation. ~3 min with full discussion.

13. Follow-up Questions & Full Answers

Follow-up 1: “Now duplicates only count if they’re within K indices.” (LC 219)

Signal sought: Adapt data structure to constraint.

Full answer: “Sliding window of HashSet, size at most K+1. Walk the array; for each nums[i], check if it’s in the window. If yes, return True. Add nums[i]; if window size exceeds K, evict nums[i-K]. O(N) time, O(K) space. The key insight: only elements within K positions can collide, so we maintain exactly those in the set.”

Follow-up 2: “Within K indices AND |values differ| ≤ T.” (LC 220)

Signal sought: Bucket sort or sorted multiset.

Full answer: “Two approaches. (a) Bucket sort: assign each value to bucket val // (T + 1). Two values within T of each other land in the same bucket OR adjacent buckets. For each nums[i], check the current bucket and ±1 neighbors. Maintain at most K+1 active buckets (a dict of bucket → value), evicting when out of window. O(N) time, O(min(N, K)) space. (b) Sorted multiset with binary search: maintain a sorted multiset of the last K+1 elements; binary search for any element in [nums[i]−T, nums[i]+T]. O(N log K). Bucket is asymptotically better; sorted set is simpler to code.”

Follow-up 3: “Massive N — 10^11 values streaming, can’t fit in memory.”

Signal sought: Probabilistic structures.

Full answer: “Bloom filter. For each value: query the filter; if ‘might-have-seen’, confirm with a secondary check (e.g., on-disk index or HyperLogLog). If ‘definitely-not-seen’ (Bloom guarantees no false negatives), insert. Bloom uses ~1-2 bytes per element with 1% false positive rate, so 10^11 entries needs ~100-200GB across a cluster — distributable. Sharding: hash the value, route to the shard that owns its hash range. Each shard runs local Bloom. Caveat: Bloom can’t delete — for streams with eviction, use a Counting Bloom Filter (more space per cell) or a Cuckoo filter (supports delete cleanly).”

Follow-up 4: “Find THE duplicate in `[1..N]` with N+1 ints, O(1) extra space, no mutation.” (LC 287)

Signal sought: Algorithmic creativity.

Full answer: “Treat array as iterated function f(i) = nums[i]. Since N+1 values map to N possible outputs, some output is duplicated — corresponds to a cycle in the iteration graph. Apply Floyd’s cycle detection: slow advances 1, fast advances 2, find meet. Then reset one pointer to start (index 0), advance both by 1, they meet at the cycle start, which IS the duplicate. O(N) time, O(1) extra space. This is the same algorithm as Linked List Cycle (your p08), just on an array reinterpreted as a graph.”

Follow-up 5 (Senior signal): “How would you test this for correctness AND performance?”

Signal sought: Testing discipline + perf awareness.

Full answer: “Three layers. (1) Functional: empty, single, all-equal, all-distinct, duplicate at start (early-exit check), duplicate at end. (2) Property: for any randomly-generated array, HashSet result equals len(set(nums)) != len(nums) — strong cross-check. (3) Perf: benchmark early-exit on [1, 1, 0, 0, ...] (duplicate at index 1) — HashSet should take ~2 hash ops; the one-liner takes N hash ops. Confirm via timing. My solution.py includes the random property test.”

14. Full Solution Walkthrough

See solution.py.

The file has six sections:

contains_duplicate_brute(nums) — O(N²) nested loops; correctness oracle for small N.
contains_duplicate(nums) — HashSet with explicit early-exit loop. The production answer.
contains_duplicate_sort(nums) — sort + adjacent compare; O(N log N) time, O(1) extra (mutates input).
contains_duplicate_oneliner(nums) — len(nums) != len(set(nums)); for comparison only.
stress_test() — 1000 random arrays + skewed (mostly-duplicates, mostly-distinct) cases; all four implementations must agree.
edge_cases() — empty, single, two-distinct, two-equal, all-equal, large random.

Decisions justified:

Why set not dict: we only need existence, not value-to-index.
Why explicit for loop not the one-liner: early-exit when duplicate found near start.
Why sort-based variant copies before sorting in stress test: don’t mutate caller’s data unexpectedly.

15. Beyond the Problem — Production Reality

Real applications:

Deduplication in event pipelines: Kafka consumers, Kinesis stream processors — dedupe events by ID. Bloom filter per consumer + periodic full-set rotation.
Billing: detect duplicate charge attempts (idempotency key check). HashSet in Redis, sized by time window.
Ad impression tracking: prevent counting the same impression twice. Bloom or HyperLogLog if approximate is OK; exact HashSet (sharded) if money depends on it.
Anti-fraud: detect repeat fraud attempts from the same device fingerprint. Distributed HashSet, time-windowed.
Cache deduplication: CDNs avoid re-fetching by URL HashSet keyed by content hash.

Real systems that are exactly this kernel:

Postgres UNIQUE constraint: under the hood, a B-tree-backed sorted set check. Sort-based duplicate detection.
Kafka exactly-once semantics: producer ID + sequence number HashSet at the broker, per partition.
Redis SADD: returns 1 if newly added, 0 if duplicate — same operation as our HashSet check.

Principal-engineer code review comment: “For our event dedup at 10M events/second, plain HashSet won’t cut it — we use Cuckoo filter for the hot path (supports delete unlike Bloom) and a sharded Redis SET for the source of truth. Also: please add a comment that this function is O(N) avg but O(N²) worst-case under hash collision attacks; for adversarial inputs (user-supplied integers in a public API), we use a keyed hash (SipHash) to prevent algorithmic DoS. We had this incident in 2019.”

LeetCode Interview — Extreme Coding