p38 — Subsets

Source: LeetCode 78 · Medium · Topics: Array, Backtracking, Bit Manipulation Companies (2024–2025 frequency): Meta (high), Amazon (high), Google (high), Microsoft (medium), Apple (medium) Loop position: algorithm round; often paired with Combinations (p37) and Permutations (p36) to test family recognition.

1. Quick Context

Given an array of distinct integers, return the powerset — all 2^n subsets. The follow-up (LC 90 — Subsets II, with duplicates) is the staff signal.

Three equally valid approaches — interviewer-level signal comes from knowing all three and the trade-offs:

  1. Backtracking with start index (variant of p37’s template — yield at EVERY node, not just leaves).
  2. Include/exclude DFS (binary choice tree, depth = n).
  3. Iterative powerset growth — for each new element, double the result by including it in every existing subset.
  4. Bitmask enumeration (n ≤ 32) — iterate masks 0..2^n - 1, decode bits.

Time: O(n · 2^n) — 2^n subsets, O(n) to copy each. Space: O(n) recursion (or O(1) for the iterative/bitmask variants, ignoring output).

The discriminator: writing the include/exclude template cleanly, AND knowing why bitmask is fastest for n ≤ 20.


🔗 https://leetcode.com/problems/subsets/

STOP. 15-min timer.


3. Prerequisite Concepts

  • p37 (Combinations) — subsets are “all sizes” instead of fixed k.
  • Pascal’s identity: 2^n = sum over k of C(n, k).
  • Bit manipulation basics: mask >> i & 1.

4. How to Approach (FRAMEWORK Steps 1–9)

Step 1 — Restate: “All 2^n subsets of the input array. Any order.”

Step 2 — Clarify:

  • “Distinct elements?” (Yes — that’s the LC 78 spec. LC 90 is the duplicate variant.)
  • “Include the empty subset?” (Yes.)
  • “Order of subsets in output?” (Free.)
  • “Order within each subset?” (Free, but typically input order is convention.)

Step 3 — Constraints: n ≤ 10 → 1024 subsets. n ≤ 20 → ~1M. Tight enough that the algorithm matters but not lifesaving.

Step 4 — Examples:

  • [1,2,3][[],[1],[2],[3],[1,2],[1,3],[2,3],[1,2,3]] (8 subsets).
  • [][[]].

Step 5 — Brute Force: Bitmask enumeration is already O(n · 2^n). The “backtracking” framing is for pedagogy.

Step 6 — Complexity: O(n · 2^n). Space O(n) recursion or O(1) bitmask.

Step 7 — Pattern: “Powerset / all-subset” → 2^n enumeration. Three equivalent algorithms.

Step 8 — Develop: see solution.py — start-index, include/exclude, iterative, bitmask, and LC 90 (duplicates).

Step 9 — Test: edge cases (empty, single, full overlap), stress vs itertools.chain over combinations.


5. Progressive Hints

If stuck for 5 min, open hints.md.


6. Deeper Insight — Four algorithms, one shape

6.1 Start-index backtracking — “snapshot at every node”

def backtrack(start, path):
    result.append(path[:])           # SNAPSHOT here, not at a leaf
    for i in range(start, n):
        path.append(nums[i])
        backtrack(i + 1, path)
        path.pop()

The crucial difference from p37: result.append is at the TOP of the function, not gated by len(path) == k. Every node of the recursion tree is itself a valid subset.

The recursion tree has 2^n nodes (each node = one subset).

6.2 Include/exclude DFS

def backtrack(i, path):
    if i == n:
        result.append(path[:]); return
    # exclude nums[i]
    backtrack(i + 1, path)
    # include nums[i]
    path.append(nums[i])
    backtrack(i + 1, path)
    path.pop()

Depth = n, branching = 2. Generates 2^n leaves; each leaf is a subset. Mirrors the bit-by-bit decision: “is element i in the subset?”

This is the cleanest framing for connecting to DP (e.g., knapsack: include/exclude each item) and to LC 698 (Partition to K Equal Sum Subsets).

6.3 Iterative powerset growth

result = [[]]
for x in nums:
    result += [sub + [x] for sub in result]

Starts with [[]]. For each new element, doubles the result by including the element in every existing subset. O(n · 2^n) time, no recursion. The most elegant code.

6.4 Bitmask enumeration

for mask in range(1 << n):
    subset = [nums[i] for i in range(n) if mask >> i & 1]
    result.append(subset)

O(n · 2^n). Trivially parallelizable (split masks across workers). Fastest in practice for n ≤ 20 (no recursion overhead, cache-friendly).

6.5 LC 90 — Subsets II (duplicates in input)

Same as LC 78 backtracking with the same dedup line as LC 40:

for i in range(start, n):
    if i > start and nums[i] == nums[i-1]: continue
    ...

Note: i > start, NOT i > 0. At the same recursion level (same start), skip equal siblings; the first time at a level, take it. This is the canonical-form pruning principle from p36/p37/p40 — internalize it.

6.6 Why so many algorithms?

Each one teaches a different concept:

  • Start-index: the “yield at every node” pattern (also used in tree traversals).
  • Include/exclude: bridges backtracking → DP. Knapsack is exactly this.
  • Iterative: the “build up” pattern. Cleanest code; hardest to extend.
  • Bitmask: bridges to bitmask DP (LC 526, LC 691).

A staff candidate uses include/exclude when extending to DP, bitmask when n is small and speed matters, iterative when readability matters, and start-index when extending to size-restricted variants (LC 77, LC 39).


7. Anti-Pattern Analysis

  1. Snapshot inside the for-loop instead of at function entry — misses the empty subset and produces duplicates.
  2. Using set of frozenset to dedup for LC 90. Exponentially wasteful; the sort+skip line dedupes at the source.
  3. for i in range(n) instead of for i in range(start, n) — generates duplicates {1,2} and {2,1}.
  4. result.append(path) instead of path[:] — same bug as p36.
  5. Bitmask without int.bit_length() or pre-sorting when needed — order issues.
  6. For n=30+ using bitmask — 2^30 = 1B subsets. Output is huge regardless of algorithm; the issue is whether you NEED all of them. For “count” or “sum over subsets,” switch to DP.
  7. For LC 90, dedup line i > 0 instead of i > start — over-prunes; loses valid subsets.

8. Skills & Takeaways

  • Four algorithms, all O(n · 2^n) but with different code/extension properties.
  • “Snapshot at every node” vs “snapshot at leaves” — the structural choice.
  • LC 90 dedup: i > start and nums[i] == nums[i-1].
  • Bitmask = the bridge to bitmask DP (Phase 03).
  • Include/exclude = the bridge to knapsack DP (Phase 05).

9. When to Move On

Move on when:

  • You can write all four variants in under 10 min total.
  • You can write LC 90 (duplicates) in under 5 min.
  • You can articulate which algorithm you’d choose in each context (small n, large n, DP extension, parallel).
  • Stress test of 200 random inputs (n=0..8) matches a reference oracle.

10. Company Context

Meta:

  • L4/L5 standard. Often paired with LC 90 immediately, then LC 39, then “now count subsets with sum = S” (DP).
  • Scorecard phrase: “Algorithm — articulated 3 algorithms, picked the right one for the follow-up.”

Amazon:

  • L5 staple. The “iterative powerset growth” version is a frequent expected answer.

Google:

  • L5+ uses this as a launching pad to bitmask DP (LC 526, LC 691, LC 1349) — be ready to discuss.

Microsoft:

  • L63. Often combined with “how many subsets sum to S” → bridge to DP.

Apple:

  • Common in algo loops; data and ML teams use the “all feature combinations” framing.

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff/Principal
ApproachOne algorithmTwo (backtracking + iterative)All four, picks the right one+ extends to bitmask DP and knapsack
LC 90 dedupSet-basedi > 0 rule (wrong)i > start rule (correct)+ proves canonical-form correctness
ConnectionsNone“Like combinations”+ Pascal’s identity+ bridge to subset-sum DP, partition DP, bitmask DP

12. Level Delta

Mid (L4):

  • One correct algorithm (backtracking or iterative).
  • Handles empty input.

Senior (L5):

  • Two algorithms minimum (start-index + iterative OR include/exclude + bitmask).
  • Solves LC 90 with correct dedup.
  • Discusses O(n · 2^n) and why output is the limit.

Staff (L6):

  • All four algorithms.
  • Connects include/exclude to knapsack DP and bitmask to bitmask DP.
  • Discusses streaming / generator variant.
  • Handles “subset with property X” (count, sum, etc.) — knows when to switch to DP.

Principal (L7+):

  • Discusses production: feature-flag combinations (test all on/off combos), formal verification (state-space enumeration for tiny systems), policy enumeration in RL (small action spaces), regulatory compliance (audit all combinations of attributes).
  • Discusses subset-sum is NP-hard; powerset enumeration is exponential — when n ≥ 30, switch to DP, branch-and-bound, or approximation.
  • Discusses sketching: for “approximately count subsets with property P,” use hashing/sampling.

13. Follow-up Q&A

Q1: Subsets II (LC 90) — duplicates in input. A: Sort input. In the loop: if i > start and nums[i] == nums[i-1]: continue. Canonical-form pruning — at each recursion level, take the first equal sibling only.

Q2: Count subsets with sum = target. A: Don’t enumerate. DP: dp[i][s] = dp[i-1][s] + dp[i-1][s - nums[i]]. O(n · S) time, O(S) space with rolling array. Subset-sum DP; LC 416 family.

Q3: K-th subset in lex order. A: Combinatorial number system, or interpret k’s binary representation as a subset-mask (with care about ordering).

Q4: Subsets summing to target with each element usable once. A: LC 39 / LC 40 / LC 416 family — DP or backtracking + pruning.

Q5: Subsets when n = 50. A: 2^50 ≈ 10^15. Can’t enumerate. Use meet-in-the-middle: split into halves of 25, enumerate each (2^25 ≈ 33M each), join. Or DP if there’s structure. Or sampling if approximate is OK. See Phase 03 Lab 09 — Meet-in-the-Middle.


14. Solution Walkthrough

See solution.py:

  • subsets_backtrack(nums) — start-index, snapshot at every node.
  • subsets_include_exclude(nums) — binary choice tree.
  • subsets_iterative(nums) — powerset growth, no recursion.
  • subsets_bitmask(nums) — enumerate 0..2^n - 1.
  • subsets_with_dup(nums) — LC 90, sort + skip-dedup.
  • subsets_brute(nums) — oracle via itertools chain over combinations.
  • edge_cases() + stress_test() — verify all five vs oracle.

Run: python3 solution.py.


15. Beyond the Problem

Principal-engineer code review comment:

“Three things. (1) Pick ONE algorithm as the ‘canonical’ for the codebase — most teams should standardize on iterative (result = [[]]; for x in nums: result += [s+[x] for s in result]) because it’s 5 lines and obviously correct. (2) For LC 90, the dedup line is THE thing that goes wrong; add a unit test specifically for [1,2,2,3] that asserts no duplicate subset appears. (3) If subsets are an INPUT to downstream code (not the final answer), don’t materialize them — write a generator (yield from ...) and let the caller consume on the fly. For n=20 that’s a 100× memory win.”

Connections forward:

  • p37 — Combinations: size-restricted subsets.
  • p36 — Permutations: ordered variant.
  • LC 90 — Subsets II: dedup variant.
  • LC 416 — Partition Equal Subset Sum: counting via subset-sum DP.
  • LC 698 — Partition to K Equal Sum Subsets: backtracking + pruning.
  • Phase 03 Lab 08 — Bitmask DP.
  • Phase 03 Lab 09 — Meet-in-the-Middle (for n=40+).
  • Phase 05 — DP labs: knapsack, subset-sum, partition.
  • Production: feature flag combinations, regulatory audit, formal verification of small systems.