p103 — Design Twitter

Source: LeetCode 355 · Medium · Topics: Design, Hash Table, Heap, Linked List Companies (2024–2025): Meta (high), Twitter/X (very high — namesake), Snap (medium), LinkedIn (high), Bloomberg (medium) Loop position: The K-way merge feed problem. Every social feed, news aggregator, and activity timeline reduces to “merge K sorted streams, take top N”.

1. Quick Context

Build a tiny social network with:

  • postTweet(userId, tweetId)
  • getNewsFeed(userId)10 most recent tweets from user + everyone they follow.
  • follow(followerId, followeeId)
  • unfollow(followerId, followeeId)

State:

  • tweets: dict[userId, list[(time, tweetId)]] — append-only, in post order.
  • follows: dict[userId, set[userId]].
  • time: int — monotonic counter.

getNewsFeed = k-way merge of at most K = (1 + #followed) sorted streams, take top 10. Heap-based: O(K log K + 10 log K) = O(K log K).

🔗 https://leetcode.com/problems/design-twitter/

STOP. 25-min timer.

3. Prerequisite Concepts

  • Heap merge of K sorted streams (LC 23 Merge K Sorted Lists, p9-heap-top-k pattern).
  • Reverse iteration of per-user tweet lists (newest first).
  • Following = adjacency set (self-implicitly followed).

4. How to Approach (FRAMEWORK 1–9)

1. Restate: Mini-Twitter; news feed = top-10 most recent across user + followees.

2. Clarify: Self-follow allowed? No (LC). But user always sees own tweets. unfollow someone you don’t follow? No-op. Re-follow? Idempotent.

3. Examples: user1 posts t5; user2 posts t6; user1 follows user2; user1.feed = [t6, t5].

5. Brute: On getNewsFeed, gather all relevant tweets, sort desc by time, take 10. O(N total tweets log N).

6. Target: O(K log K + 10 log K) where K = followee count + 1.

7. Pattern: Heap-merge K sorted streams.

8. Develop: see solution.py.

9. Test: user follows self (must be no-op or implicit); follow then unfollow then re-follow; getNewsFeed on user with 0 tweets and 0 followees; user with only 3 tweets across feed.

5. Progressive Hints

See hints.md.

6. Deeper Insight

6.1 Why heap-merge, not sort-everything

If a user follows 1M accounts and each has 10K tweets, sort-everything is 10B-element sort. Heap-merge of 1M streams pulling top 10 is (1M heap init) + (10 × log(1M)) = ~30M ops. The interviewer’s “scale” probe is satisfied only by this.

In practice, “heap init” is also dominated by only ever peeking the newest tweet per followee (lazy: push only the newest per stream initially, pop-then-push-next as you consume).

6.2 Fan-out-on-write vs fan-out-on-read

This solution is fan-out-on-read: feed is computed on getNewsFeed. Production tradeoff:

ModelPost costRead costWins when
Fan-out-on-readO(1)O(K log K)Few reads, many follows, viral users
Fan-out-on-writeO(F) F=#followersO(1)Many reads, average users
Hybrid (Twitter prod)O(F) for normals, O(1) for celebsO(K_celebs log K_celebs) + 1Mix

Twitter famously uses hybrid: for users with > ~1M followers (celebs), fan-out-on-write would mean writing 1M timelines per tweet — too expensive. Their tweets are pulled at read time. Normal users fan-out-on-write.

6.3 The “self-tweets included” trap

getNewsFeed(uid) returns tweets from uid AND followees. Half of LC submissions forget the user’s own tweets. The clean fix: when building the candidate set, always include uid itself.

6.4 Time ordering — monotonic counter, not wall clock

Wall-clock time.time() can collide for tweets posted in the same microsecond. The monotonic counter (a single self.time int incremented on each post) is the canonical answer. Production: hybrid (wall_clock_ms, sequence) tuple — Twitter’s Snowflake IDs embed this exactly: 41 bits ms-timestamp + 10 bits machine + 12 bits sequence.

6.5 Family: feed merging

  • LC 355 Design Twitter (this).
  • LC 23 Merge K Sorted Lists — pure heap merge primitive.
  • LC 218 Skyline — heap-merge variant.
  • LC 692 Top K Frequent Words — heap top-K.

7. Anti-Pattern Analysis

  1. Sort all relevant tweets — O(N total log N); fails the scale probe.
  2. Forget user’s own tweets — most common LC submission bug.
  3. Use wall-clock time.time() — collisions; nondeterministic across runs.
  4. follows as a list — O(F) for unfollow; should be set.
  5. unfollow of self — must be no-op (not an error).
  6. Push entire tweet lists onto heap upfront — wasted work if most are dropped; lazy “push next on pop” is the production pattern.

8. Skills & Takeaways

  • K-way heap merge of pre-sorted streams.
  • Monotonic counter for deterministic ordering.
  • Fan-out-on-read vs fan-out-on-write tradeoff vocabulary.

9. When to Move On

  • Implement cold in 25 min, heap-based.
  • Articulate the celebrity / fan-out-on-write tradeoff.
  • Extend to getRecentMentions(userId) follow-up.

10. Company Context

  • Meta / X / Snap / LinkedIn: Feed team interview staple.
  • Bloomberg: Real-time feed / news ticker.
  • Reddit: Front-page ranking variant.

Strong: “Per-user tweet lists in post order. getNewsFeed is a 10-element heap merge of K = (followees + 1) streams, lazy push-on-pop. Monotonic time counter.”

Weak: “Concatenate all tweets and sort.” → 1M-follow follow-up kills you.

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff+
Heap mergeAfter hintColdColdCold
Includes self-tweetsForgetsFixesColdCold
Fan-out vocabularyMentionsArticulates tradeoff
Celebrity hybridCites Twitter prod

12. Level Delta

  • Mid (L4): Sort-everything works. Survives. Doesn’t pass scale probe.
  • Senior (L5): Heap merge cold + correct self-tweets + monotonic counter.
  • Staff (L6): + lazy “push-next-on-pop” + articulates fan-out-on-read vs write + describes celebrity-user hybrid.
  • Principal (L7+): + discusses timeline materialization (Twitter’s “home timeline” stored in Redis per user, capped at ~800 entries); + read-after-write consistency (your own posts must appear instantly even if Redis hasn’t caught up — query merges materialized + recent self-tweets); + multi-DC issues (your follower fan-out crosses regions); + Snowflake IDs as the canonical monotonic time + machine-shard scheme.

13. Follow-up Q&A

Q1: Scale to 1M followees. A1: Lazy heap-merge already handles it: K=1M but we only pop 10. Initial heap-init is O(K), still dominant. To improve further: persist the timeline (fan-out-on-write).

Q2: Hot user (celebrity) with 100M followers. A2: Don’t fan-out-on-write for celebs. At read, merge celebrity tweets in at request time (hybrid model).

Q3: Delete a tweet. A3: Add deleted: set[tweet_id]; filter in getNewsFeed. Or remove from per-user list (O(N_user)).

Q4: Trending topics. A4: Count-min sketch + heavy-hitters algorithm (Misra-Gries) over a sliding time window. Different problem; same “streaming top-K” family.

Q5: Read-after-write for own posts. A5: On getNewsFeed(uid), merge user’s own recent tweets at request time even if the materialized timeline is stale. Production: this is exactly what Twitter does for self-views.

14. Solution Walkthrough

See solution.py:

  • Twitter — heap-merge feed using heapq with lazy push-on-pop.
  • TwitterBrute — gather-all + sort oracle.
  • Stress test: random sequences of post/follow/unfollow/feed.

15. Beyond the Problem

Principal code review: “The textbook answer is fan-out-on-read with heap merge — that’s what your code does, and it’s the right starting point. But Twitter’s actual production architecture (well documented in their engineering blog) is a hybrid based on followee fan-out distribution: for the median user (< ~100 followers), they fan-out-on-write into per-follower Redis timelines (capped at ~800 entries each, ~600 GB RAM total for the active user set). For celebrities (> ~1M followers), they do fan-out-on-read at request time — writing 100M timelines per tweet from @ladygaga would crash the cluster. The hybrid ‘merge celebrity in at read’ adds latency to celebrity-follower reads but bounds write amplification. The Staff/Principal signal in this round is naming this pull vs push asymmetry and identifying the viral fan-in event (e.g., a tweet by a new account that suddenly goes viral) as the hardest case — you can’t pre-compute timelines for an unknown future celebrity. Bonus signal: knowing that the timeline is itself capped (you don’t get tweets from 2014 in your feed), which means the merge is bounded by min(800, K × per_user_recent), not unbounded history. Snowflake IDs (their open-sourced monotonic ID scheme: 41 bits ms-time + 10 bits machine + 12 bits sequence) replace your self.time counter and solve cross-shard ordering without coordination.”