p99 — Design Log Storage System

Source: LeetCode 635 · Medium · Topics: Design, String, Range Query Companies (2024–2025): Amazon (medium), Bloomberg (medium), Snowflake (low) Loop position: Range-query design problem. Tests whether you recognize that a fixed-format timestamp can be range-queried via string prefix comparison.

1. Quick Context

Two operations:

put(id: int, timestamp: str) — store a log with timestamp formatted "Year:Month:Day:Hour:Minute:Second" (zero-padded: "2017:01:01:23:59:59").
retrieve(start: str, end: str, granularity: str) -> List[int] — return IDs whose timestamp falls in [start, end] at the given granularity (one of "Year", "Month", "Day", "Hour", "Minute", "Second").

Key insight: because the format is fixed-width and zero-padded, lexicographic string comparison agrees with chronological comparison. So filtering reduces to comparing fixed-length prefixes.

2. LeetCode Link + Attempt Gate

🔗 https://leetcode.com/problems/design-log-storage-system/

STOP. 20-min timer.

3. Prerequisite Concepts

Fixed-width zero-padded format → lex order = numeric order.
Prefix truncation by granularity.

4. How to Approach (FRAMEWORK 1–9)

1. Restate: Store (id, timestamp) pairs; retrieve IDs whose timestamps fall in [start, end] at variable granularity.

2. Clarify: Duplicate IDs? Per LC, ids are unique per put. Result order? Any order. Are start/end always inclusive? Yes, inclusive on both ends.

3. Examples:

retrieve("2016:01:01:01:01:01", "2017:01:01:23:00:00", "Year") matches anything with year 2016 or 2017.
Granularity “Hour” → prefix length 13.

5. Brute: Linear scan with substring comparison. This IS the canonical solution; the optimization is the prefix length lookup table.

6. Target: put O(1), retrieve O(n) (acceptable for LC).

7. Pattern: Prefix-length table maps granularity → number of chars to compare.

8. Develop: see solution.py.

9. Test: all 6 granularities; boundary timestamps (exact match start, exact match end); empty store.

6. Deeper Insight

6.1 Why prefix comparison works

The format "YYYY:MM:DD:HH:MM:SS" is:

Fixed-width — every component is the same number of chars.
Zero-padded — "01" < "10" lexicographically AND numerically.
Most-significant-first — Year before Month before Day, etc.

Result: lexicographic comparison agrees with chronological. Truncating to the granularity’s prefix gives correct range queries.

6.2 Granularity → prefix length

Granularity	Example prefix	Length
Year	`"2017"`	4
Month	`"2017:01"`	7
Day	`"2017:01:01"`	10
Hour	`"2017:01:01:23"`	13
Minute	`"2017:01:01:23:59"`	16
Second	`"2017:01:01:23:59:59"`	19

6.3 Optimization — sorted + bisect

If put is mostly upfront and retrieve is frequent, maintain a sorted list of (timestamp_prefix, id) keyed by truncated timestamp; use bisect_left / bisect_right to find the range in O(log n).

For LC’s constraints (n ≤ 500), the linear scan is fine.

6.4 Production reality — time-series databases

This is a toy version of a time-series database. Real systems (InfluxDB, Prometheus, TimescaleDB) use:

Inverted index on timestamp for range queries.
Hierarchical chunking by time bucket (e.g., one chunk per hour).
Column-oriented storage for compression.
Write-ahead log + LSM for ingestion.

The Staff-level recognition is: “we’re approximating Prometheus’ tsdb in 30 lines.”

6.5 Family: range query / time series

LC 635 Log Storage System (this).
LC 729 My Calendar I.
LC 731 My Calendar II.
LC 732 My Calendar III.
LC 1146 Snapshot Array.

7. Anti-Pattern Analysis

Parse timestamp into int components — works but unnecessarily complex; string prefix comparison is simpler and equally correct.
Compare full timestamp at “Year” granularity — fails: start="2017:..." would not match "2017:05:..." because the full strings differ.
< instead of <= — LC range is inclusive.
Forget granularity case sensitivity — LC sends "Year" exactly (capital Y); use a dict not lower().

8. Skills & Takeaways

Fixed-width zero-padded format → lex order matches numeric order.
Prefix truncation is range-query primitive when the format is hierarchical.

9. When to Move On

Solve cold in 12 min.
Recognize “lex order = numeric order” as the unlocking insight.
Articulate sorted + bisect optimization.

10. Company Context

Amazon: L5; “design simple” warmup.
Bloomberg: systems round.

Strong: “Fixed-width zero-padded format means lex comparison agrees with chronological. Store as list; on retrieve, truncate start/end to the granularity’s prefix length and string-compare. O(n) per retrieve; sorted + bisect → O(log n + k) if needed.”

11. Interviewer’s Lens

Signal	Junior	Mid	Senior	Staff+
Insight	None	Hint	Cold	Cold
Prefix length	Hardcodes	Table	Clean dict	Clean dict
Bisect opt	None	None	Mentions	+ production tsdb

12. Level Delta

Mid (L4): Works after the “string compare” hint.
Senior (L5): Cold solution + clean prefix-length dict.
Staff (L6): + sorted + bisect optimization + cites production time-series DB design.
Principal (L7+): + chunking strategies, column-oriented storage, retention policies, downsampling.

13. Follow-up Q&A

Q1: Optimize retrieve to sub-linear. A1: Maintain sorted list of (timestamp, id); on retrieve, bisect_left(start_prefix) and bisect_right(end_prefix + '~') (the ~ is the high sentinel above ‘:’ and digits in ASCII… actually use something like a generous high char or compute the prefix’s successor).

Q2: Support non-ISO formats. A2: Either canonicalize on put, or parse to integer epoch on put and store both.

Q3: Billions of logs. A3: Real time-series DB: chunked storage per time bucket, secondary index on tags, compression (delta + Gorilla for floats), retention policies, downsampling.

Q4: Concurrent put and retrieve. A4: RWLock on the store; or COW snapshot for readers.

14. Solution Walkthrough

See solution.py:

LogSystem — linear scan with prefix dict.
LogSystemSorted — bisect-based variant.
Stress test driving random puts and retrieves against both.

15. Beyond the Problem

Principal code review: “Log Storage System is a time-series database in 30 lines. The pedagogical magic is the zero-padded fixed-width format that lets lexicographic comparison stand in for chronological — this is the same trick used in Cassandra’s clustering columns, DynamoDB sort keys, Bigtable row keys, and HBase rowkeys: encode time as a sortable string prefix, and range queries become byte-level range scans. The Staff+ recognition is that this is the foundation of every column-family / wide-column NoSQL store: choose your row key carefully (often time-prefixed) and your reads become sequential disk scans, your inserts become sequential writes, your hot-partition problem is solved by salting the prefix. The same insight powers Prometheus tsdb (chunks indexed by time bucket), InfluxDB (TSI index on tags + timestamp), and TimescaleDB (hypertables auto-partitioned by time). When designing data layouts, the question is always: what’s the natural sort order, and can I encode it as a byte-comparable prefix? If yes, you get O(log n) range queries for free from any sorted store (B-tree, LSM, skiplist). This problem is the smallest non-trivial instance of that principle.”

LeetCode Interview — Extreme Coding