p99 — Design Log Storage System
Source: LeetCode 635 · Medium · Topics: Design, String, Range Query Companies (2024–2025): Amazon (medium), Bloomberg (medium), Snowflake (low) Loop position: Range-query design problem. Tests whether you recognize that a fixed-format timestamp can be range-queried via string prefix comparison.
1. Quick Context
Two operations:
put(id: int, timestamp: str)— store a log with timestamp formatted"Year:Month:Day:Hour:Minute:Second"(zero-padded:"2017:01:01:23:59:59").retrieve(start: str, end: str, granularity: str) -> List[int]— return IDs whose timestamp falls in[start, end]at the given granularity (one of"Year", "Month", "Day", "Hour", "Minute", "Second").
Key insight: because the format is fixed-width and zero-padded, lexicographic string comparison agrees with chronological comparison. So filtering reduces to comparing fixed-length prefixes.
2. LeetCode Link + Attempt Gate
🔗 https://leetcode.com/problems/design-log-storage-system/
STOP. 20-min timer.
3. Prerequisite Concepts
- Fixed-width zero-padded format → lex order = numeric order.
- Prefix truncation by granularity.
4. How to Approach (FRAMEWORK 1–9)
1. Restate: Store (id, timestamp) pairs; retrieve IDs whose timestamps fall in [start, end] at variable granularity.
2. Clarify: Duplicate IDs? Per LC, ids are unique per put. Result order? Any order. Are start/end always inclusive? Yes, inclusive on both ends.
3. Examples:
retrieve("2016:01:01:01:01:01", "2017:01:01:23:00:00", "Year")matches anything with year 2016 or 2017.- Granularity “Hour” → prefix length 13.
5. Brute: Linear scan with substring comparison. This IS the canonical solution; the optimization is the prefix length lookup table.
6. Target: put O(1), retrieve O(n) (acceptable for LC).
7. Pattern: Prefix-length table maps granularity → number of chars to compare.
8. Develop: see solution.py.
9. Test: all 6 granularities; boundary timestamps (exact match start, exact match end); empty store.
5. Progressive Hints
See hints.md.
6. Deeper Insight
6.1 Why prefix comparison works
The format "YYYY:MM:DD:HH:MM:SS" is:
- Fixed-width — every component is the same number of chars.
- Zero-padded —
"01" < "10"lexicographically AND numerically. - Most-significant-first — Year before Month before Day, etc.
Result: lexicographic comparison agrees with chronological. Truncating to the granularity’s prefix gives correct range queries.
6.2 Granularity → prefix length
| Granularity | Example prefix | Length |
|---|---|---|
| Year | "2017" | 4 |
| Month | "2017:01" | 7 |
| Day | "2017:01:01" | 10 |
| Hour | "2017:01:01:23" | 13 |
| Minute | "2017:01:01:23:59" | 16 |
| Second | "2017:01:01:23:59:59" | 19 |
6.3 Optimization — sorted + bisect
If put is mostly upfront and retrieve is frequent, maintain a sorted list of (timestamp_prefix, id) keyed by truncated timestamp; use bisect_left / bisect_right to find the range in O(log n).
For LC’s constraints (n ≤ 500), the linear scan is fine.
6.4 Production reality — time-series databases
This is a toy version of a time-series database. Real systems (InfluxDB, Prometheus, TimescaleDB) use:
- Inverted index on timestamp for range queries.
- Hierarchical chunking by time bucket (e.g., one chunk per hour).
- Column-oriented storage for compression.
- Write-ahead log + LSM for ingestion.
The Staff-level recognition is: “we’re approximating Prometheus’ tsdb in 30 lines.”
6.5 Family: range query / time series
- LC 635 Log Storage System (this).
- LC 729 My Calendar I.
- LC 731 My Calendar II.
- LC 732 My Calendar III.
- LC 1146 Snapshot Array.
7. Anti-Pattern Analysis
- Parse timestamp into int components — works but unnecessarily complex; string prefix comparison is simpler and equally correct.
- Compare full timestamp at “Year” granularity — fails:
start="2017:..."would not match"2017:05:..."because the full strings differ. <instead of<=— LC range is inclusive.- Forget granularity case sensitivity — LC sends
"Year"exactly (capital Y); use a dict notlower().
8. Skills & Takeaways
- Fixed-width zero-padded format → lex order matches numeric order.
- Prefix truncation is range-query primitive when the format is hierarchical.
9. When to Move On
- Solve cold in 12 min.
- Recognize “lex order = numeric order” as the unlocking insight.
- Articulate sorted + bisect optimization.
10. Company Context
- Amazon: L5; “design simple” warmup.
- Bloomberg: systems round.
Strong: “Fixed-width zero-padded format means lex comparison agrees with chronological. Store as list; on retrieve, truncate start/end to the granularity’s prefix length and string-compare. O(n) per retrieve; sorted + bisect → O(log n + k) if needed.”
11. Interviewer’s Lens
| Signal | Junior | Mid | Senior | Staff+ |
|---|---|---|---|---|
| Insight | None | Hint | Cold | Cold |
| Prefix length | Hardcodes | Table | Clean dict | Clean dict |
| Bisect opt | None | None | Mentions | + production tsdb |
12. Level Delta
- Mid (L4): Works after the “string compare” hint.
- Senior (L5): Cold solution + clean prefix-length dict.
- Staff (L6): + sorted + bisect optimization + cites production time-series DB design.
- Principal (L7+): + chunking strategies, column-oriented storage, retention policies, downsampling.
13. Follow-up Q&A
Q1: Optimize retrieve to sub-linear.
A1: Maintain sorted list of (timestamp, id); on retrieve, bisect_left(start_prefix) and bisect_right(end_prefix + '~') (the ~ is the high sentinel above ‘:’ and digits in ASCII… actually use something like a generous high char or compute the prefix’s successor).
Q2: Support non-ISO formats.
A2: Either canonicalize on put, or parse to integer epoch on put and store both.
Q3: Billions of logs. A3: Real time-series DB: chunked storage per time bucket, secondary index on tags, compression (delta + Gorilla for floats), retention policies, downsampling.
Q4: Concurrent put and retrieve.
A4: RWLock on the store; or COW snapshot for readers.
14. Solution Walkthrough
See solution.py:
LogSystem— linear scan with prefix dict.LogSystemSorted— bisect-based variant.- Stress test driving random puts and retrieves against both.
15. Beyond the Problem
Principal code review: “Log Storage System is a time-series database in 30 lines. The pedagogical magic is the zero-padded fixed-width format that lets lexicographic comparison stand in for chronological — this is the same trick used in Cassandra’s clustering columns, DynamoDB sort keys, Bigtable row keys, and HBase rowkeys: encode time as a sortable string prefix, and range queries become byte-level range scans. The Staff+ recognition is that this is the foundation of every column-family / wide-column NoSQL store: choose your row key carefully (often time-prefixed) and your reads become sequential disk scans, your inserts become sequential writes, your hot-partition problem is solved by salting the prefix. The same insight powers Prometheus tsdb (chunks indexed by time bucket), InfluxDB (TSI index on tags + timestamp), and TimescaleDB (hypertables auto-partitioned by time). When designing data layouts, the question is always: what’s the natural sort order, and can I encode it as a byte-comparable prefix? If yes, you get O(log n) range queries for free from any sorted store (B-tree, LSM, skiplist). This problem is the smallest non-trivial instance of that principle.”