Lab 23 — Toy SQL-Like Engine
Goal
Implement a tiny SQL-like engine that can parse and execute SELECT col1, col2 FROM t WHERE expr [JOIN u ON expr] [ORDER BY col [DESC]] [LIMIT n] over in-memory tables. The engine has three layers: tokenizer, parser (produces an AST), executor (interprets the AST). After this lab you should be able to scope a 60-minute version of this in <5 minutes and produce a working subset (no joins, no order-by) in <40 minutes.
Background Concepts
A SQL engine — even a toy — is the cleanest interview-friendly example of the frontend / backend / interpreter trilogy that runs every real query engine, compiler, and DSL:
- Lexer / tokenizer: converts a string into a stream of typed tokens (
SELECT, identifiername,=, integer42, etc.). Skips whitespace, recognizes keywords, classifies punctuation. - Parser: consumes the token stream and produces an AST:
Select(columns, from_table, where_expr, joins, order_by, limit). Recursive descent is the right tool for this problem class — top-down, predictable, fits on a whiteboard. - Executor: walks the AST and produces rows. For
WHERE, evaluate the expression against each row. ForJOIN, nested-loop the two tables and concatenate matching rows. ForORDER BY, sort by the named column withDESCflag. ForLIMIT, slice the result.
The interviewer is not testing whether you can build a real query optimizer (you can’t, in 60 minutes). They are testing whether you can decompose the problem into the three layers, write each cleanly, and connect them through a typed AST. Candidates who try to do everything inline in one function fail; candidates who name Token, Expr, Select types and split functions per layer pass.
Interview Context
A 60-minute round at Snowflake, Databricks, MongoDB, Neon, PlanetScale, and any database / data-platform company. Often paired with a smaller warmup. The supported subset varies by interviewer — at minimum SELECT cols FROM t WHERE expr is expected; joins and order-by are stretch goals; aggregates (COUNT, SUM) are extras for strong candidates.
Problem Statement
Implement Engine with:
register_table(name: str, columns: list[str], rows: list[list]): store an in-memory table.query(sql: str) -> list[list]: parse and execute the SQL, return rows.
Supported grammar:
SELECT col_list FROM table_name
[ JOIN table_name ON cond_expr ]
[ WHERE cond_expr ]
[ ORDER BY column_ref [ASC | DESC] ]
[ LIMIT integer ]
col_list is * or comma-separated column references (qualified table.col or bare col). cond_expr is a small expression language: literals (int, string), column refs, and the operators = != < <= > >= AND OR NOT.
Constraints
- Identifiers are alphanumeric (and underscore). Keywords are case-insensitive (
SELECT==select). - String literals use single quotes (
'foo'). - Tables fit in memory; nested-loop join is acceptable.
- Up to 10^3 rows per table; query must finish in well under a second.
Clarifying Questions
- Are aggregates (
COUNT,SUM) required? (No for the base; stretch goal.) - Are subqueries supported? (No.)
- Are NULLs supported? (No — undefined column = error; missing field = treat as None and
NULLpropagation rules elided.) - Are types coerced? (No — comparing
'5'and5returnsFalseor raises; document.) - Is column resolution case-sensitive? (Yes — keywords case-insensitive, identifiers case-sensitive. Document.)
- Are joins inner-only? (Yes —
INNER JOINsemantics; noLEFT/RIGHT/FULLfor the base.)
Examples
-- users(id, name, age); orders(id, user_id, total)
SELECT name FROM users WHERE age >= 18
SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id WHERE o.total > 100
SELECT name FROM users ORDER BY age DESC LIMIT 3
Initial Brute Force
Skip the parser and tokenize-execute in a single big regex-soup function. This is what most candidates produce when panicked. It works for two or three test cases and breaks instantly on any extension.
Brute Force Complexity
Roughly O(rows × cols × query-length) and bug-prone.
Optimization Path
Properly separate lexer / parser / executor. Tokenizer scans the string once: O(N). Parser is recursive descent: O(tokens). Executor: O(rows × predicate cost) for WHERE; O(left × right) for nested-loop joins; O(rows log rows) for ORDER BY. Each layer is independently testable.
Final Expected Approach
Three layers connected by typed values:
tokenize(sql) -> list[Token]—Token = (kind, value)where kind isKEYWORD,IDENT,INT,STRING,OP,PUNC,EOF.Parser(tokens).parse_select() -> Select— recursive-descent. Each non-terminal is a method.Selectis a dataclass withcolumns,from_table,joins,where,order_by,limit.Engine.execute(select) -> rows— fetch base rows, apply joins, applywhere, project columns, order, limit.
Expr is a small algebraic datatype: Literal(value), Column(table_or_None, name), BinOp(op, left, right), UnaryOp(op, operand). Evaluation: eval_expr(expr, row, schema) -> value.
Data Structures Used
list[tuple]per table (rows).dict[str, int]per table (column → index).- AST: small dataclasses or named tuples.
- Token list:
list[Token]. dict[str, callable]for operator dispatch.
Correctness Argument
Each layer’s correctness is independent of the others. Tokenizer correctness: every input character is consumed exactly once and emitted as exactly one token (or skipped if whitespace). Parser correctness: a recursive-descent parser for an LL(1) grammar accepts the language exactly when the grammar is LL(1) and the parser’s lookahead matches. The grammar above is trivially LL(1). Executor correctness: each clause is a transformation on a row stream. WHERE filters; JOIN cross-products and filters; projection picks columns; ORDER BY sorts; LIMIT truncates. Each transformation preserves the well-typed-row invariant.
Complexity
| Stage | Time |
|---|---|
| Tokenize | O(N) |
| Parse | O(T) where T = tokens |
| WHERE filter | O(R · |
| Inner JOIN (nested loop) | O(R₁ · R₂ · |
| ORDER BY | O(R log R) |
| LIMIT | O(L) |
For larger data: indices, hash joins, query optimizers — out of scope.
Implementation Requirements
import re
from dataclasses import dataclass
from typing import Any, Optional
# ---------------- Tokenizer ----------------
KEYWORDS = {"SELECT", "FROM", "WHERE", "JOIN", "ON",
"ORDER", "BY", "ASC", "DESC", "LIMIT",
"AND", "OR", "NOT"}
@dataclass
class Token:
kind: str
value: Any
_TOKEN_RE = re.compile(r"""
\s+ | # whitespace
'([^']*)' | # string literal
(\d+) | # int
(==|!=|<=|>=|=|<|>) | # ops
([A-Za-z_][A-Za-z0-9_]*) | # identifier or keyword
(,|\(|\)|\.|\*) # punctuation
""", re.VERBOSE)
def tokenize(sql: str) -> list[Token]:
tokens: list[Token] = []
i = 0
while i < len(sql):
m = _TOKEN_RE.match(sql, i)
if not m:
raise SyntaxError(f"unexpected char at {i}: {sql[i]!r}")
s, ival, op, ident, punc = m.groups()
if m.group(0).isspace():
pass
elif s is not None:
tokens.append(Token("STRING", s))
elif ival is not None:
tokens.append(Token("INT", int(ival)))
elif op is not None:
tokens.append(Token("OP", "=" if op == "==" else op))
elif ident is not None:
up = ident.upper()
if up in KEYWORDS:
tokens.append(Token("KEYWORD", up))
else:
tokens.append(Token("IDENT", ident))
elif punc is not None:
tokens.append(Token("PUNC", punc))
i = m.end()
tokens.append(Token("EOF", None))
return tokens
# ---------------- AST ----------------
@dataclass
class Column:
table: Optional[str]
name: str
@dataclass
class Literal:
value: Any
@dataclass
class BinOp:
op: str
left: Any
right: Any
@dataclass
class UnaryOp:
op: str
operand: Any
@dataclass
class Join:
table: str
alias: Optional[str]
on: Any
@dataclass
class Select:
columns: list # list[Column] or ["*"]
from_table: str
from_alias: Optional[str]
joins: list # list[Join]
where: Optional[Any]
order_by: Optional[tuple] # (Column, "ASC" | "DESC")
limit: Optional[int]
# ---------------- Parser (recursive descent) ----------------
class Parser:
def __init__(self, tokens: list[Token]):
self._t = tokens
self._i = 0
def _peek(self) -> Token: return self._t[self._i]
def _eat(self, kind, value=None) -> Token:
tok = self._t[self._i]
if tok.kind != kind or (value is not None and tok.value != value):
raise SyntaxError(f"expected {kind} {value}, got {tok}")
self._i += 1
return tok
def _accept(self, kind, value=None) -> bool:
tok = self._t[self._i]
if tok.kind == kind and (value is None or tok.value == value):
self._i += 1
return True
return False
def parse_select(self) -> Select:
self._eat("KEYWORD", "SELECT")
cols = self._parse_columns()
self._eat("KEYWORD", "FROM")
ftable, falias = self._parse_table_alias()
joins = []
while self._accept("KEYWORD", "JOIN"):
jt, ja = self._parse_table_alias()
self._eat("KEYWORD", "ON")
joins.append(Join(jt, ja, self._parse_expr()))
where = self._parse_expr() if self._accept("KEYWORD", "WHERE") else None
order_by = None
if self._accept("KEYWORD", "ORDER"):
self._eat("KEYWORD", "BY")
ob_col = self._parse_column_ref()
direction = "ASC"
if self._accept("KEYWORD", "DESC"): direction = "DESC"
elif self._accept("KEYWORD", "ASC"): direction = "ASC"
order_by = (ob_col, direction)
limit = None
if self._accept("KEYWORD", "LIMIT"):
limit = self._eat("INT").value
self._eat("EOF")
return Select(cols, ftable, falias, joins, where, order_by, limit)
def _parse_columns(self):
if self._accept("PUNC", "*"):
return ["*"]
cols = [self._parse_column_ref()]
while self._accept("PUNC", ","):
cols.append(self._parse_column_ref())
return cols
def _parse_column_ref(self) -> Column:
ident = self._eat("IDENT").value
if self._accept("PUNC", "."):
name = self._eat("IDENT").value
return Column(ident, name)
return Column(None, ident)
def _parse_table_alias(self) -> tuple[str, Optional[str]]:
name = self._eat("IDENT").value
alias = None
if self._peek().kind == "IDENT":
alias = self._eat("IDENT").value
return name, alias
def _parse_expr(self):
return self._parse_or()
def _parse_or(self):
left = self._parse_and()
while self._accept("KEYWORD", "OR"):
left = BinOp("OR", left, self._parse_and())
return left
def _parse_and(self):
left = self._parse_not()
while self._accept("KEYWORD", "AND"):
left = BinOp("AND", left, self._parse_not())
return left
def _parse_not(self):
if self._accept("KEYWORD", "NOT"):
return UnaryOp("NOT", self._parse_not())
return self._parse_cmp()
def _parse_cmp(self):
left = self._parse_atom()
if self._peek().kind == "OP":
op = self._eat("OP").value
return BinOp(op, left, self._parse_atom())
return left
def _parse_atom(self):
tok = self._peek()
if tok.kind == "INT": self._i += 1; return Literal(tok.value)
if tok.kind == "STRING": self._i += 1; return Literal(tok.value)
if tok.kind == "PUNC" and tok.value == "(":
self._i += 1
e = self._parse_expr()
self._eat("PUNC", ")")
return e
return self._parse_column_ref()
# ---------------- Executor ----------------
class Engine:
def __init__(self):
self._tables: dict[str, tuple[list[str], list[list]]] = {}
def register_table(self, name: str, columns: list[str], rows: list[list]):
self._tables[name] = (columns, [list(r) for r in rows])
def query(self, sql: str) -> list[list]:
ast = Parser(tokenize(sql)).parse_select()
return self._execute(ast)
def _execute(self, sel: Select) -> list[list]:
# 1. base
base_cols, base_rows = self._fetch(sel.from_table)
alias = sel.from_alias or sel.from_table
rows = [(r, {alias: (base_cols, r)}) for r in base_rows]
# 2. joins (nested loop)
for j in sel.joins:
jcols, jrows = self._fetch(j.table)
j_alias = j.alias or j.table
new_rows = []
for left_row, env in rows:
for jr in jrows:
new_env = dict(env)
new_env[j_alias] = (jcols, jr)
if self._eval(j.on, new_env):
new_rows.append((left_row + jr, new_env))
rows = new_rows
# 3. where
if sel.where is not None:
rows = [(r, env) for r, env in rows if self._eval(sel.where, env)]
# 4. project
if sel.columns == ["*"]:
projected = [r for r, _ in rows]
else:
projected = [[self._eval(c, env) for c in sel.columns] for _, env in rows]
# 5. order by
if sel.order_by is not None:
col, direction = sel.order_by
# we sort over the *original* rows (with env), then re-project. Simpler: sort
# the projected rows along with their key.
rows_with_keys = [(self._eval(col, env), p) for (_, env), p in zip(rows, projected)]
rows_with_keys.sort(key=lambda kp: kp[0], reverse=(direction == "DESC"))
projected = [p for _, p in rows_with_keys]
# 6. limit
if sel.limit is not None:
projected = projected[: sel.limit]
return projected
def _fetch(self, table: str):
if table not in self._tables:
raise ValueError(f"unknown table: {table}")
return self._tables[table]
def _eval(self, expr, env: dict[str, tuple[list[str], list]]):
if isinstance(expr, Literal):
return expr.value
if isinstance(expr, Column):
if expr.table is not None:
cols, row = env[expr.table]
return row[cols.index(expr.name)]
for cols, row in env.values():
if expr.name in cols:
return row[cols.index(expr.name)]
raise NameError(f"unknown column: {expr.name}")
if isinstance(expr, UnaryOp) and expr.op == "NOT":
return not self._eval(expr.operand, env)
if isinstance(expr, BinOp):
l = self._eval(expr.left, env)
r = self._eval(expr.right, env)
return {
"=": l == r, "!=": l != r,
"<": l < r, "<=": l <= r,
">": l > r, ">=": l >= r,
"AND": bool(l) and bool(r),
"OR": bool(l) or bool(r),
}[expr.op]
raise TypeError(f"bad expr: {expr}")
Tests
def setup_engine():
e = Engine()
e.register_table("users", ["id", "name", "age"], [
[1, "alice", 30], [2, "bob", 17], [3, "carol", 22], [4, "dave", 45],
])
e.register_table("orders", ["id", "user_id", "total"], [
[10, 1, 250], [11, 1, 50], [12, 3, 800], [13, 4, 75],
])
return e
def test_basic_select_star():
e = setup_engine()
assert e.query("SELECT * FROM users") == [
[1, "alice", 30], [2, "bob", 17], [3, "carol", 22], [4, "dave", 45]]
def test_where_int_compare():
e = setup_engine()
out = e.query("SELECT name FROM users WHERE age >= 18")
assert sorted(out) == [["alice"], ["carol"], ["dave"]]
def test_string_compare():
e = setup_engine()
out = e.query("SELECT name FROM users WHERE name = 'alice'")
assert out == [["alice"]]
def test_and_or_not():
e = setup_engine()
out = e.query("SELECT name FROM users WHERE age > 20 AND NOT name = 'dave'")
assert sorted(out) == [["alice"], ["carol"]]
def test_join():
e = setup_engine()
out = e.query(
"SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id "
"WHERE o.total > 100"
)
assert sorted(out) == [["alice", 250], ["carol", 800]]
def test_order_by_desc_limit():
e = setup_engine()
out = e.query("SELECT name FROM users ORDER BY age DESC LIMIT 2")
assert out == [["dave"], ["alice"]]
def test_unknown_column_errors():
e = setup_engine()
try: e.query("SELECT bogus FROM users")
except NameError: pass
else: assert False
Follow-up Questions
- How would you test it? Layer-by-layer: tokenizer tests for every keyword/operator/punctuation; parser tests that pretty-print the AST and compare strings; executor tests against fixture tables. Property test: random valid queries with predictable outputs from a Python list-comprehension oracle.
- What is the consistency model? Single-threaded; reads see the snapshot at query start. For concurrent writes, copy-on-write tables or per-table read-write locks.
- What configuration knobs would you expose? Maximum query length, maximum result rows, query timeout. Don’t expose internals (tokenizer regex, parser lookahead).
- How would you handle a poison-pill input? Catastrophic regex (rare with the lexer above), deeply nested expressions (limit recursion depth), enormous joins (
R₁ × R₂row cap before executing). Bound everything. - How would you scale to N nodes? Beyond toy: shard tables by primary key range or hash; route queries to the owning node; for joins across shards, use distributed hash join. Real systems (Spanner, CockroachDB) layer query planning, distributed execution, and consensus over this same skeleton.
- What metrics would you emit? Per-query: parse latency, execution latency, rows scanned, rows returned. Per-table: row count gauge. Aggregate: queries-per-second counter, error rate counter.
Product Extension
This is the same skeleton DuckDB, SQLite, Postgres, and every database engine starts with: lex / parse / plan / execute. Real engines add a planner/optimizer between parse and execute that rewrites the AST (push down predicates, choose join order, pick indexes), and a storage layer beneath execute. Aggregate functions, group-by, subqueries, and CTEs are all extensions of the AST + executor pair.
Language/Runtime Follow-ups
- Python: dataclasses are the right shape for AST.
re.VERBOSElexer is concise. - Java: Use sealed interfaces (Java 17+) for AST nodes. ANTLR for the parser if available; hand-rolled recursive descent if not.
- Go: Use a
type Node interface { node() }and individual struct types implementing it. The parser is a struct with the token list and an index. - C++:
std::variantfor AST nodes is clean; visitor pattern viastd::visit. - JS/TS: Discriminated unions for AST. The runtime cost of dynamic dispatch is acceptable for an interview-grade engine.
Common Bugs
- Lexer that consumes whitespace as a token — pollutes the parser. Skip whitespace in the lexer.
- Parser that allows the same column twice in projection but then fails at execution — better to validate at parse time.
- JOIN executor that builds a Cartesian product before filtering — works but quadratic memory before predicate evaluation. Filter as you go (the implementation above does this).
- ORDER BY on a column not in projection — must evaluate against the row environment, not the projected output.
- Operator precedence wrong —
NOT a AND bparsed asNOT (a AND b)instead of(NOT a) AND b. The recursive-descent ladder (OR < AND < NOT < CMP) handles this. - Case-folding identifiers — many SQL engines do (Postgres folds to lowercase); this toy engine doesn’t. Document the choice.
Debugging Strategy
When parse fails: print the token stream up to the failure point and the parser’s _i index — almost always a missing keyword in the parse method (WHERE vs WHRE). When execution returns wrong rows: log the where evaluation per row with the values it sees. When joins explode: cap row count and emit an error rather than running unbounded.
Mastery Criteria
- Decomposed lex / parse / execute in <2 minutes.
- Wrote the recursive-descent expression parser with correct precedence.
-
Implemented
WHEREandSELECT colscorrectly in <30 minutes. - Added inner JOIN nested-loop in <10 minutes from the WHERE-only baseline.
-
Added
ORDER BYandLIMITin <10 minutes more. - Articulated where the optimizer would slot in (between parse and execute).
- Listed three real systems (DuckDB, SQLite, Postgres) using this skeleton.
- Wrote tokenizer + parser + executor tests independently per layer.