Lab 23 — Toy SQL-Like Engine

Goal

Implement a tiny SQL-like engine that can parse and execute SELECT col1, col2 FROM t WHERE expr [JOIN u ON expr] [ORDER BY col [DESC]] [LIMIT n] over in-memory tables. The engine has three layers: tokenizer, parser (produces an AST), executor (interprets the AST). After this lab you should be able to scope a 60-minute version of this in <5 minutes and produce a working subset (no joins, no order-by) in <40 minutes.

Background Concepts

A SQL engine — even a toy — is the cleanest interview-friendly example of the frontend / backend / interpreter trilogy that runs every real query engine, compiler, and DSL:

  1. Lexer / tokenizer: converts a string into a stream of typed tokens (SELECT, identifier name, =, integer 42, etc.). Skips whitespace, recognizes keywords, classifies punctuation.
  2. Parser: consumes the token stream and produces an AST: Select(columns, from_table, where_expr, joins, order_by, limit). Recursive descent is the right tool for this problem class — top-down, predictable, fits on a whiteboard.
  3. Executor: walks the AST and produces rows. For WHERE, evaluate the expression against each row. For JOIN, nested-loop the two tables and concatenate matching rows. For ORDER BY, sort by the named column with DESC flag. For LIMIT, slice the result.

The interviewer is not testing whether you can build a real query optimizer (you can’t, in 60 minutes). They are testing whether you can decompose the problem into the three layers, write each cleanly, and connect them through a typed AST. Candidates who try to do everything inline in one function fail; candidates who name Token, Expr, Select types and split functions per layer pass.

Interview Context

A 60-minute round at Snowflake, Databricks, MongoDB, Neon, PlanetScale, and any database / data-platform company. Often paired with a smaller warmup. The supported subset varies by interviewer — at minimum SELECT cols FROM t WHERE expr is expected; joins and order-by are stretch goals; aggregates (COUNT, SUM) are extras for strong candidates.

Problem Statement

Implement Engine with:

  • register_table(name: str, columns: list[str], rows: list[list]): store an in-memory table.
  • query(sql: str) -> list[list]: parse and execute the SQL, return rows.

Supported grammar:

SELECT  col_list  FROM  table_name
        [ JOIN  table_name  ON  cond_expr ]
        [ WHERE  cond_expr ]
        [ ORDER BY  column_ref  [ASC | DESC] ]
        [ LIMIT  integer ]

col_list is * or comma-separated column references (qualified table.col or bare col). cond_expr is a small expression language: literals (int, string), column refs, and the operators = != < <= > >= AND OR NOT.

Constraints

  • Identifiers are alphanumeric (and underscore). Keywords are case-insensitive (SELECT == select).
  • String literals use single quotes ('foo').
  • Tables fit in memory; nested-loop join is acceptable.
  • Up to 10^3 rows per table; query must finish in well under a second.

Clarifying Questions

  1. Are aggregates (COUNT, SUM) required? (No for the base; stretch goal.)
  2. Are subqueries supported? (No.)
  3. Are NULLs supported? (No — undefined column = error; missing field = treat as None and NULL propagation rules elided.)
  4. Are types coerced? (No — comparing '5' and 5 returns False or raises; document.)
  5. Is column resolution case-sensitive? (Yes — keywords case-insensitive, identifiers case-sensitive. Document.)
  6. Are joins inner-only? (Yes — INNER JOIN semantics; no LEFT/RIGHT/FULL for the base.)

Examples

-- users(id, name, age); orders(id, user_id, total)
SELECT name FROM users WHERE age >= 18
SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id WHERE o.total > 100
SELECT name FROM users ORDER BY age DESC LIMIT 3

Initial Brute Force

Skip the parser and tokenize-execute in a single big regex-soup function. This is what most candidates produce when panicked. It works for two or three test cases and breaks instantly on any extension.

Brute Force Complexity

Roughly O(rows × cols × query-length) and bug-prone.

Optimization Path

Properly separate lexer / parser / executor. Tokenizer scans the string once: O(N). Parser is recursive descent: O(tokens). Executor: O(rows × predicate cost) for WHERE; O(left × right) for nested-loop joins; O(rows log rows) for ORDER BY. Each layer is independently testable.

Final Expected Approach

Three layers connected by typed values:

  1. tokenize(sql) -> list[Token]Token = (kind, value) where kind is KEYWORD, IDENT, INT, STRING, OP, PUNC, EOF.
  2. Parser(tokens).parse_select() -> Select — recursive-descent. Each non-terminal is a method. Select is a dataclass with columns, from_table, joins, where, order_by, limit.
  3. Engine.execute(select) -> rows — fetch base rows, apply joins, apply where, project columns, order, limit.

Expr is a small algebraic datatype: Literal(value), Column(table_or_None, name), BinOp(op, left, right), UnaryOp(op, operand). Evaluation: eval_expr(expr, row, schema) -> value.

Data Structures Used

  • list[tuple] per table (rows).
  • dict[str, int] per table (column → index).
  • AST: small dataclasses or named tuples.
  • Token list: list[Token].
  • dict[str, callable] for operator dispatch.

Correctness Argument

Each layer’s correctness is independent of the others. Tokenizer correctness: every input character is consumed exactly once and emitted as exactly one token (or skipped if whitespace). Parser correctness: a recursive-descent parser for an LL(1) grammar accepts the language exactly when the grammar is LL(1) and the parser’s lookahead matches. The grammar above is trivially LL(1). Executor correctness: each clause is a transformation on a row stream. WHERE filters; JOIN cross-products and filters; projection picks columns; ORDER BY sorts; LIMIT truncates. Each transformation preserves the well-typed-row invariant.

Complexity

StageTime
TokenizeO(N)
ParseO(T) where T = tokens
WHERE filterO(R ·
Inner JOIN (nested loop)O(R₁ · R₂ ·
ORDER BYO(R log R)
LIMITO(L)

For larger data: indices, hash joins, query optimizers — out of scope.

Implementation Requirements

import re
from dataclasses import dataclass
from typing import Any, Optional


# ---------------- Tokenizer ----------------

KEYWORDS = {"SELECT", "FROM", "WHERE", "JOIN", "ON",
            "ORDER", "BY", "ASC", "DESC", "LIMIT",
            "AND", "OR", "NOT"}

@dataclass
class Token:
    kind: str
    value: Any

_TOKEN_RE = re.compile(r"""
    \s+ |                       # whitespace
    '([^']*)' |                 # string literal
    (\d+) |                     # int
    (==|!=|<=|>=|=|<|>) |       # ops
    ([A-Za-z_][A-Za-z0-9_]*) |  # identifier or keyword
    (,|\(|\)|\.|\*)             # punctuation
""", re.VERBOSE)


def tokenize(sql: str) -> list[Token]:
    tokens: list[Token] = []
    i = 0
    while i < len(sql):
        m = _TOKEN_RE.match(sql, i)
        if not m:
            raise SyntaxError(f"unexpected char at {i}: {sql[i]!r}")
        s, ival, op, ident, punc = m.groups()
        if m.group(0).isspace():
            pass
        elif s is not None:
            tokens.append(Token("STRING", s))
        elif ival is not None:
            tokens.append(Token("INT", int(ival)))
        elif op is not None:
            tokens.append(Token("OP", "=" if op == "==" else op))
        elif ident is not None:
            up = ident.upper()
            if up in KEYWORDS:
                tokens.append(Token("KEYWORD", up))
            else:
                tokens.append(Token("IDENT", ident))
        elif punc is not None:
            tokens.append(Token("PUNC", punc))
        i = m.end()
    tokens.append(Token("EOF", None))
    return tokens


# ---------------- AST ----------------

@dataclass
class Column:
    table: Optional[str]
    name: str

@dataclass
class Literal:
    value: Any

@dataclass
class BinOp:
    op: str
    left: Any
    right: Any

@dataclass
class UnaryOp:
    op: str
    operand: Any

@dataclass
class Join:
    table: str
    alias: Optional[str]
    on: Any

@dataclass
class Select:
    columns: list                # list[Column] or ["*"]
    from_table: str
    from_alias: Optional[str]
    joins: list                  # list[Join]
    where: Optional[Any]
    order_by: Optional[tuple]    # (Column, "ASC" | "DESC")
    limit: Optional[int]


# ---------------- Parser (recursive descent) ----------------

class Parser:
    def __init__(self, tokens: list[Token]):
        self._t = tokens
        self._i = 0

    def _peek(self) -> Token: return self._t[self._i]
    def _eat(self, kind, value=None) -> Token:
        tok = self._t[self._i]
        if tok.kind != kind or (value is not None and tok.value != value):
            raise SyntaxError(f"expected {kind} {value}, got {tok}")
        self._i += 1
        return tok
    def _accept(self, kind, value=None) -> bool:
        tok = self._t[self._i]
        if tok.kind == kind and (value is None or tok.value == value):
            self._i += 1
            return True
        return False

    def parse_select(self) -> Select:
        self._eat("KEYWORD", "SELECT")
        cols = self._parse_columns()
        self._eat("KEYWORD", "FROM")
        ftable, falias = self._parse_table_alias()
        joins = []
        while self._accept("KEYWORD", "JOIN"):
            jt, ja = self._parse_table_alias()
            self._eat("KEYWORD", "ON")
            joins.append(Join(jt, ja, self._parse_expr()))
        where = self._parse_expr() if self._accept("KEYWORD", "WHERE") else None
        order_by = None
        if self._accept("KEYWORD", "ORDER"):
            self._eat("KEYWORD", "BY")
            ob_col = self._parse_column_ref()
            direction = "ASC"
            if self._accept("KEYWORD", "DESC"): direction = "DESC"
            elif self._accept("KEYWORD", "ASC"): direction = "ASC"
            order_by = (ob_col, direction)
        limit = None
        if self._accept("KEYWORD", "LIMIT"):
            limit = self._eat("INT").value
        self._eat("EOF")
        return Select(cols, ftable, falias, joins, where, order_by, limit)

    def _parse_columns(self):
        if self._accept("PUNC", "*"):
            return ["*"]
        cols = [self._parse_column_ref()]
        while self._accept("PUNC", ","):
            cols.append(self._parse_column_ref())
        return cols

    def _parse_column_ref(self) -> Column:
        ident = self._eat("IDENT").value
        if self._accept("PUNC", "."):
            name = self._eat("IDENT").value
            return Column(ident, name)
        return Column(None, ident)

    def _parse_table_alias(self) -> tuple[str, Optional[str]]:
        name = self._eat("IDENT").value
        alias = None
        if self._peek().kind == "IDENT":
            alias = self._eat("IDENT").value
        return name, alias

    def _parse_expr(self):
        return self._parse_or()
    def _parse_or(self):
        left = self._parse_and()
        while self._accept("KEYWORD", "OR"):
            left = BinOp("OR", left, self._parse_and())
        return left
    def _parse_and(self):
        left = self._parse_not()
        while self._accept("KEYWORD", "AND"):
            left = BinOp("AND", left, self._parse_not())
        return left
    def _parse_not(self):
        if self._accept("KEYWORD", "NOT"):
            return UnaryOp("NOT", self._parse_not())
        return self._parse_cmp()
    def _parse_cmp(self):
        left = self._parse_atom()
        if self._peek().kind == "OP":
            op = self._eat("OP").value
            return BinOp(op, left, self._parse_atom())
        return left
    def _parse_atom(self):
        tok = self._peek()
        if tok.kind == "INT": self._i += 1; return Literal(tok.value)
        if tok.kind == "STRING": self._i += 1; return Literal(tok.value)
        if tok.kind == "PUNC" and tok.value == "(":
            self._i += 1
            e = self._parse_expr()
            self._eat("PUNC", ")")
            return e
        return self._parse_column_ref()


# ---------------- Executor ----------------

class Engine:
    def __init__(self):
        self._tables: dict[str, tuple[list[str], list[list]]] = {}

    def register_table(self, name: str, columns: list[str], rows: list[list]):
        self._tables[name] = (columns, [list(r) for r in rows])

    def query(self, sql: str) -> list[list]:
        ast = Parser(tokenize(sql)).parse_select()
        return self._execute(ast)

    def _execute(self, sel: Select) -> list[list]:
        # 1. base
        base_cols, base_rows = self._fetch(sel.from_table)
        alias = sel.from_alias or sel.from_table
        rows = [(r, {alias: (base_cols, r)}) for r in base_rows]

        # 2. joins (nested loop)
        for j in sel.joins:
            jcols, jrows = self._fetch(j.table)
            j_alias = j.alias or j.table
            new_rows = []
            for left_row, env in rows:
                for jr in jrows:
                    new_env = dict(env)
                    new_env[j_alias] = (jcols, jr)
                    if self._eval(j.on, new_env):
                        new_rows.append((left_row + jr, new_env))
            rows = new_rows

        # 3. where
        if sel.where is not None:
            rows = [(r, env) for r, env in rows if self._eval(sel.where, env)]

        # 4. project
        if sel.columns == ["*"]:
            projected = [r for r, _ in rows]
        else:
            projected = [[self._eval(c, env) for c in sel.columns] for _, env in rows]

        # 5. order by
        if sel.order_by is not None:
            col, direction = sel.order_by
            # we sort over the *original* rows (with env), then re-project. Simpler: sort
            # the projected rows along with their key.
            rows_with_keys = [(self._eval(col, env), p) for (_, env), p in zip(rows, projected)]
            rows_with_keys.sort(key=lambda kp: kp[0], reverse=(direction == "DESC"))
            projected = [p for _, p in rows_with_keys]

        # 6. limit
        if sel.limit is not None:
            projected = projected[: sel.limit]
        return projected

    def _fetch(self, table: str):
        if table not in self._tables:
            raise ValueError(f"unknown table: {table}")
        return self._tables[table]

    def _eval(self, expr, env: dict[str, tuple[list[str], list]]):
        if isinstance(expr, Literal):
            return expr.value
        if isinstance(expr, Column):
            if expr.table is not None:
                cols, row = env[expr.table]
                return row[cols.index(expr.name)]
            for cols, row in env.values():
                if expr.name in cols:
                    return row[cols.index(expr.name)]
            raise NameError(f"unknown column: {expr.name}")
        if isinstance(expr, UnaryOp) and expr.op == "NOT":
            return not self._eval(expr.operand, env)
        if isinstance(expr, BinOp):
            l = self._eval(expr.left, env)
            r = self._eval(expr.right, env)
            return {
                "=": l == r, "!=": l != r,
                "<": l < r, "<=": l <= r,
                ">": l > r, ">=": l >= r,
                "AND": bool(l) and bool(r),
                "OR": bool(l) or bool(r),
            }[expr.op]
        raise TypeError(f"bad expr: {expr}")

Tests

def setup_engine():
    e = Engine()
    e.register_table("users", ["id", "name", "age"], [
        [1, "alice", 30], [2, "bob", 17], [3, "carol", 22], [4, "dave", 45],
    ])
    e.register_table("orders", ["id", "user_id", "total"], [
        [10, 1, 250], [11, 1, 50], [12, 3, 800], [13, 4, 75],
    ])
    return e

def test_basic_select_star():
    e = setup_engine()
    assert e.query("SELECT * FROM users") == [
        [1, "alice", 30], [2, "bob", 17], [3, "carol", 22], [4, "dave", 45]]

def test_where_int_compare():
    e = setup_engine()
    out = e.query("SELECT name FROM users WHERE age >= 18")
    assert sorted(out) == [["alice"], ["carol"], ["dave"]]

def test_string_compare():
    e = setup_engine()
    out = e.query("SELECT name FROM users WHERE name = 'alice'")
    assert out == [["alice"]]

def test_and_or_not():
    e = setup_engine()
    out = e.query("SELECT name FROM users WHERE age > 20 AND NOT name = 'dave'")
    assert sorted(out) == [["alice"], ["carol"]]

def test_join():
    e = setup_engine()
    out = e.query(
        "SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id "
        "WHERE o.total > 100"
    )
    assert sorted(out) == [["alice", 250], ["carol", 800]]

def test_order_by_desc_limit():
    e = setup_engine()
    out = e.query("SELECT name FROM users ORDER BY age DESC LIMIT 2")
    assert out == [["dave"], ["alice"]]

def test_unknown_column_errors():
    e = setup_engine()
    try: e.query("SELECT bogus FROM users")
    except NameError: pass
    else: assert False

Follow-up Questions

  1. How would you test it? Layer-by-layer: tokenizer tests for every keyword/operator/punctuation; parser tests that pretty-print the AST and compare strings; executor tests against fixture tables. Property test: random valid queries with predictable outputs from a Python list-comprehension oracle.
  2. What is the consistency model? Single-threaded; reads see the snapshot at query start. For concurrent writes, copy-on-write tables or per-table read-write locks.
  3. What configuration knobs would you expose? Maximum query length, maximum result rows, query timeout. Don’t expose internals (tokenizer regex, parser lookahead).
  4. How would you handle a poison-pill input? Catastrophic regex (rare with the lexer above), deeply nested expressions (limit recursion depth), enormous joins (R₁ × R₂ row cap before executing). Bound everything.
  5. How would you scale to N nodes? Beyond toy: shard tables by primary key range or hash; route queries to the owning node; for joins across shards, use distributed hash join. Real systems (Spanner, CockroachDB) layer query planning, distributed execution, and consensus over this same skeleton.
  6. What metrics would you emit? Per-query: parse latency, execution latency, rows scanned, rows returned. Per-table: row count gauge. Aggregate: queries-per-second counter, error rate counter.

Product Extension

This is the same skeleton DuckDB, SQLite, Postgres, and every database engine starts with: lex / parse / plan / execute. Real engines add a planner/optimizer between parse and execute that rewrites the AST (push down predicates, choose join order, pick indexes), and a storage layer beneath execute. Aggregate functions, group-by, subqueries, and CTEs are all extensions of the AST + executor pair.

Language/Runtime Follow-ups

  • Python: dataclasses are the right shape for AST. re.VERBOSE lexer is concise.
  • Java: Use sealed interfaces (Java 17+) for AST nodes. ANTLR for the parser if available; hand-rolled recursive descent if not.
  • Go: Use a type Node interface { node() } and individual struct types implementing it. The parser is a struct with the token list and an index.
  • C++: std::variant for AST nodes is clean; visitor pattern via std::visit.
  • JS/TS: Discriminated unions for AST. The runtime cost of dynamic dispatch is acceptable for an interview-grade engine.

Common Bugs

  1. Lexer that consumes whitespace as a token — pollutes the parser. Skip whitespace in the lexer.
  2. Parser that allows the same column twice in projection but then fails at execution — better to validate at parse time.
  3. JOIN executor that builds a Cartesian product before filtering — works but quadratic memory before predicate evaluation. Filter as you go (the implementation above does this).
  4. ORDER BY on a column not in projection — must evaluate against the row environment, not the projected output.
  5. Operator precedence wrong — NOT a AND b parsed as NOT (a AND b) instead of (NOT a) AND b. The recursive-descent ladder (OR < AND < NOT < CMP) handles this.
  6. Case-folding identifiers — many SQL engines do (Postgres folds to lowercase); this toy engine doesn’t. Document the choice.

Debugging Strategy

When parse fails: print the token stream up to the failure point and the parser’s _i index — almost always a missing keyword in the parse method (WHERE vs WHRE). When execution returns wrong rows: log the where evaluation per row with the values it sees. When joins explode: cap row count and emit an error rather than running unbounded.

Mastery Criteria

  • Decomposed lex / parse / execute in <2 minutes.
  • Wrote the recursive-descent expression parser with correct precedence.
  • Implemented WHERE and SELECT cols correctly in <30 minutes.
  • Added inner JOIN nested-loop in <10 minutes from the WHERE-only baseline.
  • Added ORDER BY and LIMIT in <10 minutes more.
  • Articulated where the optimizer would slot in (between parse and execute).
  • Listed three real systems (DuckDB, SQLite, Postgres) using this skeleton.
  • Wrote tokenizer + parser + executor tests independently per layer.