Create, Insert and Select - pdb part 1

In the previous post, I outlined my frustrations with current relational
databases and promised to describe my adventures trying to create my own
database. This post will be the first in a series of posts, with the goal of
writing a usable database, albeit minimalistic, database. The database will be
written in Rust and you're encouraged to follow along, but I will assume that
you are moderately familiar with Rust.

You can always see the progress in the repository here. This document is based
on the code at commit version c47e2b0e0c8bdf9ef9b5e6980f43bc1ffea6bb62.

The pdb query language

Since we're creating a database from scratch we get to decide what the query
language will look like. For the pdb query language I'm going to lean upon SQL
for the general feel, but I won't be afraid to depart from that style when
I feel like it.

The first instance of the language (and the database) will support basic types
consisting of integers, booleans, tuples and records. I might demote tuples to
syntactic sugar for records at some point, but for now it doesn't hurt to have
them as separate entities. Here's what a typical session might look like:

>> create table foo Int
Created
>> insert 42 into foo
Inserted 1
>> insert 43 into foo
Inserted 1
>> select from foo
[Int(42), Int(43)]
>> create table bar (Int, Bool)
Created
>> insert (100, true) into bar
Inserted 1
>> select from bar
[Tuple([Int(100), Bool(true)])]
>> create table baz ()
Created
>> insert () into baz
Inserted 1
>> insert () into baz
Inserted 1
>> select from baz
[Unit, Unit]
>> create table qux { x : Int, y : Int }
Created
>> insert { y = 2, x = 1 } into qux
Inserted 1
>> select from qux
[Record([("x", Int(1)), ("y", Int(2))])]

Right off the bat you'll notice some differences from traditional SQL. Most
glaringly is probably the fact that in pdb, we don't have any columns by
default. A table can just as easily consist of a number of (), or /unit/s, or
a simple list of integers. To mimic traditional tables with named columns, we
can use records, which allow us to name each field.

Getting started

We're going to create a new project using cargo new pdb. Our first
dependency is going to be pest, a parser expression grammar library for
rust. Add the following to your Cargo.toml:

...

[dependencies]
pest = "2.1.1"n
pest_derive = "2.1.0"

Parsing is (not) a pest

With the pest library, we can express the grammar of our language using a
parsing expression grammar. We'll quickly go over each line to see how it works:

WHITESPACE = _{ " " | "\n" }

First, we declare what constitutes whitespace in our language. pest uses the
special WHITESPACE rule to control what kind of whitespace is allowed between
terms. To start, we allow spaces and newlines.

identifier = @{ 'a'..'z' ~ ASCII_ALPHANUMERIC* }

tyident = @{ 'A'..'Z' ~ ASCII_ALPHANUMERIC* }

Then we make an identifier rule and a tyident rule for the various
identifiers we're going to use. We specify that regular expression identifiers
should start with a lowercase letter, and type identifiers should start with an
uppercase letter, similar to how identifiers are used in Haskell and similar
ML-family languages. It also means that we can easily distinguish between the
two kinds of identifiers.

unit = { "()" }

tytuple = { "(" ~ ty ~ ("," ~ ty)+ ~ ","? ~ ")" }

tyrecord = { "{" ~ identifier ~ ":" ~ ty ~ ("," ~ identifier ~ ":" ~ ty)* ~ ","? ~ "}" }

ty = { unit | tytuple | tyrecord | tyident }

Next, the unit rule is used for both the unit type and the unit value. Then
follows the rules for tuple and record types, and finally the ty rule for
general types. You'll notice that I'm allowing trailing commans in both the
tuple and record types, which I like because it means adding or removing fields
result in single-line diffs.

int = @{
    "-"?
    ~ ("0" | ASCII_NONZERO_DIGIT ~ ASCII_DIGIT*)
}

bool = { "true" | "false" }

tuple = { "(" ~ expr ~ ("," ~ expr)+ ~ ","? ~ ")" }

record = { "{" ~ identifier ~ "=" ~ expr ~ ("," ~ identifier ~ "=" ~ expr)* ~ ","? ~ "}" }

expr = { unit | tuple | record | int | bool }

Then we have the expressions allowed, which consists of integeres, booleans,
tuples, and records. You'll note that I also allow trailing commas in the
relevant expressions here.

insert = { "insert" ~ expr ~ "into" ~ identifier }

select = { "select" ~ "from" ~ identifier }

create = { "create" ~ "table" ~ identifier ~ ty }

statement = _{ SOI ~ (create | insert | select) ~ EOI }

Finally, the statements we allow are quite simple: insert, select, and
create. You'll notice that the select statement is very primitive. For
instance, there's no filtering, by either row or column, but we'll add those
later.

That's it for the grammar of the language, now we need to turn the parsed
statements into an AST.

Turning the Grammar into an AST

Here's the AST structures:

pub type Ident = String;

#[derive(Debug, PartialEq)]
pub enum Ty {
    Int,
    Bool,
    Tuple(Vec<Ty>),
    Unit,
    Record(Vec<(Ident, Ty)>),
}

#[derive(Debug, PartialEq)]
pub struct TableDefinition {
    pub ty: Ty,
}

#[derive(Debug, PartialEq)]
pub enum Expr {
    Int(i64),
    Bool(bool),
    Tuple(Vec<Expr>),
    Unit,
    Record(Vec<(Ident, Expr)>),
}

#[derive(Debug, PartialEq)]
pub enum Statement {
    Create(Ident, TableDefinition),
    Insert(Ident, Expr),
    Select(Ident),
}

We'll note that we've collapsed the identifier and ident from pest into an
Ident struct, which is simply a string. Apart from that, these declarations
correspond more or less directly to their pest counterparts.

Parsing using the pest parser library takes some getting used to, but once you
get the hang of it, you can quickly turn a grammar into a robust parser with
builtin error messages.

use pest::error::Error;
use pest::iterators::{Pair, Pairs};
use pest::Parser as _;

#[derive(Parser)]
#[grammar = "pdb.pest"]
pub struct Parser;

First, we import the necessary structs from pest and then we use the grammar
annotation to create a Parser struct from our grammar definition.

fn parse_tyrecord(mut pairs: Pairs<Rule>) -> Result<Ty, Error<Rule>> {
    let mut xs = Vec::new();

    while let Some(ident) = pairs.next() {
        let ty = parse_ty(pairs.next().unwrap())?;
        xs.push((ident.as_str().to_owned(), ty));
    }

    xs.sort_by(|(x, _), (y, _)| x.cmp(y));

    Ok(Ty::Record(xs))
}

Pairs<Rule> is an iterator over Pair<Rule>, which is a matching rule and
everything inside. For instance, the string true will match a pair with the
rule bool, span from the start of the string to the end, and no inner pairs.
The first parsing function parse_tyrecord parses the contents of a record type
(ie. { x : Int, y : Bool }). So, the values inside pairs will alternate
between an identifier and a type, with a guaranteed even number of values in
total, because that is the only thing our grammar allows. Hence, it is safe to
use unwrap to get the type Pair when we have found an identifier.

You'll note that we sort the vector by the identifier names, because we want to
allow { x : Int, y : Int } and { y : Int, x : Int } to represent the same
thing. You'll also note that we're not checking for multiple declarations of the
same identifier, but that is something that we could reasonably add in the
future.

fn parse_ty(pair: Pair<Rule>) -> Result<Ty, Error<Rule>> {
    match pair.as_rule() {
        Rule::tyident => match pair.as_str() {
            "Int" => Ok(Ty::Int),
            "Bool" => Ok(Ty::Bool),
            x => Err(Error::new_from_span(
                pest::error::ErrorVariant::CustomError {
                    message: format!("Invalid type {}", x),
                },
                pair.as_span(),
            )),
        },
        Rule::tytuple => Ok(Ty::Tuple(
            pair.into_inner()
                .map(|x| parse_ty(x.into_inner().next().unwrap()))
                .collect::<Result<Vec<_>, _>>()?,
        )),
        Rule::unit => Ok(Ty::Unit),
        Rule::tyrecord => parse_tyrecord(pair.into_inner()),
        r => Err(Error::new_from_span(
            pest::error::ErrorVariant::CustomError {
                message: format!(
                    "Unexpected rule {:?}, expected tyindent, tyrecord, unit or tytuple",
                    r
                ),
            },
            pair.as_span(),
        )),
    }
}

parse_ty takes a Pair<Rule> instead of a Pairs, simply because we know
that a type matches a single rule. We then match on the rule inside the type
andproduce the corresponding Ty. Parsing expressions is similar, and after
than we just need to parse the different kinds of statements. Here is the rest
of the parsing code:

fn parse_record(mut pairs: Pairs<Rule>) -> Result<Expr, Error<Rule>> {
    let mut xs = Vec::new();

    while let Some(ident) = pairs.next() {
        let expr = parse_expr(pairs.next().unwrap().into_inner().next().unwrap())?;
        xs.push((ident.as_str().to_owned(), expr));
    }

    xs.sort_by(|(x, _), (y, _)| x.cmp(y));

    Ok(Expr::Record(xs))
}

fn parse_expr(expr: Pair<Rule>) -> Result<Expr, Error<Rule>> {
    match expr.as_rule() {
        Rule::int => Ok(Expr::Int(expr.as_str().parse().unwrap())),
        Rule::bool => Ok(Expr::Bool(expr.as_str().parse().unwrap())),
        Rule::tuple => Ok(Expr::Tuple(
            expr.into_inner()
                .map(|x| parse_expr(x.into_inner().next().unwrap()))
                .collect::<Result<Vec<_>, _>>()?,
        )),
        Rule::unit => Ok(Expr::Unit),
        Rule::record => parse_record(expr.into_inner()),
        r => Err(Error::new_from_span(
            pest::error::ErrorVariant::CustomError {
                message: format!("Unexpected rule {:?}, expected expr", r),
            },
            expr.as_span(),
        )),
    }
}

pub fn parse_select(mut pairs: Pairs<Rule>) -> Result<Statement, Error<Rule>> {
    let ident = pairs.next().unwrap().as_str();

    Ok(Statement::Select(ident.to_string()))
}

pub fn parse_insert(mut pairs: Pairs<Rule>) -> Result<Statement, Error<Rule>> {
    let expr = parse_expr(pairs.next().unwrap().into_inner().next().unwrap())?;
    let ident = pairs.next().unwrap().as_str();

    Ok(Statement::Insert(ident.to_string(), expr))
}

pub fn parse_create(mut pairs: Pairs<Rule>) -> Result<Statement, Error<Rule>> {
    let ident = pairs.next().unwrap().as_str();
    let ty = parse_ty(pairs.next().unwrap().into_inner().next().unwrap())?;

    Ok(Statement::Create(
        ident.to_string(),
        TableDefinition { ty: ty },
    ))
}

fn parse_statement(pair: Pair<Rule>) -> Result<Statement, Error<Rule>> {
    match pair.as_rule() {
        Rule::create => Ok(parse_create(pair.into_inner())?),
        Rule::select => Ok(parse_select(pair.into_inner())?),
        Rule::insert => Ok(parse_insert(pair.into_inner())?),
        _ => Err(Error::new_from_span(
            pest::error::ErrorVariant::CustomError {
                message: format!("Unexpected rule {:?}, expected statement", pair),
            },
            pair.as_span(),
        )),
    }
}

pub fn parse(input: &str) -> Result<Statement, Error<Rule>> {
    let statement = Parser::parse(Rule::statement, input)?.next().unwrap();

    parse_statement(statement)
}

With that, our parser is done.

Objects (not the Java kind)

Eventually, we'd like to add other kinds of expressions (addition, sums, and so
on), but we don't want store symbols like that in our tables, so we need to be
able to evaluate expressions into something that we can store on disk. We'll
call the evaluated expressions ~Object~s, and this is the definition we'll use:

#[derive(Debug, PartialEq)]
pub enum Object {
    Int(i64),
    Bool(bool),
    Tuple(Vec<Object>),
    Unit,
    Record(Vec<(Ident, Object)>),
}

impl fmt::Display for Object {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        match self {
            Object::Int(i) => write!(f, "{}", i),
            Object::Bool(b) => write!(f, "{}", b),
            Object::Tuple(objs) => {
                let mut objs = objs.iter();
                write!(f, "(")?;

                if let Some(obj) = objs.next() {
                    write!(f, "{}", obj)?;
                }

                for obj in objs {
                    write!(f, ", {}", obj)?;
                }

                write!(f, ")")
            }
            Object::Unit => write!(f, "()"),
            Object::Record(pairs) => {
                let mut pairs = pairs.iter();
                write!(f, "{{")?;

                if let Some((ident, obj)) = pairs.next() {
                    write!(f, "{} = {}", ident, obj)?;
                }

                for (ident, obj) in pairs {
                    write!(f, ", {} = {}", ident, obj)?;
                }

                write!(f, "}}")
            }
        }
    }
}

We have also added an implementation of Display, so that our REPL can show the
result of queries in a nice manner.

Since we don't have any complicated expressions yet, evaluating is a simple
manner of translating an Expr into the corresponding Object.

pub fn eval(expr: Expr) -> Object {
    match expr {
        Expr::Int(i) => Object::Int(i),
        Expr::Bool(b) => Object::Bool(b),
        Expr::Tuple(exprs) => Object::Tuple(exprs.into_iter().map(eval).collect()),
        Expr::Unit => Object::Unit,
        Expr::Record(xs) => Object::Record(
            xs.into_iter()
                .map(|(ident, obj)| (ident, eval(obj)))
                .collect(),
        ),
    }
}

Unification and a working REPL

Before we can write our REPL, we also need a way to ensure that an expression
matches the type of the table it's being inserted in. Thankfully, this isn't
very complicated, since our language is so simple:

pub fn matches_type(expr: &Expr, ty: &Ty) -> bool {
    match (expr, ty) {
        (Expr::Int(_), Ty::Int) => true,
        (Expr::Bool(_), Ty::Bool) => true,
        (Expr::Tuple(exprs), Ty::Tuple(tys)) => exprs
            .iter()
            .zip(tys.iter())
            .all(|(x, y)| matches_type(x, y)),
        (Expr::Unit, Ty::Unit) => true,
        (Expr::Record(expr_pairs), Ty::Record(ty_pairs)) => expr_pairs
            .iter()
            .zip(ty_pairs.iter())
            .all(|((exprident, expr), (tyident, ty))| {
                exprident == tyident && matches_type(expr, ty)
            }),
        _ => false,
    }
}

We'll need to expand this using proper Hindley-Milner unification at some point,
but this will do for now.

That was the last piece we needed in order to be able to create a REPL! For now,
tables only exist in memory, and consist of a list of Object values. We also
have no filtering or anything, but we support arbitrarily complex table types,
including records and tuples. Here is the code:

const PROMPT: &[u8; 3] = b">> ";

pub fn start<R, W>(reader: &mut R, writer: &mut W) -> Result<(), Box<dyn std::error::Error>>
where
    R: BufRead,
    W: Write,
{
    let mut tables: Vec<(Ident, TableDefinition, Vec<Object>)> = Vec::new();

    loop {
        writer.write_all(PROMPT)?;
        writer.flush()?;

        let mut line = String::new();
        reader.read_line(&mut line)?;

        if &line == "" {
            writer.write_all(b"\n")?;
            writer.flush()?;
            return Ok(());
        }

        match parse(&line) {
            Ok(ast) => match ast {
                Statement::Create(ident, def) => {
                    tables.push((ident, def, Vec::new()));
                    writer.write_all(b"Created\n")?;
                    writer.flush()?;
                }
                Statement::Insert(ident, expr) => {
                    if let Some((_, def, objs)) =
                        tables.iter_mut().find(|(ident2, _, _)| ident2 == &ident)
                    {
                        if matches_type(&expr, &def.ty) {
                            let result = eval(expr);
                            objs.push(result);
                            writer.write_all(b"Inserted 1\n")?;
                        } else {
                            writer.write_all(
                                format!(
                                    "Could not insert {:?} into table {:?} with definition {:?}\n",
                                    expr, ident, &def.ty
                                )
                                .as_bytes(),
                            )?;
                        }
                    } else {
                        writer.write_all(b"No such table\n")?;
                    }
                    writer.flush()?;
                }
                Statement::Select(ident) => {
                    if let Some((_, _, objs)) =
                        tables.iter().find(|(ident2, _, _)| ident2 == &ident)
                    {
                        writer.write_all(format!("{:?}\n", objs).as_bytes())?;
                    } else {
                        writer.write_all(b"No such table\n")?;
                    }
                    writer.flush()?;
                }
            },
            Err(e) => {
                writer.write_all(format!("No parse: {}\n", e).as_bytes())?;
                writer.flush()?;
            }
        }
    }
}

That's it for this installment. There are still many things we'd like to do with
our database, including persisting to disk, filtering and more expressions, but
I think of this as the minimum viable product, and it is a good standpoint from
which to expand our database.

If you've got any comments or suggestions, feel free to contact me by email.
Bugs and pull requests are welcome in the repository at
https://github.com/Munksgaard/pdb.

Created: <2020-02-05 Thu>

Last modified: 2020-02-05 Wed 15:24