Create, Insert and Select - pdb part 1

In the previous post, I outlined my frustrations with current relational databases and promised to describe my adventures trying to create my own database. This post will be the first in a series of posts, with the goal of writing a usable database, albeit minimalistic, database. The database will be written in Rust and you're encouraged to follow along, but I will assume that you are moderately familiar with Rust.

You can always see the progress in the repository here. This document is based on the code at commit version c47e2b0e0c8bdf9ef9b5e6980f43bc1ffea6bb62.

The pdb query language

Since we're creating a database from scratch we get to decide what the query language will look like. For the pdb query language I'm going to lean upon SQL for the general feel, but I won't be afraid to depart from that style when I feel like it.

The first instance of the language (and the database) will support basic types consisting of integers, booleans, tuples and records. I might demote tuples to syntactic sugar for records at some point, but for now it doesn't hurt to have them as separate entities. Here's what a typical session might look like:

>> create table foo Int
Created
>> insert 42 into foo
Inserted 1
>> insert 43 into foo
Inserted 1
>> select from foo
[Int(42), Int(43)]
>> create table bar (Int, Bool)
Created
>> insert (100, true) into bar
Inserted 1
>> select from bar
[Tuple([Int(100), Bool(true)])]
>> create table baz ()
Created
>> insert () into baz
Inserted 1
>> insert () into baz
Inserted 1
>> select from baz
[Unit, Unit]
>> create table qux { x : Int, y : Int }
Created
>> insert { y = 2, x = 1 } into qux
Inserted 1
>> select from qux
[Record([("x", Int(1)), ("y", Int(2))])]

Right off the bat you'll notice some differences from traditional SQL. Most glaringly is probably the fact that in pdb, we don't have any columns by default. A table can just as easily consist of a number of (), or /unit/s, or a simple list of integers. To mimic traditional tables with named columns, we can use records, which allow us to name each field.

Getting started

We're going to create a new project using cargo new pdb. Our first dependency is going to be pest, a parser expression grammar library for rust. Add the following to your Cargo.toml:

...

[dependencies]
pest = "2.1.1"n
pest_derive = "2.1.0"

Parsing is (not) a pest

With the pest library, we can express the grammar of our language using a parsing expression grammar. We'll quickly go over each line to see how it works:

WHITESPACE = _{ " " | "\n" }

First, we declare what constitutes whitespace in our language. pest uses the special WHITESPACE rule to control what kind of whitespace is allowed between terms. To start, we allow spaces and newlines.

identifier = @{ 'a'..'z' ~ ASCII_ALPHANUMERIC* }

tyident = @{ 'A'..'Z' ~ ASCII_ALPHANUMERIC* }

Then we make an identifier rule and a tyident rule for the various identifiers we're going to use. We specify that regular expression identifiers should start with a lowercase letter, and type identifiers should start with an uppercase letter, similar to how identifiers are used in Haskell and similar ML-family languages. It also means that we can easily distinguish between the two kinds of identifiers.

unit = { "()" }

tytuple = { "(" ~ ty ~ ("," ~ ty)+ ~ ","? ~ ")" }

tyrecord = { "{" ~ identifier ~ ":" ~ ty ~ ("," ~ identifier ~ ":" ~ ty)* ~ ","? ~ "}" }

ty = { unit | tytuple | tyrecord | tyident }

Next, the unit rule is used for both the unit type and the unit value. Then follows the rules for tuple and record types, and finally the ty rule for general types. You'll notice that I'm allowing trailing commans in both the tuple and record types, which I like because it means adding or removing fields result in single-line diffs.

int = @{
    "-"?
    ~ ("0" | ASCII_NONZERO_DIGIT ~ ASCII_DIGIT*)
}

bool = { "true" | "false" }

tuple = { "(" ~ expr ~ ("," ~ expr)+ ~ ","? ~ ")" }

record = { "{" ~ identifier ~ "=" ~ expr ~ ("," ~ identifier ~ "=" ~ expr)* ~ ","? ~ "}" }

expr = { unit | tuple | record | int | bool }

Then we have the expressions allowed, which consists of integeres, booleans, tuples, and records. You'll note that I also allow trailing commas in the relevant expressions here.

insert = { "insert" ~ expr ~ "into" ~ identifier }

select = { "select" ~ "from" ~ identifier }

create = { "create" ~ "table" ~ identifier ~ ty }

statement = _{ SOI ~ (create | insert | select) ~ EOI }

Finally, the statements we allow are quite simple: insert, select, and create. You'll notice that the select statement is very primitive. For instance, there's no filtering, by either row or column, but we'll add those later.

That's it for the grammar of the language, now we need to turn the parsed statements into an AST.

Turning the Grammar into an AST

Here's the AST structures:

pub type Ident = String;

#[derive(Debug, PartialEq)]
pub enum Ty {
    Int,
    Bool,
    Tuple(Vec<Ty>),
    Unit,
    Record(Vec<(Ident, Ty)>),
}

#[derive(Debug, PartialEq)]
pub struct TableDefinition {
    pub ty: Ty,
}

#[derive(Debug, PartialEq)]
pub enum Expr {
    Int(i64),
    Bool(bool),
    Tuple(Vec<Expr>),
    Unit,
    Record(Vec<(Ident, Expr)>),
}

#[derive(Debug, PartialEq)]
pub enum Statement {
    Create(Ident, TableDefinition),
    Insert(Ident, Expr),
    Select(Ident),
}

We'll note that we've collapsed the identifier and ident from pest into an Ident struct, which is simply a string. Apart from that, these declarations correspond more or less directly to their pest counterparts.

Parsing using the pest parser library takes some getting used to, but once you get the hang of it, you can quickly turn a grammar into a robust parser with builtin error messages.

use pest::error::Error;
use pest::iterators::{Pair, Pairs};
use pest::Parser as _;

#[derive(Parser)]
#[grammar = "pdb.pest"]
pub struct Parser;

First, we import the necessary structs from pest and then we use the grammar annotation to create a Parser struct from our grammar definition.

fn parse_tyrecord(mut pairs: Pairs<Rule>) -> Result<Ty, Error<Rule>> {
    let mut xs = Vec::new();

    while let Some(ident) = pairs.next() {
        let ty = parse_ty(pairs.next().unwrap())?;
        xs.push((ident.as_str().to_owned(), ty));
    }

    xs.sort_by(|(x, _), (y, _)| x.cmp(y));

    Ok(Ty::Record(xs))
}

Pairs<Rule> is an iterator over Pair<Rule>, which is a matching rule and everything inside. For instance, the string true will match a pair with the rule bool, span from the start of the string to the end, and no inner pairs. The first parsing function parse_tyrecord parses the contents of a record type (ie. { x : Int, y : Bool }). So, the values inside pairs will alternate between an identifier and a type, with a guaranteed even number of values in total, because that is the only thing our grammar allows. Hence, it is safe to use unwrap to get the type Pair when we have found an identifier.

You'll note that we sort the vector by the identifier names, because we want to allow { x : Int, y : Int } and { y : Int, x : Int } to represent the same thing. You'll also note that we're not checking for multiple declarations of the same identifier, but that is something that we could reasonably add in the future.

fn parse_ty(pair: Pair<Rule>) -> Result<Ty, Error<Rule>> {
    match pair.as_rule() {
        Rule::tyident => match pair.as_str() {
            "Int" => Ok(Ty::Int),
            "Bool" => Ok(Ty::Bool),
            x => Err(Error::new_from_span(
                pest::error::ErrorVariant::CustomError {
                    message: format!("Invalid type {}", x),
                },
                pair.as_span(),
            )),
        },
        Rule::tytuple => Ok(Ty::Tuple(
            pair.into_inner()
                .map(|x| parse_ty(x.into_inner().next().unwrap()))
                .collect::<Result<Vec<_>, _>>()?,
        )),
        Rule::unit => Ok(Ty::Unit),
        Rule::tyrecord => parse_tyrecord(pair.into_inner()),
        r => Err(Error::new_from_span(
            pest::error::ErrorVariant::CustomError {
                message: format!(
                    "Unexpected rule {:?}, expected tyindent, tyrecord, unit or tytuple",
                    r
                ),
            },
            pair.as_span(),
        )),
    }
}

parse_ty takes a Pair<Rule> instead of a Pairs, simply because we know that a type matches a single rule. We then match on the rule inside the type andproduce the corresponding Ty. Parsing expressions is similar, and after than we just need to parse the different kinds of statements. Here is the rest of the parsing code:

fn parse_record(mut pairs: Pairs<Rule>) -> Result<Expr, Error<Rule>> {
    let mut xs = Vec::new();

    while let Some(ident) = pairs.next() {
        let expr = parse_expr(pairs.next().unwrap().into_inner().next().unwrap())?;
        xs.push((ident.as_str().to_owned(), expr));
    }

    xs.sort_by(|(x, _), (y, _)| x.cmp(y));

    Ok(Expr::Record(xs))
}

fn parse_expr(expr: Pair<Rule>) -> Result<Expr, Error<Rule>> {
    match expr.as_rule() {
        Rule::int => Ok(Expr::Int(expr.as_str().parse().unwrap())),
        Rule::bool => Ok(Expr::Bool(expr.as_str().parse().unwrap())),
        Rule::tuple => Ok(Expr::Tuple(
            expr.into_inner()
                .map(|x| parse_expr(x.into_inner().next().unwrap()))
                .collect::<Result<Vec<_>, _>>()?,
        )),
        Rule::unit => Ok(Expr::Unit),
        Rule::record => parse_record(expr.into_inner()),
        r => Err(Error::new_from_span(
            pest::error::ErrorVariant::CustomError {
                message: format!("Unexpected rule {:?}, expected expr", r),
            },
            expr.as_span(),
        )),
    }
}

pub fn parse_select(mut pairs: Pairs<Rule>) -> Result<Statement, Error<Rule>> {
    let ident = pairs.next().unwrap().as_str();

    Ok(Statement::Select(ident.to_string()))
}

pub fn parse_insert(mut pairs: Pairs<Rule>) -> Result<Statement, Error<Rule>> {
    let expr = parse_expr(pairs.next().unwrap().into_inner().next().unwrap())?;
    let ident = pairs.next().unwrap().as_str();

    Ok(Statement::Insert(ident.to_string(), expr))
}

pub fn parse_create(mut pairs: Pairs<Rule>) -> Result<Statement, Error<Rule>> {
    let ident = pairs.next().unwrap().as_str();
    let ty = parse_ty(pairs.next().unwrap().into_inner().next().unwrap())?;

    Ok(Statement::Create(
        ident.to_string(),
        TableDefinition { ty: ty },
    ))
}

fn parse_statement(pair: Pair<Rule>) -> Result<Statement, Error<Rule>> {
    match pair.as_rule() {
        Rule::create => Ok(parse_create(pair.into_inner())?),
        Rule::select => Ok(parse_select(pair.into_inner())?),
        Rule::insert => Ok(parse_insert(pair.into_inner())?),
        _ => Err(Error::new_from_span(
            pest::error::ErrorVariant::CustomError {
                message: format!("Unexpected rule {:?}, expected statement", pair),
            },
            pair.as_span(),
        )),
    }
}

pub fn parse(input: &str) -> Result<Statement, Error<Rule>> {
    let statement = Parser::parse(Rule::statement, input)?.next().unwrap();

    parse_statement(statement)
}

With that, our parser is done.

Objects (not the Java kind)

Eventually, we'd like to add other kinds of expressions (addition, sums, and so on), but we don't want store symbols like that in our tables, so we need to be able to evaluate expressions into something that we can store on disk. We'll call the evaluated expressions ~Object~s, and this is the definition we'll use:

#[derive(Debug, PartialEq)]
pub enum Object {
    Int(i64),
    Bool(bool),
    Tuple(Vec<Object>),
    Unit,
    Record(Vec<(Ident, Object)>),
}

impl fmt::Display for Object {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        match self {
            Object::Int(i) => write!(f, "{}", i),
            Object::Bool(b) => write!(f, "{}", b),
            Object::Tuple(objs) => {
                let mut objs = objs.iter();
                write!(f, "(")?;

                if let Some(obj) = objs.next() {
                    write!(f, "{}", obj)?;
                }

                for obj in objs {
                    write!(f, ", {}", obj)?;
                }

                write!(f, ")")
            }
            Object::Unit => write!(f, "()"),
            Object::Record(pairs) => {
                let mut pairs = pairs.iter();
                write!(f, "{{")?;

                if let Some((ident, obj)) = pairs.next() {
                    write!(f, "{} = {}", ident, obj)?;
                }

                for (ident, obj) in pairs {
                    write!(f, ", {} = {}", ident, obj)?;
                }

                write!(f, "}}")
            }
        }
    }
}

We have also added an implementation of Display, so that our REPL can show the result of queries in a nice manner.

Since we don't have any complicated expressions yet, evaluating is a simple manner of translating an Expr into the corresponding Object.

pub fn eval(expr: Expr) -> Object {
    match expr {
        Expr::Int(i) => Object::Int(i),
        Expr::Bool(b) => Object::Bool(b),
        Expr::Tuple(exprs) => Object::Tuple(exprs.into_iter().map(eval).collect()),
        Expr::Unit => Object::Unit,
        Expr::Record(xs) => Object::Record(
            xs.into_iter()
                .map(|(ident, obj)| (ident, eval(obj)))
                .collect(),
        ),
    }
}

Unification and a working REPL

Before we can write our REPL, we also need a way to ensure that an expression matches the type of the table it's being inserted in. Thankfully, this isn't very complicated, since our language is so simple:

pub fn matches_type(expr: &Expr, ty: &Ty) -> bool {
    match (expr, ty) {
        (Expr::Int(_), Ty::Int) => true,
        (Expr::Bool(_), Ty::Bool) => true,
        (Expr::Tuple(exprs), Ty::Tuple(tys)) => exprs
            .iter()
            .zip(tys.iter())
            .all(|(x, y)| matches_type(x, y)),
        (Expr::Unit, Ty::Unit) => true,
        (Expr::Record(expr_pairs), Ty::Record(ty_pairs)) => expr_pairs
            .iter()
            .zip(ty_pairs.iter())
            .all(|((exprident, expr), (tyident, ty))| {
                exprident == tyident && matches_type(expr, ty)
            }),
        _ => false,
    }
}

We'll need to expand this using proper Hindley-Milner unification at some point, but this will do for now.

That was the last piece we needed in order to be able to create a REPL! For now, tables only exist in memory, and consist of a list of Object values. We also have no filtering or anything, but we support arbitrarily complex table types, including records and tuples. Here is the code:

const PROMPT: &[u8; 3] = b">> ";

pub fn start<R, W>(reader: &mut R, writer: &mut W) -> Result<(), Box<dyn std::error::Error>>
where
    R: BufRead,
    W: Write,
{
    let mut tables: Vec<(Ident, TableDefinition, Vec<Object>)> = Vec::new();

    loop {
        writer.write_all(PROMPT)?;
        writer.flush()?;

        let mut line = String::new();
        reader.read_line(&mut line)?;

        if &line == "" {
            writer.write_all(b"\n")?;
            writer.flush()?;
            return Ok(());
        }

        match parse(&line) {
            Ok(ast) => match ast {
                Statement::Create(ident, def) => {
                    tables.push((ident, def, Vec::new()));
                    writer.write_all(b"Created\n")?;
                    writer.flush()?;
                }
                Statement::Insert(ident, expr) => {
                    if let Some((_, def, objs)) =
                        tables.iter_mut().find(|(ident2, _, _)| ident2 == &ident)
                    {
                        if matches_type(&expr, &def.ty) {
                            let result = eval(expr);
                            objs.push(result);
                            writer.write_all(b"Inserted 1\n")?;
                        } else {
                            writer.write_all(
                                format!(
                                    "Could not insert {:?} into table {:?} with definition {:?}\n",
                                    expr, ident, &def.ty
                                )
                                .as_bytes(),
                            )?;
                        }
                    } else {
                        writer.write_all(b"No such table\n")?;
                    }
                    writer.flush()?;
                }
                Statement::Select(ident) => {
                    if let Some((_, _, objs)) =
                        tables.iter().find(|(ident2, _, _)| ident2 == &ident)
                    {
                        writer.write_all(format!("{:?}\n", objs).as_bytes())?;
                    } else {
                        writer.write_all(b"No such table\n")?;
                    }
                    writer.flush()?;
                }
            },
            Err(e) => {
                writer.write_all(format!("No parse: {}\n", e).as_bytes())?;
                writer.flush()?;
            }
        }
    }
}

That's it for this installment. There are still many things we'd like to do with our database, including persisting to disk, filtering and more expressions, but I think of this as the minimum viable product, and it is a good standpoint from which to expand our database.

If you've got any comments or suggestions, feel free to contact me by email. Bugs and pull requests are welcome in the repository at https://github.com/Munksgaard/pdb.