Typechecking toyml

A Simple Typechecker in Standard ML

I've been wanting to write a compiler for a simple functional language for a long time, and I finally took the first steps in that direction some time ago, when I wrote a simple Hindley-Milner typechecker for a very simple functional language. It has no practical applications whatsoever – I didn't even write a parser – but it was a fun experience to learn more about Algorithm W and I've been meaning to document that process, and so this is it.

toyml - A minimalistic language

I'm not interested in a full fledged or even practically usable language for the moment, so toyml is basically just the typed lambda calculus. It supports integers, bools, lambda functions, function application and let-bindings.

type ident = string

datatype expr = Int of int
              | Bool of bool
              | Ident of ident
              | Let of string * expr * expr
              | Apply of expr * expr
              | Lambda of string * expr

The types follow naturally.

datatype ty = Int
            | Bool
            | Var of ident
            | Fun of ty * ty

Algorithm W

Algorithm W is a simple type checking algorithm, variants of which are used in Standard ML, Haskell and similar languages. It was first presented by Luis Damas and Robin Milner in 1982 and has since been extended in a multitude of ways.

In addition to the types defined above, we also define the following types for substitutions, type schemes and type environments.

type substitution = (ident * ty)

type tyScheme = ident list * ty

type tyEnv = (ident * tyScheme) list

Substitutions are the result of unification, which we'll describe in more detail later. For now, it's enough to know that a substitution is a mapping from a type variable to some type. Types will be familiar to most people, but type schemes and type environments deserve an elaboration.

In order to understand type schemes, let's use a motivating example that I have borrowed from Wikipedia1.

let foo = fn y => x
in foo

In this example, we cannot assign a meaningful type to foo without knowing the type of x. If we know that the type of x is Int, then foo's type is clearly α → Int, but what if foo is part of a larger definition, like the following example:

let bar = fn x =>
  let foo = fn y => x
  in foo
in bar

It might be tempting to say that foo has the type β → α, but that is only because we expect the type of bar to be α → β → α. However, there's clearly a difference between the α in foo's type and the α that appears in bar's type: The α in bar can be literally any type, but the α in foo can only be the same α that's in bar. We might say that the α in foo is bound by it's context, whereas the α in bar is free, even though both are type variables. The purpose of type schemes is simply to be able to distinguish between bound and free variables when describing the type of an expression. Therefore, a type scheme consists of a type and a list of the bound type variables in that type. We say that foo has the type ∀β . α → β 2.

Type environments are a bit easier to understand. Suppose you have an if statement where both branches use a variable x. Whatever we have inferred about the type of x from the first branch needs to be carried over to the type inference of the second branch. So a type environment is a list of mappings from identifiers to type schemes.

Free Variables

The first thing we need are functions to find the free variables in types, type schemes and type environments. A free variable in a type t is simply the list of all type variables that appear in t.

(* freevars : ty -> ident list
 * Lists the free variables in ty. *)
fun freevars Int = []
  | freevars Bool = []
  | freevars (Var id) = [id]
  | freevars (Fun (t1, t2)) = freevars t1 @ freevars t2

Because a type scheme are defined by a type t and a list of bound variables in that type, finding the free variables in a type scheme is simply a matter of subtracting the bound variables from the free variables in t.

(* freevarsScheme : tyScheme -> ident list
 * Find the free variables in the given type scheme. *)
fun freevarsScheme (ids, t) =
  freevars t except ids

Similarly, the free variables in a type environment is simply the union of the free variables in it's type schemes.

(* freevarsEnv : tyEnv -> ident list
 * Find the free variables in the given type environment. *)
fun freevarsEnv env =
  List.concat (map (fn (_, ts) => freevarsScheme ts) env)

Unifiers, Unification and Substitutions

A unifier tries to find substitutions that unify two types. For instance, the result of unifying a type int with a type α is the substitution (α, int). This is used to unify the type of a given function f with the type of a given argument α in a function application f a. The unification function mgu (most general unifier) achieves this with the help of varBind, which attempts to bind a variable to a type.

(* varBind : ident -> ty -> substitution list
 * Tries to bind ty to the type variable given by id.
 * If ty is a type variable, simply perform the binding, otherwise perform
 * the occurs check: If id is a free variable in ty, then we'll get a
 * recursive substitution, which we consider an error. *)
fun varBind id ty =
  case ty of
      Var id' => if id = id' then []
                else [(id, ty)]
    | _ => if (freevars ty) contains id then
             raise Fail "occurs check"
             [(id, ty)]

(* mgu : ty -> ty -> substitution list
 * Find the most general substitution that unifies t1 and t2. *)
fun mgu (Fun (l, r)) (Fun (l', r')) =
  let val s1 = mgu l l'
      val s2 = mgu (apply s1 r) (apply s1 r')
  in compose s1 s2
  | mgu (Var u) t = varBind u t
  | mgu t (Var u) = varBind u t
  | mgu Int Int = []
  | mgu Bool Bool = []
  | mgu _ _ = raise Fail "no unification found"

The unification itself is quite simple, but note that we do not allow recursive type definitions in varBind. That is, when unifying a type variable α with another type t, if α is a free variable in t, we raise an exception.

Type Inference

Before we move on to infer, let's quickly discuss instantiate and generalize. In short, generalize takes a type and binds all the free type variables except those that are also free in the environment, while instantiate takes a type scheme and returns a type where all the bound variables have been isntantiated to fresh type variables. We can consider them to be each others dual.

(* instantiate : tyScheme -> ty
 * Create a fresh type from a given type scheme. *)
fun instantiate (ids, ty) =
  let val subs = map (fn id => (id, getFresh ())) ids
  in apply subs ty end

(* generalize : tyEnv -> ty -> tyScheme
 * Generalizes a type into a type scheme in the given type environment.
   For instance, the concrete type `'a -> 'a`, generalizes to
   the type scheme `forAll 'a : 'a -> 'a`, but only if if `'a` is not
   also free in the type environment.
   Generalizing is the opposite of instantiating. *)
fun generalize env ty =
  (freevars ty except freevarsEnv env, ty)

Finally, we have infer, the actual type inference function.

(* infer : tyEnv -> expr -> substitution list * ty
 * Infer the type of an expression in the given type environment. *)
fun infer env (Expr.Lambda (x, e)) : (substitution list * ty) =
  let val tv = getFresh ()
      val env' = List.filter (fn (y, _) => x <> y) env
      val env'' = (x, ([], tv)) :: env'
      val (s1, t1) = infer env'' e
  in (s1, Fun (apply s1 tv, t1)) end
  | infer env (Expr.Apply (e1, e2)) =
    let val (s1, t1) = infer env e1
        val (s2, t2) = infer (applyEnv s1 env) e2
        val fresh = getFresh ()
        val s3 = mgu (apply s2 t1) (Fun (t2, fresh))
    in (compose s3 (compose s2 s1), apply s3 fresh)
  | infer env (Expr.Let (x, e1, e2)) =
    let val (s1, t1) = infer env e1
        val env' = List.filter (fn (y, _) => x <> y) env
        val t' = generalize (applyEnv s1 env) t1
        val env'' = (x, t') :: env'
        val (s2, t2) = infer (applyEnv s1 env'') e2
    in (compose s2 s1, t2) end
  | infer _ (Expr.Int _) = ([], Int)
  | infer _ (Expr.Bool _) = ([], Bool)
  | infer env (Expr.Ident x) =
    case lookup env x of
        SOME ts => ([], instantiate ts)
      | NONE => raise Fail "Unbound var"

Now what?

My plan is to slowly extend this language and its implementation in whatever way I find interesting. My first step might be to write a parser, but I'd like to experiment with subtyping, á la MLsub. At some point I'd also like to actually generate machine code, probably via LLVM, but that's probably a bit further off.

The whole project is available on GitHub. Comments, suggestions, pull requests and bug reports are more than welcome.



You'll notice the correspondence with the universal quantifier, which might make you wonder if there's a similar correspondence with existential types. Unfortunately, that's a bit beyond the scope of this post, but I hope to get back to it eventually.