Reusing Allocations, redux

The problems with the current approach

Yesterday, I presented my work on the linear scan algorithm to reuse memory allocations, and Troels and Cosmin rightly pointed out some shortcomings in my approach.

The overall goal is to be able to reuse memory allocations. For that, we want a pass which can be inserted whenever we want to in the compiler pipeline (after kernels and memory has been introduced, of course), and we want to be able to run it multiple times. As such, the approach that I had was too brittle. It made (at least) two assumptions that are too broad:

  1. Memory allocations are always immediately succeeded1 by the first use of the allocated memory block, as part of the instantiation of an array.
  2. Memory allocations are only used for the creation of one array.

The first assumption is invalid in the face of hoisting. It turns out that some programs, OptionPricing.fut in particular, hoists all allocations to outside the kernel in such a way that my algorithm didn't work. Consider the following example, written by Troels:

 1: let xs_mem = alloc ...
 2: let ys_mem = alloc ...
 3: ...
 4: let xs@xs_mem = ...
 5: ...
 6: ... last use of xs ...
 7: ...
 8: let ys@ys_mem = ...
 9: ...
10: in ys -- or whatever

My algorithm would assume that the allocation point of a memory block is the same as the first use of said memory block, meaning that these allocations would never be reused. We need a more sophisticated algorithm, where the allocation points are distinct from the first-use points of a memory block.

The other assumption fails whenever we've already performed our optimisation pass once, some code is moved around by the compiler (the Futhark compiler can do that a lot) and now we want to perform another optimisation pass. For instance, consider the following IR-like code representing what an optimised and transformed program could look like:

 1: let xs_mem = alloc ...
 2: ...
 3: let xs@xs_mem = ...
 4: ...
 5: ... last use of xs ...
 6: ...
 7: let zs_mem = alloc ...
 8: let zs@zs_mem = ...
 9: ...
10: let ys@xs_mem = ...
11: ...
12: ... last use of zs
13: ...
14: in ys -- or whatever

The algorithm as I'd written it, would readily put zs in xs_mem, because after the last use of xs, xs_mem would be put in the free list. In principle, we could probably fix our analysis to correctly state that the last use of xs_mem is actually much later (at the last use of ys), but that introduces a new problem:

 1: let xs_mem = alloc ...
 2: ...
 3: let xs@xs_mem = ...
 4: ...
 5: ... last use of xs ...
 6: ...
 7: let zs_mem = alloc ...
 8: let zs@zs_mem = ...
 9: ...
10: ... last use of zs
11: ...
12: let ys@xs_mem = ...
13: ...
14: in ys -- or whatever

Here, zs can readily be put in xs_mem, because there's no overlap of the two use-intervals of xs_mem and zs_mem. But out hypothetical analysis would indicate that the first use of xs_mem is at the allocation point, or perhaps when xs is being instantiated, and the last use is whenever ys is last used, which means that zs will not be put in xs_mem even though we can clearly see that it could be.

So, we end up with some new assumptions for the algorithm:

  1. Allocations of memory blocks are distinct from instantiation of arrays in those memory blocks.
  2. A memory block can have multiple use-intervals.

Which requires us to rethink our algorithm substantially.

What do we need?

The first thing we need to do, is decouple allocations from first-uses. This isn't necessarily hard to do, but it does require us to carry some extra state around. The solution probably is to still keep track of allocations, but in addition, we need to check each statement to see if it introduces an array. If that's the case, we look up the memory allocation and add it to the in-use list.

The next thing we need to do, is to decouple memory block merging from the analysis pass. In the first version of the algorithm, memory block merging was done in the same pass as the analysis. But when allocations can happen way in advance of their actual use, there's no way to go back and change the earlier memory blocks without a pass. Therefore, I think we need to do to passes: One to analyse memory block usage and find their use-intervals, and one to actually "merge memory blocks" (eg. inserting the statement let ys_mem = xs_mem instead of let ys_mem = alloc ...).

The final piece of the puzzle is, how do we represent the multiple use-intervals of memory blocks? I think line/statement number integer intervals are brittle and don't translate well to Futhark (how are the different branches of an if statement translated into line numbers?). When doing the initial algorithm, we instead used first variable name in the statement pattern as a unique identifier for each list. However, the variable names do not have an obvious ordering, so simply representing intervals as \((firstStm, lastStm)\) won't allow us to intersect intervals. If we're to use variable names as identifiers for statements, we have to think of some other way to represent intervals.

Thankfully, it can be done! Because each statement has a unique identifier (the first variable name in the pattern), a statement interval can be represented as the set of statements in that interval, represented by their unique identifier. Two intervals are just represented as a bigger set.

 1: let xs_mem = alloc ...
 2: let xs@xs_mem = ...
 3: let xs_1 = ...
 4: let xs_2 = ...
 5: let xs_res = ... -- last use of xs
 6: let zs_mem = alloc ...
 7: let zs@zs_mem = ...
 8: let zs_1 = ...
 9: let zs_2 = ...
10: let zs_res = ... -- last use of zs
11: let ys@xs_mem = ...
12: let ys_1 = ...
13: let ys_2 = ...
14: in ys -- or whatever

Here, the use-intervals of xs_mem would be represented as the set {xs, xs_1, xs_2, xs_res, ys, ys_1, ys_2}, while the use-intervals of zs_mem would be represented as the set {zs, zs_1, zs_2, zs_res}. Since those two sets are disjoint, we can merge their allocations by replacing line 6 with the statement let zs_mem = xs_mem.

In fact, perhaps we don't even need to keep track of memory block sizes (Allocs) until the second pass, where we merge the memory blocks?

Psuedo-code, part deux

Let's start with some types.

1: type Allocs = Map VName SubExp
2: 
3: type LastUseMap = Map VName (Set VName)
4: 
5: type InUse = Set VName
6: 
7: type UseMap = Map VName (Set VName)

Allocs is still a map from a memory block to a size, while LastUseMap is a map from a statement to a set of variables that are no longer used after that statement. InUse is used while walking through the program collecting live-intervals, and while processing a given statement it is the set of memory blocks currently instantiated to an array that is still in use. Finally, UseMap is a map from a memory block name to the set of statements in which that memory block is in use. The first algorithm, analyseStms, which has the purpose of populating the UseMap looks like this:

analyseStm :: LastUseMap -> InUse -> UseMap -> Stm -> (InUse, UseMap, Stm)
analyseStm lu_map inuse usemap (let p = exp) =
  if exp contains a body of stms (ie. introduces a scope) b then
    let (inuse', usemap', b') = analyseBody lu_map inuse usemap b
    return (inuse', usemap', let p = exp with b')
  else
    let mems = memory blocks referenced in p
    let inuse' = inusemems
    let lus = lookup p in lu_map
    let lus_mems = memory blocks referenced in lus
    let inuse'' = inuse'lus_mems
    let usemap' = usemap where for every name in inuse'', p is added to the set in the map identified by name

At the end, we'll have a map of memory blocks to the set statements in which they are in use. Now, to perform the optimisation, we call optimiseStm:

optimiseStm :: UseMap -> Allocs -> Stm -> (UseMap, Allocs, Stm)
optimiseStm usemap allocs (let p = exp) =
  if exp is an allocation then
    if there is a memory block x in allocs with the right size, and the use set for x in usemap does not overlap with the use set for p then
      let usemap' = usemap with the use set for x and the use set for p merged under the name x
      return (usemap' , allocs, let p = x)
    else
      return (usemap, allocs with (p, size of allocation), let p = exp)
  else if exp contains a body of stms (ie. introduces a scope) b then
    let (usemap', allocs', b') = optimiseBody usemap allocs b
    return (usemap', allocs', let p = exp with b')
  else
    return (usemap, allocs, let p = exp)

The important bit here is the check for size and use overlap, as well as the merge of use sets. I think it should work even for the programs and situations I described above. I'll have to think about it some more, and perhaps try my hand at implementing it, but now I think I'll stop for today, look at ICFP and enjoy my weekend.

See you on Monday!

Footnotes:

1

At least in relation to other allocations and array instantiations.