Loops and Ifs

Reusing Allocations

Yesterday, I spent a long time wrapping my head around how to do the optimization for if-expressions that I wanted to. Today, I think I may finally have a solution that works for both if-expressions and for-expressions, while keeping everything nice and tight.

Let's start with my data types:

1: type LastUseMap = Map VName (Set VName)
2: 
3: type InUse = Set VName
4: 
5: type LastUsed = Set VName
6: 
7: type MemGraph = Graph VName

In addition to the LastUseMap, InUse and the MemGraph I introduced yesterday, I now also want to keep track of LastUsed. This is the set of memory blocks that have reached their last use in a statement of a body.

Now, contrary to earlier versions of this algorithm, I'll actually have to describe in more detail how analyseBody will work. Here is the pseudocode:

analyseBody :: LastUseMap -> Stms -> (InUse, LastUsed, MemGraph, Stms)
analyseBody lumap stms =
  let inuse = ∅
  let graph = ∅
  let lus = ∅
  let stms' = []
  for each stm in stms:
    let (inuse, lus', graph, stm') = analyseStm lumap inuse graph stm
    let lus = lus ∪ lus'
    push stm' to stms'
  return (inuse lus, graph, stms')

In particular, any time we start analysing a new body, we'll start with a fresh graph and inuse. That means we have to unify the resulting graph with the previous graph after the call to analyseBody, but it will allow us to discern between what is used in different branches of an if-expression.

Now, let's take a look at analyseStm¹:

analyseStm :: LastUseMap -> InUse -> MemGraph -> Stm -> (InUse, LastUsed, MemGraph, Stm)
analyseStm lu_map inuse graph (let p = exp) =
  let (inuse_new, lus_new, graph_new, stm_new) =
    if exp is a loop with body b then
      let (inuse', lus, graph', b') = analyseBody lu_map b
      let graph'' = graph ∪ graph' ∪ (inuse ↔ (inuse' ∪ lus))
      let inuse'' = inuse ∖ lus ∪ inuse'
      in (inuse'', lus, graph'', graph', let p = exp with b')
    else if exp is a if with bodies b1, b2 then
      let (inuse1, lus1, g1, b1') = analyseBody lu_map b1
      let (inuse2, lus2, g2, b2') = analyseBody lu_map b2
      let lus = lus1 ∪ lus2
      let inuse' = (inuse ∪ inuse1 ∪ inuse2) ∖ lus
      let g = graph ∪ g1 ∪ g2
                ∪ ((inuse1 ∪ lus1) ↔ (inuse ∖ (lus2 ∖ lus1)))
                ∪ ((inuse2 ∪ lus2) ↔ (inuse ∖ (lus1 ∖ lus2)))
      in (inuse', lus, g, let p = exp with b1' and b2')
    else if exp is a kernel call with a body b then
      same as loop†
    else
      let lus = lookup p in lu_map
      let lus_mems = memory blocks referenced in lus
      let inuse' = inuse ∖ lus_mems
      in (inuse'', lus_mems, graph, stm)

  let mems = memory blocks referenced in p
  let inuse_end = inuse_new ∪ mems
  let graph_end = graph_new ∪ (inuse_end ↔ inuse_end)
  in (inuse_end, lus_new, graph_end, stm_new)

†: Just as we can get in trouble if we reuse an external memory block inside a loop, I suspect I can get in similar trouble if I do the same inside a kernel. If not, we can simplify a bit here.

There's a lot going on here. First of all, because we're now looking at memory initalizations instead of memory allocations, we need to look at every expression to see if a new memory block is introduced in the pattern. After handling the specific type of expression, we add any new memory blocks to the InUse set and update the graph accordingly.

Secondly, when handling loops, we add edges between any outside in-use memory blocks and all the memory blocks used within the loop body after procesing the loop body itself. Additionally, the resulting InUse is computed by removing any LastUsed from the previous InUse set and adding the blocks from the loop body InUse set instead.

When handling if-expressions, we analyse each branch body separately, with fresh graphs and InUse sets. The resulting LastUsed set is the union of the LastUsed from each branch. The resulting InUse is the union of the previous InUse and the InUse from each branch, minus any memory blocks in the updated LastUsed. Remember, that if a memory block is last-used in one branch of an if, it is either last-used or not used at all in the other branch, and as long as one branch contains a last-use of a kernel block, it is out of use after the if-expression. The updated graph is a little bit more complicated to compute. First, remember that in this case we can think of a graph simply as a set of edges. The resulting graph is at least the union of the previous graph and the graphs from each branch. But we also want to add edges between all memory block that was in use before the if-expression and all memory blocks that were used inside the expression, except for memory blocks that are last-used in one branch but not another. In those cases, we want to add edges in the branch that has the last-use, but not the other. For instance, to compute the edges between outside memory blocks and memory blocks in branch 1, we take the set difference between the last-uses of branch 2 and branch 1: if there is a memory block that is last-used in branch 2 but not in branch 1 it will be part of that set. Then that set is subtracted from the previous InUse to get all the outside memory blocks that we want to add edges to for branch 1. Finally edges those memory blocks and all memory blocks used in branch 1 are created.

That was a mouthful, but I think it should work.

Let's also try to write an updated optimiseStm. First, our Allocs

1: type Allocs = Set (VName, SubExp, Space)

optimiseStm :: MemGraph -> Allocs -> Stm -> (MemGraph, Allocs, Stm)
optimiseStm graph allocs (let p = exp) =
  if exp is an allocation then
    if there is a memory block x in allocs with the right size and space, there is no edge between x and p in graph then
      let graph' = graph with nodes for x and p merged under the name x
      return (graph' , allocs, let p = x)
    else
      return (graph, allocs with (p, size of allocation, space), let p = exp)
  else if exp is an if with branches b1, b2 then
    let (graph', allocs', b1') = optimiseBody graph allocs b1
    let (graph'', allocs'', b2') = optimiseBody graph' allocs' b2
    return (graph'', allocs'', let p = exp with b1', b2')
  else if exp contains a body of stms (ie. introduces a scope) b then
    let (graph', allocs', b') = optimiseBody graph allocs b
    return (graph', allocs', let p = exp with b')
  else
    return (graph, allocs, let p = exp)

More complications

After a short talk with Troels, some additional information has come to light: First of all, inside kernels, we can make all memory blocks the same size, if we want to, simply by allocating max of the sizes in question. Therefore, the task in optimiseStm isn't so much about finding other blocks of the right size, it's more about finding out which memory blocks we want to merge to get the minimal number of allocations.

Second of all, because uses of memory blocks inside kernels do not necessarily have the same size, we have to be careful when merging them. Therefore, Troels suggested using two passes: One that only looks at kernels, which is not re-entrant and which runs before allocations are hoisted, and one that runs on the rest of the program and which can run afterwards.

Wait, I don't understand this at all…

Expand Allocations

Okay, here's the problem: The expand allocations pass creates weird index functions inside kernels that are hard to reason about.

ExpandAllocations is a pass that unifies memory allocations across kernels. Consider the following kernel:

1: SegMap_n {
2:   let xs_mem = alloc x
3:   let xs@xs_mem = ...
4:   ...
5:   let ys_mem = alloc y
6:   let ys@ys_mem = ...
7: }

The allocations inside the SegMap (which maps across n threads of size n), are completely independent of each other, and my optimisation would possibly be able to merge them. However, if our pass happens after the ExpandAllocations pass, the code will look like this:

1: let xs_mem = alloc (x * n)
2: let ys_mem = alloc (y * n)
3: 
4: SegMap_n {
5:   let xs@xs_mem = ...
6:   ...
7:   let ys@ys_mem = ...
8: }

where xs and ys have weird index functions that offsets them within their respective memory allocations. Now, the threads are not independent of each other, and in particular if ys has different stride than xs, merging xs_mem and ys_mem could cause writes to ys to overwrite different parts of xs than just those that have gone out of scope in this thread.

By doing our optimisation before the ExpandAllocations pass, and only on kernels, we can merge those allocations before-hand, and let the ExpandAllocations pass worry about generalising the index functions.

Another problem is kernels with access to global memory or manual indexing. If any of that happens, we probably need to avoid merging those blocks, at least for now, because it's hard to reason about safety. Therefore, our pass should only concern itself with memory allocations and array initializations that take place within the kernel in question.

Meeting

Meetings change everything, every time! I have to be conservative in loops and maps and such. Because there can be all sorts of manual indexing and global array accesses inside loops, maps, and so on, we probably have to have any memory blocks inside the body interfere with the result of the body.

Cosmin pointed out that OptionPricing.fut is the benchmark I'm really trying to get at, and for that we may need some heuristics for handling indexing inside loops/kernels, but for now I think it's better to keep it simple.

The other point that came out is that we all that kernels are the most interesting part, and so I should try to focus on optimising them first. It was also accepted that my kernel pass will happen before ExpandAllocations, which will help immensely.

So, what's the plan now? Write a new algorithm, which only looks at kernels, and which tries to build the interference graph for it.

Footnotes:

U ↔ V here is the graph with edges from each element in U to all elements in V. Think cartesian product.