# Loops and Ifs

## Reusing Allocations

Yesterday, I spent a long time wrapping my head around how to do the optimization for if-expressions that I wanted to. Today, I think I may finally have a solution that works for both if-expressions and for-expressions, while keeping everything nice and tight.

Let's start with my data types:

1: type LastUseMap = Map VName (Set VName) 2: 3: type InUse = Set VName 4: 5: type LastUsed = Set VName 6: 7: type MemGraph = Graph VName

In addition to the `LastUseMap`

, `InUse`

and the `MemGraph`

I introduced
yesterday, I now also want to keep track of `LastUsed`

. This is the set of
memory blocks that have reached their last use in a statement of a body.

Now, contrary to earlier versions of this algorithm, I'll actually have to
describe in more detail how `analyseBody`

will work. Here is the pseudocode:

`analyseBody :: LastUseMap -> Stms -> (InUse, LastUsed, MemGraph, Stms)`

`analyseBody lumap stms`

=

let `inuse`

= ∅

let `graph`

= ∅

let `lus`

= ∅

let `stms'`

= `[]`

for each `stm`

in `stms`

:

let (`inuse`

, `lus'`

, `graph`

, `stm'`

) = `analyseStm lumap inuse graph stm`

let `lus`

= `lus`

∪ `lus'`

push `stm'`

to `stms'`

return (`inuse`

`lus`

, `graph`

, `stms'`

)

In particular, any time we start analysing a new body, we'll start with a fresh
`graph`

and `inuse`

. That means we have to unify the resulting graph with the
previous graph after the call to `analyseBody`

, but it will allow us to discern
between what is used in different branches of an if-expression.

Now, let's take a look at `analyseStm`

^{1}:

`analyseStm :: LastUseMap -> InUse -> MemGraph -> Stm -> (InUse, LastUsed, MemGraph, Stm)`

`analyseStm`

`lu_map`

`inuse`

`graph`

(let `p`

= `exp`

) =

let (`inuse_new`

, `lus_new`

, `graph_new`

, `stm_new`

) =

if `exp`

is a loop with body `b`

then

let (`inuse'`

, `lus`

, `graph'`

, `b'`

) = `analyseBody`

`lu_map`

`b`

let `graph''`

= `graph`

∪ `graph'`

∪ (`inuse`

↔ (`inuse'`

∪ `lus`

))

let `inuse''`

= `inuse`

∖ `lus`

∪ `inuse'`

in (`inuse''`

, `lus`

, `graph''`

, `graph'`

, let `p`

= `exp`

with `b'`

)

else if `exp`

is a if with bodies `b1`

, `b2`

then

let (`inuse1`

, `lus1`

, `g1`

, `b1'`

) = `analyseBody`

`lu_map`

`b1`

let (`inuse2`

, `lus2`

, `g2`

, `b2'`

) = `analyseBody`

`lu_map`

`b2`

let `lus`

= `lus1`

∪ `lus2`

let `inuse'`

= (`inuse`

∪ `inuse1`

∪ `inuse2`

) ∖ `lus`

let `g`

= `graph`

∪ `g1`

∪ `g2`

∪ ((`inuse1`

∪ `lus1`

) ↔ (`inuse`

∖ (`lus2`

∖ `lus1`

)))

∪ ((`inuse2`

∪ `lus2`

) ↔ (`inuse`

∖ (`lus1`

∖ `lus2`

)))

in (`inuse'`

, `lus`

, `g`

, let `p`

= `exp`

with `b1'`

and `b2'`

)

else if `exp`

is a kernel call with a body `b`

then

same as loop†

else

let `lus`

= lookup `p`

in `lu_map`

let `lus_mems`

= memory blocks referenced in `lus`

let `inuse'`

= `inuse`

∖ `lus_mems`

in (`inuse''`

, `lus_mems`

, `graph`

, `stm`

)

let `mems`

= memory blocks referenced in `p`

let `inuse_end`

= `inuse_new`

∪ `mems`

let `graph_end`

= `graph_new`

∪ (`inuse_end`

↔ `inuse_end`

)

in (`inuse_end`

, `lus_new`

, `graph_end`

, `stm_new`

)

†: Just as we can get in trouble if we reuse an external memory block inside a loop, I suspect I can get in similar trouble if I do the same inside a kernel. If not, we can simplify a bit here.

There's a lot going on here. First of all, because we're now looking at memory
initalizations instead of memory allocations, we need to look at every
expression to see if a new memory block is introduced in the pattern. After
handling the specific type of expression, we add any new memory blocks to the
`InUse`

set and update the graph accordingly.

Secondly, when handling loops, we add edges between any outside in-use memory
blocks and all the memory blocks used within the loop body after procesing the
loop body itself. Additionally, the resulting `InUse`

is computed by removing
any `LastUsed`

from the previous `InUse`

set and adding the blocks from the loop
body `InUse`

set instead.

When handling if-expressions, we analyse each branch body separately, with fresh
graphs and `InUse`

sets. The resulting `LastUsed`

set is the union of the
`LastUsed`

from each branch. The resulting `InUse`

is the union of the previous
`InUse`

and the `InUse`

from each branch, minus any memory blocks in the updated
`LastUsed`

. Remember, that if a memory block is last-used in one branch of an
if, it is either last-used or not used at all in the other branch, and as long
as one branch contains a last-use of a kernel block, it is out of use after the
if-expression. The updated graph is a little bit more complicated to
compute. First, remember that in this case we can think of a graph simply as a
set of edges. The resulting graph is at least the union of the previous graph
and the graphs from each branch. But we also want to add edges between all
memory block that was in use before the if-expression and all memory blocks that
were used inside the expression, *except* for memory blocks that are last-used
in one branch but not another. In those cases, we want to add edges in the
branch that has the last-use, but not the other. For instance, to compute the
edges between outside memory blocks and memory blocks in branch 1, we take the
set difference between the last-uses of branch 2 and branch 1: if there is a
memory block that is last-used in branch 2 but not in branch 1 it will be part
of that set. Then that set is subtracted from the previous `InUse`

to get all
the outside memory blocks that we want to add edges to for branch 1. Finally
edges those memory blocks and all memory blocks used in branch 1 are created.

That was a mouthful, but I think it should work.

Let's also try to write an updated `optimiseStm`

. First, our `Allocs`

1: type Allocs = Set (VName, SubExp, Space)

`optimiseStm :: MemGraph -> Allocs -> Stm -> (MemGraph, Allocs, Stm)`

`optimiseStm`

`graph`

`allocs`

(let `p`

= `exp`

) =

if `exp`

is an allocation then

if there is a memory block `x`

in `allocs`

with the right size and space, there is no edge between `x`

and `p`

in `graph`

then

let `graph'`

= `graph`

with nodes for `x`

and `p`

merged under the name `x`

return (`graph'`

, `allocs`

, let `p`

= `x`

)

else

return (`graph`

, `allocs`

with (`p`

, size of allocation, space), let `p`

= `exp`

)

else if `exp`

is an if with branches `b1`

, `b2`

then

let (`graph'`

, `allocs'`

, `b1'`

) = `optimiseBody`

`graph`

`allocs`

`b1`

let (`graph''`

, `allocs''`

, `b2'`

) = `optimiseBody`

`graph'`

`allocs'`

`b2`

return (`graph''`

, `allocs''`

, let `p`

= `exp`

with `b1'`

, `b2'`

)

else if `exp`

contains a body of stms (ie. introduces a scope) `b`

then

let (`graph'`

, `allocs'`

, `b'`

) = `optimiseBody`

`graph`

`allocs`

`b`

return (`graph'`

, `allocs'`

, let `p`

= `exp`

with `b'`

)

else

return (`graph`

, `allocs`

, let `p`

= `exp`

)

## More complications

After a short talk with Troels, some additional information has come to light:
First of all, inside kernels, we can make all memory blocks the same size, if we
want to, simply by allocating `max`

of the sizes in question. Therefore, the
task in `optimiseStm`

isn't so much about finding other blocks of the right
size, it's more about finding out which memory blocks we want to merge to get
the minimal number of allocations.

Second of all, because uses of memory blocks inside kernels do not necessarily have the same size, we have to be careful when merging them. Therefore, Troels suggested using two passes: One that only looks at kernels, which is not re-entrant and which runs before allocations are hoisted, and one that runs on the rest of the program and which can run afterwards.

Wait, I don't understand this at all…

## Expand Allocations

Okay, here's the problem: The expand allocations pass creates weird index functions inside kernels that are hard to reason about.

`ExpandAllocations`

is a pass that unifies memory allocations across
kernels. Consider the following kernel:

1: SegMap_n { 2: let xs_mem = alloc x 3: let xs@xs_mem = ... 4: ... 5: let ys_mem = alloc y 6: let ys@ys_mem = ... 7: }

The allocations inside the SegMap (which maps across `n`

threads of size `n`

),
are completely independent of each other, and my optimisation would possibly be
able to merge them. However, if our pass happens after the `ExpandAllocations`

pass, the code will look like this:

1: let xs_mem = alloc (x * n) 2: let ys_mem = alloc (y * n) 3: 4: SegMap_n { 5: let xs@xs_mem = ... 6: ... 7: let ys@ys_mem = ... 8: }

where `xs`

and `ys`

have weird index functions that offsets them within their
respective memory allocations. Now, the threads are not independent of each
other, and in particular if `ys`

has different stride than `xs`

, merging
`xs_mem`

and `ys_mem`

could cause writes to `ys`

to overwrite different parts of
`xs`

than just those that have gone out of scope in this thread.

By doing our optimisation before the `ExpandAllocations`

pass, and only on
kernels, we can merge those allocations before-hand, and let the
`ExpandAllocations`

pass worry about generalising the index functions.

Another problem is kernels with access to global memory or manual indexing. If any of that happens, we probably need to avoid merging those blocks, at least for now, because it's hard to reason about safety. Therefore, our pass should only concern itself with memory allocations and array initializations that take place within the kernel in question.

## Meeting

Meetings change everything, every time! I have to be conservative in loops and maps and such. Because there can be all sorts of manual indexing and global array accesses inside loops, maps, and so on, we probably have to have any memory blocks inside the body interfere with the result of the body.

Cosmin pointed out that OptionPricing.fut is the benchmark I'm really trying to get at, and for that we may need some heuristics for handling indexing inside loops/kernels, but for now I think it's better to keep it simple.

The other point that came out is that we all that kernels are the most
interesting part, and so I should try to focus on optimising them first. It was
also accepted that my kernel pass will happen before `ExpandAllocations`

, which
will help immensely.

So, what's the plan now? Write a new algorithm, which only looks at kernels, and which tries to build the interference graph for it.

## Footnotes:

^{1}

U ↔ V here is the graph with edges from each element in U to all elements in V. Think cartesian product.