Loops and Ifs
Reusing Allocations
Yesterday, I spent a long time wrapping my head around how to do the optimization for if-expressions that I wanted to. Today, I think I may finally have a solution that works for both if-expressions and for-expressions, while keeping everything nice and tight.
Let's start with my data types:
1: type LastUseMap = Map VName (Set VName) 2: 3: type InUse = Set VName 4: 5: type LastUsed = Set VName 6: 7: type MemGraph = Graph VName
In addition to the LastUseMap
, InUse
and the MemGraph
I introduced
yesterday, I now also want to keep track of LastUsed
. This is the set of
memory blocks that have reached their last use in a statement of a body.
Now, contrary to earlier versions of this algorithm, I'll actually have to
describe in more detail how analyseBody
will work. Here is the pseudocode:
analyseBody :: LastUseMap -> Stms -> (InUse, LastUsed, MemGraph, Stms)
analyseBody lumap stms
=
let inuse
= ∅
let graph
= ∅
let lus
= ∅
let stms'
= []
for each stm
in stms
:
let (inuse
, lus'
, graph
, stm'
) = analyseStm lumap inuse graph stm
let lus
= lus
∪ lus'
push stm'
to stms'
return (inuse
lus
, graph
, stms'
)
In particular, any time we start analysing a new body, we'll start with a fresh
graph
and inuse
. That means we have to unify the resulting graph with the
previous graph after the call to analyseBody
, but it will allow us to discern
between what is used in different branches of an if-expression.
Now, let's take a look at analyseStm
1:
analyseStm :: LastUseMap -> InUse -> MemGraph -> Stm -> (InUse, LastUsed, MemGraph, Stm)
analyseStm
lu_map
inuse
graph
(let p
= exp
) =
let (inuse_new
, lus_new
, graph_new
, stm_new
) =
if exp
is a loop with body b
then
let (inuse'
, lus
, graph'
, b'
) = analyseBody
lu_map
b
let graph''
= graph
∪ graph'
∪ (inuse
↔ (inuse'
∪ lus
))
let inuse''
= inuse
∖ lus
∪ inuse'
in (inuse''
, lus
, graph''
, graph'
, let p
= exp
with b'
)
else if exp
is a if with bodies b1
, b2
then
let (inuse1
, lus1
, g1
, b1'
) = analyseBody
lu_map
b1
let (inuse2
, lus2
, g2
, b2'
) = analyseBody
lu_map
b2
let lus
= lus1
∪ lus2
let inuse'
= (inuse
∪ inuse1
∪ inuse2
) ∖ lus
let g
= graph
∪ g1
∪ g2
∪ ((inuse1
∪ lus1
) ↔ (inuse
∖ (lus2
∖ lus1
)))
∪ ((inuse2
∪ lus2
) ↔ (inuse
∖ (lus1
∖ lus2
)))
in (inuse'
, lus
, g
, let p
= exp
with b1'
and b2'
)
else if exp
is a kernel call with a body b
then
same as loop†
else
let lus
= lookup p
in lu_map
let lus_mems
= memory blocks referenced in lus
let inuse'
= inuse
∖ lus_mems
in (inuse''
, lus_mems
, graph
, stm
)
let mems
= memory blocks referenced in p
let inuse_end
= inuse_new
∪ mems
let graph_end
= graph_new
∪ (inuse_end
↔ inuse_end
)
in (inuse_end
, lus_new
, graph_end
, stm_new
)
†: Just as we can get in trouble if we reuse an external memory block inside a loop, I suspect I can get in similar trouble if I do the same inside a kernel. If not, we can simplify a bit here.
There's a lot going on here. First of all, because we're now looking at memory
initalizations instead of memory allocations, we need to look at every
expression to see if a new memory block is introduced in the pattern. After
handling the specific type of expression, we add any new memory blocks to the
InUse
set and update the graph accordingly.
Secondly, when handling loops, we add edges between any outside in-use memory
blocks and all the memory blocks used within the loop body after procesing the
loop body itself. Additionally, the resulting InUse
is computed by removing
any LastUsed
from the previous InUse
set and adding the blocks from the loop
body InUse
set instead.
When handling if-expressions, we analyse each branch body separately, with fresh
graphs and InUse
sets. The resulting LastUsed
set is the union of the
LastUsed
from each branch. The resulting InUse
is the union of the previous
InUse
and the InUse
from each branch, minus any memory blocks in the updated
LastUsed
. Remember, that if a memory block is last-used in one branch of an
if, it is either last-used or not used at all in the other branch, and as long
as one branch contains a last-use of a kernel block, it is out of use after the
if-expression. The updated graph is a little bit more complicated to
compute. First, remember that in this case we can think of a graph simply as a
set of edges. The resulting graph is at least the union of the previous graph
and the graphs from each branch. But we also want to add edges between all
memory block that was in use before the if-expression and all memory blocks that
were used inside the expression, except for memory blocks that are last-used
in one branch but not another. In those cases, we want to add edges in the
branch that has the last-use, but not the other. For instance, to compute the
edges between outside memory blocks and memory blocks in branch 1, we take the
set difference between the last-uses of branch 2 and branch 1: if there is a
memory block that is last-used in branch 2 but not in branch 1 it will be part
of that set. Then that set is subtracted from the previous InUse
to get all
the outside memory blocks that we want to add edges to for branch 1. Finally
edges those memory blocks and all memory blocks used in branch 1 are created.
That was a mouthful, but I think it should work.
Let's also try to write an updated optimiseStm
. First, our Allocs
1: type Allocs = Set (VName, SubExp, Space)
optimiseStm :: MemGraph -> Allocs -> Stm -> (MemGraph, Allocs, Stm)
optimiseStm
graph
allocs
(let p
= exp
) =
if exp
is an allocation then
if there is a memory block x
in allocs
with the right size and space, there is no edge between x
and p
in graph
then
let graph'
= graph
with nodes for x
and p
merged under the name x
return (graph'
, allocs
, let p
= x
)
else
return (graph
, allocs
with (p
, size of allocation, space), let p
= exp
)
else if exp
is an if with branches b1
, b2
then
let (graph'
, allocs'
, b1'
) = optimiseBody
graph
allocs
b1
let (graph''
, allocs''
, b2'
) = optimiseBody
graph'
allocs'
b2
return (graph''
, allocs''
, let p
= exp
with b1'
, b2'
)
else if exp
contains a body of stms (ie. introduces a scope) b
then
let (graph'
, allocs'
, b'
) = optimiseBody
graph
allocs
b
return (graph'
, allocs'
, let p
= exp
with b'
)
else
return (graph
, allocs
, let p
= exp
)
More complications
After a short talk with Troels, some additional information has come to light:
First of all, inside kernels, we can make all memory blocks the same size, if we
want to, simply by allocating max
of the sizes in question. Therefore, the
task in optimiseStm
isn't so much about finding other blocks of the right
size, it's more about finding out which memory blocks we want to merge to get
the minimal number of allocations.
Second of all, because uses of memory blocks inside kernels do not necessarily have the same size, we have to be careful when merging them. Therefore, Troels suggested using two passes: One that only looks at kernels, which is not re-entrant and which runs before allocations are hoisted, and one that runs on the rest of the program and which can run afterwards.
Wait, I don't understand this at all…
Expand Allocations
Okay, here's the problem: The expand allocations pass creates weird index functions inside kernels that are hard to reason about.
ExpandAllocations
is a pass that unifies memory allocations across
kernels. Consider the following kernel:
1: SegMap_n { 2: let xs_mem = alloc x 3: let xs@xs_mem = ... 4: ... 5: let ys_mem = alloc y 6: let ys@ys_mem = ... 7: }
The allocations inside the SegMap (which maps across n
threads of size n
),
are completely independent of each other, and my optimisation would possibly be
able to merge them. However, if our pass happens after the ExpandAllocations
pass, the code will look like this:
1: let xs_mem = alloc (x * n) 2: let ys_mem = alloc (y * n) 3: 4: SegMap_n { 5: let xs@xs_mem = ... 6: ... 7: let ys@ys_mem = ... 8: }
where xs
and ys
have weird index functions that offsets them within their
respective memory allocations. Now, the threads are not independent of each
other, and in particular if ys
has different stride than xs
, merging
xs_mem
and ys_mem
could cause writes to ys
to overwrite different parts of
xs
than just those that have gone out of scope in this thread.
By doing our optimisation before the ExpandAllocations
pass, and only on
kernels, we can merge those allocations before-hand, and let the
ExpandAllocations
pass worry about generalising the index functions.
Another problem is kernels with access to global memory or manual indexing. If any of that happens, we probably need to avoid merging those blocks, at least for now, because it's hard to reason about safety. Therefore, our pass should only concern itself with memory allocations and array initializations that take place within the kernel in question.
Meeting
Meetings change everything, every time! I have to be conservative in loops and maps and such. Because there can be all sorts of manual indexing and global array accesses inside loops, maps, and so on, we probably have to have any memory blocks inside the body interfere with the result of the body.
Cosmin pointed out that OptionPricing.fut is the benchmark I'm really trying to get at, and for that we may need some heuristics for handling indexing inside loops/kernels, but for now I think it's better to keep it simple.
The other point that came out is that we all that kernels are the most
interesting part, and so I should try to focus on optimising them first. It was
also accepted that my kernel pass will happen before ExpandAllocations
, which
will help immensely.
So, what's the plan now? Write a new algorithm, which only looks at kernels, and which tries to build the interference graph for it.
Footnotes:
U ↔ V here is the graph with edges from each element in U to all elements in V. Think cartesian product.