2020-07-30
Status
I've been slacking off a bit on the technical diary. Yesterday I didn't write anything at all, and the day before that only a few paragraphs. Let's try to improve that a bit.
Last Use
So, what's the status? I've worked on improving my last-use analysis, to better
handle aliasing, and altogether be more robust. This turned out to, not
surprisingly, be necessary for linear scan/reusing allocations. My problems over
the last couple of days have all stemmed from the fact that the return values of
a~Body~ of code, say from an if
-expression, cannot be unambigiously associated
with a statement. Consider the following piece of code:
1: let xs: [n]i32 = ... 2: let ys = 3: if b then 4: xs 5: else 6: iota n 7: let y = .. ys .. -- last use of ys, xs
The last syntactic use of xs
is in the return from the then-branch of the if
expression, but there are no statements inside that branch. Instead, we can
associate the last use of xs
with the let ys
statement, but then it's not
clear whether it's last used in the then- or else-branch, or both. In the
algorithm that I'd previously made, I assumed that I would always be able to
determine which branch something was last-used in, which lead me to try to
create an optimisation that would reuse the memory block of xs
for the iota
n
, in the above code. However, I think I have to abandon that idea for now, and
instead focus on getting the simple cases to work.
Besides, the real last use of xs
is not inside the if
expression, but
actually at the same place as the last use of ys
. In order to be able to
correctly determine the last use xs
, I need aliasing information. Fortunately,
there already exists and Alias
lore that attaches aliasing information in
various places. In particular:
- Patterns elements in statements have a
VarAliases
attached, which indicates which existing variables the pattern element aliases. - Statements have a
ConsumedInExp
attached to their auxiliary information, which indicates which values were consumed in the expression. As far as I can tell, this is only ever used when using in-place updates. - Bodies have a
BodyAliasing
attached. I'm not sure what this is for, it seems to contain the same thingVarAliases
. Obviously, however, different branches in anif
expression can have differentBodyAliasing
information. - Ops implement
CanBeAliased
andAliasedOp
, whereAliasedOp
contains information about what is aliased inside, and what is consumed inside.
Upon looking into BodyAliasing
, I have confirmed that it is actually the case
that different branches in if
expressions have different BodyAliasing
information. It may be possible to use this knowledge to do the optimisation I
was talking about above, but let's get the other bits working first.
From a last-use perspective, we really want to know, whenever something is
last-used (the first occurrence of a variable when traversing the program
statements in reverse order), it and all of the variables it aliases should be
marked as last-used. However, given the alias lore described above, we don't
have the necessary aliasing information at that point. This lead me to create a
sort of inverted last-use map, which is only used during the analysis, and which
is then inverted at the end to produce the real last-use map. The inverse
last-use map records for each variable, in which statement it was last
used. In the example above, when we come upon the let y
statement, we record
that ys
is last used in that statement. Later, in the let ys
statement, we
see that ys
aliases xs
, so we look up the last use of ys
, which is y
,
and record that the last use of xs
is also y
.
Now, given that, let's revisit ReuseAllocations
, and see how we can improve
it.
Interference
First of all, let's try to get this interference graph right. I've started by
factoring the interference analysis out into its own module, called
Futhark.Analysis.Interference
.
I think a good first step is to set up some tests of some sort. I currently have
a bunch of different files in my tests
directory, but no way of knowing what
the correct output for each file is and why. Now, I know psum.fut
is going to
be one of my tests. Here's what the generated code with futhark-0.16.2 dev
--kernels -a -e --cse
looks like:
1: -- xss_247 : [impl₀_245][impl₁_246]i32@@xss_mem_760-> 2: -- {base: [impl₀_245, impl₁_246]; contiguous: True; LMADs: [{offset: 0i32; 3: -- strides: [impl₁_246, 1i32]; 4: -- rotates: [0i32, 0i32]; 5: -- shape: [impl₀_245, impl₁_246]; 6: -- permutation: [0, 1]; 7: -- monotonicity: [Inc, Inc]}]} 8: entry {*[?0][?1]i32@?2-> 9: {base: [?0, ?1]; contiguous: True; LMADs: [{offset: 0i32; 10: strides: [?1, 1i32]; 11: rotates: [0i32, 0i32]; 12: shape: [?0, ?1]; 13: permutation: [0, 1]; 14: monotonicity: [Inc, Inc]}]}} 15: main (mem xss_mem_760, i32 impl₀_245, i32 impl₁_246, 16: [impl₀_245][impl₁_246]i32 xss_247) = { 17: #[incremental_flattening(only_intra)] 18: let {i64 binop_x_775} = sext i32 impl₀_245 to i64 19: #[incremental_flattening(only_intra)] 20: let {i64 binop_y_776} = sext i32 impl₁_246 to i64 21: #[incremental_flattening(only_intra)] 22: let {i64 binop_x_777} = mul_nw64(binop_x_775, binop_y_776) 23: #[incremental_flattening(only_intra)] 24: let {i64 bytes_774} = mul_nw64(4i64, binop_x_777) 25: #[incremental_flattening(only_intra)] 26: let {mem mem_778} = 27: alloc(bytes_774) 28: let {i64 binop_x_764} = binop_y_776 29: let {i64 bytes_763} = mul_nw64(4i64, binop_y_776) 30: let {i64 binop_x_768} = binop_y_776 31: let {i64 bytes_767} = bytes_763 32: let {i64 binop_x_772} = binop_y_776 33: let {i64 bytes_771} = bytes_763 34: #[incremental_flattening(only_intra)] 35: -- res_409 : [impl₀_245][impl₁_246]i32@@mem_778-> 36: -- {base: [impl₀_245, impl₁_246]; contiguous: True; LMADs: [{offset: 0i32; 37: -- strides: [impl₁_246, 1i32]; 38: -- rotates: [0i32, 0i32]; 39: -- shape: [impl₀_245, impl₁_246]; 40: -- permutation: [0, 1]; 41: -- monotonicity: [Inc, Inc]}]} 42: let {[impl₀_245][impl₁_246]i32 res_409} = 43: segmap_group 44: (#groups=impl₀_245; groupsize=impl₁_246) 45: (gtid_292 < impl₀_245) (~phys_tid_305) : {[impl₁_246]i32} { 46: let {mem@local mem_765} = 47: alloc(bytes_763, @local) 48: -- resarr0_416 : [impl₁_246]i32@@mem_765-> 49: -- {base: [impl₁_246]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 50: -- rotates: [0i32]; 51: -- shape: [impl₁_246]; 52: -- permutation: [0]; 53: -- monotonicity: [Inc]}]} 54: let {[impl₁_246]i32 resarr0_416} = 55: segscan_thread 56: (#groups=impl₀_245; groupsize=impl₁_246) 57: ({{0i32}, 58: [], 59: fn {i32} (i32 x_417, i32 x_418) => 60: let {i32 res_419} = add32(x_417, x_418) 61: in {res_419}}) 62: (gtid_295 < impl₁_246) (~phys_tid_296) : {i32} { 63: let {i32 x_420} = xss_247[gtid_292, gtid_295] 64: return {returns x_420} 65: } 66: let {mem@local mem_769} = 67: alloc(bytes_763, @local) 68: -- resarr0_426 : [impl₁_246]i32@@mem_769-> 69: -- {base: [impl₁_246]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 70: -- rotates: [0i32]; 71: -- shape: [impl₁_246]; 72: -- permutation: [0]; 73: -- monotonicity: [Inc]}]} 74: let {[impl₁_246]i32 resarr0_426} = 75: segscan_thread 76: (#groups=impl₀_245; groupsize=impl₁_246) 77: ({{0i32}, 78: [], 79: fn {i32} (i32 x_427, i32 x_428) => 80: let {i32 res_429} = add32(x_427, x_428) 81: in {res_429}}) 82: (gtid_297 < impl₁_246) (~phys_tid_298) : {i32} { 83: let {i32 x_430} = resarr0_416[gtid_297] 84: return {returns x_430} 85: } 86: let {mem@local mem_773} = 87: alloc(bytes_763, @local) 88: -- resarr0_435 : [impl₁_246]i32@@mem_773-> 89: -- {base: [impl₁_246]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 90: -- rotates: [0i32]; 91: -- shape: [impl₁_246]; 92: -- permutation: [0]; 93: -- monotonicity: [Inc]}]} 94: let {[impl₁_246]i32 resarr0_435} = 95: segscan_thread 96: (#groups=impl₀_245; groupsize=impl₁_246) 97: ({{0i32}, 98: [], 99: fn {i32} (i32 x_436, i32 x_437) => 100: let {i32 res_438} = add32(x_436, x_437) 101: in {res_438}}) 102: (gtid_299 < impl₁_246) (~phys_tid_300) : {i32} { 103: let {i32 x_439} = resarr0_426[gtid_299] 104: return {returns x_439} 105: } 106: return {returns resarr0_435} 107: } 108: in {impl₀_245, impl₁_246, mem_778, res_409} 109: }
Now, what I'm mostly interested in, is that mem_765
should interfere with
mem_769
, but not mem_773
. On the other hand, mem_769
and mem_773
should
interfere with each other. So, we want an interference graph that looks like
this:
To keep things simple, I'll only concern myself with kernels for now. Therefore,
I've written an analyseKernels
function, whose sole purpose is to walk through
the program and find the kernels and call the actual analysis functions on
those.
Ah, but there are other memory blocks in that code! xss_mem_760
is the block
that the argument to main
, xss_247
, resides in. There is also mem_778
,
which is where the result of the outer map is put.
In principle, if xss_247
was consumed by the function, we could reuse it's
memory for other global memory blocks. However, since we're focusing on just
kernels, perhaps we should actually disregard them entirely when doing the
interference graph? A heuristic could be to only add blocks to the interference
graph that actually are in InUse
.
So, we still want the same interference graph.
For some reason <&> namesIntersection inuse_inside
removes all names from
lastusemems?
That's all sorted out, and now I actually get the correct interference graph,
for psum.fut
at least.