2020-07-30

Status

I've been slacking off a bit on the technical diary. Yesterday I didn't write anything at all, and the day before that only a few paragraphs. Let's try to improve that a bit.

So, what's the status? I've worked on improving my last-use analysis, to better handle aliasing, and altogether be more robust. This turned out to, not surprisingly, be necessary for linear scan/reusing allocations. My problems over the last couple of days have all stemmed from the fact that the return values of a~Body~ of code, say from an if-expression, cannot be unambigiously associated with a statement. Consider the following piece of code:

1: let xs: [n]i32 = ...
2: let ys =
3:   if b then
4:     xs
5:   else
6:     iota n
7: let y = .. ys .. -- last use of ys, xs

The last syntactic use of xs is in the return from the then-branch of the if expression, but there are no statements inside that branch. Instead, we can associate the last use of xs with the let ys statement, but then it's not clear whether it's last used in the then- or else-branch, or both. In the algorithm that I'd previously made, I assumed that I would always be able to determine which branch something was last-used in, which lead me to try to create an optimisation that would reuse the memory block of xs for the iota n, in the above code. However, I think I have to abandon that idea for now, and instead focus on getting the simple cases to work.

Besides, the real last use of xs is not inside the if expression, but actually at the same place as the last use of ys. In order to be able to correctly determine the last use xs, I need aliasing information. Fortunately, there already exists and Alias lore that attaches aliasing information in various places. In particular:

Patterns elements in statements have a VarAliases attached, which indicates which existing variables the pattern element aliases.
Statements have a ConsumedInExp attached to their auxiliary information, which indicates which values were consumed in the expression. As far as I can tell, this is only ever used when using in-place updates.
Bodies have a BodyAliasing attached. I'm not sure what this is for, it seems to contain the same thing VarAliases. Obviously, however, different branches in an if expression can have different BodyAliasing information.
Ops implement CanBeAliased and AliasedOp, where AliasedOp contains information about what is aliased inside, and what is consumed inside.

Upon looking into BodyAliasing, I have confirmed that it is actually the case that different branches in if expressions have different BodyAliasing information. It may be possible to use this knowledge to do the optimisation I was talking about above, but let's get the other bits working first.

From a last-use perspective, we really want to know, whenever something is last-used (the first occurrence of a variable when traversing the program statements in reverse order), it and all of the variables it aliases should be marked as last-used. However, given the alias lore described above, we don't have the necessary aliasing information at that point. This lead me to create a sort of inverted last-use map, which is only used during the analysis, and which is then inverted at the end to produce the real last-use map. The inverse last-use map records for each variable, in which statement it was last used. In the example above, when we come upon the let y statement, we record that ys is last used in that statement. Later, in the let ys statement, we see that ys aliases xs, so we look up the last use of ys, which is y, and record that the last use of xs is also y.

Now, given that, let's revisit ReuseAllocations, and see how we can improve it.

Interference

First of all, let's try to get this interference graph right. I've started by factoring the interference analysis out into its own module, called Futhark.Analysis.Interference.

I think a good first step is to set up some tests of some sort. I currently have a bunch of different files in my tests directory, but no way of knowing what the correct output for each file is and why. Now, I know psum.fut is going to be one of my tests. Here's what the generated code with futhark-0.16.2 dev --kernels -a -e --cse looks like:

  1: -- xss_247 : [impl₀_245][impl₁_246]i32@@xss_mem_760->
  2: -- {base: [impl₀_245, impl₁_246]; contiguous: True; LMADs: [{offset: 0i32;
  3: --                                                           strides: [impl₁_246, 1i32];
  4: --                                                           rotates: [0i32, 0i32];
  5: --                                                           shape: [impl₀_245, impl₁_246];
  6: --                                                           permutation: [0, 1];
  7: --                                                           monotonicity: [Inc, Inc]}]}
  8: entry {*[?0][?1]i32@?2->
  9:        {base: [?0, ?1]; contiguous: True; LMADs: [{offset: 0i32;
 10:                                                    strides: [?1, 1i32];
 11:                                                    rotates: [0i32, 0i32];
 12:                                                    shape: [?0, ?1];
 13:                                                    permutation: [0, 1];
 14:                                                    monotonicity: [Inc, Inc]}]}}
 15: main (mem xss_mem_760, i32 impl₀_245, i32 impl₁_246,
 16:       [impl₀_245][impl₁_246]i32 xss_247) = {
 17:   #[incremental_flattening(only_intra)]
 18:   let {i64 binop_x_775} = sext i32 impl₀_245 to i64
 19:   #[incremental_flattening(only_intra)]
 20:   let {i64 binop_y_776} = sext i32 impl₁_246 to i64
 21:   #[incremental_flattening(only_intra)]
 22:   let {i64 binop_x_777} = mul_nw64(binop_x_775, binop_y_776)
 23:   #[incremental_flattening(only_intra)]
 24:   let {i64 bytes_774} = mul_nw64(4i64, binop_x_777)
 25:   #[incremental_flattening(only_intra)]
 26:   let {mem mem_778} =
 27:     alloc(bytes_774)
 28:   let {i64 binop_x_764} = binop_y_776
 29:   let {i64 bytes_763} = mul_nw64(4i64, binop_y_776)
 30:   let {i64 binop_x_768} = binop_y_776
 31:   let {i64 bytes_767} = bytes_763
 32:   let {i64 binop_x_772} = binop_y_776
 33:   let {i64 bytes_771} = bytes_763
 34:   #[incremental_flattening(only_intra)]
 35:   -- res_409 : [impl₀_245][impl₁_246]i32@@mem_778->
 36:   -- {base: [impl₀_245, impl₁_246]; contiguous: True; LMADs: [{offset: 0i32;
 37:   --                                                           strides: [impl₁_246, 1i32];
 38:   --                                                           rotates: [0i32, 0i32];
 39:   --                                                           shape: [impl₀_245, impl₁_246];
 40:   --                                                           permutation: [0, 1];
 41:   --                                                           monotonicity: [Inc, Inc]}]}
 42:   let {[impl₀_245][impl₁_246]i32 res_409} =
 43:     segmap_group
 44:     (#groups=impl₀_245; groupsize=impl₁_246)
 45:     (gtid_292 < impl₀_245) (~phys_tid_305) : {[impl₁_246]i32} {
 46:       let {mem@local mem_765} =
 47:         alloc(bytes_763, @local)
 48:       -- resarr0_416 : [impl₁_246]i32@@mem_765->
 49:       -- {base: [impl₁_246]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
 50:       --                                                rotates: [0i32];
 51:       --                                                shape: [impl₁_246];
 52:       --                                                permutation: [0];
 53:       --                                                monotonicity: [Inc]}]}
 54:       let {[impl₁_246]i32 resarr0_416} =
 55:         segscan_thread
 56:         (#groups=impl₀_245; groupsize=impl₁_246)
 57:         ({{0i32},
 58:           [],
 59:           fn {i32} (i32 x_417, i32 x_418) =>
 60:             let {i32 res_419} = add32(x_417, x_418)
 61:             in {res_419}})
 62:         (gtid_295 < impl₁_246) (~phys_tid_296) : {i32} {
 63:           let {i32 x_420} = xss_247[gtid_292, gtid_295]
 64:           return {returns x_420}
 65:         }
 66:       let {mem@local mem_769} =
 67:         alloc(bytes_763, @local)
 68:       -- resarr0_426 : [impl₁_246]i32@@mem_769->
 69:       -- {base: [impl₁_246]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
 70:       --                                                rotates: [0i32];
 71:       --                                                shape: [impl₁_246];
 72:       --                                                permutation: [0];
 73:       --                                                monotonicity: [Inc]}]}
 74:       let {[impl₁_246]i32 resarr0_426} =
 75:         segscan_thread
 76:         (#groups=impl₀_245; groupsize=impl₁_246)
 77:         ({{0i32},
 78:           [],
 79:           fn {i32} (i32 x_427, i32 x_428) =>
 80:             let {i32 res_429} = add32(x_427, x_428)
 81:             in {res_429}})
 82:         (gtid_297 < impl₁_246) (~phys_tid_298) : {i32} {
 83:           let {i32 x_430} = resarr0_416[gtid_297]
 84:           return {returns x_430}
 85:         }
 86:       let {mem@local mem_773} =
 87:         alloc(bytes_763, @local)
 88:       -- resarr0_435 : [impl₁_246]i32@@mem_773->
 89:       -- {base: [impl₁_246]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
 90:       --                                                rotates: [0i32];
 91:       --                                                shape: [impl₁_246];
 92:       --                                                permutation: [0];
 93:       --                                                monotonicity: [Inc]}]}
 94:       let {[impl₁_246]i32 resarr0_435} =
 95:         segscan_thread
 96:         (#groups=impl₀_245; groupsize=impl₁_246)
 97:         ({{0i32},
 98:           [],
 99:           fn {i32} (i32 x_436, i32 x_437) =>
100:             let {i32 res_438} = add32(x_436, x_437)
101:             in {res_438}})
102:         (gtid_299 < impl₁_246) (~phys_tid_300) : {i32} {
103:           let {i32 x_439} = resarr0_426[gtid_299]
104:           return {returns x_439}
105:         }
106:       return {returns resarr0_435}
107:     }
108:   in {impl₀_245, impl₁_246, mem_778, res_409}
109: }

Now, what I'm mostly interested in, is that mem_765 should interfere with mem_769, but not mem_773. On the other hand, mem_769 and mem_773 should interfere with each other. So, we want an interference graph that looks like this:

To keep things simple, I'll only concern myself with kernels for now. Therefore, I've written an analyseKernels function, whose sole purpose is to walk through the program and find the kernels and call the actual analysis functions on those.

Ah, but there are other memory blocks in that code! xss_mem_760 is the block that the argument to main, xss_247, resides in. There is also mem_778, which is where the result of the outer map is put.

In principle, if xss_247 was consumed by the function, we could reuse it's memory for other global memory blocks. However, since we're focusing on just kernels, perhaps we should actually disregard them entirely when doing the interference graph? A heuristic could be to only add blocks to the interference graph that actually are in InUse.

So, we still want the same interference graph.

For some reason <&> namesIntersection inuse_inside removes all names from last_use_mems?

That's all sorted out, and now I actually get the correct interference graph, for psum.fut at least.

2020-07-30

Status

Last Use

Interference