2020-12-21

Reordering statements

Carrying off from Friday, I need to find an algorithm for reordering statements that both handles memory blocks correctly and respects the original ordering of consumes/aliases.

In essence, there are two problems we need to solve:

The copy to the double buffer in OptionPricing should not be moved up before the last use of inpacc. Troels' suggestion for "consuming" all arrays in the same memory block (except those being aliased), should help with this problem.
The ordering of these consumes and the destructive uses of memory blocks should not change.

As mentioned, the first point should handled by Troels' suggestion, but how do we combine that with the second point? In essence, the second point is about making sure that:

let xs@mem_1 = ...
let x = xs[...]
let ys@mem_1 = ...
let y = ys[...]
let zs@mem_1 = ...
let z = zs[...]

doesn't turn into

let ys@mem_1 = ...
let y = ys[...]
let xs@mem_1 = ...
let x = xs[...]
let zs@mem_1 = ...
let z = zs[...]

By the heuristic Troels suggested, when processing let zs@mem_1 = ..., we will consume both ys and xs, but perhaps the definition of ys can depend on mem_1 containing xs, without there being a direct dependency? Perhaps, not?

The problem arises in the OptionPricing example because the result of the loop body is written to the same memory block as the loop variable. The computation of whatever is written into the double buffer depends on inpacc, but because the memory block isn't part of the type, there is no direct dependency between the inpacc and the double buffer. Can this ever happen for more than two arrays?

So it needs to be the case that there is a loop variable, and a double buffer that is written to the same result. Then, inside, there's a third array xs, which resides in the same memory block without depending (in the type system) on inpacc or the double buffer array depending on it.

When can it actually happen that some arrays reside in the same memory block without there being any dependency between them? In loops, because the loop body result has to be in the same memory location as the loop variable. In functions? No. In a map? Yes, but then they are either…

Okay, here's the problem from OptionPricing:

1: loop (inpacc@mem_1, x) for i < N do
2: let mem_2 = alloc(...)
3: let res@mem_2 = scan (+) 0 inpacc
4: let mem_3 = alloc(...)
5: let tmp = map2 (+) res inpacc
6: let x' = reduce (+) x tmp
7: let double_buffer@mem_1 = copy(res)
8: in (double_buffer, x')

Without looking at the memory blocks, we might be tempted to move the copy into double_buffer up right after the scan. That would allow us to reuse mem_2 for the computation of tmp instead of allocating a new memory block mem_3. However, that's not allowed, so when we're processing line 7, we notice that inpacc resides in the same memory block as double_buffer, namely mem_1, and so we insert a consume of inpacc, which forces the computation of tmp and x' before we can perform the copy.

As an aside, can we easily model this in actual Futhark?

Yes, the following code has that pattern:

1: let main [n] (xss: [][n]i64, x: i64) =
2:   #[incremental_flattening(only_intra)]
3:   map (\xs -> loop (xs, x) for i < n do
4:               let res = scan (*) 1 (map (* 42) xs)
5:               let tmp = scan (*) 1 (map (+ 1) xs)
6:               let tmp' = scan (+) 0 (map2 (+) tmp xs)
7:               in (res, tmp'[0]))
8:   xss

Turns into (inside the loop):

 1: loop {*[n_5379]i64 xs_5588,
 2:       i64 x_5589} = {x_linear_double_buffer_copy_6223, x_5382}
 3: for i_5587:i64 < n_5379 do {
 4:   -- resarr0_5595 : [n_5379]i64@@mem_6191->
 5:   -- {base: [n_5379]; contiguous: True; LMADs: [{offset: 0i64; strides: [1i64];
 6:   --                                             rotates: [0i64]; shape: [n_5379];
 7:   --                                             permutation: [0];
 8:   --                                             monotonicity: [Inc]}]}
 9:   let {[n_5379]i64 resarr0_5595} =
10:     segscan_thread
11:     (#groups=impl₀_5380; groupsize=n_5379)
12:     ({{1i64},
13:       [],
14:       fn {i64} (i64 x_5596, i64 x_5597) =>
15:         let {i64 defunc_1_op_res_5598} = mul64(x_5596, x_5597)
16:         in {defunc_1_op_res_5598}})
17:     (gtid_5444 < n_5379) (~phys_tid_5445) : {i64} {
18:       let {i64 x_5599} = xs_5588[gtid_5444]
19:       let {i64 defunc_0_f_res_5600} = mul64(42i64, x_5599)
20:       return {returns defunc_0_f_res_5600}
21:     }
22:   -- resarr0_5606 : [n_5379]i64@@mem_6194->
23:   -- {base: [n_5379]; contiguous: True; LMADs: [{offset: 0i64; strides: [1i64];
24:   --                                             rotates: [0i64]; shape: [n_5379];
25:   --                                             permutation: [0];
26:   --                                             monotonicity: [Inc]}]}
27:   -- res_5607 : [n_5379]i64@@mem_6196->
28:   -- {base: [n_5379]; contiguous: True; LMADs: [{offset: 0i64; strides: [1i64];
29:   --                                             rotates: [0i64]; shape: [n_5379];
30:   --                                             permutation: [0];
31:   --                                             monotonicity: [Inc]}]}
32:   let {[n_5379]i64 resarr0_5606, [n_5379]i64 res_5607} =
33:     segscan_thread
34:     (#groups=impl₀_5380; groupsize=n_5379)
35:     ({{1i64},
36:       [],
37:       fn {i64} (i64 x_5608, i64 x_5609) =>
38:         let {i64 defunc_1_op_res_5610} = mul64(x_5608, x_5609)
39:         in {defunc_1_op_res_5610}})
40:     (gtid_5446 < n_5379) (~phys_tid_5447) : {i64, i64} {
41:       let {i64 x_5611} = resarr0_5595[gtid_5446]
42:       let {i64 x_5612} = xs_5588[gtid_5446]
43:       let {i64 defunc_0_f_res_5614} = add64(1i64, x_5612)
44:       return {returns defunc_0_f_res_5614, returns x_5611}
45:     }
46:   -- resarr0_5618 : [n_5379]i64@@mem_6199->
47:   -- {base: [n_5379]; contiguous: True; LMADs: [{offset: 0i64; strides: [1i64];
48:   --                                             rotates: [0i64]; shape: [n_5379];
49:   --                                             permutation: [0];
50:   --                                             monotonicity: [Inc]}]}
51:   let {[n_5379]i64 resarr0_5618} =
52:     segscan_thread
53:     (#groups=impl₀_5380; groupsize=n_5379)
54:     ({{0i64},
55:       [],
56:       fn {i64} (i64 x_5619, i64 x_5620) =>
57:         let {i64 defunc_1_op_res_5621} = add64(x_5619, x_5620)
58:         in {defunc_1_op_res_5621}})
59:     (gtid_5448 < n_5379) (~phys_tid_5449) : {i64} {
60:       let {i64 x_5622} = resarr0_5606[gtid_5448]
61:       let {i64 x_5623} = xs_5588[gtid_5448]
62:       let {i64 defunc_1_f_res_5625} = add64(x_5622, x_5623)
63:       return {returns defunc_1_f_res_5625}
64:     }
65:   let {i64 loopres_5637} = resarr0_5618[0i64]
66:   -- double_buffer_array_6221 : [n_5379]i64@@double_buffer_mem_6220->
67:   -- {base: [n_5379]; contiguous: True;
68:   --  LMADs: [{offset: mul_nw64 (phys_tid_5454) (n_5379); strides: [1i64];
69:   --           rotates: [0i64]; shape: [n_5379]; permutation: [0];
70:   --           monotonicity: [Inc]}]}
71:   let {[n_5379]i64 double_buffer_array_6221} = copy(res_5607)
72:   in {double_buffer_array_6221, loopres_5637}
73: }

Notice that it looks like we should be able to move the copy of res_5607 up before the last scan, but if we did so, we would overwrite the contents of xs_5588, which resarr0_5618 depends on.

Okay, I think that's enough for now. The next step is to implement Athas' suggestion and see if there are any programs that actually have more than two overlapping arrays, without any clear interdependencies.

Future work suggestion by Cosmin

I'll write this down here, before I lose my notes:

We want to optimize NW (Needleman-Wunsch).
Read Cosmins paper Logical Inference Techniques for Loop Parallelization. Only section 2 is relevant.
The purpose is to get an idea about what the equations and abstractions are for safe reuse of memory blocks. Cosmins paper is about something else, but might serve as inspiration.
The end product is a set of rules, equations, abstractions and/or types that can help us reuse even more memory allocations.