2020-12-21
Reordering statements
Carrying off from Friday, I need to find an algorithm for reordering statements that both handles memory blocks correctly and respects the original ordering of consumes/aliases.
In essence, there are two problems we need to solve:
- The copy to the double buffer in OptionPricing should not be moved up before the last use of inpacc. Troels' suggestion for "consuming" all arrays in the same memory block (except those being aliased), should help with this problem.
- The ordering of these consumes and the destructive uses of memory blocks should not change.
As mentioned, the first point should handled by Troels' suggestion, but how do we combine that with the second point? In essence, the second point is about making sure that:
let xs@mem_1 = ... let x = xs[...] let ys@mem_1 = ... let y = ys[...] let zs@mem_1 = ... let z = zs[...]
doesn't turn into
let ys@mem_1 = ... let y = ys[...] let xs@mem_1 = ... let x = xs[...] let zs@mem_1 = ... let z = zs[...]
By the heuristic Troels suggested, when processing let zs@mem_1 = ...
, we will
consume both ys
and xs
, but perhaps the definition of ys
can depend on
mem_1
containing xs
, without there being a direct dependency? Perhaps, not?
The problem arises in the OptionPricing example because the result of the loop
body is written to the same memory block as the loop variable. The computation
of whatever is written into the double buffer depends on inpacc
, but because
the memory block isn't part of the type, there is no direct dependency between
the inpacc
and the double buffer. Can this ever happen for more than two
arrays?
So it needs to be the case that there is a loop variable, and a double buffer
that is written to the same result. Then, inside, there's a third array xs
,
which resides in the same memory block without depending (in the type system)
on inpacc
or the double buffer array depending on it.
When can it actually happen that some arrays reside in the same memory block without there being any dependency between them? In loops, because the loop body result has to be in the same memory location as the loop variable. In functions? No. In a map? Yes, but then they are either…
Okay, here's the problem from OptionPricing:
1: loop (inpacc@mem_1, x) for i < N do 2: let mem_2 = alloc(...) 3: let res@mem_2 = scan (+) 0 inpacc 4: let mem_3 = alloc(...) 5: let tmp = map2 (+) res inpacc 6: let x' = reduce (+) x tmp 7: let double_buffer@mem_1 = copy(res) 8: in (double_buffer, x')
Without looking at the memory blocks, we might be tempted to move the copy into
double_buffer
up right after the scan. That would allow us to reuse mem_2
for the computation of tmp
instead of allocating a new memory block
mem_3
. However, that's not allowed, so when we're processing line 7, we
notice that inpacc
resides in the same memory block as double_buffer
, namely
mem_1
, and so we insert a consume of inpacc
, which forces the computation of
tmp
and x'
before we can perform the copy.
As an aside, can we easily model this in actual Futhark?
Yes, the following code has that pattern:
1: let main [n] (xss: [][n]i64, x: i64) = 2: #[incremental_flattening(only_intra)] 3: map (\xs -> loop (xs, x) for i < n do 4: let res = scan (*) 1 (map (* 42) xs) 5: let tmp = scan (*) 1 (map (+ 1) xs) 6: let tmp' = scan (+) 0 (map2 (+) tmp xs) 7: in (res, tmp'[0])) 8: xss
Turns into (inside the loop):
1: loop {*[n_5379]i64 xs_5588, 2: i64 x_5589} = {x_linear_double_buffer_copy_6223, x_5382} 3: for i_5587:i64 < n_5379 do { 4: -- resarr0_5595 : [n_5379]i64@@mem_6191-> 5: -- {base: [n_5379]; contiguous: True; LMADs: [{offset: 0i64; strides: [1i64]; 6: -- rotates: [0i64]; shape: [n_5379]; 7: -- permutation: [0]; 8: -- monotonicity: [Inc]}]} 9: let {[n_5379]i64 resarr0_5595} = 10: segscan_thread 11: (#groups=impl₀_5380; groupsize=n_5379) 12: ({{1i64}, 13: [], 14: fn {i64} (i64 x_5596, i64 x_5597) => 15: let {i64 defunc_1_op_res_5598} = mul64(x_5596, x_5597) 16: in {defunc_1_op_res_5598}}) 17: (gtid_5444 < n_5379) (~phys_tid_5445) : {i64} { 18: let {i64 x_5599} = xs_5588[gtid_5444] 19: let {i64 defunc_0_f_res_5600} = mul64(42i64, x_5599) 20: return {returns defunc_0_f_res_5600} 21: } 22: -- resarr0_5606 : [n_5379]i64@@mem_6194-> 23: -- {base: [n_5379]; contiguous: True; LMADs: [{offset: 0i64; strides: [1i64]; 24: -- rotates: [0i64]; shape: [n_5379]; 25: -- permutation: [0]; 26: -- monotonicity: [Inc]}]} 27: -- res_5607 : [n_5379]i64@@mem_6196-> 28: -- {base: [n_5379]; contiguous: True; LMADs: [{offset: 0i64; strides: [1i64]; 29: -- rotates: [0i64]; shape: [n_5379]; 30: -- permutation: [0]; 31: -- monotonicity: [Inc]}]} 32: let {[n_5379]i64 resarr0_5606, [n_5379]i64 res_5607} = 33: segscan_thread 34: (#groups=impl₀_5380; groupsize=n_5379) 35: ({{1i64}, 36: [], 37: fn {i64} (i64 x_5608, i64 x_5609) => 38: let {i64 defunc_1_op_res_5610} = mul64(x_5608, x_5609) 39: in {defunc_1_op_res_5610}}) 40: (gtid_5446 < n_5379) (~phys_tid_5447) : {i64, i64} { 41: let {i64 x_5611} = resarr0_5595[gtid_5446] 42: let {i64 x_5612} = xs_5588[gtid_5446] 43: let {i64 defunc_0_f_res_5614} = add64(1i64, x_5612) 44: return {returns defunc_0_f_res_5614, returns x_5611} 45: } 46: -- resarr0_5618 : [n_5379]i64@@mem_6199-> 47: -- {base: [n_5379]; contiguous: True; LMADs: [{offset: 0i64; strides: [1i64]; 48: -- rotates: [0i64]; shape: [n_5379]; 49: -- permutation: [0]; 50: -- monotonicity: [Inc]}]} 51: let {[n_5379]i64 resarr0_5618} = 52: segscan_thread 53: (#groups=impl₀_5380; groupsize=n_5379) 54: ({{0i64}, 55: [], 56: fn {i64} (i64 x_5619, i64 x_5620) => 57: let {i64 defunc_1_op_res_5621} = add64(x_5619, x_5620) 58: in {defunc_1_op_res_5621}}) 59: (gtid_5448 < n_5379) (~phys_tid_5449) : {i64} { 60: let {i64 x_5622} = resarr0_5606[gtid_5448] 61: let {i64 x_5623} = xs_5588[gtid_5448] 62: let {i64 defunc_1_f_res_5625} = add64(x_5622, x_5623) 63: return {returns defunc_1_f_res_5625} 64: } 65: let {i64 loopres_5637} = resarr0_5618[0i64] 66: -- double_buffer_array_6221 : [n_5379]i64@@double_buffer_mem_6220-> 67: -- {base: [n_5379]; contiguous: True; 68: -- LMADs: [{offset: mul_nw64 (phys_tid_5454) (n_5379); strides: [1i64]; 69: -- rotates: [0i64]; shape: [n_5379]; permutation: [0]; 70: -- monotonicity: [Inc]}]} 71: let {[n_5379]i64 double_buffer_array_6221} = copy(res_5607) 72: in {double_buffer_array_6221, loopres_5637} 73: }
Notice that it looks like we should be able to move the copy of res_5607
up
before the last scan, but if we did so, we would overwrite the contents of
xs_5588
, which resarr0_5618
depends on.
Okay, I think that's enough for now. The next step is to implement Athas' suggestion and see if there are any programs that actually have more than two overlapping arrays, without any clear interdependencies.
Future work suggestion by Cosmin
I'll write this down here, before I lose my notes:
- We want to optimize NW (Needleman-Wunsch).
- Read Cosmins paper Logical Inference Techniques for Loop Parallelization. Only section 2 is relevant.
- The purpose is to get an idea about what the equations and abstractions are for safe reuse of memory blocks. Cosmins paper is about something else, but might serve as inspiration.
- The end product is a set of rules, equations, abstractions and/or types that can help us reuse even more memory allocations.