2020-07-09
Yesterday, and the plan for today
Yesterday I spent some more time trying to think about how best to do the linear scan. I ended up wanting to rewrite Cosmins last-use analysis and use that instead of my own last-use analysis, because having last-use information attached to each expression/block seems more useful than just having a list of line-numbers. The plan is to continute that implementation today.
However, Cosmin has also written an email stating that driver-knn.fut
is not
being tuned correctly with my new autotuner. I'll have to look in to that first.
driver-knn.fut
Let's try to run the benchmark using a few different version of the compiler.
$ futhark-0.15.8 bench --backend=opencl driver-knn.fut Compiling driver-knn.fut... Reporting average runtime of 10 runs for each dataset. Results for driver-knn.fut: 256i32 [2097152][7]f32 [10000000][7]f32: 2817820μs (RSD: 0.003; min: -0%; max: +1%) [jxk588@a00333 knn-by-kdtree]$ futhark-0.16.1 bench --backend=opencl driver-knn.fut Compiling driver-knn.fut... Reporting average runtime of 10 runs for each dataset. Results for driver-knn.fut: 256i32 [2097152][7]f32 [10000000][7]f32: 4680997μs (RSD: 0.026; min: -3%; max: +5%) Results for driver-knn.fut: 256i32 [2097152][7]f32 [10000000][7]f32: 4680997μs (RSD: 0.026; min: -3%; max: +5%) [jxk588@a00333 knn-by-kdtree]$ futhark-0.16.1 autotune --backend=opencl driver-knn.fut Compiling driver-knn.fut... Tuning main.suff_outer_par_0 on entry point main and dataset 256i32 [2097152][7]f32 [10000000][7]f32 Tuning main.suff_outer_par_1 on entry point main and dataset 256i32 [2097152][7]f32 [10000000][7]f32 Tuning main.suff_outer_par_2 on entry point main and dataset 256i32 [2097152][7]f32 [10000000][7]f32 Tuning main.suff_intra_par_7 on entry point main and dataset 256i32 [2097152][7]f32 [10000000][7]f32 Tuning main.suff_intra_par_4 on entry point main and dataset 256i32 [2097152][7]f32 [10000000][7]f32 Tuning main.suff_outer_par_3 on entry point main and dataset 256i32 [2097152][7]f32 [10000000][7]f32 Tuning main.suff_outer_redomap_15 on entry point main and dataset 256i32 [2097152][7]f32 [10000000][7]f32 Wrote driver-knn.fut.tuning Result of autotuning: main.suff_intra_par_4=2097152 main.suff_intra_par_7=2000000000 main.suff_outer_par_0=2000000000 main.suff_outer_par_1=7 main.suff_outer_par_2=2000000000 main.suff_outer_par_3=2000000000 main.suff_outer_redomap_15=10248 [jxk588@a00333 knn-by-kdtree]$ futhark-0.16.1 bench --backend=opencl driver-knn.fut Compiling driver-knn.fut... Reporting average runtime of 10 runs for each dataset. Results for driver-knn.fut (using driver-knn.fut.tuning): 256i32 [2097152][7]f32 [10000000][7]f32: 4688262μs (RSD: 0.027; min: -4%; max: +6%)
That took a looooong time… But the tuning results seem to at least match the performance of the untuned program. However, it's significantly slower than the 0.15.8 version.
Troels suggests I not worry about the regression from 0.15.8, so I've sent Cosmin an e-mail asking for his tuning parameters. He got
main.suff_intra_par_4=2000000000 main.suff_intra_par_7=2097152 main.suff_outer_par_0=2000000000 main.suff_outer_par_1=2000000000 main.suff_outer_par_2=2000000000 main.suff_outer_par_3=2000000000 main.suff_outer_redomap_15=5615445
I need to test his numbers on gpu04, but someone else is using the machine right now. It would be nice to figure out which of the two parameters are causing the problem.
Last-use
Do I really need to use the LastUses
lore? Why do I think I need it? I think I
need it because it would make it easier to handle blocks. Here's the example
from the other day:
1: let mem1 = alloc(42) 2: -- last-use: [] 3: let arr1@mem1 = ... 4: 5: let x = reduce (+) 0 arr1 6: 7: let (mem2, arr2@mem2) = 8: if ... then 9: let mem_then = alloc(42) 10: let res_then@mem_then = map (* 2) (iota 42) 11: in (mem_then, res_then) 12: else 13: let mem_else = alloc(x) 14: let res_else = map (* 2) (iota x) 15: in (mem_then, res_else)
And this is what it is transformed to:
1: let mem1 = alloc(42) 2: -- freelist: [] 3: let arr1@mem1 = ... 4: -- freelist: [] 5: 6: let x = reduce (+) 0 arr1 7: -- last-use is [arr1, mem1], freelist: [(mem1, 42)] 8: 9: let (mem2, arr2@mem2) = 10: if ... then 11: -- mem1 is free and of size 42, so we use that instead of allocing 12: let mem_then = mem1 13: -- freelist is now [] 14: let res_then@mem_then = map (* 2) (iota 42) 15: in (mem_then, res_then) 16: -- also return freelist = [] 17: else 18: let mem_else = alloc(x) 19: let res_else = map (* 2) (iota x) 20: in (mem_then, res_else) 21: -- also return freelist = [(mem1, 42)] 22: -- freelist after if = [] 23: ... 24: 25: let y = reduce (+) 0 arr2 26: -- last-use is [arr2, mem2], freelist = [(mem2, ext)], cannot be reused
We could do something similar using the line counts, I guess? But this is so simple and straightforward, I think it's worth writing a more complicated LastUse lore for it.
There's one problem however: the last-use of mem1
isn't actually on line
6, but on line 3.
Wait, we can do the same thing rather simply by using the line count, right? Just have the last-uses sorted by line, pop things off the front, populate the freelist as required? Then, inside each block, we call last-uses before proceeding with the linear scan?
Also, we should rewrite last-uses to work on Body
instead, so we can remove
last-use entries for things that are returned by the body.
Some time later
I think I have a better skeleton now. However, when adding new elements to my free-list, I need to find the size of the thing I'm adding. I should probably also filter it so I only add mems (but only through their associated array). Afterwards, I can add to lookup to the alloc handling part of the code.
A long meeting later
I did get a bit further, and now the code is available here.
The problem that I have right now, is that I cannot perform the memory lookup on line 89 without introducing some scope. Troels has the following advise:
15:20:50 munksgaard | Athas: In a Pass, I'm walking through statements using foldM, collecting information along the way. At some point, I come along a VName, and now I want | to look up the memory information of that VName, can I use lookupMemInfo somehow? That requires me to handle scope, right? Is there a better way? 15:22:45 Athas | munksgaard: you almost always need to keep a Scope around somehow. 15:29:31 munksgaard | Sure, I'm just confused as to how to apply it. Here's the code I have, and the line that's giving me problems: | https://github.com/Munksgaard/futhark-linear-scan/blob/master/ReuseAllocations.hs#L89 15:30:23 munksgaard | I guess the scope information needs to be available somehow, but I don't know how to introduce it/keep track of it 15:30:42 Athas | You either do it manually by adding a Scope parameter to all your functions, or you use a Reader monad that contains a Scope. 15:31:45 Athas | Like how I do it here: https://github.com/diku-dk/futhark/blob/bee7823271bea5b407dd7eea69ebaf9b1b6b28c3/src/Futhark/Optimise/Unstream.hs#L43 15:33:50 munksgaard | I see, that's very helpful. Thank you.
I'll have to try that tomorrow.