2020-07-02
Yesterday, and the plan for today
I've been on vacation for the last week and a half, so today is going to be mostly about getting back up to speed. To take it easy, I'll investigate why the bfast benchmark is not getting autotuned correctly. I also want to dedicate some time to writing a retrospective on the process of writing these technical diary entries.
Retrospective
Let's start with some statistics first.
$ wc -l *.org 412 2020-06-08.org 291 2020-06-09.org 449 2020-06-10.org 35 2020-06-11.org 21 2020-06-12.org 40 2020-06-15.org 94 2020-06-16.org 122 2020-06-17.org 92 2020-06-18.org 30 2020-06-19.org 14 2020-07-02.org
I've written 10 entries so far, one for each work-day since I started this experiment. The first three days are by far the longest entries, mostly because of some long source listings, but also because the entries themselves were written more in-depth. Some of the later entries are very short, just 14 lines for the last one.
Why are some entries so short? Part of it is certainly that, on some days, less actual work gets done than on others. There are also types of work that is harder to write useful reflections about: applying aliasing to my last-use analysis as described in the entry from 2020-06-19, for instance, is mostly just about fiddling with the current code, looking up some functions in the Futhark library and so forth. However, I think the biggest reason why the entries on some days are shorter than on other days, is that I don't have or take the time to actually write in the diary as I work. Writing continually throughout the day, using the writing as a way to concretize what I'm doing, to help steer my process, was the goal all along, but it is difficult to keep up and requires discipline. Sometimes it's just easier to hack along on some code or try to fix some bug without having to document everything as I go along.
I think it's important to find a compromise that works for me. Ideally, I'd like to document everything meticulously as I did in some of those first days: doing to resulted in some productive days, and it's great to be able to go back and read what I was doing. But I probably also have to face that that's not going to work every day, at least not initially.
One final thing that I'll note, is that I sometimes "forget" to finish my entry for the day. Either because I have to hurry out of the office to go somewhere, or because I simply forget to. Sometimes I finish up the entry later in the evening, and sometimes I only get around to it the day after. I don't think there's any shame in doing so, after all, no-one is sitting around waiting for my daily entry, but I have noticed that when I don't finish up my daily entry in a timely manner, I miss that end-of-the-day reflective thinking. I think that, by trying to be a bit more diligent about finishing up the entry as part of the actual work-day, I could improve the re-readability and the reflective quaility of the entries.
bfast
As mentioned previously, Troels has mentioned that bfast
is not being tuned
correctly using the new autotuner I created, or at least that it is not as fast
as it should be. The problem could also be that bfast
is just slower with
incremental flattening than without. Let's try to find out what's wrong.
First, we'll build and install the latest version of Futhark on gpu04. When that is done, let's run the bfast benchmark with and without tuning, and compare the results. Perhaps we need to compare to an older version of Futhark also, maybe the one just before merging default incremental flattening? That would be 0.15.8. Thankfully, I already have that installed on gpu04.
Here's the untuned benchmark results using current master:
$ futhark bench --backend=opencl --json untuned.json --no-tuning --runs 100 bfast.fut Compiling bfast.fut... Reporting average runtime of 100 runs for each dataset. Results for bfast.fut: data/sahara.in: 14883μs (RSD: 0.004; min: -0%; max: +3%)
Here's the tuning results:
$ futhark autotune --backend=opencl bfast.fut Compiling bfast.fut... Tuning main.suff_intra_par_12 on entry point main and dataset data/sahara.in Tuning main.suff_outer_par_10 on entry point main and dataset data/sahara.in Tuning main.suff_outer_par_16 on entry point main and dataset data/sahara.in Tuning main.suff_outer_par_15 on entry point main and dataset data/sahara.in Tuning main.suff_outer_par_8 on entry point main and dataset data/sahara.in Tuning main.suff_outer_par_9 on entry point main and dataset data/sahara.in Tuning main.suff_intra_par_6 on entry point main and dataset data/sahara.in Tuning main.suff_outer_par_5 on entry point main and dataset data/sahara.in Wrote bfast.fut.tuning Result of autotuning: main.suff_intra_par_12=2000000000 main.suff_intra_par_6=2000000000 main.suff_outer_par_10=2000000000 main.suff_outer_par_15=543744 main.suff_outer_par_16=4349952 main.suff_outer_par_5=2000000000 main.suff_outer_par_8=28138752 main.suff_outer_par_9=543744
The tuned benchmark results:
$ futhark bench --backend=opencl --json tuned.json --runs 100 bfast.fut Compiling bfast.fut... Reporting average runtime of 100 runs for each dataset. Results for bfast.fut (using bfast.fut.tuning): data/sahara.in: 11976μs (RSD: 0.141; min: -14%; max: +22%)
And the comparison:
$ ../../../tools/cmp-bench-json.py untuned.json tuned.json bfast.fut data/sahara.in: 1.24x
So, tuning the program definitely gives us an improvement over the untuned version. Now there are two questions: Are the tuning parameters actually optimal, and is the performance of the tuned program a regression from earlier results without incremental flattening?
Let's investigate the latter first:
$ futhark-0.15.8 bench --backend=opencl --json 0.15.8.json --no-tuning --runs 100 bfast.fut Compiling bfast.fut... Reporting average runtime of 100 runs for each dataset. Results for bfast.fut: data/sahara.in: 7873μs (RSD: 0.064; min: -2%; max: +24%)
Aha! Our tuned program with incremental flattening is 33% slower than the untuned version from Futhark 0.15.8. Let's see what kernels are being run (filtered for just the kernels with at least 1 run):
$ futhark opencl bfast.fut $ gunzip -c data/sahara.in.gz | ./bfast -e main -P --tuning bfast.fut.tuning > /dev/null Peak memory usage for space 'device': 734489448 bytes. copy_dev_to_dev ran 4 times; avg: 6us; total: 26us copy_5781 ran 1 times; avg: 97us; total: 97us main.scan_stage1_1973 ran 1 times; avg: 589us; total: 589us main.scan_stage2_1973 ran 1 times; avg: 7us; total: 7us main.scan_stage3_1973 ran 1 times; avg: 98us; total: 98us main.segmap_1050 ran 1 times; avg: 22us; total: 22us main.segmap_1171 ran 1 times; avg: 7us; total: 7us main.segmap_1229 ran 1 times; avg: 6us; total: 6us main.segmap_1899 ran 1 times; avg: 196us; total: 196us main.segmap_2002 ran 1 times; avg: 4us; total: 4us main.segmap_2187 ran 1 times; avg: 48us; total: 48us main.segmap_2460 ran 8 times; avg: 132us; total: 1062us main.segmap_2510 ran 8 times; avg: 154us; total: 1234us main.segmap_2664 ran 1 times; avg: 148us; total: 148us main.segmap_2705 ran 1 times; avg: 3238us; total: 3238us main.segmap_2902 ran 1 times; avg: 354us; total: 354us main.segmap_intragroup_4151 ran 1 times; avg: 1431us; total: 1431us main.segred_large_2030 ran 1 times; avg: 464us; total: 464us main.segred_large_2297 ran 1 times; avg: 4392us; total: 4392us main.segred_small_2060 ran 1 times; avg: 149us; total: 149us map_transpose_f32_low_height ran 2 times; avg: 43us; total: 86us replicate_5435 ran 1 times; avg: 11us; total: 11us 40 operations with cumulative runtime: 13669us
Same for the old version
$ futhark-0.15.8 opencl bfast.fut $ gunzip -c data/sahara.in.gz | ./bfast -e main -P > /dev/null Peak memory usage for space 'device': 338209512 bytes. copy_dev_to_dev ran 4 times; avg: 9us; total: 36us copy_2612 ran 1 times; avg: 98us; total: 98us map_transpose_f32 ran 1 times; avg: 468us; total: 468us map_transpose_f32_low_height ran 1 times; avg: 8us; total: 8us replicate_2557 ran 1 times; avg: 6us; total: 6us scan_stage1_1117 ran 1 times; avg: 630us; total: 630us scan_stage2_1117 ran 1 times; avg: 6us; total: 6us scan_stage3_1117 ran 1 times; avg: 103us; total: 103us segmap_1001 ran 1 times; avg: 6us; total: 6us segmap_1016 ran 1 times; avg: 6us; total: 6us segmap_1057 ran 1 times; avg: 200us; total: 200us segmap_1145 ran 1 times; avg: 130us; total: 130us segmap_1233 ran 1 times; avg: 51us; total: 51us segmap_1302 ran 8 times; avg: 132us; total: 1061us segmap_1322 ran 8 times; avg: 153us; total: 1227us segmap_1369 ran 1 times; avg: 147us; total: 147us segmap_1395 ran 1 times; avg: 2332us; total: 2332us segmap_1435 ran 1 times; avg: 354us; total: 354us segmap_960 ran 1 times; avg: 25us; total: 25us segmap_intragroup_1561 ran 1 times; avg: 1271us; total: 1271us segmap_intragroup_1899 ran 1 times; avg: 1430us; total: 1430us segred_small_1182 ran 1 times; avg: 164us; total: 164us 39 operations with cumulative runtime: 9759us
So, there are some intragroup kernels that are not being run in the new version. Let's figure out what the tuning tree looks like:
Threshold forest: ("main.suff_outer_par_5",False) | `- ("main.suff_intra_par_6",False) | +- ("main.suff_intra_par_12",False) | +- ("main.suff_outer_par_10",False) | +- ("main.suff_outer_par_15",False) | | | `- ("main.suff_outer_par_16",False) | +- ("main.suff_outer_par_8",False) | `- ("main.suff_outer_par_9",False)
That's strange, it's not actually a list, but an actual tree. Shouldn't incremental flattening always produce a list?
Anyway, there's something wrong with the autotuner. Here's the first few lines of debugging output:
Tuning main.suff_intra_par_12 on entry point main and dataset data/sahara.in Running with options: -L --size=main.suff_intra_par_12=2000000000 Running executable "./bfast" with arguments ["-L","--size=main.suff_intra_par_12=2000000000","-e","main","-t","/tmp/futhark-bench13235-0","-r","10","-b"] Got ePars: 8699904 Trying e_pars [8699904] Running with options: -L --size=main.suff_intra_par_12=8699904 Running executable "./bfast" with arguments ["-L","--size=main.suff_intra_par_12=8699904","-e","main","-t","/tmp/futhark-bench13235-1","-r","10","-b"] Tuning main.suff_outer_par_10 on entry point main and dataset data/sahara.in Running with options: -L --size=main.suff_outer_par_10=2000000000 --size=main.suff_intra_par_12=2000000000 Running executable "./bfast" with arguments ["-L","--size=main.suff_outer_par_10=2000000000","--size=main.suff_intra_par_12=2000000000","-e","main","-t","/tmp/futhark-bench13235-2","-r","10","-b"] Got ePars: 543744 Trying e_pars [543744] Running with options: -L --size=main.suff_outer_par_10=543744 --size=main.suff_intra_par_12=2000000000 Running executable "./bfast" with arguments ["-L","--size=main.suff_outer_par_10=543744","--size=main.suff_intra_par_12=2000000000","-e","main","-t","/tmp/futhark-bench13235-3","-r","10","-b"] Tuning main.suff_outer_par_16 on entry point main and dataset data/sahara.in Running with options: -L --size=main.suff_outer_par_16=2000000000 --size=main.suff_outer_par_10=543744 --size=main.suff_intra_par_12=2000000000 Running executable "./bfast" with arguments ["-L","--size=main.suff_outer_par_16=2000000000","--size=main.suff_outer_par_10=543744","--size=main.suff_intra_par_12=2000000000","-e","main","-t","/tmp/futhark-bench13235-4","-r","10","-b"] Got ePars: 4349952 Trying e_pars [4349952] Running with options: -L --size=main.suff_outer_par_16=4349952 --size=main.suff_outer_par_10=543744 --size=main.suff_intra_par_12=2000000000 Running executable "./bfast" with arguments ["-L","--size=main.suff_outer_par_16=4349952","--size=main.suff_outer_par_10=543744","--size=main.suff_intra_par_12=2000000000","-e","main","-t","/tmp/futhark-bench13235-5","-r","10","-b"] ...
To tune correctly, we want to tune from the bottom of the tree upwards, but
instead we start with suff_intra_par_12
which is somewhere in the middle? Ah,
I guess all that just stems from the fact that we're not actually tuning a list,
but a tree.
Here's the optimal tuning parameters that we'd like to see:
main.suff_outer_par_5=2000000000 main.suff_intra_par_6=20000000000 main.suff_intra_par_12=20000000000 main.suff_outer_par_10=2 main.suff_outer_par_15=20000000000 main.suff_outer_par_16=2 main.suff_outer_par_8=2 main.suff_outer_par_9=2000000000
Right off the bat, we can see that suff_outer_par_10
is being tuned
incorrectly. Instead of being set low (to 543744), it's being maxed out. Oh,
perhaps the default tuning parameters are not high enough?
It might also be that the default threshold is too small!
Well, that's a task for tomorrow.
Tomorrow
Continue with bfast
.