HOME

2020-07-03

Yesterday, and the plan for today

Yesterday, most of the time was spent on writing a retrospective on these technical diaary entries and on figuring out why bfast is slow. I'll continue with bfast today.

bfast, continued

So, it turns out that the default threshold was indeed too low. Before my changes the default threshold was 32768, but all the thresholds in bfast using the safari dataset exhibit degrees of parallelism higher than that. This means that when we're trying to tune eg. suff_outer_par_10, the thresholds above (suff_outer_par_5) in the tuning tree evaluate to true, and the code versions guarded by suff_outer_par_10 are never executed. That meant that the tuning results were completely random.

Thankfully, we can alleviate this problem simply by setting the default threshold higher. Now, bfast executes as fast as before incremental flattening:

$ futhark autotune --backend=opencl bfast.fut
Compiling bfast.fut...
Tuning main.suff_intra_par_12 on entry point main and dataset data/sahara.in
Tuning main.suff_outer_par_10 on entry point main and dataset data/sahara.in
Tuning main.suff_outer_par_16 on entry point main and dataset data/sahara.in
Tuning main.suff_outer_par_15 on entry point main and dataset data/sahara.in
Tuning main.suff_outer_par_8 on entry point main and dataset data/sahara.in
Tuning main.suff_outer_par_9 on entry point main and dataset data/sahara.in
Tuning main.suff_intra_par_6 on entry point main and dataset data/sahara.in
Tuning main.suff_outer_par_5 on entry point main and dataset data/sahara.in
Wrote bfast.fut.tuning
Result of autotuning:
main.suff_intra_par_12=2000000000
main.suff_intra_par_6=3874176
main.suff_outer_par_10=543744
main.suff_outer_par_15=2000000000
main.suff_outer_par_16=4349952
main.suff_outer_par_5=2000000000
main.suff_outer_par_8=28138752
main.suff_outer_par_9=543744

$ futhark bench --backend=opencl bfast.fut
Compiling bfast.fut...
Reporting average runtime of 10 runs for each dataset.

Results for bfast.fut (using bfast.fut.tuning):
data/sahara.in:       9308μs (RSD: 0.053; min:  -8%; max:  +5%)

$ futhark-0.15.8 bench --backend=opencl --no-tuning bfast.fut
Compiling bfast.fut...
Reporting average runtime of 10 runs for each dataset.

Results for bfast.fut:
data/sahara.in:       9797μs (RSD: 0.005; min:  -0%; max:  +2%)

Last-use

Let's go back to last-use and see what we were up to before I went on vacation.

The current status is that I have a very simple last-use analysis, which supports basic aliasing. Troels and Cosmin mentioned that the aliasing information provided by the aliasing library probably isn't sufficient, but I'm not quite sure why.

I've also spent a bit of time investigating Cosmins implementation of last-use. The biggest difference from mine is that he introduces a new LastUse lore, such that all expressions are annotated with last-use information. He also implements last-use analysis himself, without using FreeIn. As a result, he correctly finds last-use information within nested blocks even on the first run. However, I'm not sure that we need either the new lore or the nested information. My hope is that we can treat each block "separately".

The next step is probably to start implementing the linear scan analysis?

To remind myself, here's what we need to do:

  • Walk through the program from top to bottom.
  • Each time there is an allocation, do the following:
    • Add any memory blocks/arrays past their last use since the last allocation was performed to the free list. Also add size information.
    • Look through the free list for a suitably sized block for the current allocation. If one exists, use it instead (perhaps by inserting a simple let new = previous statement), otherwise leave the allocation in place

Sounds simple enough, really…

Let's start by creating a new repository, copy over the last-use stuff and make our program do a Pass instead of an Action.

Tomorrow

Well, tomorrow is on Monday. But the plan is to continue with the linear-scan pass. I've started building the skeleton, and on Monday it'll need to be expanded to be a proper pass.