Existentialized index functions in loops
The Purpose
Over the course of a few weeks, I've been working on an optimization in the Futhark compiler which will allow us to eliminate extra copies in a few specific cases. For instance, the following piece of code would result in a new copy at the end of each loop iteration, in addition to a copy at the loop initialization:
let main [n] (a: [n]i32): i32 = let b = loop xs = a[1:] for i < n / 2 - 2 do xs[i:] -- This will result in a copy in reduce (+) 0 b
If, instead of performing a copy at the end of each loop iteration, we could existentialize the index function of the array returned by the loop, returning the offset in addition to the memory location and result, we could avoid those copies.
How does it work?
Here, existentializing the index function means to replace parts of it with free variables, that can then be bound in the surrounding context.
For instance, the array type annotation (including the index function) could look like this:
[size]f32@@mem->{ base: [size]; contiguous: True; LMADs: [{offset: orig + i; strides: [0i32]; rotates: [0i32]; shape: [size_new]; permutation: [0]; monotonicity: [Inc]}]}
where size
, mem
, orig
, i
, and size_new
are variables from the
code. Existentializing the type would yield something like this:
[?0]f32@@?5>{ base: [?1]; contiguous: True; LMADs: [{offset: ?2; strides: [?3]; rotates: [0i32]; shape: [?4]; permutation: [0]; monotonicity: [Inc]}]}
Now, in addition to returning the array itself (and perhaps the memory
allocation), we return the values that was existentialized out of the type/index
function. That way, by existentializing the index function of different arrays
(like the ones returned in the loop initialization and the one returned at the
end of the loop), we can make their index functions match, such that their types
match. I've previously done the same thing in if
expressions, existentializing
the array index functions in different branches to avoid copies.
There's one caveat here though, which is that the DoLoop
construct in Futhark
does not support existentialized index functions. So we have to generate new
variables and insert them in the index function/type annotation so that we can
instantiate the array before we can return it.
Benchmarks
To benchmark our changes, we compare the performance of the futhark-benchmark
benchmarks compiled with the compiler at commit
3fc8988b5dbcb53e1722ebc2429db56c693b6782
(which is the commit just before our
changes) and 54a30308bf015cd455b5bd77fa5a0957c9897f79
using
futhark-benchmarks
at 2babb70aa4987383635686af8c7e30a26ed45250
.
Here are the commands and results:
(cd futhark-benchmarks; git checkout 2babb70aa4987383635686af8c7e30a26ed45250) git checkout 3fc8988b5dbcb53e1722ebc2429db56c693b6782 stack exec futhark -- bench \ --no-tuning \ --backend=opencl \ --json bench-master.json \ --exclude=no_opencl \ --ignore-files micro \ --ignore-files /lib/ \ futhark-benchmarks git checkout 54a30308bf015cd455b5bd77fa5a0957c9897f79 stack exec futhark -- bench \ --no-tuning \ --backend=opencl \ --json bench-master.json \ --exclude=no_opencl \ --ignore-files micro \ --ignore-files /lib/ \ futhark-benchmarks tools/cmp-bench-json.py bench-master.json bench-existential-loop-6.json
futhark-benchmarks/accelerate/canny/canny.fut data/lena512.in: 0.99x data/lena256.in: 1.00x futhark-benchmarks/accelerate/crystal/crystal.fut #0 ("200i32 30.0f32 5i32 1i32 1.0f32"): 1.01x #4 ("2000i32 30.0f32 50i32 1i32 1.0f32"): 1.00x #5 ("4000i32 30.0f32 50i32 1i32 1.0f32"): 1.00x futhark-benchmarks/accelerate/fft/fft.fut data/64x256.in: 0.95x data/128x512.in: 0.97x data/1024x1024.in: 0.97x data/512x512.in: 1.03x data/256x256.in: 1.01x data/128x128.in: 0.95x futhark-benchmarks/accelerate/fluid/fluid.fut benchmarking/medium.in: 1.05x futhark-benchmarks/accelerate/hashcat/hashcat.fut rockyou.dataset: 1.00x futhark-benchmarks/accelerate/kmeans/kmeans.fut data/k5_n50000.in: 1.11x data/trivial.in: 0.99x data/k5_n200000.in: 1.00x futhark-benchmarks/accelerate/mandelbrot/mandelbrot.fut #1 ("1000i32 1000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.02x #3 ("4000i32 4000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x #2 ("2000i32 2000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 0.99x #0 ("800i32 600i32 -0.7f32 0.0f32 3.067f32 100i32 16.0f..."): 1.01x #4 ("8000i32 8000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x futhark-benchmarks/accelerate/nbody/nbody-bh.fut data/10000-bodies.in: 1.06x data/100000-bodies.in: 0.96x data/1000-bodies.in: 0.94x futhark-benchmarks/accelerate/nbody/nbody.fut data/10000-bodies.in: 0.83x data/100000-bodies.in: 0.95x data/1000-bodies.in: 0.47x futhark-benchmarks/accelerate/pagerank/pagerank.fut data/small.in: 1.00x data/random_medium.in: 1.00x futhark-benchmarks/accelerate/ray/trace.fut #0 ("800i32 600i32 100i32 50.0f32 -100.0f32 -700.0f32 1..."): 1.00x futhark-benchmarks/accelerate/smoothlife/smoothlife.fut #1 ("256i32"): 0.94x #2 ("512i32"): 1.00x #3 ("1024i32"): 0.95x #0 ("128i32"): 0.93x futhark-benchmarks/accelerate/tunnel/tunnel.fut #1 ("10.0f32 1000i32 1000i32"): 1.01x #4 ("10.0f32 8000i32 8000i32"): 1.00x #0 ("10.0f32 800i32 600i32"): 0.98x #2 ("10.0f32 2000i32 2000i32"): 1.00x #3 ("10.0f32 4000i32 4000i32"): 1.00x futhark-benchmarks/finpar/LocVolCalib.fut LocVolCalib-data/small.in: 1.01x LocVolCalib-data/medium.in: 1.01x LocVolCalib-data/large.in: 1.00x futhark-benchmarks/finpar/OptionPricing.fut OptionPricing-data/medium.in: 1.00x OptionPricing-data/small.in: 0.99x OptionPricing-data/large.in: 1.00x futhark-benchmarks/jgf/crypt/crypt.fut crypt-data/medium.in: 0.97x futhark-benchmarks/jgf/crypt/keys.fut crypt-data/userkey0.txt: 0.98x futhark-benchmarks/jgf/series/series.fut data/1000000.in: 1.00x data/10000.in: 1.00x data/100000.in: 1.00x futhark-benchmarks/misc/bfast/bfast-cloudy.fut data/peru.in: 1.02x data/sahara-cloudy.in: 1.00x futhark-benchmarks/misc/bfast/bfast.fut data/sahara.in: 1.00x futhark-benchmarks/misc/heston/heston32.fut data/1062_quotes.in: 0.98x data/10000_quotes.in: 1.00x data/100000_quotes.in: 1.00x futhark-benchmarks/misc/heston/heston64.fut data/1062_quotes.in: 1.00x data/10000_quotes.in: 1.00x data/100000_quotes.in: 1.00x futhark-benchmarks/misc/knn-by-kdtree/buildKDtree.fut valid-data/kdtree-ppl-32-m-2097152.in: 1.03x futhark-benchmarks/misc/radix_sort/radix_sort_blelloch_benchmark.fut data/radix_sort_100K.in: 1.07x data/radix_sort_10K.in: 1.10x data/radix_sort_1M.in: 0.97x futhark-benchmarks/misc/radix_sort/radix_sort_large.fut data/radix_sort_100K.in: 1.39x data/radix_sort_10K.in: 1.01x data/radix_sort_1M.in: 1.01x futhark-benchmarks/parboil/histo/histo.fut data/default.in: 1.03x data/large.in: 1.00x futhark-benchmarks/parboil/mri-q/mri-q.fut data/large.in: 1.00x data/small.in: 1.01x futhark-benchmarks/parboil/sgemm/sgemm.fut data/tiny.in: 0.96x data/small.in: 1.06x data/medium.in: 1.01x futhark-benchmarks/parboil/stencil/stencil.fut data/default.in: 1.00x data/small.in: 1.03x futhark-benchmarks/parboil/tpacf/tpacf.fut data/large.in: 1.00x data/small.in: 1.00x data/medium.in: 1.00x futhark-benchmarks/pbbs/ray/ray.fut data/angel.in: 1.00x data/dragon.in: 1.00x data/happy.in: 1.00x futhark-benchmarks/rodinia/backprop/backprop.fut data/small.in: 0.99x data/medium.in: 1.00x futhark-benchmarks/rodinia/bfs/bfs_asympt_ok_but_slow.fut data/64kn_32e-var-1-256-skew.in: 0.99x data/512nodes_high_edge_variance.in: 1.08x data/graph1MW_6.in: 1.05x data/4096nodes.in: 1.12x futhark-benchmarks/rodinia/bfs/bfs_filt_padded_fused.fut data/64kn_32e-var-1-256-skew.in: 0.93x data/512nodes_high_edge_variance.in: 0.97x data/graph1MW_6.in: 0.99x data/4096nodes.in: 0.93x futhark-benchmarks/rodinia/bfs/bfs_heuristic.fut data/64kn_32e-var-1-256-skew.in: 0.93x data/512nodes_high_edge_variance.in: 0.93x data/graph1MW_6.in: 1.03x data/4096nodes.in: 0.98x futhark-benchmarks/rodinia/bfs/bfs_iter_work_ok.fut data/64kn_32e-var-1-256-skew.in: 1.07x data/512nodes_high_edge_variance.in: 1.29x data/graph1MW_6.in: 1.21x data/4096nodes.in: 1.50x futhark-benchmarks/rodinia/cfd/cfd.fut data/fvcorr.domn.193K.toa: 1.01x data/fvcorr.domn.097K.toa: 0.91x futhark-benchmarks/rodinia/hotspot/hotspot.fut data/512.in: 0.50x data/1024.in: 1.00x data/64.in: 1.06x futhark-benchmarks/rodinia/kmeans/kmeans.fut data/kdd_cup.in: 1.05x data/100.in: 0.99x data/204800.in: 1.00x futhark-benchmarks/rodinia/lavaMD/lavaMD.fut data/3_boxes.in: 1.00x data/10_boxes.in: 1.02x futhark-benchmarks/rodinia/lud/lud.fut data/512.in: 1.00x data/64.in: 0.99x data/256.in: 0.97x data/16by16.in: 1.01x data/2048.in: 0.98x futhark-benchmarks/rodinia/myocyte/myocyte.fut data/small.in: 1.02x data/medium.in: 1.00x futhark-benchmarks/rodinia/nn/nn.fut data/medium.in: 0.99x futhark-benchmarks/rodinia/nw/nw.fut data/large.in: 1.00x futhark-benchmarks/rodinia/particlefilter/particlefilter.fut data/128_128_10_image_400000_particles.in: 1.01x data/128_128_10_image_10000_particles.in: 1.00x futhark-benchmarks/rodinia/pathfinder/pathfinder.fut data/medium.in: 0.40x futhark-benchmarks/rodinia/srad/srad.fut data/image.in: 0.97x
The good : bfs_iter_work_ok
The bfs_iter_work_ok
benchmark from Rodinia is faster with my changes. By
investigating the code using futhark dev --gpu
, we can see that we're exactly
hitting the kinds of places we set out to: loops where the result is
unnecessarily linearized. Instead of having to linearize the resulting arrays at
the end of each loop, we existentialize the index function of the existing
arrays and return that instead.
The bad: nbody
Some of the benchmarks are suspiciously slow. Let's take a look at nbody
from
the Accelerate benchmark suite.
By running nbody
through futhark bench
using the two different versions of
the compiler, we get some idea of what's going on:
[jxk588@a00333 nbody]$ futhark-master opencl ./nbody.fut && cat data/1000-bodies.in | ./nbody -P --runs 10 >/dev/null Peak memory usage for space 'device': 52000 bytes. copy_dev_to_dev ran 0 times; avg: 0us; total: 0us copy_dev_to_host ran 0 times; avg: 0us; total: 0us copy_host_to_dev ran 0 times; avg: 0us; total: 0us copy_scalar_to_dev ran 0 times; avg: 0us; total: 0us copy_scalar_from_dev ran 0 times; avg: 0us; total: 0us segmap_intragroup_536 ran 1 times; avg: 72us; total: 72us 1 operations with cumulative runtime: 72us [jxk588@a00333 nbody]$ futhark-existential-loop-6 opencl ./nbody.fut && cat data/1000-bodies.in | ./nbody -P --runs 10 >/dev/null Peak memory usage for space 'device': 76000 bytes. copy_dev_to_dev ran 6 times; avg: 5us; total: 32us copy_dev_to_host ran 0 times; avg: 0us; total: 0us copy_host_to_dev ran 0 times; avg: 0us; total: 0us copy_scalar_to_dev ran 0 times; avg: 0us; total: 0us copy_scalar_from_dev ran 0 times; avg: 0us; total: 0us segmap_intragroup_536 ran 1 times; avg: 69us; total: 69us 7 operations with cumulative runtime: 101us
We see that, using the existential-loop-6
version, we get a lot of additional
runs of copy_dev_to_dev
. Indeed, by examining the output of the explicit
allocations run, we see the problem:
-- xps_364 : [n_354]f32@@xps_mem_951-> {base: [n_354]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [n_354]; permutation: [0]; monotonicity: [Inc]}]} -- yps_365 : [n_355]f32@@yps_mem_952-> {base: [n_355]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [n_355]; permutation: [0]; monotonicity: [Inc]}]} ... entry {[?0]f32@?7-> {base: [?0]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [?0]; permutation: [0]; monotonicity: [Inc]}]}, [?1]f32@?8-> {base: [?1]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [?1]; permutation: [0]; monotonicity: [Inc]}]}, ... } main (mem xps_mem_951, mem yps_mem_952, ..., i32 n_354, i32 n_355, ..., [n_354]f32 xps_364, [n_355]f32 yps_365, ...) = { ... -- res_390 : [n_354]f32@@res_mem_1177-> {base: [res_ixfn_1173]; contiguous: True; LMADs: [{offset: res_ixfn_1174; strides: [res_ixfn_1175]; rotates: [0i32]; shape: [res_ixfn_1176]; permutation: [0]; monotonicity: [Inc]}]} -- res_391 : [n_354]f32@@res_mem_1182-> {base: [res_ixfn_1178]; contiguous: True; LMADs: [{offset: res_ixfn_1179; strides: [res_ixfn_1180]; rotates: [0i32]; shape: [res_ixfn_1181]; permutation: [0]; monotonicity: [Inc]}]} ... let {i32 res_ixfn_1173, i32 res_ixfn_1174, i32 res_ixfn_1175, ..., mem res_mem_1177, ...; [n_354]f32 res_390, [n_354]f32 res_391, ...} = -- bodies_396 : [n_354]f32@@mem_param_962-> {base: [ctx_param_ext_958]; contiguous: True; LMADs: [{offset: ctx_param_ext_959; strides: [ctx_param_ext_960]; rotates: [0i32]; shape: [ctx_param_ext_961]; permutation: [0]; monotonicity: [Inc]}]} -- bodies_397 : [n_354]f32@@mem_param_967-> {base: [ctx_param_ext_963]; contiguous: True; LMADs: [{offset: ctx_param_ext_964; strides: [ctx_param_ext_965]; rotates: [0i32]; shape: [ctx_param_ext_966]; permutation: [0]; monotonicity: [Inc]}]} loop {i32 ctx_param_ext_958, i32 ctx_param_ext_959, i32 ctx_param_ext_960, ..., mem mem_param_962, mem mem_param_967, mem mem_param_972, ...; [n_354]f32 bodies_396, [n_354]f32 bodies_397, ...} = { ... } for _i_402:i32 < n_steps_361 do { ... } ... let {mem mem_1205} = alloc(bytes_1203) -- res_linear_1206 : [n_354]f32@@mem_1205-> {base: [n_354]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [n_354]; permutation: [0]; monotonicity: [Inc]}]} let {[n_354]f32 res_linear_1206} = copy(res_390) let {i64 binop_x_1208} = sext i32 n_354 to i64 let {i64 bytes_1207} = mul_nw64(binop_x_1208, 4i64) let {mem mem_1209} = alloc(bytes_1207) -- res_linear_1210 : [n_354]f32@@mem_1209-> {base: [n_354]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [n_354]; permutation: [0]; monotonicity: [Inc]}]} let {[n_354]f32 res_linear_1210} = copy(res_391) let {i64 binop_x_1212} = sext i32 n_354 to i64 let {i64 bytes_1211} = mul_nw64(binop_x_1212, 4i64) let {mem mem_1213} = alloc(bytes_1211) -- res_linear_1214 : [n_354]f32@@mem_1213-> {base: [n_354]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [n_354]; permutation: [0]; monotonicity: [Inc]}]} let {[n_354]f32 res_linear_1214} = copy(res_392) ... in {n_354, n_354, ... , mem_1205, mem_1209, ... , res_linear_1206, res_linear_1210, ...} }
The function has to return direct arrays (notice the type annotations at the top), and the results from the loop are not direct arrays in their current form, because we aggressively existentialize the entire index function (with the exception of the rotation). Therefore, copies are inserted.
Now, after the explicit allocations pass, the simplifier determines that the offset and stride are actually loop invariant (they're always 0 and 1, respectively), so the index functions that result from the loop become direct, but the simplifier doesn't know to remove those extra copies at the end. Hopefully, we can implement a pass in the future which can remove these unnecessary copies.
The ugly: hotspot
, pathfinder
and radix_sort_large
The hotspot
and pathfinder
from Rodinia also perform badly with my
modifications, but not necessarily for the same reasons. Now, both do suffer
from the same kind of problems with unnecessary copies that nbody does, but
there's something else at work as well:
$ futhark-existential-loop-6 opencl ./hotspot.fut && cat data/512.in | ./hotspot --runs 100 -t /dev/stderr >/dev/null 32573 11688 11613 31875 31992 31962 11844 32083 11275 32311 32100 32159 11583 ...
$ futhark-existential-loop-6 opencl ./pathfinder.fut && cat data/medium.in | ./pathfinder --runs 100 -t /dev/stderr >/dev/null 10915 10787 3533 10733 10790 3511 10786 10974 3643 10877 10977 3613 ...
In both cases, the same basic patterns seem to repeat themselves. When running
the pathfinder
benchmark, we get two slow runs and one fast. For hotspot,
the
pattern is not quite as pronounced, but there are still some incredible jumps in
performance.
Interestingly, the radix_sort_large
benchmark exhibit some of the same
behavior, but here it is without my changes that the strange behavior manifest:
$ futhark-master opencl radix_sort_large.fut && cat data/radix_sort_100K.in | ./radix_sort_large --runs 10 -t /dev/stderr >/dev/null 6181 3607 5987 3380 5966 3460 6004 3449 5975 3477
I'm not sure what's going on here, but it warrants more investigation.
Next up
First up, it would be nice to figure out why those numbers in hotspot
,
pathfinder
and radix_sort_large
vary so wildly. I should figure out whether
the performance swings happen on the host or on the GPU, and then take it from
there.
Next week, I have to get started on an adaptation of the linear scan register allocation algoritm for array allocations in Futhark. More on that later.