Existentialized index functions in loops
The Purpose
Over the course of a few weeks, I've been working on an optimization in the Futhark compiler which will allow us to eliminate extra copies in a few specific cases. For instance, the following piece of code would result in a new copy at the end of each loop iteration, in addition to a copy at the loop initialization:
let main [n] (a: [n]i32): i32 = let b = loop xs = a[1:] for i < n / 2 - 2 do xs[i:] -- This will result in a copy in reduce (+) 0 b
If, instead of performing a copy at the end of each loop iteration, we could existentialize the index function of the array returned by the loop, returning the offset in addition to the memory location and result, we could avoid those copies.
How does it work?
Here, existentializing the index function means to replace parts of it with free variables, that can then be bound in the surrounding context.
For instance, the array type annotation (including the index function) could look like this:
[size]f32@@mem->{
base: [size];
contiguous: True;
LMADs: [{offset: orig + i;
strides: [0i32];
rotates: [0i32];
shape: [size_new];
permutation: [0];
monotonicity: [Inc]}]}
where size, mem, orig, i, and size_new are variables from the
code. Existentializing the type would yield something like this:
[?0]f32@@?5>{
base: [?1];
contiguous: True;
LMADs: [{offset: ?2;
strides: [?3];
rotates: [0i32];
shape: [?4];
permutation: [0];
monotonicity: [Inc]}]}
Now, in addition to returning the array itself (and perhaps the memory
allocation), we return the values that was existentialized out of the type/index
function. That way, by existentializing the index function of different arrays
(like the ones returned in the loop initialization and the one returned at the
end of the loop), we can make their index functions match, such that their types
match. I've previously done the same thing in if expressions, existentializing
the array index functions in different branches to avoid copies.
There's one caveat here though, which is that the DoLoop construct in Futhark
does not support existentialized index functions. So we have to generate new
variables and insert them in the index function/type annotation so that we can
instantiate the array before we can return it.
Benchmarks
To benchmark our changes, we compare the performance of the futhark-benchmark
benchmarks compiled with the compiler at commit
3fc8988b5dbcb53e1722ebc2429db56c693b6782 (which is the commit just before our
changes) and 54a30308bf015cd455b5bd77fa5a0957c9897f79 using
futhark-benchmarks at 2babb70aa4987383635686af8c7e30a26ed45250.
Here are the commands and results:
(cd futhark-benchmarks; git checkout 2babb70aa4987383635686af8c7e30a26ed45250) git checkout 3fc8988b5dbcb53e1722ebc2429db56c693b6782 stack exec futhark -- bench \ --no-tuning \ --backend=opencl \ --json bench-master.json \ --exclude=no_opencl \ --ignore-files micro \ --ignore-files /lib/ \ futhark-benchmarks git checkout 54a30308bf015cd455b5bd77fa5a0957c9897f79 stack exec futhark -- bench \ --no-tuning \ --backend=opencl \ --json bench-master.json \ --exclude=no_opencl \ --ignore-files micro \ --ignore-files /lib/ \ futhark-benchmarks tools/cmp-bench-json.py bench-master.json bench-existential-loop-6.json
futhark-benchmarks/accelerate/canny/canny.fut
data/lena512.in: 0.99x
data/lena256.in: 1.00x
futhark-benchmarks/accelerate/crystal/crystal.fut
#0 ("200i32 30.0f32 5i32 1i32 1.0f32"): 1.01x
#4 ("2000i32 30.0f32 50i32 1i32 1.0f32"): 1.00x
#5 ("4000i32 30.0f32 50i32 1i32 1.0f32"): 1.00x
futhark-benchmarks/accelerate/fft/fft.fut
data/64x256.in: 0.95x
data/128x512.in: 0.97x
data/1024x1024.in: 0.97x
data/512x512.in: 1.03x
data/256x256.in: 1.01x
data/128x128.in: 0.95x
futhark-benchmarks/accelerate/fluid/fluid.fut
benchmarking/medium.in: 1.05x
futhark-benchmarks/accelerate/hashcat/hashcat.fut
rockyou.dataset: 1.00x
futhark-benchmarks/accelerate/kmeans/kmeans.fut
data/k5_n50000.in: 1.11x
data/trivial.in: 0.99x
data/k5_n200000.in: 1.00x
futhark-benchmarks/accelerate/mandelbrot/mandelbrot.fut
#1 ("1000i32 1000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.02x
#3 ("4000i32 4000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x
#2 ("2000i32 2000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 0.99x
#0 ("800i32 600i32 -0.7f32 0.0f32 3.067f32 100i32 16.0f..."): 1.01x
#4 ("8000i32 8000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x
futhark-benchmarks/accelerate/nbody/nbody-bh.fut
data/10000-bodies.in: 1.06x
data/100000-bodies.in: 0.96x
data/1000-bodies.in: 0.94x
futhark-benchmarks/accelerate/nbody/nbody.fut
data/10000-bodies.in: 0.83x
data/100000-bodies.in: 0.95x
data/1000-bodies.in: 0.47x
futhark-benchmarks/accelerate/pagerank/pagerank.fut
data/small.in: 1.00x
data/random_medium.in: 1.00x
futhark-benchmarks/accelerate/ray/trace.fut
#0 ("800i32 600i32 100i32 50.0f32 -100.0f32 -700.0f32 1..."): 1.00x
futhark-benchmarks/accelerate/smoothlife/smoothlife.fut
#1 ("256i32"): 0.94x
#2 ("512i32"): 1.00x
#3 ("1024i32"): 0.95x
#0 ("128i32"): 0.93x
futhark-benchmarks/accelerate/tunnel/tunnel.fut
#1 ("10.0f32 1000i32 1000i32"): 1.01x
#4 ("10.0f32 8000i32 8000i32"): 1.00x
#0 ("10.0f32 800i32 600i32"): 0.98x
#2 ("10.0f32 2000i32 2000i32"): 1.00x
#3 ("10.0f32 4000i32 4000i32"): 1.00x
futhark-benchmarks/finpar/LocVolCalib.fut
LocVolCalib-data/small.in: 1.01x
LocVolCalib-data/medium.in: 1.01x
LocVolCalib-data/large.in: 1.00x
futhark-benchmarks/finpar/OptionPricing.fut
OptionPricing-data/medium.in: 1.00x
OptionPricing-data/small.in: 0.99x
OptionPricing-data/large.in: 1.00x
futhark-benchmarks/jgf/crypt/crypt.fut
crypt-data/medium.in: 0.97x
futhark-benchmarks/jgf/crypt/keys.fut
crypt-data/userkey0.txt: 0.98x
futhark-benchmarks/jgf/series/series.fut
data/1000000.in: 1.00x
data/10000.in: 1.00x
data/100000.in: 1.00x
futhark-benchmarks/misc/bfast/bfast-cloudy.fut
data/peru.in: 1.02x
data/sahara-cloudy.in: 1.00x
futhark-benchmarks/misc/bfast/bfast.fut
data/sahara.in: 1.00x
futhark-benchmarks/misc/heston/heston32.fut
data/1062_quotes.in: 0.98x
data/10000_quotes.in: 1.00x
data/100000_quotes.in: 1.00x
futhark-benchmarks/misc/heston/heston64.fut
data/1062_quotes.in: 1.00x
data/10000_quotes.in: 1.00x
data/100000_quotes.in: 1.00x
futhark-benchmarks/misc/knn-by-kdtree/buildKDtree.fut
valid-data/kdtree-ppl-32-m-2097152.in: 1.03x
futhark-benchmarks/misc/radix_sort/radix_sort_blelloch_benchmark.fut
data/radix_sort_100K.in: 1.07x
data/radix_sort_10K.in: 1.10x
data/radix_sort_1M.in: 0.97x
futhark-benchmarks/misc/radix_sort/radix_sort_large.fut
data/radix_sort_100K.in: 1.39x
data/radix_sort_10K.in: 1.01x
data/radix_sort_1M.in: 1.01x
futhark-benchmarks/parboil/histo/histo.fut
data/default.in: 1.03x
data/large.in: 1.00x
futhark-benchmarks/parboil/mri-q/mri-q.fut
data/large.in: 1.00x
data/small.in: 1.01x
futhark-benchmarks/parboil/sgemm/sgemm.fut
data/tiny.in: 0.96x
data/small.in: 1.06x
data/medium.in: 1.01x
futhark-benchmarks/parboil/stencil/stencil.fut
data/default.in: 1.00x
data/small.in: 1.03x
futhark-benchmarks/parboil/tpacf/tpacf.fut
data/large.in: 1.00x
data/small.in: 1.00x
data/medium.in: 1.00x
futhark-benchmarks/pbbs/ray/ray.fut
data/angel.in: 1.00x
data/dragon.in: 1.00x
data/happy.in: 1.00x
futhark-benchmarks/rodinia/backprop/backprop.fut
data/small.in: 0.99x
data/medium.in: 1.00x
futhark-benchmarks/rodinia/bfs/bfs_asympt_ok_but_slow.fut
data/64kn_32e-var-1-256-skew.in: 0.99x
data/512nodes_high_edge_variance.in: 1.08x
data/graph1MW_6.in: 1.05x
data/4096nodes.in: 1.12x
futhark-benchmarks/rodinia/bfs/bfs_filt_padded_fused.fut
data/64kn_32e-var-1-256-skew.in: 0.93x
data/512nodes_high_edge_variance.in: 0.97x
data/graph1MW_6.in: 0.99x
data/4096nodes.in: 0.93x
futhark-benchmarks/rodinia/bfs/bfs_heuristic.fut
data/64kn_32e-var-1-256-skew.in: 0.93x
data/512nodes_high_edge_variance.in: 0.93x
data/graph1MW_6.in: 1.03x
data/4096nodes.in: 0.98x
futhark-benchmarks/rodinia/bfs/bfs_iter_work_ok.fut
data/64kn_32e-var-1-256-skew.in: 1.07x
data/512nodes_high_edge_variance.in: 1.29x
data/graph1MW_6.in: 1.21x
data/4096nodes.in: 1.50x
futhark-benchmarks/rodinia/cfd/cfd.fut
data/fvcorr.domn.193K.toa: 1.01x
data/fvcorr.domn.097K.toa: 0.91x
futhark-benchmarks/rodinia/hotspot/hotspot.fut
data/512.in: 0.50x
data/1024.in: 1.00x
data/64.in: 1.06x
futhark-benchmarks/rodinia/kmeans/kmeans.fut
data/kdd_cup.in: 1.05x
data/100.in: 0.99x
data/204800.in: 1.00x
futhark-benchmarks/rodinia/lavaMD/lavaMD.fut
data/3_boxes.in: 1.00x
data/10_boxes.in: 1.02x
futhark-benchmarks/rodinia/lud/lud.fut
data/512.in: 1.00x
data/64.in: 0.99x
data/256.in: 0.97x
data/16by16.in: 1.01x
data/2048.in: 0.98x
futhark-benchmarks/rodinia/myocyte/myocyte.fut
data/small.in: 1.02x
data/medium.in: 1.00x
futhark-benchmarks/rodinia/nn/nn.fut
data/medium.in: 0.99x
futhark-benchmarks/rodinia/nw/nw.fut
data/large.in: 1.00x
futhark-benchmarks/rodinia/particlefilter/particlefilter.fut
data/128_128_10_image_400000_particles.in: 1.01x
data/128_128_10_image_10000_particles.in: 1.00x
futhark-benchmarks/rodinia/pathfinder/pathfinder.fut
data/medium.in: 0.40x
futhark-benchmarks/rodinia/srad/srad.fut
data/image.in: 0.97x
The good : bfs_iter_work_ok
The bfs_iter_work_ok benchmark from Rodinia is faster with my changes. By
investigating the code using futhark dev --gpu, we can see that we're exactly
hitting the kinds of places we set out to: loops where the result is
unnecessarily linearized. Instead of having to linearize the resulting arrays at
the end of each loop, we existentialize the index function of the existing
arrays and return that instead.
The bad: nbody
Some of the benchmarks are suspiciously slow. Let's take a look at nbody from
the Accelerate benchmark suite.
By running nbody through futhark bench using the two different versions of
the compiler, we get some idea of what's going on:
[jxk588@a00333 nbody]$ futhark-master opencl ./nbody.fut && cat data/1000-bodies.in | ./nbody -P --runs 10 >/dev/null Peak memory usage for space 'device': 52000 bytes. copy_dev_to_dev ran 0 times; avg: 0us; total: 0us copy_dev_to_host ran 0 times; avg: 0us; total: 0us copy_host_to_dev ran 0 times; avg: 0us; total: 0us copy_scalar_to_dev ran 0 times; avg: 0us; total: 0us copy_scalar_from_dev ran 0 times; avg: 0us; total: 0us segmap_intragroup_536 ran 1 times; avg: 72us; total: 72us 1 operations with cumulative runtime: 72us [jxk588@a00333 nbody]$ futhark-existential-loop-6 opencl ./nbody.fut && cat data/1000-bodies.in | ./nbody -P --runs 10 >/dev/null Peak memory usage for space 'device': 76000 bytes. copy_dev_to_dev ran 6 times; avg: 5us; total: 32us copy_dev_to_host ran 0 times; avg: 0us; total: 0us copy_host_to_dev ran 0 times; avg: 0us; total: 0us copy_scalar_to_dev ran 0 times; avg: 0us; total: 0us copy_scalar_from_dev ran 0 times; avg: 0us; total: 0us segmap_intragroup_536 ran 1 times; avg: 69us; total: 69us 7 operations with cumulative runtime: 101us
We see that, using the existential-loop-6 version, we get a lot of additional
runs of copy_dev_to_dev. Indeed, by examining the output of the explicit
allocations run, we see the problem:
-- xps_364 : [n_354]f32@@xps_mem_951->
{base: [n_354]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [n_354]; permutation: [0]; monotonicity: [Inc]}]}
-- yps_365 : [n_355]f32@@yps_mem_952->
{base: [n_355]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [n_355]; permutation: [0]; monotonicity: [Inc]}]}
...
entry {[?0]f32@?7->
{base: [?0]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
rotates: [0i32]; shape: [?0];
permutation: [0];
monotonicity: [Inc]}]},
[?1]f32@?8->
{base: [?1]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
rotates: [0i32]; shape: [?1];
permutation: [0];
monotonicity: [Inc]}]},
...
}
main (mem xps_mem_951, mem yps_mem_952, ...,
i32 n_354, i32 n_355, ...,
[n_354]f32 xps_364, [n_355]f32 yps_365, ...) = {
...
-- res_390 : [n_354]f32@@res_mem_1177->
{base: [res_ixfn_1173]; contiguous: True; LMADs: [{offset: res_ixfn_1174; strides: [res_ixfn_1175]; rotates: [0i32]; shape: [res_ixfn_1176]; permutation: [0]; monotonicity: [Inc]}]}
-- res_391 : [n_354]f32@@res_mem_1182->
{base: [res_ixfn_1178]; contiguous: True; LMADs: [{offset: res_ixfn_1179; strides: [res_ixfn_1180]; rotates: [0i32]; shape: [res_ixfn_1181]; permutation: [0]; monotonicity: [Inc]}]}
...
let {i32 res_ixfn_1173, i32 res_ixfn_1174, i32 res_ixfn_1175,
...,
mem res_mem_1177, ...;
[n_354]f32 res_390, [n_354]f32 res_391, ...} =
-- bodies_396 : [n_354]f32@@mem_param_962->
{base: [ctx_param_ext_958]; contiguous: True; LMADs: [{offset: ctx_param_ext_959; strides: [ctx_param_ext_960]; rotates: [0i32]; shape: [ctx_param_ext_961]; permutation: [0]; monotonicity: [Inc]}]}
-- bodies_397 : [n_354]f32@@mem_param_967->
{base: [ctx_param_ext_963]; contiguous: True; LMADs: [{offset: ctx_param_ext_964; strides: [ctx_param_ext_965]; rotates: [0i32]; shape: [ctx_param_ext_966]; permutation: [0]; monotonicity: [Inc]}]}
loop {i32 ctx_param_ext_958, i32 ctx_param_ext_959, i32 ctx_param_ext_960,
...,
mem mem_param_962, mem mem_param_967, mem mem_param_972, ...;
[n_354]f32 bodies_396, [n_354]f32 bodies_397, ...} = { ... }
for _i_402:i32 < n_steps_361 do {
...
}
...
let {mem mem_1205} =
alloc(bytes_1203)
-- res_linear_1206 : [n_354]f32@@mem_1205->
{base: [n_354]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [n_354]; permutation: [0]; monotonicity: [Inc]}]}
let {[n_354]f32 res_linear_1206} = copy(res_390)
let {i64 binop_x_1208} = sext i32 n_354 to i64
let {i64 bytes_1207} = mul_nw64(binop_x_1208, 4i64)
let {mem mem_1209} =
alloc(bytes_1207)
-- res_linear_1210 : [n_354]f32@@mem_1209->
{base: [n_354]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [n_354]; permutation: [0]; monotonicity: [Inc]}]}
let {[n_354]f32 res_linear_1210} = copy(res_391)
let {i64 binop_x_1212} = sext i32 n_354 to i64
let {i64 bytes_1211} = mul_nw64(binop_x_1212, 4i64)
let {mem mem_1213} =
alloc(bytes_1211)
-- res_linear_1214 : [n_354]f32@@mem_1213->
{base: [n_354]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; rotates: [0i32]; shape: [n_354]; permutation: [0]; monotonicity: [Inc]}]}
let {[n_354]f32 res_linear_1214} = copy(res_392)
...
in {n_354, n_354, ... , mem_1205, mem_1209, ... ,
res_linear_1206, res_linear_1210, ...}
}
The function has to return direct arrays (notice the type annotations at the top), and the results from the loop are not direct arrays in their current form, because we aggressively existentialize the entire index function (with the exception of the rotation). Therefore, copies are inserted.
Now, after the explicit allocations pass, the simplifier determines that the offset and stride are actually loop invariant (they're always 0 and 1, respectively), so the index functions that result from the loop become direct, but the simplifier doesn't know to remove those extra copies at the end. Hopefully, we can implement a pass in the future which can remove these unnecessary copies.
The ugly: hotspot, pathfinder and radix_sort_large
The hotspot and pathfinder from Rodinia also perform badly with my
modifications, but not necessarily for the same reasons. Now, both do suffer
from the same kind of problems with unnecessary copies that nbody does, but
there's something else at work as well:
$ futhark-existential-loop-6 opencl ./hotspot.fut && cat data/512.in | ./hotspot --runs 100 -t /dev/stderr >/dev/null 32573 11688 11613 31875 31992 31962 11844 32083 11275 32311 32100 32159 11583 ...
$ futhark-existential-loop-6 opencl ./pathfinder.fut && cat data/medium.in | ./pathfinder --runs 100 -t /dev/stderr >/dev/null 10915 10787 3533 10733 10790 3511 10786 10974 3643 10877 10977 3613 ...
In both cases, the same basic patterns seem to repeat themselves. When running
the pathfinder benchmark, we get two slow runs and one fast. For hotspot, the
pattern is not quite as pronounced, but there are still some incredible jumps in
performance.
Interestingly, the radix_sort_large benchmark exhibit some of the same
behavior, but here it is without my changes that the strange behavior manifest:
$ futhark-master opencl radix_sort_large.fut && cat data/radix_sort_100K.in | ./radix_sort_large --runs 10 -t /dev/stderr >/dev/null 6181 3607 5987 3380 5966 3460 6004 3449 5975 3477
I'm not sure what's going on here, but it warrants more investigation.
Next up
First up, it would be nice to figure out why those numbers in hotspot,
pathfinder and radix_sort_large vary so wildly. I should figure out whether
the performance swings happen on the host or on the GPU, and then take it from
there.
Next week, I have to get started on an adaptation of the linear scan register allocation algoritm for array allocations in Futhark. More on that later.