Technical Diary, Day 2

Yesterday, and the plan for today

Well, I started trying to write about my progress yesterday, and I think it went well. So far so good. I'll try to keep up writing this technical diary for the next two weeks (before I go on vacation). I'll do a retrospective at the end of each week to see if there's anything related to the process I could improve.

I had some problems with my org-mode project which I use to publish these posts, but I figured it out today. Hopefully, publishing these diary entries will be straightforward now.

Work-wise, I figured out how the allocator works and how Troels' fix to the allocator have improved the benchmarks of my existential loops. However, the nbody benchmark is still very slow, even though it doesn't have that many copies, so today I'll try to figure out why. First I'll try to investigate the debugging output from running nbody on our test data set, in order to get an idea of how many allocation attempts there are. Then, I'll add some simple profiling to the allocator to see how many cache misses there are, guess at why they happen, and try to figure out how to avoid them.

What's wrong with nbody?

I thought I had fixed nbody with a small change in free_list_find, but that turned out to be false.

A bigger change seemed to do the trick though. As described yesterday, the allocator looks primarily for existing allocations matching the given tag, or variable name. Troels' fix extends that strategy by also matching on allocations with the exact same size, which allows the allocator to reuse more allocations. My idea was to completely disregard tags, and instead rely only on the size information. My algorithm runs through the free list and finds the free allocation with the best matching size and returns it.

There is a downside to this strategy: A program which continuously allocates larger and larger memory blocks using the same tag will never be able to reuse memory blocks, but it would keep all the previous allocations in the free list nonetheless. The previous allocator would remove and free existing allocations using the same tag before actually allocating a new block.

However, the upside is that many of our benchmarks run substantially faster.

/home/jxk588/src/futhark $ ./tools/cmp-bench-json.py bench-master.json bench-just-size.json

futhark-benchmarks/accelerate/canny/canny.fut
  data/lena512.in:                                                      1.16x
  data/lena256.in:                                                      1.13x

futhark-benchmarks/accelerate/crystal/crystal.fut
  #0 ("200i32 30.0f32 5i32 1i32 1.0f32"):                               0.97x
  #4 ("2000i32 30.0f32 50i32 1i32 1.0f32"):                             1.00x
  #5 ("4000i32 30.0f32 50i32 1i32 1.0f32"):                             1.00x

futhark-benchmarks/accelerate/fft/fft.fut
  data/64x256.in:                                                       1.01x
  data/128x512.in:                                                      1.15x
  data/1024x1024.in:                                                    1.01x
  data/512x512.in:                                                      1.05x
  data/256x256.in:                                                      1.03x
  data/128x128.in:                                                      1.04x

futhark-benchmarks/accelerate/fluid/fluid.fut
  benchmarking/medium.in:                                               1.07x

futhark-benchmarks/accelerate/hashcat/hashcat.fut
  rockyou.dataset:                                                      0.98x

futhark-benchmarks/accelerate/kmeans/kmeans.fut
  data/k5_n50000.in:                                                    1.00x
  data/trivial.in:                                                      1.08x
  data/k5_n200000.in:                                                   1.10x

futhark-benchmarks/accelerate/mandelbrot/mandelbrot.fut
  #1 ("1000i32 1000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         1.01x
  #3 ("4000i32 4000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         1.00x
  #2 ("2000i32 2000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         1.00x
  #0 ("800i32 600i32 -0.7f32 0.0f32 3.067f32 100i32 16.0f..."):         0.99x
  #4 ("8000i32 8000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         1.00x

futhark-benchmarks/accelerate/nbody/nbody-bh.fut
  data/10000-bodies.in:                                                 1.13x
  data/100000-bodies.in:                                                1.06x
  data/1000-bodies.in:                                                  1.07x

futhark-benchmarks/accelerate/nbody/nbody.fut
  data/10000-bodies.in:                                                 1.32x
  data/100000-bodies.in:                                                1.00x
  data/1000-bodies.in:                                                  3.15x

futhark-benchmarks/accelerate/pagerank/pagerank.fut
  data/small.in:                                                        1.04x
  data/random_medium.in:                                                1.13x

futhark-benchmarks/accelerate/ray/trace.fut
  #0 ("800i32 600i32 100i32 50.0f32 -100.0f32 -700.0f32 1..."):         1.00x

futhark-benchmarks/accelerate/smoothlife/smoothlife.fut
  #1 ("256i32"):                                                        1.00x
  #2 ("512i32"):                                                        0.99x
  #3 ("1024i32"):                                                       1.00x
  #0 ("128i32"):                                                        1.01x

futhark-benchmarks/accelerate/tunnel/tunnel.fut
  #1 ("10.0f32 1000i32 1000i32"):                                       1.00x
  #4 ("10.0f32 8000i32 8000i32"):                                       1.00x
  #0 ("10.0f32 800i32 600i32"):                                         1.00x
  #2 ("10.0f32 2000i32 2000i32"):                                       1.00x
  #3 ("10.0f32 4000i32 4000i32"):                                       1.00x

futhark-benchmarks/finpar/LocVolCalib.fut
  LocVolCalib-data/small.in:                                            0.99x
  LocVolCalib-data/medium.in:                                           1.00x
  LocVolCalib-data/large.in:                                            1.00x

futhark-benchmarks/finpar/OptionPricing.fut
  OptionPricing-data/medium.in:                                         1.04x
  OptionPricing-data/small.in:                                          1.05x
  OptionPricing-data/large.in:                                          1.00x

futhark-benchmarks/jgf/crypt/crypt.fut
  crypt-data/medium.in:                                                 2.51x

futhark-benchmarks/jgf/crypt/keys.fut
  crypt-data/userkey0.txt:                                              1.05x

futhark-benchmarks/jgf/series/series.fut
  data/1000000.in:                                                      1.00x
  data/10000.in:                                                        0.99x
  data/100000.in:                                                       1.00x

futhark-benchmarks/misc/bfast/bfast-cloudy.fut
  data/peru.in:                                                         0.94x
  data/sahara-cloudy.in:                                                0.94x

futhark-benchmarks/misc/bfast/bfast.fut
  data/sahara.in:                                                       1.03x

futhark-benchmarks/misc/heston/heston32.fut
  data/1062_quotes.in:                                                  0.97x
  data/10000_quotes.in:                                                 1.03x
  data/100000_quotes.in:                                                0.99x

futhark-benchmarks/misc/heston/heston64.fut
  data/1062_quotes.in:                                                  1.00x
  data/10000_quotes.in:                                                 1.00x
  data/100000_quotes.in:                                                1.00x

futhark-benchmarks/misc/knn-by-kdtree/buildKDtree.fut
  valid-data/kdtree-ppl-32-m-2097152.in:                                1.01x

futhark-benchmarks/misc/radix_sort/radix_sort_blelloch_benchmark.fut
  data/radix_sort_100K.in:                                              0.84x
  data/radix_sort_10K.in:                                               1.11x
  data/radix_sort_1M.in:                                                1.09x

futhark-benchmarks/misc/radix_sort/radix_sort_large.fut
  data/radix_sort_100K.in:                                              1.01x
  data/radix_sort_10K.in:                                               0.86x
  data/radix_sort_1M.in:                                                1.02x

futhark-benchmarks/parboil/histo/histo.fut
  data/default.in:                                                      1.35x
  data/large.in:                                                        1.33x

futhark-benchmarks/parboil/mri-q/mri-q.fut
  data/large.in:                                                        1.04x
  data/small.in:                                                        1.03x

futhark-benchmarks/parboil/sgemm/sgemm.fut
  data/tiny.in:                                                         1.94x
  data/small.in:                                                        1.54x
  data/medium.in:                                                       1.11x

futhark-benchmarks/parboil/stencil/stencil.fut
  data/default.in:                                                      1.00x
  data/small.in:                                                        1.00x

futhark-benchmarks/parboil/tpacf/tpacf.fut
  data/large.in:                                                        1.00x
  data/small.in:                                                        1.00x
  data/medium.in:                                                       1.00x

futhark-benchmarks/pbbs/ray/ray.fut
  data/angel.in:                                                        0.99x
  data/dragon.in:                                                       1.05x
  data/happy.in:                                                        1.02x

futhark-benchmarks/rodinia/backprop/backprop.fut
  data/small.in:                                                        1.22x
  data/medium.in:                                                       1.16x

futhark-benchmarks/rodinia/bfs/bfs_asympt_ok_but_slow.fut
  data/64kn_32e-var-1-256-skew.in:                                      1.05x
  data/512nodes_high_edge_variance.in:                                  1.03x
  data/graph1MW_6.in:                                                   1.02x
  data/4096nodes.in:                                                    0.99x

futhark-benchmarks/rodinia/bfs/bfs_filt_padded_fused.fut
  data/64kn_32e-var-1-256-skew.in:                                      0.97x
  data/512nodes_high_edge_variance.in:                                  1.08x
  data/graph1MW_6.in:                                                   1.12x
  data/4096nodes.in:                                                    0.93x

futhark-benchmarks/rodinia/bfs/bfs_heuristic.fut
  data/64kn_32e-var-1-256-skew.in:                                      1.03x
  data/512nodes_high_edge_variance.in:                                  1.05x
  data/graph1MW_6.in:                                                   1.07x
  data/4096nodes.in:                                                    1.21x

futhark-benchmarks/rodinia/bfs/bfs_iter_work_ok.fut
  data/64kn_32e-var-1-256-skew.in:                                      1.06x
  data/512nodes_high_edge_variance.in:                                  1.24x
  data/graph1MW_6.in:                                                   1.05x
  data/4096nodes.in:                                                    1.15x

futhark-benchmarks/rodinia/cfd/cfd.fut
  data/fvcorr.domn.193K.toa:                                            1.00x
  data/fvcorr.domn.097K.toa:                                            1.00x

futhark-benchmarks/rodinia/hotspot/hotspot.fut
  data/512.in:                                                          1.04x
  data/1024.in:                                                         1.02x
  data/64.in:                                                           1.04x

futhark-benchmarks/rodinia/kmeans/kmeans.fut
  data/kdd_cup.in:                                                      0.98x
  data/100.in:                                                          1.02x
  data/204800.in:                                                       1.00x

futhark-benchmarks/rodinia/lavaMD/lavaMD.fut
  data/3_boxes.in:                                                      1.40x
  data/10_boxes.in:                                                     1.13x

futhark-benchmarks/rodinia/lud/lud.fut
  data/512.in:                                                          1.01x
  data/64.in:                                                           0.99x
  data/256.in:                                                          1.11x
  data/16by16.in:                                                       1.01x
  data/2048.in:                                                         1.01x

futhark-benchmarks/rodinia/myocyte/myocyte.fut
  data/small.in:                                                        0.99x
  data/medium.in:                                                       1.00x

futhark-benchmarks/rodinia/nn/nn.fut
  data/medium.in:                                                       1.03x

futhark-benchmarks/rodinia/nw/nw.fut
  data/large.in:                                                        1.00x

futhark-benchmarks/rodinia/particlefilter/particlefilter.fut
  data/128_128_10_image_400000_particles.in:                            0.99x
  data/128_128_10_image_10000_particles.in:                             1.01x

futhark-benchmarks/rodinia/pathfinder/pathfinder.fut
  data/medium.in:                                                       0.96x

futhark-benchmarks/rodinia/srad/srad.fut
  data/image.in:                                                        1.02x

It sounded like the others like the tradeoff, so hopefully the PR will land soon.

Better debugging information in opencl.h

As part of the process of figuring out why nbody was so slow, I've also added a bit of debugging output to the allocator. More might be warranted in the future, but for now, we're mostly interested in when allocations actually happen.

Issues

  • Apparently, specifying that a source block contains futhark code causes the block to not scroll correctly when rendered as HTML.

Tomorrow

  • Let's try and get started on the linear scan algorithm. First, we need a way to perform liveness analysis on the explicit memory allocation IR in Futhark.
  • I should also start looking at the ICFP paper and artifact I'm supposed to review. I should at least print the article tomorrow, so I can read it tomorrow or the day after.

Longer term

  • A more complete rewrite of the dynamic allocator might be warranted.