Technical Diary, Day 1

The Purpose

This is an experiment. I might remove these entries if I don't like how it turns out. The goal is to document my technical work in more detail, in order to help me be more conscious about what I spend my time on, deliberate about how I do it, and more mindful about the results.

The main idea is that articulating what I've learned throughout the day helps crystallize those learnings in my head. Just like teaching others helps you understand a subject much deeper yourself1, by forcing myself to write for an audience (however imagined it may be) about what I have (or haven't) achieved throughout the day, those things that I've achieved will be embedded much deeper in my mind. At the same time, documenting my work in more detail like this, will make it easier for me to go back to something I've worked on in the past, and pick up from where I left. I hope.

Therefore, I'll try to write clearly about what I'm doing, hopefully without too many half-finished notes and TODOs. That said, the primary audience of these entries is myself, and I don't necessarily expect what I'm writing here to be of interest to anyone else. I also won't be overly pedagogical, so expect something in between full blown blog posts for broad audiences and a haphazardly maintained text file full of personal notes.

Yesterday, and the plan for today

I've been spending a lot of time over the last few weeks on existentializing for loops in Futhark in order to avoid unnecessary copies. Late last week, I believed I was mostly finished, but my changes turned out to reveal some very inconsistent benchmark times, seemingly related to the allocator. Troels merged a PR with some fixes for the allocator, so today I have to investigate whether those changes fixed my inconsistent benchmarks. Before I do so though, I'd like to understand what he did, and why. I should also try to understand in general what the allocator does and how. Hopefully, with those changes, we can get my existentialized loops merge today or tomorrow.

We also had our changes to linguist finally merged, which means Github should hopefully get syntax highlighting and file format recognition for Futhark.

Finally, I had an idea for improving the examples in Erk's genfut library, but after some initial investigation, I decided that it would be too much work for now.


First of all, the allocator resides within the rts directory, which I haven't really looked at before. It seems to consist of common pieces of code that are included when Futhark writes its outputs, making up the common backend "runtimes" as it were.

For instance, when compiling a simple test program using futhark opencl, we see the resulting foo.c file contain the following bits and pieces from rts:

// Start of util.h.
// Start of timing.h.
// Start of values.h.
// Start of tuning.h.
// Start of lock.h.
// Start of free_list.h.
// Start of opencl.h.

The OpenCL kernel also includes atomics.h from rts/c.

Let's take a quick look at what each of these does, leaving free_list.h out for now. But first, lunch… Good stuff.

util.h contains some generally usable files, mostly to do with error handling, string handling, and file I/O. timing.h has just get_wall_time, which is prett self-explanantory. values.h is a larger file, which contains functionality to read and write textual and binary Futhark values. Probably used for reading array inputs and the like, when running something like echo [1i32] | ./foo. The high-level functions are write_array, read_array, write_scalar and read_scalar.

tuning.h contains a function to load tuning files, which relies on a function argument called set_size, which looks to be generated dynamically. Using futhark opencl, it looks like the function is called futhark_context_config_set_size.

lock.h contains a cross-platform implementation of a mutex lock. It seems to be used in the generated code to make sure that two instances of the Futhark program doesn't try to access the GPU device at the same time.

atomics.h is used by the generated GPU code (CUDA or OpenCL) to handle atomic arithmetic. opencl.h and cuda.h implement common helper functions for their respective platforms, like opencl_device_info, build_opencl_program, and setup_opencl_with_command_queue, but also, crucially for my current investigation, opencl_alloc, opencl_free. The allocation/dellocation functions, in turn, reference the functions in free_list.h, which contain the actual implementation of the free list. Let's take a closer look at the allocation functions and the free list.

opencl_alloc and friends

The opencl_alloc function is used to allocate GPU memory objects. The current implementation tries to reuse the same allocation for the same allocation point in the code. Meaning, that if there is an allocation at the start of loop, it will try to reuse that allocation each time to loop runs. It does so by taking the tag or variable name of the allocation variable and passing that as an argument to the opencl_alloc function. That tag is then passed on to free_list_find inside free_list.h, which loops through the list of free allocations and tries to find one with the same name. The assumption is that there is always at most one block with a given tag in the free list. This means, that the allocator is not primarily focused on the size of the allocation, but on the name. If there is no free allocation with the same name, free_list_find will not return any allocation, even if there are allocations of the right size.

Upon looking at free_list_find and opencl_alloc initially, I though we were primarily concerned with the size of the allocation, but that turned out to be false.

In any case, when free_list_find returns a free block, opencl_alloc then checks if it is sufficiently large. If not, it free the returned block and allocates a new one.

opencl_free is similarly simple. First it releases any allocations from the free list with the same tag as the block it's trying to free, and then it inserts the current block in to the free list by calling free_list_insert.

In some cases, my existentialization-changes cause memory blocks to change names over the course of the program. This means that a block can be allocated under one name and freed under another name. Next time the same memory is being allocated, there is no free element in the list, so we have to perform a new allocation. Troels' fix solves this by also allowing free_list_find to return blocks with identical size to the one we're trying to allocate. That'll find our earlier allocation of the same size.

Still, perhaps there's a better way to do this. The dynamic allocator could probably use a rework, now that we cannot rely on tags as much as was previously the case. If we decide to rework it, the primary concern is the total size of allocations in the list. According to Troels, GPUs do not handle running out of memory well, so we'll need to make sure we're relatively conservative with our memory usage.

Impact on benchmarks with existentialized loops

Now, we should take a look at the impact on the benchmarks, with the goal of merging the PR.

So, we're comparing the compiler at commit 931bd15749e8e025c5223be5411ae424f3e59ca0, which is the existential-loop-branch, to the compiler at commit 4fedd7191c32bf364790578b235d20068cb35c61, which is the master it is based on.

[jxk588@a00333 futhark]$ ~/src/futhark/tools/cmp-bench-json.py bench-master-new.json bench-existential-loop-6-new.json

  data/lena512.in:                                                      1.03x
  data/lena256.in:                                                      0.96x

  #0 ("200i32 30.0f32 5i32 1i32 1.0f32"):                               1.01x
  #4 ("2000i32 30.0f32 50i32 1i32 1.0f32"):                             1.00x
  #5 ("4000i32 30.0f32 50i32 1i32 1.0f32"):                             1.00x

  data/64x256.in:                                                       0.95x
  data/128x512.in:                                                      0.94x
  data/1024x1024.in:                                                    0.89x
  data/512x512.in:                                                      0.91x
  data/256x256.in:                                                      0.92x
  data/128x128.in:                                                      0.92x

  benchmarking/medium.in:                                               0.96x

  rockyou.dataset:                                                      1.01x

  data/k5_n50000.in:                                                    1.13x
  data/trivial.in:                                                      1.01x
  data/k5_n200000.in:                                                   0.91x

  #1 ("1000i32 1000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         1.02x
  #3 ("4000i32 4000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         1.00x
  #2 ("2000i32 2000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         1.00x
  #0 ("800i32 600i32 -0.7f32 0.0f32 3.067f32 100i32 16.0f..."):         0.99x
  #4 ("8000i32 8000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         1.00x

  data/10000-bodies.in:                                                 0.99x
  data/100000-bodies.in:                                                0.98x
  data/1000-bodies.in:                                                  0.98x

  data/10000-bodies.in:                                                 0.84x
  data/100000-bodies.in:                                                1.00x
  data/1000-bodies.in:                                                  0.52x

  data/small.in:                                                        1.00x
  data/random_medium.in:                                                1.00x

  #0 ("800i32 600i32 100i32 50.0f32 -100.0f32 -700.0f32 1..."):         0.97x

  #1 ("256i32"):                                                        0.97x
  #2 ("512i32"):                                                        0.91x
  #3 ("1024i32"):                                                       0.95x
  #0 ("128i32"):                                                        0.98x

  #1 ("10.0f32 1000i32 1000i32"):                                       1.00x
  #4 ("10.0f32 8000i32 8000i32"):                                       1.00x
  #0 ("10.0f32 800i32 600i32"):                                         1.00x
  #2 ("10.0f32 2000i32 2000i32"):                                       1.00x
  #3 ("10.0f32 4000i32 4000i32"):                                       1.00x

  LocVolCalib-data/small.in:                                            1.00x
  LocVolCalib-data/medium.in:                                           1.00x
  LocVolCalib-data/large.in:                                            1.00x

  OptionPricing-data/medium.in:                                         0.98x
  OptionPricing-data/small.in:                                          1.01x
  OptionPricing-data/large.in:                                          1.00x

  crypt-data/medium.in:                                                 0.98x

  crypt-data/userkey0.txt:                                              0.97x

  data/1000000.in:                                                      1.00x
  data/10000.in:                                                        1.00x
  data/100000.in:                                                       1.00x

  data/peru.in:                                                         1.00x
  data/sahara-cloudy.in:                                                0.96x

  data/sahara.in:                                                       1.00x

  data/1062_quotes.in:                                                  0.99x
  data/10000_quotes.in:                                                 1.02x
  data/100000_quotes.in:                                                1.00x

  data/1062_quotes.in:                                                  1.00x
  data/10000_quotes.in:                                                 1.00x
  data/100000_quotes.in:                                                1.00x

  valid-data/kdtree-ppl-32-m-2097152.in:                                1.01x

  data/radix_sort_100K.in:                                              1.10x
  data/radix_sort_10K.in:                                               1.10x
  data/radix_sort_1M.in:                                                1.00x

  data/radix_sort_100K.in:                                              1.01x
  data/radix_sort_10K.in:                                               1.11x
  data/radix_sort_1M.in:                                                1.01x

  data/default.in:                                                      0.99x
  data/large.in:                                                        1.03x

  data/large.in:                                                        1.00x
  data/small.in:                                                        0.94x

  data/tiny.in:                                                         0.98x
  data/small.in:                                                        1.05x
  data/medium.in:                                                       1.00x

  data/default.in:                                                      0.99x
  data/small.in:                                                        0.99x

  data/large.in:                                                        1.00x
  data/small.in:                                                        1.00x
  data/medium.in:                                                       1.00x

  data/angel.in:                                                        1.00x
  data/dragon.in:                                                       1.00x
  data/happy.in:                                                        1.00x

  data/small.in:                                                        1.05x
  data/medium.in:                                                       0.99x

  data/64kn_32e-var-1-256-skew.in:                                      1.04x
  data/512nodes_high_edge_variance.in:                                  0.98x
  data/graph1MW_6.in:                                                   0.95x
  data/4096nodes.in:                                                    0.99x

  data/64kn_32e-var-1-256-skew.in:                                      0.99x
  data/512nodes_high_edge_variance.in:                                  0.98x
  data/graph1MW_6.in:                                                   1.04x
  data/4096nodes.in:                                                    1.03x

  data/64kn_32e-var-1-256-skew.in:                                      0.99x
  data/512nodes_high_edge_variance.in:                                  1.08x
  data/graph1MW_6.in:                                                   0.99x
  data/4096nodes.in:                                                    1.01x

  data/64kn_32e-var-1-256-skew.in:                                      1.10x
  data/512nodes_high_edge_variance.in:                                  1.34x
  data/graph1MW_6.in:                                                   1.16x
  data/4096nodes.in:                                                    1.27x

  data/fvcorr.domn.193K.toa:                                            1.00x
  data/fvcorr.domn.097K.toa:                                            1.00x

  data/512.in:                                                          1.05x
  data/1024.in:                                                         1.00x
  data/64.in:                                                           1.00x

  data/kdd_cup.in:                                                      1.00x
  data/100.in:                                                          1.01x
  data/204800.in:                                                       1.00x

  data/3_boxes.in:                                                      0.99x
  data/10_boxes.in:                                                     1.01x

  data/512.in:                                                          0.99x
  data/64.in:                                                           1.00x
  data/256.in:                                                          1.03x
  data/16by16.in:                                                       0.99x
  data/2048.in:                                                         1.00x

  data/small.in:                                                        1.02x
  data/medium.in:                                                       0.99x

  data/medium.in:                                                       1.01x

  data/large.in:                                                        1.00x

  data/128_128_10_image_400000_particles.in:                            0.99x
  data/128_128_10_image_10000_particles.in:                             0.97x

  data/medium.in:                                                       0.99x

  data/image.in:                                                        1.02x

Mostly, the results are the same as before, though pathfinder and hotspot are not so slow any more. Unfortunately, nbody is still really slow for small datasets, so we'll need to do more investigation there.


  • My current publishing setup in org-mode seem to not always pick up new files automatically.
  • It also doesn't handle links to sections in other files very well.


  • Let's try to get some profiling and instrumentation going for the dynamic allocator. It would be nice to see why nbody is so slow.

Longer term

  • After successfully adding Futhark highlighting to Github (although I still don't know when it will actually show up in the system), I just noticed that bat, which I use instead of cat, doesn't have support for Futhark. Adding support seems to require writing another kind of syntax file, but now that we have a few different ones to work out from, perhaps it won't be a big deal.
  • We should add some profiling information to the dynamic allocator. Troels suggests just having some ~fprintf~s in the right places.
  • Consider rewriting the dynamic allocator. Perhaps we can just use size. After instrumenting with some timing information, we'll need to do some tests to see what impact different strategies iwll have.



That has been my experience and I believe it's well supported by scientific evidence.