Technical Diary, Day 1
The Purpose
This is an experiment. I might remove these entries if I don't like how it turns out. The goal is to document my technical work in more detail, in order to help me be more conscious about what I spend my time on, deliberate about how I do it, and more mindful about the results.
The main idea is that articulating what I've learned throughout the day helps crystallize those learnings in my head. Just like teaching others helps you understand a subject much deeper yourself1, by forcing myself to write for an audience (however imagined it may be) about what I have (or haven't) achieved throughout the day, those things that I've achieved will be embedded much deeper in my mind. At the same time, documenting my work in more detail like this, will make it easier for me to go back to something I've worked on in the past, and pick up from where I left. I hope.
Therefore, I'll try to write clearly about what I'm doing, hopefully without too many half-finished notes and TODOs. That said, the primary audience of these entries is myself, and I don't necessarily expect what I'm writing here to be of interest to anyone else. I also won't be overly pedagogical, so expect something in between full blown blog posts for broad audiences and a haphazardly maintained text file full of personal notes.
Yesterday, and the plan for today
I've been spending a lot of time over the last few weeks on existentializing
for
loops in Futhark in order to avoid unnecessary copies. Late last week, I
believed I was mostly finished, but my changes turned out to reveal some very inconsistent benchmark times, seemingly related to the
allocator. Troels merged a PR with some fixes for the allocator, so today I have
to investigate whether those changes fixed my inconsistent benchmarks. Before I
do so though, I'd like to understand what he did, and why. I should also try to
understand in general what the allocator does and how. Hopefully, with those
changes, we can get my existentialized loops merge today or tomorrow.
We also had our changes to linguist finally merged, which means Github should hopefully get syntax highlighting and file format recognition for Futhark.
Finally, I had an idea for improving the examples in Erk's genfut library, but after some initial investigation, I decided that it would be too much work for now.
rts
First of all, the allocator resides within the rts
directory, which I haven't
really looked at before. It seems to consist of common pieces of code that are
included when Futhark writes its outputs, making up the common backend
"runtimes" as it were.
For instance, when compiling a simple test program using futhark opencl
, we
see the resulting foo.c
file contain the following bits and pieces from rts
:
// Start of util.h. ... // Start of timing.h. ... // Start of values.h. ... // Start of tuning.h. ... // Start of lock.h. ... // Start of free_list.h. ... // Start of opencl.h. ...
The OpenCL kernel also includes atomics.h
from rts/c
.
Let's take a quick look at what each of these does, leaving free_list.h
out
for now. But first, lunch… Good stuff.
util.h
contains some generally usable files, mostly to do with error handling,
string handling, and file I/O. timing.h
has just get_wall_time
, which is
prett self-explanantory. values.h
is a larger file, which contains
functionality to read and write textual and binary Futhark values. Probably used
for reading array inputs and the like, when running something like echo
[1i32] | ./foo
. The high-level functions are write_array
, read_array
,
write_scalar
and read_scalar
.
tuning.h
contains a function to load tuning files, which relies on a function
argument called set_size
, which looks to be generated dynamically. Using
futhark opencl
, it looks like the function is called
futhark_context_config_set_size
.
lock.h
contains a cross-platform implementation of a mutex lock. It seems to
be used in the generated code to make sure that two instances of the Futhark
program doesn't try to access the GPU device at the same time.
atomics.h
is used by the generated GPU code (CUDA or OpenCL) to handle atomic
arithmetic. opencl.h
and cuda.h
implement common helper functions for their
respective platforms, like opencl_device_info
, build_opencl_program
, and
setup_opencl_with_command_queue
, but also, crucially for my current
investigation, opencl_alloc
, opencl_free
. The allocation/dellocation
functions, in turn, reference the functions in free_list.h
, which contain the
actual implementation of the free list. Let's take a closer look at the
allocation functions and the free list.
opencl_alloc
and friends
The opencl_alloc
function is used to allocate GPU memory objects. The current
implementation tries to reuse the same allocation for the same allocation point
in the code. Meaning, that if there is an allocation at the start of loop, it
will try to reuse that allocation each time to loop runs. It does so by taking
the tag or variable name of the allocation variable and passing that as an
argument to the opencl_alloc
function. That tag is then passed on to
free_list_find
inside free_list.h
, which loops through the list of free
allocations and tries to find one with the same name. The assumption is that
there is always at most one block with a given tag in the free list. This means,
that the allocator is not primarily focused on the size of the allocation, but
on the name. If there is no free allocation with the same name,
free_list_find
will not return any allocation, even if there are allocations
of the right size.
Upon looking at free_list_find
and opencl_alloc
initially, I though we were
primarily concerned with the size of the allocation, but that turned out to be
false.
In any case, when free_list_find
returns a free block, opencl_alloc
then
checks if it is sufficiently large. If not, it free the returned block and
allocates a new one.
opencl_free
is similarly simple. First it releases any allocations from the
free list with the same tag as the block it's trying to free, and then it
inserts the current block in to the free list by calling free_list_insert
.
In some cases, my existentialization-changes cause memory blocks to change names
over the course of the program. This means that a block can be allocated under
one name and freed under another name. Next time the same memory is being
allocated, there is no free element in the list, so we have to perform a new
allocation. Troels' fix solves this by also allowing free_list_find
to return
blocks with identical size to the one we're trying to allocate. That'll find
our earlier allocation of the same size.
Still, perhaps there's a better way to do this. The dynamic allocator could probably use a rework, now that we cannot rely on tags as much as was previously the case. If we decide to rework it, the primary concern is the total size of allocations in the list. According to Troels, GPUs do not handle running out of memory well, so we'll need to make sure we're relatively conservative with our memory usage.
Impact on benchmarks with existentialized loops
Now, we should take a look at the impact on the benchmarks, with the goal of merging the PR.
So, we're comparing the compiler at commit 931bd15749e8e025c5223be5411ae424f3e59ca0, which is the existential-loop-branch, to the compiler at commit 4fedd7191c32bf364790578b235d20068cb35c61, which is the master it is based on.
[jxk588@a00333 futhark]$ ~/src/futhark/tools/cmp-bench-json.py bench-master-new.json bench-existential-loop-6-new.json futhark-benchmarks/accelerate/canny/canny.fut data/lena512.in: 1.03x data/lena256.in: 0.96x futhark-benchmarks/accelerate/crystal/crystal.fut #0 ("200i32 30.0f32 5i32 1i32 1.0f32"): 1.01x #4 ("2000i32 30.0f32 50i32 1i32 1.0f32"): 1.00x #5 ("4000i32 30.0f32 50i32 1i32 1.0f32"): 1.00x futhark-benchmarks/accelerate/fft/fft.fut data/64x256.in: 0.95x data/128x512.in: 0.94x data/1024x1024.in: 0.89x data/512x512.in: 0.91x data/256x256.in: 0.92x data/128x128.in: 0.92x futhark-benchmarks/accelerate/fluid/fluid.fut benchmarking/medium.in: 0.96x futhark-benchmarks/accelerate/hashcat/hashcat.fut rockyou.dataset: 1.01x futhark-benchmarks/accelerate/kmeans/kmeans.fut data/k5_n50000.in: 1.13x data/trivial.in: 1.01x data/k5_n200000.in: 0.91x futhark-benchmarks/accelerate/mandelbrot/mandelbrot.fut #1 ("1000i32 1000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.02x #3 ("4000i32 4000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x #2 ("2000i32 2000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x #0 ("800i32 600i32 -0.7f32 0.0f32 3.067f32 100i32 16.0f..."): 0.99x #4 ("8000i32 8000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x futhark-benchmarks/accelerate/nbody/nbody-bh.fut data/10000-bodies.in: 0.99x data/100000-bodies.in: 0.98x data/1000-bodies.in: 0.98x futhark-benchmarks/accelerate/nbody/nbody.fut data/10000-bodies.in: 0.84x data/100000-bodies.in: 1.00x data/1000-bodies.in: 0.52x futhark-benchmarks/accelerate/pagerank/pagerank.fut data/small.in: 1.00x data/random_medium.in: 1.00x futhark-benchmarks/accelerate/ray/trace.fut #0 ("800i32 600i32 100i32 50.0f32 -100.0f32 -700.0f32 1..."): 0.97x futhark-benchmarks/accelerate/smoothlife/smoothlife.fut #1 ("256i32"): 0.97x #2 ("512i32"): 0.91x #3 ("1024i32"): 0.95x #0 ("128i32"): 0.98x futhark-benchmarks/accelerate/tunnel/tunnel.fut #1 ("10.0f32 1000i32 1000i32"): 1.00x #4 ("10.0f32 8000i32 8000i32"): 1.00x #0 ("10.0f32 800i32 600i32"): 1.00x #2 ("10.0f32 2000i32 2000i32"): 1.00x #3 ("10.0f32 4000i32 4000i32"): 1.00x futhark-benchmarks/finpar/LocVolCalib.fut LocVolCalib-data/small.in: 1.00x LocVolCalib-data/medium.in: 1.00x LocVolCalib-data/large.in: 1.00x futhark-benchmarks/finpar/OptionPricing.fut OptionPricing-data/medium.in: 0.98x OptionPricing-data/small.in: 1.01x OptionPricing-data/large.in: 1.00x futhark-benchmarks/jgf/crypt/crypt.fut crypt-data/medium.in: 0.98x futhark-benchmarks/jgf/crypt/keys.fut crypt-data/userkey0.txt: 0.97x futhark-benchmarks/jgf/series/series.fut data/1000000.in: 1.00x data/10000.in: 1.00x data/100000.in: 1.00x futhark-benchmarks/misc/bfast/bfast-cloudy.fut data/peru.in: 1.00x data/sahara-cloudy.in: 0.96x futhark-benchmarks/misc/bfast/bfast.fut data/sahara.in: 1.00x futhark-benchmarks/misc/heston/heston32.fut data/1062_quotes.in: 0.99x data/10000_quotes.in: 1.02x data/100000_quotes.in: 1.00x futhark-benchmarks/misc/heston/heston64.fut data/1062_quotes.in: 1.00x data/10000_quotes.in: 1.00x data/100000_quotes.in: 1.00x futhark-benchmarks/misc/knn-by-kdtree/buildKDtree.fut valid-data/kdtree-ppl-32-m-2097152.in: 1.01x futhark-benchmarks/misc/radix_sort/radix_sort_blelloch_benchmark.fut data/radix_sort_100K.in: 1.10x data/radix_sort_10K.in: 1.10x data/radix_sort_1M.in: 1.00x futhark-benchmarks/misc/radix_sort/radix_sort_large.fut data/radix_sort_100K.in: 1.01x data/radix_sort_10K.in: 1.11x data/radix_sort_1M.in: 1.01x futhark-benchmarks/parboil/histo/histo.fut data/default.in: 0.99x data/large.in: 1.03x futhark-benchmarks/parboil/mri-q/mri-q.fut data/large.in: 1.00x data/small.in: 0.94x futhark-benchmarks/parboil/sgemm/sgemm.fut data/tiny.in: 0.98x data/small.in: 1.05x data/medium.in: 1.00x futhark-benchmarks/parboil/stencil/stencil.fut data/default.in: 0.99x data/small.in: 0.99x futhark-benchmarks/parboil/tpacf/tpacf.fut data/large.in: 1.00x data/small.in: 1.00x data/medium.in: 1.00x futhark-benchmarks/pbbs/ray/ray.fut data/angel.in: 1.00x data/dragon.in: 1.00x data/happy.in: 1.00x futhark-benchmarks/rodinia/backprop/backprop.fut data/small.in: 1.05x data/medium.in: 0.99x futhark-benchmarks/rodinia/bfs/bfs_asympt_ok_but_slow.fut data/64kn_32e-var-1-256-skew.in: 1.04x data/512nodes_high_edge_variance.in: 0.98x data/graph1MW_6.in: 0.95x data/4096nodes.in: 0.99x futhark-benchmarks/rodinia/bfs/bfs_filt_padded_fused.fut data/64kn_32e-var-1-256-skew.in: 0.99x data/512nodes_high_edge_variance.in: 0.98x data/graph1MW_6.in: 1.04x data/4096nodes.in: 1.03x futhark-benchmarks/rodinia/bfs/bfs_heuristic.fut data/64kn_32e-var-1-256-skew.in: 0.99x data/512nodes_high_edge_variance.in: 1.08x data/graph1MW_6.in: 0.99x data/4096nodes.in: 1.01x futhark-benchmarks/rodinia/bfs/bfs_iter_work_ok.fut data/64kn_32e-var-1-256-skew.in: 1.10x data/512nodes_high_edge_variance.in: 1.34x data/graph1MW_6.in: 1.16x data/4096nodes.in: 1.27x futhark-benchmarks/rodinia/cfd/cfd.fut data/fvcorr.domn.193K.toa: 1.00x data/fvcorr.domn.097K.toa: 1.00x futhark-benchmarks/rodinia/hotspot/hotspot.fut data/512.in: 1.05x data/1024.in: 1.00x data/64.in: 1.00x futhark-benchmarks/rodinia/kmeans/kmeans.fut data/kdd_cup.in: 1.00x data/100.in: 1.01x data/204800.in: 1.00x futhark-benchmarks/rodinia/lavaMD/lavaMD.fut data/3_boxes.in: 0.99x data/10_boxes.in: 1.01x futhark-benchmarks/rodinia/lud/lud.fut data/512.in: 0.99x data/64.in: 1.00x data/256.in: 1.03x data/16by16.in: 0.99x data/2048.in: 1.00x futhark-benchmarks/rodinia/myocyte/myocyte.fut data/small.in: 1.02x data/medium.in: 0.99x futhark-benchmarks/rodinia/nn/nn.fut data/medium.in: 1.01x futhark-benchmarks/rodinia/nw/nw.fut data/large.in: 1.00x futhark-benchmarks/rodinia/particlefilter/particlefilter.fut data/128_128_10_image_400000_particles.in: 0.99x data/128_128_10_image_10000_particles.in: 0.97x futhark-benchmarks/rodinia/pathfinder/pathfinder.fut data/medium.in: 0.99x futhark-benchmarks/rodinia/srad/srad.fut data/image.in: 1.02x
Mostly, the results are the same as before, though pathfinder
and hotspot
are not so slow any more. Unfortunately, nbody
is still really slow for small
datasets, so we'll need to do more investigation there.
Issues
- My current publishing setup in org-mode seem to not always pick up new files automatically.
- It also doesn't handle links to sections in other files very well.
Tomorrow
- Let's try to get some profiling and instrumentation going for the dynamic
allocator. It would be nice to see why
nbody
is so slow.
Longer term
- After successfully adding Futhark highlighting to Github (although I still
don't know when it will actually show up in the system), I just noticed that
bat
, which I use instead ofcat
, doesn't have support for Futhark. Adding support seems to require writing another kind of syntax file, but now that we have a few different ones to work out from, perhaps it won't be a big deal. - We should add some profiling information to the dynamic allocator. Troels suggests just having some ~fprintf~s in the right places.
- Consider rewriting the dynamic allocator. Perhaps we can just use size. After instrumenting with some timing information, we'll need to do some tests to see what impact different strategies iwll have.
Footnotes:
That has been my experience and I believe it's well supported by scientific evidence.