# Improving a benchmark

## Futhark

I've recently begun work on a PhD at the university of Copenhagen, working on a parallel programming language called Futhark. The goal is to help address some memory allocation issues in the language, specifically by employing static analysis at compile time to avoid things like unnecessary buffer allocations. I'm still learning about the language and how to think about the generated GPU code, and to that end I have been spending some time trying to improve a benchmark program written in Futhark.

## The Benchmark

The benchmark is called `lud`

, it computes an LU decomposition of a given matrix
in parallel, and it is from the parallel programming benchmark suite called
Rodinia. Troels Henriksen and Cosmin Oancea have written the initial
translations of the `lud`

benchmark, which are available here. Specifically, I'm
measuring myself against the `lud-clean.fut`

implementation.

Let's first examine how the initial Futhark implementation is doing compared to the original OpenCL implementation from Rodinia, both using block sizes of 16.

All benchmarks are performed on a Intel Xeon E5-2650 v2 2.60GHz computer with a
GeForce RTX 2080 Ti GPU. I am using version 3.1 of the Rodinia benchmark^{1} and
Futhark at commit 73b44b1 (from 2020-01-14). In order to run the Rodinia
benchmark, first compile the benchmark using `make`

and then run the program to
get 11 samples, discarding the first.

make ./lud -i ../../../data/lud/2048.dat

To run the Futhark benchmark, first auto tune `lud-clean.fut`

, and then run the
benchmark using `futhark bench`

:

FUTHARK_INCREMENTAL_FLATTENING=1 futhark autotune --backend=opencl lud-clean.fut FUTHARK_INCREMENTAL_FLATTENING=1 futhark bench --backend=opencl lud-clean.fut

Note that I have enabled incremental flattening, as we'll depend on that for some of our later optimizations. To read more about incremental flattening in Futhark, read Data-Parallel Flattening by Expansion by Elsman et al.

Matrix size | Rodinia | Futhark | Difference |
---|---|---|---|

64x64 | 74.4 μs (RSD: 0.15) | 692.0 μs (RSD: 0.10) | +830.1% |

256x256 | 285.2 μs (RSD: 0.07) | 2816.0 μs (RSD: 0.13) | +887.3% |

515x512 | 566.6 μs (RSD: 0.09) | 5427.1 μs (RSD: 0.12) | +857.8% |

2048x2048 | 2189.1 μs (RSD: 0.01) | 99729.1 μs (RSD: 0.04) | +4455.7% |

As we can see, the Futhark implementation performs significantly worse.

`lud_perimeter_upper`

The bulk of the work in the current Futhark implementation of `lud`

is done by
the four helper functions `lud_diagonal`

, `lud_perimeter_upper`

,
`lud_perimeter_lower`

and `lud_internal`

. In Rodinia `lud_perimeter_upper`

and
`lud_perimeter_lower`

are handled by the same function, `lud_perimeter`

, but in
the Futhark benchmark they've been split up into two separate functions. Let us
first try to understand the implementation of `lud_perimeter_upper`

and see how
it compares with the implementation in Rodinia.

### Inspecting the implementation

let lud_perimeter_lower [m][b] (diag: [b][b]f32) (a0s: [m][b][b]f32): *[m][b][b]f32 = let a1s = map (\ (x: [b][b]f32): [b][b]f32 -> transpose(x)) a0s in let a2s = map (\a1: [b][b]f32 -> map (\row0: [b]f32 -> -- Upper loop row=replicate b 0.0f32 for j < b do let sum = (loop sum=0.0f32 for k < j do sum + diag[j,k] * row[k]) let row[j] = row0[j] - sum in row ) a1 ) a1s in map (\x: [b][b]f32 -> transpose(x)) a2s

`lud_perimeter_upper`

takes a b×b matrix called `diag`

which is used to multiply
with and an array of b×b matrices or blocks called `a0s`

. First, we transpose
each block in `a0S`

and then we do a parallel map over each block and another
one over each row in the block. For each cell in the row we compute the dot
product of the previously updated values in the row and a row in `diag`

and
subtract it from the original value. In Futhark notation, the value of each
updated cell A[i,j] is `dotprod row[:j] diag[j, :j]`

, where `row[:j]`

has
already been updated. Although we are mapping over each row in the matrices,
because the original matrix has been transposed, it is helpful to think of the
parallelism as happening on a per-column basis. Each parallel map then has an
inner loop that computes the actual sum.

If we inspect the hand-written kernel from the Rodinia implementation of `lud`

,
we can see that the two functions from `lud-clean.fut`

have been merged into
one, and we can see that Rodinia uses `BLOCK_SIZE*2`

threads, where the threads
with `tx < BLOCK_SIZE`

handle the upper perimeter and the others handle the
lower perimeter. The loop itself however, looks quite similar to what we'd
expect from `lud-clean.fut`

. However, if we look a bit further up in the OpenCL
implementation, we see that it is using local memory, and spends quite a bit of
time moving the matrices to and from those local stores. Thinking about it, that
makes a lot of sense, since we'll have `BLOCK_SIZE`

reads from each entry
`peri_row[j, tx]`

, and reading from global memory is much more expensive than
reading from local memory. The same goes for writing to `peri_row[i, tx]`

. It's
reasonable to suspect that a lot of our performance loss is due to using global
memory instead of local memory. Let's see if we can confirm our suspicions.

To verify that `lud_perimeter_upper`

uses global memory but is otherwise mostly
identical to the implementation from Rodinia, we first extract the function into
its own file, just to make the generated code easier to understand. We'll need
to change the function name to `main`

or change the function into an entry, but
then we can simply compile the code using

FUTHARK_INCREMENTAL_FLATTENING=1 futhark opencl lud_perimeter_upper.fut

and then dump the OpenCL kernel using

./lud_perimeter_upper --dump-opencl lud_perimeter_upper.cl

The main bit of code that we're interested in looks like this:

for (int32_t j_5811 = 0; j_5811 < b_5471; j_5811++) { float sum_5812; float sum_5813 = 0.0F; for (int32_t k_5814 = 0; k_5814 < j_5811; k_5814++) { float x_5815 = ((__global float *) diag_mem_5827)[j_5811 * b_5472 + k_5814]; float y_5816 = ((__global float *) mem_5831)[phys_tid_5617 + k_5814 * (num_groups_5804 * segmap_group_sizze_5803)]; float y_5817 = x_5815 * y_5816; float res_5818 = sum_5813 + y_5817; float sum_tmp_5862 = res_5818; sum_5813 = sum_tmp_5862; } sum_5812 = sum_5813; float x_5819 = ((__global float *) a0s_mem_5828)[gtid_5615 * (b_5475 * b_5474) + j_5811 * b_5475 + gtid_5616]; float lw_val_5820 = x_5819 - sum_5812; ((__global float *) mem_5831)[phys_tid_5617 + j_5811 * (num_groups_5804 * segmap_group_sizze_5803)] = lw_val_5820; }

We can see that the loop is essentially the same as the hand-written one: each
thread computes the values for its column in a loop. However, we also see that
this version is using global memory, and of course there's also the two
transmutes before and after the main loop. Futhark doesn't know that copying the
entire matrix to local memory would speed things up. Inserting a single ```
let a1
= copy a1 in
```

between the two ~map~s would not help, as Futhark would recognize
it as an unnecessary copy and optimize it away. We need to trick Futhark into
copying the matrix into local memory.

### Improving the implementation

We want to use local memory to process the matrices `lud_perimeter_lower`

gets
as input, in other words, we want to exploit intra-group parallelism. Each group
should work on a matrix in the array of matrices. Each thread in a group should
copy a part of the necessary matrix to local memory, and then when the whole
matrix has been copied, the threads should start actually computing the dot
products.

Our initial implementation looks like this:

let main [m][b] (diag: [b][b]f32) (mats: *[m][b][b]f32): *[m][b][b]f32 = map (\mat: [b][b]f32 -> let mat = copy mat in loop mat for im1 < b-1 do let i = im1 + 1 let row_sums = loop row_sums = replicate b 0 for j < i do map2 (\sum el -> sum + diag[i,j] * el) row_sums mat[j] let row = map2 (-) mat[i] row_sums let mat[i] = row in mat ) mats

We'll note some key differences from the previous function:

- There is no transpose at the beginning and end of the function.
- After mapping over the matrices in
`mats`

, we copy`mat`

to try and force it into local memory (and to be able to update directly in it). - Instead of mapping over each row (or rather, transposed column) in the function, we loop over the rows and let each thread compute the sum of the cells above each cell in that row multiplied with the appropriate diagonal value. In essence, we've interchanged the loop and the inner-most map.

In all, the resulting function is semantically equivalent to the previous one, but we're hoping that this new function is able to take advantage of intra-group parallelism. So, what does the generated kernel look like?

for (int32_t im1_5761 = 0; im1_5761 < upper_bound_5640; im1_5761++) { int32_t i_5762 = 1 + im1_5761; ((__local float *) double_buffer_mem_6291)[local_tid_6341] = ((__global float *) mem_6187)[local_tid_6341]; barrier(CLK_LOCAL_MEM_FENCE); for (int32_t j_5765 = 0; j_5765 < i_5762; j_5765++) { float x_5767 = ((__global float *) diag_mem_6183)[i_5762 * b_5613 + j_5765]; int32_t gtid_5680 = local_tid_6341; int32_t phys_tid_5681; phys_tid_5681 = local_tid_6341; if (slt32(gtid_5680, b_5612)) { float x_5769 = ((__local float *) double_buffer_mem_6291)[gtid_5680]; float x_5770 = ((__local float *) mem_6197)[j_5765 * b_5612 + gtid_5680]; float y_5771 = x_5767 * x_5770; float res_5772 = x_5769 + y_5771; ((__local float *) mem_6206)[gtid_5680] = res_5772; } barrier(CLK_LOCAL_MEM_FENCE); ((__local float *) double_buffer_mem_6291)[local_tid_6341] = ((__local float *) mem_6206)[local_tid_6341]; barrier(CLK_LOCAL_MEM_FENCE); } int32_t gtid_5686 = local_tid_6341; int32_t phys_tid_5687; phys_tid_5687 = local_tid_6341; if (slt32(gtid_5686, b_5612)) { float x_5775 = ((__local float *) mem_6197)[i_5762 * b_5612 + gtid_5686]; float x_5776 = ((__local float *) double_buffer_mem_6291)[gtid_5686]; float res_5777 = x_5775 - x_5776; if ((sle32(0, i_5762) && slt32(i_5762, b_5612)) && (sle32(0, gtid_5686) && slt32(gtid_5686, b_5612))) { ((__local float *) mem_6197)[i_5762 * b_5612 + gtid_5686] = res_5777; } } barrier(CLK_LOCAL_MEM_FENCE); }

That's a big chunk of code, but essentially what is happing is what we want: the
matrix has been copied into local memory and the computations inside the nested
for loop all touch only local memory. There are still some global memory
accesses left, for `diag`

and for `mem_6187`

, which I'm guessing is `row_sums`

.
How does the performance compare to our old implementation?

Input size of mats | old | local_{mat} |
---|---|---|

128x16x16 | 30.8 μs (RSD: 0.05) | 53.0 μs (RSD: 0.08) |

128x8x8 | 23.1 μs (RSD: 0.14) | 36.3 μs (RSD: 0.09) |

32x16x16 | 28.6 μs (RSD: 0.04) | 50.1 μs (RSD: 0.07) |

As we can see, the code is unfortunately not any faster than the old one. Our
initial guess is that this is due to `diag`

and `mem_6187`

still being global.
Unfortunately, simply copying `diag`

with a `let diag = copy diag`

will get
optimized away, and the resulting kernel will look the same as before.

`lud_diagonal`

`lud_perimeter_lower`

looks a lot like `lud_perimeter_upper`

, so we weren't able
to get any speed improvements there, but we were able to get some speedup in the
`lud_diagonal`

function. Here's the old version:

let main [b] (a: [b][b]f32): *[b][b]f32 = let a_cols = copy(transpose(a)) in let b2 = 2*b in let a_rc = map (\ (i: i32): [b2]f32 -> map (\ (j: i32): f32 -> if j < b then unsafe a[i,j ] else unsafe a_cols[i,j-b] ) (iota(b2) ) ) (iota(b) ) let a_rc = loop a_rc for i < b do let row_col = map (\ (j: i32): f32 -> if j < b then if j < i then 0.0f32 else let sum = loop sum=0.0f32 for k < i do sum + a_rc[k,i+b]*a_rc[k,j] in a_rc[i,j]-sum else let j = j - b in if j < (i+1) then 0.0f32 else let aii = loop aii=a_rc[i,i] for k < i do aii - (a_rc[k,i+b]*a_rc[k,i]) in let sum = loop sum=0.0f32 for k < i do sum + a_rc[k,j+b]*a_rc[k,i] in (a_rc[i,j+b]-sum) / aii ) (iota(b2) ) in let a_rc[i] = row_col in a_rc in map (\ (i: i32): [b]f32 -> map (\ (j: i32): f32 -> if (i <= j) then a_rc[i,j] else a_rc[j,i+b] ) (iota(b) ) ) (iota(b) )

and here is the updated (and much simpler version):

let dotprod [n] (a: [n]f32) (b: [n]f32): f32 = map2 (*) a b |> reduce (+) 0 let lud_diagonal [b] (a: *[b][b]f32): *[b][b]f32 = map2 (\x mat -> let mat = copy mat in loop mat for i < b-1 do let col = map (\j -> if j > i then unsafe (mat[j,i] - (dotprod mat[j,:i] mat[:i,i])) / mat[i,i] else mat[j,i]) (iota b) let mat[:,i] = col let row = map (\j -> if j > i then mat[i+1, j] - (dotprod mat[:i+1, j] mat[i+1, :i+1]) else mat[i+1, j]) (iota b) let mat[i+1] = row in mat ) (iota (opaque 1)) [a] |> head

You'll notice that we have to use the outer map trick to force Futhark to use
incremental flattening and to force it to copy `mat`

to local memory. We have
also introduced a new helper function `dotprod`

, which allows us to be a bit
more clear about what we're doing (taking the dot product) without sacrificing
any performance.

Input size of mats | old | new | Difference |
---|---|---|---|

16x16 | 292.6 μs (RSD: 0.12) | 142.4 μs (RSD: 0.10) | -51.33% |

64x64 | 517.9 μs (RSD: 0.11) | 215.8 μs (RSD: 0.03) | -58.38% |

256x256 | 2167.6 μs (RSD: 0.14) | 1260.8 μs (RSD: 0.03) | -41.83% |

512x512 | 4181.1 μs (RSD: 0.12) | 2692.2 μs (RSD: 0.04) | -35.61% |

2048x2048 | 20156.6 μs (RSD: 0.12) | 19300.5 μs (RSD: 0.09) | -4.24% |

As we can see, there is a great speedup in the smaller input sizes, but it would seem like we run out of local memory when trying to handle the big inputs.

I have submitted the changed lud-diagonal as a PR to futhark-benchmarks here.
Unfortunately, it seems like `lud`

is still an order of magnitude slower than
the Rodinia implementation. This seems a bit extreme, so at some point I'd like
to investigate this benchmark a bit further.

## Footnotes:

^{1}

I've written a simple patch to make the timing behavior more consistent with Futhark's.