2020-07-10
Yesterday, and the plan for today
I got a better start on the linear scan pass yesterday. I decided to use my old
last-use implementation for now, since it should work alright. Later, I might
want to have a LastUse
lore anyway, but I'll try to get it to work without
first.
Today, I have to continue writing my pass. In particular, I need to get my
lookupMemInfo
to work. That requires adding scoping.
Chess/perft
I ran across an interesting chess problem that we might be able to use as a
benchmark. perft is a performance test and move path enumeration that is used to
determine the number of legal moves in a given chess position. It seems like
GPUs have successfully been used for estimating this number up to perft(15)
,
possibly using Monte Carlo situation. It would be interesting to look into and
understand the test, and see if we can write a Futhark program that computes it
for us.
ReuseAllocs
So, I got my basic pass to work today! Here's the result of a run (with debugging output):
1: $ cabal exec futhark-linear-scan -- tests/arrays2.fut 2>&1 2: Warning at tests/arrays2.fut:1:11-11: 3: Unused variable "n". 4: stms: 5: 6: last_uses?: [] 7: allocs: 8: [] 9: frees: 10: [] 11: last_uses: 12: [] 13: 14: stms: 15: (0, let {mem mem_260} = alloc(40i64)) 16: (1, -- res_187 : [10i32]i32@@mem_260-> 17: -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 18: -- rotates: [0i32]; shape: [10i32]; 19: -- permutation: [0]; 20: -- monotonicity: [Inc]}]} 21: let {[10i32]i32 res_187} = iota32(10i32, 0i32, 1i32)) 22: (2, -- xs_188 : [10i32]i32@@mem_260-> 23: -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 24: -- rotates: [0i32]; shape: [10i32]; 25: -- permutation: [0]; 26: -- monotonicity: [Inc]}]} 27: let {[10i32]i32 xs_188} = res_187 with [1i32] <- 0i32) 28: (3, let {mem mem_263} = alloc(40i64)) 29: (4, -- res_189 : [10i32]i32@@mem_263-> 30: -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 31: -- rotates: [0i32]; shape: [10i32]; 32: -- permutation: [0]; 33: -- monotonicity: [Inc]}]} 34: let {[10i32]i32 res_189} = iota32(10i32, 0i32, 1i32)) 35: (5, -- ys_190 : [10i32]i32@@mem_263-> 36: -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 37: -- rotates: [0i32]; shape: [10i32]; 38: -- permutation: [0]; 39: -- monotonicity: [Inc]}]} 40: let {[10i32]i32 ys_190} = res_189 with [1i32] <- 0i32) 41: (6, let {i32 segred_group_size_227} = 42: get_size(segred_group_size_226, group_size)) 43: (7, let {i32 num_groups_229} = 44: calc_num_groups(10i64, segred_num_groups_228, segred_group_size_227)) 45: (8, let {mem mem_267} = alloc(4i64)) 46: (9, -- acc0_231 : [1i32]i32@@mem_267-> 47: -- {base: [1i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 48: -- rotates: [0i32]; shape: [1i32]; 49: -- permutation: [0]; 50: -- monotonicity: [Inc]}]} 51: let {[1i32]i32 acc0_231} = 52: segred_thread 53: (#groups=num_groups_229; groupsize=segred_group_size_227) 54: ({{0i32}, 55: [], 56: commutative fn {i32} (i32 x_201, i32 x_202) => 57: let {i32 res_203} = add32(x_201, x_202) 58: in {res_203}}) 59: (dummy_232 < 1i32, gtid_233 < 10i32) (~phys_tid_234) : {i32} { 60: let {i32 x_204} = xs_188[gtid_233] 61: return {returns x_204} 62: }) 63: (10, let {i32 acc0_200} = acc0_231[0i32]) 64: (11, let {i32 segscan_group_size_238} = 65: get_size(segscan_group_size_237, group_size)) 66: (12, let {i32 num_groups_240} = 67: calc_num_groups(10i64, segscan_num_groups_239, segscan_group_size_238)) 68: (13, let {mem mem_271} = alloc(40i64)) 69: (14, -- resarr0_206 : [10i32]i32@@mem_271-> 70: -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 71: -- rotates: [0i32]; shape: [10i32]; 72: -- permutation: [0]; 73: -- monotonicity: [Inc]}]} 74: let {[10i32]i32 resarr0_206} = 75: segscan_thread 76: (#groups=num_groups_240; groupsize=segscan_group_size_238) 77: ({{0i32}, 78: [], 79: fn {i32} (i32 x_207, i32 x_208) => 80: let {i32 res_209} = add32(x_207, x_208) 81: in {res_209}}) 82: (gtid_242 < 10i32) (~phys_tid_243) : {i32} { 83: let {i32 x_210} = ys_190[gtid_242] 84: return {returns x_210} 85: }) 86: (15, let {i32 segred_group_size_247} = 87: get_size(segred_group_size_246, group_size)) 88: (16, let {i32 num_groups_249} = 89: calc_num_groups(10i64, segred_num_groups_248, segred_group_size_247)) 90: (17, let {mem mem_275} = alloc(4i64)) 91: (18, -- acc0_251 : [1i32]i32@@mem_275-> 92: -- {base: [1i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 93: -- rotates: [0i32]; shape: [1i32]; 94: -- permutation: [0]; 95: -- monotonicity: [Inc]}]} 96: let {[1i32]i32 acc0_251} = 97: segred_thread 98: (#groups=num_groups_249; groupsize=segred_group_size_247) 99: ({{0i32}, 100: [], 101: commutative fn {i32} (i32 x_217, i32 x_218) => 102: let {i32 res_219} = add32(x_217, x_218) 103: in {res_219}}) 104: (dummy_252 < 1i32, gtid_253 < 10i32) (~phys_tid_254) : {i32} { 105: let {i32 x_220} = resarr0_206[gtid_253] 106: return {returns x_220} 107: }) 108: (19, let {i32 acc0_216} = acc0_251[0i32]) 109: (20, let {i32 res_223} = add32(acc0_200, acc0_216)) 110: 111: last_uses?: [(2, res_187), (2, mem_260), (5, res_189), (5, mem_263), (9, xs_188), (9, 112: segred_group_size_227), 113: (9, num_groups_229), (9, gtid_233), (9, mem_267), (10, acc0_231), (14, ys_190), 114: (14, segscan_group_size_238), (14, num_groups_240), (14, gtid_242), (14, 115: mem_271), 116: (18, resarr0_206), (18, segred_group_size_247), (18, num_groups_249), (18, 117: gtid_253), 118: (18, mem_275), (19, acc0_251), (20, acc0_200), (20, acc0_216)] 119: 1 adding new frees: [] 120: 2 adding new frees: [(VName (Name "mem") 260,Constant (IntValue (Int64Value 40)))] 121: 3 lookup: 40i64 122: 3 found a result: VName (Name "mem") 260 123: 4 adding new frees: [] 124: 5 adding new frees: [] 125: 6 adding new frees: [] 126: 7 adding new frees: [] 127: 8 lookup: 4i64 128: 9 adding new frees: [(VName (Name "mem") 260,Constant (IntValue (Int64Value 40)))] 129: 10 adding new frees: [(VName (Name "mem") 267,Constant (IntValue (Int64Value 4)))] 130: 11 adding new frees: [] 131: 12 adding new frees: [] 132: 13 lookup: 40i64 133: 13 found a result: VName (Name "mem") 260 134: 14 adding new frees: [] 135: 15 adding new frees: [] 136: 16 adding new frees: [] 137: 17 lookup: 4i64 138: 17 found a result: VName (Name "mem") 267 139: 18 adding new frees: [] 140: 19 adding new frees: [] 141: 20 adding new frees: [] 142: allocs: 143: [(VName (Name "mem") 267,Constant (IntValue (Int64Value 4))),(VName (Name "mem") 260,Constant (IntValue (Int64Value 40)))] 144: frees: 145: [(VName (Name "mem") 267,Constant (IntValue (Int64Value 4))),(VName (Name "mem") 260,Constant (IntValue (Int64Value 40))),(VName (Name "mem") 260,Constant (IntValue (Int64Value 40)))] 146: last_uses: 147: [] 148: 149: entry {i32} main (i32 n_186) = { 150: let {mem mem_260} = 151: alloc(40i64) 152: -- res_187 : [10i32]i32@@mem_260-> 153: -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 154: -- rotates: [0i32]; shape: [10i32]; 155: -- permutation: [0]; 156: -- monotonicity: [Inc]}]} 157: let {[10i32]i32 res_187} = iota32(10i32, 0i32, 1i32) 158: -- xs_188 : [10i32]i32@@mem_260-> 159: -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 160: -- rotates: [0i32]; shape: [10i32]; 161: -- permutation: [0]; 162: -- monotonicity: [Inc]}]} 163: let {[10i32]i32 xs_188} = 164: -- Consumes res_187 165: res_187 with [1i32] <- 0i32 166: -- mem_263 aliases mem_260 167: let {mem mem_263} = mem_260 168: -- res_189 : [10i32]i32@@mem_263-> 169: -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 170: -- rotates: [0i32]; shape: [10i32]; 171: -- permutation: [0]; 172: -- monotonicity: [Inc]}]} 173: let {[10i32]i32 res_189} = iota32(10i32, 0i32, 1i32) 174: -- ys_190 : [10i32]i32@@mem_263-> 175: -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 176: -- rotates: [0i32]; shape: [10i32]; 177: -- permutation: [0]; 178: -- monotonicity: [Inc]}]} 179: let {[10i32]i32 ys_190} = 180: -- Consumes res_189 181: res_189 with [1i32] <- 0i32 182: let {i32 segred_group_size_227} = 183: get_size(segred_group_size_226, group_size) 184: let {i32 num_groups_229} = 185: calc_num_groups(10i64, segred_num_groups_228, segred_group_size_227) 186: let {mem mem_267} = 187: alloc(4i64) 188: -- acc0_231 : [1i32]i32@@mem_267-> 189: -- {base: [1i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 190: -- rotates: [0i32]; shape: [1i32]; 191: -- permutation: [0]; 192: -- monotonicity: [Inc]}]} 193: let {[1i32]i32 acc0_231} = 194: segred_thread 195: (#groups=num_groups_229; groupsize=segred_group_size_227) 196: ({{0i32}, 197: [], 198: commutative fn {i32} (i32 x_201, i32 x_202) => 199: let {i32 res_203} = add32(x_201, x_202) 200: in {res_203}}) 201: (dummy_232 < 1i32, gtid_233 < 10i32) (~phys_tid_234) : {i32} { 202: let {i32 x_204} = xs_188[gtid_233] 203: return {returns x_204} 204: } 205: let {i32 acc0_200} = acc0_231[0i32] 206: let {i32 segscan_group_size_238} = 207: get_size(segscan_group_size_237, group_size) 208: let {i32 num_groups_240} = 209: calc_num_groups(10i64, segscan_num_groups_239, segscan_group_size_238) 210: -- mem_271 aliases mem_260 211: let {mem mem_271} = mem_260 212: -- resarr0_206 : [10i32]i32@@mem_271-> 213: -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 214: -- rotates: [0i32]; shape: [10i32]; 215: -- permutation: [0]; 216: -- monotonicity: [Inc]}]} 217: let {[10i32]i32 resarr0_206} = 218: segscan_thread 219: (#groups=num_groups_240; groupsize=segscan_group_size_238) 220: ({{0i32}, 221: [], 222: fn {i32} (i32 x_207, i32 x_208) => 223: let {i32 res_209} = add32(x_207, x_208) 224: in {res_209}}) 225: (gtid_242 < 10i32) (~phys_tid_243) : {i32} { 226: let {i32 x_210} = ys_190[gtid_242] 227: return {returns x_210} 228: } 229: let {i32 segred_group_size_247} = 230: get_size(segred_group_size_246, group_size) 231: let {i32 num_groups_249} = 232: calc_num_groups(10i64, segred_num_groups_248, segred_group_size_247) 233: -- mem_275 aliases mem_267 234: let {mem mem_275} = mem_267 235: -- acc0_251 : [1i32]i32@@mem_275-> 236: -- {base: [1i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32]; 237: -- rotates: [0i32]; shape: [1i32]; 238: -- permutation: [0]; 239: -- monotonicity: [Inc]}]} 240: let {[1i32]i32 acc0_251} = 241: segred_thread 242: (#groups=num_groups_249; groupsize=segred_group_size_247) 243: ({{0i32}, 244: [], 245: commutative fn {i32} (i32 x_217, i32 x_218) => 246: let {i32 res_219} = add32(x_217, x_218) 247: in {res_219}}) 248: (dummy_252 < 1i32, gtid_253 < 10i32) (~phys_tid_254) : {i32} { 249: let {i32 x_220} = resarr0_206[gtid_253] 250: return {returns x_220} 251: } 252: let {i32 acc0_216} = acc0_251[0i32] 253: let {i32 res_223} = add32(acc0_200, acc0_216) 254: in {res_223} 255: } 256:
As you can see on line 167, it succesfully changes the allocation to
reuse an existing allocation! Unfortunately, my last-use analysis is far too
simple, even with aliasing. From the debugging output, we can see that mem_260
is reported as being last-used on line 157, when in reality it is used
by all references to xs_187
. Furthermore, even if we fixed that so the
last-use of mem_260
was equal to the last-use of xs_187
, that happens
immediately after, on line 165, but that's still not the actual
last-use of mem_260
: res_187
is consumed in the update which creates
res_188
, which still resides in mem_260
.
I knew all of this would be problems with my last-use analysis, but it still feels nice to see that the pass works.
There are many more issues with the pass: It needs to handle nested blocks, and in reality it probably should only concern itself with code inside kernels. But now that I have a working foundation, I feel like the time is right to go back and implement Cosmins more advanced last-use analysis. I clearly need it.