# 2020-07-10

## Yesterday, and the plan for today

I got a better start on the linear scan pass yesterday. I decided to use my old last-use implementation for now, since it should work alright. Later, I might want to have a `LastUse` lore anyway, but I'll try to get it to work without first.

Today, I have to continue writing my pass. In particular, I need to get my `lookupMemInfo` to work. That requires adding scoping.

## Chess/perft

I ran across an interesting chess problem that we might be able to use as a benchmark. perft is a performance test and move path enumeration that is used to determine the number of legal moves in a given chess position. It seems like GPUs have successfully been used for estimating this number up to `perft(15)`, possibly using Monte Carlo situation. It would be interesting to look into and understand the test, and see if we can write a Futhark program that computes it for us.

## ReuseAllocs

So, I got my basic pass to work today! Here's the result of a run (with debugging output):

```  1: \$ cabal exec futhark-linear-scan -- tests/arrays2.fut  2>&1
2: Warning at tests/arrays2.fut:1:11-11:
3:   Unused variable "n".
4: stms:
5:
6: last_uses?: []
7: allocs:
8: []
9: frees:
10: []
11: last_uses:
12: []
13:
14: stms:
15: (0, let {mem mem_260} = alloc(40i64))
16: (1, -- res_187 : [10i32]i32@@mem_260->
17:     -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
18:     --                                            rotates: [0i32]; shape: [10i32];
19:     --                                            permutation: [0];
20:     --                                            monotonicity: [Inc]}]}
21:  let {[10i32]i32 res_187} = iota32(10i32, 0i32, 1i32))
22: (2, -- xs_188 : [10i32]i32@@mem_260->
23:     -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
24:     --                                            rotates: [0i32]; shape: [10i32];
25:     --                                            permutation: [0];
26:     --                                            monotonicity: [Inc]}]}
27:  let {[10i32]i32 xs_188} = res_187 with [1i32] <- 0i32)
28: (3, let {mem mem_263} = alloc(40i64))
29: (4, -- res_189 : [10i32]i32@@mem_263->
30:     -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
31:     --                                            rotates: [0i32]; shape: [10i32];
32:     --                                            permutation: [0];
33:     --                                            monotonicity: [Inc]}]}
34:  let {[10i32]i32 res_189} = iota32(10i32, 0i32, 1i32))
35: (5, -- ys_190 : [10i32]i32@@mem_263->
36:     -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
37:     --                                            rotates: [0i32]; shape: [10i32];
38:     --                                            permutation: [0];
39:     --                                            monotonicity: [Inc]}]}
40:  let {[10i32]i32 ys_190} = res_189 with [1i32] <- 0i32)
41: (6, let {i32 segred_group_size_227} =
42:       get_size(segred_group_size_226, group_size))
43: (7, let {i32 num_groups_229} =
44:       calc_num_groups(10i64, segred_num_groups_228, segred_group_size_227))
45: (8, let {mem mem_267} = alloc(4i64))
46: (9, -- acc0_231 : [1i32]i32@@mem_267->
47:     -- {base: [1i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
48:     --                                           rotates: [0i32]; shape: [1i32];
49:     --                                           permutation: [0];
50:     --                                           monotonicity: [Inc]}]}
51:  let {[1i32]i32 acc0_231} =
53:    (#groups=num_groups_229; groupsize=segred_group_size_227)
54:    ({{0i32},
55:      [],
56:      commutative fn {i32} (i32 x_201, i32 x_202) =>
57:        let {i32 res_203} = add32(x_201, x_202)
58:        in {res_203}})
59:    (dummy_232 < 1i32, gtid_233 < 10i32) (~phys_tid_234) : {i32} {
60:      let {i32 x_204} = xs_188[gtid_233]
61:      return {returns x_204}
62:    })
63: (10, let {i32 acc0_200} = acc0_231[0i32])
64: (11, let {i32 segscan_group_size_238} =
65:        get_size(segscan_group_size_237, group_size))
66: (12, let {i32 num_groups_240} =
67:        calc_num_groups(10i64, segscan_num_groups_239, segscan_group_size_238))
68: (13, let {mem mem_271} = alloc(40i64))
69: (14, -- resarr0_206 : [10i32]i32@@mem_271->
70:      -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
71:      --                                            rotates: [0i32]; shape: [10i32];
72:      --                                            permutation: [0];
73:      --                                            monotonicity: [Inc]}]}
74:  let {[10i32]i32 resarr0_206} =
76:    (#groups=num_groups_240; groupsize=segscan_group_size_238)
77:    ({{0i32},
78:      [],
79:      fn {i32} (i32 x_207, i32 x_208) =>
80:        let {i32 res_209} = add32(x_207, x_208)
81:        in {res_209}})
82:    (gtid_242 < 10i32) (~phys_tid_243) : {i32} {
83:      let {i32 x_210} = ys_190[gtid_242]
84:      return {returns x_210}
85:    })
86: (15, let {i32 segred_group_size_247} =
87:        get_size(segred_group_size_246, group_size))
88: (16, let {i32 num_groups_249} =
89:        calc_num_groups(10i64, segred_num_groups_248, segred_group_size_247))
90: (17, let {mem mem_275} = alloc(4i64))
91: (18, -- acc0_251 : [1i32]i32@@mem_275->
92:      -- {base: [1i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
93:      --                                           rotates: [0i32]; shape: [1i32];
94:      --                                           permutation: [0];
95:      --                                           monotonicity: [Inc]}]}
96:  let {[1i32]i32 acc0_251} =
98:    (#groups=num_groups_249; groupsize=segred_group_size_247)
99:    ({{0i32},
100:      [],
101:      commutative fn {i32} (i32 x_217, i32 x_218) =>
102:        let {i32 res_219} = add32(x_217, x_218)
103:        in {res_219}})
104:    (dummy_252 < 1i32, gtid_253 < 10i32) (~phys_tid_254) : {i32} {
105:      let {i32 x_220} = resarr0_206[gtid_253]
106:      return {returns x_220}
107:    })
108: (19, let {i32 acc0_216} = acc0_251[0i32])
109: (20, let {i32 res_223} = add32(acc0_200, acc0_216))
110:
111: last_uses?: [(2, res_187), (2, mem_260), (5, res_189), (5, mem_263), (9, xs_188), (9,
112:                                                                        segred_group_size_227),
113:  (9, num_groups_229), (9, gtid_233), (9, mem_267), (10, acc0_231), (14, ys_190),
114:  (14, segscan_group_size_238), (14, num_groups_240), (14, gtid_242), (14,
115:                                                                       mem_271),
116:  (18, resarr0_206), (18, segred_group_size_247), (18, num_groups_249), (18,
117:                                                                         gtid_253),
118:  (18, mem_275), (19, acc0_251), (20, acc0_200), (20, acc0_216)]
119: 1 adding new frees: []
120: 2 adding new frees: [(VName (Name "mem") 260,Constant (IntValue (Int64Value 40)))]
121: 3 lookup: 40i64
122: 3 found a result: VName (Name "mem") 260
123: 4 adding new frees: []
124: 5 adding new frees: []
125: 6 adding new frees: []
126: 7 adding new frees: []
127: 8 lookup: 4i64
128: 9 adding new frees: [(VName (Name "mem") 260,Constant (IntValue (Int64Value 40)))]
129: 10 adding new frees: [(VName (Name "mem") 267,Constant (IntValue (Int64Value 4)))]
130: 11 adding new frees: []
131: 12 adding new frees: []
132: 13 lookup: 40i64
133: 13 found a result: VName (Name "mem") 260
134: 14 adding new frees: []
135: 15 adding new frees: []
136: 16 adding new frees: []
137: 17 lookup: 4i64
138: 17 found a result: VName (Name "mem") 267
139: 18 adding new frees: []
140: 19 adding new frees: []
141: 20 adding new frees: []
142: allocs:
143: [(VName (Name "mem") 267,Constant (IntValue (Int64Value 4))),(VName (Name "mem") 260,Constant (IntValue (Int64Value 40)))]
144: frees:
145: [(VName (Name "mem") 267,Constant (IntValue (Int64Value 4))),(VName (Name "mem") 260,Constant (IntValue (Int64Value 40))),(VName (Name "mem") 260,Constant (IntValue (Int64Value 40)))]
146: last_uses:
147: []
148:
149: entry {i32} main (i32 n_186) = {
150:   let {mem mem_260} =
151:     alloc(40i64)
152:   -- res_187 : [10i32]i32@@mem_260->
153:   -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
154:   --                                            rotates: [0i32]; shape: [10i32];
155:   --                                            permutation: [0];
156:   --                                            monotonicity: [Inc]}]}
157:   let {[10i32]i32 res_187} = iota32(10i32, 0i32, 1i32)
158:   -- xs_188 : [10i32]i32@@mem_260->
159:   -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
160:   --                                            rotates: [0i32]; shape: [10i32];
161:   --                                            permutation: [0];
162:   --                                            monotonicity: [Inc]}]}
163:   let {[10i32]i32 xs_188} =
164:     -- Consumes res_187
165:     res_187 with [1i32] <- 0i32
166:   -- mem_263 aliases mem_260
167:   let {mem mem_263} = mem_260
168:   -- res_189 : [10i32]i32@@mem_263->
169:   -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
170:   --                                            rotates: [0i32]; shape: [10i32];
171:   --                                            permutation: [0];
172:   --                                            monotonicity: [Inc]}]}
173:   let {[10i32]i32 res_189} = iota32(10i32, 0i32, 1i32)
174:   -- ys_190 : [10i32]i32@@mem_263->
175:   -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
176:   --                                            rotates: [0i32]; shape: [10i32];
177:   --                                            permutation: [0];
178:   --                                            monotonicity: [Inc]}]}
179:   let {[10i32]i32 ys_190} =
180:     -- Consumes res_189
181:     res_189 with [1i32] <- 0i32
182:   let {i32 segred_group_size_227} =
183:     get_size(segred_group_size_226, group_size)
184:   let {i32 num_groups_229} =
185:     calc_num_groups(10i64, segred_num_groups_228, segred_group_size_227)
186:   let {mem mem_267} =
187:     alloc(4i64)
188:   -- acc0_231 : [1i32]i32@@mem_267->
189:   -- {base: [1i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
190:   --                                           rotates: [0i32]; shape: [1i32];
191:   --                                           permutation: [0];
192:   --                                           monotonicity: [Inc]}]}
193:   let {[1i32]i32 acc0_231} =
195:     (#groups=num_groups_229; groupsize=segred_group_size_227)
196:     ({{0i32},
197:       [],
198:       commutative fn {i32} (i32 x_201, i32 x_202) =>
199:         let {i32 res_203} = add32(x_201, x_202)
200:         in {res_203}})
201:     (dummy_232 < 1i32, gtid_233 < 10i32) (~phys_tid_234) : {i32} {
202:       let {i32 x_204} = xs_188[gtid_233]
203:       return {returns x_204}
204:     }
205:   let {i32 acc0_200} = acc0_231[0i32]
206:   let {i32 segscan_group_size_238} =
207:     get_size(segscan_group_size_237, group_size)
208:   let {i32 num_groups_240} =
209:     calc_num_groups(10i64, segscan_num_groups_239, segscan_group_size_238)
210:   -- mem_271 aliases mem_260
211:   let {mem mem_271} = mem_260
212:   -- resarr0_206 : [10i32]i32@@mem_271->
213:   -- {base: [10i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
214:   --                                            rotates: [0i32]; shape: [10i32];
215:   --                                            permutation: [0];
216:   --                                            monotonicity: [Inc]}]}
217:   let {[10i32]i32 resarr0_206} =
219:     (#groups=num_groups_240; groupsize=segscan_group_size_238)
220:     ({{0i32},
221:       [],
222:       fn {i32} (i32 x_207, i32 x_208) =>
223:         let {i32 res_209} = add32(x_207, x_208)
224:         in {res_209}})
225:     (gtid_242 < 10i32) (~phys_tid_243) : {i32} {
226:       let {i32 x_210} = ys_190[gtid_242]
227:       return {returns x_210}
228:     }
229:   let {i32 segred_group_size_247} =
230:     get_size(segred_group_size_246, group_size)
231:   let {i32 num_groups_249} =
232:     calc_num_groups(10i64, segred_num_groups_248, segred_group_size_247)
233:   -- mem_275 aliases mem_267
234:   let {mem mem_275} = mem_267
235:   -- acc0_251 : [1i32]i32@@mem_275->
236:   -- {base: [1i32]; contiguous: True; LMADs: [{offset: 0i32; strides: [1i32];
237:   --                                           rotates: [0i32]; shape: [1i32];
238:   --                                           permutation: [0];
239:   --                                           monotonicity: [Inc]}]}
240:   let {[1i32]i32 acc0_251} =
242:     (#groups=num_groups_249; groupsize=segred_group_size_247)
243:     ({{0i32},
244:       [],
245:       commutative fn {i32} (i32 x_217, i32 x_218) =>
246:         let {i32 res_219} = add32(x_217, x_218)
247:         in {res_219}})
248:     (dummy_252 < 1i32, gtid_253 < 10i32) (~phys_tid_254) : {i32} {
249:       let {i32 x_220} = resarr0_206[gtid_253]
250:       return {returns x_220}
251:     }
252:   let {i32 acc0_216} = acc0_251[0i32]
253:   let {i32 res_223} = add32(acc0_200, acc0_216)
254:   in {res_223}
255: }
256:
```

As you can see on line 167, it succesfully changes the allocation to reuse an existing allocation! Unfortunately, my last-use analysis is far too simple, even with aliasing. From the debugging output, we can see that `mem_260` is reported as being last-used on line 157, when in reality it is used by all references to `xs_187`. Furthermore, even if we fixed that so the last-use of `mem_260` was equal to the last-use of `xs_187`, that happens immediately after, on line 165, but that's still not the actual last-use of `mem_260`: `res_187` is consumed in the update which creates `res_188`, which still resides in `mem_260`.

I knew all of this would be problems with my last-use analysis, but it still feels nice to see that the pass works.

There are many more issues with the pass: It needs to handle nested blocks, and in reality it probably should only concern itself with code inside kernels. But now that I have a working foundation, I feel like the time is right to go back and implement Cosmins more advanced last-use analysis. I clearly need it.