## § Efficient tree transformations on GPUs (TODO)

All material lifted straight from Aaron Hsu's PhD thesis. I'll be converting APL notation to C++-like notation.

#### § Tree repsentation as multi-dimensional ragged nested arrays

We're interested in this tree:
      ∘
┌──┬──┴────┐
a  b       c
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u
│    │   |
│   ┌┴┐ ┌┴┐
v   w x y z

I'll be writing APL commands in front of a $ to mimic bash, and I'll write some arrays as multi-line. To run them, collapse them into a single line. The ast object is represented in memory as: $ ast ← ('∘'
('a' ('p'))
('b'
('q' ('v'))
('r'))
('c'
('s' ('w' 'x'))
('t' ('y' 'z'))
('u')))
$]disp data ┌→┬─┬──┐ │0│T│n.│ │1│T│n.│ │2│T│n.│ │1│T│n.│ │2│T│n.│ │3│T│n.│ │2│T│n.│ │1│T│n.│ │2│T│n.│ │3│T│n.│ │4│T│n.│ │2│T│n.│ │3│T│n.│ │4│T│n.│ │2↓T↓n.↓ └→┴→┴─→┘  This is the same thing as a structure of arrays (SOA) representation, where each array of information (eg, the depth at data, the T information at data) are each arrays which can be accessed well on SIMD instructions. #### § AST representation TODO #### § Path matrices We want information of how to go up and down the tree in ideally constant time. We store this information in what is known as a path matrix. For our recurring example, the path matrix is: ∘ a p b q v r c s w x t y z u | preorder traversal ────────────────────────────────────────────────── ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ | depth=0 - a a b b b b c c c c c c c c | depth=1 - - p - q q r - s s s t t t u | depth=2 - - - - - v - - - w x - y z - | depth=3 ∘ 0 ┌──┬──┴────┐ a b c 1 │ ┌┴┐ ┌───┼───┐ p q r s t u 2 │ │ | │ ┌┴┐ ┌┴┐ v w x y z 3  To efficiently compute this, we first replace every value in our tree with its preorder traversal visit time. This changes the tree to:  ∘:0 0 ┌──────────┬──┴─────────────────┐ a:1 b:3 c:7 1 │ ┌───┴───┐ ┌──────────┼───────┐ p:2 q:4 r:6 s:8 t:11 u:14 2 │ │ │ │ ┌──┴──┐ ┌─┴───┐ v:5 w:9 x:10 y:12 z:13 3  The values we store in the tree are the integers. The old labels are represented for clarity. The path matrix for this tree is: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 | preorder traversal ──────────────────────────────────────────────────────────── 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | depth=0 - 1 1 3 3 3 3 7 7 7 7 7 7 7 7 | depth=1 - - 2 - 4 4 6 - 8 8 8 11 11 11 14 | depth=2 - - - - - 5 - - - 9 10 - 12 13 - | depth=3 ∘:0 0 ┌──────────┬──┴─────────────────┐ a:1 b:3 c:7 1 │ ┌───┴───┐ ┌──────────┼───────┐ p:2 q:4 r:6 s:8 t:11 u:14 2 │ │ │ │ ┌──┴──┐ ┌─┴───┐ v:5 w:9 x:10 y:12 z:13 3  #### § Rendering the depth information in 2D We use the incantation: $ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$((⍳≢d)@(d,¨⍳≢d)) ((⌈/d) (≢d))⍴'-' 0 - - - - - - - - - - - - - - - 1 - 3 - - - 7 - - - - - - - - - 2 - 4 - 6 - 8 - - 11 - - 14 - - - - - 5 - - - 9 - - 12 - - - - - - - - - - - - 10 - - 13 -  Let's break this down (the symbol   means a lamp, for commenting/illumination) $ ⍳ 3 ⍝ iota: make a list of n elements:.
1 2 3

$d 0 1 2 1 2 3 2 1 2 3 4 2 3 4 2$ ≢d ⍝ tally: ≢. count no. of elements in d:
15

⍳≢d  ⍝ list of elements of len (no. of elements in d).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

$]disp (1 2 3),(4 5 6) ⍝ ,:concatenate ┌→────┬─────┐ │1 2 3│4 5 6│ └~───→┴~───→┘  ]disp (1 2 3) ,¨ (4 5 6) ┌→──┬───┬───┐ │1 4│2 5│3 6│ └~─→┴~─→┴~─→┘  The use of ¨ needs some explanation. ¨ is a higher order function which takes a function and makes it a mapped version of the original function. So, ,¨ is a function which attemps to map the concatenation operator. Now, given two arrays (1 2 3) and (4 5 6), (1 2 3) ,¨ 4 5 6 attemps to run , on each pair 1 and 4, 2 and 5, 3 and 6. This gives us tuples ((1 4) (2 5) (3 6)). So, for our purposes, zip ← ,¨. ]disp (d,¨⍳≢d) ⍝ zip d with [1..len d]. ┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐ │0 0│1 1│2 2│1 3│2 4│3 5│2 6│1 7│2 8│3 9│4 10│2 11│3 12│4 13│2 14│ └~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘  $ ((⌈/d) (≢d))⍴'-' ⍝ array of dim (max val in d) x (no. of elem in d)
---------------
---------------
---------------
---------------

• ⌈ is the maximum operator and / is the fold operator, so⌈/d finds the maximum in d. Recall that (≢d) find the no. ofelements in d. ⍴ reshapes an array to the desired size. We pass ita 1x1 array containing only -, which gets reshaped into a(⌈/d) x (≢d) sizes array of - symbols.
TODO: explain @ and its use

#### § Creating the path matrix

$⎕IO ← 0 ⍝ (inform APL that we wish to use 0-indexing.)$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
9

$¯1 + (+/ (∨\ ⌽a)) ⍝ subtract 1 from the previous number 8  Why the hell does this work? Well, here's the proof: • On running ⌽a, we reverse the a. The last 1 of a at index $i$becomes the first $1$ of ⌽a at index $i' \equiv n-i$. • On running ∨\ ⌽a, numbers including and after the first 1become 1. That is, all indexes $j \geq i'$ have 1 in them. • On running +/ (∨\ ⌽a), we sum up all 1s. This will give us $n-i'+1$ 1s.That is, $n-i'+1 = n-(n-i)+1 =i+1$. • We subtract a $1$ to correctly find the $i$ from $i+1$. This technique will work for every row of a matrix. This is paramount, since we can now repeat this for the depth vector we were previously interested in for each row, and thereby compute the parent index! #### § Converting from depth vector to parent vector, Take 3 (full matrix) We want to extend the previous method we hit upon to compute the parents of all nodes in parallel. To perform this, we need to run the moral equivalent of the following: $ ⎕IO ← 0 ⍝ 0 indexing
$d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depth vector$ t ← 11 ⍝ node we are interested in
$a←d[⍳t]=d[t]-1 ⍝ boolean vector of nodes whose depth is that of t's parent 0 1 0 1 0 0 0 1 0 0 0$ ¯1 +  (+/ (∨\ ⌽a)) ⍝ index of last 0 of boolean vector
7

for every single choice of t. To perform this, we can build a 2D matrix of d[⍳t]=d[t]-1 where t ranges over [0..len(d)-1] (ie, it ranges over all the nodes in the graph). We begin by using:
$⎕IO ← 0 ⋄ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depths$ ]display ltdepth ← d ∘.> d ⍝ find d[i] > d[j] for all i, j.
┌→────────────────────────────┐
↓0 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
└~────────────────────────────┘

• Note that gt[i][j] = 1 iff d[j] < d[i]. So, for a given row (i = fixed), the 1snodes that are at lower depth (ie, potential parents).
• If we mask this to only have those indeces where j <= i, then thelast one in each row will be such that d[last 1] = d[i] - 1. Why? Becausethe node that is closest to us with a depth less than us must be our parent,in the preorder traversal.
$⎕IO ← 0 ⋄ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depths$ ]display left ←  (⍳3) ∘.> (⍳3) ⍝ find i > j for all i, j.
┌→────┐
↓0 0 0│
│1 0 0│
│1 1 0│
└~────┘

Combining the three techniques, we can arrive at:
$⎕IO ← 0 ⋄ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depths$ ltdepth ← d ∘.> d ⍝ find d[i] > d[j] for all i, j.
$preds ← (⍳≢d) ∘.> (⍳≢d) ⍝ predecessors: find i > j for all i, j.$ pred_higher ←  ltdepth ∧ left   ⍝ predecessors tht are higher in the tree
$parents_take_3 ← ¯1 + +/∨\⌽pred_higher ⍝ previous idiom for finding last 1. ¯1 0 1 0 3 4 3 0 7 8 8 7 11 11 7  For comparison, the actual value is:  (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14) | indexes d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices (¯1 0 1 0 3 4 3 0 7 8 8 7 11 11 7) | parents, take 3 ∘:0 0 ┌──────────┬──┴─────────────────┐ a:1 b:3 c:7 1 │ ┌───┴───┐ ┌──────────┼───────┐ p:2 q:4 r:6 s:8 t:11 u:14 2 │ │ │ │ ┌──┴──┐ ┌─┴───┐ v:5 w:9 x:10 y:12 z:13 3  We have an off-by-one error for the 0 node! That's easily fixed, we simply perform a maximum with 0 to move ¯1 -> 0: $  parents_take_3 ← 0⌈  ¯1 +  +/∨\⌽pred_higher
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7

So, that's our function:
parents_take_3 ← 0⌈  ¯1 +  +/∨\⌽ ((d∘.>d) ∧ (⍳≢d)∘.>(⍳≢d))
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7

Note that the time complexity for this is dominated by having to calculate the outer products, which even given infinite parallelism, take $O(n)$ time. We will slowly chip away at this, to be far better.

#### § Converting from depth vector to parent vector, Take 4 (log critial depth)

We will use the Key(⌸) operator which allows us to create key value pairs.
$d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2$ ]disp (⍳≢d) ,¨ d ⍝ zip d with indexes
┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐
│0 0│1 1│2 2│3 1│4 2│5 3│6 2│7 1│8 2│9 3│10 3│11 2│12 3│13 3│14 2│
└~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘

$d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2$ ]display b ← {⍺ ⍵}⌸d  ⍝ each row i has tuple (i, js): d[js] = i
┌→──────────────────┐
↓   ┌→┐             │
│ 0 │0│             │
│   └~┘             │
│   ┌→────┐         │
│ 1 │1 3 7│         │
│   └~────┘         │
│   ┌→────────────┐ │
│ 2 │2 4 6 8 11 14│ │
│   └~────────────┘ │
│   ┌→───────────┐  │
│ 3 │5 9 10 12 13│  │
│   └~───────────┘  │
└∊──────────────────┘

In fact, it allows us to apply an arbitrary function to combine keys and values. We will use a function that simply returns all the values for each key.
$d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2$ ]display b ← {⍵}⌸d ⍝ each row i contains values j such that d[j] = i.
┌→──────────────┐
↓0 0  0  0  0  0│
│1 3  7  0  0  0│
│2 4  6  8 11 14│
│5 9 10 12 13  0│
└~──────────────┘

Our first try doesn't quite work: it winds up trying to create a numeric matrix, which means that we can't have different rows of different sizes. So, the information that only index 0 is such that d = 0 is lost. What we can to is to wrap the keys to arrive at:
$d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2$ ]display b ← {⊂⍵}⌸d ⍝ d[b[i]] = i
┌→───────────────────────────────────────────┐
│ ┌→┐ ┌→────┐ ┌→────────────┐ ┌→───────────┐ │
│ │0│ │1 3 7│ │2 4 6 8 11 14│ │5 9 10 12 13│ │
│ └~┘ └~────┘ └~────────────┘ └~───────────┘ │
└∊───────────────────────────────────────────┘

Consider the groups b = (2 4 6 8 11 14) and b = (5 9 10 12 13). All of 3's parents are present in 2. Every element in 3 fits at some location in 2. Here is what the fit would look like:
b  2 4 _ 6 8 _  _ 11 __ __ 14   (nodes of depth 2)
b      5     9  10   12 13      (leaf nodes)
4     8   8   11 11      (parents: predecessor of b in b)

∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
│             │          │
│          ┌──┴──┐     ┌─┴───┐
v:5        w:9   x:10  y:12  z:13        3

We use the Interval Index(⍸) operator to solve the problem of finding the parent / where we should sqeeze a node from b into b (This is formally known as the predecessor problem)
⍝ left[a[i]] is closest number < right[i]
⍝ left[a[i]] is the predecessor of right[i] in left[i].
$a ← (1 10 100 1000) ⍸ (1 2000 300 50 2 ) 0 3 2 1 0  Now, we can use the technology of predecessor to find parents of depth 3 nodes among the depth 2 nodes: $ depth2 ← 2 4 6 8 11 14
$depth3 ← 5 9 10 12 13 ⍝ parents (from chart): 4 8 8 11 11$ depth3parentixs ← depth2 ⍸ depth3
$depth3parents ← depth2[depth3parentixs] 4 8 8 11 11 ∘:0 0 ┌──────────┬──┴─────────────────┐ a:1 b:3 c:7 1 │ ┌───┴───┐ ┌──────────┼───────┐ p:2 q:4 r:6 s:8 t:11 u:14 2 │ │ │ │ ┌──┴──┐ ┌─┴───┐ v:5 w:9 x:10 y:12 z:13 3  We need to know one-more APL-ism: the 2-scan. When we write a usual scan operation, we have: $ ⍳5
1 2 3 4 5

$+/⍳5 ⍝ reduce 15  $ 2+/⍳5 ⍝ apply + to _pairs_ (2 = pairs)
3 5 7 9 ⍝ (1+2) (2+3) (3+4) (4+5)

$3+/⍳5 ⍝ apply + to 3-tuples 6 9 12 ⍝ (1+2+3) (2+3+4) (3+4+5)  We begin by assuming the parent of i is i by using p←⍳≢d. $ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$d2nodes ← {⊂⍵}⌸d ┌→┬─────┬─────────────┬─────────────┐ │1│2 4 8│3 5 7 9 12 15│6 10 11 13 14│ └→┴~───→┴~───────────→┴~───────────→┘$ p←⍳≢d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Now comes the biggie:
$findparent ← {parentixs ← ⍺⍸⍵ ⋄ p[⍵]←⍺[parentixs]}  • ⍺ is the list of parent nodes. • ⍵ is the list of current child nodes. • We first find the indexes of our parent nodes by usingthe pix ← parent ⍸ child idiom. • Then, we find the actual parents by indexing intothe parent list: pix[parentixs]. • We write these into the parents of the child using:p[children] ← parent[parent ⍸ child] This finally culminates in: $ d←0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p 0 0 1 0 3 4 3 0 7 8 8 7 11 11 7 (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14) | indexes d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices ∘:0 0 ┌──────────┬──┴─────────────────┐ a:1 b:3 c:7 1 │ ┌───┴───┐ ┌──────────┼───────┐ p:2 q:4 r:6 s:8 t:11 u:14 2 │ │ │ │ ┌──┴──┐ ┌─┴───┐ v:5 w:9 x:10 y:12 z:13 3  Which can be further golfed to: $ p⊣2{p[⍵]←⍺[⍺⍸⍵]}⌿⊢∘⊂⌸d⊣p←⍳≢d
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7

The total time complexity of this method assuming infinite parallelism is as follows:
$p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p  • (p←⍳≢d) can be filled in O(1) time. • (d2nodes←{⊂⍵}⌸d) is searching for keys in a small integer domain, so this is O(#nodes) usingradix sort as far as I know. However, the thesis mentions that this can be done inO(log(|#nodes|)). I'm not sure how, I need to learn this. • For each call of findp, the call (pix ← ⍺⍸⍵) can be implemented using binary searchleading to a logarthmic complexity in the size of ⍺ (since we are looking upfor predecessors of ⍵ in ⍺). • The time complexity of the fold 2findp/d2nodes can be done entirely in parallelsince all the writes into the p vector are independent: we only write theparent of the current node we are looking at. #### § 3.4: Computing nearest Parent by predicate I'm going to simplify the original presentation by quite a bit.  a b c d e f g h i | names 0 1 2 3 4 5 6 7 8 | indexes P ← (0 0 1 2 0 4 5 6 7) | parents X ← (0 1 0 0 1 1 0 0 0) | marked nodes a:0 ┌────┴───┐ b:1(X) e:4(X) | | c:2 f:5(X) | | d:3 g:6 │ h:7 | i:8  We want to find nodes marked as X that are the closest parents to a given node. The X vector is a boolean vector that has a 1 at the index of each X node: (b, e, f). So, the indexes (1, 4, 5) are 1 in the X vector. The output we want is the vector:  0 1 2 3 4 5 6 7 8 | indexes a b c d e f g h i | names PX ← (0 0 1 1 0 4 5 5 5) | closest X parent index a a b b a e f f f | closest X parent name a:0 ┌────┴───┐ b:1(X) e:4(X) | | c:2 f:5(X) | | d:3 g:6 │ h:7 | i:8  The incantation is: $ I←{(⊂⍵)⌷⍺} ⍝ index LHS by RHS | (100 101 102 103)[(3 1 2)] := 103 101 102
\$ PX ← P I@{X[⍵]≠1} ⍣ ≡ P
0 0 1 1 0 4 5 5 5

TODO. At any rate, since this does not require any writes and purely reads, and nor does it need any synchronization, this is fairly straightforward to implement on the GPU.

#### § 3.5: Lifting subtrees to the root

Once we have marked our X nodes, we now wish to lift entire subtrees of X up to the root.
• This pass displays how to lift subtrees and add new nodes to replace the subtree's original nodes.
• Luckily, there are no sibling relationships that need to be maintained sincewe are uprooting an entire subtree.
• There are no ordering constraints on how the subtrees should be arranged atthe top.
• Hence, we can simply add new nodes to the end of the tree (in terms of the preorder traversal).Adding to the middle of the tree will be discussed later.
There is some good advice in the thesis:
When using APL primitives this way, it may be good to map their names and definitions to the domain of trees. For example, the primitive ⍸Predicate is read as "the nodes where Predicate holds" and not as "the indexes where Predicate is 1".
For example, given the tree:
      0 1 2 3 4 5  | indexes
a b c d e f  | names
P  ← (0 0 1 0 3 4) | parents
X  ← (0 1 0 1 1 0) | X nodes
PX ← (0 0 1 0 3 4) | closest X parent index

a:0
┌────┴───┐
b:1(X)   d:3(X)
|        |
c:2      e:4(X)
|
f:5

we want the transformed tree to be:
    a:0
┌────┴───┐
bp:1(X)   ep:4(X)
---------
b:1(X)
|
c:2
---------
e:4
|
fp:5
---------
f:5(X)
|
g:6

We first look for nodes that need to be lifted. There are:
• Non-root nodes (ie, nodes whose parents are not themselves: p≠(⍳≢p))
• Which have the property X.
nodes←⍸(X ∧ p≠(⍳≢p))  ⍝ ⍸:pick indexes.
`