A Universe of Sorts

§ Siddharth Bhat

§ Table of contents:

  1. Sanskrit and Sumerian
  2. Writing Cuneiform
  3. The code of hammurabi
  4. The implicit and inverse function theorem
  5. Whalesong hyperbolic space in detail
  6. Motivating Djikstra's
  7. Intuitions for hyperbolic space
  8. Product of compact spaces in compact
  9. Hyperbolic groups have solvable word problem
  10. Elementary uses of Sheaves in complex analysis
  11. Snake lemma
  12. Kernel,cokernel,image
  13. The commutator subgroup
  14. Simplicity of A5 using PSL(2, 5)
  15. A5 is not solvable
  16. Complex orthogonality in terms of projective geometry
  17. Arithmetic sequences, number of integers in a closed interval
  18. The arg function, continuity, orientation
  19. Odd partitions, unique partitions
  20. Continued fractions, mobius transformations
  21. permutations-and-lyndon-factorization
  22. Graphs are preorders
  23. Crash course on domain theory
  24. Parallelisable version of maximum sum subarray
  25. Thoughts on implicit heaps
  26. Discriminant and Resultant
  27. Polynomial root finding using QR decomposition
  28. A hacker's guide to numerical analysis (WIP)
  29. Mobius inversion on Incidence Algebras
  30. Finite differences and Umbral calculus
  31. Permutahedron
  32. Lyndon + Christoffel = Convex Hull
  33. Geometric proof of e^x >= 1+x, e^(-x) >= 1-x
  34. Ranking and Sorting
  35. Proof of minkowski convex body theorem
  36. Burrows Wheeler (WIP)
  37. Intuitionstic logic as a Heytig algebra
  38. Edit distance
  39. Evolution of bee colonies
  40. Best practices for array indexing
  41. Algebraic structure for vector clocks
  42. Networks are now faster than disks
  43. Einstein-de Haas effect
  44. Rank-select as adjunction
  45. Bounding chains: uniformly sample colorings
  46. Coupling from the past (WIP)
  47. Word problems in Russia and America
  48. Encoding mathematical hieararchies
  49. Learning code by hearing it
  50. Your arm can be a spinor
  51. Self modifying code for function calls
  52. Adjunctions as advice
  53. Reversible computation as groups on programs
  54. Blazing fast math rendering on the web
  55. VC dimension
  56. Symplectic version of classical mechanics (WIP)
  57. Theorems for free
  58. How to reason with half-open intervals
  59. how does one build a fusion bomb?
  60. Christoffel symbols, geometrically
  61. A natural vector space without an explicit basis
  62. Cache oblivious B trees
  63. Krohn-Rhodes decomposition (WIP)
  64. Proving block matmul using program analysis (WIP)
  65. Why I like algebra over analysis
  66. using for cleaner function type typedefs
  67. A walkway of lanterns (WIP)
  68. Natural transformations
  69. The hilarious commentary by dinosaure in OCaml git
  70. How to link against MLIR with CMake
  71. Energy as triangulaizing state space
  72. The cutest way to write semidirect products
  73. My Favourite APLisms
  74. Proof of chinese remainder theorem on rings
  75. monic and epic arrows
  76. The geometry of Lagrange multipliers
  77. Efficient tree transformations on GPUs (WIP)
  78. Things I wish I knew when I was learning APL
  79. Every ideal that is maximal wrt. being disjoint from a multiplicative subset is prime
  80. Getting started with APL
  81. SpaceChem was the best compiler I ever used
  82. Mnemonic for Kruskal and Prim
  83. Legendre transform
  84. Cartesian Trees
  85. DFS numbers as a monotone map
  86. Self attention? not really
  87. Coarse structures
  88. Matroids for greedy algorithms (WIP)
  89. Grokking Zariski (WIP)
  90. My preferred version of quicksort
  91. Geometric proof of Cauchy Schwarz inequality
  92. Dataflow analysis using Grobner basis
  93. Fenwick trees and orbits
  94. Dirichlet inversion (WIP)
  95. Incunabulum for the 21st century: Making the J interpreter compile in 2020
  96. An example of a sequence whose successive terms get closer together but isn't Cauchy (does not converge)
  97. Krylov subspace method
  98. Good reference to the Rete pattern matching algorithm
  99. Leapfrog Integration
  100. Comparison of forward and reverse mode automatic differentiation
  101. An invitation to homology and cohomology: Part 1 --- Homology
  102. An invitation to homology and cohomology: Part 2 --- Cohomology
  103. Stuff I learnt in 2019
  104. A motivation for p-adic analysis
  105. Line of investigation to build physical intuition for semidirect products
  106. Topology is really about computation --- part 2
  107. Topology is really about computation --- part 1
  108. PSLQ algorithm: finding integer relations between reals
  109. Geometric characterization of normal subgroups
  110. Radical ideals, nilpotents, and reduced rings
  111. My disenchantment with abstract interpretation
  112. Computing equivalent gate sets using grobner bases
  113. The janus programming language --- Time reversible computation
  114. A = B --- A book about proofs of combinatorial closed forms (TODO link)
  115. Generating k bitsets of a given length n
  116. Bondi k-calculus
  117. Vivado toolchain craziness
  118. What the hell is a Grobner basis? Ideals as rewrite systems
  119. Lie bracket versus torsion
  120. Spatial partitioning data structures in molecular dynamics
  121. Vector: Arthur Whitney and text editors
  122. Discrete random distributions with conditioning in 20 lines of haskell
  123. Everything you know about word2vec is wrong
  124. Small Haskell MCMC implementation
  125. Debugging debug info in GHC
  126. GHC LLVM code generator: Switch to unreachable
  127. Concurrency in Haskell
  128. Handy list of differential geometry definitions
  129. Lazy programs have space leaks, Strict programs have time leaks
  130. Presburger arithmetic can represent the Collatz Conjecture
  131. Using compactness to argue about covers
  132. Stephen wolfram's live stream
  133. Japanese Financial Counting system
  134. Cleave as a word has some of the most irregular inflections
  135. McCune's single axiom for group theory
  136. Arthur Whitney: dense code
  137. How does one work with arrays in a linear language?
  138. Linear optimisation is the same as linear feasibility checking
  139. Quantum computation without complex numbers
  140. Linguistic fun fact: Comparative Illusion
  141. Stuff I learnt in 2018
  142. Stuff I learnt in 2017
  143. Reading the structs library
  144. Reading the machines library (WIP)
  145. Explaining laziness (WIP)
  146. Explaining STG(WIP)
  147. Simplexhc: proc points suck / making GHC an order of magnitude faster
  148. Simplexhc: dec 2017
  149. Simplexhc: oct 29 2017
  150. Simplexhc: july 2017
  151. Simplexhc: july 6th 2017
  152. Simplexhc: announcement
  153. GSoC 2015 proposal
  154. GSoC 2015 week 1
  155. GSoC 2015 week 3 and 4
  156. GSoC 2015 week 5
  157. GSoC 2015 week 6
  158. GSoC 2015 week 7
  159. GSoC 2015 final report
  160. Link Dump
  161. Big list of emacs gripes
  162. Big list of Coq
  163. Big list of writing
  164. Big list of Latex
  165. Big list of Architecture
  166. Big list of Recipes
  167. Big list of words
  168. Big list of Haiku

§ Sanskrit and Sumerian

§ Sanskrit's precursors

The oldest known sanskrit text that has been found is the Rig Veda. Eveyrthing after is the study of sanskit proper. This is quite problematic because the Rig Veda is a complete and consistent textbook with respect to language. It's a black-box in terms of language evolution. The question "what is the precursor" asks for a method of study to determine the precursor. We just don't know because we have no writings, text, etc.

§ Archaeological data

Archaeological data is problematic as well. We don't know where the people who knew sanskrit came from. Sanskrit was spoken in the northen part of Hindusthan [what they originally called Bhrahma-nagar (?)]. While we can try to undersatnd where it comes from, it's hard. The script is Brahmi / Devanagiri, which means used by Brahma or god. The name "sanskrit" is a compound that stands for "well-formed". It's really clean-slate in that sense. The study of Indo-Aryan languages in the Indian subcontinent has only one possible known history, which stops at the Rig Veda. We don't know the relationship between 2400BC when Rig Veda was written to anything before it.

§ Non Vedic sanskrit

Studies in non-vedic sanskrit is known to be the "true" proto-Indo-Europoean (PIE) language. The conjecture is that this Indo European language ought to be able to cover middle greek, hittite, and sanskit.

§ Prakrit and its relationship to this story

Prakrit evolved as a vernacular of Sanskrit in the north pahadi region. Hindi as we know toady evolved from Hindusthani, which came from languages in northern india. Languages like Marathi, Marvadi, Gujurathi, etc. came a lot before Hindi did.

§ TLDR on Sanskrit v/s Hindi

There is a huge gap of time between Sanskit, Prakrit, Pali, and Hindi. Hindi evolved around the 1600s due to the Mughals who used to speak a vernacular of Hindusthani. Kabir also wrote in Hindusthani. There was also some Farsi and Urdu mixed in.
Hindi is more of a political exploit than an actual language ~ Alok 2020

§ The relationship to Sumerian

We don't know what the relationship to sumerian is. Social expectations that was setup is sumerian has become more stringent in Sanskrit.

§ Writing Cuneiform

I've been reading about the Sumerian people, and I've gotten fascinated with the question of how to write in Cuneiform, which is their script. I wanted to learn how to write cuneiform. It appears that it was originally written down by pressing reed styluses into clay. The script is syllabic in nature. We have three components:
  1. Vertical wedge 𐏑
  2. Horizontal wedge 𐎣
  3. Diagonal 𐏓

§ References

§ The code of hammurabi

I've wanted to read the code of hammurabi since it was name dropped in Snow Crash by Neal Stephenson. I finally got around to it. Here's some excerpts I found fascinating.

§ References

§ The implicit and inverse function theorem

I keep forgetting the precise conditions of these two theorems. So here I'm writing it down as a reference for myself.

§ Implicit function: Relation to function

If we have a function y = g(p, q), we can write this as yg(p, q) = 0. This can be taken as an implicit function h(y; p, q) = yg(p, q). We then want to recover the explicit version of y = g′(p, q) such that h(g′(p, q); p, q) = 0. That is, we recover the original explicit formulation of y = g′(p, q) in a way that satisfies h.

§ The 1D linear equation case

In the simplest possible case, assume the relationship between y and p is a linear one, given implicitly. So we have h(y; p) = α y + β p + γ = 0. Solving for h(y,p) = 0, one arrives at: y = −1/α (β p + γ).

§ The circle case

In the circle case, we have h(y; p) = p2 + y2 − 1. We can write y = ± √p2 − 1. These are two solutions, not one, and hence a relation, not a function.

§ Assuming that a solution for h(y, p) exists

Let us say we wish to solve h(y; p) = y3 + p2 − 3 yp − 7 = 0. Let's assume that we have a solution y = sol(p) around the point (y=3, p=4). Then we must have: sol(p)3 + p2 − 3 sol(p) p − 7 = 0. Differentiating by p, we get: 3 sol(p)2 sol′(p) + 2p − 3 sol′(p) p − 3 sol(p) = 0. This gives us the condition on the derivative:
 sol(p)2 sol′(p) + 2p − 3 sol′(pp − 3 sol(p) = 0          
sol(p)2 − 3p 
= 3 sol(p) − 2p 
sol′(p) = [3 sol(p) − 2p] / 
3(sol(p)2 − 3p
The above solution exists if 3(sol(p)2 − 3p ≠ 0). This quantity is again h / ∂ y.

§ Application to economics

§ Inverse function: Function to Bijection

One perspective we can adopt is that of Newton's method. Recall that Newton's method allows us to find x* for a fixed y* such that y* = f(x*). It follows the exact same process!

§ References

§ Whalesong hyperbolic space in detail

We can build a toy model of a space where velocity increases with depth. Let the x-y axis be: left-to-right (→) is positive x, top-to-bottom (↓) is positive y. Now let the velocity at a given location (x, y) be (y+1, 1). That is, velocity along y is constant; the velocity along x is an increasing function of the current depth. The velocity along x increases linearly with depth. Under such a model, our shortest paths will be 'curved' paths.

§ References

§ Motivating Djikstra's

Usually I've seen Djikstra's algorithm presented as a greedy algorithm, and then an analogy given to "fire" for the greediness. We can reverse this presentation start with the "fire", and the discretize the "fire" solution to arrive at Djikstra's

§ The fire analogy

§ Intuitions for hyperbolic space

§ Product of compact spaces in compact

§ Hyperbolic groups have solvable word problem

I have never seen an elementary account of this in a 'trust these facts, now here is why hyperbolic groups have a solvable word problem'. I am writing such an account for myself. It's an account for building intuition, so no proofs will be provided except for the final theorem. All facts will be backed by intuition. Since most of it is geometric, it's easy to convey intuition.

§ Graphs of groups, quasi-isometry.

We can convert any group into a graph by using the cayley graph of the group. We characterize hyperbolic space as a space where we can build 'thin triangles'. We also think of hyperbolic space as where geodesics from a given point diverge (in terms of angle) exponentially fast. The choice of generators for the cayley graph gives different graphs. We can assign a unique geometric object by considering cayley graphs upto quasi isometry. The cayley graph of a group with respect to different generating sets are quasi-isometric. We can now try to study properties that are invariant under quasi-isometry, since these are somehow 'represented faithfully by the geometry'.

§ Hyperbolicity enters the picture

We now say that a graph is hyperbolic if the cayley graph of the group is hyperbolic. We can show that hyperbolicity is preserved by quasi-isometry. So this property does not depend on the generating set.

§ Isoperimetric Inequality

If we have a finitely presented group G = ⟨ S | R, and w is a word in the free group Free(S), if [[w]] = 1, we will have
w = 
 ui ri± 1 ui′ 
This follows almost by definition. Since we have quotiented by r we can have elements of r in between u u. We will need to have a u since there's nothing else to cancel off the u.

§ Area of a word

Let G = ⟨ S | R. Let w be a word in S such that [[w]] = e. The area of the word w is the minimal number of such u r u components we need to write it down. Formally:
Area(w) = min
 n | 
 ui r+i± 1 ui
I don't understand the geometric content of this definition. I asked on mathoverflow.

§ Isopermetric function for a group

f: ℕ → ℕ is a Dehn function or isoperimetric function if the area of the word is upper bounded by f(|word|). In some sense, the length of the word is the perimeter to the area, and this gives us a form of the isoperimetric inequality. Formally, f is a Dehn function if for all words wF(S) such that [[w]] = e, we have A(w) ≤ f(|w|). depending on the growth of f, we say that G has linear, quadratic, exponential etc. Dehn function.

§ Geometric content of Area

We define a map to be aninite, planar, oriented, connected and simply connected simplicial2-complex (!). A map D is a diagram over an alphabet S iff every edge eD has a label lbl(e) ∈ S such that lbl(e−1) = (lbl(e))−1. Hang on: what does it mean to invert an edge? I presume it means to go backwards along an edge. So we assume the graph is directed, and we have edges in both directions. A Van Kampen diagram over a group G = ⟨ S | R is a diagram D over S such that for all faces of D, the label of the boundary of f is labelled by some r± : rR. The area of such a diagram is the number of faces.

§ Hyperbolic iff Linear Isopermetric Inequality is satisfied

A finitely presented group is hyperbolic if and of if its cayley grah satisfies the linear isoperimetric inequality.

§ Deciding if elements are conjugate to each other

§ Proof that conjugacy is solvable for hyperbolic groups

|                                           |
g2                                          g1
|                                           |
v                                           v
g1 =x g1 x'<-------x'--------------------- x g1
Assume such an element x exists such that g2 = x g1 x'. Now we have two triangles, e-x-xg1 and xg1-x'-g2. Let us focus on e-x-xg1. Assume the triangles in our space is delta thin.
         α = δ
 \_______               ‖                   |
         \________      ‖    > δ            |
                           \________        |
Since our triangles are δ thin, if the distance of the point p ∈ [e xg1] to

§ References

§ Elementary uses of Sheaves in complex analysis

I always wanted to see sheaves in the wild in a setting that was both elementary but 'correct': In that, it's not some perverse example created to show sheaves (DaTaBaSeS arE ShEAvEs). Ahlfors has a great example of this which I'm condensing here for future reference.

§ Sheafs: Trial 1

§ Sheafs: Trial 2

§ Sheaf: Trial 3

A sheaf over D is a topological space Sh and a mapping π: ShD with the properties: We will pick D to be an open set in the complex plane; Really, D can be arbitrary.

§ Germs of analytic functions satisfy (Sheaf: Trial 3)

§ Snake lemma

§ Why homomorphisms for chain maps?

First of all, to define a mapping between simplicial complexes { Gi } and { Hi }, one might naively assume that we can ask for functions { fi: GiHi }:
       ∂    ∂
    G3 → G2 → G1 → 0
    |    |    |
    f    g    h
    ↓    ↓    ↓  
0 → H3 → H2 → H1
      ∂    ∂
Unfortunately, to be able to use the machinery of Homology, we need the { fi } to be abelian group homomorphisms. However, this is no great loss. Intuitively, when we want to map complexes, we first say where the generators of the abelian group (-module) maps to; Everything else is determined by the generators. This aligns nicely with our intuition of what a map between complexes should look like: we tell where the geometry goes ("this edge goes there"), and the algebra is "dragged along for the ride". This gives us the diagram:
    |      |      |
    f3     f2     f1
    ↓      ↓      ↓
0 →H3--∂-→H2--∂-→H1
where the fi are homomorphisms. So, this means we can talk about kernels and images!
       |         |          |
       ↓         ↓          ↓
       G3--∂----→G2----∂---→G1--→ 0 
       |         |          |
       f3        f2         f1
       ↓         ↓          ↓  
F → E → V → {0}
{0} → {e} → {v} → {0} 
The Snake Lemma gives us a mapping d: Ker(f1) → Im(f3) such that this long exact sequence is saatisfied:

§ What do we wish to compute?

§ Kernel, cokernel, image

Consider a linear map T: XY. we want to solve for { x : T(x) = y0 }.

§ The commutator subgroup

Define the commutator of g, h as [g, h] ≡ ghg−1h−1. The subgroup generated by all commutators in a group is called as the commutator subgroup. Sometimes denoted as [G, G].

§ Simplicity of A5 using PSL(2, 5)

§ Presentation of A5

We take as faith A5 has the presentation:
<a, b | a^2 = b^3 = (ab)^5 = 1>
If I find a nice proof of this isomorphism, or some other way to derive the fact that PSL(2, 5) is isomorphic to A5, I will fill this up.

§ Step 1: PSL(2, 5) is isomorphic to A5

PSL(2, 5) consists of projective Mobius transformations with function composition as the group operation. Here, we freely use the linear algebraic relationship between transformations of the form (az + b)/(cz + d) and matrices [a b; c z].
 abcd ∈ ℤ5, ad − bc = 1          
 f: ℤ5 ⋃ { ∞ } → ℤ5 ⋃ { ∞ }          
 f(z) ≡ (az + b)/(cz + d         
 q(z) = 1/(1−z         
q(q(z)) = 
1 − 
(1−z) − 1
         = 1 − 
q(q(q(z))) = 1 − 
 = 1 − (1 − z) = z
To recap, we have achieved a set of functions:
p(z) = -1/z [order 2]
q(z) = 1/(1-z) [order 3]
r(z) = (z - 1) [order 5]
r = -1/[1/(1-z)] = p . q
That is, we have found a way elements in PSL(2, 5) such that p^2 = q^3 = (pq)^5 = 1. This gives us the [surjective] map from PSL(2, 5) into A5.

§ Step 2: PSL(2, 5) is simple

TODO! I'm still reading Keith Conrad's notes.

§ References

§ A5 is not solvable

There are many accounts of why A5 is not solvable on the internet. I'm recording my version here, because the proof involves certain ad-hoc choices which I want to make sure I can find off-hand in the future. We'll show that [A5, A5] = A5, thereby proving that A5 not solvable. This is useful for Galois theory, where we want to show tha A5 cannot be built as extensions of smaller cyclic groups.

§ Notation

I'll be using non-standard notation: (12);(34) means 'perform (12) then perform (34)'. I find this notation makes permutation composition intuitive for me. The ; is evocative of C-style languages, where we are ending a statement. I will be consistently using [g, h] ≡ ghg−1h−1 to denote the commutator.

§ permutations in A5

First, recall that A5 only has the even permutations in S5. So it can have zero, two, four, involutions that build it up. There can't be more after simplification, since S5 ony has 5 elements --- the largest sequence of transpositions we can do is (12)(23)(34)(45). So, in A5, we have:
= (32);(12) [(23) = (32)]
= (32);(21) [(12) = (21)]
[a b c] -(32)->
[a c b] -(21)->
[c a b]

§ A5 is generated by 3-cycles.

We claim that we can write any element of A5 in terms of 3-cycles. So, if we figure out how to write 3-cycles in terms of commutators, we win. Because the commutator subgroup of An is generated by elements that can be written as [g, h]. If we can show that 3-cycles can be written as [g, h], then every other element has a representation in terms of these 3-cycles, and are therefore elements of the commutator subgroup.

§ 3-cycles can be generated as commutators of 2-cycles:

  g        h        g^-1    h^-1
  g        h           g^-1       h^-1
  g      h          g^-1    h^-1
  g   h    g^-1   h^-1
So we are left with (32);(21);(32);(21). This is the square of what we really wanted, C = (32);(21). However, since C is a 3-cycle, we know that C = C−2. So, we can start with C−1, use our trick to generate C−2 which is equal to C. Since this works for any C, we have shown that we can generate 3-cycles from commutators of A5.

§ Alternate viewpoint on above proof

We have a 3-cycle s = (a b c). We first first a square root t such that t*t=s. To do this, we make t have the cycles of s spread out in gaps of 2:
t = (a _ _)
t = (a _ b) [+2]
t = (a c b) [+2, modulo]
It is hopefully clear that t*t = s:
t = (a c b)
t*t: apply the cycle twice.
t*t = a -(skip c) -> b
      b -(skip a) -> c
      c ->(skip b) -> a
    = (a b c) = s
Now, we will write s = t*t and then find the commutator decomposition from it:
s = t*t
  = (abc)(abc)
  = (cb)(ba)(cb)(ba)
  = (cb)|(ba)|(cb)|(ba)
  = (cb)|(ba)|(cb)|(ba)
     g     h   g-1   h-1
But there's a problem: this g and h do not belong to A5, they belong to S5. This is fixed by using a random (pq) which we know will exist.

§ Recap: How have we shown that A5 is not solvable?

what have we shown? In my mind, I think of it as:
arbitrary g
= (3-cycle-1)(3-cycle-2)....(3-cycle-n)
= [g, h][g2, h2]....[gn, hn]
= member of [A5, A5]
Recall that [A5, A5] is generated by commutators. It not only contains elements of the form [g, h], but also all products of the form [g, h][g′, h′]. So we don't need to exhibit how to write a 5-cycle as some [g, h]. We just need to exhibit how to write as the product of commutators, which we have now shown.

§ Solvable implies simple

We can consider the other definition of simple. Let there be a chain of normal subgroups G = N[0] ≤ N[1] ≤ N[1] ≤ … ≤ N[m] = e, such that each quotient N[i] / N[i+1] is abelian. Then, if G is simple, this chain can only be G = N[0] ≤ N[1] = e.

§ Nagging doubt: Did we depend on our numbering of cycles?

In all my proofs, I had used one 3-cycle, or 5-cycle, or 2-cycle to argue that it all works out. Is this really legal? Perhaps the argument written for the 3-cycle C = (123) will break down for D = (321). Fear not!

§ All 3-cycles are conjugate to each other in A5.

a   b   c
1 2 3 4 5
  p   q r
So, we will perform our proof assuming there is 1 overlap, 2 overlap, 3 overlap. Recall that if C = (a b c) is a cycle and s is a permutation, then the action of conjugating C with s produces a permutation (s(a) s(b) s(c)). We will prove our results by finding an s, and then making s even. This is the difficult part of the proof, since we need to show that all 3-cycles are conjugate in A5. We will write s as two distinct transpositions, which will guarantee that it belongs to A5.
C = (abx)
D = (pqx)

s: send a to p, b to q
s = (ap)(bq)
C = (abx) -conj s-> (pqx) = D
C = (axy)
D = (pxy) = (yxp) [cyclic property]

s: send a to y, y to p
s = (ay)(yp)

C = (axy) -conj s-> (yxp) = D

§ Why do we care about solvable?

§ SAGE code to play around with commutators of A5:

from collections import defaultdict
m = defaultdict(set)
A5 = AlternatingGroup(5)
S5 = SymmetricGroup(5) # if necessary
for g in A5:
    for h in A5:
        m[g * h * g^(-1) * h^(-1)] |= { (g, h) }

# all 60 elem can be written in terms of commutators
print("number of elem generated as commutator: " + str(len(m.keys())))

# Show how to access elements of A5 and their commutator representation
cyc5 = A5("(1, 2, 3, 4, 5)")
cyc3 = A5("(1, 2, 3)")
cyc2disj = A5("(1, 2) (3, 4)")


§ Writing each element in A5 directly as a commutator

We have shown how to write 3-cycles as the commutator of 2-cycles. We will now show how to do this for disjoint 2-cycles and 5-cycles as a matter of enlightenment.

§ Writing disjoint 2-cycles as commutator

First, we will write a two disjoint two cycles as the square root of a 4-cycle. We will then show how to write this 4-cycle as two 3-cycles.
s = (12)(34) 
Note that if we choose t = (abcd), then t*t will exchange the first and third elements a <-> c, and the second and fourth elements b <-> d. So, if we choose:
t = (1324)
t*t = (12) (34)
Next, we need to write this t*t as [g, h] for g, h from A5.
t*t = (1324)(1324)
    = (42)(23)(31);(42)(23)(31)
    = (42)(23)(31);(42)(23)(31)
    = (42)(23)(31);(23)(23);(42)(23)(31)
                   ^^^^^^^^ inserted
    = (42)(23)|(31)(23)|(23)(42)|(23)(31)
          g   |    h   |   g'   |   h'

    = [(42)(23), (31)(23)]
Where both (42)(23), and (31)(23) are members of A5.

§ Writing 3-cycle as commutator

In the description of showing how to generate 3-cycles, we do this explicitly.

§ Writing 5-cycle as commutator

Let s = (1 2 3 4 5). we once again find a square root of s. To build this, we will build an element with the elements of s written with gaps of 2:
t = (1 _ _ _ _) 
  = (1 _ 2 _ _)  [+2 index]
  = (1 _ 2 _ 3)  [+2 index, wrap]
  = (1 4 2 _ 3)  [+2 index, wrap]
  = (1 4 2 5 3)  [+2 index, wrap]
It should be clear how t*t = s: When we take s = t*t, the resulting permutation s will move an element j = t[i] to k = t[i+2]. But we have built t such that t[i+2] = s[i+1]. So we will move the element according to how s pleases:
t = (1 4 2 5 3)
t*t = 1 -> (4 skip) -> 2 
      2 -> (5 skip) -> 3
      3 -> (1 skip) -> 4
      3 -> (2 skip) -> 5
      5 -> (3 skip) -> 1
t*t = (1 2 3 4 5) = s
We will now use t*t to write the commutator:
s = t*t
  = (35)(52)(24)(41);(35)(52)(24)(41)  
  = (1, 2)(3, 5)|(1, 5)(2, 4)|(3, 5)(1, 2)|(2, 4)(1, 5)

  = (1, 2)(3, 5)|(1, 5)(2, 4)|(3, 5)(1, 2)|(2, 4)(1, 5)
         g             h          g^{-1}       h^{-1}

§ To think: relationship between square roots and commutators?

§ Complex orthogonality in terms of projective geometry

If we think of complex vectors p = [p1, p2], q = [q1, q2] as belonging to projective space: that is, pp1/p2, and qq1 / q2, we can interpret orthogonality as:
p . q = 0           
1 + p2 
2 = 0 
p1 / p2 = − 
p = −1/
 = −q/|q
If we imagine these as points on the Riemann sphere, TODO

§ References

§ Arithmetic sequences, number of integers in a closed interval

This is a cute reason for why when we count the number of integers in the closed interval [a, b], it's going to be b - a + 1. We setup an arithmetic sequence with initial term a, common difference 1. Now we know that the nth term is a + (n-1)d. So, we get
a + (n-1)d = b
(n-1).1 = b - a
n = b - a + 1

§ The arg function, continuity, orientation

Let us think of the function arg: ℂ → ℝ as a multi valued function, which maps each complex number to the set of possible valid angles that generate it:
arg(z) ≡ { t ∈ ℝ : |z|ei (π/2)t = z  }
We plot the function here:

§ Recovering single valued-ness

Now, the question is, can we somehow automatically recover single valued-ness? kind of, by stipulating that for any given curve c: [0, 1] → ℂ, the function argc: [0, 1] → ℝ is continuous. Let's try to investigate what happens if we move from right towards bot, arbitrarily stipulating ("picking a branch") that arg(right) = 0 as a sort of basepoint. This prompts the natural question:
what happens if we move in the opposite direction?

§ Counter-clockwise movement

§ Multiple winding

the true power of this multi-valued approach comes from being able to handle multiple windings. Here the real meaning of being a multi-valued function shows through. If we decide to go through the the loop twice, as:
bot -> right -> top -> left -> bot -> right -> top -> left
-1  -> 0     -> 1   -> 2    -> 3   -> 4     -> 5   -> 6
That is, we end up with the value 6, which can only b

§ Orientation from continuity

There's something really elegant about being able to recover a notion of "orientation" by simply:
  1. Allowing multi-valued functions.
  2. Forcing continuity constraints.
  3. Interpreting increase/decrease in the value of the function.

§ Discretizing, gaining more insight

I was personally dis-satisfied with the above explanation, because it seemed weird that we would need to depend on the history to define this function. We can formalize this notion of history. Let's first discretize the situation, giving us:

§ Making the theory path-dependent

We originally had:
path: Time -> Spoke
f: Spoke -> 2^Val -- multi-valued

-- | morally, not mathematically.
out: Time -> Val
out t = choose_best_value (f(path[t]))
But there was a vagueness in this choose_best_value. So we redefine it:
path: Time -> Spoke
f': (Time -> Spoke) -> Time -> Val
f'(path, tcur) = 
  argmax (\v -> |v - path[tcur-1]| + |v - path[tcur+1|)

out: Time -> Val
out = f'(path)

§ Odd partitions, unique partitions

A well known identity in combinatorics is that the partitions n = l1 + l2 + ... + ln where each li is odd is in bijectiion with a partition where each li is unique. I really liked this bijection.
(15, 9, 9, 5, 5, 5, 3, 3, 3, 1, 1, 1, 1) [all odd]
(15, 9x2, 5x3, 3x3, 1x4) [group]
(15x1, 9x2, 5x(2+1), 3x(2+1), 1x4) [expand base-2]
(15, 18, [10, 5], [6, 3], 4) [all unique]

§ Continued fractions, mobius transformations

read ekmett/fractions properly and write detailed log about it, and the related math.

§ Permutations-and-lyndon-factorization

For a string s, the Lyndon factorization writes s as the concatenation of substrings t1, t2, ..., tn, such that: For example, given the word banana, the lyndon factorization is:
b; an; an; a; 
We can define a notation for writing permutation as:
(7) (2 3) (1 4 5)
If we treat it as a string 723145, the duval algorithm provides the decomposition:
7; 23; 145; 
So, we can treat the duval algorithm as a way to recover the permutation given the raw string. It's a nice way to remember the definition of lyndon decomposition if nothng else.

§ Graphs are preorders

I wanted to record this fact for myself, so that I have reason to come back to it as I learn more preorder theory. Perhaps there are certain graph theoretic phenomena that make more sense when looked at from a preorder point of view.

§ Crash course on domain theory

In lambda calculus, we often see functions of the form λ xx(x). We would like a way to associate a "natural" mathematical object to such a function. The most obvious choice for lambda calculus is to try to create a set V of values which contains its own function space: (VV) ⊆ V. Unfortunately, the only solution for this in sets is the trivial set { * }.

§ Computation as fixpoints of continuous functions

§ CPOs


§ Monotone map

f: 2 → 2 f(S) ≡

SS is finite 
S U { 0 }S is infinite
This does not preserve least-upper-bounds. Consider the sequence of elements:
A1 = { 1}, A2 = {1, 2}, A3 = {1, 2, 3}, …, An = {1, 2, 3, …, n }
The union of all Ai is . Each of these sets is finite. Hence f({1 }) = {1 }, f({1, 2 }) = {1, 2} and so on. Therefore:
f(⊔ Ai) = f(ℕ) = ℕ ⋃ { 0 }⊔ f(Ai) = ⊔ Ai = ℕ

§ Continuous function

§ Fixpoints of continuouts functions

The least fixed point of a continous function f: DD is:
FIX(f) ≡ lub({ fn() : n ≥ 0 })

§ as implication

We can think of ba as b a. That is, b has more information than a, and hence implies a.

§ References

§ Parallelisable version of maximum sum subarray

I learnt of a "prefix sum/min" based formulation from the solution to question D, codeforces educational round 88. The idea is to start with the max prefix sum as the difference of right minus left:
L ≤ i ≤ R

0 ≤ i ≤ R
 a[i] − min L
0 ≤ i ≤ L ≤ R

Which is then expressed as:
asum[n] ≡ 
0 ≤ i ≤ n
 a[i]  = 
: (asum[R] − min(L ≤ R)asum[L])
aminsum[n] ≡ 
0 ≤ i ≤ n
 asum[i] = 
: (asum[R] − aminsum[R])
Since asum is a prefix-sum of a, and amin is a prefix min of asum, the whole thing is O(n) serial, O(logn) parallel. In haskell, this translates to:
let sums = scanl (+) 0
let sums_mins = scanl1 min . sums
let rise xs = zipWith (-) (sums xs) (sums_mins xs)
main = rise [1, 2, 3, -2, -1, -4, 4, 6]
> [0,1,3,6,4,3,0,4,10]
sums_mins keeps track of the sea level, while the zipWith (-) xxx sums_mins computes the elevation from the sea level.

§ Thoughts on implicit heaps

Some musings I had on the ability to represent heaps as arrays, and in general, the benifits of knowing the total number of elements.

§ Discriminant and Resultant

I had always seen the definition of a discriminant of a polynomial p(x) as:
Disc(p(x)) ≡ an(2n − n) 
 (ri − rj)2
While it is clear why this tracks if a polynomial has repeated roots or not, I could never motivate to myself or remember this definition. I learnt that in fact, this comes from a more general object, the resultant of two polynomials P(x), Q(x), which provides a new polynomial Res(P(x), Q(x) which is zero iff P, Q share a common root. Then, the discriminant is defined as the resultant of a polynomial and its derivative. This makes far more sense: This cleared up a lot of the mystery for me.

§ How did I run into this? Elimination theory.

I was trying to learn how elimination theory works: Given a variety V = { (x, y) : Z(x, y) = 0 }, how does one find a rational parametrization (p(t), q(t)) such that Z(p(t), q(t)) = 0, and p(t), q(t) are rational functions? That is, how do we find a rational parametrization of the locus of a polynomial Z(x, y)? The answer is: use resultants!

§ 5 minute intro to elimination theory.

Recall that when we have a linear system Ax = 0, the system has a non-trivial solution iff |A| = 0. Formally: x ≠ 0  ⇐⇒ |A| = 0. Also, the ratio of solutions is given by:
xi / xj = (−1)i+j |Ai|/|Aj|
If we have two polynomials p(a; x) = a0 + a1 x + a2 x2, and q(b; x) = b0 + b1x + b2 x2, then the system p(a; x), q(b; x) has a simeltaneous zero iff:



= 0 
 A x = 0          

§ Big idea

The matrix is setup in such a way that any solution vector v such that Qv = 0 will be of the form v = (α3, α2, α, 1). That is, the solution vector is a polynomial, such that Qv = 0. Since Qv = 0, we have that a2 α2 + a1 α + a0 = 0, and b2 α2 + b1 α + b0 = 0.

§ Proof

a2 w + a1 x + a0 y = 0 b2 w + a1 x + a0 y = 0 
a2 x + a1 y + a0 z = 0 b2 x + a1 y + a0 z = 0 
Since the coefficients of the two systems are the same, we must have that (w, x, y) and (x, y, z) are linearly dependent. That is:
(wxy) = α (xyzw = α x = α2 y = α3 z 
We can take z = 1 arbitrarily, giving us a vector of the form (w, x, y, z) = (α3, α2, α, 1), which is the structure of the solution we are looking for!

§ References

§ Polynomial root finding using QR decomposition

  1. For a polynomial p(x), build the companion matrix P(x).
  2. Show that the characteristic polynomial cp(P) of the companion matrix P(x) is indeed p(x).
  3. Find eigenvalues of P(x), which will be roots of p(x), since the eigenvalues of a matrix M are the roots of its characteristic polynomial cp(M).
  4. We use QR since it is numerically stable. The Q matrix discovered by QR is orthogonal, and hence does not disturb the covariance of the noise on matrix multiplication.

§ A hacker's guide to numerical analysis

Life may toss us ill-conditioned problems, but it is too short to settle for unstable algorithms. - D.P. O'Leary

§ Measures of error

If x is a number and x is its approximation, then the are two notions of error:
  1. absolute errror: |x − x|.
  2. relative error: |x − x|/|x|.
Since the relative error is invariant under scaling (x ↦ α x), we will mostly be interested in relative error.

§ Significant digits

The significant digits in a number are the first nonzero digit and all succeeding digits. Thus 1.7320 has five significant digits. 0.0491 has only three significant digits. It is not transparent to me why this definition is sensible.

§ Correct Significant digits --- a first stab

We can naively define x agrees to x upto p significant digits if x and x round to the same number upto p significant digits. This definition is seriously problematic. Consider the numbers: Here, y has correct one and three significant digits relative to x, but incorrect 2 significant digits, since the truncation at x2 and y2 do not agree even to the first significant digit.

§ Correct Significant digits --- the correct definition

We say that x agress to x upto p significant digits if |x − x| is less than half a unit in the pth significant digit of x.

§ Accuracy v/s precision

Accuracy is not limited by precision: By using fixed precision arithmetic, we can emulate arbitrary precision arithmetic. The problem is that often this emulation is too expensive to be useful.

§ Backward, Forward errors

Let y = f(x), where f: ℝ → ℝ. Let us compute ŷ as an approximation to y, in an arithmetic of precision u. How do we measure the quality of ŷ?
  1. In many cases, we maybe happy with an ŷ such that the relative error between y and ŷ is equal to u: we did the best we can with the precision that was given. This is the forward error.
  2. An alternative question we can ask is, for what δ x do we have that ŷ = f(x + δ x). That is, how far away from the input do we need to stray, to get a matching output? There maybe many such δ x, so we ask for min|δ x|. We can divide this error by x as a normalization factor. This is the backward error.
There are two reasons we prefer backward error.
  1. It converts error analysis into "data analysis". The data itself tends to be uncertain. If the error given by the backward analysis is smaller than the data uncertainty, then we can write off our error as being too small. Since for all we know, we have 'fixed' the uncertain data with our small error.
  2. It reduces the question of error analysis into perturbation theory, which is very well understood for a large class of problems.

§ Backward stable

A method for computing y = f(x) is called backward stable if it produces a ŷ with small backward error. That is, we need a small δ x such that ŷ = f(x + δ x).

§ Mixed forward-backward error

We assume that addition and subtraction are backward stable, where u is the number of significant digits to which our arithmetic operations can be performed:
x ± y = x(1 + Δ) ± y(1 + Δ) ∀ |Δ| ≤ u
Another type of error we can consider is that of the form:
ŷ + δ y = f(x + Δ x)
That is, for a small perturbation in the output y), we can get a backward error of δ x. This is called as mixed forward backward error. We can say that an algorithm with mixed-forward-backward-error is stable iff:
 ŷ + δ y = f(x + Δ x         
 |Δ y|/|ŷ| < є          
 |Δ x|/|x| < η          
 є, η are small          
This definition of stability is useful when rounding errors are the dominant form of errors.

§ Conditioning

Relationship between forward and backward error is govered by conditioning: the sensitivity of solutions to perturbations of data. Let us have an approximate solution ŷ = f(x + δ x). Then:
 ŷ − y = f(x + δ x) − f(x) = f′(x) δ x + O((δ x)2         
 (ŷ − y)/y = (x f′(x)/f(x)) (Δ x/x) + O((Δ x)2         
 (ŷ − y)/y = c(x) (Δ x/x) + O((Δ x)2)         
 c(x) ≡ |x f′(x)/f(x)|          
The quantity c(x) measures the scaling factor to go from the relative change in output to the relative change in input. Note that this is a property of the function f, not any particular algorithm.

§ Example: logx

If f(x) = logx, then c(x) = |(x (logx)′) / logx| = |1/logx|. This quantity is very large for x ≃ 1. So, a small change in x can produce a drastic change in logx around 1.

§ Rule of thumb

We now gain access to the useful rule:
forward error  condition number × backward error

§ Forward stable

If a method produces answers with forward errors of similar magnitude to those produced by a backward stable method, then it is called forward stable. Backward stability implies forward stability, but not vice-versa (TODO: why?)

§ Cancellation

Consider the following program:
#include <cmath>
#include <stdio.h>

int main() {
    double x = 12e-9;
    double c = cos(x);
    double one_sub_c = 1 - c;
    double denom = x*x;
    double yhat = one_sub_c / denom;

    printf("x:         %20.16f\n"
           "cx:        %20.16f\n"
           "one_sub_c: %20.16f\n"
           "denom:     %20.16f\n"
           "yhat:      %20.16f\n",
            x, c, one_sub_c, denom, yhat); 
which produces the output:
x:           0.0000000120000000
cx:          0.9999999999999999
one_sub_c:   0.0000000000000001
denom:       0.0000000000000001
yhat:        0.7709882115452477
This is clearly wrong, because we know that (1−cosx)/x2) ≤ 1/2. The reason for this terrible result is that: In general:
 x ≡ 1 + є error of order є          
 y ≡ 1 − x = є value of order є          
That is, subtracting values close to each other (in this case, 1 and x) converts error order of magnitude into value order of magnitude. Alternatively, it brings earlier errors into promience as values.

§ Analysis of subtraction

We can consider the subtraction:
 x = a − b; x = â − b          
 â = a(1 + Δ a         
 b = b(1 + Δ b         

x − x


aΔ a − bΔ b
a − b

|−aΔ a − bΔ b|
|a − b|
|aΔ a + bΔ b|
|a − b|
max(|Δ a|, |Δ b|) (|a| + |b|)
|a − b|
This quantity will be large when |ab| ≪ |a| + |b|: that is, when there is heavy cancellation in the subtraction to compute x.

§ Underflow

#include <cmath>
#include <stdio.h>

int main() {
    double x = 1000;
    for(int i = 0; i < 60; ++i) {
        x = sqrt(x);
    for(int i = 0; i < 60; ++i) {
        x = x*x;
    printf("x: %10.20f\n", x);
This produces the output:
x: 1.00000000000000000000
That is, even though the function is an identity function, the answer collapses to 1. What is happening?

§ Computing (ex − 1)/x

One way to evaluate this function is as follows:
double f(double x) { return x == 0 ? 1 : (pow(M_E, x) - 1) / x; }
This can suffer from catastrophic cancellation in the numerator. When x is close to 0, ex is close to 1, and ex − 1 will magnify the error in ex.
double f(double x) { 
   const double y = pow(M_E, x);
   return y == 1 ? 1 : (y - 1) / log(y);
This algorithm seems crazy, but there's insight in it. We can show that the errors cancel! The idea is that neither (y − 1) nor logy are particularly good, the errors accumulated in them almost completely cancel out, leaving out a good value:
assume ŷ = 1 1 = ŷ ≡ ex(1 + δ) log1 = log(ex ) + log(1 + δ) x = −log(1 + δ) x = −δ + O2)
If ŷ ≠ 1:
f = (ŷ − 1)/logŷ = (1+є3)(ŷ − 1)(1 + є+1)/(logŷ(1 + є2))

§ IEEE floating point fun: +0 and -0 for complex analysis

Rather than think of +0 and -0 as distinct numerical values, think of their sign bit as an auxiliary variable that conveys one bit of information (or misinformation) about any numerical variable that takes on 0 as its value.
We have two types of zeroes, +0 and -0 in IEEE-754. These are used in some cases. The most famous is that 1/+0 = +∞, while 1/−0 = −∞. Here, we proceed to discuss some complex-analytic considerations.
Therefore. implementers of compilers and run-time libraries bear a heavy burden of attention to detail if applications programmers are to realize the full benefit of the IEEE style of complex arithmetic. That benefit deserves Some discussion here if only to reassure implementers that their assiduity will be appreciated.
−1 + 0 i
 = +0 + i 
−1 − 0 i
 = +0 − i 
These will ensure that z* = (√z)*:
copysign(1, +0) = +1 copysign(1, −0) = −1 
These will ensure that x, 1/x = x when x = ± ∞. An example is provided where the two limits:
f(x + i0) = 
y → 0−
 f(x + i y
f(x + i0) = 
y → 0−
 f(x + i y

§ Complex-analytic considerations

The principal branch of a complex function is a way to select one branch of a complex-function, which tends to be multi-valued. A classical example is the argument function, where arg(r ei θ = θ. However, this is ambiguous: we can map θ ↦ θ + 2 π and still have the same complex number. So, we need to fix some standard. We usually pick the "branch" where 0 ≤ θ < 2 π. In general, we need to carefully handle what happens to the function at the discontinuity.
What deserves to be undermined is blind faith in the power of Algebra. We should not believe that the equivalence class of expressions that all describe the same complex analytic function can be recognized by algebraic means alone, not even if relatively uncomplicated expressions are the only ones considered.

§ References

§ Mobius inversion on Incidence Algebras

Most of these functions are really defined on the incidence algebra of the poset P with ground field K. An incidence algebra I(P) is a set of functions which maps intervals of P to the ground field K. an interval is a tuple (x, y) ∈ P × P such that xP (where the comes from the partial order on P). We have a product structure which I denote , given by:
(α ⋆ β)([xz]) = 
x ≤ y ≤ z
 α(xy) β(yz)
A linear algebra way to look at this is to consider |P| x |P| matrices over K where the rows and columns are indexed by P. The a function α: P × PK can be written as the elements of the P × P matrix. Then this convolution-like operator is simply matrix multiplication. We have three natural functions: (1) The characteristic function, which is the identity for :
δ([xz]) ≡ 1  if  x = z ;  0  otherwise
(2) the zeta function, which plays the role of the constant 1:
ζ([xz]) ≡  1  if  x ≤ z ;  0  otherwise
(3) The inverse of the zeta function, the mobius function, a tool for mobius inversion:
 µ([xz])  = 1          
µ([xz])  = − 
x ≤ y < z
The mobius inversion theorem for posets states that ζ and µ as defined above are convolutional inverses. that is, ζ ⋆ µ = δ. This allows us to prove:
g([xz]) = 
x ≤ y ≤ z
g([xz]) = 
x ≤ y ≤ z
 f([xy]) · 1 
g([xz]) = 
x ≤ y ≤ z
 f([xy]) · ζ(yz
 g = f ⋆ ζ          
 g ⋆ mu = f ⋆ ζ ⋆ µ          
 g ⋆ mu = f ⋆ δ          
 g ⋆ mu = f          
We have managed to find f in terms of g, when previously we had g in terms of f. TODO: we are usually interested in a fixed [x, z]. What happens if we make this implicit? We may get nice notation for all of this!

§ Sums as mobius inversion

We can derive the formula we had for integrals, that:
F(i) = 
0 ≤ i ≤ n
 f(i)  ⇐⇒ f(k) = F(k) − F(k−1)
by setting up mobius inversion on the usual partial order for the natural numbers. For simplicity, I'll show the example on [0, 1, 2, 3, 4]. The example immediately generalizes.
F([xz]) ≡ 
x ≤ y ≤ z
x ≤ y ≤ z
x ≤ y ≤ z
 fi([xy]) · ζ(yz
 fi ⋆ ζ          
 f(n) = fi([0, n]) ≡  (F ⋆ mu)[0, n         
0 ≤ x ≤ n
 F([0, x]) µ([xn])
µ([4, 4]) = 1  By definition µ([3, 4]) = − 

3 ≤ x < 4

By definition

§ Finite differences and Umbral calculus

Umbral calculus lays out a large collection of "umbral" / "shadowy" coincidences across combinatorics and calculus. Here, I'll lay out some of these that I learnt from Concrete Mathematics. I hope to use this for myself to motivate a bunch of combinatorics. I'll provide an interesting proof of why ∑1 ≤ k < n k2 = k(k−1)/2 using this umbral calculus.

§ Discrete Derivative: Forward difference

We begin by trying to build a discrete version of the derivative operator. The derivative of f will be denoted using f'. The forward difference of f: ℝ → ℝ is defined as:
δ f: ℝ → ℝ; (δ f)(x) = f(x + 1) − f(x)
Immediately, we can see that this operator is linear:
 δ(f + g)    = (f+g)(x+1) − (f+g)(x)          
 = (f(x+1) − f(x)) + (g(x+1)−g(x))           
 = (δ f) + (δ g)           
 δ(α f)(x)    = (α f)(x+1) − (α f)(x        
 = α · (f(x+1) − f(x))            
 = α (δ f)          
it obeys a slightly corrupted version of the chain rule, (fg)′ = fg + gf:
 = (fg)(x+1) − (fg)(x)           
 f(x+1)g(x+1) − f(x)g(x) + 0          
 f(x+1)g(x+1) − f(x)g(x) + [f(x)g(x+1) − f(x)g(x+1)]          
 g(x+1)[f(x+1) − f(x)] + f(x)[g(x+1) − g(x)]          
 g(x+1)(δ f)(x) + f(x) (δ g)(x         
 = (S δ f)(x) + (f δ g)(x) [(Sh)(x) ≡ h(x+1)]          
 = (S  δ f + f δ g)(x         
We need this new S operator to shift the function's input from x to x+1.

§ Falling factorial as polynomials

cool, now that we seem to have a linear derivative operator that behaves roughly sanely, let's test it out! The first reasonable target is a polynomial, so let's try C2:
δ(x2) = (x+1)2 − x2 = 2x + 1
This is disappointing, it does not behave very well :( However, there is an analogue that does well. This is the falling factorial, defined as:
x(n) ≡ x(x−1)(x−2)⋯(xn+1)
For example:
x(0) = 1 x(1) = x x(2) = x(x−1) x(3) = x(x−1)(x−2) 
Let's try and apply our discrete difference δ:
  = (x+1)(x−1+1) − x(x−1)          
  = (x+1)(x) − x(x−1)          
  = x*2 = 2x(1)          
 = (x+1)(x−1+1)(x−2+1) − x(x−1)(x−2)          
 = (x+1)(x)(x−1) − x(x−1)(x−2)          
 x(x−1)((x+1) − (x−2)) = 3x(x−1) = 3x(2)          
These falling factorials look pretty unnatural though, why do we care? Well, once we build some integral calculus, we can handle our problem of 1 ≤ i < k i2 using this calculus, by rewriting i2 in terms of these falling factorials.

§ Sums as discrete integrals

We want to think of 0 ≤ i < n f(i) as the correct variant of 0n f(i) di. The most important property of an integral is the fundamental theorem of calculus:

 f′(idi = f(b) − f(a) ↦ 
a ≤ i < b
 (δ f)(i) =?= f(b) − f(a)
we can check if the assertion is true:
a ≤ i < b
 (δ f)(i)  
 = [f(a+1) − f(a)] + [f(a+2) − f(a+1)] + [f(a+3)−f(a+2)] + ⋯ + [f(b) − f(b−1)]          
 f(b) − f(a)    (The sum telescopes)           
Sweet, so we just kicked a theory of calculus of the ground. Let's put this to some use:

§ Gauss' famous formula from discrete calculus

Let's begin by deriving the closed form for [1·(k−1)] naturals:
0 ≤ i < n
 i = 
0 ≤ i < n
 i(1) = i(2)/2 

 = n(n−1)/2
Let's now derive the closed form form the sum of squares of [1·(k−1)]:
0 ≤ i < n
0 ≤ i < n
 i*(i−1) + i  
0 ≤ i < n
 i(2) + i(1) 
 n(3)/2 + n(2)/2          
 n(n−1)(n−2)/3 + n(n−1)/2          
 n(n−1)(n/3 − 2/3 + 1/2)          
 n(n−1)(2n − 1)/6          
Trying to perform this process in general does beg a question: how do we convert from xn into some combination of rising and falling factorials? It turns out that Stirling Numbers will show up to aid this conversion. But before that, let's see some more connections.

§ 2x as the combinatorial version of ex

We want to find the analogue of the exponential function ex, which satisfies the equation f′(x) = f(x). Setting this up in the discrete case:
 df(x) = f(x) |  f(0) = 1          
 f(x+1) − f(x) = f(x) | f(0) = 1          
 f(x+1) = 2f(x) | f(0) = 1          
 f(n) = 2n          
What does this buy us? It gives us a nice proof that k nCk = 2n. It proceeds by taking the taylor of ex, "combinatorializing it", and then simplifying:
ex = 
2x = 

§ Harmonic series as the combinatorial version of logarithm

In the continuous case, we have that 1/x = logx. In the discrete case, we get i=1n 1/xHn, the sum of the first n harmonic numbers. This explains why the harmonic numbers keep showing up in different places across discrete math --- We're often trying to integrate some kind of 1/x quantity. This also may tell us that Hn and 2n are somewhat "weak inverses" of each other. I haven't thought about this weak inverse thing too much.

§ Stirling numbers of the first kind for convering between polynomials and falling factorials

We can express x(n) = ∑k=0n [n, k] xk where [n, k] are the (yet to be defined) unsigned stirling numbers of the first kind. (aside: I wonder if this can be derived from some kind of mobius inversion). We begin my in fact defining the stirling numbers of the first kind, [n, k] as the coefficients of the rising factorial:
x(n) = x(x+1)(x+2) … (x+n−1) = 
The question as a combinatorialist is:
what do the unsigned striling numbers of the first kind, [n, i], count?
The answer is:
[n, i] counts the number of permutations of n elements with i disjoint cycles.
For example, in the case of the permutations of the set {1, 2, 3}, we have the permutations:
(1)(2)(3) 3 cycles           
(12)(3) 2 cycles           
(1)(23) 2 cycles           
(2)(13) 2 cycles           
(132) 1 cycle           
(123) 1 cycle           
So, this gives the counts:
[3, 3] = 1 [3, 2] = 3 [3, 1] = 2
These stirling numbers satisfy a recurrence:
[n+1, k] = n[nk] + [nk−1]
This can be seen combinatorially. If we want the number permutations of n+1 objects with k cycles, we can either:
  1. Take the permutations of n objects with k−1 cycles, [n, k−1], and then insert the new (n+1)th object into a new cycle all by itself. That is, we create a new permutation where p(n+1) = (n+1).
  2. Take the permutation of n objects with k cycles, and insert this new (n+1)th object into any of the k cycles. If the original permutation had x -p-> y, we can insert this new object as x -p-> * -p->y for any x. Thus, there are n choices of locations to inser * --- as the image of all possible xs. This gives us the n[n, k] term.
Another cute fact of the unsigned stirling numbers of the first kind [n, k] is that since permutations are partitioned by their number of cycles, we have:
 [nk] = n!

§ References

§ Permutahedron

The permutahedron over n letters is a polytope which is defined as the convex hull of all permutations of the point (1, 2, …, n). For example, the permutahedron over 3 letters is the convex hull of the points (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1). Here, we show that it can be embedded in (d−1) dimensionl space, and that each point perm((1, 2, … n)) is indeed a vertex of the convex hull. An example of a situation where a point is not in the vertex of a convex hull is convexhull(1, 2, 3) = [1, 3]. Note that 2 was used to generate the convex hull, but is not a vertex since it is the convex combination of 1, 3. We wish to prove that such a situation does not happen in the case of the permutahedron. I have not seen an elementary proof that each point that is a permutation of the original is a vertex, so I'm recording that proof out of interest.

§ Showing that all permutations of (1, 2, …, n) are vertices

§ Core idea

start by looking at index i of the "largest value" N in the point P. Since it's the largest point, if it's a convex combination, all other points in the convex combination must have the same "largest value" at that index i. So we can now ignore the index i and get permutations of [1..N-1] Repeat this process to prove that all indexes of things in the convex sum must be equal to P. But this is absurd, because all vertices are distinct. Hence, the only way for this to work out is if we have only one point that we are convex-summing over. That is, P cannot be written as the convex sum of two or more vertices.

§ Formal proof

Let us be working with permutations of [1, 2, ..., N]. Pick point P. Assume P is in the convex sum of some { Xi }. Let index(N, P) be the index of number N in P. For example, if P = [2, 3, 1], then
index(2, P) = 0 (assuming 0-based indexing)
index(3, P) = 1
index(1, P) = 2
If P is the convex sum of vertices { Xi }, all of them must have that:
Xi[index(N, P)]  = N forall Xi in the convex sum
since N is the largest value that X_I can have at any index. The only way to get the maximum value by a convex sum is for all values to be that maximum value. So, we can now ignore dimension index(N, P), since all the vertices X_i involved in the convex combination and P have index(N, P) = N. If we project out this dimension, we are left with permutation of (1..N-1). Repeat the same process. We will need to have all X_i to have their value at index(N-1, P) to be N-1. Keep doing this till we get that the dimensions of all points in Xi and P are equal. But all the points in { Xi } are distinct since they are permutation of [1..n]. Hence, { Xi } can contain only a single point, and this point is P. Hence, P cannot be written as the convex combination of other points. Hence, P is a vertex.

§ Proof that convex sum of maximum must have maximum

Assume we have values {v1, v2, ... vn} where without loss of generality, v1 <= v2 <= ... <= vn. We can always relabel otherwise. Now we want to write vn as the convex sum of {v1, v2, ... vn}. We can draw this on the number line as:

§ Lyndon + Christoffel = Convex Hull

Actual "real world" use-case of lyndon factorization, cribbed from here: I wanted to learn a real-world use-case of lyndon factorization so I could remember it. I found a whole bridge between combinatorics/strings and discrete geometry (they call it "digital geometry") that I didn't know existed before.

§ Bonus: slick characterization of line drawing

If we want to draw a line with slope p/q = 4/7 using the lower approximation the idea is that we keep taking 4 steps along x, and every time we "exceed" 7 steps along x, we take a step along y. This is the same as:
  1. working in Z/(4+7)Z, starting from 0, writing down multiples of 4 till we cycle back to 0
  2. marking an "increase" in a step with an x, and a "decrease" in a step with y.
The intuition is that once we walk k*p steps where k*p >= q, we want to increment y. So, at first glance, we may believe we should consider Z/qZ. However, this is misguided. Examples to enlighten:
  1. Consider x=1, y=0. We should use Z/1Z = { 0 }: that is, we must keep moving along 0 -x -> 0 -x-> 0 -x-> .... This is unlike what happens if we choose Z/0Z (which is not a well-defined idea).
  2. Consider x=1,y=1. We should use Z/2Z, so we keep going 0 -x-> 1 -y-> 0 -> ... which will cause is to flip x -> y -> x -> y -> ....
In some sense, we are making sure that we can "start" with an x and see where that takes us. In the Z/1Z case, we realise that we can keep taking xs. In the Z/2Z case, we realise we need to flip between x and x. Let's try to show this formally, where k is the smallest number such that kp >= q. We'll also have concrete examples where p=2, q=7. For the example, k=4 since kp = 4*2 = 8 > 7. If we work in Z/yZ = Z/7Z, we will get the numbers:
Z:    0, p, 2p, ... (k-1)p, [q] kp 
Z/qZ: 0, p, 2p, ... (k-1)p, [q] (kp % q)
Z/7z: 0, 2,  4, ...      6, [7] (8 % 7 = 1)
But notice that we are inserting x between numbers. So we will get:
Z:    0 -x-> p -x-> 2p, ... (k-1)p -y-> (k+1)p
Z/qZ: 0 -x-> p -x-> 2p, ... (k-1)p -y-> ((k+1)p % y)
Z/7z: 0 -x-> 2 -x->  4, ...      6 -y-> (8      % 7 = 1) 
         ^                            ^
      x since [0 < 2]              y since [6 > 1]
which gives us the moves:
Z/7Z: 0 -x-> 2 -x-> 4 -x-> 6 -y-> 8 % 7 = 1
We only get 3 occurences of x, after which on the next accumulation of p, becomes an 8 which wraps around to a 1. This is the age-old tension that exists between points and gaps. For k points, there are k-1 gaps. In our case, we have 4 points [0, 2, 4, 6], but this leaves us room for only three gaps for threex moves, while in reality we need 4. We remedy this situation by giving ourselves space enough for one more x, by changing from Z/qZ to Z/(p+q)Z. We should look at this as creating space for another gap.
Z:        0, p, 2p, ... (k-1)p, [q] kp,  [q+p  ], (k+1)p
Z/(p+q)Z: 0, p, 2p, ... (k-1)p, [q] kp,  [q+p  ], (k+1)p % (p+q)    
Z/9z: 0, 2,  4, ...          6, [7]  8,  [7+2=9], (10    %    9=1)
which gives us the moves:
Z:        0 -x-> p -x-> ... (k-1)p -x-> kp -y-> (k+1)p
Z/(p+q)Z: 0 -x-> p -x-> ... (k-1)p -x-> kp -y-> (k+1)p % (p+q)
Z/9Z:     0 -x-> 2 -x-> ...      6 -x->  8 -y-> (   10 %     9 = 1)
             ^                             ^
           x since [0 < 2]               y since [8 > 1]

That is, we are able to get k occurences x between 0, p, ..,kp which has (k+1) points. Concretely, we have:
Z/9Z: 0 -x-> 2 -x-> 4 -x-> 6 -x-> 8 -y-> 1
where we have 4 occurences of x in between after which we have an occurence of y, which is what we want when x=2, y = 7. We need to reach at least q=7 before we have exceeded our denominator and need to make a move along y.

§ Geometric proof of e^x >= 1+x, e^(-x) >= 1-x

Let's concentrate on the e^x >= 1 + x part.
  1. The tangent of e^x at x = 0 is 1 + x, since the taylor series of e^x truncated upto x is 1 + x.
  2. e^x is a strongly convex function, since (e^x)'' = e^x which is positive everywhere. Hence, e^x will always lie above its tangent.
Similarly for e^(-x), working through the math:
  1. 1 -x is tangent at x=0 to e^(-x)
  2. (e^(-x))'' = -(e^(-x))' e^(-x) which is again positive everywhere, and hence, e^(-x) is strongly convex.

§ Ranking and Sorting

We we want to sort an arrray xs and write the results into an array ys. In both cases, the invariant to be satisfied is that ys is xs in ascending order. I'll be considering two broad ways to implement such a thing:
for(int i = 0; i < N; ++i) {
   ys[rs[i]] = xs[i]
for(int i = 0; i < N; ++i) {
   ys[i] = xs[ss[i]]

§ Defnition of rank

In the case of (RANK), the specification is that the rank of an element e is the number of elements that are:
  1. less than e.
  2. equal to e but occur before e.
This ensures that rank is a permutation: that is, every element e is given a unique index.
int rank(const int *xs, int i) { 
    int cnt = 0;
    for (int j = 0; j < N; ++j) {
        if (xs[j] < xs[i] || (xs[j] == xs[i] && j < i)) cnt += 1;
    return cnt;

for(int i = 0; i < N; ++i) {
  rs[i] = rank(xs, i);

§ Rank: Alternative 1

An alternative way to look at our definition of rank is that we are sorting the tuples (xs[i], i) using lex ordering. So if two indeces i, j have the same value, then we sort on the index.

§ Rank: Alternative 2

We could have also defined rank as:
int rank(int *xs, int i) { 
    int cnt = 0;
    for (int j = 0; j < i; ++j) {
        if (xs[j] <= xs[i]) cnt += 1
    return cnt;
I declined from doing so because I wanted to show the lex-ordering interpretation of rank will be be useful later.

§ Definition of select

For (SELECT), we'll need to think a little harder. Let's try to rewrite our way to glory:
ys[rs[i]] = xs[i]
ys[rs[ss[i]]] = xs[ss[i]]
ys[i] = xs[ss[i]]
This gives us necessary and sufficient conditions on how to find an ss: ss must be a permutation that is an inverse permutation to rs.

§ How to invert a permutation?

How does one invert a permutation? We have a permutation rs[i] that maps i to rs[i]. We want to find a new permutation ss[i] such that
rs[ss[i]] = i
// ss[i] is an index 'k'...
ss[i] = k 
// 'k' is the location of 'i' in rs.
rs[k] = i
So if we first tag each location of rs with its index, and then sort with the values of rs being the keys, then ss[i] will be the values. In pictures:
rs[i]          [3 4 0 1 2]
rs_tagged[i]   [(3, 0) (4, 1) (0, 2) (1, 3) (2, 4)]
rs_sorted[i]   [(0, 2) (1, 3) (2, 4) (3, 0) (4, 1)]
ss[i]          [    2      3      4      0      1 ]
rs[ss[i]]      [ rs[2]  rs[3]  rs[4]  rs[0]  rs[1]]
rs[ss[i]]      [    0      1      2      3      4 ]

§ Rank and Select are inverses

It's nice, there are a couple of miracles that lined up here:

§ Generalized rank and select are adjoint

We had to "fix" our definition of rank to avoid equal elements in the array. Hence, we had the rule that we also sort by indexes if the elements are equal. However, if we now decide to ignore this rule, we will then recreate the classical adjunction between rank and select.

§ References

§ Proof of minkowski convex body theorem

We can derive a proof of the minkowski convex body theorem starting from Blichfeldt’s theorem.

§ Blichfeldt's theorem

This theorem allows us to prove that a set of large-enough-size in any lattice will have two points such that their difference lies in the lattice. Formally, we have:
  1. A lattice L(B) ≡ { Bx : x ∈ ℤn } for some basis Bn. The lattice L is spanned by integer linear combinations of rows of B.
  2. A body SRn which need not be convex!, which has volume greater than det(B). Recall that for a lattice L(B), the volume of a fundamental unit / fundamental parallelopiped is det(B).
Blichfeldt's theorem tells us that there exists two points x1, x2S such that x1x2L.

§ Proof

The idea is to:
  1. Chop up sections of S across all translates of the fundamental parallelopiped that have non-empty intersections with S back to the origin. This makes all of them overlap with the fundamental parallelopiped with the origin.
  2. Since S has volume great that det(B), but the fundamental paralellopiped only has volume det(B), points from two different parallelograms must overlap.
  3. "Undo" the translation to find two points which are of the form x1 = l1 + δ, x2 = l2 + δ. they must have the same δ since they overlapped when they were laid on the fundamental paralellopiped. Also notice that l1l2 since they came from two different parallograms on the plane!
  4. Notice that x1x2 = l1l2L ≠ 0, since we already argued that l1l2. This gives us what we want.

§ Minkowskis' Convex body Theorem from Blichfeldt's theorem

Consider a convex set S ⊆ ℝn that is symmetric about the origin with volume greater than 2n det(B). Create a new set T which is S * 0.5. Formally:
T ≡ S/2 = { (x1/2, x2, …, xn/2)  : (x1x2, …, xn) ∈ S }
We now see that Vol(T) > det(B) to invoke Blichfeldt's theorem. Formally:
Vol(T) = 1/2n Vol(S) > 1/2n (2n det(B)) = det(B)
We can apply Blichfeldt's theorem to get our hands on two points x1, x2T such that x1x2L.
 x1 ∈ T ⇒ 2x1 ∈ S  (S = 2T         
 x2 ∈ T ⇒ 2x2 ∈ S  (S = 2T         
 2x2 ∈ S ⇒ −2x2 ∈ S (S is symmetric about origin)          
(2x1) + 
 (−2x2) ∈ S (S is convex)
 x1 − x2 ∈ S (Simplification)         
 nonzero lattice point ∈ S          

§ References

§ Burrows Wheeler

We aim to get the O(n) algorithm for burrows wheeler, by starting from the naive O(n2) implementation and then slowly chipping away to get to the fastest algorithm

§ String rotations

Given a string s of length n, we can index it as s[0], s[1], upto s[n−1]. We can now build a table consisting of rotations of the string. We'll define:
lrot :: [a] -> [a]
lrot [] = []; lrot (x:xs) = xs ++ [x]
We can build a table of left-rotations:
We can immediately notice that it's a symmetric matrix. We can prove this as follows. We write lrot(rot, s) as an array such that:
lrot(rots)[i] = s[(rot + i
Now, we note that on row r of our array we have the string lrot(r, s). So the value at row r, column c is . But this is symmetric in c, r so can be written as , which is equal to lrot(c, s)[r]. Formally:
lrot(rs)[c] = s[(r + c

§ Sorts of string rotations

We're next interested in considering sorted order of the string rotations. For example, in the case of the string "here, there", all the rotations are:
here-there  0
ere-thereh  1
re-therehe  2
e-thereher  3
-therehere  4
therehere-  5
herehere-t  6
erehere-th  7
rehere-the  8
ehere-ther  9
which is calculated with the help of the following haskell code:
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]

lrots :: [a] -> [[a]]; lrots as = take (length as) (iterate lrot as)

main =  putStrLn $ intercalate "\n"  
  (zipWith (\s i -> s <> "  " <> show i)
           (lrots "here-there") [0,1..])
If we now sort these rotations, we get:
-therehere  0
e-thereher  1
ehere-ther  2
ere-thereh  3
erehere-th  4
here-there  5
herehere-t  6
re-therehe  7
rehere-the  8
therehere-  9
we produce this by chainging the above definition of main to:
main =  putStrLn $ intercalate "\n"  
  (zipWith (\s i -> s <> "  " <> show i)
           -- | extra `sort` here
           (sort $ lrots "here-there") [0,1..])
Let's look at the final column of that table. We have:
-thereher|e|  0
e-therehe|r|  1
ehere-the|r|  2
ere-there|h|  3
erehere-t|h|  4
here-ther|e|  5 <- ORIGINAL STRING
herehere-|t|  6
re-thereh|e|  7
rehere-th|e|  8
therehere|-|  9

Now, we'll show how to write a really cool function call bwt such that:
bwtinv ("errhhetee-",5) = "here-there"
The 5 is the index of the original string in the list. That is, given the jumbled up last-column and the index of the original string, we're able to reconstruct the original string. The reason this is useful is that the jumbled string "errhhetee-" is easier to compress: it has long runs of r, h, and e which makes it easier to compress. The process we performed of rotating, sorting the rotations and taking the final column is the Burrows-Wheeler transform. in code:
import Data.List
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]

lrots :: [a] -> [[a]]
lrots as = take (length as) (iterate lrot as)

findix :: Eq a => a -> [a] -> Int
findix a as = length (takeWhile (/= a) as)

lastcol :: [[a]] -> [a]; lastcol = map last

bwt :: Eq a => Ord a => [a] -> ([a], Int)
bwt as = let as' = (sort . lrots) as in (lastcol as', findix as as')
Now we need to understand how bwtinv is defined and what it does. The definition is:
import Control.Arrow (&&&)

bwtinv :: Eq a => Ord a => ([a], Int) -> [a]
bwtinv (as, k) = recreate as !! k

-- recreate · lastcol · sort · rots = sort · rots
recreate :: (Eq a, Ord a) => [a] -> [[a]]
recreate as = recreate' (length as) as

recreate' :: (Eq a, Ord a) => Int -> [a] -> [[a]]
recreate' 0 = map (const [])
recreate' n = hdsort . consCol . (id &&& recreate' (n-1))

hdsort :: Ord a => [[a]] -> [[a]]
hdsort = let cmp (x:xs) (y:ys) = compare x y
         in sortBy cmp 

consCol :: ([a], [[a]]) -> [[a]]
consCol = uncurry (zipWith (:))
OK, so much for the code. what does it do? The idea is that:

§ References

§ Intuitionstic logic as a Heytig algebra

open sets form a Heytig Algebra, which is a lattice plus an implication operator. So it's stronger than a lattice, but weaker than a boolean algebra. Formally, a Heytig algebra over a set X is a collection (L, ∨, ∧, ⇒) where (X, ∨, ∧) form a lattice, and obeys the axiom
⇒: L → L; (c ∧ a) ≤ b  ⇐⇒ c ≤ (a ⇒ b)
In any topological space (U, τ) ( U for universe set) the open sets of that space form a Heytig algebra. Here, set-union and set-intersection become the lattice join and lattice meet. We define a "weak complement" denoted by ¬ which is defined as:
¬ A ≡ interior(AC)
We need the interior to make sure that we're left with an open set. Also, this ¬ is not really a complement, since we won't have that ¬ ¬ A = A. but, more on that later! We write open intervals with round parenthesis:
¬ A of the above set A becomes the open interval that is in the interior of A complement.
--(===)-- A
==]---[== A complement
==)---(== Not A
Now, we can try to ask, when is ¬ ¬ A = U (where U is the universe set or the full space). If we consider a set containing a single point, then we will have:
==)(== A 
------------ Not A  
============ Not (Not A)
in words:
 A = U ∖ { 1 }          
 ¬ A = interior(Ac) = interior({ 1 }) = ∅          
 ¬ ¬ A = interior((¬ A)c) = interior(∅c) = interior(U) = U          
So in some sense, the law of excluded middle is "almost true": It holds that A ≃ ¬ ¬ A, where A excludes a set of points of measure zero. This is really interesting, since it gives some sort of oddball probabilistic flavour to things where if we blind ourselves to measure-0 sets, then ¬ ¬ A = A. Now, we look at implication. The implication AB is the largest set open C such that CA = B. In pictures:
---(========)----- A
---------(=====)-- B
---------(==)----- A && B
===)-----(======== A -> B
The reason this is true is that from the definition:
 ⇒: L → L;  c ≤ (a ⇒ b)  ⇐⇒ (c ∧ a) ≤ b          
 replacing ≤ with = for insight:          
 ⇒: L → L;  c = (a ⇒ b)  ⇐⇒ (c ∧ a) = b           
Alternatively, we can use the fact that in regular boolean algebra:
a ⇒ b = ¬ a ∨ b
to derive A −> B:
---(========)----- A
===)--------(===== NOT A
---------(=====)-- B
[Extend B to everything that doesn't include
  more of A than already included]
===)-----(======== NOT A or B = A -> B

§ a -> b as a contained in b:

We'll show that a -> b is true/top/the full real line if and only if a is contained in b. We can show this using ¬ ab definition:
  1. ab (given)
  2. Since ab, ¬ a ≥ ¬ b, since ¬ reverses inclusion order.
  3. xy implies that xpyp for all p, since p is on both sides.
  4. Specializing to x = ¬ a, y = ¬ b, p = b gives us ¬ ab ≥ ¬ bb
  5. ¬ bb is universe U
  6. ¬ abU, which means it's equal to U (nothing can be greater than U).
We can also show this geometrically:
----------(===)---- A
------(==========) B
----------(===)---- A && B
[Extend B to everything that doesn't include
  more of A than already included.
  But all of A is
  already included!]
================== A -> B

§ reading curry, uncurry using a -> b as containment:

curry, uncurry are the adjunction/isomorphism:
((c,a) -> b) ~ (c -> (a -> b))
Read this as:
( c     ,         a)     ->         b
(c `intersection` a) `contained in` b
if and only if
c      ->        (a       ->      b)
c `contained in` (a `implication` b)
That is, we have "read" curry/uncurry as:
c ∧ a ≤ b  ⇐⇒ c ≤ a ⇒ b
which was the definition of we wanted!

§ References

§ Edit distance

This implementation of edit distance crystallizes the fact that when computing edit distance, we only ever move forwards on solving the problem. we do not store the results of the overlapping computations, though we could. Rather, the goal of this implementation is to capture the traversal pattern necessary for edit distance into a Cursor, and to view the edit distance problem from the point of view of this Cursor. The problem setting is that we have a source string, a destination string, and we wish to perform operations that convert the source string into the destination string. The operations allowed are to: We want to minimise the number of operations to be used. Let's see how we model this.
{-# LANGUAGE ViewPatterns #-}
type Ix = Int; type DestIx = Ix; type SrcIx = Ix;
data Move = InsertFromDest DestIx |
            RemoveFromSrc SrcIx |
            ReplaceSrcWithDest SrcIx DestIx
Our cost model says that each move costs 1. We are charged for every move we make. We are to minimize the number of operations.
movescost :: [Move] -> Int; movescost = length
We model this as us having a Cursor which contains list [a] and information about where we are in the list as an Ix. This is the same as a Zipper for a list, except that in this case, we only allow ourselves to walk forward.
data Cursor a = Cursor Ix [a]
cdone :: Cursor a -> Bool; cdone (Cursor ix vs) = ix >= length vs
cix :: Cursor a -> Ix; cix (Cursor ix _) = ix
cval :: Cursor a -> a; cval c@(Cursor ix vs) = vs !! ix
incr :: Cursor a -> Cursor a; incr (Cursor ix vs) = Cursor (ix+1) vs
cursor :: [a] -> Cursor a; cursor = Cursor 0
We implement edit, that tells us how to edit the source string into the destination string. The convention is edit .
-- | decide how to get ixth character of bs from our as.
edit :: Eq a => Cursor a -> Cursor a -> [Move]
edit (cdone -> True) (cdone -> True) = []
edit a@(cdone -> False) b@(cdone -> True) = 
  (RemoveFromSrc (cix a)):edit (incr a) b
edit a@(cdone -> True) b@(cdone -> False) = 
  (InsertFromDest (cix b)):edit a (incr b)
edit a b =  
  let nomatch = argmin movescost 
                (ReplaceSrcWithDest (cix a) (cix b):edit (incr a) (incr b))
                (RemoveFromSrc (cix a):edit (incr a) b) 
  in case cval a == cval b of
      True -> argmin movescost nomatch (edit (incr a) (incr b))
      False -> nomatch 
The helper used for finding minimum according to a cost model.
argmin :: (a -> Int) -> a -> a -> a
argmin f a a' = if (f a) < (f a') then a else a'

§ Evolution of bee colonies

This kind of culture that beehives have is called as 'eusociality'. I'm interested in this because I want to understand what alien societies might look like, and what selection pressures are required to have bee-like societies evolve. Many sci-fi books (Ender's game for example) depict aliens with such a society, but tend to be hazy on how this state of affairs came to be.

§ Best practices for array indexing

These are rules I'm going to follow when I solve problems on codeforces. I've arrived at these rules by repeatedly noticing the kinds of mistakes I make and attempting to adopt conventions to eliminate these.

§ Rule 1: Half-open intervals

Intervals are only represented as [begin, past-the-end) That is, we start at begin, include numbers begin+1, begin+2, ..., upto, but excluding past-the-end. So, for example, [0, 3) = [0, 1, 2], while [0, 1) = [0], and [0, 0) = []. I am not allowed to use [begin, end], or (before-begin, past-the-end) or any such variation I represent and think of intervals.

§ Rule 2: No relational operators when using for/while/do while loops.

When I write my loop conditions, I am not allowed to use relational operators. I must only write i != n or i == m. I am not allowed to write i < n, i <= n, i > n, i >= n for my for loops.

§ Rule 3: The loop iterator lives in "getting to" space:

The loop iterator i in for(int i = a; i != b; ++i) is to be thought of as getting to/living right before" the values [a, a+1, a+2, ... b-1]. In ASCII-art:
|| a-1 || a   || a+1 || ... || b-1 ||  b  || b+1 ||
       ^^     ^^     ^^            ^^
     (i=a) (i=a+1)  (i=a+2) ...   (i=b)

§ Ruel 4: One always reads loops according to the above rules

for(int i = begin; i != past-the-end; ++i) {
  // NO: i from begin to past-the-end.
  // YES: [i _getting to_ the beginning] till [i _getting_ past-the-end]

§ The rationale for banning relational operators

There is a strong difference in qualia between i < n and i != n. The former makes on aware of when the loop runs; the latter of when the loop quits. I wish to be cognizant of the precisely when a loop quits. On writing i != past-the-end, I know that we quit as soon as we get past the end. This feels much clearer than being aware that the loops runs as long as i < n.

§ half-open indexing: length <-> index

The first advantage of these conventions is that [begin, past-the-end) is the same as [begin, begin+length). I've found this to be of great help to flit between length-based-thoughts and indexing-based-thoughts.

§ half-open indexing: 0 length

The second almost trivial point is that [begin, begin+length) holds when length is zero. That is,
for(int i = begin; i != begin + 0; ++ i) { ... }
does the right thing and iterates over no elements.

§ half-open indexing: -ve length

The third neat point is that [begin, begin+length) holds even when length is negative. Hence, if we want to index the last two elements of an array, we recall that we start from index (n-1), and we want a segment of length -2 from there, so our loop is:
for(int i = n-1; i != (n-1) + -2; i--) {
 // loops over the correct segment.

§ half-open indexing: splitting an array

Finally, this convention helps when attempting to take the beginning part of a list and the rest of the list. If we want to split an array into two segments at some index len, Notice how we used both the "length view" and the "past the end" view to quickly derive the bounds of the second array from the first array.

§ half-open indexing: uniformly generate power-of-2 intervals.

If we want intervals of length 2n: 1, 2, 4, 8, ... which is common if one is building data structures such as segment trees and fenwick trees, in a half-open representation, this literally becomes [a, a+1), [a, a+2), [a, a+4) and so on. On the other hand, if one wants to use closed interval based indexing, one needs to generate the series 2n − 1, which is [a, a+0], [a, a+3], [a, a+7] which is slightly more finicky.

§ How to deal with strides

If there's a loop
for(int i = 0; i < 5; i += 2) { ... }
the naive i != 5 translation will not work since i only takes on even numbers [0, 2, 4, 6, ...]. For this, we can perform a simple transformation and always make sure our loop variables increment by 1. In a compiler, this is often called as "canonicalizing the loop induction variable". In LLVM the canonicalization is performed by the -indvars pass. The above example canonicalized becomes:
// btc = backedge taken count. How many times we have
// gone past the "back edge" of the loop to go back to the
// beginning of the loop.
for(int btc = 0; btc != 5 / 2; btc += 1) { const int i = btc * 2; }

§ In conclusion

These are rules that I dreamed up after noticing my idiosyncracies in loop-writing.

§ Algebraic structure for vector clocks

I update my time, ie, union(time me, time me), I get an element that's one up the lattice. When I union with someone else, I get the max. So we have an algebraic structure which is (L, ≤, next: LL) where next is monotone for (L, <=). The induced union operator ∪: L × LL is:
x ⋃ y ≡ 

next(x)x = y  
max(xy)x ≠ y 
This also shows up in the case of the "Gallager-Humblet-Spira Algorithm" and the fragment-name-union-rule.

§ Networks are now faster than disks

I learnt of this counter-intutive fact first from this usenix article no SQL. On checking up, it seems to actually be true. Jeff Dean's keynote at LADIS 2009 report these numbers: On the other hand, a commentor on this discussion at serverfault mentioned that SSDs might be much faster, and the numbers bear out:

§ Einstein-de Haas effect

I learnt of this from hacker news. This is some crazy experiment that shows that the 'quantum angular momentum' (spin) and the 'classical angular momentum' need to be conserved together for physics to work out:
There's an experiment that transfers that angular momentum all the way up to macroscopic levels. By magnetizing a cylinder of iron, all the spins start pointing in the same direction. By conservation of angular momentum, the cylinder itself has to start spinning in the opposite direction. I'm very fond of this experiment, because it magnifies a strange quantum phenomenon to the classical level.
So, my understanding of the experiment is:

§ References

§ Rank-select as adjunction

We will introduce two operations rank, select, --- these are used to build memory-efficient data structures that can still be queried quickly. We will show how rank and select are adjoint. This will also us to also consider coselect, which is a variation of select that is not commonly discussed.
{-# LANGUAGE GeneralizedNewtypeDeriving #-}

import Data.List
newtype Count = Count Int deriving(Eq, Show, Ord, Num)
newtype Ix = Ix Int deriving(Eq, Show, Ord, Num)

eqscan :: Eq a => a -> [a] -> [Count]
eqscan a0 as = map Count $ scanl (+) 0 [if a0 == a then 1 else 0 | a <- as]

-- | finds the index `Ix` in `as` at which `a` has occured
-- | for the `Count`th time.
select :: Eq a => a -> [a] -> Count -> Maybe Ix
select a0 as c = Ix <$> findIndex (== c) (eqscan a0 as)

-- | finds the number of times `a` has occured in `as` till `Ix`.
rank :: Eq a => a -> [a] -> Ix -> Count
rank a as (Ix ix) = (eqscan a as) !! ix
Given this, we can prove that select and rank are adjunctions. There are different ways to think about this. My favourite way to is to notice that an Ix (index) is much "finer" information than Count. Hence, the concrete domain must be Ix, the abstract domain must be Count, and rank-select is an abstract interpretation!

§ Co-select

We can build another version of select called co-select that scans from the right. Alternatively, it finds on the reverse list:
coselect :: Eq a => a -> [a] -> Count -> Maybe Ix
coselect a0 as c = Ix <$> findIndex (== c) (reverse (eqscan a0 as))
Thanks to Edward Kmett for teaching me this.

§ Bounding chains: uniformly sample colorings

We wish to uniformly sample k colorings of a graph G with maximum degree Δ. Hence, we require k ≥ Δ + 1. To perform this sampling, we use MCMC to sample from a markov chain whose states are k-colorings of G, whose stationary distribution is the uniform distribution over valid colorings. The issue with MCMC techniques is that we never know when our chain has reached the stationary state. To ensure we receive uniformly distributed samples, we built a "bounding chain" W which has the following properties: We will in fact run the W chain, and prove that running the W chain is equivalent to running a 'shadow' of the X chain, and that stationary states of the W chain correspond to stationary sates of the X chain. Let X be the chain whose states are valid k colorings of G. In one step of the chain X, we choose a vertex v uniformly at random; we then choose a color cv uniformly at random for v that makes it a proper coloring. The vertex v is changed to this new color c. This is a symmetric proposal distribution, Hence this chain has the uniform distribution over k-colorings to be the stationary state. Sampling exactly from this chain is hard: construct an initial state X0 amounts to finding some valid k coloring which in itself might be challenging. Worse, we do not know whether the chain X has reached a stationary state or not.

§ Bounding Chain

We construct a new chain W (the bounding chain of X), whose states are sets of colors for vertices in G. Formally, the states of W are functions Vert(G) → 2C where Vert(G) denotes the vertices of G; C the set of colors. The transition will be to pick a vertex v uniformly at random. Then, pick a new set of legal colors Cv for v, such that: We describe the transition function next. But first, we need an alternate lens on the transitions of X that is amenable to massaging.

§ Equivalent description of the transitions of X:

  1. Choosing a color uniformly at random from the set of valid colors for a vertex.
  2. Choosing colors from C without replacement until we get a color that is a valid color.
We claim that (1) and (2) have the same probability distribution. Abstracting slightly, we state:
  1. Probability of choosing an element tT uniformly from TS. This has probability 1/|T|.
  2. Probability of choosing a particular element tT, by picking elements from S without replacement until we get some element in T.
(1) and (2) have the same probability distribution.

§ Proof by induction:

Process (2) in code is the following:
def process(S, T):
  assert(issubset(T, S)) # precondition
  s = np.random.choice(S) # uniform choice over S.
  if s in T: return s # |T|/|S| prob. to enter `if`.
    # (1 - |T|/|S|) to enter `else`
    Snext = S.remove(s); return process(Snext, T) 
We claim that the probability that process(S, T) = t0 for a fixed t0 in T is 1/|T|. We create a new function indicator to express this:
def indicator(t0, S, T):
  assert(t0 in T) # precondition
  assert(issubset(T, S)) # precondition

  return t0 == process(S, T)
Let's push in t0 == into the definiton of process after inling process. This gives us:
def indicator(t0, S, T):
  assert(t0 in T) # precondition
  assert(issubset(T, S)) # precondition

  s = np.random.choice(S)
  if s in T:
    # T/S prob. for x in T
    return t0 == s # 1/|T| prob. for t0 == x
    #  (1 - |T|/|S|) to reach else branch.
    Snext = S.remove(s); return process(Snext, T) 

Now, we write down the recurrence for the probability that we are trying to compute: P(t0, S, T) is the probability that indicator(t0, S, T) returns True. Alternatively, it's the probability that process(S, T) returns t0.
P(t0, S, T) = prob. that process(S, T) = t0
P(t0, T, T) = T/|T| * 1/|T| = 1/|T| [base case]
P(t0, S, T) = 1/|S| + (1 - |T|/|S|) * P(t0, |S|-1, T) [induction]
We assume for induction that P(t0, |S|-1, T) = 1/|T|. On substitution into [induction], we get:
P(t0, S, T) 
= 1/|S| + (1 - |T|/|S|) * P(t0, |S|-1, T) [induction]
= 1/|S| + (1 - |T|/|S|) * 1/|T| 
= 1/|S| + (1/|T| - 1/|S|)s
= 1/|T|
Which is indeed the same probability as (1):
1. Choosing an element uniformly from a subset T = 1/|T|.

§ Proof by program analysis, Version 1

def process(S, T):
  s = np.random.choice(S)
  if s in T: return s
  else return process(S.remove(s), T)
Notice that the last return is tail-call. This program can be rewritten as:
def process(S, T):
  while True:
    s = np.random.choice(S)
    if s in T: return s
    S = S.remove(s)
As previously, we create an indicator function and study its behaviour:
def indicator(t0, S, T):
  assert(t0 in T) # precondition
  assert(issubset(T, S)) # precondition

  while True:
    s = np.random.choice(S)
    if s in T: return t0 == s
    S = S.remove(s)
We know that this programs only returns a value from the line: We now compute P(process(S, T) == t0). Whatever the return value of indicator, we can assume that it occured within the if condition. We can use this to compute:
P(indicator(t0, S, T) = True)
 = P(t0 == s | s in T) [program only returns in this case]
 = 1/|T|

§ Proof by program analysis, Version 2

Alternatively, we can also analyze this as we did in the first proof, using the rule:
def f():
  return if cond1 then cond2 else cond3
will have probability:
P(cond1) * P(cond2|cond1) + P(not cond1) * P(cond3|not cond1)
Using this, we can analyze indicator as:
P(indicator(t0, S, T) = True)
 = P(s in T) * P(t0 == s |s in T) + 
    P(s not in T) * P(indicator(t0, S.remove(s), T) | s not in T)
 = |T|/|S| * 1/|T| + 
    (|S|-|T|)/|S| * P(indicator(t0, S.remove(s), T))
 = 1/|S| + (|S|-|T|)/|S| * 1/|T| [by induction]
 = 1/|T|

§ Sampling using the above definition

Recall that State(X) ≡ VC, State(W) ≡ V → 2C. Let x[ ], w[ ] be the states of some run in the markov chains. We start by having x[0] to be any valid k-coloring of G, and w[0] to be the state where all vertices have all possible colors; w[0] ≡ _ ↦ C. Clearly, x[0] ∈ w[0]. By induction we assume that x[n−1] ∈ w[n−1]. We must now calculate a w[n], x[n] such that (1) x[n] ∈ w[n], (2) x[n]'s proposal is symmetric. (3) w[n]'s proposal is symmetric.

§ Occluded set O

Define OC (for occluded) be the set colors that might possibly be blocked for v from our view of w. Note that this is an over-approxmation: that is, there may be colors that are not blocked for v, which we believe to be blocked from w's point of view.
O ≡  { c ∈ C : (v, α) ∈ Ec ∈ w[n−1](α) }

§ Allowed set A

Define AC (for allowed) to be CO. Note that A is an under-approxmation, since O was an over-approximation. That is:

§ S: the sequence of states for transition

Now, we pick elements of C in sequence till we get an element of A. call this sequence S. We will at worst pick Δ + 1 elements for S, since the max-degree of the graph is Δ.

§ Transition

Let i be the first index in S where we get a color that is truly legal for v in x[n]. Note that such an index will always exist: We pick elements into S till we get an element in A, and elements of A are always legal. However, there can be elements which are not in A that are still legal for v in x[n], since A is an under-approximation. By the lemma proven, we know that this process of picking colors C in a sequence till we get a color that is legal for v at index i is the same as picking uniformly at random from the set of colors that are legal for v.

§ An example

For example, we could have:
X | p:2 --- q:4
W | p:{1, 2} --- q:{4, 5}

we are sampling q:

O = {1, 2}
A = {3, 4, 5}
S = [2, 1, 3]

X | p:1 -- q:2 
W | p:{1, 2} -- q:{1, 2, 3}
If we analyze S = [2, 1, 3] we notice that:
2: invalid for W(p:[1, 2]), invalid for X(p:2)    
1: invalid for W, valid for X  
3: valid for W, valid for X
So, what we are really sampling ix: To transition X, we can safely drop SW. However, to transition W correctly, we generate more elements in the sequence, till we hit a "safe" element.

§ An optimisation on picking colors: why we need a blocking set

Define the set BC (for blocked) which governs which values x[n] surely cannot take from our view of w[n−1]. Note that B is an under-approximation. v might have more colors that are blocked than what w sees.
B ≡ { c ∈ C : (v, α) ∈ Ew[n−1](α) = {c} }
Rather than sampling colors from C till we get an element of A, we can sample colors from C/B. We know that the colors in B can never be used by X, since the colors in B are those that we know are blocked for sure. This is used in the theoretical analysis of the paper.

§ Termination

We terminate when W has "trapped" X. That is, |w[n](v)| = 1 forall vV. In such a case, the states of W is equal to states of X. This is a coalescence (as it occurs in coupling from the past). From the coupling from the past theory, we know that we have reached a stationary state of A when this happens.

§ Pseudocode

-- | colors, vertices, edges
cs :: [C]; cs = ...
vs :: [V]; vs = ...
es :: [(V, V); es = ...

-- | definitely not allowed
blocked :: (V -> [C]) -> (V -> [C])
blocked v2cs v0 = concat [v2cs w | (v, w) <- es, v == v0, length (v2cs w) == 1]

-- | maybe not allowed
occluded :: (V -> [C]) -> (V -> [C])
occluded v2cs v0 = concat [v2cs w | (v, w) <- es, v == v0]

-- | definitely allowed from Ws point of view
allowed :: (V -> [C]) -> (V -> [C])
allowed v2cs = cs \\ occluded v2cs 

-- | perturb the color function to another function
perturb :: (V -> [C]) -> Rand (V -> [C])
perturb v2cs = do $ 
  randv <- uniform_rand vs
  randc_for_v <- uniform_rand (allowed v2cs v)
  return $ \v -> if v == randv then randc_for_v else v2cs v

-- | check if we are done
terminated :: (V -> [C]) -> Bool
terminated v2cs = all [length (v2cs v) == 1 | v <- vs]

-- | generate chain
chain :: (V -> [C]) -> Rand [V -> [C]]
chain f = do
  if terminated f 
  then [] else do 
    f;' <- perturb f; fs <- chain f'; return (f:fs)

-- | return a sample
sample :: (V -> [C]) -> Rand (V -> [C])
sample = last . chain

§ References

§ Coupling from the past

§ Relationship between CFTP and reset transitions

§ References

§ Word problems in Russia and America

scathing critique of how ameriacn math education is screwed: also, interesting anecdote about how looking for 'reality' in mathematical problems may in fact break student's ability to think in the abstract! This is a deep insight.

§ Encoding mathematical hieararchies

I've wanted to learn how the SageMATH system is organized when it comes to math hieararchies. I also wish to learn how lean4 encodes their hiearchies. I know how mathematical components does it. This might help narrow in on what what the "goldilocks zone" is for typeclass design.

§ References

§ Learning code by hearing it

I learnt of this from an amazing discussion on HackerNews, where a sighted programmed, who is going blind, asked the community if he could remain a software engineer. An answer by kolanos read:
You can definitely continue as a software engineer. I'm living proof. ... For example, as you get better with a screen reader, you'll be bumping the speech rate up to 1.75-2X normal speech. ... Typos will be easily spotted as they just won't "sound right". It will be like listening to a familiar song and then hitting an off note in the melody. And this includes code. Also, because code is no longer represented visually as blocks, you'll find you're building an increasingly detailed memory model of your code. Sighted people do this, too, but they tend to visualize in their mind. When you abandon this two dimensional representation, your non-visual mental map suffers no spatial limits. You'll be amazed how good your memory will get without the crutch of sight.
I find this incredibly fascinating. I think I'm going to try this: I'll listen to lectures on 1.5-2x anyway, so this may work just as well. I'm planning on trying this with the MLIR codebase, which I want to get to know intimately. I'll write updates on how this goes. Drew DeVault also posted links to tools/scripts for programming without needing visual feedback: I also learnt about emacspeak, which supposedly works well for an audio-based-workflow.

§ Your arm can be a spinor

I tend to forget the name of this trick. It exhibits spinors in real life: a system that needs to rotate by 720 degrees to return back to its original state, versus the usual 360 tha we are generally used to. We need to consider our entire arm + cup we are holding as a system for this to work.

§ Self modifying code for function calls: Look ma, I don't need a stack!

If one does not have recursive calls, one can eliminate the need to push return addresses on a call stack by writing self-modifying code --- I leant of this from TAOCP, volume 1. Knuth shows this off once he introduces MIXAL, his fantasy aseembly language in which TAOCP programs are written. I'll explain the usual way one performs call-return, then explain the nifty self-modifying-code way. I think this is the cleanest, most accessible example of self-modifying-code that I know.

§ The traditional solution for call/ret

We wish to have function f call function g. For g to be able to return control to f, f pushes a return address into the call stack, that g pops and jmps to
f's body
L0    push addr(L0); location of instr to be run after call.
      jmp START_G
L1:   <code after call>
g's body
      retloc = pop ; pop location to jump to
RETG: jmp retloc
Rather than pushing and poping, we can rewrite the code of g, to change retloc before a call to g. In made-up-pseudocode, here's what that would look like:

§ The jump based solution

* f's body
L0    store loc(RETG) assemble(jmp addr(L1))
      jmp START_G
L1:   <code after call>
* g's body
RETG: <###-to-be-filled-dummy-###>
instead of having a call stack, before f calls g, f modify g's code at location RETG into a jmp instruction by store ing the instruction jmp addr(L1). This effectively creates a 'custom' g that knows how to return control flow into f.
* g's body (after execution of L0)
RETG: jump addr(L1)
This way, we have obviated the need for a push/pop sequence, by directly modifying g's code. This is really neat --- we replace the overhead of a push/pop with a single store.

§ Why recursion breaks.

We don't actually need a call stack, as long as we don't want to write recursive functions. We can't have recursion, or more generally "re-entrance": consider a call chain of the form:

§ Adjunctions as advice

An adjunction F |- U allows us to go from F a -> x to a -> U x. We can look at this as shifting the "before-advice" from the input to an "after advice" of the output, where I'm using Also, to remember this, we can write it as:
F a <--F--- a
|           |
v           v
x ----U--> U x
Where F is for Free and U is for forgetfUl. Recall that if F is free and U is forgetful, then U(F(x)) = x, since adding more structure through the free functor and then stripping it away gives us the object back. We can prove that if U(F(x)) = x and U, F are functors, then we have a function fwd: (f a -> x) -> (a -> u x) as:
f      :: (f a -> x) 
fmap   :: (p   -> q) -> (u p     -> u q)
fmap   :: (p   -> q) -> (u ( p ) -> u q)
fmap f :: (f a -> x) -> (u (f a) -> u x) 
fmap f :: (f a -> x) -> (a       -> u x) [using u (f a) = a]

§ References

§ Reversible computation as groups on programs

If we consider a language like Janus where every program is reversible, we can then get a group structure on programs with the identity program not computing anything at all, the inverse performing the reverse operation. Alternatively, one can use the trick from quantum mechanics of using anciliary qubits to build reversible classical gates. The question is, do either of these approaches allow for better-than-STOKE exploration of the program space? Ie, can we somehow exploit the discrete group structure (in the case of Janus) or the Lie group structure of the unitary group (as in the QM case) to find programs in far quicker ways?

§ Blazing fast math rendering on the web

So, I've shifted the blog to be static-site-generated using a static-site-generator written by yours truly. The code clocks in at around a thousand lines of C++: What did I gain?
  1. My generator is a real compiler, so I get errors on math and markdown malformation.
  2. I can write math that loads instantly on your browser, using no MathJax, KaTeX or any client side processing, nor the need to fetch images, which looks like this:
h(x) ≡ 

x > 0 
f(x) + g(x)

§ Why?

My blog is a single 9000 line markdown file, rendered as a single HTML page, so I need it to compile fast, render fast, render beautiful. Existing tools compromise on one or the other.

§ No seriously, why a single markdown file?

I need a single file to edit, so I can rapidly jot down new ideas. This is the essence of why I'm able to log most of what I study: because it's seamless. Far more importantly, it provides spatio-temporal locality. I add things in chronological order to tbe blog, as I learn thing. If I need to recall something I had studied, go to that location in the blog based on a sense of when. When I do get to a location I want, the scrollbar gives me a sense of where I am in the file. this is important to me, since it hepls me reason spatially about what i know and what I've learnt. It's someting I love about books, and deeply miss when navigtaing the web.I'm determined to keep this spatio-temporal locality on my little slice of the internet.

§ Why is this awful?

As elegant as this model is to edit, it's awful for browsers to render. The file used to take on the order of minutes for all the math to finish rendering. MathJax (and KaTeX) painfully attempt to render each math block. As they do, the page jumps around until everything has settled. As this is happening, your CPU throttles, your lap or hand gets warm, and the page is stuck. Clearly not great UX. I still want math. What do I do? The solution is easy: Approximate the math rendering using ASCII/UTF-8 characters! There are tools that do this --- hevea is one of them. Unfortunately, there is no markdown-based-blogging-platform that uses this, so I had to write my own.

§ The cure

The solution is easy. I wrote the tool. The page you're reading it is rendered using the tool. All the math renders in under a second because it's nothing crazy, it's just text and tables which browsers know how to render. No JavaScript necessary. snappy performance. Whoo!

§ The details: Writing my own Markdown to HTML transpiler.

the final transpiler clocks in at 1300Loc of C++, which is very small for a feature-complete markdown-to-HTML piece of code that's blazing fast, renders math correctly, and provides error messages.

§ Quirks fixed, features gained.

I got quite a bit "for free" as I wrote this, fixing mild annoyances and larger pain points around using github + markdown for publishing on the web:

§ Choice of language

I choose to write this in C-style-C++, primarily because I wanted the tool to be fast, and I'd missed writing C++ for a while. I really enjoy how stupid-simple C style C++ turns out to be: the C++ papers over some of C's annoyances (like formatted output for custom types), while still preserving the KISS feeling of writing C++. Why not Rust? I freely admit that rust might have been a sane choice as well. unfortunately, asking rust to treat UTF-8 string as a "ball of bytes" is hard, when it's stupidly easy with C. Plus, I wanted to use arena-style-allocation where I make huge allocations in one go and then don't think about memory, something that I don't have control over in Rust. I don't have any segfaults (yet, perhaps), thanks to UBSAN and ASAN. I find Rust to have more impedance than C on small applications, and this was indeed small.

§ Performance

Everything except the latex to HTML is blazing fast. Unfortunately, calling hevea is slow, so I implemented a caching mechanism to make using hevea not-slow. hevea does not have an API, so I need to fork and talk to its process which is understandably flow. I built a "key-value-store" (read: serialize data into a file) with the stupidly-simple approach of writing an append-only log into a file. hevea is a pure function conceptally, since on providing the same latex input it's going to produce the same HTML output, so it's perfectly safe to cache it:
const char DB_PATH[]="./blogcache.txt";
unordered_map<ll, const char *> G_DB;
void loadDB() {
    G_DB = {};
    FILE *f = fopen(DB_PATH, "rb");
    while (!feof(f)) {
        ll k, len;
        fread(&k, sizeof(ll), 1, f); if (feof(f)) break;
        fread(&len, sizeof(ll), 1, f);
        char *buf = (char *)calloc(sizeof(char), len + 2);
        fread(buf, sizeof(char), len, f);

const char *lookup_key(ll k) {
    unordered_map<ll, const char *>::iterator it = G_DB.find(k);
    if (it == G_DB.end()) { return nullptr; } return it->second;

void store_key_value(const ll k, KEEP const char *v, const ll len) {
    assert(G_DB.count(k) == 0);
    G_DB.insert(make_pair(k, strdup(v)));

    FILE *f = fopen(DB_PATH, "ab");
    assert(f != nullptr && "unable to open DB file");
    fwrite(&k, sizeof(ll), 1, f);
    fwrite(&len, sizeof(ll), 1, f);
    fwrite(v, sizeof(char), len, f);

§ For the future

I plan to rip out hevea and write my own latex -> HTML converter for the subset of LaTeX I actually use. hevea's strength is its downfall: It can handle all of LaTeX, which means it's really slow. If I can concentrate on a small subset, I don't need to play caching tricks, and I can likely optimise the layout further for my use-cases. I also want colored error messages, because who doesn't? I'll probably gradually improve my static site generator over time. Once it's at a level of polish where I'm happy with it, I'll spin it out as a separate project.

§ Conclusions

Am I glad I did it? Yes, purely because my chunk of the internet aligns with how I want it to be, and that makes me є more happy. I think of it as an investment into future me, since I can extend the markdown and the transpiler in the way I want it to be.

§ VC dimension

Consider a ground set X. Let the space of all possible binary classifications be the function space C ≡ { ff : X → ± 1 }. Now, a hypothesis class H is a subset of C. For example, some model such as "return +1 if a point is inside a region, −1 otherwise" is a subset of the full class C. The VC dimension of H measures how good H is generating different classification. We first need the notion of shattering to define this. A subset SX of the ground set shatters a hypothesis class H if the function actS has full range, where actS is defined as:
actSH → |S|{0, 1} actS(h) = (h(s0), h(s1), h(s2), …, h(sn))
That is, the hypothesis class H can classify all the subsets of S. Now the VC dimension of the hypothesis class H of a ground set X is the size of largest possible SX such that S is shattered by H.

§ Correct interpretation

§ Subtletly 1:

We can have the case where: In this case, the VC dimension of H is 5, not 3.

§ Subtletly 2:

We cannot have the case where: For contradiction, let S be the set of size 5 that is shattered by H. Let T S, |T| = 4. Now, H shatters T since H shatters S. Hence, Some set of size 4 has been shattered. Contradiction, since we assumed that no set of size 4 is shattered by H. So, to prove that sets of size (≥ n) cannot be shattered, it suffices to prove that sets of size equal to n cannot be shattered.

§ Growth of number of sets shattered in |S| for SX for a fixed H.

If we fix a hypothesis class H for X, and we want to understand how H varies over subsets of X, the idea is this: Let S be a set that is the maximum sized set that is shattered by X. ie, |S| = Vcdim(H) and H shatters S. Now, the idea is this: We can show that this exponential/polynomial behaviour happens in general for SX.

§ Symplectic version of classical mechanics

§ Basics, symplectic mechanics as inverting ω:

I've never seen this kind of "inverting ω" perspective written down anywhere. Most of them start by using the inteior product iX ω without ever showing where the thing came from. This is my personal interpretation of how the symplectic version of classical mecanics comes to be. If we have a non-degenerate, closed two-form ω: TpM × TpM → ℝ. Now, given a hamiltonian H: M → ℝ, we can construct a vector field XH: MTM under the definition:
 partially apply ω to see ω as a mapping from Tp to Tp*M          
 ω2: Tp M → Tp*M ≡ λ (vTp M). λ (wTp M) . ω(vw         
 ω2−1Tp*M → Tp MdHM → TpM          
 XH ≡ λ (pM) → ω2−1 (dH(p))          

dH(p) : Tp*M 

ω2−1(dH(p)): TpM 
 XH = ω2−1 ∘ dH          
This way, given a hamiltonian H: M → ℝ, we can construct an associated vector field XH, in a pretty natural way. We can also go the other way. Given the X, we can build the dH under the equivalence:
 ω2−1 ∘ dH  = XH         
 dH = ω2(XH         
dH  = ω2(XH)  
H = ω2(XH
This needs some demands, like the one-form dH being integrable. But this works, and gives us a bijection between XH and H as we wanted. We can also analyze the definition we got from the previous manipulation:
 ω2(XH) =  dH          
 λ (wTp M) ω(XHw) = dH          
 ω(XH, ·) = dH          
We can take this as a relationship between XH and dH. Exploiting this, we can notice that dH(XH) = 0. That is, moving along XH does not modify dH:
 ω2(XH) =  dH          
 λ (wTp M) ω(XHw) = dH          
 dH(XH) = ω(XHXH) = 0   ω is anti-symmetric          

§ Preservation of ω

We wish to show that XH*(ω) = ω. That is, pushing forward ω along the vector field XH preserves ω. TODO.

§ Moment Map

Now that we have a method of going from a vector field XH to a Hamiltonian H, we can go crazier with this. We can generate vector fields using Lie group actions on the manifold, and then look for hamiltonians corresponding to this lie group. This lets us perform "inverse Noether", where for a given choice of symmetry, we can find the Hamiltonian that possesses this symmetry. We can create a map from the Lie algebra gG to a vector field X g, performing:
 t : ℝ ↦ et g : G          
 t : ℝ ↦ φ(et g) : M          
X g ≡ 
(φ(et g))|t = 0TM
We can then attempt to recover a hamiltonian H g from X g. If we get a hamiltonian from this process, then it will have the right symmetries.

§ Theorems for free

These are personal notes I made of a custom notation for denoting the relations from the theorems for free paper. I developed the notation since I wanted to keep track of what types are floating around and what the relations are doing. We interpret types as sets. If elements belong to the relation, ie, if (a, a') ∈ R ⊂ AxA, we will denote this as a[A -R- A']a'. We will now write down some inference rules:
  1. We define ReflB as a[Bool ReflB Bool]a
  2. We define ReflI as i[Int ReflI Int]i
  3. The product of two relations [A R B], [X S Y] is called as RxS, and is defined as: (a,x)[AxX RxS BxY](b,y) iff: ∀ abxy, a[A R B]b ∧ x[X S Y]y.
  4. The list space of a [A R B] is called [A* [A R B] B*], and is defined as: la[A* [A R B] B*]lb iff: ∀ la lb, |la| = |lb| ∧ (∀ i, la[i][A R B]lb[i])
  5. The function space of two relations[A R B], [X S Y] is called [A->X R->S B->Y], and is defined as: f[A->X R->S B->Y]g iff: ∀ a b, a[A R B]b => f(a)[X S Y]g(b).
  6. The type family space of two relations is a function that takes a relation [A R B] and produces a new relation: g[FA | [A R B] | FB]h. The relation takes as parameter a relation [A R B] for each choice.
  7. The space of relations of ∀X.F(X) is a relation defined by: g[A->FA | ∀X.F(X) [FA [A R B] FB]| B->FB]h ∀ A B R, (g A)[FA | [A R B] | FB](h B).

§ Parametricity theorem

The parametricity thm states that for all terms (r: R), we can deduce r[R rel(R) R]r where rel(R) is the relation that fits the type, and is derived from the above rules.

§ Parametricity for lists when the relation is a function:

The list space of a [A R B] is called [A* [A R B] B*], and is defined as: la[A* [A R B] B*]lb iff: Now, let us take a special case where [A R B] is a function δ: A -> B. That is: If this is the case, then we can simplify the math to be:

§ Parametricity to prove rearrangements

§ References

§ How to reason with half-open intervals

I've always found code that uses half-open intervals far harder to write than using closed intervals. For example, when performing string processing, I prefer to write closed over halfopen since I find it easier to think about:
void closed(int begin, int end, char *c) { //easier
  for(int i = begin; i <= end; ++i) {... }

void open(int begin, int len, char *c) { //harder
  for(int i = begin; i < begin + len; ++i) { ... }
However, I realised that by changing how I think about this to:
void open(int begin, int len, char *c) { //harder
  for(int i = begin; i != begin + len; ++i)
It somehow made it way easier to grok. Of course, this is old had to anyone who performs loop optimisatison: LLVM internally converts most comparisons into the a != b form, because it's easier to analyse. It took me this long it's easier for me to think in this viewpoint as well.

§ How does one build a fusion bomb?

I haven't found anything on the internet that describes how to build a fusion bomb; it's almost as if this information has been supressed by governments. However, I'm curious --- would a physics grad student studying nuclear physics or explosives have the theoretical know-how to precisely build one, given the raw materials? Or is there some "secret sauce" that's necessary? I read on wikipedia that most countries classify the details:
Detailed knowledge of fission and fusion weapons is classified to some degree in virtually every industrialized nation. In the United States, such knowledge can by default be classified as "Restricted Data", even if it is created by persons who are not government employees or associated with weapons programs, in a legal doctrine known as "born secret".

§ Christoffel symbols, geometrically

Suppose we have a manifold M. of dimension d that has been embedded isometrically into n. So we have a function e: ℝd → ℝn which is the embedding. We will identify M to be the subspace Im(e). Recall that xi e : ℝd → ℝn is defined as:
 xie : ℝd → ℝn          
[∂ xie](p) ≡
δ x → 0
e(p + (0:0, 1:0…, ix, …, n:0)) − e(p)
δ x
Note that it is a function of type d → ℝn. We can calculate the derivaive of this vector field as follows:
∂ xi
= ∂xi 
vj(p) ∂xj e 
vj · ∂xi ∂xj e + ∂xje · ∂xi vj         
We choose to rewrite the second degree term in terms of the tangent space, and some component that is normal to us that we have no control over.
(∂xi ∂xj e )(p) ≡ Γijk ∂xk e + 
This gives us the Christoffel symbols as "variation of second derivative along the manifold.

§ Relationship to the Levi-Cevita connection

The covariant derivative defined by the Levi-Cevita connection is the derivative that contains the projection of the full derivative in n onto the tangent space Tp M. This is defined by the equations:
ei V ≡ ∂xi V − 
= Π
vj · ∂xi ∂xj e + ∂xje · ∂xi vj 
= Π

vj · (Γijk ∂xk e + 
)+ ∂xje · ∂xi vj 

vj · (Γijk ∂xk e + 
) + ∂xje · ∂xi vj 
vj · (Γijk ∂xk e + 
) + ∂xke · ∂xi vk 
 vj · Γijk ∂xk e  + ∂xke · ∂xi vk          
= ∂xk e 
vj · Γijk  + ∂xi vk 

§ References

§ A natural vector space without an explicit basis

On learning about infinite dimensional vector spaces, one learns that we need to use the axiom of choice to assert that every such vector space has a basis; indeed, it's equivalent to the AoC to assert this. However, I had not known any "natural" examples of such a vector space till I studied the proof of the barvinok algorithm. I produce the example here. Consider a space such as S ≡ ℝ3. Now, consider the vector space spanned by the indicator functions of polyhedra in S. that's a mouthful, so let's break it down. A polyhedra is defined as a set of points that is defined by linear inequalities: P ≡ { xS : ai · xbi, i ∈ [1… n] }, for all aiS, b ∈ ℝ. The indicator functions are of the form:
[poly]: S → ℝ; [poly](x) ≡

1x ∈ poly 
we can define a vector space of these functions over , using the "scaling" action as the action of on these functions: The vector space V is defined as the span of the indicator functions of all polyhedra. It's clearly a vector space, and a hopefully intuitive one. However, note that the set we generated this from (indicators of polyhedra) don't form a basis since they have many linear dependencies between them. For example, one can write the equation:
*---*   *-*   *-*   *
|###|   |#|   |#|   |
|###| = |#| + |#| - |
|###|   |#|   |#|   |
*---*   *-*   *-*   *

§ Cache oblivious B trees

Central idea: assume a memory model where computation is free, only cost is pulling data from cache into memory. Cache has total size M, can hold blocks of size B. So it can hold M/B blocks of main memory. Memory memory has infinite size. Cost is number of transfers. We assume that the algorithm does not know M or B. We assume that the cache replacement strategy is optimal (kick out block that is going to be used farthest in the future). This is an OK assumption to make since an LRU cache using twice the memory of a "oracular" cache performs equally well (citation?) These data structures are cool since they essentially "Adapt" to varying cache hierarchies and even multiple level cache hierarchies. We study how to build cache-oblivious B-trees.

§ Building optimal cache-oblivious B trees to solve search

§ Analysis Claim: we need to pull O(logB N) blocks for any B for any search query

N is the number of nodes in the BST. Note that in the analysis, we know what B is, even though the algorithm does not.
1 2 3 4 5 6 7 8 <- index
|     |       | <- boundary
|-xxxxxxx-----| <-  data

§ Black box: ordered file maintainince

We need a black box: ordered file maintainance (linked list for arrays)

§ Next: dynamic BST (inserts and deletes): layout

we take a VEB static tree on top of an ordered file. Tree is a segtree that has max of nodes. Leaves are the members of the ordered file.

§ Updates

§ Updates: analysis.

§ References

§ Krohn-Rhodes decomposition

We denote partial functions with XY and total functions with XY. A set X equipped with a binary operator ⋆: X × XX which is closed and associative is a semigroup.

§ Partial function semigroup

For a ground set X, the set of partial functions Pf(X) ≡ { f: XX } along with function composition forms a semigroup. This is in fact stronger than a semigroup. There exists:

§ Transformation semigroup(TS)

Let Q be a set. Let SPf(Q) be a sub-semigroup of Pf(Q). Then the semigroup X ≡ (Q, S) is called as the transformation semigroup(X) of states Q. We call X ≡ (Q, S) as a transformation monoid if S contains 1Q(q) = q.

§ Subtlety of being at transformation monoid.

There is some subttlety here. Just because S is a monoid does not mean that it that is a transformation monoid. It must have the identity element of Pf(Q) to be called a transformation monoid. For example, consider the set Q
a, b
and the transformation semigroup S
f ≡ α ↦ b
. Now the set S is indeed a monoid with identity element as f: QQ. however, f ≠ 1Q , andh ence, S is a not a transformation monoid.

§ Examples of transformation semigroups

  1. (X, { θ(x) = undef }). The semigroup with the empty transformation.
  2. (X, ∅), the semigroup with no transformations.

§ Semigroup action

We sometimes wish to represent a semigroup using an action/transformation semigroup on a ground set X. So, given some semigroup (T, ×) that needs to be represented, if we can find a morphism r: TPf(X) ( r for representation) such that: Put more simply, t1t2 r(t1) ≠ r(t2) where we define function equality extensionally: f = g ≡ ∀ x, f(x) = g(x).

§ Embedding actions into the transformation semigroup

We often wish to represent some semigroup S as the transformation semigroup of some set of states Q. We can achieve this by proving a morphism: Then, we can treat elements of S as elements of Pf(Q).

§ Completion of a transformation semigroup

Given a transformation semigroup X ≡ (Q, S) we can complete it by adding a new sink state , and then converting all partial functions in S to total functions that transition to . We have that ⊥ · s = s · ⊥   ∀ sS. We denote the completion as Xc ≡ (Qc, Sc).

§ Coverings

Let X ≡ (QX, SX) and Y ≡ (QY, SY) be transformation semigroups. Let φ ⊆ QY × QX be a relation. Let sxSX and sySY. Then, if the following diagram commutes:
If sx(φ(qy)) = φ(ty(qy)), then we say that ty covers sx relative to φ. We imagine the ty lying above sx, being projected down by φ. If a fixed φ, for all sxSX there exists a tySY such that t covers s relative to φ, then we say that φ: is a relation of automata.
X ◁φ Y
X ≺φY
If XφY, we say that Y dominates X, or Y covers X, or X divides Y.

§ Checking coverings and generating subsets

We note that for a given covering φ, if sx is covered by ty and px is covered by qy, then sxtx is covered by tyqy. Thus, to check if X is covered by Y, we simply need to check if some generating subset of X is covered by Y.

§ Checking coverings of representations

Let us assume we have a representation of a transformation semigroup given with a semigroup Σ, a transformation semigroup X ≡ (QX, SX), and a representation r: Σ → SX that is faithful. Now, to check that X is covered by another Y, it suffices to check that there exists a tyY for each σ ∈ X such that r(σ) is covered by this ty.

§ Companion relation

Given a relation φ: YX, then we define:
Σ ≡ { (st) : t ∈ TY  covers  s ∈ SX }
Recall compositions of elements are covered by a composition of their coverings. Hence, if (s, t), (s′, t′) ∈ Σ, then (ss′, tt′) ∈ Σ. thus, Σ is a subsemigroup of SX × SY. We can regard Σ as the graph of a relation φ′ ⊆ QY × QX. This will be called as companion relation of φ.

§ Wreath products

Let X ≡ (QX, SX) and Y ≡ (QY, SY). We're going to define a large product XY. We begn with the set WSXQY × SY, where SXQY ≡ { f : QYSX }. The wreath product then becomes:
X ≀ Y ≡ (QX × QYW)
with the action of W on an element of QX × QY being defined as:
(f : QY → SXsy : SY) (qx : QXqY : QY) ≡ ( f(qy)(qx) , sy (qy))
it's a "follow the types" sort of definition, where we edit the right component as ryty(ry) since that's all we can do. In the case of the left component, we have a qx, and we need to produce another element in QX, so we "must use f". The only way to use f is to feed it a ty. This forces us into the above definition.

§ Composition of wreath products

To show that its closed under composition, let's consider (f, sy), (g, ty) ∈ W with f, g: QYgSX, and sy, tySY. The result is going to be:
(fsy)  (gty) =  (λ qyf(qy) ∘ g(qy), ty ∘ uy)

§ Equivalences of subsets of states

Let X = (Q, S) be a transition system. Given subsets (a, b, ⊆ Q), we shall write ba if either ba or there exists some sS such that bsa, where s(a) ≡ { s(ai) : aia}. We can define an equivalence relation ab  ⇐⇒ abba. Note: ba |b| ≤ |a|, since ba means that bs(a). Note that s is actually a function s: QQ, and a function mapped over a set can only ever decrease the number of elements in a set, since a function can only xglomp elements together; it can never break an element apart into two. Hence, bsaa, and thus |b| ≤ |a|. Similiarly, ab |a| ≤ |b|. Therefore, ba means that |b| = |a|. Theorem: for all a, bQX such that a   b such that bs(a), we show that b = s(a), and there exists a tSX such that a = t(b). Proof: Since bs(a) ⊆ a and |b| = |a|, b = s(a). Therefore s is a permutation. Hence, s is invertible and there exists an inverse permutation t such that a = t(b). We now need to show that tSX. To do this, first note that if the order of the permutation s is n, then t = sn−1, since ts = sn−1s = 1S. Since the semigroup S is closed under composition t = sn−1 is in S, since it is s composed with itself (n−1) times.

§ Subset families of interest

We will be interest in a family of subsets of QX called A, of the form: In the above set, we have and as defined above. We note that the set A is closed under the action of all sSX. For example, the empty set is taken to the empty set. All singleton sets are taken to other singleton sets. For the full set Q, we add the sets s(Q) for all sSX.

§ Height function

A height function for a transition system X ≡ (QX, SX) is a function h: A → ℤ such that:
  1. h(∅) = −1.
  2. h({ q }) = 0 ∀ qQ.
  3. ab h(a) = h(b) for all a, bA.
  4. b < a h(b) < h(a) for all a, bA.
The notation b < a ≡ (ba) ∧ ¬ (ab). (3) + (4) imply that two elements of the same height are either equivalent or incomparable.

§ Pavings and bricks

for aA such that |a| > 1, we denote by Ba the set of all bA what are maximal subsets of a. That is, if bBa then b a, and c, b c a. Equivalently, if there exists a c such that bca, then b = c or b = a. Note that we can assert that a = ∪bBa b. This is because Ba contains all the singletons of QX. so we can begin by writing a as a union of singletons, and then merging elements of Ba into larger elements of B, terminating when we cannot merge any more elements of Ba.

§ Group of permutations for aA

Let us assume that there exists a sS such that s(a) = a. Let Aa be the set of all elements in A contained in a: Aa = { Ai : AiA, Aia }. Recall that the set A was closed under the action of all s, and hence, since s is a permutation of a, this naturally extends into a permutation of Aa: s Aa = Aa. Now note that this induces a permutation of the set Ba. This creates a transition system:
 Ga ≡ { s ∈ S : s a = a }          
 Ha ≡ (BaGa         
We have already shown how if sS defines a permutation of some set X by its action, then its inverse also exists in S. So, this means that Ga is in fact a transition group that acts on Ba. It might turn out that Ga = ∅. However, if Ga ≠ ∅, then as stated above, Ga is a group. We will call such a transition group a generalized transition group, since either Ga = ∅ or Ga is a group. Now, the generalized transition group Ha is called as the holonomy transition system of a, and the group Ga is called as the holonomy group of a. We have that GaS since Ga is a quotient of the sub-semigroup { s | sS, as = a }. (TODO: so what? why does this mean that it's ?) Theorem: if ab, then HaHb (similar subsets have isomorphic holonomy transition systems). Proof: Let us assume that ab. since ab, we have elements of the form s, s−1S such that b = s(a), a = s−1(b). Recall that for baBa is such that for a member gGa, g(ba) = ba. Bb must have the element s(ba). [TODO!]

§ Holonomy decomposition

Let X ≡ (Q, S) be a transition system and let h be a height function for X, such that h(Q) > 0. For a fixed i, let a1, a2, … ak be the representatives of equivalence classes of elements of A of height equal to i. We define:
Hi≡ Ha1 ∨ Ha2 … ∨ Han

§ Inductive hypothesis for coverings

We will say a relational covering Xφ Y is of rank i with respect to a given height function h if φ relates states in Y to subsets of states in x that are members of A and have rank at most i. Formally, for each pQY, we have that φ(p) ∈ A and h(φ(p)) ≤ i. We prove that if Xφ Y is a relational covering of rank i, then Xφ HiY is a relational covering of rank i − 1. The proof is a proof by induction.

§ Base case:

Start with the relational covering with QY = { 0 }, SY = { id }, and the cover φ(0) = QX. Clearly, this has rank n since the height of QX is n, and φ is inded a covering, since the only transition that Y can make (stay at the same state) is simulated by any transition in SX [TODO: is this really the argument?] For induction, assume Xφ Y is a relational covering of rank i with respect to some height function h. X≡ (QX, SX) and Y ≡ (QY, SY). We define We know that A contains elements of height exactly i. Let a1, a2, … ak be representatives of sets of of height i in A. Thus, for each qyiQYi, we have that: We will show how to establish a relational covering:

§ References

§ Proving block matmul using program analysis

It's a somewhat well-known fact that given matrix multiplication: O = AB where O ∈ ℝ2n × 2m ( O for output), A ∈ ℝ2n × r, B ∈ ℝr × 2m are matrices. We can also write this as follows:






a11 b11 + a12 b21a11 b12 + a12 b22 
a21 b11a22 b21a21 b12 + a22 b22

When written as code, the original matrix multiplication is:
// a:[2N][2Z] b:[2Z][2M] -> out:[2N][2M]
int matmul(int N, int Z, int M, int a[N][Z], int b[Z][M], int out[N][M]) {
  for(int i = 0; i < 2*N; ++i) {
    for(int j = 0; j < 2*M; ++j) {
      for(int k = 0; k < 2Z; ++k) out[i][j] += a[i][k] * b[k][j]
and the block-based matrix multiplication is:
// a:[2N][2Z] b:[2Z][2M] -> out:[2N][2M]
int matmulBlock(int N, int Z, int M, int a[N][Z], int b[Z][M], int out[N][M]) {
  for (int BI = 0; BI < 2; ++BI) {
    for (int BJ = 0; BJ < 2; ++BJ) {
      for(int i = BI*N; i < BI*N+N; ++i) {
        for(int j = BJ*M; j < BJ*M+M; ++j) {
          for(int k = 0; k < 2Z; ++k) { out[i][j] += a[i][k] * b[k][j] }
we wish to show that both of these programs have the same semantics. We will do this by appealing to ideas from program analysis.

§ The key idea

We will consider the statement:
out[i][j] += a[i][k] * b[k][j]
as occuring at an abstract "point in time" (i, j, k) in the matmul function. I also occurs at an abstract "point in time" (BI, BJ, i′, j′, k′) in the matmulBlock function. We will then show that the loops for(i...) for(j...) for(k...) are fully parallel, and hence we can reorder the loops any way we want. Then, we will show that the ordering imposed by (BI, BJ, i′, j′, k′) is a reordering of the original (i, j, k) ordering. We do this by showing that there is a bijection:
(i=i0j=j0k=k0) → (BI=i0/NBJ=j0/Ni=i0%Nj=j0%Nk=k0)
Thus, this bijection executes all loops, and does so without affecting the program semantics.

§ Schedules

We'll zoom out a little, to consider some simple programs and understan how to represent parallelism.
void eg1(int N, int M, int out[N][M]) {
for(int i = 0; i < N; ++i) {
  for(int j = 1; j < M; ++j) {
    out[i][j] = out[i][j-1];
Notice that this program is equivalent to the program with the i loop reversed:
void eg1rev(int N, int M, int out[N][M]) {
for(int i = N-1; i >=0; --i) {
  for(int j = 1; j < M; ++j) {
    out[i][j] = out[i][j-1];
What's actually stopping us from reversing the loop for(j...)? it's the fact that the value of, say, out[i=0][j=1] depends on out[i=0][j=0]. We can see that in general, out[i=i_0][j=j_0] depends on out[i=i_0][j=j_0-1]. We can represent this by considering a dependence set:
write:(i0j0−1) → write:(i0j0) }
in general, we can reorder statements as long as we do not change the directions of the arrows in the dependence set.

§ Dependence structure of matmul.

§ Fully parallel, reordering

§ References

§ Why I like algebra over analysis

Midnight discussions with my room-mate Arjun P. This tries to explore what it is about algebra that I find appealing. I think the fundamental difference to me comes down to flavour --- analysis and combinatorial objects feel very "algorithm", while Algebra feels "data structure". To expand on the analogy, a proof technique is like an algorithm, while an algebraic object is like a data structure. The existence of an algebraic object allows us to "meditate" on the proof technique as a separate object that does not move through time. This allows us to "get to know" the algebraic object, independent of how it's used. So, at least for me, I have a richness of feeling when it comes to algebra that just doesn't shine through with analysis. The one exception maybe reading something like "by compactness", which has been hammered into me by exercises from Munkres :) Meditating on a proof technique is much harder, since the proof technique is necessarily intertwined with the problem, unlike a data structure which to some degree has an independent existence. This reminds me of the quote: "“Art is how we decorate space; Music is how we decorate time.”. I'm not sure how to draw out the tenuous connection I feel, but it's there. Arjun comes from a background of combinatorics, and my understanding of his perspective is that each proof is a technique unto itself. Or, perhaps instantiating the technique for each proof is difficult enough that abstracting it out is not useful enough in the first place. A good example of a proof technique that got studied on its own right in combinatorics is the probabilistic method. A more reasonable example is that of the Pigeonhole principle, which still requires insight to instantiate in practise. Not that this does not occur in algebra either, but there is something in algebra about how just meditating on the definitions. For example, Whitney trick that got pulled out of the proof of the Whitney embedding theorem. To draw an analogy for the haskellers, it's the same joy of being able to write down the type of a haskell function and know exactly what it does, enough that a program can automatically derive the function (djinn). The fact that we know the object well enough that just writing the type down allows us to infer the program, makes it beautiful. There's something very elegant about the minimality that algebra demands. Indeed, this calls back to another quote: "perfection is achieved not when there is nothing more to add, but when there is nothing left to take away". I'm really glad that this 2 AM discussion allowed me to finally pin down why I like algebra.

§ using for cleaner function type typedefs

I've always struggled with remembering the syntax for function type typedefs:
we can now use using for a far more pleasant syntax:
which is also intuitive. You write down the "type" on the right hand side, and give it a name on the left. This is not strictly the same, since the typedef typedefs FNTYNAME to a function pointer type, while the C++ version typedefs the function type. I prefer the latter at any rate, since I dislike the fact that the usual typedef tends to hide the fact that a function pointer is some pointer-like-thing.

§ A walkway of lanterns

§ Semidirect products





a + X · bXY


§ A walkway of lanterns

§ Krohn-rhodes, AKA how to model Freudian psychoanalysis using Lagrangians over semigroups.

§ Natural transformations

I don't find people who draw "all three parts" of the natural transformation: the catories C, FC, and GC, and then show the relationship between them, so I made this for my own reference.

§ The hilarious commentary by dinosaure in OCaml git

the Ocaml-git project is a re-implementation of git in OCaml. It's well-written, and I was walking through the codebase, when I found absolutely amazing, hilarious, and deep comments from dinosaure. I really enjoyed reading through the codebase, and the existence of these comments made it more humane to read. I don't know who dinosaure is, but I'm really glad they wrote the comments they did, it really made my day.

§ The one that takes a stab at Haskell for fun

(* XXX(dinosaure): ...we can fix this detail but
I'm lazy like haskell. TODO! *)

§ The academic one that broken-links to a paper

(* XXX(dinosaure): see this paper
https://github.com/ocamllabs/papers/blob/master/irmin/2014.08.matthieu/rapport.pdf *)

§ The one about the frustrations of bug-hunting

(* XXX(dinosaure): [~chunked:false] is mandatory, I don't want to explain
why (I lost one day to find this bug) but believe me. *)

§ The one about a potential heisenbug

(* XXX(dinosaure): if, one day, we find a bug about the serialization of the
IDX file, may be it's about this function (stable/unstable sort). *)

§ The requisite comment in french for an OCaml project

(* XXX(dinosaure): bon ici, c'est une note compliqué, j'ai mis 2 jours
à fixer le bug. Donc je l'explique en français, c'est plus simple.
En gros, [Helper.MakeDecoder] utilise ce buffer comme buffer interne
pour gérer les alterations. Ce qui ce passe, c'est que dans la
fonction [refill], il s'agit de compléter à partir d'un [input]
(typiquement [zl]) le buffer interne. C'est finalement un
__mauvais__ jeu entre un [Cstruct.t] et un [Bigarray].
Il s'agit de connaître la véritable taille du [Bigarray] et de
rajouter avec [blit] le contenu de l'[input] si la taille du
[Bigarray] (et pas du [Cstruct]) est suffisante.
Avant, cette modification, [zl], [de] et [io] partagaient le même
[Bigarray] découpé (avec [Cstruct]) en 3. Donc, dans le
[MakeDecoder], [refill] considérait (pour des gros fichiers faisant
plus de 0x8000 bytes) que après [de], nous avions encore de la
place - et dans ce cas, nous avions [io].
Ainsi, on [blit]ait [zl] dans [de+sizeof(de) == io], et finalement,
on se retrouvait à essayer de décompresser ce que nous avions
décompressé. (YOLO).
Donc, on considère maintenant [de] comme un [Cstruct.t] et un
[Bigarray] physiquement différent pour éviter ce problème.
Cependant, il faudrait continuer à introspecter car j'ai
l'intuition que pour un fichier plus gros que [2 * 0x8000], on
devrait avoir un problème. Donc TODO. *)

§ The deep one

(* XXX(dinosaure): at the end, we don't care if we lost something. *)

§ How to link against MLIR with CMake

Since MLIR hasn't setup the nice tooling that LLVM has around CMake as far as I can tell, one needs to actually know CMake to link against MLIR. However, as is well known, CMake incantations are handed down by preists who spend the better part of their lives studying the tome that is the CMake manual. I, an unlucky soul had to go on this adventure, and I hope to spare you the trouble. I wished to link against a static library build of MLIR. The secret lies in the find_library call:
#If the variable has been set by -DMLIR_INCLUDE_PATH, then keep it.
#Otherwise fallback to the environment variable $MLIR_INCLUDE_PATH.
#if neither, then *shrug*.

#Resolve for:
#- a library target called `MLIRAnalysis`
#- asking to link against `libMLIAnalysis.a`
#- using the variable MLIR_INCLUDE_PATH which as we saw before
##  is either an environment variable or a cmake option
target_include_directories(languagemodels PRIVATE ${MLIR_INCLUDE_PATH})
I cribbed the actual things to link against from the path mlir/examples/Toy/Ch2/CMakeLists.txt which helpfully lists MLIR things it needs to link against. The full CMakeLists is here:
cmake_minimum_required(VERSION 3.5)


## I don't want to use find_package since I want proper control over where my LLVM comes from.
## find_package(LLVM REQUIRED)

        rnn.cpp codegenc.h lang.h codegenmlir.h)

## Attempt to take these as command line arguments. IF that fails,
## lookup environment.


target_include_directories(languagemodels PRIVATE ${MLIR_INCLUDE_PATH})
find_library(MLIRAnalysis MLIRAnalysis ${MLIR_LIBRARY_PATH})
find_library(MLIRParser MLIRParser ${MLIR_LIBRARY_PATH})
find_library(MLIRSideEffects MLIRSideEffects ${MLIR_LIBRARY_PATH})
find_library(MLIRTransforms MLIRTransforms ${MLIR_LIBRARY_PATH})
find_library(LLVMCore LLVMCore ${MLIR_LIBRARY_PATH})
find_library(LLVMSupport LLVMSupport ${MLIR_LIBRARY_PATH})

## debugging to check if it's been set properly
message(MLIRAnalysis ${MLIRAnalysis})


§ Energy as triangulaizing state space

This comes from The wild book by John Rhodes, which I anticipate I'll be posting more of in the coming weeks.

§ Experiments

Let an experiment be a tuple of the phase space X, action space A, and an action of the actions onto the phase space : A × XX. We will write x′ = a x to denote the new state of the system x. So the experiment E is the data E ≡ (X, A, : A × XX).

§ Coordinate systems.

The existence of the action allows us to write the evolution of the system recursively: xt+1 = axt. However, to understand the final state xt+1, we need to essentially "run the recursion", which does not permit us to understand the experiment. What we really need is the ability to "unroll" the loop. To quote:
Informally, understanding an experiment E means introducing coordinates into phase space of E which are in triangular form under the action of the inputs of E.

§ Conservation laws as triangular form

We identify certain interesting invariants of a system by two criteria:
  1. The parameter Q(t) determines some obviously important aspects of the system. That is, there is a deterministic function M(Q(t)) which maps Q(t) to "measure" some internal state of the system.
  2. If the values of such a parameter Q is known at time t0 (denoted Q(t0)) and it is also known what inputs are presented to the system from time t to time t + є (denoted I[t0, t0 + є]), then the new value of Q is a deterministic function of Q(t0) and I[t0, t0+ є].
Such parameters allow us to understand a system, since they are deterministic parameters of the evolution of the system, while also provding a way to measure some internal state of the system using M. For example, consider a system x with an energy function e(x). If we perform an action a on the system x, then we can predict the action e(x′ = a x) given just e(x) and a --- here, (x′ = a x) is the action of the system a on x.
In general, conservation principles give a first coordinate of a triangularization. In the main a large part of physics can be viewed as discovering and introducing functions e of the states q of the system such that under action a, e(a q) depends only on e(q) and a, and not on q.

§ Theory: semidirect and wreath products

§ Symmetries as triangular form

We first heuristically indicate the construction involved in going from the group of symmetries to the triangularization, and then precisely write it out in all pedantic detail.
Let an experiment be E ≡ (X, A, ). Then we define Π is a symmetry of E iff:
  1. Π: XX is a permutation of X.
  2. Π commutes with the action of each a: Π(a x) = a Π(x) .
We say that the theory E is transitive (in the action sense) if for all x1, x2X, x1x2, there exists a1, a2, … an such that x2 = an … (a1 x1) . Facts of the symmetries of a system:
  1. We know that the symmetries of a theory E form a group.
  2. If E is transitive, then each symmetry Π is a regular permutation --- If there exists an x such that Π(xf) = xf (a fixed point), then this implies that Π(x) = x for all x.
  3. Let the action split X into disjoint orbits O1, O2, … Ok from whom we choose representatives x1O1, x2O2, … xkOk. Then, if E is transitive, there is exactly one action that sends a particular xi to a particular xj. So, on fixing one component of an action, we fix all components.
To show that this gives rise to a triangulation, we first construct a semigroup of the actions of the experiment: S(E) ≡ { a1an : n ≥ 1  and  aiA }. Now, let G = Sym(E), the full symmetry group of E. One can apparently express the symmetry group in terms of:
(XS) ≤ (GG)  ≀ ({ O1O2, … Ok}, T)

§ The cutest way to write semidirect products

Given two monoids (M, +, 0M) and (N, ×, 1N), and a homomorphism φ: NEnd(M), where End(M) is the endomorphism group of M. We will notate φ(n)(m) as n · mM. Now the semidirect product M φN is the set M × N equipped with the multiplication rule: This can also be written down as:




m + n · mn × n

This way of writing down semidirect products as matrices makes many things immediately clear:

m + n · mn × n′ 



Hence: which is indeed the right expression for the inverse.

§ My Favourite APLisms

§ identity matrix

n←3 ⋄ id ← n n ⍴(1,n⍴0) ⋄ id
This relies heavily on replicating its arguments.

§ histogram

xs←(1 1 3 3 3 6) ⋄ n←(⌈/xs)⍴0 ⋄ n[xs]+←1 ⋄ n
The use of n[x] +←1 will stably write +1 as many times as there are repeated indexes in xs.

§ String matching / parity as fold :

]disp str ← (1 0 0 1 0 0 0 0 1 0 1 0 0 0) ⋄ 2 1 ⍴ ((⊂ str) ⍪ ⊂((≠\str)))
↓1 0 0 1 0 0 0 0 1 0 1 0 0 0│
│1 1 1 0 0 0 0 0 1 1 0 0 0 0│

§ Proof of chinese remainder theorem on rings

§ General operations on ideals

We have at our hands a commutative ring R, and we wish to study the ideal structure on the ring. In particular, we can combine ideals in the following ways:
  1. I + J ≡ { i + j : ∀ iI, jJ }
  2. IJ ≡ { x : ∀ xIxJ }
  3. IJ ≡ { (i, j) : ∀ iIjJ }
  4. IJ ≡ { ij : ∀ iIjJ }
We have the containment:
IJ ⊆ I ⋂ J ⊆ IJ ⊆ I + J ⊆ R

§ IJ is a ideal, IJIJ

it's not immediate from the definition that IJ is an ideal. The idea is that given a sum k ik jkIJ, we can write each ik jk = ik, since the ideal I is closed under multiplication with R. This gives us ik = i″ ∈ I. Similarly, we can interpret k ik jk = ∑k jk = jkJ. Hence, we get the containment IJIJ.

§ IJ subseteq I, IJJ

Immediate from the inclusion function.

§ I, JI + J

Immediate from inclusion

§ CRT from an exact sequence

There exists an exact sequence:
0 → I ⋂ J 

I ⊕ J 

I + J → 0 
 f(r) = (rr         
 g((ij)) = i + j          
We are forced into this formula by considerations of dimension. We know:
 dim(I ⊕ J) = dim(I) + dim(J         
 dim(I + J) = dim(I) + dim(J) − dim(I ⋂ J) [inclusion-exclusion]          
 dim(I + J) = dim(I ⊕ J) − dim(I ⋂ J         
 dim(I + J) − dim(I ⊕ J) + dim(I ⋂ J) = 0         
 V − E + F = 2          
By analogy to euler characteristic which arises from homology, we need to have IJ in the middle of our exact sequence. So we must have:
0 → ? → I ⊕ J → ?→ 0
Now we need to decide on the relative ordering between IJ and I + J. Thus, the exact sequence must have I + J in the image of IJ. This forces us to arrive at:
0 → I ⋂ J → I ⊕ J → I + J → 0
The product ideal IJ plays no role, since it's not possible to define a product of modules in general (just as it is not possible to define a product of vector spaces). Thus, the exact sequence better involve module related operations. We can now recover CRT:
0 → I ⋂ J 

I ⊕ J 

I + J → 0 
0 → R 

R  R 

R → 0 
0 → R / (I ⋂ J) → R/I ⊕ R /J → R/(I + J) → 0           

§ References

§ monic and epic arrows

This is trivial, I'm surprised it took me this long to internalize this fact. When we convert a poset (X, ≤) into a category, we stipulate that xy  ⇐⇒ xy. If we now consider the category Set of sets and functions between sets, and arrow Af B is a function from A to B. If f is monic, then we know that |A| = |Im(f)| ≤ |B|. That is, a monic arrow behaves a lot like a poset arrow! Similarly, an epic arrow behaves a lot like the arrow in the inverse poset. I wonder if quite a lot of category theoretic diagrams are clarified by thinking of monic and epic directly in terms of controlling sizes.

§ The geometry of Lagrange multipliers

If we want to minise a function f(x) subject to the constraints g(x) = c, one uses the method of lagrange multipliers. The idea is to consider a new function L(x, λ) = f(x) + λ (cg(x)). Now, if one has a local maxima (x, y), then the conditions:
  1. L/∂ x = 0: f′(x) − λ g′(x) = 0.
  2. L/∂ λ = 0: g(x) = c.
Equation (2) is sensible: we want our optima to satisfy the constraint that we had originally imposed. What is Equation (1) trying to say? Geometrically, it's asking us to keep f′(x) parallel to g′(x). Why is this a good ask? Let us say that we are at an (x0) which is a feasible point ( g(x0) = c). We are interested in wiggling (x0) →wiggle (x0 + є) ≡ x1. If f′(x0) and g′(x0) are parallel, then attempting to improve f(x0 + є) by change g(x0 + є), and thereby violate the constraint g(x0 + ) = c.

§ Efficient tree transformations on GPUs

All material lifted straight from Aaron Hsu's PhD thesis. I'll be converting APL notation to C++-like notation.

§ Tree repsentation as multi-dimensional ragged nested arrays

We're interested in this tree:
a  b       c
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u
  │    │   |
  │   ┌┴┐ ┌┴┐
  v   w x y z
I'll be writing APL commands in front of a $ to mimic bash, and I'll write some arrays as multi-line. To run them, collapse them into a single line. The ast object is represented in memory as:
$ ast ← ('∘'
           ('a' ('p'))
             ('q' ('v'))
             ('s' ('w' 'x'))
             ('t' ('y' 'z'))
$ ]disp ast
│ │  ││b│qv│r│││c│┌→┬──┐│┌→┬──┐│u││
│ │  │└─┴─→┴─┘││ ││s│wx│││t│yz││ ││
│ │  │        ││ │└─┴─→┘│└─┴─→┘│ ││
│ │  │        │└─┴─────→┴─────→┴─┘│
Here's how read the array representation. Look at the top level of the tree. we have a root node with three children:
a  b       c

│∘│  │        │             │
│ │ a│   b    │     c       │
│ │  │        │             │
With the first being the root node, and the three adjacent cells being the children. Next, we look at how x is represented. This is predictably recursive. Let's see the subtree under x:
a  b       c

│∘│ap│        │             │
│ │  │  b     │   c         │
│ │  │        │             │

Similarly for y:
a  b       c
│ ┌┴┐
p q r

│∘│ap│┌→┬──┬─┐│             │
│ │  ││b│q │r││   c         │
│ │  │└─┴─→┴─┘│             │
And so on, leading to the final representation:
a  b       c
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u
  │    │   |
  │   ┌┴┐ ┌┴┐
  v   w x y z
│ │  ││b│qv│r│││c│┌→┬──┐│┌→┬──┐│u││
│ │  │└─┴─→┴─┘││ ││s│wx│││t│yz││ ││
│ │  │        ││ │└─┴─→┘│└─┴─→┘│ ││
│ │  │        │└─┴─────→┴─────→┴─┘│
Note that for this representation to work, we need to be able to: I don't understand the memory layout of this, to be honest. I feel like to represent this in memory would still rely on pointer-chasing, since we need to box all the arrays. This is possibly optimised by APL to not be too bad.

§ The depth vector representation

      ∘             0
a  b       c        1
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u    2
  │    │   |
  │   ┌┴┐ ┌┴┐
  v   w x y z       3
If we visit this tree and record depths in pre-order (node left right), we arrive at the list:
  (a:1 (p:2)) (b:1 (q:2 (v:3)) (r:2))
  (c:1 (s:2 (w:3 x:3)) (t:2 (y:3 z:3)) (u:2)))
formatted as:
    (q:2 (v:3))
  (c:1 (s:2 (w:3 x:3))
       (t:2 (y:3 z:3))
This linearlized is the list:
    (∘ a p b q v r c s w x t y z u)
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)

      ∘             0
a  b       c        1
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u    2
  │    │   |
  │   ┌┴┐ ┌┴┐
  v   w x y z       3
To convert the ast object into a depth vector representation, we can use the following call:
$ ast ← ('∘' ('a' ('p')) ('b' ('q' ('v')) ('r')) ('c' ('s' ('w' 'x')) ('t' ('y' 'z')) ('u')))
$ d ← ∊0{(⊢,(⍺+1)∇⊣)/⌽⍺,1↓⍵}ast
0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
Let's break this down: TODO

§ Inverted tables

We represent data associated with our nodes as follows:
$ data ← ⍪ ¨d(15⍴'T')(↑15⍴⊂'n.')
$ ]disp data
This is the same thing as a structure of arrays (SOA) representation, where each array of information (eg, the depth at data[1], the T information at data[2]) are each arrays which can be accessed well on SIMD instructions.

§ AST representation


§ Path matrices

We want information of how to go up and down the tree in ideally constant time. We store this information in what is known as a path matrix. For our recurring example, the path matrix is:
∘ a p b q v r c s w x t y z u | preorder traversal
∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ | depth=0
- a a b b b b c c c c c c c c | depth=1
- - p - q q r - s s s t t t u | depth=2
- - - - - v - - - w x - y z - | depth=3

      ∘             0
a  b       c        1
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u    2
  │    │   |
  │   ┌┴┐ ┌┴┐
  v   w x y z       3
To efficiently compute this, we first replace every value in our tree with its preorder traversal visit time. This changes the tree to:
              ∘:0                               0
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3
The values we store in the tree are the integers. The old labels are represented for clarity. The path matrix for this tree is:
0  1  2  3  4  5  6  7  8  9 10 11 12 13 14  | preorder traversal
0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  | depth=0
-  1  1  3  3  3  3  7  7  7  7  7  7  7  7  | depth=1
-  -  2  -  4  4  6  -  8  8  8 11 11 11 14  | depth=2
-  -  -  -  -  5  -  -  -  9 10  - 12 13  -  | depth=3

              ∘:0                               0
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3

§ Rendering the depth information in 2D

We use the incantation:
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ ((⍳≢d)@(d,¨⍳≢d)) ((⌈/d) (≢d))⍴'-'
0 - - - - - - - - -  -  -  -  -  -
- 1 - 3 - - - 7 - -  -  -  -  -  -
- - 2 - 4 - 6 - 8 -  - 11  -  - 14
- - - - - 5 - - - 9  -  - 12  -  -
- - - - - - - - - - 10  -  - 13  -
Let's break this down (the symbol means a lamp, for commenting/illumination)
$ ⍳ 3 ⍝ iota: make a list of n elements:.
1 2 3
$ d
0 1 2 1 2 3 2 1 2 3 4 2 3 4 2

$ ≢d ⍝ tally: ≢`. count no. of elements in d:
⍳≢d  ⍝ list of elements of len (no. of elements in d).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
$ ]disp (1 2 3),(4 5 6) ⍝ ,:concatenate
│1 2 3│4 5 6│
]disp (1 2 3) ,¨ (4 5 6)
│1 4│2 5│3 6│
The use of ¨ needs some explanation. ¨ is a higher order function which takes a function and makes it a mapped version of the original function. So, is a function which attemps to map the concatenation operator. Now, given two arrays (1 2 3) and (4 5 6), (1 2 3) ,¨ 4 5 6 attemps to run , on each pair 1 and 4, 2 and 5, 3 and 6. This gives us tuples ((1 4) (2 5) (3 6)). So, for our purposes, zip ← ,¨.
]disp (d,¨⍳≢d) ⍝ zip d with [1..len d].
│0 0│1 1│2 2│1 3│2 4│3 5│2 6│1 7│2 8│3 9│4 10│2 11│3 12│4 13│2 14│
$ ((⌈/d) (≢d))⍴'-' ⍝ array of dim (max val in d) x (no. of elem in d)
TODO: explain @ and its use

§ Creating the path matrix

$ ⎕IO ← 0 ⍝ (inform APL that we wish to use 0-indexing.)
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ PM ← ⌈\((⍳≢d)@(d,¨⍳≢d))(((⌈/d+1)(≢d))⍴0)

0 0 0 0 0 0 0 0 0 0  0  0  0  0  0
0 1 1 3 3 3 3 7 7 7  7  7  7  7  7
0 0 2 2 4 4 6 6 8 8  8 11 11 11 14
0 0 0 0 0 5 5 5 5 9 10 10 12 13 13

      0               0
1  3          7       1
│ ┌┴┐  ┌──────┼───┐
2 4 6  8     11   14  2
  │    │      |
  │   ┌┴─┐   ┌┴──┐
  5   9  10  12  13   3
The incantation can be broken down into:
$ grid←(⌈/d+1) (≢d) ⍴ 0
$ grid
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
$ writeixs ← (d,¨⍳≢d)
$ ]disp writeixs
│0 0│1 1│2 2│1 3│2 4│3 5│2 6│1 7│2 8│3 9│3 10│2 11│3 12│3 13│2 14│
$ ixgrid ← ((⍳≢d)@writeixs) grid
$ ixgrid
0 0 0 0 0 0 0 0 0 0  0  0  0  0  0
0 1 0 3 0 0 0 7 0 0  0  0  0  0  0
0 0 2 0 4 0 6 0 8 0  0 11  0  0 14
0 0 0 0 0 5 0 0 0 9 10  0 12 13  0
$ PM ← ⌈\ixgrid
$ PM
0 0 0 0 0 0 0 0 0 0  0  0  0  0  0
0 1 1 3 3 3 3 7 7 7  7  7  7  7  7
0 0 2 2 4 4 6 6 8 8  8 11 11 11 14
0 0 0 0 0 5 5 5 5 9 10 10 12 13 13

§ Using the path matrix: distance of a node from every other node.

Note that the maximum distance between two nodes is to climb all the way to the top node, and then climb down:
dmax ← depth(a) + depth(b)
If we know the lowest common ancestor of two nodes, then the distance of one node to another is:
dcorrect ← dist(a, lca(a, b)) + dist(b, lca(a, b))
So, we can compute the depth as:
dcorrect ← dist(a, lca(a, b)) + dist(lca(a, b), b)
 = dist(a, lca(a, b)) + depth(lca(a, b)) +
   dist(b, lca(a, b)) + depth(lca(a, b)) +
   -2 * depth(lca(a, b))
 = depth(a) +
   depth(b) +
   -2 * depth (lca(a, b))
[TODO: picture] [TODO: finish writing this]

§ Parent vector representation

A parent vector is a vector of length n where Parent[i] denotes an index into Parent. Hence, the following condition will return 1 if V is a parent vector. For example, for our given example, here is the parent vector:
d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) │ depths
    (∘  a  p  b  q  v  r  c  s  w  x  t  y  z  u) │ values
p ← (∘  ∘  a  ∘  b  q  b  ∘  c  s  s  c  t  t  c) │ parents
    (0  1  2  3  4  5  6  7  8  9 10 11 12 13 14) | indexes
P ← (0  0  1  0  3  4  3  0  7  8  8  7 11 11  7) │ parent indices

              ∘:0                               0
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3
The condition a parent vector must satisfy is:
∧/V ∊(⍳≢V) ⍝ [All elements of V belong in the list [1..len(V)] ]
The root node (parent of all elements) can be found using the fixpoint operator ():
I←{(⊂⍵)⌷⍺} ⍝ index into the left hand side param using right hand side param
I⍣≡⍨p ⍝ compute the fixpoint of the I operator using ⍨ and apply it to p

§ Converting from depth vector to parent vector, Take 1

As usual, let's consider our example:
d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) │ depths
    (∘  a  p  b  q  v  r  c  s  w  x  t  y  z  u) │ values
p ← (∘  ∘  a  ∘  b  q  b  ∘  c  s  s  c  t  t  c) │ parents
    (0  1  2  3  4  5  6  7  8  9 10 11 12 13 14) | indexes
P ← (0  0  1  0  3  4  3  0  7  8  8  7 11 11  7) │ parent indices

              ∘:0                               0
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3
Note that the depth vector already encodes parent-child information. For example, to compute the parent of t:11, notice that it's at depth 2. So we should find all the nodes from d[0..11] which have depths equal to 2, and then pick the rightmost one. This translates to the expression:
$ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2)
$ t ← 11   ⍝ target node
$ ixs ← ⍳t   ⍝ array indexes upto this node
  0 1 2 3 4 5 6 7 8 9 10
$ d[ixs]   ⍝ depths of nodes to the left of the given node t
  0 1 2 1 2 3 2 1 2 3 3
$ d[ixs]  = d[t]-1  ⍝ boolean vector of nodes whose depth is that of t's parent
  0 1 0 1 0 0 0 1 0 0 0
$ eqds ← ⍸ (d[ixs] = d[t]-1)  ⍝ array indexes of nodes whose depth is that of t's parent
  1 3 7
$ ⌽ eqds ⍝ reverse of array indexes to extract `7`
  7 3 1
$ ⊃ ⌽ eqds ⍝ first of the reverse of the array indexes to extract `7`
$ (⌽⍸(d[⍳t] = d[t]-1))[0]  ⍝ APL style one-liner of the above
While this is intuitive, this does not scale: It does not permit us to find the parent of all the nodes at once --- ie, it is not parallelisable over choices of t.

§ Converting from depth vector to parent vector, Take 2 (Or scan idiom)

Imagine we have a list of 0s and 1s, and we want to find the index of the rightmost 1 value. For example, given:
       0 1 2 3 4 5 6 7 8 9 10 11 12
$ a ← (0 0 1 0 0 0 1 0 1 0  0  0  0)
we want the answer to be f a = 8. We saw an implementation in terms of f←{(⌽⍸⍵)[0]} in Take 1. (recall that is the symbol for the right-hand-side argument of a function). We're going to perform the same operation slightly differently. Let's consider the series of transformations:
⍝      0 1 2 3 4 5 6 7 8 9 10 11 12
$ a ← (0 0 1 0 0 0 1 0 1 0  0  0  0) ⍝ original array

$ ⌽a  ⍝ reverse of a
  0 0 0 0 1 0 1 0 0 0 1 0 0

$ ∨\ ⌽a ⍝ prefix scan(\) using the OR(∨) operator. Turn all
        ⍝ entries after the first 1 into a 1
  0 0 0 0 1 1 1 1 1 1 1 1 1

$ +/ (∨\ ⌽a)  ⍝ sum over the previous list, counting number of 1s

$ ¯1 +  (+/ (∨\ ⌽a))  ⍝ subtract 1 from the previous number
Why the hell does this work? Well, here's the proof: This technique will work for every row of a matrix. This is paramount, since we can now repeat this for the depth vector we were previously interested in for each row, and thereby compute the parent index!

§ Converting from depth vector to parent vector, Take 3 (full matrix)

We want to extend the previous method we hit upon to compute the parents of all nodes in parallel. To perform this, we need to run the moral equivalent of the following:
$ ⎕IO ← 0 ⍝ 0 indexing
$ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depth vector
$ t ← 11 ⍝ node we are interested in
$ a←d[⍳t]=d[t]-1  ⍝ boolean vector of nodes whose depth is that of t's parent
  0 1 0 1 0 0 0 1 0 0 0
$ ¯1 +  (+/ (∨\ ⌽a)) ⍝ index of last 0 of boolean vector
for every single choice of t. To perform this, we can build a 2D matrix of d[⍳t]=d[t]-1 where t ranges over [0..len(d)-1] (ie, it ranges over all the nodes in the graph). We begin by using:
$ ⎕IO ← 0 ⋄ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depths
$ ]display ltdepth ← d ∘.> d ⍝ find `d[i] > d[j]` for all i, j.
↓0 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
$ ⎕IO ← 0 ⋄ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depths
$ ]display left ←  (⍳3) ∘.> (⍳3) ⍝ find `i > j` for all i, j.
↓0 0 0│
│1 0 0│
│1 1 0│
Combining the three techniques, we can arrive at:
$ ⎕IO ← 0 ⋄ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depths
$ ltdepth ← d ∘.> d ⍝ find `d[i] > d[j]` for all i, j.
$ preds ←  (⍳≢d) ∘.> (⍳≢d) ⍝ predecessors: find `i > j` for all i, j.
$ pred_higher ←  ltdepth ∧ left   ⍝ predecessors tht are higher in the tree
$  parents_take_3 ← ¯1 +  +/∨\⌽pred_higher  ⍝ previous idiom for finding last 1.
¯1 0 1 0 3 4 3 0 7 8 8 7 11 11 7
For comparison, the actual value is:
    (0   1  2  3  4  5  6  7  8  9 10 11 12 13 14)  | indexes
d ← (0   1  2  1  2  3  2  1  2  3  3  2  3  3  2)  │ depths
P ← (0   0  1  0  3  4  3  0  7  8  8  7 11 11  7)  │ parent indices
    (¯1  0  1  0  3  4  3  0  7  8  8  7 11 11  7) | parents, take 3

              ∘:0                               0
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3
We have an off-by-one error for the 0 node! That's easily fixed, we simply perform a maximum with 0 to move ¯1 -> 0:
$  parents_take_3 ← 0⌈  ¯1 +  +/∨\⌽pred_higher
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7
So, that's our function:
parents_take_3 ← 0⌈  ¯1 +  +/∨\⌽ ((d∘.>d) ∧ (⍳≢d)∘.>(⍳≢d))
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7
Note that the time complexity for this is dominated by having to calculate the outer products, which even given infinite parallelism, take O(n) time. We will slowly chip away at this, to be far better.

§ Converting from depth vector to parent vector, Take 4 (log critial depth)

We will use the Key() operator which allows us to create key value pairs.
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]disp (⍳≢d) ,¨ d ⍝ zip d with indexes
│0 0│1 1│2 2│3 1│4 2│5 3│6 2│7 1│8 2│9 3│10 3│11 2│12 3│13 3│14 2│
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⍺ ⍵}⌸d  ⍝ each row i has tuple (i, js): d[js] = i
↓   ┌→┐             │
│ 0 │0│             │
│   └~┘             │
│   ┌→────┐         │
│ 1 │1 3 7│         │
│   └~────┘         │
│   ┌→────────────┐ │
│ 2 │2 4 6 8 11 14│ │
│   └~────────────┘ │
│   ┌→───────────┐  │
│ 3 │5 9 10 12 13│  │
│   └~───────────┘  │
In fact, it allows us to apply an arbitrary function to combine keys and values. We will use a function that simply returns all the values for each key.
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⍵}⌸d ⍝ each row i contains values j such that d[j] = i.
↓0 0  0  0  0  0│
│1 3  7  0  0  0│
│2 4  6  8 11 14│
│5 9 10 12 13  0│
Our first try doesn't quite work: it winds up trying to create a numeric matrix, which means that we can't have different rows of different sizes. So, the information that only index 0 is such that d[0] = 0 is lost. What we can to is to wrap the keys to arrive at:
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⊂⍵}⌸d ⍝ d[b[i]] = i
│ ┌→┐ ┌→────┐ ┌→────────────┐ ┌→───────────┐ │
│ │0│ │1 3 7│ │2 4 6 8 11 14│ │5 9 10 12 13│ │
│ └~┘ └~────┘ └~────────────┘ └~───────────┘ │
Consider the groups b[2] = (2 4 6 8 11 14) and b[3] = (5 9 10 12 13). All of 3's parents are present in 2. Every element in 3 fits at some location in 2. Here is what the fit would look like:
b[2]  2 4 _ 6 8 _  _ 11 __ __ 14   (nodes of depth 2)
b[3]      5     9  10   12 13      (leaf nodes)
          4     8   8   11 11      (parents: predecessor of b[3] in b[2])

              ∘:0                               0
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3
We use the Interval Index() operator to solve the problem of finding the parent / where we should sqeeze a node from b[3] into b[2] (This is formally known as the predecessor problem)
⍝ left[a[i]] is closest number < right[i]
⍝ left[a[i]] is the predecessor of right[i] in left[i].
$ a ← (1 10 100 1000) ⍸ (1 2000 300 50 2 )
0 3 2 1 0
Now, we can use the technology of predecessor to find parents of depth 3 nodes among the depth 2 nodes:
$ depth2 ← 2 4 6 8 11 14
$ depth3 ← 5 9 10 12 13 ⍝ parents (from chart): 4 8 8 11 11
$ depth3parentixs ← depth2 ⍸ depth3
$ depth3parents  ← depth2[depth3parentixs]
4 8 8 11 11

              ∘:0                               0
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3
We need to know one-more APL-ism: the 2-scan. When we write a usual scan operation, we have:
$ ⍳5
1 2 3 4 5
$ +/⍳5 ⍝ reduce
$ 2+/⍳5 ⍝ apply + to _pairs_ (2 = pairs)
3 5 7 9 ⍝ (1+2) (2+3) (3+4) (4+5)
$ 3+/⍳5 ⍝  apply + to 3-tuples
6 9 12 ⍝ (1+2+3) (2+3+4) (3+4+5)
We begin by assuming the parent of i is i by using p←⍳≢d.
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ d2nodes ← {⊂⍵}⌸d
│1│2 4 8│3 5 7 9 12 15│6 10 11 13 14│
$ p←⍳≢d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Now comes the biggie:
$ findparent ← {parentixs ← ⍺⍸⍵ ⋄ p[⍵]←⍺[parentixs]}
This finally culminates in:
$ d←0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7

    (0   1  2  3  4  5  6  7  8  9 10 11 12 13 14)  | indexes
d ← (0   1  2  1  2  3  2  1  2  3  3  2  3  3  2)  │ depths
P ← (0   0  1  0  3  4  3  0  7  8  8  7 11 11  7)  │ parent indices
              ∘:0                               0
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
       │             │          │
       │          ┌──┴──┐     ┌─┴───┐
       v:5        w:9   x:10  y:12  z:13        3
Which can be further golfed to:
$ p⊣2{p[⍵]←⍺[⍺⍸⍵]}⌿⊢∘⊂⌸d⊣p←⍳≢d
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7
The total time complexity of this method assuming infinite parallelism is as follows:
$ p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p

§ 3.4: Computing nearest Parent by predicate

I'm going to simplify the original presentation by quite a bit.
     a b c d e f g h i  | names
     0 1 2 3 4 5 6 7 8  | indexes
P ← (0 0 1 2 0 4 5 6 7) | parents
X ← (0 1 0 0 1 1 0 0 0) | marked nodes

b:1(X)   e:4(X)
|        |
c:2      f:5(X)
|        |
d:3      g:6
We want to find nodes marked as X that are the closest parents to a given node. The X vector is a boolean vector that has a 1 at the index of each X node: (b, e, f). So, the indexes (1, 4, 5) are 1 in the X vector. The output we want is the vector:
      0 1 2 3 4 5 6 7 8  | indexes
      a b c d e f g h i  | names
PX ← (0 0 1 1 0 4 5 5 5) | closest X parent index
      a a b b a e f f f  | closest X parent name

b:1(X)   e:4(X)
|        |
c:2      f:5(X)
|        |
d:3      g:6
The incantation is:
$ I←{(⊂⍵)⌷⍺} ⍝ index LHS by RHS | (100 101 102 103)[(3 1 2)] := 103 101 102
$ PX ← P I@{X[⍵]≠1} ⍣ ≡ P
0 0 1 1 0 4 5 5 5
TODO. At any rate, since this does not require any writes and purely reads, and nor does it need any synchronization, this is fairly straightforward to implement on the GPU.

§ 3.5: Lifting subtrees to the root

Once we have marked our X nodes, we now wish to lift entire subtrees of X up to the root. There is some good advice in the thesis:
When using APL primitives this way, it may be good to map their names and definitions to the domain of trees. For example, the primitive ⍸Predicate is read as "the nodes where Predicate holds" and not as "the indexes where Predicate is 1".
For example, given the tree:
      0 1 2 3 4 5  | indexes
      a b c d e f  | names
P  ← (0 0 1 0 3 4) | parents
X  ← (0 1 0 1 1 0) | X nodes
PX ← (0 0 1 0 3 4) | closest X parent index

b:1(X)   d:3(X)
|        |
c:2      e:4(X)
we want the transformed tree to be:
bp:1(X)   ep:4(X)
We first look for nodes that need to be lifted. There are:
nodes←⍸(X ∧ p≠(⍳≢p))  ⍝ ⍸:pick indexes.

§ 3.6: Wrapping Expressions

§ 3.7: Lifting Guard Test Exprsessions

§ 3.8: Couting rank of index operators

§ 3.9: Flattening Expressions

§ 3.10: Associating Frame slots and variables

§ 3.11: Placing frames into a lexical stack

§ 3.12: Recording Exported names

§ 3.13: Lexical Resolution

§ 5.2.1 Traversal Idioms

§ 5.2.2 Edge Mutation Idioms

§ 5.2.3 Node Mutation Idioms

§ Things I wish I knew when I was learning APL

§ Every ideal that is maximal wrt. being disjoint from a multiplicative subset is prime

I ran across this when reading another question on math.se, so I posted this proof for verification just to be sure I wasn't missing something. We wish to characterise prime ideals as precisely those that are disjoint from a multiplicative subset SR. That is: I'll be using the definition of prime as:

§ Prime ideal implies complement is maximal multiplicative subset:

Let S = ≡ RP be the complement of the prime ideal P R in question.

§ Ideal whose complement is maximal multiplicative subset implies ideal is prime.

§ Getting started with APL

§ SpaceChem was the best compiler I ever used

It's kind of sad that this is the case, but on thinking about this, I realised that the SpaceChem game was essentially a compiler, and it was such a pleasure to learn how to use and debug --- the visual nature of it made it amazing to find out.

§ Mnemonic for Kruskal and Prim

I often forget which is which, so I came up with this:

§ Legendre transform

§ Cartesian Trees

Cartesian trees construct a tree T = C(A) given an array A, such that range minimum query (RMQ) on the array A is equivalent to the lowest common ancestor (LCA) of the nodes of the tree T. Note that the tree looks like a min-heap. To see the connection to LCA, if we want to find the range minimum in the range containing the elements [12, 10, 20, 15, 18] of the array, the minimum is 10, which is indeed the lowest common ancestor of the nodes of 12 and 18 in the tree.

§ Building a Cartesian tree in linear time:

§ Converting LCA to RMQ

We can go the other way, and convert an LCA problem into a RMQ problem. We perform an inorder traversal of the nodes, scribbling down the depth of the node (Link to lecture at 15:30). We ask for the argmin version of RMQ, that gives us the index of the node with the lowest depth. This gives us the index of where the node lives.

§ Universe reduction in RMQ

We can have an arbitrary ordered universe, on which we want to perform RMQ. We can convert this to LCA by using a cartesian tree, and then convert to a "clean" RMQ (by using the LCA -> RMQ using depth conversion). This now will give us way faster operations (since we now have integers).

§ +-1 RMQ:

We want the differences between nodes to have a difference of only +-1. We had a much wider gap. Here, we perform an Euler tour (walk the tree DFS search order), and sribble down every vertex we visit. To find the LCA, we perform the RMQ on the locations of the first occurence of the node. (I think we don't actually need the first occurence, any occurence will do).

§ References

§ DFS numbers as a monotone map

Really, we want a partial order that is defined with the tree as the Hasse diagram. However, performing operations on this is hard. Hence, the DFS numbering is a good monotone map from this partial order to the naturals, which creates a total order. I want to think about this deeper, I feel that this might be a good way to think about the low numbers that show up in tarjan's algorithm for strongly connected components This also begs the question: can we use other partial orders, that chunk some information, but don't lose all the information as going to a total order (the naturals) does?

§ Self attention? not really

The code is taken from The annotated transformer which explains the "attention is all you need paper". On skimming the code, one sees the delightful line of code:
class EncoderLayer(nn.Module):
  "Encoder is made up of self-attn and feed forward (defined below)"
  def __init__(self, size, self_attn, feed_forward, dropout):
     super(EncoderLayer, self).__init__()
     self.self_attn = self_attn
     self.feed_forward = feed_forward
     self.sublayer = clones(SublayerConnection(size, dropout), 2)
     self.size = size
  def forward(self, x, mask):
    "Follow Figure 1 (left) for connections."
    x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
    return self.sublayer[1](x, self.feed_forward)
where the line:
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
seems to imply that we are, indeed, performing a self attention with the same value x as the query, key, and value. However, reading the code of the self-attention (or the paper) leads one to realise:
class MultiHeadedAttention(nn.Module):
  def __init__(self, h, d_model, dropout=0.1):
    "Take in model size and number of heads."
    super(MultiHeadedAttention, self).__init__()
    assert d_model % h == 0
    # We assume d_v always equals d_k
    self.d_k = d_model // h
    self.h = h
    self.linears = clones(nn.Linear(d_model, d_model), 4)
    self.attn = None
    self.dropout = nn.Dropout(p=dropout)

  def forward(self, query, key, value, mask=None):
    "Implements Figure 2"
    if mask is not None:
        # Same mask applied to all h heads.
        mask = mask.unsqueeze(1)
    nbatches = query.size(0)

    # 1) Do all the linear projections in batch from d_model => h x d_k
    query, key, value = \
        [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
         for l, x in zip(self.linears, (query, key, value))]

    # 2) Apply attention on all the projected vectors in batch.
    x, self.attn = attention(query, key, value, mask=mask,

    # 3) "Concat" using a view and apply a final linear.
    x = x.transpose(1, 2).contiguous() \
         .view(nbatches, -1, self.h * self.d_k)
    return self.linears[-1](x)
where we notice:
## 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
  [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
   for l, x in zip(self.linears, (query, key, value))]

## 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask,
where we see that query, key, value are being linearly transformed before being used. Hence, an input of (x, x, x) is transformed to (q′, k′, v′) = (Qx, Kx, Vx) where Q, K, V are arbitrary matrices. Next, when we pass these into attention, the output we get is:
softmax(q′, kTv = (Q x) (K x)T (V x) = Q x xT KT V x
the code below is the same thing, spelled out:
def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn
So It's not really self attention: it's more like: modulated attention to self :)

§ Coarse structures

A coarse structure on the set X is a collection of relations on X: E ⊆ 2X × X (called as controlled sets / entourages) such that: The sets that are controlled are "small" sets. The bounded coarse structure on a metric space (X, d) is the set of all relations such that there exists a uniform bound such that all related elements are within that bounded distance.
(e ⊂ X × X) ∈ E  ⇐⇒ ∃ δ ∈ ℝ, ∀ (xy) ∈ Ed(xy) < δ
We can check that the functions: are coarse inverses to each other. I am interested in this because if topology is related to semidecidability, then coarse structures (which are their dual) are related to..?

§ References

§ Matroids for greedy algorithms

§ Definitions of matroids

A matrioid M is a set X equipped with an independence set I ⊆ 2X.

§ Example 1: Linearly independent sets

Let V be a vector space. The independent sets I are of the form:
I ≡ { S ⊆ V  :  vectors in S are lineary independent } 
This is an independence system because the empty set is linearly independent, and subsets of a linearly independent collection of vectors will be linearly independent. The exchange property is satisfied because of linear algebraic reasons.

§ Example 2: The graphic/cyclic Matroid: Matroid of Forests

Let G = (V, E) be a graph. Then collections of edges of the form:
I ≡ { F ⊆ E : F contains no cycles } 
is an independence system because the empty forest is a forest, and a subset of edges of a forest continues to be a forest. To check the exchange property, TODO

§ Example 3: The partition matroid

Consider the partition matroid M ≡ (E, I), where we have a partitioning of E known as Ei, and numbers ki the independence set consists of subsets F which have at most ki elements in common with each Ei.
I ≡ { F ⊆ E  : ∀ i = 1, … N, | F ⋂ Ei | ≤ ki }
The independence axioms are intuitively satisfied, since our constraints on picking edges are of the form | FEi | ≤ ki, which will continue to hold as F becomes smaller. For the exchange axiom, let | Y | > | X |. Then, we can assert that for some index i, it must be the case that | YEi | > | XEi |. Hence, we can add an element in Ei ∩ (Y / X) into X whilst still maintaining independence.

§ Bases and Circuits

A matroid can be completely categorized by knowing either the bases or the circuits of that matroid.

§ Unique Circuit property

Then, there exists a unique circuit CS ∪ { e }. That is, when we go from independent to dependent by adding an element, we will have a single, unique circuit. For example, when we add an edge into a forest to create a cycle, this cycle will be unique!

§ Proof

Let C1, C2 be circuits created when e was added into S, where C1 is the largest circuit of S, and C2 is the smallest circuit of S. Notice that C1, C2 must contain e --- if they did not, then C1, C2 would be circuits in S, contradicting the assumption that S is independent. Recall that C1, C2 are both circuits, which means that removing even a single element from them will cause them to become independent sets. Let us contemplate CC1C2. Either C = C1 in which case we are done. Otherwise, | C | > | C1 |, | C | > | C2 |. Otherwise, consider C′ ≡ C { e } = (C1C2) {e} = (C1 {e}) ∪ (C2 { e }). Now, we consider C. Clearly, this is a dependent set, since C1 C, and C1 is a dependent set. Since, C = C′ ∪ {e }, this means that C is a maximally independent set. Since C does not contain e, C′ = S.

§ Rank functions

A rank function of a matroid M ≡ ⟨ X, I is a function:
r: 2X → ℕ : r(S) = max{ | T | : T ⊆ S ∧ T ∈ I }
That is, for any subset SX, r(S) is the cardinality of the largest independent subset of S. In this matroid, the TODO: picture

§ Bipartite matching as matroid intersection

Matchings in a bipartite graph G = (V, E) with partition (A, B) arise as the intersection of the independent sets of two matroids. We will denote by δ: V → 2E the function which takes a vertex to the set of edges incident on that vertex. Let MA be a partition matroid: MA ≡ (E, IA) where IA is:
IA ≡ { F ⊆ E : | F ⋂ δ(a) | ≤ 1 ∀ a ∈ A }
That is, in IA, every independent set has for each vertex of A, at most one edge incident. We need to check that this is an independent set. The empty set of no edges is independent. If some collection of edges are such that they have at most one edge incident, then removing edges can only decrease incidence. Hence, it's also downward closed. TODO: add picture Similarly, we define MB ≡ (E, IB):
IB ≡ { F ⊆ E : | F ⋂ δ(b) | ≤ 1 ∀ b ∈ B }
Now, notice that any collection of edges FIAIB is a legal matching, since the edges cover all vertices of A and B at most once. The largest element of IAIB is the maximum matching that we are looking for.

§ Largest common independent set

Given two matroids M1 ≡ (E, I1), M2 ≡ (E, I2), with rank functions r1 and r2. Let SI1 cap I2 and let FE.

§ References:

§ Grokking Zariski

There's a lot written on the Zariski topology on the internet, but most of them lack explicit examples and pictures. This is my attempt to communicate what the Zariski topology looks like, from the perspectives that tickle my fancy (a wealth of concrete examples, topology-as-semi-decidability, and pictures).

§ The Zariski topology

Recall that the Zariski topology is defined by talking about what its closed sets are. The common zero sets of a family of polynomials are the closed sets of the Zariski topology. Formally, the topology (ℝn, τ) has as closed sets:
x ∈ ℝn : ∀ fi ∈ ℝ[X1X2, … Xn],  fi(x) = 0  }
Open sets (the complement of closed sets) are of them form:
x ∈ ℝn : ∃ fi ∈ ℝ[X1X2, … Xn],  fi(x) ≠ 0  } ∈ τ
The empty set is generated as { x ∈ ℝn : 0 ≠ 0 } and the full set is generated as { x ∈ ℝn : 1 ≠ 0 }.

§ Semi-decidability

Recall that in this view of topology, for a space (X, τ), for every open set O ∈ τ, we associate a turing machine TO which semi-decides inclusion. That is, if a point is in O then it halts with the output IN-SET. if the point is not in O, TO infinite loops. Formally:
x    ∈ O  ⇐⇒ TO halts on input x         
x    O  ⇐⇒ TO does not halts on input o         
Alternatively, for a closed set C ∈ τ, we associate a a turing machine TC which semi-decides exclusion. That is, if a point is not in C it halts with the output "NOT-IN-SET". If the point is in C, the turing machine TC infinite loops. Now in the case of polynomials, we can write a function that semidecides exclusion from closed sets:
# return NOT-IN-SET of x is not a zero of the polynomial
def is_not_in_zeros(poly, x0):
  precision = 0 # start with zero precision
  while True:
    if poly(x0[:precision]) != 0: return NOT-IN-SET
    precision += 1 # up the precision
Since we can only evaluate a polynomial up to some finite precision, we start with zero precsion and then gradually get more precise. If some x0 is not a root of poly, then at some point, we will have that poly(x0) /neq 0. If x0 is indeed a root, then we will never halt this process; We will keep getting poly(x0[:precision]) = 0 for all levels of precision.

§ Spec(R)

§ Spec(ℤ) from this perspective

Since is a PID, we can think of numbers instead of ideals. The above picture asks us to think of a number as a function from a prime to a reduced number. n(p) = n % p. We then have that the closed sets are those primes which can zero out some number. That is:
V(I) = { x ∈ Spec(ℤ) : I(x) = 0 } V((n)) = { (p) ∈ Spec(ℤ) : (n)/(p) = 0 } V((n)) = { (p) ∈ Spec(ℤ) : (n) mod (p) = 0 }           
So in our minds eye, we need to imagine a space of prime ideals (points), which are testing with all ideals (polynomials). Given a set of prime ideals (a tentative locus, say a circle), the set of prime ideals is closed if it occurs as the zero of some collection of ideas (if the locus occurs as the zero of some collection of polynomials).

§ My preferred version of quicksort

Wikipedia lists the implementation of quicksort as:
algorithm quicksort(A, lo, hi) is
    if lo < hi then
        p := partition(A, lo, hi)
        quicksort(A, lo, p - 1)
        quicksort(A, p + 1, hi)

algorithm partition(A, lo, hi) is
    pivot := A[hi]
    i := lo
    for j := lo to hi do
        if A[j] < pivot then
            swap A[i] with A[j]
            i := i + 1
    swap A[i] with A[hi]
    return i
Here, the indeces [lo..i-1] have values less than the pivot, while [i..j] are great or equal to the pivot.

§ The version I prefer

// #define SWAP(ix, ix2) { int t = a[ix]; a[ix] = a[ix2]; a[ix2] = t; }
// sorts the interval [l, r]
void qs(int l, int r) {
    if (r - l < 0) return;
    int part = a[r]; // the partition

    // a[getill...n] >= part (getill = greater or equal till)
    // starts at r since we know that a[r] >= (partition=a[r])
    int getill = r;
    // a[l..lttill] < part (lttill = less or equal till.
    // starts at (l-1) since we do not know about any value < partition
    int lttill = l-1;

    // loop until they start probing into the other set
    while(!(lttill+1 >=getill || getill-1 <=lttill)) {
        // if the speculated element is < partition
        if (a[getill-1] < part) {
            // swap the value at getill-1 will the slot at lttill+1
            SWAP(getill-1, lttill+1);
            // increment lttill, since we KNOW that the
            // value at lttill+1 = a[getill-1] is < part
        } else {
            // all we know is that a[getill-1] < part, so we can engulf
            // the region into
    // the partitions must be next to each other, since we have engulfed everything
    assert(getill - lttill == 1);
    // move the partition value to the center.
    SWAP(getill, r);

    // recurse:solve [l..littil] (leave getill=part alone) solve [getill+1..r]
    qs(l, lttill);
    qs(getill+1, r);
This implementation to me makes very clear to me what information is "known": It also makes clear what is being "probed"/"tentative": The termination condition is clear: when one partition starts reaching into the other partitions resources, its done. Due to using closed intervals everywhere, it's very easy to see precisely what data starts and ends where. What version of quicksort do you prefer? Drop me an email!

§ Geometric proof of Cauchy Schwarz inequality

Here's one fun application of Cauchy-Schwarz. We can apply it to two vectors x=(√a, √b) and y=(√b, √a) to derive the AM-GM inequality:

§ Dataflow analysis using Grobner basis

This was a quick experiment in using Grobner basis to model situations. We can represent our dataflow analysis constraints in terms of polynomial rewrites over F2. Given the program:
p = { 0: ["=", "x", 'y'],
      1: ['br', 2, 100],
      2: ['=', 'z', 'x'],
      3: ['br', 2],
      100: ['ret', 'z'] }
whose semantics I hope are fairly straightforward --- the dictionary represents instruction locations. Instructions proceed sequentially. branch moves control flow around. Note that br can branch to multiple locations, since we are not control-flow sensitive. The idea is that since in a dataflow analysis, we need information at each variable at each program point, we can create a ring of polynomials over F2 for each variable at each program point. So in this case, we wold need:
R = F_2[x0, y0, z0, x1, y1, z1, x2, y2, z2, x3, y3, z3, x100, y100, z100]
We then add elements into the ideal that represents our constraints. For example, to perform dataflow analysis, we need to add constraints about how if a variable z is alive, all variables that are used to compute z at 100 are alive. This sets up equations that may have cycles (in the case of loops). These are usually resolved using the Kildall algorithm. However, we can also ask SAGE to kindly solve the Grobner basis. I hypothesize that the "easy" dataflow problems out to be toric ideals which admit much faster solutions.

§ Fenwick trees and orbits

I learnt of a nice, formal way to prove the correctness of Fenwick trees in terms of orbits that I wish to reproduce here. One can use a Fenwick tree to perform cumulative sums Sum(n) ≡ ∑in A[i], and updates Upd(i, v) ≡ A[i] += v. Naively, cumulative sums can take O(n) time and updates take O(1) time. A Fenwick tree can perform both in log(n). In general, we can perform any monoid-based catenation and update in log(n).

§ organization

We allow indexes [1, 2, … n]. The node with factorization i ≡ 2k × l, 2 ¬|l (that is, k is the highest power of 2 in i) is responsible for the interval [i−2k+1, i] = (i−2k, i]. I'm going to state all the manipulations in terms of prime factorizations, since I find it far more intuitive than bit-fiddling. In general, I want to find a new framework to discover and analyze bit-fiddling heavy algorithms. Some examples of the range of responsibility of an index are:

§ query

To perform a cumulative sum, we need to read from the correct overlap regions that cover the full array. For example, to read from 15, we would want to read: So we need to read the indices: At each location, we strip off the value 2r. We can discover this value with bit-fiddling: We claim that a & (−a) = 2r. Let a = x 1 0r. Now, a = ¬ a + 1 = x01r + 1 = x10r. Hence, a & (−a) = a & (¬ a + 1) = (x 10r) & (10r) = 0|α|10r = 2r So the full implementation of query is:
#define LSB(x) x&(-x)
int a[N];
int q(int i) {
    int s = 0;
    while (i > 0) { s += a[i]; i -= LSB(i); }
    return s;

§ update

To perform an update at i, we need to update all locations which on querying overlap with i. For example, to update the location 9, we would want to update: So we need to update the indices: We use the same bit-fiddling technique as above to strip off the value 2r
#define LSB(x) x&(-x)
int tree[N];
int u(int i, int v) {
    while (i < N) { tree[i] += v; i += LSB(i); }

§ correctness

We wish to analyze the operations Query(q) ≡ ∑i=1q a[i], and Update(i, val) ≡ a[i+= val. To do this, we are allowed to maintain an auxiliary array d which we will manipuate. We will stipulate the conditions of operations on d such that they will reflect the values of Query and Update, albeit much faster. We will analyze the algorithm in terms of orbits. We have two operators, one for update called U, and one for query called Q. Given an index i, repeatedly applying the query operator gives us the indeces we need to read and accumulate from the underlying array a to get the total sum a[0..i]: Given an index u, repeatedly applying the update operator U gives us all the indeces we need to add the change to update: For query and update to work, we need the condition that: That is, if and only if the query index q includes the update location u, will the orbits intersect. The intuition is that we want updates at an index u to only affect queries that occur at indeces qu. Hence, we axiomatise that for an update to be legal, it must the orbits of queries that are at indeces greater than it. We will show that our operators: do satisfy the conditions above. For a quick numerical check, we can use the code blow to ensure that the orbits are indeed disjoint:
## calculate orbits of query and update in fenwick tree

def lsb(i): return i & (-i)
def U(i): return i + lsb(i)
def Q(i): return i - lsb(i)
def orbit(f, i):
    s = set()
    while i not in s and i > 0 and i < 64:
        s.add(i); i = f(i)
    return s

if __name__ == "__main__":
    for q in range(1, 16):
        for u in range(1, 16):
            qo = orbit(Q, q); uo = orbit(U, u)
            c = qo.intersection(uo)
            print("q:%4s | u:%4s | qo: %20s | uo: %20s | qu: %4s" %
                  (q, u, qo, uo, c))


§ Case 1: q = u

We note that Q always decreases the value of q, and u always increases it. Hence, if q = u, they meet at this point, and i, j ≥ 1,    Qi (q) ≠ Uj(u). Hence, they meet exactly once as required.

§ Case 2: q < u

As noted above, q always decreases and u always increases, hence in this case they will never meet as required.

§ Case 3: q > u

Let the entire array have size 2N. Let q = e1 f_q, u = e0 f_u, where e, f_q, f_u may be empty strings. Notice that Q will always strip away rightmost ones in fq, leading to q = e10...0 at some point. Similarly, U will keep on adding new rightmost ones, causing the state to be u = e01...10...0U e100.... Hence, at some point q = u.

§ References

§ Dirichlet inversion

We call all functions f: ℤ → ℝ as arithmetic functions, since they operate on the integers. We introduce an operator fg: ℤ → ℝ. It is defined by: We will show that the set of arithmetic functions forms a group under the operator , with identity:
id(n) ≡ 1/n = 

1n = 1 
The reason all of this is interesting is that the inverse of the constant function 1(n) ≡ 1 is going to be this function called as the mobius function µ:
µ(n=p1α1 p2α2 … prαr) ≡

  0if any αi > 1 
  (−1)α1 + α2 + … + αrif all αi ∈ { 0, 1 }
The mobius function will allow us to perform mobius inversion:
d | n
 g(d) = 
d | n
 g(d) 1(n/d) = g ⋆ 1 
f ⋆ 1−1=  g ⋆ 1 ⋆ 1−1          
f ⋆ µg          
That is, we originally had f defined in terms of g. We can recover an expression for g in terms of f.

§ The algebra of multiplicative functions

We claim that the set of functions { ℤ → ℂ } is a commutative group, with the group operation such that: with the identity element being id(n) ≡ ⌊ 1 / n. The idea is that if (n = 1), then ⌊ 1/1 ⌋ = 1, and for any other number n > 0, 1/n < 1, hence ⌊ 1/n ⌋ = 0.

§ verification of istar being the identity

(f ⋆ id)(n) ≡ 
d | n
f(nid(1) + 
d | nd > 1
f(n) · 1 + 
d | nd > 1
 f(n) · 0 

§ associativity, commutativity of

To prove associativity, it's better to write the formula as:
(f ⋆ g)(n) ≡ 
d | n
 f(ng(n/d) = 
xy = n
From this rewrite, it's clear that (fgh)(n) will unambiguously sum over tripes (x, y, z) such that xyz = n. I leave the working-this-out to you. This should also make the commutativity immediate. Summing over pairs of the form f(x) g(y) : xy = n is the same as summing over f(y) g(x) : yx = n.

§ Existence of inverse

We can show that an inverse exists by showing that a formula for it exists; The idea is to construct one by induction. Clearly, for a given function f, we need the inverse f−1 to be such that (ff−1)(n) = id. Hence:
 (f ⋆ f−1)(1) = id(1) = 1          
 f(1) f−1(1) = 1          
  f−1(1) ≡ 1 / f(1)         
Great, we have a base case; We can now compute f−1(n) inductively, assuming we know the value of f−1(d) for all d | n.
 (f ⋆ f−1)(n) = id(1) = 0          
d | n
 f(df−1(n/d) = 0 
f(1) f−1(n) + 
d | nd < n
 f(df−1(n/d) = 0 
f−1(n) = −
d | nd < n

§ µ as the inverse of the one function

§ Mobius inversion

Now that we know that µ = const 1−1, we can use this fact to perform mobius inversion:
f(n) ≡ 
d | n
 g(n/d) = const 1 ⋆ g
We have f written in terms of g. We can now write g in terms of f:
 f(n)  = const 1 ⋆ g          
 f ⋆ const 1−1 = g          
 g = f ⋆ µ          
g = 
d | n
 f(d) µ(n/d)

§ n = ∑d | n φ(d)

d{ 1 ≤ x ≤ 12 : (x, 12) = d }{ 1 ≤ x ≤ 12: (x/d, 12/d) = 1}size of set 
1{ 1, 5, 7, 11 }(x, 12) = 1
2{2, 10 }(x/2, 6) = 1
3{3, 9 }(x/3, 4) = 1
4{4, 8 }(x/4, 3) = 1
6{ 6 }(x/6, 2) = 1
12{ 12 } (x/12, 1) = 11
Notice that the sizes of sets that we are calculating, for example, |{ 1 ≤ x ≤ 12 : (x/2, 6) = 1 }| = φ(6). Summing over all of what we have, we've counted the numbers in [1, 2, …, 12] in two ways --- one directly, and the other by partitioning into equivalence classes:
12 = φ(1) + φ(2) + φ(3) + φ(4) + φ(6) + φ(12) = 
d | 12
In general, the same argument allows us to prove that:
n = 
d | n

§ Using mobius inversion on the euler totient function

§ Other arithmetical functions and their relations

§ Incunabulum for the 21st century: Making the J interpreter compile in 2020

This is me trying to understand the fabled interpreter of the J language working, so I could absorb Arthur Whitney's style of writing C: it's cryptic, short, and fits in a page. I learnt of this from the J language page, which comes with the quote:
One summer weekend in 1989, Arthur Whitney visited Ken Iverson at Kiln Farm and produced—on one page and in one afternoon—an interpreter fragment on the AT&T 3B1 computer. I studied this interpreter for about a week for its organization and programming style; and on Sunday, August 27, 1989, at about four o'clock in the afternoon, wrote the first line of code that became the implementation described in this document. Arthur's one-page interpreter fragment is as follows:

§ The original source

typedef char C;typedef long I;
typedef struct a{I t,r,d[3],p[2];}*A;
#define P printf
#define R return
#define V1(f) A f(w)A w;
#define V2(f) A f(a,w)A a,w;
#define DO(n,x) {I i=0,_n=(n);for(;i<_n;++i){x;}}
I *ma(n){R(I*)malloc(n*4);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
tr(r,d)I *d;{I z=1;DO(r,z=z*d[i]);R z;}
A ga(t,r,d)I *d;{A z=(A)ma(5+tr(r,d));z->t=t,z->r=r,mv(z->d,d,r);
 R z;}
V1(iota){I n=*w->p;A z=ga(0,1,&n);DO(n,z->p[i]=i);R z;}
V2(plus){I r=w->r,*d=w->d,n=tr(r,d);A z=ga(0,r,d);
 DO(n,z->p[i]=a->p[i]+w->p[i]);R z;}
V2(from){I r=w->r-1,*d=w->d+1,n=tr(r,d);
 A z=ga(w->t,r,d);mv(z->p,w->p+(n**a->p),n);R z;}
V1(box){A z=ga(1,0,0);*z->p=(I)w;R z;}
V2(cat){I an=tr(a->r,a->d),wn=tr(w->r,w->d),n=an+wn;
 A z=ga(w->t,1,&n);mv(z->p,a->p,an);mv(z->p+an,w->p,wn);R z;}
V2(rsh){I r=a->r?*a->d:1,n=tr(r,a->p),wn=tr(w->r,w->d);
 A z=ga(w->t,r,a->p);mv(z->p,w->p,wn=n>wn?wn:n);
 if(n-=wn)mv(z->p+wn,z->p,n);R z;}
V1(sha){A z=ga(0,1,&w->r);mv(z->p,w->d,w->r);R z;}
V1(id){R w;}V1(size){A z=ga(0,0,0);*z->p=w->r?*w->d:1;R z;}
pi(i){P("%d ",i);}nl(){P("\n");}
pr(w)A w;{I r=w->r,*d=w->d,n=tr(r,d);DO(r,pi(d[i]));nl();
 if(w->t)DO(n,P("< ");pr(w->p[i]))else DO(n,pi(w->p[i]));nl();}

C vt[]="+{~<#,";
I st[26]; qp(a){R  a>='a'&&a<='z';}qv(a){R a<'a';}
A ex(e)I *e;{I a=*e;
 if(qp(a)){if(e[1]=='=')R st[a-'a']=ex(e+2);a= st[ a-'a'];}
 R qv(a)?(*vm[a])(ex(e+1)):e[1]?(*vd[e[1]])(a,ex(e+2)):(A)a;}
noun(c){A z;if(c<'0'||c>'9')R 0;z=ga(0,0,0);*z->p=c-'0';R z;}
verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
I *wd(s)C *s;{I a,n=strlen(s),*e=ma(n+1);C c;
 DO(n,e[i]=(a=noun(c=s[i]))?a:(a=verb(c))?a:c);e[n]=0;R e;}

main(){C s[99];while(gets(s))pr(ex(wd(s)));}
It's a lot to take in --- it's quite breathtaking really, the way it all hangs together in one page.

§ The attempt to get it run

Unfortunately, this does not work if we try to get it to run in 2020. I decided to read the code and understand what would happen. I managed to read enough to understand that the code a=~3 ought to create an array with values [0 1 2]. On attempting to run this however, we get:
$ gcc -O0 -g -std=c89 -fsanitize=address -fsanitize=undefined incunabulum.c -o bin/incunabulum && ./bin/incunabulum
(many many GCC warnings elided)
==23726==ERROR: AddressSanitizer: heap-buffer-overflow on address
  0x60300000eff0 at pc 0x000000402be3 bp 0x7ffe6dde6b70 sp 0x7ffe6dde6b60
WRITE of size 8 at 0x60300000eff0 thread T0
    #0 0x402be2 in wd /home/bollu/work/w/incunabulum.c:40
    #1 0x402d28 in main /home/bollu/work/w/incunabulum.c:42
    #2 0x7f7ae901082f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
    #3 0x400ca8 in _start (/home/bollu/w/bin/incunabulum+0x400ca8)
SUMMARY: AddressSanitizer: heap-buffer-overflow /home/bollu/work/w/incunabulum.c:40 wd
oops! The code uses a lot of punning between int and int*. These assumptions break now that we're in 64-bit. The patch to get this to work is:

§ the patch

diff --git a/incunabulum.c b/incunabulum.c
index 2cae744..778e35a 100644
--- a/incunabulum.c
+++ b/incunabulum.c
@@ -1,11 +1,11 @@
-typedef char C;typedef long I;
+typedef char C;typedef long long I;
 typedef struct a{I t,r,d[3],p[2];}*A;
 #define P printf
 #define R return
 #define V1(f) A f(w)A w;
 #define V2(f) A f(a,w)A a,w;
 #define DO(n,x) {I i=0,_n=(n);for(;i<_n;++i){x;}}
-I *ma(n){R(I*)malloc(n*4);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
+I *ma(n){R(I*)malloc(n*8);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
 tr(r,d)I *d;{I z=1;DO(r,z=z*d[i]);R z;}
 A ga(t,r,d)I *d;{A z=(A)ma(5+tr(r,d));z->t=t,z->r=r,mv(z->d,d,r);
  R z;}
@@ -34,9 +34,10 @@ I st[26]; qp(a){R  a>='a'&&a<='z';}qv(a){R a<'a';}
 A ex(e)I *e;{I a=*e;
  if(qp(a)){if(e[1]=='=')R st[a-'a']=ex(e+2);a= st[ a-'a'];}
  R qv(a)?(*vm[a])(ex(e+1)):e[1]?(*vd[e[1]])(a,ex(e+2)):(A)a;}
-noun(c){A z;if(c<'0'||c>'9')R 0;z=ga(0,0,0);*z->p=c-'0';R z;}
-verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
+I noun(c){A z;if(c<'0'||c>'9')R 0;z=ga(0,0,0);*z->p=c-'0';R z;}
+I verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
 I *wd(s)C *s;{I a,n=strlen(s),*e=ma(n+1);C c;
  DO(n,e[i]=(a=noun(c=s[i]))?a:(a=verb(c))?a:c);e[n]=0;R e;}

 main(){C s[99];while(gets(s))pr(ex(wd(s)));}

§ It runs!

After applying the patch, we manage to get the interpreter to run!
0 1 2

§ The lock screen

I liked it so much that I took a screenshot and made it my lock screen.

§ Thoughts

I'm really fascinated by the code. I loved I could simply stare the screenshot to absorb the code. There was no scrolling involved. The variables are well-named (to the extent I understand the code), and it's clearly extremely well thought out. If there's someone who understands some of the thorny aspects of the code: I'd be really glad to know the details. Please leave an issue or a pull request against the repo. I'm going write a dissection of the code once I fully understand it, since I couldn't find explanaing the code on the internet. Until then, enjoy the monolith of code!

§ An example of a sequence whose successive terms get closer together but isn't Cauchy (does not converge)

§ The problem

Provide an example of a sequence an: ℕ → ℝ such that limn → ∞ | an+1an | → 0, but limn, m → ∞, m > n |anam| ≠ 0. That is, proide a series where the distances between successive terms converges to zero, but where distances between terms that are "farther apart than 1" does not converge to 0. That is, the sequence is not Cauchy.

§ Regular solution: Harmonic numbers

The usual solution is to take the harmonic numbers, Hn ≡ ∑i=1n 1/i. Then, we show that:
n → ∞
Hn+1 − Hn 


 → 0
n → ∞
H2n − Hn 


 + … + 
 + … + 
 ≠ 0 on x → ∞

§ Memorable solution: logarithm

We can much more simply choose an = log(n). This yields the simple calculation:
n → ∞
 an+1 − an = log(n+1) − log(n
 = log((n+1)/n))          
= log(1 + 1/n
  n → ∞  

log(1) = 0
while on the other hand,
n → ∞
 a2n − an  = log(2n) − log(n) = log(2) + log(n) − log(n) = log2 ≠ 0
I find this far cleaner conceptually, since it's "obvious" to everyone that an = log(n) diverges, while the corresponding fact for Hn is hardly convincing. We also get straightforward equalities everywhere, instead of inequalities. I still feel that I don't grok what precisely fails here, in that, my intuition still feels that the local condition ought to imply the Cauchy condition: if an tells an+1 to not be too far, and an+1 tells an+2, surely this must be transitive? I have taught my instincts to not trust my instincts on analysis, which is a shitty solution :) I hope to internalize this someday. EDIT: I feel I now understand what's precisely happening after ruminating a bit. The Cauchy convergence criterion allows us to drop a finite number of terms, and then capture everything after that point in a ball of radius є. As є shrinks, all the terms in the sequence are "squeezed togeher". In the an+1an case, only successive terms must maintain an є distance. But as the log example shows, you can steadily plod along, keeping є ball next to є ball, to reach:
n → ∞
є → 0
 f(n) · є
whose behaviour can do unexpected things depending on the choice of .

§ Krylov subspace method

This is a class of methods used to solve Ax = b, where A is sparse. Krylov subspace methods are a class of methods which use the idea of a Krylov subspace. There is conjugate gradient (CG), GMRES (Generalized minimal residual method).
Km(Av) ≡ span { vAvA2v, …, Am v}
Clearly, KmKm+1, and there is a maximum KN that we can span (the full vector space). We are interested in the smallest index M such that KM = KM+1. We notice that KM is invariant under the action of A. Now, let's consider:
Km(Ax)≡ span {xAxA2x, … Am x }          
 span { A−1 bbAb, … Am−1 x }      (substitute x = A−1b         
 A span { A−1 bbAb, … Am−1 b}      (Invariance of Krylov subspace)          
 span {bAb, … Am b}           
We learnt that Ax = b has a solution in Km(A, b). Using this, we can build solvers that exploit the Krylov subspace. We will describe GMRES and CG.

§ Generalized minimal residual --- GMRES

We wish to solve Ax = b where A is sparse and b is normalized. The nth Krylov subspace is Kn(A, b) ≡ span {b, Ab, A2b, …, Anb }. We approximate the actual solution with a vector xnKn(A, b). We define the residual as rnA xnb.

§ Conjugate gradient descent

§ Good reference to the Rete pattern matching algorithm

The Rete pattern matching algorithm is an algorithm that allows matching a huge number of rules with a huge database of "facts". MLIR ("multi-language intermediate representation") is a new technology that hopes to centralize much of the research and development of various compilers into a single authoritative source. The major claim-to-fame is that it allows one to mix various "dialects" (much as Racket does). So, to a first order approximation, MLIR is "JSON for compiler intermediate representations". What MLIR gets right is tooling. They take the experience that the LLVM project has mined for the last decade and a half, and bake many of the good stuff that came with LLVM right in. For example, MLIR comes in-built with a pretty printer, a notion of types, a notion of "attributes", SSA, enforced provenance tracking of code (so one can always know what the original source code was that led to some assembly). Sound engineering might see MLIR succeed where many others failed. I was reminded of Rete since the MLIR folks are trying to solve the pattern matching problem in general for their Generic DAG Rewriter. They currently just use a worklist based algorithm. I'm trying to understand if Rete can be used instead. Rete is famous for being hard to understand, so I began a quest to look for good sources to implement it. I found a great PhD thesis written by Robert B. Doorenboos, which quips:
Since the Rete match algorithm provides the starting point for much of the work in this thesis, this chapter describes Rete. Unfortunately, most of the descriptions of Rete in the literature are not particularly lucid,1 which is perhaps why Rete has acquired \a reputation for extreme differentialculty."(Perlin, 1990b) To remedy this situation, this chapter describes Rete in a tutorial style, rather than just briey reviewing it and referring the reader to the literature for a full description. We will first give an overview of Rete, and then discuss the principle data structures and procedures commonly used to implement it. High-level pseudocode will be given for many of the structures and procedures, so that this chapter may serve as a guide to readers who want to implement rete (or some variant) in their own systems.
I now have a reference to an accessible description of this stuff. I might implement Rete to understand it, so that it's part of my toolkit if I ever need it.

§ Leapfrog Integration

We have a system we wish to simulate using hamilton's equations:
∂ q
∂ t
∂ H
∂ p
∂ p
∂ t
 = −
∂ H
∂ q
We want to simulate a system using these differential equations. We will begin with some initial position and momentum (q0, p0), evaluate q/∂ t |(q0, p0), p/∂ t |(q0, p0), and use these to find (qnext, pnext). An integrator is a general algorithm that produces the next position and momentum using current information:
(qnextpnext) = I 

q0, p0,
∂ q
∂ t
∂ p
∂ t

The design of I is crucial: different choices of I will have different trade-offs for accuracy and performance. Another interesting property we might want is for I to be a symplectic integrator --- that is, it preserves the total energy of the system. For example, here's a plot of the orbits of planets using two integrators, one that's symplectic (leapfrog) and one that isn't (Euler) Notice that since leapfrog attempts to keep energy conserved, the orbits stay as orbits! On the other hand, the euler integrator quickly spirals out, since we lose energy during the integration. Note that this is not an issue of numerical precision: The euler integrator is ineherently such that over long timescales, it will lose energy. On the other hand, the leapfrog integrator will always remain stable, even with very large timesteps and low precision. I present the equations of the leapfrog integrator, a proof sketch that it is symplectic, and the code listing that was used to generate the above plot. Often, code makes most ideas very clear!

§ The integrator

§ Code listing

§ Incantations

## Run HMC with a particular choice of potential
import numpy as np
from matplotlib.animation import FuncAnimation
import matplotlib.pyplot as plt
import numpy.linalg
## dq/dt = dH/dp|_{p0, q0}
## dp/dt = -dH/dq|_{p0, q0}
def leapfrog(dhdp, dhdq, q0, p0, dt):
    p0 += -dhdq(q0, p0) * 0.5 * dt

    # full step position
    # q += dt * p
    q0 += dhdp(q0, p0) * dt

    # half step position
    p0 += -dhdq(q0, p0) * 0.5 * dt
    return (q0, p0)
For reference, we also implement an euler integrator, that uses the derivative to compute the position and momentum of the next timestep independently.
def euler(dhdp, dhdq, q, p, dt):
    pnew = p + -dhdq(q, p) * dt
    qnew = q + dhdp(q, p) * dt
    return (qnew, pnew)
Finally, we implement planet(integrator, n, dt) which simulates gravitational potential and usual kinetic energy, using the integrator given by integrator for n steps, with each timestep taking dt.
def planet(integrator, n, dt):
    STRENGTH = 0.5

    # minimise potential V(q): q, K(p, q) p^2
    q = np.array([0.0, 1.0])
    p = np.array([-1.0, 0.0])

    # H = STRENGTH * |q| (potential) + p^2/2 (kinetic)
    def H(qcur, pcur): return STRENGTH * np.linalg.norm(q) + np.dot(p, p) / 2
    def dhdp(qcur, pcur): return p
    def dhdq(qcur, pcur): return STRENGTH * 2 * q / np.linalg.norm(q)

    qs = []
    for i in range(n):
        print("q: %10s | p: %10s | H: %6.4f" % (q, p, H(q, p)))
        (q, p) = integrator(dhdp, dhdq, q, p, dt)
    return np.asarray(qs)
We plot the simulations using matplotlib and save them.

print("planet simulation with leapfrog")
planet_leapfrog = planet(leapfrog, NITERS, TIMESTEP)

plt.rcParams.update({'font.size': 12, 'font.family':'monospace'})
fig, ax = plt.subplots()
ax.plot(planet_leapfrog[:, 0], planet_leapfrog[:, 1], label='leapfrog',
        linewidth=3, color='#00ACC1')
print("planet simulation with euler")
planet_euler = planet(euler, NITERS, TIMESTEP)
ax.plot(planet_euler[:, 0], planet_euler[:, 1], label='euler',
        linewidth=3, color='#D81B60')

legend = plt.legend(frameon=False)
ax.set_title("leapfrog v/s euler: NITERS=%s dt=%s" % (NITERS, TIMESTEP))

§ Comparison of forward and reverse mode AD

Quite a lot of ink has been spilt on this topic. My favourite reference is the one by Rufflewind. However, none of these examples have a good stock of examples for the diference. So here, I catalogue the explicit computations between computing forward mode AD and reverse mode AD. In general, in forward mode AD, we fix how much the inputs wiggle with respect to a parameter t. We figure out how much the output wiggles with respect to t. If output = f(input1, input2, … inputn), then output/∂ t = ∑if/∂ inputiinputi/∂ dt. In reverse mode AD, we fix how much the parameter t wiggles with respect to the output. We figure out how much the parameter t wiggles with respect to the inputs. If outputi = fi(input, …), then t/∂ input = ∑it/∂ outputifi/input. This is a much messier expression, since we need to accumulate the data over all outputs. Essentially, deriving output from input is easy, since how to compute an output from an input is documented in one place. deriving input from output is annoying, since many outputs can depent on a single output. The upshot is that if we have few "root outputs" (like a loss function), we need to run AD once with respect to this, and we will get the wiggles of all inputs at the same time with respect to this output, since we compute the wiggles output to input. The first example of z = max(x, y) captures the essential difference between the two approached succinctly. Study this, and everything else will make sense.

§ Maximum: z = max(x, y)

∂ x
∂ t
= ?          
∂ y
∂ t
= ?          
∂ z
∂ t

∂ x
∂ t
if x > y 
∂ y
∂ t
We can compute z/∂ x by setting t = x. That is, x/∂ t = 1, ∂ y/∂ t = 0. Similarly, can compute z/∂ y by setting t = y. That is, x/∂ t = 1, ∂ y/∂ t = 0. If we want both gradients z/∂ x, ∂ z/∂ y, we will have to rerun the above equations twice with the two initializations. In our equations, we are saying that we know how sensitive the inputs x, y are to a given parameter t. We are deriving how sensitive the output z is to the parameter t as a composition of x, y. If x > y, then we know that z is as sensitive to t as x is.
∂ t
∂ z
= ?          
∂ t
∂ x

∂ t
∂ z
if x > y 
∂ t
∂ y

∂ t
∂ z
if y > x 
We can compute z/∂ x, ∂ z/∂ y in one shot by setting t = z. That is, z/∂ t = 1. In our equations, we are saying that we know how sensitive the parameter t is to a given output z. We are trying to see how sensitive t is to the inputs x, y. If x is active (ie, x > y), then t is indeed sensitive to x and t/∂ x = 1. Otherwise, it is not sensitive, and t/∂ x = 0.

§ sin: z = sin(x)

∂ x
∂ t
= ?          
∂ z
∂ t
∂ z
∂ x
∂ x
∂ t
∂ x
∂ t
We can compute z/∂ x by setting t = x. That is, setting x/∂ t = 1.
∂ t
∂ z
= ?          
∂ t
∂ x
∂ t
∂ z
∂ z
∂ x
∂ t
∂ z
We can compute z/∂ x by setting t = z. That is, setting z/∂ t = 1.

§ addition: z = x + y:

zx + y          
∂ x
∂ t
= ?          
∂ y
∂ t
= ?          
∂ z
∂ t
∂ z
∂ x
∂ x
∂ t
∂ z
∂ y
∂ y
∂ t
= 1 · 
∂ x
∂ t
 + 1 · 
∂ y
∂ t
∂ x
∂ t
∂ y
∂ t
zx + y          
∂ t
∂ z
= ?          
∂ t
∂ x
∂ t
∂ z
∂ z
∂ x
∂ t
∂ z
 · 1 = 
∂ t
∂ z
∂ t
∂ y
∂ t
∂ z
∂ z
∂ y
∂ t
∂ z
 · 1 = 
∂ t
∂ z

§ multiplication: z = xy

zx y          
∂ x
∂ t
= ?          
∂ y
∂ t
= ?          
∂ z
∂ t
∂ z
∂ x
∂ x
∂ t
∂ z
∂ y
∂ y
∂ t
∂ x
∂ t
 + x 
∂ y
∂ t