## § Sanskrit and Sumerian

#### § Sanskrit's precursors

The oldest known sanskrit text that has been found is the Rig Veda. Eveyrthing after is the study of sanskit proper. This is quite problematic because the Rig Veda is a complete and consistent textbook with respect to language. It's a black-box in terms of language evolution. The question "what is the precursor" asks for a method of study to determine the precursor. We just don't know because we have no writings, text, etc.

#### § Archaeological data

Archaeological data is problematic as well. We don't know where the people who knew sanskrit came from. Sanskrit was spoken in the northen part of Hindusthan [what they originally called Bhrahma-nagar (?)]. While we can try to undersatnd where it comes from, it's hard. The script is Brahmi / Devanagiri, which means used by Brahma or god. The name "sanskrit" is a compound that stands for "well-formed". It's really clean-slate in that sense. The study of Indo-Aryan languages in the Indian subcontinent has only one possible known history, which stops at the Rig Veda. We don't know the relationship between 2400BC when Rig Veda was written to anything before it.

#### § Non Vedic sanskrit

Studies in non-vedic sanskrit is known to be the "true" proto-Indo-Europoean (PIE) language. The conjecture is that this Indo European language ought to be able to cover middle greek, hittite, and sanskit.

#### § Prakrit and its relationship to this story

Prakrit evolved as a vernacular of Sanskrit in the north pahadi region. Hindi as we know toady evolved from Hindusthani, which came from languages in northern india. Languages like Marathi, Marvadi, Gujurathi, etc. came a lot before Hindi did.

#### § TLDR on Sanskrit v/s Hindi

There is a huge gap of time between Sanskit, Prakrit, Pali, and Hindi. Hindi evolved around the 1600s due to the Mughals who used to speak a vernacular of Hindusthani. Kabir also wrote in Hindusthani. There was also some Farsi and Urdu mixed in.
Hindi is more of a political exploit than an actual language ~ Alok 2020

#### § The relationship to Sumerian

We don't know what the relationship to sumerian is. Social expectations that was setup is sumerian has become more stringent in Sanskrit.

## § Writing Cuneiform

I've been reading about the Sumerian people, and I've gotten fascinated with the question of how to write in Cuneiform, which is their script. I wanted to learn how to write cuneiform. It appears that it was originally written down by pressing reed styluses into clay. The script is syllabic in nature. We have three components:
1. Vertical wedge 𐏑
2. Horizontal wedge 𐎣
3. Diagonal 𐏓

## § The code of hammurabi

I've wanted to read the code of hammurabi since it was name dropped in Snow Crash by Neal Stephenson. I finally got around to it. Here's some excerpts I found fascinating.

## § The implicit and inverse function theorem

I keep forgetting the precise conditions of these two theorems. So here I'm writing it down as a reference for myself.

#### § Implicit function: Relation to function

• Example 1: x2 + y2 = 1 to y = √1 − x2.
If we have a function y = g(p, q), we can write this as yg(p, q) = 0. This can be taken as an implicit function h(y; p, q) = yg(p, q). We then want to recover the explicit version of y = g′(p, q) such that h(g′(p, q); p, q) = 0. That is, we recover the original explicit formulation of y = g′(p, q) in a way that satisfies h.

#### § The 1D linear equation case

In the simplest possible case, assume the relationship between y and p is a linear one, given implicitly. So we have h(y; p) = α y + β p + γ = 0. Solving for h(y,p) = 0, one arrives at: y = −1/α (β p + γ).
• Note that the solution exists iff α ≠ 0.
• Also note that the the existence of the solution is equivalent to asking that h / ∂ y = α ≠ 0.

#### § The circle case

In the circle case, we have h(y; p) = p2 + y2 − 1. We can write y = ± √p2 − 1. These are two solutions, not one, and hence a relation, not a function.
• We can however build two functions by taking two parts. y+ = +√p2 − 1; y− = −√p2 − 1.
• In this case, we have h / ∂ y = 2y, which changes sign for the two solutions. If y> 0, then (∂ h / ∂ y)(y= 0). Similarly for the negative case.

#### § Assuming that a solution for h(y, p) exists

Let us say we wish to solve h(y; p) = y3 + p2 − 3 yp − 7 = 0. Let's assume that we have a solution y = sol(p) around the point (y=3, p=4). Then we must have: sol(p)3 + p2 − 3 sol(p) p − 7 = 0. Differentiating by p, we get: 3 sol(p)2 sol′(p) + 2p − 3 sol′(p) p − 3 sol(p) = 0. This gives us the condition on the derivative:

sol(p)2 sol′(p) + 2p − 3 sol′(pp − 3 sol(p) = 0

 sol′(p) ⎡ ⎣ 3 sol(p)2 − 3p ⎤ ⎦ = 3 sol(p) − 2p

 sol′(p) = [3 sol(p) − 2p] / ⎡ ⎣ 3(sol(p)2 − 3p) ⎤ ⎦

The above solution exists if 3(sol(p)2 − 3p ≠ 0). This quantity is again h / ∂ y.

#### § Application to economics

• We have two inputs which are purchaed as x1 units of input 1, x2 units of input 2.
• The price of the first input is w1 BTC/unit. That of the second input is w2 BTC/unit.
• We produce an output which is sold at price w BTC/unit.
• For a given (x1, x2) units of input, we can produce x1a x2b units of output where a + b < 1. The Coob-douglas function.
• The profit is going to be profit(x1 x2, w1, w2, w) = w(x1a x2b) − w1 x1w2 x2.
• We want to select x1, x2 to maximize profits.
• Assume we are at break-even: profit(x1, x2, w1, w2, w) = 0.
• The implicit function theorem allows us to understand how any variable changes with respect to any other variable. It tells us that locally, for example, that the number of units of the first input we buy ( x1) is a function of the price w1. Moreover, we can show that it's a decreasing function of the price.

#### § Inverse function: Function to Bijection

• Given a differentiable function f, at a point p, we will have a continuous inverse f−1(p) if the derivative f′(p) is locally invertible.
• The intuition is that we can approximate the original function with a linear function. y = f(p + δ) = f(p) + f′(p) δ. Now since f′(p) is locally invertible, we can solve for δ. y = f(p) + f′(p) δ implies that δ = 1/f′(p) [yf(p + δ) ]. This gives us the pre-image (p + delta) ↦ y.
• The fact that 1/f′(p) is non-zero is the key property. This generalizes in multiple dimensions to saying that f′(p) is invertible.
One perspective we can adopt is that of Newton's method. Recall that Newton's method allows us to find x* for a fixed y* such that y* = f(x*). It follows the exact same process!
• We then find the tangent f′(x[1]).
• We draw the tangent at the point x[1] as (y[2] − y[1]) = f′(x[1])(x[2] − x[1]).
• To find the y* we set y[2] = y*.
• This gives us x[2] = x[1] − (y*y[1])/f′(x[1]).
• Immediately generalizing, we get x[n+1] = x[n] − (y*y[n]) / f′(x[n]).

## § Whalesong hyperbolic space in detail

We can build a toy model of a space where velocity increases with depth. Let the x-y axis be: left-to-right (→) is positive x, top-to-bottom (↓) is positive y. Now let the velocity at a given location (x, y) be (y+1, 1). That is, velocity along y is constant; the velocity along x is an increasing function of the current depth. The velocity along x increases linearly with depth. Under such a model, our shortest paths will be 'curved' paths.

## § Motivating Djikstra's

Usually I've seen Djikstra's algorithm presented as a greedy algorithm, and then an analogy given to "fire" for the greediness. We can reverse this presentation start with the "fire", and the discretize the "fire" solution to arrive at Djikstra's

#### § The fire analogy

• Assume we want to find the shortest path from point A to point B. We should find the path that fire travels. How do we do this? Well, we simply simulate a fire going through all paths from our initial point, and then pick the shortest path (ala Fermat). We interpret the graph edges as distances between the different locations in space.
• So we write a function simulate :: V × T → 2V × T, which when given a starting vertex v : V and a time t : T, returns the set of vertices reachable in at most that time, and the time it took to reach them.
• Notice that we are likely repeating a bunch of work. Say the fire from x reaches p, and we know the trajectory of the fire from p. The next time where a fire from y reaches p, we don't need to recalculate the trajectory. We can simply cache this.
• Notice that this time parameter being a real number is pointless; really the only thing that matters are the events (as in computational geometry), where the time crosses some threshold.
• By combining the two pieces of information, we are led to the usual implementation: Order the events by time using a queue, and then add new explorations as we get to them.

## § Hyperbolic groups have solvable word problem

I have never seen an elementary account of this in a 'trust these facts, now here is why hyperbolic groups have a solvable word problem'. I am writing such an account for myself. It's an account for building intuition, so no proofs will be provided except for the final theorem. All facts will be backed by intuition. Since most of it is geometric, it's easy to convey intuition.

#### § Graphs of groups, quasi-isometry.

• NOTE: I will consistently denote the inverse of g by g.
We can convert any group into a graph by using the cayley graph of the group. We characterize hyperbolic space as a space where we can build 'thin triangles'. We also think of hyperbolic space as where geodesics from a given point diverge (in terms of angle) exponentially fast. The choice of generators for the cayley graph gives different graphs. We can assign a unique geometric object by considering cayley graphs upto quasi isometry. The cayley graph of a group with respect to different generating sets are quasi-isometric. We can now try to study properties that are invariant under quasi-isometry, since these are somehow 'represented faithfully by the geometry'.

#### § Hyperbolicity enters the picture

We now say that a graph is hyperbolic if the cayley graph of the group is hyperbolic. We can show that hyperbolicity is preserved by quasi-isometry. So this property does not depend on the generating set.

#### § Isoperimetric Inequality

If we have a finitely presented group G = ⟨ S | R, and w is a word in the free group Free(S), if [[w]] = 1, we will have
w =
 n ∏ i=1
ui ri± 1 ui′
This follows almost by definition. Since we have quotiented by r we can have elements of r in between u u. We will need to have a u since there's nothing else to cancel off the u.

#### § Area of a word

Let G = ⟨ S | R. Let w be a word in S such that [[w]] = e. The area of the word w is the minimal number of such u r u components we need to write it down. Formally:
Area(w) = min
n |
 n ∏ i=1
ui r+i± 1 ui
I don't understand the geometric content of this definition. I asked on mathoverflow.

#### § Isopermetric function for a group

f: ℕ → ℕ is a Dehn function or isoperimetric function if the area of the word is upper bounded by f(|word|). In some sense, the length of the word is the perimeter to the area, and this gives us a form of the isoperimetric inequality. Formally, f is a Dehn function if for all words wF(S) such that [[w]] = e, we have A(w) ≤ f(|w|). depending on the growth of f, we say that G has linear, quadratic, exponential etc. Dehn function.

#### § Geometric content of Area

We define a map to be aninite, planar, oriented, connected and simply connected simplicial2-complex (!). A map D is a diagram over an alphabet S iff every edge eD has a label lbl(e) ∈ S such that lbl(e−1) = (lbl(e))−1. Hang on: what does it mean to invert an edge? I presume it means to go backwards along an edge. So we assume the graph is directed, and we have edges in both directions. A Van Kampen diagram over a group G = ⟨ S | R is a diagram D over S such that for all faces of D, the label of the boundary of f is labelled by some r± : rR. The area of such a diagram is the number of faces.

#### § Hyperbolic iff Linear Isopermetric Inequality is satisfied

A finitely presented group is hyperbolic if and of if its cayley grah satisfies the linear isoperimetric inequality.

#### § Deciding if elements are conjugate to each other

• If we can answer the question of whether two elements are conjugate to each other (does there exist a g such that ghg′ =?= k), we can solve that an element is equal to the identity:
• Pick k = e. Then if we have ghg′ = k = e, then gh = g hence h = e.
• If we can check that an element is equal to the identity, we can check for equality of elements. two elements k, l are equal iff kl′ = e.
• So solving conjugacy automatically allows us to check of equality.

#### § Proof that conjugacy is solvable for hyperbolic groups

e------------------x----------------------->x
|                                           |
g2                                          g1
|                                           |
v                                           v
g1 =x g1 x'<-------x'--------------------- x g1

Assume such an element x exists such that g2 = x g1 x'. Now we have two triangles, e-x-xg1 and xg1-x'-g2. Let us focus on e-x-xg1. Assume the triangles in our space is delta thin.
         α = δ
e------------------x----‖------------------>x
\_______               ‖                   |
\________      ‖    > δ            |
\_____p====================
\________        |
\_______xg1

Since our triangles are δ thin, if the distance of the point p ∈ [e xg1] to

## § Elementary uses of Sheaves in complex analysis

I always wanted to see sheaves in the wild in a setting that was both elementary but 'correct': In that, it's not some perverse example created to show sheaves (DaTaBaSeS arE ShEAvEs). Ahlfors has a great example of this which I'm condensing here for future reference.

#### § Sheafs: Trial 1

• We have function elements (f: Ω → ℂ, Ω ⊆ ℂ). f is complex analytic, Ω is an open subset of .
• Two function elements (f1, Ω1), (f2, Ω2) are said to be analytic continuations of each other iff Ω1 ∩ Ω2 ≠ ∅, and f1 = f2 on the set Ω1 ∩ Ω2).
• (f2, Ω2) can be called as the continuation of (f1, Ω1) to region Ω2.
• We will have that the analytic continuation of f1 to Ω2 is unique. If there exists a function element (g2, Ω2), (h2, Ω2) such that g2 = f1 = h2 in the region Ω1 ∩ Ω2, then by analyticity, this agreement will extend to all of Ω2.
• Analytic continuation is therefore an equivalence relation (prove this!)
• A chain of analytic continuations is a sequence of (fi, Ωi) such that the adjacent elements of this sequence are analytic continuations of each other. (fi, Ωi) analytically continues (fi+1, Ωi+1).
• Every equivalence class of this equivalence relation is called as a global analytic function. Put differently, it's a family of function elements (f, U) and (g, V) such that we can start from (f, U) and build analytic continuations to get to (g, V).

#### § Sheafs: Trial 2

• We can take a different view, with (f, z ∈ ℂ) such that f is analytic at some open set Ω which contains z. So we should picture an f sitting analytically on some open set Ω which contains z.
• Two pairs (f, z), (f′, z′) are considered equivalent if z = z and f = f is some neighbourhood of z (= z′).
• This is clearly an equivalence relation. The equivalence classes are called as germs.
• Each germ (f, z) has a unique projection z. We denote a germ of f with projection z as fz.
• A function element (f, Ω) gives rise to germs (f, z) for each z ∈ Ω.
• Conversely, every germ (f, z) is determined by some function element (f, Ω) since we needed f to be analytic around some open neighbourhood of z: Call this neighbourhood Ω.
• Let D ⊆ ℂ be an open set. The set of all germs { fz : zD } is called as a sheaf over D. If we are considering analytic f then this will be known as the sheaf of germs of analytic functions over D. This sheaf will be denoted as Sh(D).
• There is a projection π: Sh(D) → D; (f, z) ↦ z. For a fixed z0 ∈ D, the inverse-image π−1(z0) is called as the stalk over z0. It is denoted by Sh(z).
• Sh carries both topological and algebraic structure. We can give the sheaf a topology to talk about about continuous mappings in and out of Sh. It also carries a pointwise algebraic structure at each stalk: we can add and subtract functions at each stalk; This makes it an abelain group.

#### § Sheaf: Trial 3

A sheaf over D is a topological space Sh and a mapping π: ShD with the properties:
• π is a local homeomorphism. Each sS has an open neighbourhood D such that π(D) is open, and the restriction of π to D is a homeomorphism.
• For each point zD, the stalk π−1(z) ≡ Sz has the structre of an abelian group.
• The group operations are continuous with respect to the topology of Sh.
We will pick D to be an open set in the complex plane; Really, D can be arbitrary.

## § Snake lemma

### § Why homomorphisms for chain maps?

First of all, to define a mapping between simplicial complexes { Gi } and { Hi }, one might naively assume that we can ask for functions { fi: GiHi }:
       ∂    ∂
G3 → G2 → G1 → 0
|    |    |
f    g    h
↓    ↓    ↓
0 → H3 → H2 → H1
∂    ∂

Unfortunately, to be able to use the machinery of Homology, we need the { fi } to be abelian group homomorphisms. However, this is no great loss. Intuitively, when we want to map complexes, we first say where the generators of the abelian group (-module) maps to; Everything else is determined by the generators. This aligns nicely with our intuition of what a map between complexes should look like: we tell where the geometry goes ("this edge goes there"), and the algebra is "dragged along for the ride". This gives us the diagram:
    G3--∂-→G2--∂-→G1
|      |      |
f3     f2     f1
↓      ↓      ↓
0 →H3--∂-→H2--∂-→H1

where the fi are homomorphisms. So, this means we can talk about kernels and images!
    Ker(f3)----→Ker(f2)--→Ker(f1)
|         |          |
↓         ↓          ↓
G3--∂----→G2----∂---→G1--→ 0
|         |          |
f3        f2         f1
↓         ↓          ↓
Im(f3)--∂--→Im(f2)--∂-→Im(f1)

F → E → V → {0}
{0} → {e} → {v} → {0}

The Snake Lemma gives us a mapping d: Ker(f1) → Im(f3) such that this long exact sequence is saatisfied:

### § What do we wish to compute?

• Now that we've agreed that this family of maps { fi : GiHi } ought to be structured maps, the next question is "OK, now what? What does one want to determine"? Ideally, we would get a new chain complex which I tacitly denote as { f(Gi) }, consisting of the image of Gi inside Hi and the ability to determine its structure.
• However, this is the boring bit. We don't really care about the chain complex { f(Gi) } per se. What we actually care about are the homology groups! So we would really like a tool that allows us to compute Hi(f(G)) in some convenient fashion.

## § Kernel, cokernel, image

Consider a linear map T: XY. we want to solve for { x : T(x) = y0 }.
• If we have an x0 such that T(x0) = y0, then see that T(x0 + Ker(T)) = T(x0) + T(Ker(T)) = y0 + 0 = y0. So the kernel gives us our degrees of freedom: how much can we change around without changing the solution.
• Now consider the cokernel: Coker(T) = Y / Im(T). If we want to find a solution { x : T(x) = y0 }. If we have y0 ≠ 0 ∈ Coker(T), then the solution set is empty. The cokernel tells us the obstruction to a solution.

## § The commutator subgroup

Define the commutator of g, h as [g, h] ≡ ghg−1h−1. The subgroup generated by all commutators in a group is called as the commutator subgroup. Sometimes denoted as [G, G].
• We need to consider generation. Consider the free group on 4 letters G = ⟨ a, b, c, d. Now [a, b] · [c, d] has no expression in terms of [α, β].
• In general, the elements of the commutator subgroup will be products of commutators.
• It measures the degree of non-abelian-ness of the group. G/[G, G] is the largest quotient of G that is abelian. Alternatively, [G, G] is the smallest normal subgroup we need to quotient by to get an abelian quotient. This quotienting is called abelianization.

## § Simplicity of A5 using PSL(2, 5)

#### § Presentation of A5

We take as faith A5 has the presentation:
<a, b | a^2 = b^3 = (ab)^5 = 1>

If I find a nice proof of this isomorphism, or some other way to derive the fact that PSL(2, 5) is isomorphic to A5, I will fill this up.

#### § Step 1: PSL(2, 5) is isomorphic to A5

PSL(2, 5) consists of projective Mobius transformations with function composition as the group operation. Here, we freely use the linear algebraic relationship between transformations of the form (az + b)/(cz + d) and matrices [a b; c z].

 a, b, c, d ∈ ℤ5, ad − bc = 1 f: ℤ5 ⋃ { ∞ } → ℤ5 ⋃ { ∞ } f(z) ≡ (az + b)/(cz + d)
• We allow coefficients for the Mobius transform to be from ℤ5, and we allow the domain and codomain of the function to be projectivized: so we add a point at infinity to ℤ5.
• We construct a map from PSL(2, 5) to A5 and then show that this map is an isomorphism. We exploit the presentation of A5 to find elements a, bPSL(2, 5) such that p2 = q3 = (pq)5 = I. We can link this to the presentation of A5 which requires precisely those relations.
• For an element of order 3, we pick q(z) = 1/(1-z).

q(z) = 1/(1−z

q(q(z)) =
1
1 −
 1 1−z

=
1
 (1−z) − 1 1−z

=
 (1−z) −z
=
 (z−1) z

= 1 −
 1 z

q(q(q(z))) = 1 −
 1 q(z)
= 1 − (1 − z) = z

• I don't know of a principled way to arrive at this choice of q(z), except by noticing that az + b does not work, and neither does 1/z. The next simplest choice is things of the form 1/(1-z). If there is a nicer way, I'd love to know.
• For a function of order 5, we have to use the structure of the finite field somehow. We can consider the function r(z) = 1 + z. On repeating this 5 times, we wil get 5 + z = z. However, it is hard to connect r(z) = 1 + z to the previous choice of q(z) = 1/(1-z).
• We use the same idea for r(z), and pick r(z) = z - 1. This will allow us to accumulate -1s till we hit a -5 = 0.
• To get r(z) = (z - 1), we need to compose q(z) = 1/(1-z) with p(z) = -1/z. This p(z) is of order 2.
To recap, we have achieved a set of functions:
p(z) = -1/z [order 2]
q(z) = 1/(1-z) [order 3]
r(z) = (z - 1) [order 5]
r = -1/[1/(1-z)] = p . q

That is, we have found a way elements in PSL(2, 5) such that p^2 = q^3 = (pq)^5 = 1. This gives us the [surjective] map from PSL(2, 5) into A5.
• By a cardinality argument, we know that the size of PSL(2, 5) is 60. Hence, since PSL(2, 5) and A5 have the same number of elements, this map must be a bijection.

## § A5 is not solvable

There are many accounts of why A5 is not solvable on the internet. I'm recording my version here, because the proof involves certain ad-hoc choices which I want to make sure I can find off-hand in the future. We'll show that [A5, A5] = A5, thereby proving that A5 not solvable. This is useful for Galois theory, where we want to show tha A5 cannot be built as extensions of smaller cyclic groups.

#### § Notation

I'll be using non-standard notation: (12);(34) means 'perform (12) then perform (34)'. I find this notation makes permutation composition intuitive for me. The ; is evocative of C-style languages, where we are ending a statement. I will be consistently using [g, h] ≡ ghg−1h−1 to denote the commutator.

#### § permutations in A5

First, recall that A5 only has the even permutations in S5. So it can have zero, two, four, involutions that build it up. There can't be more after simplification, since S5 ony has 5 elements --- the largest sequence of transpositions we can do is (12)(23)(34)(45). So, in A5, we have:
• The identity permutation ().
• The transpositions (ij)(kl) where {i, j} and {k, l} do not overlap. From these, we get the 2-cycles.
• The transpositions (ij)(kl) where {i, j} and {k, l} overlap. Here we cannot have {i, j} = {k, l} since then we will just have a single transposition. So, let us assume that we have j = k. If we have any other equality, we can always flip the transpositions around to get to the normal form j = k:
(23);(12)
= (32);(12) [(23) = (32)]
= (32);(21) [(12) = (21)]

• In this case, we can show that such a transposition must be a cycle:
[a b c] -(32)->
[a c b] -(21)->
[c a b]

• Intuitively, we are pushing the element c backward, and allowing the other elements to take its place using the permutation (23);(12).
• So, from the transpositions of the form (ij)(kl) where {i, j} and {k, l} intersect, we get the 3-cycles.
• Finally, we can have the transpositions of the form (12)(23)(34)(45). It must be of this form, or some permutation of this form. Otherwise, we would have repeated elements, since these transpositions are packed "as close as possible". These generate the 5-cycles.

#### § A5 is generated by 3-cycles.

We claim that we can write any element of A5 in terms of 3-cycles.
• The disjoint transpositions of the type (34)(12) can be written as (34)(23)(23)(12), because (23)(23) = e. This can be further broken down into ((34)(23)) ((23)(12)) which is two 2-cycles: (234); (123).
• The non-disjoint transpositions of the type (32)(21) are 3-cycles: (32)(21) = (123).
• 3-cycles are 3-cycles.
• Any 5-cycle an be written as two 3-cycles: (45)(34)(23)(12) can be written as ((45)(34))((23)(12)) which is two 3-cycles: (345); (123).
So, if we figure out how to write 3-cycles in terms of commutators, we win. Because the commutator subgroup of An is generated by elements that can be written as [g, h]. If we can show that 3-cycles can be written as [g, h], then every other element has a representation in terms of these 3-cycles, and are therefore elements of the commutator subgroup.

#### § 3-cycles can be generated as commutators of 2-cycles:

• We saw how we can write a 3-cycle of the form C = (123) as (32)(21). We wish to write this as the commutator of two elements g, h: C = [g, h].
• The idea is that we have the leftover elements 4, 5 that are unsused by C in A5 [here is where 5 is important: 3 + 2 = 5, and we need two leftover elements].
• We can use these two leftover elements 4, 5 to build elements g, h which cancel off, leaving us with (32)(21). We start with g = (32)___, h = (21)___ where the ___ is to be determined:
(32)___||(21)___||___(32)||___(21)
g        h        g^-1    h^-1

• It is important that g and h contain another tuple, because they are members of A5! We need them to be permutations having 2, 4, 6 transpositions.
• We insert (4 5) everywhere. These (4 5) can slide over the (2 1) and thereby harmlessly cancel:
(32)(45)||(21)(45)||(45)(32)||(45)(21)
g        h           g^-1       h^-1

• Simplify the above expression by moving the (45) over (21), (32):
(32)||(21)(45)(45)||(32)||(45)(45)(21)
g      h          g^-1    h^-1

• cancel the (45)(45) = e:
(32)||(21)||(32)||(21)
g   h    g^-1   h^-1

So we are left with (32);(21);(32);(21). This is the square of what we really wanted, C = (32);(21). However, since C is a 3-cycle, we know that C = C−2. So, we can start with C−1, use our trick to generate C−2 which is equal to C. Since this works for any C, we have shown that we can generate 3-cycles from commutators of A5.

#### § Alternate viewpoint on above proof

We have a 3-cycle s = (a b c). We first first a square root t such that t*t=s. To do this, we make t have the cycles of s spread out in gaps of 2:
t = (a _ _)
t = (a _ b) [+2]
t = (a c b) [+2, modulo]

It is hopefully clear that t*t = s:
t = (a c b)
t*t: apply the cycle twice.
t*t = a -(skip c) -> b
b -(skip a) -> c
c ->(skip b) -> a
= (a b c) = s

Now, we will write s = t*t and then find the commutator decomposition from it:
s = t*t
= (abc)(abc)
= (cb)(ba)(cb)(ba)
= (cb)|(ba)|(cb)|(ba)
= (cb)|(ba)|(cb)|(ba)
g     h   g-1   h-1

But there's a problem: this g and h do not belong to A5, they belong to S5. This is fixed by using a random (pq) which we know will exist.

#### § Recap: How have we shown that A5 is not solvable?

what have we shown?
• 3-cycles can be written as [g, h] for g, hA5. Alternatively, we can say that 3-cycles belong to the commutator subgroup of A5, since they can be written as commutators.
• any element in A5 can be written as the composition of 3-cycles.
• Hence, any element in A5 can be written as the composition of commutators.
In my mind, I think of it as:
arbitrary g
= (3-cycle-1)(3-cycle-2)....(3-cycle-n)
= [g, h][g2, h2]....[gn, hn]
= member of [A5, A5]

Recall that [A5, A5] is generated by commutators. It not only contains elements of the form [g, h], but also all products of the form [g, h][g′, h′]. So we don't need to exhibit how to write a 5-cycle as some [g, h]. We just need to exhibit how to write as the product of commutators, which we have now shown.

#### § Solvable implies simple

We can consider the other definition of simple. Let there be a chain of normal subgroups G = N[0] ≤ N[1] ≤ N[1] ≤ … ≤ N[m] = e, such that each quotient N[i] / N[i+1] is abelian. Then, if G is simple, this chain can only be G = N[0] ≤ N[1] = e.
• If we want the quotient G/N to be abelian, then we need the commutator subgroup [G, G] to be a a subset of N.
• In our case, [A5, A5] = A5. So if we want to remove the non-abelian-ness of A5, we need to quotient by the whole of A5.
• This means that any such chain will immediately collapse to e.
• So, it's impossible to build A5 using 'cycling components' starting from {e}. Viewed from the field theoretic perspective, this means that it's impossible to reach a polynomial whose splitting field has galois group A5 by simply appending cycles.

#### § Nagging doubt: Did we depend on our numbering of cycles?

In all my proofs, I had used one 3-cycle, or 5-cycle, or 2-cycle to argue that it all works out. Is this really legal? Perhaps the argument written for the 3-cycle C = (123) will break down for D = (321). Fear not!
• We will show that all 3-cycles are conjugate to each other. So, we can always relabel a 3-cycle within A5.
• It is easy to note that g[k, l]g−1 = [gkg−1, glg−1]. This shows that the commutator subgroup is closed under conjugation. It better be, because it ought to be normal for us to take quotients from it.
• Combining these facts, if we show that (123) is in [A5, A5], then some other cycle (ijk) can be conjugated to (123). Since the commutator subgroup is closed under conjugation, we have that (ijk) is a member of [A5, A5].

#### § All 3-cycles are conjugate to each other in A5.

• Given two 3-cycles C=(abc) and D=(pqr), at least one of a, b, c must be equal to one of p, q, r. Since each a, b, c is unique, and each p, q, r is unique, for them to not overlap, we would need 6 elements. But we only have 5, so there must be some overlap:
a   b   c
1 2 3 4 5
p   q r

So, we will perform our proof assuming there is 1 overlap, 2 overlap, 3 overlap. Recall that if C = (a b c) is a cycle and s is a permutation, then the action of conjugating C with s produces a permutation (s(a) s(b) s(c)). We will prove our results by finding an s, and then making s even. This is the difficult part of the proof, since we need to show that all 3-cycles are conjugate in A5. We will write s as two distinct transpositions, which will guarantee that it belongs to A5.
• Case 1: (abx) and (pqx) have a single element x in common:
C = (abx)
D = (pqx)

s: send a to p, b to q
s = (ap)(bq)
C = (abx) -conj s-> (pqx) = D

• Case 2: (axy) and (pxy) have two elements in common, x and y. Naively, we would pick s: send x to y. But this is odd, so this isn't a member of A5. To make it even, we rearrange D = (pxy) as D = (yxp). This lets us go from C to D by relabelling a to y, y to p. This permutation is even since it has two distinct transpositions.
C = (axy)
D = (pxy) = (yxp) [cyclic property]

s: send a to y, y to p
s = (ay)(yp)

C = (axy) -conj s-> (yxp) = D

• Case 3: (xyz) and (xyz) have all three elements in common, x, y, z. Here we can conjugate by identity and we are done.

#### § Why do we care about solvable?

• Roughly, we can look at the solvability criterion as giving us a way to build our group G from a series of extensions N[1], N[2], …. This extension is special, because at each step, we are adding a cyclic group.
• When we want to write a solution using nth roots, we can only add the nth roots of unity, a "cyclic" component. So, any element we can reach by using nth roots ought to be able to be written down as an extension of cyclic elements.

#### § SAGE code to play around with commutators of A5:

• Create a dictionary m which maps each element of A5 to the commutators that create it.
from collections import defaultdict
m = defaultdict(set)
A5 = AlternatingGroup(5)
S5 = SymmetricGroup(5) # if necessary
for g in A5:
for h in A5:
m[g * h * g^(-1) * h^(-1)] |= { (g, h) }

# all 60 elem can be written in terms of commutators
print("number of elem generated as commutator: " + str(len(m.keys())))

# Show how to access elements of A5 and their commutator representation
cyc5 = A5("(1, 2, 3, 4, 5)")
cyc3 = A5("(1, 2, 3)")
cyc2disj = A5("(1, 2) (3, 4)")

print(m[cyc5])
print(m[cyc3])
print(m[cyc2disj])


#### § Writing each element in A5 directly as a commutator

We have shown how to write 3-cycles as the commutator of 2-cycles. We will now show how to do this for disjoint 2-cycles and 5-cycles as a matter of enlightenment.

#### § Writing disjoint 2-cycles as commutator

First, we will write a two disjoint two cycles as the square root of a 4-cycle. We will then show how to write this 4-cycle as two 3-cycles.
s = (12)(34)

Note that if we choose t = (abcd), then t*t will exchange the first and third elements a <-> c, and the second and fourth elements b <-> d. So, if we choose:
t = (1324)
t*t = (12) (34)

Next, we need to write this t*t as [g, h] for g, h from A5.
t*t = (1324)(1324)
= (42)(23)(31);(42)(23)(31)
= (42)(23)(31);(42)(23)(31)
= (42)(23)(31);(23)(23);(42)(23)(31)
^^^^^^^^ inserted
= (42)(23)|(31)(23)|(23)(42)|(23)(31)
g   |    h   |   g'   |   h'

= [(42)(23), (31)(23)]

Where both (42)(23), and (31)(23) are members of A5.

#### § Writing 3-cycle as commutator

In the description of showing how to generate 3-cycles, we do this explicitly.

#### § Writing 5-cycle as commutator

Let s = (1 2 3 4 5). we once again find a square root of s. To build this, we will build an element with the elements of s written with gaps of 2:
t = (1 _ _ _ _)
= (1 _ 2 _ _)  [+2 index]
= (1 _ 2 _ 3)  [+2 index, wrap]
= (1 4 2 _ 3)  [+2 index, wrap]
= (1 4 2 5 3)  [+2 index, wrap]

It should be clear how t*t = s: When we take s = t*t, the resulting permutation s will move an element j = t[i] to k = t[i+2]. But we have built t such that t[i+2] = s[i+1]. So we will move the element according to how s pleases:
t = (1 4 2 5 3)
t*t = 1 -> (4 skip) -> 2
2 -> (5 skip) -> 3
3 -> (1 skip) -> 4
3 -> (2 skip) -> 5
5 -> (3 skip) -> 1
t*t = (1 2 3 4 5) = s

We will now use t*t to write the commutator:
s = t*t
= (35)(52)(24)(41);(35)(52)(24)(41)
=
=
=
= (1, 2)(3, 5)|(1, 5)(2, 4)|(3, 5)(1, 2)|(2, 4)(1, 5)

= (1, 2)(3, 5)|(1, 5)(2, 4)|(3, 5)(1, 2)|(2, 4)(1, 5)
g             h          g^{-1}       h^{-1}


## § Complex orthogonality in terms of projective geometry

If we think of complex vectors p = [p1, p2], q = [q1, q2] as belonging to projective space: that is, pp1/p2, and qq1 / q2, we can interpret orthogonality as:

p . q = 0
p1
 q
1 + p2
 q
2 = 0

p1 / p2 = −
 q2
/
 q1

p = −1/
 q
= −q/|q

If we imagine these as points on the Riemann sphere, TODO

#### § References

• Visual Complex analysis by Tristan Needham

## § Arithmetic sequences, number of integers in a closed interval

This is a cute reason for why when we count the number of integers in the closed interval [a, b], it's going to be b - a + 1. We setup an arithmetic sequence with initial term a, common difference 1. Now we know that the nth term is a + (n-1)d. So, we get
a + (n-1)d = b
(n-1).1 = b - a
n = b - a + 1


## § The arg function, continuity, orientation

Let us think of the function arg: ℂ → ℝ as a multi valued function, which maps each complex number to the set of possible valid angles that generate it:
 arg(z) ≡ { t ∈ ℝ : |z|ei (π/2)t = z  }
We plot the function here:
• Note that for every value zC, we get a set of values associated to it.

#### § Recovering single valued-ness

Now, the question is, can we somehow automatically recover single valued-ness? kind of, by stipulating that for any given curve c: [0, 1] → ℂ, the function argc: [0, 1] → ℝ is continuous. Let's try to investigate what happens if we move from right towards bot, arbitrarily stipulating ("picking a branch") that arg(right) = 0 as a sort of basepoint.
• Note that we were forced to pick the value arg(bot) = -1 from our considerations of continuity. No other value extends continuous from the right to the bottom.
• Also note that we got a smaller value: we move from 0 -> -1: we decrease our value as we move clockwise.
This prompts the natural question:
what happens if we move in the opposite direction?

#### § Counter-clockwise movement

• Let's move counter-clockwise from right, arbitrarily picking the branch arg(right) = 0 as before. This gives us:
• Note that once again, we were forced to pick arg(top) = 1 by continuity considerations.
• Also note that this time, we got a larger value: we move from 0 -> 1: we increase our value as we move counter-clockwise

#### § Multiple winding

the true power of this multi-valued approach comes from being able to handle multiple windings. Here the real meaning of being a multi-valued function shows through. If we decide to go through the the loop twice, as:
bot -> right -> top -> left -> bot -> right -> top -> left
-1  -> 0     -> 1   -> 2    -> 3   -> 4     -> 5   -> 6

That is, we end up with the value 6, which can only b

#### § Orientation from continuity

There's something really elegant about being able to recover a notion of "orientation" by simply:
1. Allowing multi-valued functions.
2. Forcing continuity constraints.
3. Interpreting increase/decrease in the value of the function.

#### § Discretizing, gaining more insight

I was personally dis-satisfied with the above explanation, because it seemed weird that we would need to depend on the history to define this function. We can formalize this notion of history. Let's first discretize the situation, giving us:
• We are on the space of the spokes, given by a, b, c, d, e, f, g, h.
• We have a function f: Spoke -> Val whose values are given on the spokes.
• We are interested in the path p: Time -> Spoke, p = [a, b, c, d, e, f, g, h, a].
• If we evaluate the function f on the path p, we get out: Time -> Val, out = [0, 1, 2, 3, 4, 5, 6, 7, 0].
• We have a "jump" from 7 to 0 in out as we cross from h to a. This is a discontinuity in out at time 7.
• We want to fix this, so we make the function f multi-valued.
• We assign both values 8 and 0 to the spoke a. We wish to define the evaluation of f: Spoke -> 2^N relative to path p. At time t, point p[t], we pick any value in f(p[t]) that makes out[t] continuous.
• So in this case, when we start, we have two choices for out[0] = f(p[0]) = f(a): 0 and 8. But we know that out[1] = f(p[1]) = f(b) = 1. Hence, for out[0] to be continuous, we must pick out[0] = 0.
• Similarly, at out[8] we have two choices: 0 and 8. But we have that out[7] = 7, hence we pick out[8] = 8.
• Note that we say 'we pick the value' that makes out continuous. This is not really rigorous. We can fix this by re-defining f in such a way that f is not Spoke -> Val, but rather it knows the full path: f': (Time -> Spoke) -> Val.

#### § Making the theory path-dependent

path: Time -> Spoke
f: Spoke -> 2^Val -- multi-valued

-- | morally, not mathematically.
out: Time -> Val
out t = choose_best_value (f(path[t]))

But there was a vagueness in this choose_best_value. So we redefine it:
path: Time -> Spoke
f': (Time -> Spoke) -> Time -> Val
f'(path, tcur) =
argmax (\v -> |v - path[tcur-1]| + |v - path[tcur+1|)
f(path[tcur])

out: Time -> Val
out = f'(path)

• The function f' that defines the value of the path has full access to the path itself!
• At time tcur, it attempts to pick the value in f(path[tcur]) which makes the discontinuity as small as possible. It picks a value v from the possible values of f(path[tcur]). This v minimises the of the distances from the previous time point (|v - path[tcur-1]), and the distance from the next time point (|v - path[tcur + 1]).
• This provides a rigorous definition of what it means to "pick a value in the branch". This can clearly be extended to the continuous domain.

## § Odd partitions, unique partitions

A well known identity in combinatorics is that the partitions n = l1 + l2 + ... + ln where each li is odd is in bijectiion with a partition where each li is unique. I really liked this bijection.
(15, 9, 9, 5, 5, 5, 3, 3, 3, 1, 1, 1, 1) [all odd]
=
(15, 9x2, 5x3, 3x3, 1x4) [group]
=
(15x1, 9x2, 5x(2+1), 3x(2+1), 1x4) [expand base-2]
=
(15, 18, [10, 5], [6, 3], 4) [all unique]


## § Continued fractions, mobius transformations

read ekmett/fractions properly and write detailed log about it, and the related math.

## § Permutations-and-lyndon-factorization

For a string s, the Lyndon factorization writes s as the concatenation of substrings t1, t2, ..., tn, such that:
• each ti is a simple word. That is, it is lexicographically smaller than all of its cyclic shifts.
• the words are in non-increasing order: t1 >= t2 >= t3 ... >= tn.
For example, given the word banana, the lyndon factorization is:
b; an; an; a;

We can define a notation for writing permutation as:
• Each term in a cycle is written in ascending order.
• Cycles are written in descending order of the first element.
• Single element are ignored.
(7) (2 3) (1 4 5)

If we treat it as a string 723145, the duval algorithm provides the decomposition:
7; 23; 145;

So, we can treat the duval algorithm as a way to recover the permutation given the raw string. It's a nice way to remember the definition of lyndon decomposition if nothng else.

## § Graphs are preorders

I wanted to record this fact for myself, so that I have reason to come back to it as I learn more preorder theory. Perhaps there are certain graph theoretic phenomena that make more sense when looked at from a preorder point of view.

## § Crash course on domain theory

In lambda calculus, we often see functions of the form λ xx(x). We would like a way to associate a "natural" mathematical object to such a function. The most obvious choice for lambda calculus is to try to create a set V of values which contains its own function space: (VV) ⊆ V. Unfortunately, the only solution for this in sets is the trivial set { * }.

#### § CPOs

• Given a partial order (P, ≤). assume we have a subset QP. A least upper bound u of Q is an element that is the smallest element in P which is larger than every element in Q.
• A subset Q of P is called as a chain if its elements can be put into order. That is, there is a labelling of elements of Q into q1, q2, …, qn such that q1 ≤ q2 ≤ … ≤ qn.

#### § CCPOs

• A partially ordered set is called as a chain complete partial order if each chain has a least upper bound.
• This is different from a lattice, where each subset has a least upper bound.
• Every ccpo has a minimal element given by completion(∅) = .
• TODO: example of ccpo that is not a lattice

#### § Monotone map

• A function from P to Q is said to be monotone if ppf(p) ≤ f(p′).
• Composition of monotone functions is monotone.
• The image of a chain wrt a monotone function is a chain.
• A monotone function need not preserve least upper bounds. Consider:
f: 2 → 2 f(S) ≡

 S S is finite S U { 0 } S is infinite
This does not preserve least-upper-bounds. Consider the sequence of elements:
 A1 = { 1}, A2 = {1, 2}, A3 = {1, 2, 3}, …, An = {1, 2, 3, …, n }
The union of all Ai is . Each of these sets is finite. Hence f({1 }) = {1 }, f({1, 2 }) = {1, 2} and so on. Therefore:
 f(⊔ Ai) = f(ℕ) = ℕ ⋃ { 0 }⊔ f(Ai) = ⊔ Ai = ℕ

#### § Continuous function

• A function is continous if it is monotone and preserves all LUBs. This is only sensible as a definition on ccpos, because the equation defining it is: lub . f = f . lub, where lub: chain(P) \rightarrow P. However, for lub to always exist, we need P to be a CCPO. So, the definition of continuous only works for CCPOs.
• The composition of continuous functions of chain-complete partially ordered sets is continuous.

#### § Fixpoints of continuouts functions

The least fixed point of a continous function f: DD is:
 FIX(f) ≡ lub({ fn() : n ≥ 0 })

#### § ≤ as implication

We can think of ba as b a. That is, b has more information than a, and hence implies a.

#### § References

• Semantics with Applications: Hanne Riis Nielson, Flemming Nielson.

## § Parallelisable version of maximum sum subarray

I learnt of a "prefix sum/min" based formulation from the solution to question D, codeforces educational round 88. The idea is to start with the max prefix sum as the difference of right minus left:

 max (L, R)
 ∑ L ≤ i ≤ R
a[i

 max R

 ∑ 0 ≤ i ≤ R
a[i] − min L
 ∑ 0 ≤ i ≤ L ≤ R

Which is then expressed as:
asum[n] ≡
 ∑ 0 ≤ i ≤ n
a[i]  =
 max R
: (asum[R] − min(L ≤ R)asum[L])
aminsum[n] ≡
 min 0 ≤ i ≤ n
asum[i] =
 max R
: (asum[R] − aminsum[R])
Since asum is a prefix-sum of a, and amin is a prefix min of asum, the whole thing is O(n) serial, O(logn) parallel. In haskell, this translates to:
let sums = scanl (+) 0
let sums_mins = scanl1 min . sums
let rise xs = zipWith (-) (sums xs) (sums_mins xs)
main = rise [1, 2, 3, -2, -1, -4, 4, 6]
> [0,1,3,6,4,3,0,4,10]

sums_mins keeps track of the sea level, while the zipWith (-) xxx sums_mins computes the elevation from the sea level.

## § Thoughts on implicit heaps

Some musings I had on the ability to represent heaps as arrays, and in general, the benifits of knowing the total number of elements.
• Knowing the total number of elements allows us to pre-ordain a memory layout. We can decide that for a node at index i, left child is at 2*i, and right child is at 2*i+1. This gives parent at i//2.
• This immediately gives us O(1) access to parent (i//2) and sibling (i^1) with no extra memory usage. S
• This cannot be done for data structures where we need to splice into the middle: For example, an implicit treap where we wish to splice sub-arrays together.
• On coding a heap, we can decide whether to use the left or right sibling by using next = 2*i + predicate. If predicate = false = 0, we will pick the left child, otherwise the right child. This allows us to compress some annoying if/then/elses into one-liners.

## § Discriminant and Resultant

I had always seen the definition of a discriminant of a polynomial p(x) as:
Disc(p(x)) ≡ an(2n − n)
 ∏ i< j
(ri − rj)2
While it is clear why this tracks if a polynomial has repeated roots or not, I could never motivate to myself or remember this definition. I learnt that in fact, this comes from a more general object, the resultant of two polynomials P(x), Q(x), which provides a new polynomial Res(P(x), Q(x) which is zero iff P, Q share a common root. Then, the discriminant is defined as the resultant of a polynomial and its derivative. This makes far more sense:
• If a polynomial has a repeated root r, then its factorization will be of the form p(x) = (xr)2 q(x). The derivative of the polynomial will have an (xr) term that can be factored out.
• On the contrary, if a polynomial only has a root of degree 1, then the factorization will be p(x) = (xr) q(x), where q(x) is not divisible by (xr). Then, the derivative will be p′(x) = 1 · q(x) + (xr) q′(x). We cannot take (xr) common from this, since q(x) is not divisible by (xr).
This cleared up a lot of the mystery for me.

#### § How did I run into this? Elimination theory.

I was trying to learn how elimination theory works: Given a variety V = { (x, y) : Z(x, y) = 0 }, how does one find a rational parametrization (p(t), q(t)) such that Z(p(t), q(t)) = 0, and p(t), q(t) are rational functions? That is, how do we find a rational parametrization of the locus of a polynomial Z(x, y)? The answer is: use resultants!
• We have two univariate polynomials p(a; x), p(b; x), where the notation p(a; x) means that we have a polynomial p(a; x) ≡ ∑i a[i] xi. The resultant isa polynomial Res(a; b) which is equal to 0 when p(a; x) and p(b; x) share a common root.
• We can use this to eliminate variables. We can treat a bivariate polynomial p(x, y) as a univariate polynomial p′(y) over the ring R[X]. This way, given two bivariate polynomial p(a; x, y), q(b; x, y), we can compute their resultant, giving us conditions to detect for which values of a, b, x, there exists a common y such that p(a; x, y) and (q, x, y) share a root. If (a, b) are constants, then we get a polynomial Res(x) that tracks whether p(a; x, y) and q(a; x, y) share a root.
• We can treat the implicit equation above as two equations, xp(t) = 0, yq(t) = 0. We can apply the method of resultants to project out t from the equations.

#### § 5 minute intro to elimination theory.

Recall that when we have a linear system Ax = 0, the system has a non-trivial solution iff |A| = 0. Formally: x ≠ 0  ⇐⇒ |A| = 0. Also, the ratio of solutions is given by:
 xi / xj = (−1)i+j |Ai|/|Aj|
If we have two polynomials p(a; x) = a0 + a1 x + a2 x2, and q(b; x) = b0 + b1x + b2 x2, then the system p(a; x), q(b; x) has a simeltaneous zero iff:

 a2 a1 a0 0 0 a2 a1 a0 b2 b1 b0 0 0 b2 b1 b0

 1 x x2 x3

= 0

A x = 0

#### § Big idea

The matrix is setup in such a way that any solution vector v such that Qv = 0 will be of the form v = (α3, α2, α, 1). That is, the solution vector is a polynomial, such that Qv = 0. Since Qv = 0, we have that a2 α2 + a1 α + a0 = 0, and b2 α2 + b1 α + b0 = 0.

#### § Proof

• Necessity is clear. If we have some non trivial vector v ≠ 0 such that Qv = 0, then we need |Q| = 0.
• Sufficiency: Since |Q| = 0, there is some vector v = (w, x, y, z) such that Qv = 0. We need to show that this v is non-trivial. If the polynomials p(a;x), q(b;x) are not equal, then we have that the rows which have coefficients from p and q are linearly independent. So, the pair of rows (1, 3), and the pair (2, 4) are linearly independent. This means that the linear system:
 a2 w + a1 x + a0 y = 0 b2 w + a1 x + a0 y = 0
Similarly:
 a2 x + a1 y + a0 z = 0 b2 x + a1 y + a0 z = 0
Since the coefficients of the two systems are the same, we must have that (w, x, y) and (x, y, z) are linearly dependent. That is:
 (w, x, y) = α (x, y, z) w = α x = α2 y = α3 z
We can take z = 1 arbitrarily, giving us a vector of the form (w, x, y, z) = (α3, α2, α, 1), which is the structure of the solution we are looking for!

## § Polynomial root finding using QR decomposition

1. For a polynomial p(x), build the companion matrix P(x).
2. Show that the characteristic polynomial cp(P) of the companion matrix P(x) is indeed p(x).
3. Find eigenvalues of P(x), which will be roots of p(x), since the eigenvalues of a matrix M are the roots of its characteristic polynomial cp(M).
4. We use QR since it is numerically stable. The Q matrix discovered by QR is orthogonal, and hence does not disturb the covariance of the noise on matrix multiplication.

## § A hacker's guide to numerical analysis

Life may toss us ill-conditioned problems, but it is too short to settle for unstable algorithms. - D.P. O'Leary

#### § Measures of error

If x is a number and x is its approximation, then the are two notions of error:
1. absolute errror: |x − x|.
2. relative error: |x − x|/|x|.
Since the relative error is invariant under scaling (x ↦ α x), we will mostly be interested in relative error.

#### § Significant digits

The significant digits in a number are the first nonzero digit and all succeeding digits. Thus 1.7320 has five significant digits. 0.0491 has only three significant digits. It is not transparent to me why this definition is sensible.

#### § Correct Significant digits --- a first stab

We can naively define x agrees to x upto p significant digits if x and x round to the same number upto p significant digits. This definition is seriously problematic. Consider the numbers:
• x = 0.9949, x1 = 1.0, x2 = 0.99, x3 = 0.9950
• y = 0.9951, y1 = 1.0, y2 = 1.0, y3 = 0.9950
Here, y has correct one and three significant digits relative to x, but incorrect 2 significant digits, since the truncation at x2 and y2 do not agree even to the first significant digit.

#### § Correct Significant digits --- the correct definition

We say that x agress to x upto p significant digits if |x − x| is less than half a unit in the pth significant digit of x.

#### § Accuracy v/s precision

• Accuracy: absolute or relative error of a quantity.
• Precision: accuracy with which basic arithmetic +, -, *, / are performed. for floating point, measured by unit round-off (we have not met this yet).
Accuracy is not limited by precision: By using fixed precision arithmetic, we can emulate arbitrary precision arithmetic. The problem is that often this emulation is too expensive to be useful.

#### § Backward, Forward errors

Let y = f(x), where f: ℝ → ℝ. Let us compute ŷ as an approximation to y, in an arithmetic of precision u. How do we measure the quality of ŷ?
1. In many cases, we maybe happy with an ŷ such that the relative error between y and ŷ is equal to u: we did the best we can with the precision that was given. This is the forward error.
2. An alternative question we can ask is, for what δ x do we have that ŷ = f(x + δ x). That is, how far away from the input do we need to stray, to get a matching output? There maybe many such δ x, so we ask for min|δ x|. We can divide this error by x as a normalization factor. This is the backward error.
There are two reasons we prefer backward error.
1. It converts error analysis into "data analysis". The data itself tends to be uncertain. If the error given by the backward analysis is smaller than the data uncertainty, then we can write off our error as being too small. Since for all we know, we have 'fixed' the uncertain data with our small error.
2. It reduces the question of error analysis into perturbation theory, which is very well understood for a large class of problems.

#### § Backward stable

A method for computing y = f(x) is called backward stable if it produces a ŷ with small backward error. That is, we need a small δ x such that ŷ = f(x + δ x).

#### § Mixed forward-backward error

We assume that addition and subtraction are backward stable, where u is the number of significant digits to which our arithmetic operations can be performed:
 x ± y = x(1 + Δ) ± y(1 + Δ) ∀ |Δ| ≤ u
Another type of error we can consider is that of the form:
 ŷ + δ y = f(x + Δ x)
That is, for a small perturbation in the output y), we can get a backward error of δ x. This is called as mixed forward backward error. We can say that an algorithm with mixed-forward-backward-error is stable iff:

 ŷ + δ y = f(x + Δ x) |Δ y|/|ŷ| < є |Δ x|/|x| < η є, η are small
This definition of stability is useful when rounding errors are the dominant form of errors.

#### § Conditioning

Relationship between forward and backward error is govered by conditioning: the sensitivity of solutions to perturbations of data. Let us have an approximate solution ŷ = f(x + δ x). Then:

 ŷ − y = f(x + δ x) − f(x) = f′(x) δ x + O((δ x)2) (ŷ − y)/y = (x f′(x)/f(x)) (Δ x/x) + O((Δ x)2) (ŷ − y)/y = c(x) (Δ x/x) + O((Δ x)2) c(x) ≡ |x f′(x)/f(x)|
The quantity c(x) measures the scaling factor to go from the relative change in output to the relative change in input. Note that this is a property of the function f, not any particular algorithm.

#### § Example: logx

If f(x) = logx, then c(x) = |(x (logx)′) / logx| = |1/logx|. This quantity is very large for x ≃ 1. So, a small change in x can produce a drastic change in logx around 1.
• Note the the absolute change is quite small: log(x + δ x) ≃ logx + δ x/x. However, relative to logx, this change of δ x/x is quite large.

#### § Rule of thumb

 forward error  condition number × backward error
• Glass half empty: Ill-conditioned problems can have large forward error.
• Glass half full: Well-conditioned problems do not amplify error in data.

#### § Forward stable

If a method produces answers with forward errors of similar magnitude to those produced by a backward stable method, then it is called forward stable. Backward stability implies forward stability, but not vice-versa (TODO: why?)

#### § Cancellation

Consider the following program:
#include <cmath>
#include <stdio.h>

int main() {
double x = 12e-9;
double c = cos(x);
double one_sub_c = 1 - c;
double denom = x*x;
double yhat = one_sub_c / denom;

printf("x:         %20.16f\n"
"cx:        %20.16f\n"
"one_sub_c: %20.16f\n"
"denom:     %20.16f\n"
"yhat:      %20.16f\n",
x, c, one_sub_c, denom, yhat);
}

which produces the output:
x:           0.0000000120000000
cx:          0.9999999999999999
one_sub_c:   0.0000000000000001
denom:       0.0000000000000001
yhat:        0.7709882115452477

This is clearly wrong, because we know that (1−cosx)/x2) ≤ 1/2. The reason for this terrible result is that:
• we know cosx to high accuracy, since x was some fixed quantity.
• 1 − cosx converted the error in cosx into its value.
• 1 − cosx has only one significant figure.
• This makes it practically useless for anything else we are interested in doing.
In general:

 x ≡ 1 + є error of order є y ≡ 1 − x = є value of order є
That is, subtracting values close to each other (in this case, 1 and x) converts error order of magnitude into value order of magnitude. Alternatively, it brings earlier errors into promience as values.

#### § Analysis of subtraction

We can consider the subtraction:

x = a − b; x = â − b
â = a(1 + Δ a
b = b(1 + Δ b

 x − x x

 −aΔ a − bΔ b a − b

 |−aΔ a − bΔ b| |a − b|

=
 |aΔ a + bΔ b| |a − b|

≤
 max(|Δ a|, |Δ b|) (|a| + |b|) |a − b|

This quantity will be large when |ab| ≪ |a| + |b|: that is, when there is heavy cancellation in the subtraction to compute x.

#### § Underflow

#include <cmath>
#include <stdio.h>

int main() {
double x = 1000;
for(int i = 0; i < 60; ++i) {
x = sqrt(x);
}
for(int i = 0; i < 60; ++i) {
x = x*x;
}
printf("x: %10.20f\n", x);
}

This produces the output:
./sqrt-pow-1-12
...
x: 1.00000000000000000000

That is, even though the function is an identity function, the answer collapses to 1. What is happening?

#### § Computing (ex − 1)/x

One way to evaluate this function is as follows:
double f(double x) { return x == 0 ? 1 : (pow(M_E, x) - 1) / x; }

This can suffer from catastrophic cancellation in the numerator. When x is close to 0, ex is close to 1, and ex − 1 will magnify the error in ex.
double f(double x) {
const double y = pow(M_E, x);
return y == 1 ? 1 : (y - 1) / log(y);
}

This algorithm seems crazy, but there's insight in it. We can show that the errors cancel! The idea is that neither (y − 1) nor logy are particularly good, the errors accumulated in them almost completely cancel out, leaving out a good value:
 assume ŷ = 1 1 = ŷ ≡ ex(1 + δ) log1 = log(ex ) + log(1 + δ) x = −log(1 + δ) x = −δ + O(δ2)
If ŷ ≠ 1:
 f = (ŷ − 1)/logŷ = (1+є3)(ŷ − 1)(1 + є+1)/(logŷ(1 + є2))

#### § IEEE floating point fun: +0 and -0 for complex analysis

Rather than think of +0 and -0 as distinct numerical values, think of their sign bit as an auxiliary variable that conveys one bit of information (or misinformation) about any numerical variable that takes on 0 as its value.
We have two types of zeroes, +0 and -0 in IEEE-754. These are used in some cases. The most famous is that 1/+0 = +∞, while 1/−0 = −∞. Here, we proceed to discuss some complex-analytic considerations.
Therefore. implementers of compilers and run-time libraries bear a heavy burden of attention to detail if applications programmers are to realize the full benefit of the IEEE style of complex arithmetic. That benefit deserves Some discussion here if only to reassure implementers that their assiduity will be appreciated.
 −1 + 0 i
= +0 + i
 −1 − 0 i
= +0 − i
These will ensure that z* = (√z)*:
 copysign(1, +0) = +1 copysign(1, −0) = −1
These will ensure that x, 1/x = x when x = ± ∞. An example is provided where the two limits:

f(x + i0) =
 lim y → 0−
f(x + i y

f(x + i0) =
 lim y → 0−
f(x + i y

#### § Complex-analytic considerations

The principal branch of a complex function is a way to select one branch of a complex-function, which tends to be multi-valued. A classical example is the argument function, where arg(r ei θ = θ. However, this is ambiguous: we can map θ ↦ θ + 2 π and still have the same complex number. So, we need to fix some standard. We usually pick the "branch" where 0 ≤ θ < 2 π. In general, we need to carefully handle what happens to the function at the discontinuity.
What deserves to be undermined is blind faith in the power of Algebra. We should not believe that the equivalence class of expressions that all describe the same complex analytic function can be recognized by algebraic means alone, not even if relatively uncomplicated expressions are the only ones considered.

## § Mobius inversion on Incidence Algebras

Most of these functions are really defined on the incidence algebra of the poset P with ground field K. An incidence algebra I(P) is a set of functions which maps intervals of P to the ground field K. an interval is a tuple (x, y) ∈ P × P such that xP (where the comes from the partial order on P). We have a product structure which I denote , given by:
(α ⋆ β)([xz]) =
 ∑ x ≤ y ≤ z
α(xy) β(yz)
A linear algebra way to look at this is to consider |P| x |P| matrices over K where the rows and columns are indexed by P. The a function α: P × PK can be written as the elements of the P × P matrix. Then this convolution-like operator is simply matrix multiplication. We have three natural functions: (1) The characteristic function, which is the identity for :
 δ([x, z]) ≡ 1  if  x = z ;  0  otherwise
(2) the zeta function, which plays the role of the constant 1:
 ζ([x, z]) ≡  1  if  x ≤ z ;  0  otherwise
(3) The inverse of the zeta function, the mobius function, a tool for mobius inversion:

µ([xz])  = 1

µ([xz])  = −
 ∑ x ≤ y < z
µ([xy])

The mobius inversion theorem for posets states that ζ and µ as defined above are convolutional inverses. that is, ζ ⋆ µ = δ. This allows us to prove:

g([xz]) =
 ∑ x ≤ y ≤ z
f([xy])

g([xz]) =
 ∑ x ≤ y ≤ z
f([xy]) · 1

g([xz]) =
 ∑ x ≤ y ≤ z
f([xy]) · ζ(yz

g = f ⋆ ζ
g ⋆ mu = f ⋆ ζ ⋆ µ
g ⋆ mu = f ⋆ δ
g ⋆ mu = f
We have managed to find f in terms of g, when previously we had g in terms of f. TODO: we are usually interested in a fixed [x, z]. What happens if we make this implicit? We may get nice notation for all of this!

#### § Sums as mobius inversion

We can derive the formula we had for integrals, that:
F(i) =
 ∑ 0 ≤ i ≤ n
f(i)  ⇐⇒ f(k) = F(k) − F(k−1)
by setting up mobius inversion on the usual partial order for the natural numbers. For simplicity, I'll show the example on [0, 1, 2, 3, 4]. The example immediately generalizes.
• We have the partial (well, total) order P: 0 < 1 < 2 < 3 < 4.
• We are given a function f(·) we wish to integrate. We define an auxiliary function fi([x, y]) = f(y) which evaluates f on the right endpoint.
• We can now define F([x, z]) as the sum of f from x to z:

F([xz]) ≡
 ∑ x ≤ y ≤ z
f(y

 ∑ x ≤ y ≤ z
fi([xy])

 ∑ x ≤ y ≤ z
fi([xy]) · ζ(yz

fi ⋆ ζ
• This tells us that f(n) = fi([0, n]) = (F ⋆ µ)([0, n]):

f(n) = fi([0, n]) ≡  (F ⋆ mu)[0, n

=
 ∑ 0 ≤ x ≤ n
F([0, x]) µ([xn])

• We note that we need to know the values of µ([x, n]) for a fixed n, for varying x. Let us attempt to calculate µ([0, 4]), µ([1, 4]), µ([2, 4]), µ([3, 4]), µ([4, 4]) and see if this can be generalized:

µ([4, 4]) = 1  By definition µ([3, 4]) = −

 ∑ 3 ≤ x < 4

By definition

## § Finite differences and Umbral calculus

Umbral calculus lays out a large collection of "umbral" / "shadowy" coincidences across combinatorics and calculus. Here, I'll lay out some of these that I learnt from Concrete Mathematics. I hope to use this for myself to motivate a bunch of combinatorics. I'll provide an interesting proof of why ∑1 ≤ k < n k2 = k(k−1)/2 using this umbral calculus.

#### § Discrete Derivative: Forward difference

We begin by trying to build a discrete version of the derivative operator. The derivative of f will be denoted using f'. The forward difference of f: ℝ → ℝ is defined as:
 δ f: ℝ → ℝ; (δ f)(x) = f(x + 1) − f(x)
Immediately, we can see that this operator is linear:

 δ(f + g) = (f+g)(x+1) − (f+g)(x) = (f(x+1) − f(x)) + (g(x+1)−g(x)) = (δ f) + (δ g) δ(α f)(x) = (α f)(x+1) − (α f)(x) = α · (f(x+1) − f(x)) = α (δ f)
it obeys a slightly corrupted version of the chain rule, (fg)′ = fg + gf:

 δ(fg) = (fg)(x+1) − (fg)(x) = f(x+1)g(x+1) − f(x)g(x) + 0 = f(x+1)g(x+1) − f(x)g(x) + [f(x)g(x+1) − f(x)g(x+1)] = g(x+1)[f(x+1) − f(x)] + f(x)[g(x+1) − g(x)] = g(x+1)(δ f)(x) + f(x) (δ g)(x) = (S δ f)(x) + (f δ g)(x) [(Sh)(x) ≡ h(x+1)] = (S  δ f + f δ g)(x)
We need this new S operator to shift the function's input from x to x+1.

#### § Falling factorial as polynomials

cool, now that we seem to have a linear derivative operator that behaves roughly sanely, let's test it out! The first reasonable target is a polynomial, so let's try C2:
 δ(x2) = (x+1)2 − x2 = 2x + 1
This is disappointing, it does not behave very well :( However, there is an analogue that does well. This is the falling factorial, defined as:
 x(n) ≡ x(x−1)(x−2)⋯(x−n+1)
For example:
 x(0) = 1 x(1) = x x(2) = x(x−1) x(3) = x(x−1)(x−2)
Let's try and apply our discrete difference δ:

 δ(x(2)) = (x+1)(x−1+1) − x(x−1) = (x+1)(x) − x(x−1) = x*2 = 2x(1) δ(x(3)) = (x+1)(x−1+1)(x−2+1) − x(x−1)(x−2) = (x+1)(x)(x−1) − x(x−1)(x−2) = x(x−1)((x+1) − (x−2)) = 3x(x−1) = 3x(2)
These falling factorials look pretty unnatural though, why do we care? Well, once we build some integral calculus, we can handle our problem of 1 ≤ i < k i2 using this calculus, by rewriting i2 in terms of these falling factorials.

#### § Sums as discrete integrals

We want to think of 0 ≤ i < n f(i) as the correct variant of 0n f(i) di. The most important property of an integral is the fundamental theorem of calculus:
 b a
f′(idi = f(b) − f(a) ↦
 ∑ a ≤ i < b
(δ f)(i) =?= f(b) − f(a)
we can check if the assertion is true:

 ∑ a ≤ i < b
(δ f)(i)

= [f(a+1) − f(a)] + [f(a+2) − f(a+1)] + [f(a+3)−f(a+2)] + ⋯ + [f(b) − f(b−1)]
f(b) − f(a)    (The sum telescopes)
Sweet, so we just kicked a theory of calculus of the ground. Let's put this to some use:

#### § Gauss' famous formula from discrete calculus

Let's begin by deriving the closed form for [1·(k−1)] naturals:
 ∑ 0 ≤ i < n
i =
 ∑ 0 ≤ i < n
i(1) = i(2)/2
 n 0
= n(n−1)/2
Let's now derive the closed form form the sum of squares of [1·(k−1)]:

 ∑ 0 ≤ i < n
i2

 ∑ 0 ≤ i < n
i*(i−1) + i

 ∑ 0 ≤ i < n
i(2) + i(1)

n(3)/2 + n(2)/2
n(n−1)(n−2)/3 + n(n−1)/2
n(n−1)(n/3 − 2/3 + 1/2)
n(n−1)(2n − 1)/6

Trying to perform this process in general does beg a question: how do we convert from xn into some combination of rising and falling factorials? It turns out that Stirling Numbers will show up to aid this conversion. But before that, let's see some more connections.

#### § 2x as the combinatorial version of ex

We want to find the analogue of the exponential function ex, which satisfies the equation f′(x) = f(x). Setting this up in the discrete case:

 d′f(x) = f(x) |  f(0) = 1 f(x+1) − f(x) = f(x) | f(0) = 1 f(x+1) = 2f(x) | f(0) = 1 f(n) = 2n
What does this buy us? It gives us a nice proof that k nCk = 2n. It proceeds by taking the taylor of ex, "combinatorializing it", and then simplifying:

ex =
 ∞ ∑ n=0
 xn n!

2x =
 ∞ ∑ n=0
 x(n) n!

=
 ∞ ∑ n=0
 x*(x−1)*(x−2)*⋯(x−n+1) n!

=
 ∞ ∑ n=0
 x*(x−1)*(x−2)*⋯(x−n+1) n!

#### § Harmonic series as the combinatorial version of logarithm

In the continuous case, we have that 1/x = logx. In the discrete case, we get i=1n 1/xHn, the sum of the first n harmonic numbers. This explains why the harmonic numbers keep showing up in different places across discrete math --- We're often trying to integrate some kind of 1/x quantity. This also may tell us that Hn and 2n are somewhat "weak inverses" of each other. I haven't thought about this weak inverse thing too much.

#### § Stirling numbers of the first kind for convering between polynomials and falling factorials

We can express x(n) = ∑k=0n [n, k] xk where [n, k] are the (yet to be defined) unsigned stirling numbers of the first kind. (aside: I wonder if this can be derived from some kind of mobius inversion). We begin my in fact defining the stirling numbers of the first kind, [n, k] as the coefficients of the rising factorial:
x(n) = x(x+1)(x+2) … (x+n−1) =
 n ∑ i=0
[nixi
The question as a combinatorialist is:
what do the unsigned striling numbers of the first kind, [n, i], count?
[n, i] counts the number of permutations of n elements with i disjoint cycles.
For example, in the case of the permutations of the set {1, 2, 3}, we have the permutations:

 (1)(2)(3) 3 cycles (12)(3) 2 cycles (1)(23) 2 cycles (2)(13) 2 cycles (132) 1 cycle (123) 1 cycle
So, this gives the counts:
 [3, 3] = 1 [3, 2] = 3 [3, 1] = 2
These stirling numbers satisfy a recurrence:
 [n+1, k] = n[n, k] + [n, k−1]
This can be seen combinatorially. If we want the number permutations of n+1 objects with k cycles, we can either:
1. Take the permutations of n objects with k−1 cycles, [n, k−1], and then insert the new (n+1)th object into a new cycle all by itself. That is, we create a new permutation where p(n+1) = (n+1).
2. Take the permutation of n objects with k cycles, and insert this new (n+1)th object into any of the k cycles. If the original permutation had x -p-> y, we can insert this new object as x -p-> * -p->y for any x. Thus, there are n choices of locations to inser * --- as the image of all possible xs. This gives us the n[n, k] term.
Another cute fact of the unsigned stirling numbers of the first kind [n, k] is that since permutations are partitioned by their number of cycles, we have:
 n ∑ k=1
[nk] = n!

## § Permutahedron

The permutahedron over n letters is a polytope which is defined as the convex hull of all permutations of the point (1, 2, …, n). For example, the permutahedron over 3 letters is the convex hull of the points (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1). Here, we show that it can be embedded in (d−1) dimensionl space, and that each point perm((1, 2, … n)) is indeed a vertex of the convex hull. An example of a situation where a point is not in the vertex of a convex hull is convexhull(1, 2, 3) = [1, 3]. Note that 2 was used to generate the convex hull, but is not a vertex since it is the convex combination of 1, 3. We wish to prove that such a situation does not happen in the case of the permutahedron. I have not seen an elementary proof that each point that is a permutation of the original is a vertex, so I'm recording that proof out of interest.

#### § Core idea

start by looking at index i of the "largest value" N in the point P. Since it's the largest point, if it's a convex combination, all other points in the convex combination must have the same "largest value" at that index i. So we can now ignore the index i and get permutations of [1..N-1] Repeat this process to prove that all indexes of things in the convex sum must be equal to P. But this is absurd, because all vertices are distinct. Hence, the only way for this to work out is if we have only one point that we are convex-summing over. That is, P cannot be written as the convex sum of two or more vertices.

#### § Formal proof

Let us be working with permutations of [1, 2, ..., N]. Pick point P. Assume P is in the convex sum of some { Xi }. Let index(N, P) be the index of number N in P. For example, if P = [2, 3, 1], then
index(2, P) = 0 (assuming 0-based indexing)
index(3, P) = 1
index(1, P) = 2

If P is the convex sum of vertices { Xi }, all of them must have that:
Xi[index(N, P)]  = N forall Xi in the convex sum

since N is the largest value that X_I can have at any index. The only way to get the maximum value by a convex sum is for all values to be that maximum value. So, we can now ignore dimension index(N, P), since all the vertices X_i involved in the convex combination and P have index(N, P) = N. If we project out this dimension, we are left with permutation of (1..N-1). Repeat the same process. We will need to have all X_i to have their value at index(N-1, P) to be N-1. Keep doing this till we get that the dimensions of all points in Xi and P are equal. But all the points in { Xi } are distinct since they are permutation of [1..n]. Hence, { Xi } can contain only a single point, and this point is P. Hence, P cannot be written as the convex combination of other points. Hence, P is a vertex.

#### § Proof that convex sum of maximum must have maximum

Assume we have values {v1, v2, ... vn} where without loss of generality, v1 <= v2 <= ... <= vn. We can always relabel otherwise. Now we want to write vn as the convex sum of {v1, v2, ... vn}. We can draw this on the number line as:
---[v1--v2--v3--...--vn]---

• A convex sum must be inside this line segment between [v1--...--vn] (vn is to the rightmost since it's known to be the largest value in {v1...vn}).
• So, if we try to write vn as the convex sum, we must have the coefficient of vn=1 and the coefficient of all other points =0, since any other combination will "pull the convex sum away from the right (where vn sits), towards the left".

## § Lyndon + Christoffel = Convex Hull

Actual "real world" use-case of lyndon factorization, cribbed from here:
I wanted to learn a real-world use-case of lyndon factorization so I could remember it. I found a whole bridge between combinatorics/strings and discrete geometry (they call it "digital geometry") that I didn't know existed before.
• If you have a line with rational slope p/q and you want to draw a "discretized line" by connecting integer points in ZxZ, you can describe this discretized line as starting from (0, 0), making moves x (move up 1 unit along x-axis), y (move up 1 unit along y-axis), finally reaching the point (p, q). For example, to reach the point (2, 3), you can make the moves [x, x, x, y, y].
• A christoffel word is a word w ∈ {x, y } such that it hugs a line of rational slope p/q as close as possible. Formally, there are no integer points between the line with slope p/q starting from the origin, and the discretized line as described by w. An example picture:
• It turns out that all primitive christoffel words are Lyndon words. I'm not 100% sure what primitive is, but the take-away is that these primitive christoffel words represent discrete lines that are good approximations of lines.
• Now, we are given a discrete sequence of adjacent line segments going upwards, where the line segments are described by x, y moves. We want to check if the discrete curve defined by them is well-approximating a convex polygon.
• We compute the lyndon factorization of the word. This splits the original line segment into a series of line segments, where each successive line segment has lower slope than the previous (since the lyndon decomposition splits words into non-decreasing lex order).
• We can then check that each word in the lyndon decomposition is a Christoffel word. If it is, then your sequence of moves describes a "good discrete convex hull", since as described above, a christoffel word "hugs the line" well.

#### § Bonus: slick characterization of line drawing

If we want to draw a line with slope p/q = 4/7 using the lower approximation the idea is that we keep taking 4 steps along x, and every time we "exceed" 7 steps along x, we take a step along y. This is the same as:
1. working in Z/(4+7)Z, starting from 0, writing down multiples of 4 till we cycle back to 0
2. marking an "increase" in a step with an x, and a "decrease" in a step with y.
The intuition is that once we walk k*p steps where k*p >= q, we want to increment y. So, at first glance, we may believe we should consider Z/qZ. However, this is misguided. Examples to enlighten:
1. Consider x=1, y=0. We should use Z/1Z = { 0 }: that is, we must keep moving along 0 -x -> 0 -x-> 0 -x-> .... This is unlike what happens if we choose Z/0Z (which is not a well-defined idea).
2. Consider x=1,y=1. We should use Z/2Z, so we keep going 0 -x-> 1 -y-> 0 -> ... which will cause is to flip x -> y -> x -> y -> ....
In some sense, we are making sure that we can "start" with an x and see where that takes us. In the Z/1Z case, we realise that we can keep taking xs. In the Z/2Z case, we realise we need to flip between x and x. Let's try to show this formally, where k is the smallest number such that kp >= q. We'll also have concrete examples where p=2, q=7. For the example, k=4 since kp = 4*2 = 8 > 7. If we work in Z/yZ = Z/7Z, we will get the numbers:
Z:    0, p, 2p, ... (k-1)p, [q] kp
Z/qZ: 0, p, 2p, ... (k-1)p, [q] (kp % q)
Z/7z: 0, 2,  4, ...      6, [7] (8 % 7 = 1)

But notice that we are inserting x between numbers. So we will get:
Z:    0 -x-> p -x-> 2p, ... (k-1)p -y-> (k+1)p
Z/qZ: 0 -x-> p -x-> 2p, ... (k-1)p -y-> ((k+1)p % y)
Z/7z: 0 -x-> 2 -x->  4, ...      6 -y-> (8      % 7 = 1)
^                            ^
x since [0 < 2]              y since [6 > 1]

which gives us the moves:
Z/7Z: 0 -x-> 2 -x-> 4 -x-> 6 -y-> 8 % 7 = 1

We only get 3 occurences of x, after which on the next accumulation of p, becomes an 8 which wraps around to a 1. This is the age-old tension that exists between points and gaps. For k points, there are k-1 gaps. In our case, we have 4 points [0, 2, 4, 6], but this leaves us room for only three gaps for threex moves, while in reality we need 4. We remedy this situation by giving ourselves space enough for one more x, by changing from Z/qZ to Z/(p+q)Z. We should look at this as creating space for another gap.
Z:        0, p, 2p, ... (k-1)p, [q] kp,  [q+p  ], (k+1)p
Z/(p+q)Z: 0, p, 2p, ... (k-1)p, [q] kp,  [q+p  ], (k+1)p % (p+q)
Z/9z: 0, 2,  4, ...          6, [7]  8,  [7+2=9], (10    %    9=1)

which gives us the moves:
Z:        0 -x-> p -x-> ... (k-1)p -x-> kp -y-> (k+1)p
Z/(p+q)Z: 0 -x-> p -x-> ... (k-1)p -x-> kp -y-> (k+1)p % (p+q)
Z/9Z:     0 -x-> 2 -x-> ...      6 -x->  8 -y-> (   10 %     9 = 1)
^                             ^
x since [0 < 2]               y since [8 > 1]


That is, we are able to get k occurences x between 0, p, ..,kp which has (k+1) points. Concretely, we have:
Z/9Z: 0 -x-> 2 -x-> 4 -x-> 6 -x-> 8 -y-> 1

where we have 4 occurences of x in between after which we have an occurence of y, which is what we want when x=2, y = 7. We need to reach at least q=7 before we have exceeded our denominator and need to make a move along y.

## § Geometric proof of e^x >= 1+x, e^(-x) >= 1-x

Let's concentrate on the e^x >= 1 + x part.
1. The tangent of e^x at x = 0 is 1 + x, since the taylor series of e^x truncated upto x is 1 + x.
2. e^x is a strongly convex function, since (e^x)'' = e^x which is positive everywhere. Hence, e^x will always lie above its tangent.
Similarly for e^(-x), working through the math:
1. 1 -x is tangent at x=0 to e^(-x)
2. (e^(-x))'' = -(e^(-x))' e^(-x) which is again positive everywhere, and hence, e^(-x) is strongly convex.

## § Ranking and Sorting

We we want to sort an arrray xs and write the results into an array ys. In both cases, the invariant to be satisfied is that ys is xs in ascending order. I'll be considering two broad ways to implement such a thing:
• 1. (RANK) Index straight into ys, reindexed into xs. We name the reindexing array as rs (rs for ranks).
for(int i = 0; i < N; ++i) {
ys[rs[i]] = xs[i]
}

• 2. (SELECT) Reindex into ys, index straight into xs. We name the reindexing array ss (ss for selects).
for(int i = 0; i < N; ++i) {
ys[i] = xs[ss[i]]
}


#### § Defnition of rank

In the case of (RANK), the specification is that the rank of an element e is the number of elements that are:
1. less than e.
2. equal to e but occur before e.
This ensures that rank is a permutation: that is, every element e is given a unique index.
int rank(const int *xs, int i) {
int cnt = 0;
for (int j = 0; j < N; ++j) {
if (xs[j] < xs[i] || (xs[j] == xs[i] && j < i)) cnt += 1;
}
return cnt;
}

for(int i = 0; i < N; ++i) {
rs[i] = rank(xs, i);
}


#### § Rank: Alternative 1

An alternative way to look at our definition of rank is that we are sorting the tuples (xs[i], i) using lex ordering. So if two indeces i, j have the same value, then we sort on the index.

#### § Rank: Alternative 2

We could have also defined rank as:
int rank(int *xs, int i) {
int cnt = 0;
for (int j = 0; j < i; ++j) {
if (xs[j] <= xs[i]) cnt += 1
}
return cnt;
}

I declined from doing so because I wanted to show the lex-ordering interpretation of rank will be be useful later.

#### § Definition of select

For (SELECT), we'll need to think a little harder. Let's try to rewrite our way to glory:
• 1. From definition of rank:
ys[rs[i]] = xs[i]

• 2. move i -> ss[i] where ss[i] is guaranteed to be a permutation, so we will write each index eventually:
ys[rs[ss[i]]] = xs[ss[i]]

• 3. Stipulate that rs[ss[i]] = i, since we want a ys[i]:
ys[i] = xs[ss[i]]

This gives us necessary and sufficient conditions on how to find an ss: ss must be a permutation that is an inverse permutation to rs.

#### § How to invert a permutation?

How does one invert a permutation? We have a permutation rs[i] that maps i to rs[i]. We want to find a new permutation ss[i] such that
rs[ss[i]] = i

Equivalently:
// ss[i] is an index 'k'...
ss[i] = k
// 'k' is the location of 'i' in rs.
rs[k] = i

So if we first tag each location of rs with its index, and then sort with the values of rs being the keys, then ss[i] will be the values. In pictures:
rs[i]          [3 4 0 1 2]
rs_tagged[i]   [(3, 0) (4, 1) (0, 2) (1, 3) (2, 4)]
rs_sorted[i]   [(0, 2) (1, 3) (2, 4) (3, 0) (4, 1)]
ss[i]          [    2      3      4      0      1 ]
rs[ss[i]]      [ rs[2]  rs[3]  rs[4]  rs[0]  rs[1]]
rs[ss[i]]      [    0      1      2      3      4 ]


#### § Rank and Select are inverses

It's nice, there are a couple of miracles that lined up here:
• Rank starts with r and select starts with s so we get nice naming.
• Rank and select are true inverse permutations of each other.

#### § Generalized rank and select are adjoint

We had to "fix" our definition of rank to avoid equal elements in the array. Hence, we had the rule that we also sort by indexes if the elements are equal. However, if we now decide to ignore this rule, we will then recreate the classical adjunction between rank and select.

#### § References

• Richard Bird: Pearls of functional algorithm design

## § Proof of minkowski convex body theorem

We can derive a proof of the minkowski convex body theorem starting from Blichfeldt’s theorem.

#### § Blichfeldt's theorem

This theorem allows us to prove that a set of large-enough-size in any lattice will have two points such that their difference lies in the lattice. Formally, we have:
1. A lattice L(B) ≡ { Bx : x ∈ ℤn } for some basis Bn. The lattice L is spanned by integer linear combinations of rows of B.
2. A body SRn which need not be convex!, which has volume greater than det(B). Recall that for a lattice L(B), the volume of a fundamental unit / fundamental parallelopiped is det(B).
Blichfeldt's theorem tells us that there exists two points x1, x2S such that x1x2L.

#### § Proof

The idea is to:
1. Chop up sections of S across all translates of the fundamental parallelopiped that have non-empty intersections with S back to the origin. This makes all of them overlap with the fundamental parallelopiped with the origin.
2. Since S has volume great that det(B), but the fundamental paralellopiped only has volume det(B), points from two different parallelograms must overlap.
3. "Undo" the translation to find two points which are of the form x1 = l1 + δ, x2 = l2 + δ. they must have the same δ since they overlapped when they were laid on the fundamental paralellopiped. Also notice that l1l2 since they came from two different parallograms on the plane!
4. Notice that x1x2 = l1l2L ≠ 0, since we already argued that l1l2. This gives us what we want.

#### § Minkowskis' Convex body Theorem from Blichfeldt's theorem

Consider a convex set S ⊆ ℝn that is symmetric about the origin with volume greater than 2n det(B). Create a new set T which is S * 0.5. Formally:
 T ≡ S/2 = { (x1/2, x2, …, xn/2)  : (x1, x2, …, xn) ∈ S }
We now see that Vol(T) > det(B) to invoke Blichfeldt's theorem. Formally:
 Vol(T) = 1/2n Vol(S) > 1/2n (2n det(B)) = det(B)
We can apply Blichfeldt's theorem to get our hands on two points x1, x2T such that x1x2L.

x1 ∈ T ⇒ 2x1 ∈ S  (S = 2T
x2 ∈ T ⇒ 2x2 ∈ S  (S = 2T
2x2 ∈ S ⇒ −2x2 ∈ S (S is symmetric about origin)

 1 2
(2x1) +
 1 2
(−2x2) ∈ S (S is convex)

x1 − x2 ∈ S (Simplification)
nonzero lattice point ∈ S

## § Burrows Wheeler

We aim to get the O(n) algorithm for burrows wheeler, by starting from the naive O(n2) implementation and then slowly chipping away to get to the fastest algorithm

#### § String rotations

Given a string s of length n, we can index it as s[0], s[1], upto s[n−1]. We can now build a table consisting of rotations of the string. We'll define:
lrot :: [a] -> [a]
lrot [] = []; lrot (x:xs) = xs ++ [x]

We can build a table of left-rotations:
foobar
oobarf
obarfo
barfoo
arfoob
rfooba

We can immediately notice that it's a symmetric matrix. We can prove this as follows. We write lrot(rot, s) as an array such that:
 lrot(rot, s)[i] = s[(rot + i)
Now, we note that on row r of our array we have the string lrot(r, s). So the value at row r, column c is . But this is symmetric in c, r so can be written as , which is equal to lrot(c, s)[r]. Formally:
 lrot(r, s)[c] = s[(r + c)

#### § Sorts of string rotations

We're next interested in considering sorted order of the string rotations. For example, in the case of the string "here, there", all the rotations are:
here-there  0
ere-thereh  1
re-therehe  2
e-thereher  3
-therehere  4
therehere-  5
herehere-t  6
erehere-th  7
rehere-the  8
ehere-ther  9

which is calculated with the help of the following haskell code:
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]

lrots :: [a] -> [[a]]; lrots as = take (length as) (iterate lrot as)

main =  putStrLn $intercalate "\n" (zipWith (\s i -> s <> " " <> show i) (lrots "here-there") [0,1..])  If we now sort these rotations, we get: -therehere 0 e-thereher 1 ehere-ther 2 ere-thereh 3 erehere-th 4 here-there 5 herehere-t 6 re-therehe 7 rehere-the 8 therehere- 9  we produce this by chainging the above definition of main to: main = putStrLn$ intercalate "\n"
(zipWith (\s i -> s <> "  " <> show i)
-- | extra sort here
(sort $lrots "here-there") [0,1..])  Let's look at the final column of that table. We have: -thereher|e| 0 e-therehe|r| 1 ehere-the|r| 2 ere-there|h| 3 erehere-t|h| 4 here-ther|e| 5 <- ORIGINAL STRING herehere-|t| 6 re-thereh|e| 7 rehere-th|e| 8 therehere|-| 9 0123456789 errhhetee-  Now, we'll show how to write a really cool function call bwt such that: bwtinv ("errhhetee-",5) = "here-there"  The 5 is the index of the original string in the list. That is, given the jumbled up last-column and the index of the original string, we're able to reconstruct the original string. The reason this is useful is that the jumbled string "errhhetee-" is easier to compress: it has long runs of r, h, and e which makes it easier to compress. The process we performed of rotating, sorting the rotations and taking the final column is the Burrows-Wheeler transform. in code: import Data.List lrot :: [a] -> [a] lrot [] = []; lrot (a:as) = as ++ [a] lrots :: [a] -> [[a]] lrots as = take (length as) (iterate lrot as) findix :: Eq a => a -> [a] -> Int findix a as = length (takeWhile (/= a) as) lastcol :: [[a]] -> [a]; lastcol = map last bwt :: Eq a => Ord a => [a] -> ([a], Int) bwt as = let as' = (sort . lrots) as in (lastcol as', findix as as')  Now we need to understand how bwtinv is defined and what it does. The definition is: import Control.Arrow (&&&) bwtinv :: Eq a => Ord a => ([a], Int) -> [a] bwtinv (as, k) = recreate as !! k -- recreate · lastcol · sort · rots = sort · rots recreate :: (Eq a, Ord a) => [a] -> [[a]] recreate as = recreate' (length as) as recreate' :: (Eq a, Ord a) => Int -> [a] -> [[a]] recreate' 0 = map (const []) recreate' n = hdsort . consCol . (id &&& recreate' (n-1)) hdsort :: Ord a => [[a]] -> [[a]] hdsort = let cmp (x:xs) (y:ys) = compare x y in sortBy cmp consCol :: ([a], [[a]]) -> [[a]] consCol = uncurry (zipWith (:))  OK, so much for the code. what does it do? The idea is that: • we recreate the entire matrix from the last column using recreate and take the kth element. • recreate apprents a copy of the initial last column to a matrix of columns, and then sorts this. #### § References • Richard Bird: Pearls of functional algorithm design ## § Intuitionstic logic as a Heytig algebra open sets form a Heytig Algebra, which is a lattice plus an implication operator. So it's stronger than a lattice, but weaker than a boolean algebra. Formally, a Heytig algebra over a set X is a collection (L, ∨, ∧, ⇒) where (X, ∨, ∧) form a lattice, and obeys the axiom  ⇒: L → L; (c ∧ a) ≤ b ⇐⇒ c ≤ (a ⇒ b) In any topological space (U, τ) ( U for universe set) the open sets of that space form a Heytig algebra. Here, set-union and set-intersection become the lattice join and lattice meet. We define a "weak complement" denoted by ¬ which is defined as:  ¬ A ≡ interior(AC) We need the interior to make sure that we're left with an open set. Also, this ¬ is not really a complement, since we won't have that ¬ ¬ A = A. but, more on that later! We write open intervals with round parenthesis:  A --(===)--  ¬ A of the above set A becomes the open interval that is in the interior of A complement. --(===)-- A ==]---[== A complement ==)---(== Not A  Now, we can try to ask, when is ¬ ¬ A = U (where U is the universe set or the full space). If we consider a set containing a single point, then we will have: ==)(== A ------------ Not A ============ Not (Not A)  in words:  A = U ∖ { 1 } ¬ A = interior(Ac) = interior({ 1 }) = ∅ ¬ ¬ A = interior((¬ A)c) = interior(∅c) = interior(U) = U So in some sense, the law of excluded middle is "almost true": It holds that A ≃ ¬ ¬ A, where A excludes a set of points of measure zero. This is really interesting, since it gives some sort of oddball probabilistic flavour to things where if we blind ourselves to measure-0 sets, then ¬ ¬ A = A. Now, we look at implication. The implication AB is the largest set open C such that CA = B. In pictures: ---(========)----- A ---------(=====)-- B ---------(==)----- A && B ===)-----(======== A -> B  The reason this is true is that from the definition:  ⇒: L → L; c ≤ (a ⇒ b) ⇐⇒ (c ∧ a) ≤ b replacing ≤ with = for insight: ⇒: L → L; c = (a ⇒ b) ⇐⇒ (c ∧ a) = b Alternatively, we can use the fact that in regular boolean algebra:  a ⇒ b = ¬ a ∨ b to derive A −> B: ---(========)----- A ===)--------(===== NOT A ---------(=====)-- B [Extend B to everything that doesn't include more of A than already included] ===)-----(======== NOT A or B = A -> B  #### § a -> b as a contained in b: We'll show that a -> b is true/top/the full real line if and only if a is contained in b. We can show this using ¬ ab definition: 1. ab (given) 2. Since ab, ¬ a ≥ ¬ b, since ¬ reverses inclusion order. 3. xy implies that xpyp for all p, since p is on both sides. 4. Specializing to x = ¬ a, y = ¬ b, p = b gives us ¬ ab ≥ ¬ bb 5. ¬ bb is universe U 6. ¬ abU, which means it's equal to U (nothing can be greater than U). We can also show this geometrically: ----------(===)---- A ------(==========) B ----------(===)---- A && B [Extend B to everything that doesn't include more of A than already included. But all of A is already included!] ================== A -> B  #### § reading curry, uncurry using a -> b as containment: curry, uncurry are the adjunction/isomorphism: ((c,a) -> b) ~ (c -> (a -> b))  Read this as: ( c , a) -> b (c intersection a) contained in b  if and only if c -> (a -> b) c contained in (a implication b)  That is, we have "read" curry/uncurry as:  c ∧ a ≤ b ⇐⇒ c ≤ a ⇒ b which was the definition of we wanted! #### § References ## § Edit distance This implementation of edit distance crystallizes the fact that when computing edit distance, we only ever move forwards on solving the problem. we do not store the results of the overlapping computations, though we could. Rather, the goal of this implementation is to capture the traversal pattern necessary for edit distance into a Cursor, and to view the edit distance problem from the point of view of this Cursor. The problem setting is that we have a source string, a destination string, and we wish to perform operations that convert the source string into the destination string. The operations allowed are to: • Insert a character from the destination to the source. • Remove a character from the source. • Replace a character in the source with a character from the destination. We want to minimise the number of operations to be used. Let's see how we model this. {-# LANGUAGE ViewPatterns #-} type Ix = Int; type DestIx = Ix; type SrcIx = Ix; data Move = InsertFromDest DestIx | RemoveFromSrc SrcIx | ReplaceSrcWithDest SrcIx DestIx deriving(Show)  Our cost model says that each move costs 1. We are charged for every move we make. We are to minimize the number of operations. movescost :: [Move] -> Int; movescost = length  We model this as us having a Cursor which contains list [a] and information about where we are in the list as an Ix. This is the same as a Zipper for a list, except that in this case, we only allow ourselves to walk forward. data Cursor a = Cursor Ix [a]  • cdone tells is if we have completely consumed a cursor. • cix tells us the index of the cursor. • cval lets us dereference a cursor. • incr lets us move a cursor to the next array entry. • cursor converts a list into a Cursor at the first index. cdone :: Cursor a -> Bool; cdone (Cursor ix vs) = ix >= length vs cix :: Cursor a -> Ix; cix (Cursor ix _) = ix cval :: Cursor a -> a; cval c@(Cursor ix vs) = vs !! ix incr :: Cursor a -> Cursor a; incr (Cursor ix vs) = Cursor (ix+1) vs cursor :: [a] -> Cursor a; cursor = Cursor 0  We implement edit, that tells us how to edit the source string into the destination string. The convention is edit . -- | decide how to get ixth character of bs from our as. edit :: Eq a => Cursor a -> Cursor a -> [Move]  • 1. If both strings have been consumed, then no moves are to be made. edit (cdone -> True) (cdone -> True) = []  • 2. If the destination string has been fully matched while the source string has not, then remove characters from the source string. edit a@(cdone -> False) b@(cdone -> True) = (RemoveFromSrc (cix a)):edit (incr a) b  • 3. If the source string has run out of characters while the destination string still has characters, insert characters from the destination string. edit a@(cdone -> True) b@(cdone -> False) = (InsertFromDest (cix b)):edit a (incr b)  • 4. Otherwise, we have characters remaining in both strings. Try the options of (1) replacing a source character with a destination character (2) removing a character from the source and continuing, and (3) if the current characters match, then keep the match and try to combine characters that come later in the string. We pick the best out of these using the argmin combinator. edit a b = let nomatch = argmin movescost (ReplaceSrcWithDest (cix a) (cix b):edit (incr a) (incr b)) (RemoveFromSrc (cix a):edit (incr a) b) in case cval a == cval b of True -> argmin movescost nomatch (edit (incr a) (incr b)) False -> nomatch  The helper used for finding minimum according to a cost model. argmin :: (a -> Int) -> a -> a -> a argmin f a a' = if (f a) < (f a') then a else a'  ## § Evolution of bee colonies This kind of culture that beehives have is called as 'eusociality'. I'm interested in this because I want to understand what alien societies might look like, and what selection pressures are required to have bee-like societies evolve. Many sci-fi books (Ender's game for example) depict aliens with such a society, but tend to be hazy on how this state of affairs came to be. ## § Best practices for array indexing These are rules I'm going to follow when I solve problems on codeforces. I've arrived at these rules by repeatedly noticing the kinds of mistakes I make and attempting to adopt conventions to eliminate these. #### § Rule 1: Half-open intervals Intervals are only represented as [begin, past-the-end) That is, we start at begin, include numbers begin+1, begin+2, ..., upto, but excluding past-the-end. So, for example, [0, 3) = [0, 1, 2], while [0, 1) = [0], and [0, 0) = []. I am not allowed to use [begin, end], or (before-begin, past-the-end) or any such variation I represent and think of intervals. #### § Rule 2: No relational operators when using for/while/do while loops. When I write my loop conditions, I am not allowed to use relational operators. I must only write i != n or i == m. I am not allowed to write i < n, i <= n, i > n, i >= n for my for loops. #### § Rule 3: The loop iterator lives in "getting to" space: The loop iterator i in for(int i = a; i != b; ++i) is to be thought of as getting to/living right before" the values [a, a+1, a+2, ... b-1]. In ASCII-art: || a-1 || a || a+1 || ... || b-1 || b || b+1 || ^^ ^^ ^^ ^^ (i=a) (i=a+1) (i=a+2) ... (i=b)  #### § Ruel 4: One always reads loops according to the above rules for(int i = begin; i != past-the-end; ++i) { // NO: i from begin to past-the-end. // YES: [i _getting to_ the beginning] till [i _getting_ past-the-end] }  #### § The rationale for banning relational operators There is a strong difference in qualia between i < n and i != n. The former makes on aware of when the loop runs; the latter of when the loop quits. I wish to be cognizant of the precisely when a loop quits. On writing i != past-the-end, I know that we quit as soon as we get past the end. This feels much clearer than being aware that the loops runs as long as i < n. #### § half-open indexing: length <-> index The first advantage of these conventions is that [begin, past-the-end) is the same as [begin, begin+length). I've found this to be of great help to flit between length-based-thoughts and indexing-based-thoughts. #### § half-open indexing: 0 length The second almost trivial point is that [begin, begin+length) holds when length is zero. That is, for(int i = begin; i != begin + 0; ++ i) { ... }  does the right thing and iterates over no elements. #### § half-open indexing: -ve length The third neat point is that [begin, begin+length) holds even when length is negative. Hence, if we want to index the last two elements of an array, we recall that we start from index (n-1), and we want a segment of length -2 from there, so our loop is: for(int i = n-1; i != (n-1) + -2; i--) { // loops over the correct segment. }  #### § half-open indexing: splitting an array Finally, this convention helps when attempting to take the beginning part of a list and the rest of the list. If we want to split an array into two segments at some index len, • The first sub-array is [0, len) since it starts at location 0 and has length len. • Since we have taken len elements, the second sub-array must have length n-len. It must start at index len since len is past-the-end for the first sub-array. So, the second sub-array has indeces [len, n-len). Notice how we used both the "length view" and the "past the end" view to quickly derive the bounds of the second array from the first array. #### § half-open indexing: uniformly generate power-of-2 intervals. If we want intervals of length 2n: 1, 2, 4, 8, ... which is common if one is building data structures such as segment trees and fenwick trees, in a half-open representation, this literally becomes [a, a+1), [a, a+2), [a, a+4) and so on. On the other hand, if one wants to use closed interval based indexing, one needs to generate the series 2n − 1, which is [a, a+0], [a, a+3], [a, a+7] which is slightly more finicky. #### § How to deal with strides If there's a loop for(int i = 0; i < 5; i += 2) { ... }  the naive i != 5 translation will not work since i only takes on even numbers [0, 2, 4, 6, ...]. For this, we can perform a simple transformation and always make sure our loop variables increment by 1. In a compiler, this is often called as "canonicalizing the loop induction variable". In LLVM the canonicalization is performed by the -indvars pass. The above example canonicalized becomes: // btc = backedge taken count. How many times we have // gone past the "back edge" of the loop to go back to the // beginning of the loop. for(int btc = 0; btc != 5 / 2; btc += 1) { const int i = btc * 2; }  #### § In conclusion These are rules that I dreamed up after noticing my idiosyncracies in loop-writing. ## § Algebraic structure for vector clocks I update my time, ie, union(time me, time me), I get an element that's one up the lattice. When I union with someone else, I get the max. So we have an algebraic structure which is (L, ≤, next: LL) where next is monotone for (L, <=). The induced union operator ∪: L × LL is: x ⋃ y ≡  next(x) x = y max(x, y) x ≠ y • Is the total ordering on vector clocks not isomorphic to the total ordering on ? This also shows up in the case of the "Gallager-Humblet-Spira Algorithm" and the fragment-name-union-rule. ## § Networks are now faster than disks I learnt of this counter-intutive fact first from this usenix article no SQL. On checking up, it seems to actually be true. Jeff Dean's keynote at LADIS 2009 report these numbers: • Round trip within same datacenter: 500,000 ns • Disk seek: 10,000,000 ns [regular disk] On the other hand, a commentor on this discussion at serverfault mentioned that SSDs might be much faster, and the numbers bear out: • Read 4K randomly from SSD: 150,000 ns • Round trip within same datacenter: 500,000 ns • Disk seek: 10,000,000 ns [regular disk] ## § Einstein-de Haas effect I learnt of this from hacker news. This is some crazy experiment that shows that the 'quantum angular momentum' (spin) and the 'classical angular momentum' need to be conserved together for physics to work out: There's an experiment that transfers that angular momentum all the way up to macroscopic levels. By magnetizing a cylinder of iron, all the spins start pointing in the same direction. By conservation of angular momentum, the cylinder itself has to start spinning in the opposite direction. I'm very fond of this experiment, because it magnifies a strange quantum phenomenon to the classical level. So, my understanding of the experiment is: • classical angular momentum and quantum angular momentum are related. • quantum angular momentum is decomposed into spin and orbital angular momentum. • for something like iron, spin is 96% of magnetization • angular momentum is proportional to magnetization • So, the experiment measures the spin (mostly) in terms of the classical spinning of the cylinder. #### § References ## § Rank-select as adjunction We will introduce two operations rank, select, --- these are used to build memory-efficient data structures that can still be queried quickly. We will show how rank and select are adjoint. This will also us to also consider coselect, which is a variation of select that is not commonly discussed. {-# LANGUAGE GeneralizedNewtypeDeriving #-} import Data.List newtype Count = Count Int deriving(Eq, Show, Ord, Num) newtype Ix = Ix Int deriving(Eq, Show, Ord, Num) eqscan :: Eq a => a -> [a] -> [Count] eqscan a0 as = map Count$ scanl (+) 0 [if a0 == a then 1 else 0 | a <- as]

-- | finds the index Ix in as at which a has occured
-- | for the Countth time.
select :: Eq a => a -> [a] -> Count -> Maybe Ix
select a0 as c = Ix <$> findIndex (== c) (eqscan a0 as) -- | finds the number of times a has occured in as till Ix. rank :: Eq a => a -> [a] -> Ix -> Count rank a as (Ix ix) = (eqscan a as) !! ix  Given this, we can prove that select and rank are adjunctions. There are different ways to think about this. My favourite way to is to notice that an Ix (index) is much "finer" information than Count. Hence, the concrete domain must be Ix, the abstract domain must be Count, and rank-select is an abstract interpretation! #### § Co-select We can build another version of select called co-select that scans from the right. Alternatively, it finds on the reverse list: coselect :: Eq a => a -> [a] -> Count -> Maybe Ix coselect a0 as c = Ix <$> findIndex (== c) (reverse (eqscan a0 as))

Thanks to Edward Kmett for teaching me this.

## § Bounding chains: uniformly sample colorings

We wish to uniformly sample k colorings of a graph G with maximum degree Δ. Hence, we require k ≥ Δ + 1. To perform this sampling, we use MCMC to sample from a markov chain whose states are k-colorings of G, whose stationary distribution is the uniform distribution over valid colorings. The issue with MCMC techniques is that we never know when our chain has reached the stationary state. To ensure we receive uniformly distributed samples, we built a "bounding chain" W which has the following properties:
• States of W cover states of X [ie, state space of W are subsets of the state space of X].
• W's convergence to a stationary state can be checked.
• Upon convergence of W, state of W = state of X.
We will in fact run the W chain, and prove that running the W chain is equivalent to running a 'shadow' of the X chain, and that stationary states of the W chain correspond to stationary sates of the X chain. Let X be the chain whose states are valid k colorings of G. In one step of the chain X, we choose a vertex v uniformly at random; we then choose a color cv uniformly at random for v that makes it a proper coloring. The vertex v is changed to this new color c. This is a symmetric proposal distribution, Hence this chain has the uniform distribution over k-colorings to be the stationary state. Sampling exactly from this chain is hard: construct an initial state X0 amounts to finding some valid k coloring which in itself might be challenging. Worse, we do not know whether the chain X has reached a stationary state or not.

#### § Bounding Chain

We construct a new chain W (the bounding chain of X), whose states are sets of colors for vertices in G. Formally, the states of W are functions Vert(G) → 2C where Vert(G) denotes the vertices of G; C the set of colors. The transition will be to pick a vertex v uniformly at random. Then, pick a new set of legal colors Cv for v, such that:
• It is guaranteed that if X were transitioning on v, the color cv that would be picked by X for v is a member of Cv. [state is covered]
• The size of the set Cv attempts to be smaller than the current set of colorings Cv. [convergence]
We describe the transition function next. But first, we need an alternate lens on the transitions of X that is amenable to massaging.

#### § Equivalent description of the transitions of X:

1. Choosing a color uniformly at random from the set of valid colors for a vertex.
2. Choosing colors from C without replacement until we get a color that is a valid color.
We claim that (1) and (2) have the same probability distribution. Abstracting slightly, we state:
1. Probability of choosing an element tT uniformly from TS. This has probability 1/|T|.
2. Probability of choosing a particular element tT, by picking elements from S without replacement until we get some element in T.
(1) and (2) have the same probability distribution.

#### § Proof by induction:

Process (2) in code is the following:
def process(S, T):
assert(issubset(T, S)) # precondition
s = np.random.choice(S) # uniform choice over S.
if s in T: return s # |T|/|S| prob. to enter if.
else:
# (1 - |T|/|S|) to enter else
Snext = S.remove(s); return process(Snext, T)

We claim that the probability that process(S, T) = t0 for a fixed t0 in T is 1/|T|. We create a new function indicator to express this:
def indicator(t0, S, T):
assert(t0 in T) # precondition
assert(issubset(T, S)) # precondition

return t0 == process(S, T)

Let's push in t0 == into the definiton of process after inling process. This gives us:
def indicator(t0, S, T):
assert(t0 in T) # precondition
assert(issubset(T, S)) # precondition

s = np.random.choice(S)
if s in T:
# T/S prob. for x in T
return t0 == s # 1/|T| prob. for t0 == x
else:
#  (1 - |T|/|S|) to reach else branch.
Snext = S.remove(s); return process(Snext, T)


Now, we write down the recurrence for the probability that we are trying to compute: P(t0, S, T) is the probability that indicator(t0, S, T) returns True. Alternatively, it's the probability that process(S, T) returns t0.
P(t0, S, T) = prob. that process(S, T) = t0
P(t0, T, T) = T/|T| * 1/|T| = 1/|T| [base case]
P(t0, S, T) = 1/|S| + (1 - |T|/|S|) * P(t0, |S|-1, T) [induction]

We assume for induction that P(t0, |S|-1, T) = 1/|T|. On substitution into [induction], we get:
P(t0, S, T)
= 1/|S| + (1 - |T|/|S|) * P(t0, |S|-1, T) [induction]
= 1/|S| + (1 - |T|/|S|) * 1/|T|
= 1/|S| + (1/|T| - 1/|S|)s
= 1/|T|

Which is indeed the same probability as (1):
1. Choosing an element uniformly from a subset T = 1/|T|.

#### § Proof by program analysis, Version 1

def process(S, T):
s = np.random.choice(S)
if s in T: return s
else return process(S.remove(s), T)

Notice that the last return is tail-call. This program can be rewritten as:
def process(S, T):
while True:
s = np.random.choice(S)
if s in T: return s
S = S.remove(s)

As previously, we create an indicator function and study its behaviour:
def indicator(t0, S, T):
assert(t0 in T) # precondition
assert(issubset(T, S)) # precondition

while True:
s = np.random.choice(S)
if s in T: return t0 == s
S = S.remove(s)

We know that this programs only returns a value from the line:
• if s in T: return t0 == s
We now compute P(process(S, T) == t0). Whatever the return value of indicator, we can assume that it occured within the if condition. We can use this to compute:
P(indicator(t0, S, T) = True)
= P(t0 == s | s in T) [program only returns in this case]
= 1/|T|


#### § Proof by program analysis, Version 2

Alternatively, we can also analyze this as we did in the first proof, using the rule:
def f():
return if cond1 then cond2 else cond3

will have probability:
P(cond1) * P(cond2|cond1) + P(not cond1) * P(cond3|not cond1)

Using this, we can analyze indicator as:
P(indicator(t0, S, T) = True)
= P(s in T) * P(t0 == s |s in T) +
P(s not in T) * P(indicator(t0, S.remove(s), T) | s not in T)
= |T|/|S| * 1/|T| +
(|S|-|T|)/|S| * P(indicator(t0, S.remove(s), T))
= 1/|S| + (|S|-|T|)/|S| * 1/|T| [by induction]
= 1/|T|


#### § Sampling using the above definition

Recall that State(X) ≡ VC, State(W) ≡ V → 2C. Let x[ ], w[ ] be the states of some run in the markov chains. We start by having x[0] to be any valid k-coloring of G, and w[0] to be the state where all vertices have all possible colors; w[0] ≡ _ ↦ C. Clearly, x[0] ∈ w[0]. By induction we assume that x[n−1] ∈ w[n−1]. We must now calculate a w[n], x[n] such that (1) x[n] ∈ w[n], (2) x[n]'s proposal is symmetric. (3) w[n]'s proposal is symmetric.

#### § Occluded set O

Define OC (for occluded) be the set colors that might possibly be blocked for v from our view of w. Note that this is an over-approxmation: that is, there may be colors that are not blocked for v, which we believe to be blocked from w's point of view.
 O ≡  { c ∈ C : (v, α) ∈ E, c ∈ w[n−1](α) }

#### § Allowed set A

Define AC (for allowed) to be CO. Note that A is an under-approxmation, since O was an over-approximation. That is:
• Any color in A is definitely a legal color for v in x[n].
• There are colors which are legal for v in x[n] that is not in A.

#### § S: the sequence of states for transition

Now, we pick elements of C in sequence till we get an element of A. call this sequence S. We will at worst pick Δ + 1 elements for S, since the max-degree of the graph is Δ.

#### § Transition

Let i be the first index in S where we get a color that is truly legal for v in x[n]. Note that such an index will always exist: We pick elements into S till we get an element in A, and elements of A are always legal. However, there can be elements which are not in A that are still legal for v in x[n], since A is an under-approximation.
• We assign x[n](v) = i. So, x only cares about S[:i].
• We assign w[n](v) = A. So, W cares about the entire sequence.
By the lemma proven, we know that this process of picking colors C in a sequence till we get a color that is legal for v at index i is the same as picking uniformly at random from the set of colors that are legal for v.

#### § An example

For example, we could have:
X | p:2 --- q:4
W | p:{1, 2} --- q:{4, 5}

we are sampling q:

O = {1, 2}
A = {3, 4, 5}
S = [2, 1, 3]

X | p:1 -- q:2
W | p:{1, 2} -- q:{1, 2, 3}

If we analyze S = [2, 1, 3] we notice that:
2: invalid for W(p:[1, 2]), invalid for X(p:2)
1: invalid for W, valid for X
3: valid for W, valid for X

So, what we are really sampling ix:
• A prefix sequence SX = [1] (for Xs transition)
• A leftover sequence SW = [2, 3] (for Ws transition)
To transition X, we can safely drop SW. However, to transition W correctly, we generate more elements in the sequence, till we hit a "safe" element.

#### § An optimisation on picking colors: why we need a blocking set

Define the set BC (for blocked) which governs which values x[n] surely cannot take from our view of w[n−1]. Note that B is an under-approximation. v might have more colors that are blocked than what w sees.
 B ≡ { c ∈ C : (v, α) ∈ E, w[n−1](α) = {c} }
Rather than sampling colors from C till we get an element of A, we can sample colors from C/B. We know that the colors in B can never be used by X, since the colors in B are those that we know are blocked for sure. This is used in the theoretical analysis of the paper.

#### § Termination

We terminate when W has "trapped" X. That is, |w[n](v)| = 1 forall vV. In such a case, the states of W is equal to states of X. This is a coalescence (as it occurs in coupling from the past). From the coupling from the past theory, we know that we have reached a stationary state of A when this happens.

#### § Pseudocode

-- | colors, vertices, edges
cs :: [C]; cs = ...
vs :: [V]; vs = ...
es :: [(V, V); es = ...

-- | definitely not allowed
blocked :: (V -> [C]) -> (V -> [C])
blocked v2cs v0 = concat [v2cs w | (v, w) <- es, v == v0, length (v2cs w) == 1]

-- | maybe not allowed
occluded :: (V -> [C]) -> (V -> [C])
occluded v2cs v0 = concat [v2cs w | (v, w) <- es, v == v0]

-- | definitely allowed from Ws point of view
allowed :: (V -> [C]) -> (V -> [C])
allowed v2cs = cs \\ occluded v2cs

-- | perturb the color function to another function
perturb :: (V -> [C]) -> Rand (V -> [C])
perturb v2cs = do $randv <- uniform_rand vs randc_for_v <- uniform_rand (allowed v2cs v) return$ \v -> if v == randv then randc_for_v else v2cs v

-- | check if we are done
terminated :: (V -> [C]) -> Bool
terminated v2cs = all [length (v2cs v) == 1 | v <- vs]

-- | generate chain
chain :: (V -> [C]) -> Rand [V -> [C]]
chain f = do
if terminated f
then [] else do
f;' <- perturb f; fs <- chain f'; return (f:fs)

-- | return a sample
sample :: (V -> [C]) -> Rand (V -> [C])
sample = last . chain



## § Word problems in Russia and America

scathing critique of how ameriacn math education is screwed: also, interesting anecdote about how looking for 'reality' in mathematical problems may in fact break student's ability to think in the abstract! This is a deep insight.

## § Encoding mathematical hieararchies

I've wanted to learn how the SageMATH system is organized when it comes to math hieararchies. I also wish to learn how lean4 encodes their hiearchies. I know how mathematical components does it. This might help narrow in on what what the "goldilocks zone" is for typeclass design.

## § Learning code by hearing it

I learnt of this from an amazing discussion on HackerNews, where a sighted programmed, who is going blind, asked the community if he could remain a software engineer. An answer by kolanos read:
You can definitely continue as a software engineer. I'm living proof. ... For example, as you get better with a screen reader, you'll be bumping the speech rate up to 1.75-2X normal speech. ... Typos will be easily spotted as they just won't "sound right". It will be like listening to a familiar song and then hitting an off note in the melody. And this includes code. Also, because code is no longer represented visually as blocks, you'll find you're building an increasingly detailed memory model of your code. Sighted people do this, too, but they tend to visualize in their mind. When you abandon this two dimensional representation, your non-visual mental map suffers no spatial limits. You'll be amazed how good your memory will get without the crutch of sight.
I find this incredibly fascinating. I think I'm going to try this: I'll listen to lectures on 1.5-2x anyway, so this may work just as well. I'm planning on trying this with the MLIR codebase, which I want to get to know intimately. I'll write updates on how this goes. Drew DeVault also posted links to tools/scripts for programming without needing visual feedback:
I also learnt about emacspeak, which supposedly works well for an audio-based-workflow.
• [https://news.ycombinator.com/item?id=22919455]

## § Your arm can be a spinor

I tend to forget the name of this trick. It exhibits spinors in real life: a system that needs to rotate by 720 degrees to return back to its original state, versus the usual 360 tha we are generally used to. We need to consider our entire arm + cup we are holding as a system for this to work.

### § Self modifying code for function calls: Look ma, I don't need a stack!

If one does not have recursive calls, one can eliminate the need to push return addresses on a call stack by writing self-modifying code --- I leant of this from TAOCP, volume 1. Knuth shows this off once he introduces MIXAL, his fantasy aseembly language in which TAOCP programs are written. I'll explain the usual way one performs call-return, then explain the nifty self-modifying-code way. I think this is the cleanest, most accessible example of self-modifying-code that I know.

#### § The traditional solution for call/ret

We wish to have function f call function g. For g to be able to return control to f, f pushes a return address into the call stack, that g pops and jmps to
f's body
START_F:
...
L0    push addr(L0); location of instr to be run after call.
jmp START_G
L1:   <code after call>

g's body
START_G:
...
RETG: jmp retloc

Rather than pushing and poping, we can rewrite the code of g, to change retloc before a call to g. In made-up-pseudocode, here's what that would look like:

#### § The jump based solution

* f's body
START_F:
...
jmp START_G
L1:   <code after call>

* g's body
START_G:
...
RETG: <###-to-be-filled-dummy-###>

instead of having a call stack, before f calls g, f modify g's code at location RETG into a jmp instruction by store ing the instruction jmp addr(L1). This effectively creates a 'custom' g that knows how to return control flow into f.
* g's body (after execution of L0)
START_G:
...

This way, we have obviated the need for a push/pop sequence, by directly modifying g's code. This is really neat --- we replace the overhead of a push/pop with a single store.

#### § Why recursion breaks.

We don't actually need a call stack, as long as we don't want to write recursive functions. We can't have recursion, or more generally "re-entrance": consider a call chain of the form:
• A -> B -> C -> B.
• during A -> B, A will scribble a into B.
• during B -> C, B will scribble a into C.
• during C -> B, C will scribble into B, destroying the previous .
• This creates a cycle, where C will attempt to return to B and vice versa.

An adjunction F |- U allows us to go from F a -> x to a -> U x. We can look at this as shifting the "before-advice" from the input to an "after advice" of the output, where I'm using Also, to remember this, we can write it as:
F a <--F--- a
|           |
v           v
x ----U--> U x

Where F is for Free and U is for forgetfUl. Recall that if F is free and U is forgetful, then U(F(x)) = x, since adding more structure through the free functor and then stripping it away gives us the object back. We can prove that if U(F(x)) = x and U, F are functors, then we have a function fwd: (f a -> x) -> (a -> u x) as:
f      :: (f a -> x)
fmap   :: (p   -> q) -> (u p     -> u q)
fmap   :: (p   -> q) -> (u ( p ) -> u q)
fmap f :: (f a -> x) -> (u (f a) -> u x)
fmap f :: (f a -> x) -> (a       -> u x) [using u (f a) = a]


### § Reversible computation as groups on programs

If we consider a language like Janus where every program is reversible, we can then get a group structure on programs with the identity program not computing anything at all, the inverse performing the reverse operation. Alternatively, one can use the trick from quantum mechanics of using anciliary qubits to build reversible classical gates. The question is, do either of these approaches allow for better-than-STOKE exploration of the program space? Ie, can we somehow exploit the discrete group structure (in the case of Janus) or the Lie group structure of the unitary group (as in the QM case) to find programs in far quicker ways?

### § Blazing fast math rendering on the web

So, I've shifted the blog to be static-site-generated using a static-site-generator written by yours truly. The code clocks in at around a thousand lines of C++:
What did I gain?
1. My generator is a real compiler, so I get errors on math and markdown malformation.
2. I can write math that loads instantly on your browser, using no MathJax, KaTeX or any client side processing, nor the need to fetch images, which looks like this:
h(x) ≡

 ∞ i=0
f(xg(xdx
x > 0
 ∞ ∑ i=0
f(x) + g(x)
otherwise

#### § Why?

My blog is a single 9000 line markdown file, rendered as a single HTML page, so I need it to compile fast, render fast, render beautiful. Existing tools compromise on one or the other.

#### § No seriously, why a single markdown file?

I need a single file to edit, so I can rapidly jot down new ideas. This is the essence of why I'm able to log most of what I study: because it's seamless. Far more importantly, it provides spatio-temporal locality. I add things in chronological order to tbe blog, as I learn thing. If I need to recall something I had studied, go to that location in the blog based on a sense of when. When I do get to a location I want, the scrollbar gives me a sense of where I am in the file. this is important to me, since it hepls me reason spatially about what i know and what I've learnt. It's someting I love about books, and deeply miss when navigtaing the web.I'm determined to keep this spatio-temporal locality on my little slice of the internet.

#### § Why is this awful?

As elegant as this model is to edit, it's awful for browsers to render. The file used to take on the order of minutes for all the math to finish rendering. MathJax (and KaTeX) painfully attempt to render each math block. As they do, the page jumps around until everything has settled. As this is happening, your CPU throttles, your lap or hand gets warm, and the page is stuck. Clearly not great UX. I still want math. What do I do? The solution is easy: Approximate the math rendering using ASCII/UTF-8 characters! There are tools that do this --- hevea is one of them. Unfortunately, there is no markdown-based-blogging-platform that uses this, so I had to write my own.

#### § The cure

The solution is easy. I wrote the tool. The page you're reading it is rendered using the tool. All the math renders in under a second because it's nothing crazy, it's just text and tables which browsers know how to render. No JavaScript necessary. snappy performance. Whoo!

#### § The details: Writing my own Markdown to HTML transpiler.

the final transpiler clocks in at 1300Loc of C++, which is very small for a feature-complete markdown-to-HTML piece of code that's blazing fast, renders math correctly, and provides error messages.

#### § Quirks fixed, features gained.

I got quite a bit "for free" as I wrote this, fixing mild annoyances and larger pain points around using github + markdown for publishing on the web:
• I really don't want tables, but I do want the ability to write vertical bars | freely in my text. Unfortunately, github insists that those are tables, and completely wrecks rendering.
• I get line numbers in code blocks now, which Github Flavoured Markdown did not have.
• I get error messages on incorrectly closed bold/italic/code blocks, using heuristics that prevent them from spanning across too many lines.
• I get error messages on broken latex, since all my latex passes through hevea. This is awesome, since I no longer need to refresh my browser, wait for mathjax to load, go make myself tea (remember that mathjax was slow?), and then come back to see the errors.
• I can get error messages if my internal document links are broken. To be fair, my tool doesn't currently give me these errors, but it can (and soon will).
• In general, I get control, which was something I did not have with rendering directly using Github, or using someone else's tool.

#### § Choice of language

I choose to write this in C-style-C++, primarily because I wanted the tool to be fast, and I'd missed writing C++ for a while. I really enjoy how stupid-simple C style C++ turns out to be: the C++ papers over some of C's annoyances (like formatted output for custom types), while still preserving the KISS feeling of writing C++. Why not Rust? I freely admit that rust might have been a sane choice as well. unfortunately, asking rust to treat UTF-8 string as a "ball of bytes" is hard, when it's stupidly easy with C. Plus, I wanted to use arena-style-allocation where I make huge allocations in one go and then don't think about memory, something that I don't have control over in Rust. I don't have any segfaults (yet, perhaps), thanks to UBSAN and ASAN. I find Rust to have more impedance than C on small applications, and this was indeed small.

#### § Performance

Everything except the latex to HTML is blazing fast. Unfortunately, calling hevea is slow, so I implemented a caching mechanism to make using hevea not-slow. hevea does not have an API, so I need to fork and talk to its process which is understandably flow. I built a "key-value-store" (read: serialize data into a file) with the stupidly-simple approach of writing an append-only log into a file. hevea is a pure function conceptally, since on providing the same latex input it's going to produce the same HTML output, so it's perfectly safe to cache it:
const char DB_PATH[]="./blogcache.txt";
unordered_map<ll, const char *> G_DB;
G_DB = {};
FILE *f = fopen(DB_PATH, "rb");
...
while (!feof(f)) {
ll k, len;
fread(&k, sizeof(ll), 1, f); if (feof(f)) break;
...
char *buf = (char *)calloc(sizeof(char), len + 2);
...
}
fclose(f);
};

const char *lookup_key(ll k) {
unordered_map<ll, const char *>::iterator it = G_DB.find(k);
if (it == G_DB.end()) { return nullptr; } return it->second;
};

void store_key_value(const ll k, KEEP const char *v, const ll len) {
assert(G_DB.count(k) == 0);
G_DB.insert(make_pair(k, strdup(v)));

FILE *f = fopen(DB_PATH, "ab");
assert(f != nullptr && "unable to open DB file");
fwrite(&k, sizeof(ll), 1, f);
fwrite(&len, sizeof(ll), 1, f);
fwrite(v, sizeof(char), len, f);
fclose(f);
}


#### § For the future

I plan to rip out hevea and write my own latex -> HTML converter for the subset of LaTeX I actually use. hevea's strength is its downfall: It can handle all of LaTeX, which means it's really slow. If I can concentrate on a small subset, I don't need to play caching tricks, and I can likely optimise the layout further for my use-cases. I also want colored error messages, because who doesn't? I'll probably gradually improve my static site generator over time. Once it's at a level of polish where I'm happy with it, I'll spin it out as a separate project.

#### § Conclusions

Am I glad I did it? Yes, purely because my chunk of the internet aligns with how I want it to be, and that makes me є more happy. I think of it as an investment into future me, since I can extend the markdown and the transpiler in the way I want it to be.

### § VC dimension

Consider a ground set X. Let the space of all possible binary classifications be the function space C ≡ { ff : X → ± 1 }. Now, a hypothesis class H is a subset of C. For example, some model such as "return +1 if a point is inside a region, −1 otherwise" is a subset of the full class C. The VC dimension of H measures how good H is generating different classification. We first need the notion of shattering to define this. A subset SX of the ground set shatters a hypothesis class H if the function actS has full range, where actS is defined as:
 actS: H → |S|{0, 1} actS(h) = (h(s0), h(s1), h(s2), …, h(sn))
That is, the hypothesis class H can classify all the subsets of S. Now the VC dimension of the hypothesis class H of a ground set X is the size of largest possible SX such that S is shattered by H.

#### § Correct interpretation

• We need just one set S of size n to be shattered by H. We get to pick the set S.

#### § Subtletly 1:

• We do not need all sets of size n to be shattered by H.
We can have the case where:
• All sets of size 3 are shattered by H
• Only one set of size 4 is shattered by H. All other sets of size 4 are not.
• Only one size of size 5 is shattered by H. All other sets of size 5 are not.
• No set of size 6 is shattered by H.
In this case, the VC dimension of H is 5, not 3.

#### § Subtletly 2:

We cannot have the case where:
• All sets of size 3 are shattered by H
• No set of size 4 is shattered by H
• Some set of size 5 is shattered by H
For contradiction, let S be the set of size 5 that is shattered by H. Let T S, |T| = 4. Now, H shatters T since H shatters S. Hence, Some set of size 4 has been shattered. Contradiction, since we assumed that no set of size 4 is shattered by H. So, to prove that sets of size (≥ n) cannot be shattered, it suffices to prove that sets of size equal to n cannot be shattered.

#### § Growth of number of sets shattered in |S| for S ⊆ X for a fixed H.

If we fix a hypothesis class H for X, and we want to understand how H varies over subsets of X, the idea is this: Let S be a set that is the maximum sized set that is shattered by X. ie, |S| = Vcdim(H) and H shatters S. Now, the idea is this:
• For subsets TS, |actT(H)| = 2|T| -- exponential.
• For subpersets S Sup, |actSup(H) = Comb(|Sup|, |S) -- polynomial.
We can show that this exponential/polynomial behaviour happens in general for SX.

## § Symplectic version of classical mechanics

#### § Basics, symplectic mechanics as inverting ω:

I've never seen this kind of "inverting ω" perspective written down anywhere. Most of them start by using the inteior product iX ω without ever showing where the thing came from. This is my personal interpretation of how the symplectic version of classical mecanics comes to be. If we have a non-degenerate, closed two-form ω: TpM × TpM → ℝ. Now, given a hamiltonian H: M → ℝ, we can construct a vector field XH: MTM under the definition:

partially apply ω to see ω as a mapping from Tp to Tp*M
ω2: Tp M → Tp*M ≡ λ (vTp M). λ (wTp M) . ω(vw
ω2−1Tp*M → Tp MdHM → TpM
XH ≡ λ (pM) → ω2−1 (dH(p))

(pM
 dH ▸
dH(p) : Tp*M
 ω2−1 ▸
ω2−1(dH(p)): TpM

XH = ω2−1 ∘ dH
This way, given a hamiltonian H: M → ℝ, we can construct an associated vector field XH, in a pretty natural way. We can also go the other way. Given the X, we can build the dH under the equivalence:

ω2−1 ∘ dH  = XH
dH = ω2(XH

 ∫ dH  = ∫ ω2(XH)

 H = ∫ ω2(XH)

This needs some demands, like the one-form dH being integrable. But this works, and gives us a bijection between XH and H as we wanted. We can also analyze the definition we got from the previous manipulation:

 ω2(XH) =  dH λ (w: Tp M) ω(XH, w) = dH ω(XH, ·) = dH
We can take this as a relationship between XH and dH. Exploiting this, we can notice that dH(XH) = 0. That is, moving along XH does not modify dH:

 ω2(XH) =  dH λ (w: Tp M) ω(XH, w) = dH dH(XH) = ω(XH, XH) = 0   ω is anti-symmetric

#### § Preservation of ω

We wish to show that XH*(ω) = ω. That is, pushing forward ω along the vector field XH preserves ω. TODO.

#### § Moment Map

Now that we have a method of going from a vector field XH to a Hamiltonian H, we can go crazier with this. We can generate vector fields using Lie group actions on the manifold, and then look for hamiltonians corresponding to this lie group. This lets us perform "inverse Noether", where for a given choice of symmetry, we can find the Hamiltonian that possesses this symmetry. We can create a map from the Lie algebra gG to a vector field X g, performing:

t : ℝ ↦ et g : G
t : ℝ ↦ φ(et g) : M

X g ≡
 d dt
(φ(et g))|t = 0TM

We can then attempt to recover a hamiltonian H g from X g. If we get a hamiltonian from this process, then it will have the right symmetries.

These are personal notes I made of a custom notation for denoting the relations from the theorems for free paper. I developed the notation since I wanted to keep track of what types are floating around and what the relations are doing. We interpret types as sets. If elements belong to the relation, ie, if (a, a') ∈ R ⊂ AxA, we will denote this as a[A -R- A']a'. We will now write down some inference rules:
1. We define ReflB as a[Bool ReflB Bool]a
2. We define ReflI as i[Int ReflI Int]i
3. The product of two relations [A R B], [X S Y] is called as RxS, and is defined as: (a,x)[AxX RxS BxY](b,y) iff: ∀ abxy, a[A R B]b ∧ x[X S Y]y.
4. The list space of a [A R B] is called [A* [A R B] B*], and is defined as: la[A* [A R B] B*]lb iff: ∀ la lb, |la| = |lb| ∧ (∀ i, la[i][A R B]lb[i])
5. The function space of two relations[A R B], [X S Y] is called [A->X R->S B->Y], and is defined as: f[A->X R->S B->Y]g iff: ∀ a b, a[A R B]b => f(a)[X S Y]g(b).
6. The type family space of two relations is a function that takes a relation [A R B] and produces a new relation: g[FA | [A R B] | FB]h. The relation takes as parameter a relation [A R B] for each choice.
7. The space of relations of ∀X.F(X) is a relation defined by: g[A->FA | ∀X.F(X) [FA [A R B] FB]| B->FB]h ∀ A B R, (g A)[FA | [A R B] | FB](h B).

#### § Parametricity theorem

The parametricity thm states that for all terms (r: R), we can deduce r[R rel(R) R]r where rel(R) is the relation that fits the type, and is derived from the above rules.

#### § Parametricity for lists when the relation is a function:

The list space of a [A R B] is called [A* [A R B] B*], and is defined as: la[A* [A R B] B*]lb iff:
• ∀ la lb, |la| = |lb| ∧ (∀ i, la[i][A R B]lb[i])
Now, let us take a special case where [A R B] is a function δ: A -> B. That is:
• a[A R B]b iff δ(a) = b.
If this is the case, then we can simplify the math to be:
• la[A* [A R B] B*]lb <=> ∀ la lb, |la| = |lb| ∧ (∀ i, la[i][A R B]lb[i])
• la[A* [A R B] B*]lb <=> ∀ la lb, |la| = |lb| ∧ (∀ i, δ(la[i]) = lb[i]
• la[A* [A R B] B*]lb <=> ∀ la lb, map δ la = lb

#### § Parametricity to prove rearrangements

• r[∀ X. X* -> X*]r
• (r A)[A*->A* | [A*->A* [A R B] B*->B*] | B*->B*](r B)
• as[A* [A R B] B*]bs => (r A as)[A* [A R B] B*](r B bs)
• Pick [A R B] to be a function δ: A -> B. Ie, a[A R B]b iff δ(a) = b.
• This lets us convert all occrences of α[A R B]ω into ω = δ(α).
• Hence, as[A* [A R B] B*]bs becomes map δ as = bs.
• Hence, (r A as)[A* [A R B] B*](r B bs) becomes map δ (r A as) = (r B bs)
• In toto, this let us replace bs with map δ as. We derive:
• map δ (r A as) = (r B bs)
• map δ (r A as) = (r B (map δ as)
• map δ . (r A) = (r B) . map δ
• Replace bs[i] with δ(as[i])to get result: δ(r A as[i]) = r B δ(as[i]), which was indeed what we were looking for.

### § How to reason with half-open intervals

I've always found code that uses half-open intervals far harder to write than using closed intervals. For example, when performing string processing, I prefer to write closed over halfopen since I find it easier to think about:
void closed(int begin, int end, char *c) { //easier
for(int i = begin; i <= end; ++i) {... }
}

void open(int begin, int len, char *c) { //harder
for(int i = begin; i < begin + len; ++i) { ... }
}

void open(int begin, int len, char *c) { //harder
for(int i = begin; i != begin + len; ++i)
}

It somehow made it way easier to grok.
• I had problems with < since I would mentally shift from i < begin + len to i <= begin + len - 1. Making this move would then make all other reasoning harder, since I had keep switching between the < and <= point of view.
• On the other hand, when using i != begin + len, there was a single location to focus on: the point begin + len, and what happens when i reaches it.
Of course, this is old had to anyone who performs loop optimisatison: LLVM internally converts most comparisons into the a != b form, because it's easier to analyse. It took me this long it's easier for me to think in this viewpoint as well.

### § How does one build a fusion bomb?

I haven't found anything on the internet that describes how to build a fusion bomb; it's almost as if this information has been supressed by governments. However, I'm curious --- would a physics grad student studying nuclear physics or explosives have the theoretical know-how to precisely build one, given the raw materials? Or is there some "secret sauce" that's necessary? I read on wikipedia that most countries classify the details:
Detailed knowledge of fission and fusion weapons is classified to some degree in virtually every industrialized nation. In the United States, such knowledge can by default be classified as "Restricted Data", even if it is created by persons who are not government employees or associated with weapons programs, in a legal doctrine known as "born secret".

### § Christoffel symbols, geometrically

Suppose we have a manifold M. of dimension d that has been embedded isometrically into n. So we have a function e: ℝd → ℝn which is the embedding. We will identify M to be the subspace Im(e). Recall that xi e : ℝd → ℝn is defined as:

xie : ℝd → ℝn

[∂ xie](p) ≡
 lim δ x → 0

 e(p + (0:0, 1:0…, i:δx, …, n:0)) − e(p) δ x

Note that it is a function of type d → ℝn.
• The tangent space at point pImage(e) is going to be spanned by the basis { ∂xie |p : ℝn }.
• The metric tensor of M, gij ≡ ⟨ ∂ e/∂ xi | ∂ e/∂ xj. That is, the metric tensor "agrees" with the dot product of the ambient space n.
• A vector field V on the manifold M is by definition a combination of the tangent vector fields. V(p0) ≡ vj(p0) ∂xj e(p0)
We can calculate the derivaive of this vector field as follows:

 V(p) ∂ xi

 = ∂xi ⎡ ⎣ vj(p) ∂xj e ⎤ ⎦
vj · ∂xi ∂xj e + ∂xje · ∂xi vj
We choose to rewrite the second degree term in terms of the tangent space, and some component that is normal to us that we have no control over.
(∂xi ∂xj e )(p) ≡ Γijk ∂xk e +
 → n
This gives us the Christoffel symbols as "variation of second derivative along the manifold.

#### § Relationship to the Levi-Cevita connection

The covariant derivative defined by the Levi-Cevita connection is the derivative that contains the projection of the full derivative in n onto the tangent space Tp M. This is defined by the equations:

ei V ≡ ∂xi V −
 → n

= Π

 → n

vj · ∂xi ∂xj e + ∂xje · ∂xi vj

= Π

 → n

vj · (Γijk ∂xk e +
 → n
)+ ∂xje · ∂xi vj

vj · (Γijk ∂xk e +
 → 0
) + ∂xje · ∂xi vj

vj · (Γijk ∂xk e +
 → 0
) + ∂xke · ∂xi vk

vj · Γijk ∂xk e  + ∂xke · ∂xi vk

 = ∂xk e ⎛ ⎝ vj · Γijk  + ∂xi vk ⎞ ⎠

### § A natural vector space without an explicit basis

On learning about infinite dimensional vector spaces, one learns that we need to use the axiom of choice to assert that every such vector space has a basis; indeed, it's equivalent to the AoC to assert this. However, I had not known any "natural" examples of such a vector space till I studied the proof of the barvinok algorithm. I produce the example here. Consider a space such as S ≡ ℝ3. Now, consider the vector space spanned by the indicator functions of polyhedra in S. that's a mouthful, so let's break it down. A polyhedra is defined as a set of points that is defined by linear inequalities: P ≡ { xS : ai · xbi, i ∈ [1… n] }, for all aiS, b ∈ ℝ. The indicator functions are of the form:
[poly]: S → ℝ; [poly](x) ≡

 1 x ∈ poly 0 otherwise
we can define a vector space of these functions over , using the "scaling" action as the action of on these functions: The vector space V is defined as the span of the indicator functions of all polyhedra. It's clearly a vector space, and a hopefully intuitive one. However, note that the set we generated this from (indicators of polyhedra) don't form a basis since they have many linear dependencies between them. For example, one can write the equation:
*---*   *-*   *-*   *
|###|   |#|   |#|   |
|###| = |#| + |#| - |
|###|   |#|   |#|   |
*---*   *-*   *-*   *


### § Cache oblivious B trees

Central idea: assume a memory model where computation is free, only cost is pulling data from cache into memory. Cache has total size M, can hold blocks of size B. So it can hold M/B blocks of main memory. Memory memory has infinite size. Cost is number of transfers. We assume that the algorithm does not know M or B. We assume that the cache replacement strategy is optimal (kick out block that is going to be used farthest in the future). This is an OK assumption to make since an LRU cache using twice the memory of a "oracular" cache performs equally well (citation?) These data structures are cool since they essentially "Adapt" to varying cache hierarchies and even multiple level cache hierarchies. We study how to build cache-oblivious B-trees.

#### § Building optimal cache-oblivious B trees to solve search

• We use a balanced BST. We want to find an order to store nodes in memory such that when we search for an element, we minimize number of blocks we need to pull in.
• All standard orders such as level order, pre-order, post-order fail.
• Corrrect order is "VEB (Van Em De Boas) order": carve a tree at the middle level of its edges. Layout a "triangle" or smaller collection of nodes linearly. Then Recursively layout the trees, linearly in memory.
• Supposedly if the number of nodes is N, we wil have roughly (N) nodes on the top, and then (N) triangles at the bottom.

#### § Analysis Claim: we need to pull O(logB N) blocks for any B for any search query

N is the number of nodes in the BST. Note that in the analysis, we know what B is, even though the algorithm does not.
• We look at a particuar level of recursion. We will call it a "level of detail" straddling B.
• We will have large triangles of size B, inside which there are smaller triangles of size B (reminds me of sierpinski).
• We know that the algorithm recursively lays it out, and triangle stores everything "inside" it in a contiguous region. So we stop at the requisite size where we know that the tree's triangles themselves contain triangles which fit into the block size.
• A little triangle of size less than B can live in at most two memory blocks by straddling a block boundary: by eg. having (B−1) bits in one block and a single bit in another block.
1 2 3 4 5 6 7 8 <- index
|     |       | <- boundary
|-xxxxxxx-----| <-  data

• The question is that on a root-to-leaf bpath, how many such triangles do we need to visit. Since we repeatedly divide the nodes in half with respect to height until the little triangle has number of nodes less than B, the height is going to be O(logB) since it's still a binary tree.
• total height in O(logN).
• so height of "chunked tree" where we view each triangle as a single node is logN / logB = logB n.
• insight: ou data structure construction in some sense permits us to "binary search on B" since we divide the data structure into levels based on B. if B = N, then the full data structure fits into memory and we're good.

#### § Black box: ordered file maintainince

We need a black box: ordered file maintainance (linked list for arrays)
• Store n elements in specified order in an array of linear size O(N). Array permits gaps.
• updates: delete element, insert elements between 2 elements.
• cannot do this in linear time, but we can move elements in an interval of size log2(N) amortized.
• We need O(1) scans for the data structure.

#### § Next: dynamic BST (inserts and deletes): layout

we take a VEB static tree on top of an ordered file. Tree is a segtree that has max of nodes. Leaves are the members of the ordered file.

• search for node.
• update ordered file.
• propogate updates into the tree. This will have to be done in post-order because we need the leaves to be fixed before we can update the parent max.

• look at level of detail that straddles B.
• Let us look at the bottom 2 levels.
• Note that when we perform post-order inside a triangle that has 3 triangles of size B, we need to alternate between parent triangle and child triangle. Since the parent triangle is of size B and can therefore take at most 2B blocks of memory, similarly the child can take at most 2B blocks of memory.
• So if our cache can hold 4 blocks of memory, we're done. We won't need to kick anything out when performing the post-order traversal.
• For levels that are above the bottom 2 levels, we're still OK. there are not many triangles! / not many nodes! (1:16:00 in the video)

### § Krohn-Rhodes decomposition

We denote partial functions with XY and total functions with XY. A set X equipped with a binary operator ⋆: X × XX which is closed and associative is a semigroup.

#### § Partial function semigroup

For a ground set X, the set of partial functions Pf(X) ≡ { f: XX } along with function composition forms a semigroup. This is in fact stronger than a semigroup. There exists:
• An identify function ex: XX; eX(x) = x
• A zero function θx: XX; θx(x) = undef, where by undef we mean that it is undefined.

#### § Transformation semigroup(TS)

Let Q be a set. Let SPf(Q) be a sub-semigroup of Pf(Q). Then the semigroup X ≡ (Q, S) is called as the transformation semigroup(X) of states Q.
• The elements of Q are called states of X
• while the elements of S are called actions of X.
• The set Q itself is called as the underlying set of X.
• For a fixed transformation semigroup X, we will write QX and SX to refer to its states and actions.
We call X ≡ (Q, S) as a transformation monoid if S contains 1Q(q) = q.

#### § Subtlety of being at transformation monoid.

There is some subttlety here. Just because S is a monoid does not mean that it that is a transformation monoid. It must have the identity element of Pf(Q) to be called a transformation monoid. For example, consider the set Q
a, b
and the transformation semigroup S
f ≡ α ↦ b
. Now the set S is indeed a monoid with identity element as f: QQ. however, f ≠ 1Q , andh ence, S is a not a transformation monoid.

#### § Examples of transformation semigroups

1. (X, { θ(x) = undef }). The semigroup with the empty transformation.
2. (X, ∅), the semigroup with no transformations.

#### § Semigroup action

We sometimes wish to represent a semigroup using an action/transformation semigroup on a ground set X. So, given some semigroup (T, ×) that needs to be represented, if we can find a morphism r: TPf(X) ( r for representation) such that:
• r(t1 + t2) = r(t1) ∘ r(t2). [ r is a semigroup morphism].
• t1t2xX such that r(t1)(x) ≠ r(t2)(x). [Faithfulness].
Put more simply, t1t2 r(t1) ≠ r(t2) where we define function equality extensionally: f = g ≡ ∀ x, f(x) = g(x).

#### § Embedding actions into the transformation semigroup

We often wish to represent some semigroup S as the transformation semigroup of some set of states Q. We can achieve this by proving a morphism:
• r: SPf(Q) that is faithful.
Then, we can treat elements of S as elements of Pf(Q).

#### § Completion of a transformation semigroup

Given a transformation semigroup X ≡ (Q, S) we can complete it by adding a new sink state , and then converting all partial functions in S to total functions that transition to . We have that ⊥ · s = s · ⊥   ∀ sS. We denote the completion as Xc ≡ (Qc, Sc).

#### § Coverings

Let X ≡ (QX, SX) and Y ≡ (QY, SY) be transformation semigroups. Let φ ⊆ QY × QX be a relation. Let sxSX and sySY. Then, if the following diagram commutes:
 a → b ↓ ↓ x → y
If sx(φ(qy)) = φ(ty(qy)), then we say that ty covers sx relative to φ. We imagine the ty lying above sx, being projected down by φ. If a fixed φ, for all sxSX there exists a tySY such that t covers s relative to φ, then we say that φ: is a relation of automata.
• If φ: QYQX is surjective, then we say that φ is a relational covering and write:
 X ◁φ Y
• If φ QY × QX is both surjective and partial, then we say that φ is a covering and write:
 X ≺φY
If XφY, we say that Y dominates X, or Y covers X, or X divides Y.

#### § Checking coverings and generating subsets

We note that for a given covering φ, if sx is covered by ty and px is covered by qy, then sxtx is covered by tyqy. Thus, to check if X is covered by Y, we simply need to check if some generating subset of X is covered by Y.

#### § Checking coverings of representations

Let us assume we have a representation of a transformation semigroup given with a semigroup Σ, a transformation semigroup X ≡ (QX, SX), and a representation r: Σ → SX that is faithful. Now, to check that X is covered by another Y, it suffices to check that there exists a tyY for each σ ∈ X such that r(σ) is covered by this ty.

#### § Companion relation

Given a relation φ: YX, then we define:
 Σ ≡ { (s, t) : t ∈ TY  covers  s ∈ SX }
Recall compositions of elements are covered by a composition of their coverings. Hence, if (s, t), (s′, t′) ∈ Σ, then (ss′, tt′) ∈ Σ. thus, Σ is a subsemigroup of SX × SY. We can regard Σ as the graph of a relation φ′ ⊆ QY × QX. This will be called as companion relation of φ.

#### § Wreath products

Let X ≡ (QX, SX) and Y ≡ (QY, SY). We're going to define a large product XY. We begn with the set WSXQY × SY, where SXQY ≡ { f : QYSX }. The wreath product then becomes:
 X ≀ Y ≡ (QX × QY, W)
with the action of W on an element of QX × QY being defined as:
 (f : QY → SX, sy : SY) (qx : QX, qY : QY) ≡ ( f(qy)(qx) , sy (qy))
it's a "follow the types" sort of definition, where we edit the right component as ryty(ry) since that's all we can do. In the case of the left component, we have a qx, and we need to produce another element in QX, so we "must use f". The only way to use f is to feed it a ty. This forces us into the above definition.

#### § Composition of wreath products

To show that its closed under composition, let's consider (f, sy), (g, ty) ∈ W with f, g: QYgSX, and sy, tySY. The result is going to be:
 (f, sy)  (g, ty) =  (λ qy. f(qy) ∘ g(qy), ty ∘ uy)

#### § Equivalences of subsets of states

Let X = (Q, S) be a transition system. Given subsets (a, b, ⊆ Q), we shall write ba if either ba or there exists some sS such that bsa, where s(a) ≡ { s(ai) : aia}. We can define an equivalence relation ab  ⇐⇒ abba. Note: ba |b| ≤ |a|, since ba means that bs(a). Note that s is actually a function s: QQ, and a function mapped over a set can only ever decrease the number of elements in a set, since a function can only xglomp elements together; it can never break an element apart into two. Hence, bsaa, and thus |b| ≤ |a|. Similiarly, ab |a| ≤ |b|. Therefore, ba means that |b| = |a|. Theorem: for all a, bQX such that a   b such that bs(a), we show that b = s(a), and there exists a tSX such that a = t(b). Proof: Since bs(a) ⊆ a and |b| = |a|, b = s(a). Therefore s is a permutation. Hence, s is invertible and there exists an inverse permutation t such that a = t(b). We now need to show that tSX. To do this, first note that if the order of the permutation s is n, then t = sn−1, since ts = sn−1s = 1S. Since the semigroup S is closed under composition t = sn−1 is in S, since it is s composed with itself (n−1) times.

#### § Subset families of interest

We will be interest in a family of subsets of QX called A, of the form:
• all sets of the form s(Q) for all sSX
• the set Q
• the empty set
• all the singleton sets { q } for all qQ.
In the above set, we have and as defined above. We note that the set A is closed under the action of all sSX. For example, the empty set is taken to the empty set. All singleton sets are taken to other singleton sets. For the full set Q, we add the sets s(Q) for all sSX.

#### § Height function

A height function for a transition system X ≡ (QX, SX) is a function h: A → ℤ such that:
1. h(∅) = −1.
2. h({ q }) = 0 ∀ qQ.
3. ab h(a) = h(b) for all a, bA.
4. b < a h(b) < h(a) for all a, bA.
The notation b < a ≡ (ba) ∧ ¬ (ab). (3) + (4) imply that two elements of the same height are either equivalent or incomparable.

#### § Pavings and bricks

for aA such that |a| > 1, we denote by Ba the set of all bA what are maximal subsets of a. That is, if bBa then b a, and c, b c a. Equivalently, if there exists a c such that bca, then b = c or b = a. Note that we can assert that a = ∪bBa b. This is because Ba contains all the singletons of QX. so we can begin by writing a as a union of singletons, and then merging elements of Ba into larger elements of B, terminating when we cannot merge any more elements of Ba.
• The set Ba is called as the paving of a.
• The elements of Ba are called as the bricks of a.

#### § Group of permutations for a ∈ A

Let us assume that there exists a sS such that s(a) = a. Let Aa be the set of all elements in A contained in a: Aa = { Ai : AiA, Aia }. Recall that the set A was closed under the action of all s, and hence, since s is a permutation of a, this naturally extends into a permutation of Aa: s Aa = Aa. Now note that this induces a permutation of the set Ba. This creates a transition system:

 Ga ≡ { s ∈ S : s a = a } Ha ≡ (Ba, Ga)
We have already shown how if sS defines a permutation of some set X by its action, then its inverse also exists in S. So, this means that Ga is in fact a transition group that acts on Ba. It might turn out that Ga = ∅. However, if Ga ≠ ∅, then as stated above, Ga is a group. We will call such a transition group a generalized transition group, since either Ga = ∅ or Ga is a group. Now, the generalized transition group Ha is called as the holonomy transition system of a, and the group Ga is called as the holonomy group of a. We have that GaS since Ga is a quotient of the sub-semigroup { s | sS, as = a }. (TODO: so what? why does this mean that it's ?) Theorem: if ab, then HaHb (similar subsets have isomorphic holonomy transition systems). Proof: Let us assume that ab. since ab, we have elements of the form s, s−1S such that b = s(a), a = s−1(b). Recall that for baBa is such that for a member gGa, g(ba) = ba. Bb must have the element s(ba). [TODO!]

#### § Holonomy decomposition

Let X ≡ (Q, S) be a transition system and let h be a height function for X, such that h(Q) > 0. For a fixed i, let a1, a2, … ak be the representatives of equivalence classes of elements of A of height equal to i. We define:
 Hi∨≡ Ha1 ∨ Ha2 … ∨ Han

#### § Inductive hypothesis for coverings

We will say a relational covering Xφ Y is of rank i with respect to a given height function h if φ relates states in Y to subsets of states in x that are members of A and have rank at most i. Formally, for each pQY, we have that φ(p) ∈ A and h(φ(p)) ≤ i. We prove that if Xφ Y is a relational covering of rank i, then Xφ HiY is a relational covering of rank i − 1. The proof is a proof by induction.

#### § Base case:

Start with the relational covering with QY = { 0 }, SY = { id }, and the cover φ(0) = QX. Clearly, this has rank n since the height of QX is n, and φ is inded a covering, since the only transition that Y can make (stay at the same state) is simulated by any transition in SX [TODO: is this really the argument?] For induction, assume Xφ Y is a relational covering of rank i with respect to some height function h. X≡ (QX, SX) and Y ≡ (QY, SY). We define
• QYi ≡ { qy : qyQY, h(φ(qy)) = i }
• QY< ≡ { qy : qyQY, h(φ(qy)) < i }
We know that A contains elements of height exactly i. Let a1, a2, … ak be representatives of sets of of height i in A. Thus, for each qyiQYi, we have that:
• φ(qyi) = aj for a unique 1 ≤ jk.
• We select elements u, uS such that u(φ(qyi)) = aj and u(aj) = φ(qyi).
We will show how to establish a relational covering:
• XφHi Y using a relation:
• φ ⊆ [(Ba1Ba2 ∪ … BakQY ] × QX

### § Proving block matmul using program analysis

It's a somewhat well-known fact that given matrix multiplication: O = AB where O ∈ ℝ2n × 2m ( O for output), A ∈ ℝ2n × r, B ∈ ℝr × 2m are matrices. We can also write this as follows:

 o11 o12 o21 o22

=

 a11 a12 a21 a22

 b11 b12 b21 b22

=

 a11 b11 + a12 b21 a11 b12 + a12 b22 a21 b11+ a22 b21 a21 b12 + a22 b22

When written as code, the original matrix multiplication is:
// a:[2N][2Z] b:[2Z][2M] -> out:[2N][2M]
int matmul(int N, int Z, int M, int a[N][Z], int b[Z][M], int out[N][M]) {
for(int i = 0; i < 2*N; ++i) {
for(int j = 0; j < 2*M; ++j) {
for(int k = 0; k < 2Z; ++k) out[i][j] += a[i][k] * b[k][j]
}
}
}

and the block-based matrix multiplication is:
// a:[2N][2Z] b:[2Z][2M] -> out:[2N][2M]
int matmulBlock(int N, int Z, int M, int a[N][Z], int b[Z][M], int out[N][M]) {
for (int BI = 0; BI < 2; ++BI) {
for (int BJ = 0; BJ < 2; ++BJ) {
for(int i = BI*N; i < BI*N+N; ++i) {
for(int j = BJ*M; j < BJ*M+M; ++j) {
for(int k = 0; k < 2Z; ++k) { out[i][j] += a[i][k] * b[k][j] }
}
}
}
}
}

we wish to show that both of these programs have the same semantics. We will do this by appealing to ideas from program analysis.

#### § The key idea

We will consider the statement:
out[i][j] += a[i][k] * b[k][j]

as occuring at an abstract "point in time" (i, j, k) in the matmul function. I also occurs at an abstract "point in time" (BI, BJ, i′, j′, k′) in the matmulBlock function. We will then show that the loops for(i...) for(j...) for(k...) are fully parallel, and hence we can reorder the loops any way we want. Then, we will show that the ordering imposed by (BI, BJ, i′, j′, k′) is a reordering of the original (i, j, k) ordering. We do this by showing that there is a bijection:
 (i=i0, j=j0, k=k0) → (BI=i0/N, BJ=j0/N, i=i0%N, j=j0%N, k=k0)
Thus, this bijection executes all loops, and does so without affecting the program semantics.

#### § Schedules

We'll zoom out a little, to consider some simple programs and understan how to represent parallelism.
void eg1(int N, int M, int out[N][M]) {
for(int i = 0; i < N; ++i) {
for(int j = 1; j < M; ++j) {
out[i][j] = out[i][j-1];
}
}

Notice that this program is equivalent to the program with the i loop reversed:
void eg1rev(int N, int M, int out[N][M]) {
for(int i = N-1; i >=0; --i) {
for(int j = 1; j < M; ++j) {
out[i][j] = out[i][j-1];
}
}

What's actually stopping us from reversing the loop for(j...)? it's the fact that the value of, say, out[i=0][j=1] depends on out[i=0][j=0]. We can see that in general, out[i=i_0][j=j_0] depends on out[i=i_0][j=j_0-1]. We can represent this by considering a dependence set:
 { write:(i0, j0−1) → write:(i0, j0) }
in general, we can reorder statements as long as we do not change the directions of the arrows in the dependence set.

### § Why I like algebra over analysis

Midnight discussions with my room-mate Arjun P. This tries to explore what it is about algebra that I find appealing. I think the fundamental difference to me comes down to flavour --- analysis and combinatorial objects feel very "algorithm", while Algebra feels "data structure". To expand on the analogy, a proof technique is like an algorithm, while an algebraic object is like a data structure. The existence of an algebraic object allows us to "meditate" on the proof technique as a separate object that does not move through time. This allows us to "get to know" the algebraic object, independent of how it's used. So, at least for me, I have a richness of feeling when it comes to algebra that just doesn't shine through with analysis. The one exception maybe reading something like "by compactness", which has been hammered into me by exercises from Munkres :) Meditating on a proof technique is much harder, since the proof technique is necessarily intertwined with the problem, unlike a data structure which to some degree has an independent existence. This reminds me of the quote: "“Art is how we decorate space; Music is how we decorate time.”. I'm not sure how to draw out the tenuous connection I feel, but it's there. Arjun comes from a background of combinatorics, and my understanding of his perspective is that each proof is a technique unto itself. Or, perhaps instantiating the technique for each proof is difficult enough that abstracting it out is not useful enough in the first place. A good example of a proof technique that got studied on its own right in combinatorics is the probabilistic method. A more reasonable example is that of the Pigeonhole principle, which still requires insight to instantiate in practise. Not that this does not occur in algebra either, but there is something in algebra about how just meditating on the definitions. For example, Whitney trick that got pulled out of the proof of the Whitney embedding theorem. To draw an analogy for the haskellers, it's the same joy of being able to write down the type of a haskell function and know exactly what it does, enough that a program can automatically derive the function (djinn). The fact that we know the object well enough that just writing the type down allows us to infer the program, makes it beautiful. There's something very elegant about the minimality that algebra demands. Indeed, this calls back to another quote: "perfection is achieved not when there is nothing more to add, but when there is nothing left to take away". I'm really glad that this 2 AM discussion allowed me to finally pin down why I like algebra.

### § using for cleaner function type typedefs

I've always struggled with remembering the syntax for function type typedefs:
typedef RETTY (*FNTYNAME)(ARGTY1, ARGTY2, ..., ARGTYn);

we can now use using for a far more pleasant syntax:
using FNTYNAME = RETTY(ARGTY1, ARGTY2, ..., ARGTYn);

which is also intuitive. You write down the "type" on the right hand side, and give it a name on the left. This is not strictly the same, since the typedef typedefs FNTYNAME to a function pointer type, while the C++ version typedefs the function type. I prefer the latter at any rate, since I dislike the fact that the usual typedef tends to hide the fact that a function pointer is some pointer-like-thing.

### § A walkway of lanterns

#### § Semidirect products

• (α ≡ { a, b, …}, +, 0)
• (ω ≡ { X, Y, …}, ×, 1)
• ·  :  ω → Automorphisms(α)
• rotations: ℤ 5
• reflection: ℤ 2
• D5 = ℤ5 ℤ2

 1 0 a X

 1 0 b Y

=

 1 0 a + X · b XY

• (Yb) →act (Xa)
• XYa + X · b

#### § A walkway of lanterns

• Imagine as a long walkway. you start at 0. You are but a poor lamp lighter.
• Where are the lamps? At each i ∈ ℤ, you have a lamp that is either on, or off. So you have ℤ2.
• L ≡ ℤ → ℤ2 is our space of lanterns. You can act on this space by either moving using , or toggling a lamp using ℤ2. ℤ2
• g = (lights:⟨−1, 0, 1⟩, loc:10)
• move3: (lights: ⟨ ⟩, loc: 3)
• move3 · g = (lights:⟨−1, 0, 1⟩, loc:13)
• togglex = (lights:⟨ 0, 2 ⟩, loc: 0)
• togglex · g = (lights: ⟨ −1, 0, 1, 13, 15 ⟩, loc:13)
• toggley = (lights: ⟨ −13, −12 ⟩, loc:0)
• toggley· g= (lights:⟨ −1 ⟩, loc:13)

### § Natural transformations

I don't find people who draw "all three parts" of the natural transformation: the catories C, FC, and GC, and then show the relationship between them, so I made this for my own reference.

### § The hilarious commentary by dinosaure in OCaml git

the Ocaml-git project is a re-implementation of git in OCaml. It's well-written, and I was walking through the codebase, when I found absolutely amazing, hilarious, and deep comments from dinosaure. I really enjoyed reading through the codebase, and the existence of these comments made it more humane to read. I don't know who dinosaure is, but I'm really glad they wrote the comments they did, it really made my day.

#### § The one that takes a stab at Haskell for fun

(* XXX(dinosaure): ...we can fix this detail but
I'm lazy like haskell. TODO! *)


(* XXX(dinosaure): see this paper
https://github.com/ocamllabs/papers/blob/master/irmin/2014.08.matthieu/rapport.pdf *)


#### § The one about the frustrations of bug-hunting

(* XXX(dinosaure): [~chunked:false] is mandatory, I don't want to explain
why (I lost one day to find this bug) but believe me. *)


#### § The one about a potential heisenbug

(* XXX(dinosaure): if, one day, we find a bug about the serialization of the


#### § The requisite comment in french for an OCaml project

(* XXX(dinosaure): bon ici, c'est une note compliqué, j'ai mis 2 jours
à fixer le bug. Donc je l'explique en français, c'est plus simple.
En gros, [Helper.MakeDecoder] utilise ce buffer comme buffer interne
pour gérer les alterations. Ce qui ce passe, c'est que dans la
fonction [refill], il s'agit de compléter à partir d'un [input]
(typiquement [zl]) le buffer interne. C'est finalement un
__mauvais__ jeu entre un [Cstruct.t] et un [Bigarray].
Il s'agit de connaître la véritable taille du [Bigarray] et de
rajouter avec [blit] le contenu de l'[input] si la taille du
[Bigarray] (et pas du [Cstruct]) est suffisante.
Avant, cette modification, [zl], [de] et [io] partagaient le même
[Bigarray] découpé (avec [Cstruct]) en 3. Donc, dans le
[MakeDecoder], [refill] considérait (pour des gros fichiers faisant
plus de 0x8000 bytes) que après [de], nous avions encore de la
place - et dans ce cas, nous avions [io].
Ainsi, on [blit]ait [zl] dans [de+sizeof(de) == io], et finalement,
on se retrouvait à essayer de décompresser ce que nous avions
décompressé. (YOLO).
Donc, on considère maintenant [de] comme un [Cstruct.t] et un
[Bigarray] physiquement différent pour éviter ce problème.
Cependant, il faudrait continuer à introspecter car j'ai
l'intuition que pour un fichier plus gros que [2 * 0x8000], on
devrait avoir un problème. Donc TODO. *)


#### § The deep one

(* XXX(dinosaure): at the end, we don't care if we lost something. *)


### § How to link against MLIR with CMake

Since MLIR hasn't setup the nice tooling that LLVM has around CMake as far as I can tell, one needs to actually know CMake to link against MLIR. However, as is well known, CMake incantations are handed down by preists who spend the better part of their lives studying the tome that is the CMake manual. I, an unlucky soul had to go on this adventure, and I hope to spare you the trouble. I wished to link against a static library build of MLIR. The secret lies in the find_library call:
#If the variable has been set by -DMLIR_INCLUDE_PATH, then keep it.
#Otherwise fallback to the environment variable $MLIR_INCLUDE_PATH. #if neither, then *shrug*. IF(NOT MLIR_INCLUDE_PATH) set (MLIR_INCLUDE_PATH$ENV{MLIR_INCLUDE_PATH})
endif()

#Resolve for:
#- a library target called MLIRAnalysis
#- asking to link against libMLIAnalysis.a
#- using the variable MLIR_INCLUDE_PATH which as we saw before
##  is either an environment variable or a cmake option
target_include_directories(languagemodels PRIVATE ${MLIR_INCLUDE_PATH})  I cribbed the actual things to link against from the path mlir/examples/Toy/Ch2/CMakeLists.txt which helpfully lists MLIR things it needs to link against. The full CMakeLists is here: cmake_minimum_required(VERSION 3.5) project(languagemodels) set(CMAKE_CXX_STANDARD 14) ## I don't want to use find_package since I want proper control over where my LLVM comes from. ## find_package(LLVM REQUIRED) add_executable(languagemodels rnn.cpp codegenc.h lang.h codegenmlir.h) ## Attempt to take these as command line arguments. IF that fails, ## lookup environment. IF(NOT MLIR_INCLUDE_PATH) set (MLIR_INCLUDE_PATH$ENV{MLIR_INCLUDE_PATH})
endif()

IF(NOT MLIR_LIBRARY_PATH)
set (MLIR_LIBRARY_PATH $ENV{MLIR_LIBRARY_PATH}) endif() target_include_directories(languagemodels PRIVATE${MLIR_INCLUDE_PATH})
find_library(MLIRAnalysis MLIRAnalysis ${MLIR_LIBRARY_PATH}) find_library(MLIRIR MLIRIR${MLIR_LIBRARY_PATH})
find_library(MLIRParser MLIRParser ${MLIR_LIBRARY_PATH}) find_library(MLIRSideEffects MLIRSideEffects${MLIR_LIBRARY_PATH})
find_library(MLIRTransforms MLIRTransforms ${MLIR_LIBRARY_PATH}) find_library(LLVMCore LLVMCore${MLIR_LIBRARY_PATH})
find_library(LLVMSupport LLVMSupport ${MLIR_LIBRARY_PATH}) ## debugging to check if it's been set properly message(MLIR_INCLUDE_PATH${MLIR_INCLUDE_PATH})
message(MLIR_LIBRARY_PATH ${MLIR_LIBRARY_PATH}) message(MLIRAnalysis${MLIRAnalysis})

${MLIRAnalysis}${MLIRIR}
${MLIRParser}${MLIRSideEffects}
${MLIRTransforms}${LLVMCore}
${LLVMSupport})  ### § Energy as triangulaizing state space This comes from The wild book by John Rhodes, which I anticipate I'll be posting more of in the coming weeks. #### § Experiments Let an experiment be a tuple of the phase space X, action space A, and an action of the actions onto the phase space : A × XX. We will write x′ = a x to denote the new state of the system x. So the experiment E is the data E ≡ (X, A, : A × XX). #### § Coordinate systems. The existence of the action allows us to write the evolution of the system recursively: xt+1 = axt. However, to understand the final state xt+1, we need to essentially "run the recursion", which does not permit us to understand the experiment. What we really need is the ability to "unroll" the loop. To quote: Informally, understanding an experiment E means introducing coordinates into phase space of E which are in triangular form under the action of the inputs of E. #### § Conservation laws as triangular form We identify certain interesting invariants of a system by two criteria: 1. The parameter Q(t) determines some obviously important aspects of the system. That is, there is a deterministic function M(Q(t)) which maps Q(t) to "measure" some internal state of the system. 2. If the values of such a parameter Q is known at time t0 (denoted Q(t0)) and it is also known what inputs are presented to the system from time t to time t + є (denoted I[t0, t0 + є]), then the new value of Q is a deterministic function of Q(t0) and I[t0, t0+ є]. Such parameters allow us to understand a system, since they are deterministic parameters of the evolution of the system, while also provding a way to measure some internal state of the system using M. For example, consider a system x with an energy function e(x). If we perform an action a on the system x, then we can predict the action e(x′ = a x) given just e(x) and a --- here, (x′ = a x) is the action of the system a on x. In general, conservation principles give a first coordinate of a triangularization. In the main a large part of physics can be viewed as discovering and introducing functions e of the states q of the system such that under action a, e(a q) depends only on e(q) and a, and not on q. #### § Theory: semidirect and wreath products #### § Symmetries as triangular form We first heuristically indicate the construction involved in going from the group of symmetries to the triangularization, and then precisely write it out in all pedantic detail. Let an experiment be E ≡ (X, A, ). Then we define Π is a symmetry of E iff: 1. Π: XX is a permutation of X. 2. Π commutes with the action of each a: Π(a x) = a Π(x) . We say that the theory E is transitive (in the action sense) if for all x1, x2X, x1x2, there exists a1, a2, … an such that x2 = an … (a1 x1) . Facts of the symmetries of a system: 1. We know that the symmetries of a theory E form a group. 2. If E is transitive, then each symmetry Π is a regular permutation --- If there exists an x such that Π(xf) = xf (a fixed point), then this implies that Π(x) = x for all x. 3. Let the action split X into disjoint orbits O1, O2, … Ok from whom we choose representatives x1O1, x2O2, … xkOk. Then, if E is transitive, there is exactly one action that sends a particular xi to a particular xj. So, on fixing one component of an action, we fix all components. To show that this gives rise to a triangulation, we first construct a semigroup of the actions of the experiment: S(E) ≡ { a1an : n ≥ 1 and aiA }. Now, let G = Sym(E), the full symmetry group of E. One can apparently express the symmetry group in terms of:  (X, S) ≤ (G, G) ≀ ({ O1, O2, … Ok}, T) ### § The cutest way to write semidirect products Given two monoids (M, +, 0M) and (N, ×, 1N), and a homomorphism φ: NEnd(M), where End(M) is the endomorphism group of M. We will notate φ(n)(m) as n · mM. Now the semidirect product M φN is the set M × N equipped with the multiplication rule: • (m, n) (m′, n′) = (m + n · m′, nn′) This can also be written down as:  1 0 m n  1 0 m′ n′ =  1 0 m + n · m′ n × n′ This way of writing down semidirect products as matrices makes many things immediately clear: • The semidirect product is some kind of "shear" transform, since that's what a shear transformation looks like, matrix-wise. • The resulting monoid M φ N has identity (0M, 1N), since for the matrix to be identity, we need the 2nd row to be (0, 1). • The inverse operation if (M, N) were groups would have to be such that  1 0 m + n · m′ n × n′ =  1 0 0 1 Hence: • nn′ = 1 implies that n′ = 1/n. • m + n m′ = 0 implies that m′ = −m/n. which is indeed the right expression for the inverse. ### § My Favourite APLisms #### § identity matrix n←3 ⋄ id ← n n ⍴(1,n⍴0) ⋄ id  This relies heavily on ⍴ replicating its arguments. #### § histogram xs←(1 1 3 3 3 6) ⋄ n←(⌈/xs)⍴0 ⋄ n[xs]+←1 ⋄ n  The use of n[x] +←1 will stably write +1 as many times as there are repeated indexes in xs. #### § String matching / parity as fold ≠: ]disp str ← (1 0 0 1 0 0 0 0 1 0 1 0 0 0) ⋄ 2 1 ⍴ ((⊂ str) ⍪ ⊂((≠\str))) ┌→──────────────────────────┐ ↓1 0 0 1 0 0 0 0 1 0 1 0 0 0│ ├~─────────────────────────→┤ │1 1 1 0 0 0 0 0 1 1 0 0 0 0│ └~─────────────────────────→┘  ### § Proof of chinese remainder theorem on rings #### § General operations on ideals We have at our hands a commutative ring R, and we wish to study the ideal structure on the ring. In particular, we can combine ideals in the following ways: 1. I + J ≡ { i + j : ∀ iI, jJ } 2. IJ ≡ { x : ∀ xIxJ } 3. IJ ≡ { (i, j) : ∀ iIjJ } 4. IJ ≡ { ij : ∀ iIjJ } We have the containment:  IJ ⊆ I ⋂ J ⊆ I, J ⊆ I + J ⊆ R #### § IJ is a ideal, IJ ⊆ I ∩ J it's not immediate from the definition that IJ is an ideal. The idea is that given a sum k ik jkIJ, we can write each ik jk = ik, since the ideal I is closed under multiplication with R. This gives us ik = i″ ∈ I. Similarly, we can interpret k ik jk = ∑k jk = jkJ. Hence, we get the containment IJIJ. #### § I ∩ J subseteq I, I ∩ J ⊆ J Immediate from the inclusion function. #### § I, J ⊆ I + J Immediate from inclusion #### § CRT from an exact sequence There exists an exact sequence: 0 → I ⋂ J  f ▸ I ⊕ J  g ▸ I + J → 0 f(r) = (rr g((ij)) = i + j We are forced into this formula by considerations of dimension. We know:  dim(I ⊕ J) = dim(I) + dim(J) dim(I + J) = dim(I) + dim(J) − dim(I ⋂ J) [inclusion-exclusion] dim(I + J) = dim(I ⊕ J) − dim(I ⋂ J) dim(I + J) − dim(I ⊕ J) + dim(I ⋂ J) = 0 V − E + F = 2 By analogy to euler characteristic which arises from homology, we need to have IJ in the middle of our exact sequence. So we must have:  0 → ? → I ⊕ J → ?→ 0 Now we need to decide on the relative ordering between IJ and I + J. • There is no universal way to send I oplus JIJ. It's an unnatural operation to restrict the direct sum into the intersection. • There is a universal way to send IJI + J: sum the two components. This can be seen as currying the addition operation. Thus, the exact sequence must have I + J in the image of IJ. This forces us to arrive at:  0 → I ⋂ J → I ⊕ J → I + J → 0 The product ideal IJ plays no role, since it's not possible to define a product of modules in general (just as it is not possible to define a product of vector spaces). Thus, the exact sequence better involve module related operations. We can now recover CRT: 0 → I ⋂ J  f ▸ I ⊕ J  g ▸ I + J → 0 0 → R  f ▸ R R  g ▸ R → 0 0 → R / (I ⋂ J) → R/I ⊕ R /J → R/(I + J) → 0 #### § References ### § monic and epic arrows This is trivial, I'm surprised it took me this long to internalize this fact. When we convert a poset (X, ≤) into a category, we stipulate that xy ⇐⇒ xy. If we now consider the category Set of sets and functions between sets, and arrow Af B is a function from A to B. If f is monic, then we know that |A| = |Im(f)| ≤ |B|. That is, a monic arrow behaves a lot like a poset arrow! Similarly, an epic arrow behaves a lot like the arrow in the inverse poset. I wonder if quite a lot of category theoretic diagrams are clarified by thinking of monic and epic directly in terms of controlling sizes. ### § The geometry of Lagrange multipliers If we want to minise a function f(x) subject to the constraints g(x) = c, one uses the method of lagrange multipliers. The idea is to consider a new function L(x, λ) = f(x) + λ (cg(x)). Now, if one has a local maxima (x, y), then the conditions: 1. L/∂ x = 0: f′(x) − λ g′(x) = 0. 2. L/∂ λ = 0: g(x) = c. Equation (2) is sensible: we want our optima to satisfy the constraint that we had originally imposed. What is Equation (1) trying to say? Geometrically, it's asking us to keep f′(x) parallel to g′(x). Why is this a good ask? Let us say that we are at an (x0) which is a feasible point ( g(x0) = c). We are interested in wiggling (x0) →wiggle (x0 + є) ≡ x1. • x1 is still feasible: g(x1) = c = g(x0). • x1 is an improvement: f(x1) > f(x0). • If we want g(x1) to not change, then we need g′(x0) · є= 0. • If we want f(x1) to be larger, we need f′(x0) · є> 0. If f′(x0) and g′(x0) are parallel, then attempting to improve f(x0 + є) by change g(x0 + є), and thereby violate the constraint g(x0 + ) = c. ### § Efficient tree transformations on GPUs All material lifted straight from Aaron Hsu's PhD thesis. I'll be converting APL notation to C++-like notation. #### § Tree repsentation as multi-dimensional ragged nested arrays We're interested in this tree:  ∘ ┌──┬──┴────┐ a b c │ ┌┴┐ ┌───┼───┐ p q r s t u │ │ | │ ┌┴┐ ┌┴┐ v w x y z  I'll be writing APL commands in front of a $ to mimic bash, and I'll write some arrays as multi-line. To run them, collapse them into a single line. The ast object is represented in memory as:
$ast ← ('∘' ('a' ('p')) ('b' ('q' ('v')) ('r')) ('c' ('s' ('w' 'x')) ('t' ('y' 'z')) ('u')))$ ]disp ast
┌→┬──┬────────┬───────────────────┐
│∘│ap│┌→┬──┬─┐│┌→┬──────┬──────┬─┐│
│ │  ││b│qv│r│││c│┌→┬──┐│┌→┬──┐│u││
│ │  │└─┴─→┴─┘││ ││s│wx│││t│yz││ ││
│ │  │        ││ │└─┴─→┘│└─┴─→┘│ ││
│ │  │        │└─┴─────→┴─────→┴─┘│
└─┴─→┴───────→┴──────────────────→┘

Here's how read the array representation. Look at the top level of the tree. we have a root node with three children:
      ∘
┌──┬──┴────┐
a  b       c

┌→┬──┬────────┬─────────────┐
│∘│  │        │             │
│ │ a│   b    │     c       │
│ │  │        │             │
└─┴─→┴───────→┴────────────→┘

With the first ∘ being the root node, and the three adjacent cells being the children. Next, we look at how x is represented. This is predictably recursive. Let's see the subtree under x:
      ∘
┌──┬──┴────┐
a  b       c
│
p

┌→┬──┬────────┬─────────────┐
│∘│ap│        │             │
│ │  │  b     │   c         │
│ │  │        │             │
└─┴─→┴───────→┴────────────→┘


Similarly for y:
      ∘
┌──┬──┴────┐
a  b       c
│ ┌┴┐
p q r

┌→┬──┬────────┬─────────────┐
│∘│ap│┌→┬──┬─┐│             │
│ │  ││b│q │r││   c         │
│ │  │└─┴─→┴─┘│             │
└─┴─→┴───────→┴────────────→┘

And so on, leading to the final representation:
      ∘
┌──┬──┴────┐
a  b       c
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u
│    │   |
│   ┌┴┐ ┌┴┐
v   w x y z
┌→┬──┬────────┬───────────────────┐
│∘│ap│┌→┬──┬─┐│┌→┬──────┬──────┬─┐│
│ │  ││b│qv│r│││c│┌→┬──┐│┌→┬──┐│u││
│ │  │└─┴─→┴─┘││ ││s│wx│││t│yz││ ││
│ │  │        ││ │└─┴─→┘│└─┴─→┘│ ││
│ │  │        │└─┴─────→┴─────→┴─┘│
└─┴─→┴───────→┴──────────────────→┘

Note that for this representation to work, we need to be able to:
• nest arrays inside arrays.
• have subarrays of different sizes (ragged arrays)
• of different nesting depths --- so it's really not even an array?
I don't understand the memory layout of this, to be honest. I feel like to represent this in memory would still rely on pointer-chasing, since we need to box all the arrays. This is possibly optimised by APL to not be too bad.

#### § The depth vector representation

      ∘             0
┌──┬──┴────┐
a  b       c        1
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u    2
│    │   |
│   ┌┴┐ ┌┴┐
v   w x y z       3

If we visit this tree and record depths in pre-order (node left right), we arrive at the list:
(∘:0
(a:1 (p:2)) (b:1 (q:2 (v:3)) (r:2))
(c:1 (s:2 (w:3 x:3)) (t:2 (y:3 z:3)) (u:2)))

formatted as:
(∘:0
(a:1
(p:2))
(b:1
(q:2 (v:3))
(r:2)
)
(c:1 (s:2 (w:3 x:3))
(t:2 (y:3 z:3))
(u:2))
)

This linearlized is the list:
    (∘ a p b q v r c s w x t y z u)
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)

∘             0
┌──┬──┴────┐
a  b       c        1
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u    2
│    │   |
│   ┌┴┐ ┌┴┐
v   w x y z       3

To convert the ast object into a depth vector representation, we can use the following call:
$ast ← ('∘' ('a' ('p')) ('b' ('q' ('v')) ('r')) ('c' ('s' ('w' 'x')) ('t' ('y' 'z')) ('u')))$ d ← ∊0{(⊢,(⍺+1)∇⊣)/⌽⍺,1↓⍵}ast
0 1 2 1 2 3 2 1 2 3 3 2 3 3 2

Let's break this down: TODO

#### § Inverted tables

We represent data associated with our nodes as follows:
$data ← ⍪ ¨d(15⍴'T')(↑15⍴⊂'n.')$ ]disp data
┌→┬─┬──┐
│0│T│n.│
│1│T│n.│
│2│T│n.│
│1│T│n.│
│2│T│n.│
│3│T│n.│
│2│T│n.│
│1│T│n.│
│2│T│n.│
│3│T│n.│
│4│T│n.│
│2│T│n.│
│3│T│n.│
│4│T│n.│
│2↓T↓n.↓
└→┴→┴─→┘

This is the same thing as a structure of arrays (SOA) representation, where each array of information (eg, the depth at data[1], the T information at data[2]) are each arrays which can be accessed well on SIMD instructions.

TODO

#### § Path matrices

We want information of how to go up and down the tree in ideally constant time. We store this information in what is known as a path matrix. For our recurring example, the path matrix is:
∘ a p b q v r c s w x t y z u | preorder traversal
──────────────────────────────────────────────────
∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ | depth=0
- a a b b b b c c c c c c c c | depth=1
- - p - q q r - s s s t t t u | depth=2
- - - - - v - - - w x - y z - | depth=3

∘             0
┌──┬──┴────┐
a  b       c        1
│ ┌┴┐  ┌───┼───┐
p q r  s   t   u    2
│    │   |
│   ┌┴┐ ┌┴┐
v   w x y z       3

To efficiently compute this, we first replace every value in our tree with its preorder traversal visit time. This changes the tree to:
              ∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
│             │          │
│          ┌──┴──┐     ┌─┴───┐
v:5        w:9   x:10  y:12  z:13        3

The values we store in the tree are the integers. The old labels are represented for clarity. The path matrix for this tree is:
0  1  2  3  4  5  6  7  8  9 10 11 12 13 14  | preorder traversal
────────────────────────────────────────────────────────────
0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  | depth=0
-  1  1  3  3  3  3  7  7  7  7  7  7  7  7  | depth=1
-  -  2  -  4  4  6  -  8  8  8 11 11 11 14  | depth=2
-  -  -  -  -  5  -  -  -  9 10  - 12 13  -  | depth=3

∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
│             │          │
│          ┌──┴──┐     ┌─┴───┐
v:5        w:9   x:10  y:12  z:13        3


#### § Rendering the depth information in 2D

We use the incantation:
$d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)$ ((⍳≢d)@(d,¨⍳≢d)) ((⌈/d) (≢d))⍴'-'
0 - - - - - - - - -  -  -  -  -  -
- 1 - 3 - - - 7 - -  -  -  -  -  -
- - 2 - 4 - 6 - 8 -  - 11  -  - 14
- - - - - 5 - - - 9  -  - 12  -  -
- - - - - - - - - - 10  -  - 13  -

Let's break this down (the symbol   means a lamp, for commenting/illumination)
$⍳ 3 ⍝ iota: make a list of n elements:. 1 2 3  $ d
0 1 2 1 2 3 2 1 2 3 4 2 3 4 2

$≢d ⍝ tally: ≢. count no. of elements in d: 15  ⍳≢d ⍝ list of elements of len (no. of elements in d). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  $ ]disp (1 2 3),(4 5 6) ⍝ ,:concatenate
┌→────┬─────┐
│1 2 3│4 5 6│
└~───→┴~───→┘

]disp (1 2 3) ,¨ (4 5 6)
┌→──┬───┬───┐
│1 4│2 5│3 6│
└~─→┴~─→┴~─→┘

The use of ¨ needs some explanation. ¨ is a higher order function which takes a function and makes it a mapped version of the original function. So, ,¨ is a function which attemps to map the concatenation operator. Now, given two arrays (1 2 3) and (4 5 6), (1 2 3) ,¨ 4 5 6 attemps to run , on each pair 1 and 4, 2 and 5, 3 and 6. This gives us tuples ((1 4) (2 5) (3 6)). So, for our purposes, zip ← ,¨.
]disp (d,¨⍳≢d) ⍝ zip d with [1..len d].
┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐
│0 0│1 1│2 2│1 3│2 4│3 5│2 6│1 7│2 8│3 9│4 10│2 11│3 12│4 13│2 14│
└~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘

$d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)$ PM ← ⌈$$(⍳≢d)@(d,¨⍳≢d))(((⌈/d+1)(≢d))⍴0) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 3 3 3 3 7 7 7 7 7 7 7 7 0 0 2 2 4 4 6 6 8 8 8 11 11 11 14 0 0 0 0 0 5 5 5 5 9 10 10 12 13 13 0 0 ┌──┬──┴───────┐ 1 3 7 1 │ ┌┴┐ ┌──────┼───┐ 2 4 6 8 11 14 2 │ │ | │ ┌┴─┐ ┌┴──┐ 5 9 10 12 13 3  The incantation can be broken down into: • (((⌈/d+1)(≢d))⍴0) is used to create a max(d+1)x|d| dimension array of zeros. Here, the rows define depths, and the columns correspond to tree nodes which for us are their preorder indexes.  grid←(⌈/d+1) (≢d) ⍴ 0  grid 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  • ((d ,¨ ⍳≢d)) creates an array of pairs (depth, preindex). We will use this to fill index (d, pi) with the value pi.  writeixs ← (d,¨⍳≢d)  ]disp writeixs ┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐ │0 0│1 1│2 2│1 3│2 4│3 5│2 6│1 7│2 8│3 9│3 10│2 11│3 12│3 13│2 14│ └~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘  • ixgrid ← ((⍳≢d)@writeixs) grid rewrites at index writeixs[i] the value ((i≢d)[i]).  ixgrid ← ((⍳≢d)@writeixs) grid  ixgrid 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0 0 0 7 0 0 0 0 0 0 0 0 0 2 0 4 0 6 0 8 0 0 11 0 0 14 0 0 0 0 0 5 0 0 0 9 10 0 12 13 0  • Finally, ⌈ is the maximum operator, and \ is the prefix scan operator, so ⌈\ixgrid creates a prefix scan of the above grid to give us our final path matrix:  PM ← ⌈\ixgrid  PM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 3 3 3 3 7 7 7 7 7 7 7 7 0 0 2 2 4 4 6 6 8 8 8 11 11 11 14 0 0 0 0 0 5 5 5 5 9 10 10 12 13 13  #### § Using the path matrix: distance of a node from every other node. Note that the maximum distance between two nodes is to climb all the way to the top node, and then climb down: dmax ← depth(a) + depth(b)  If we know the lowest common ancestor of two nodes, then the distance of one node to another is: dcorrect ← dist(a, lca(a, b)) + dist(b, lca(a, b))  So, we can compute the depth as: dcorrect ← dist(a, lca(a, b)) + dist(lca(a, b), b) = dist(a, lca(a, b)) + depth(lca(a, b)) + dist(b, lca(a, b)) + depth(lca(a, b)) + -2 * depth(lca(a, b)) = depth(a) + depth(b) + -2 * depth (lca(a, b))  [TODO: picture] [TODO: finish writing this] #### § Parent vector representation A parent vector is a vector of length n where Parent[i] denotes an index into Parent. Hence, the following condition will return 1 if V is a parent vector. For example, for our given example, here is the parent vector: d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths (∘ a p b q v r c s w x t y z u) │ values p ← (∘ ∘ a ∘ b q b ∘ c s s c t t c) │ parents (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14) | indexes P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices ∘:0 0 ┌──────────┬──┴─────────────────┐ a:1 b:3 c:7 1 │ ┌───┴───┐ ┌──────────┼───────┐ p:2 q:4 r:6 s:8 t:11 u:14 2 │ │ │ │ ┌──┴──┐ ┌─┴───┐ v:5 w:9 x:10 y:12 z:13 3  The condition a parent vector must satisfy is: ∧/V ∊(⍳≢V) ⍝ [All elements of V belong in the list [1..len(V)] ]  • V ∊ (⍳≢V) will be a list of whether each element in v belongs (∊) to the list (⍳≢V) = [1..len(V)] • Recall that / is for reduction, and ∧/ is a boolean AND reduction. Hence, we compute whether each element of the vector V is in the range [1..len(V)]. • We add the constraint that root notes that don't have a parent simply point to themselves. This allows us to free ourselves from requiring some kind of nullptr check. The root node (parent of all elements) can be found using the fixpoint operator (⍨): I←{(⊂⍵)⌷⍺} ⍝ index into the left hand side param using right hand side param I⍣≡⍨p ⍝ compute the fixpoint of the I operator using ⍨ and apply it to p  #### § Converting from depth vector to parent vector, Take 1 As usual, let's consider our example: d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths (∘ a p b q v r c s w x t y z u) │ values p ← (∘ ∘ a ∘ b q b ∘ c s s c t t c) │ parents (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14) | indexes P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices ∘:0 0 ┌──────────┬──┴─────────────────┐ a:1 b:3 c:7 1 │ ┌───┴───┐ ┌──────────┼───────┐ p:2 q:4 r:6 s:8 t:11 u:14 2 │ │ │ │ ┌──┴──┐ ┌─┴───┐ v:5 w:9 x:10 y:12 z:13 3  Note that the depth vector already encodes parent-child information. • The parent of node i is a node j such that d[j] = d[i] - 1 and j is the closest index to the left of i such that this happens. For example, to compute the parent of t:11, notice that it's at depth 2. So we should find all the nodes from d[0..11] which have depths equal to 2, and then pick the rightmost one. This translates to the expression:  d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)  t ← 11 ⍝ target node  ixs ← ⍳t ⍝ array indexes upto this node 0 1 2 3 4 5 6 7 8 9 10  d[ixs] ⍝ depths of nodes to the left of the given node t 0 1 2 1 2 3 2 1 2 3 3  d[ixs] = d[t]-1 ⍝ boolean vector of nodes whose depth is that of t's parent 0 1 0 1 0 0 0 1 0 0 0  eqds ← ⍸ (d[ixs] = d[t]-1) ⍝ array indexes of nodes whose depth is that of t's parent 1 3 7  ⌽ eqds ⍝ reverse of array indexes to extract 7 7 3 1  ⊃ ⌽ eqds ⍝ first of the reverse of the array indexes to extract 7 7  (⌽⍸(d[⍳t] = d[t]-1))[0] ⍝ APL style one-liner of the above  While this is intuitive, this does not scale: It does not permit us to find the parent of all the nodes at once --- ie, it is not parallelisable over choices of t. #### § Converting from depth vector to parent vector, Take 2 (Or scan idiom) Imagine we have a list of 0s and 1s, and we want to find the index of the rightmost 1 value. For example, given:  0 1 2 3 4 5 6 7 8 9 10 11 12  a ← (0 0 1 0 0 0 1 0 1 0 0 0 0)  we want the answer to be f a = 8. We saw an implementation in terms of f←{(⌽⍸⍵)[0]} in Take 1. (recall that ⍵ is the symbol for the right-hand-side argument of a function). We're going to perform the same operation slightly differently. Let's consider the series of transformations: ⍝ 0 1 2 3 4 5 6 7 8 9 10 11 12  a ← (0 0 1 0 0 0 1 0 1 0 0 0 0) ⍝ original array  ⌽a ⍝ reverse of a 0 0 0 0 1 0 1 0 0 0 1 0 0  ∨\ ⌽a ⍝ prefix scan($$ using the OR(∨) operator. Turn all
⍝ entries after the first 1 into a 1
0 0 0 0 1 1 1 1 1 1 1 1 1

$+/ (∨\ ⌽a) ⍝ sum over the previous list, counting number of 1s 9$ ¯1 +  (+/ (∨\ ⌽a))  ⍝ subtract 1 from the previous number
8

Why the hell does this work? Well, here's the proof:
• On running ⌽a, we reverse the a. The last 1 of a at index i becomes the first 1 of ⌽a at index i′ ≡ ni.
• On running ∨\ ⌽a, numbers including and after the first 1 become 1. That is, all indexes ji have 1 in them.
• On running +/ (∨\ ⌽a), we sum up all 1s. This will give us ni′+1 1s. That is, ni′+1 = n−(ni)+1 =i+1.
• We subtract a 1 to correctly find the i from i+1.
This technique will work for every row of a matrix. This is paramount, since we can now repeat this for the depth vector we were previously interested in for each row, and thereby compute the parent index!

#### § Converting from depth vector to parent vector, Take 3 (full matrix)

We want to extend the previous method we hit upon to compute the parents of all nodes in parallel. To perform this, we need to run the moral equivalent of the following:
$⎕IO ← 0 ⍝ 0 indexing$ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depth vector
$t ← 11 ⍝ node we are interested in$ a←d[⍳t]=d[t]-1  ⍝ boolean vector of nodes whose depth is that of t's parent
0 1 0 1 0 0 0 1 0 0 0
$¯1 + (+/ (∨\ ⌽a)) ⍝ index of last 0 of boolean vector 7  for every single choice of t. To perform this, we can build a 2D matrix of d[⍳t]=d[t]-1 where t ranges over [0..len(d)-1] (ie, it ranges over all the nodes in the graph). We begin by using: $ ⎕IO ← 0 ⋄ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depths
$]display ltdepth ← d ∘.> d ⍝ find d[i] > d[j] for all i, j. ┌→────────────────────────────┐ ↓0 0 0 0 0 0 0 0 0 0 0 0 0 0 0│ │1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│ │1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│ │1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│ │1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│ │1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│ │1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│ │1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│ │1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│ │1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│ │1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│ │1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│ │1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│ │1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│ │1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│ └~────────────────────────────┘  • Note that gt[i][j] = 1 iff d[j] < d[i]. So, for a given row (i = fixed), the 1s nodes that are at lower depth (ie, potential parents). • If we mask this to only have those indeces where j <= i, then the last one in each row will be such that d[last 1] = d[i] - 1. Why? Because the node that is closest to us with a depth less than us must be our parent, in the preorder traversal. $ ⎕IO ← 0 ⋄ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depths
$]display left ← (⍳3) ∘.> (⍳3) ⍝ find i > j for all i, j. ┌→────┐ ↓0 0 0│ │1 0 0│ │1 1 0│ └~────┘  Combining the three techniques, we can arrive at: $ ⎕IO ← 0 ⋄ d ← (0  1  2  1  2  3  2  1  2  3  3  2  3  3  2) ⍝ depths
$ltdepth ← d ∘.> d ⍝ find d[i] > d[j] for all i, j.$ preds ←  (⍳≢d) ∘.> (⍳≢d) ⍝ predecessors: find i > j for all i, j.
$pred_higher ← ltdepth ∧ left ⍝ predecessors tht are higher in the tree$  parents_take_3 ← ¯1 +  +/∨\⌽pred_higher  ⍝ previous idiom for finding last 1.
¯1 0 1 0 3 4 3 0 7 8 8 7 11 11 7

For comparison, the actual value is:
    (0   1  2  3  4  5  6  7  8  9 10 11 12 13 14)  | indexes
d ← (0   1  2  1  2  3  2  1  2  3  3  2  3  3  2)  │ depths
P ← (0   0  1  0  3  4  3  0  7  8  8  7 11 11  7)  │ parent indices
(¯1  0  1  0  3  4  3  0  7  8  8  7 11 11  7) | parents, take 3

∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
│             │          │
│          ┌──┴──┐     ┌─┴───┐
v:5        w:9   x:10  y:12  z:13        3

We have an off-by-one error for the 0 node! That's easily fixed, we simply perform a maximum with 0 to move ¯1 -> 0:
$parents_take_3 ← 0⌈ ¯1 + +/∨\⌽pred_higher 0 0 1 0 3 4 3 0 7 8 8 7 11 11 7  So, that's our function: parents_take_3 ← 0⌈ ¯1 + +/∨\⌽ ((d∘.>d) ∧ (⍳≢d)∘.>(⍳≢d)) 0 0 1 0 3 4 3 0 7 8 8 7 11 11 7  Note that the time complexity for this is dominated by having to calculate the outer products, which even given infinite parallelism, take O(n) time. We will slowly chip away at this, to be far better. #### § Converting from depth vector to parent vector, Take 4 (log critial depth) We will use the Key(⌸) operator which allows us to create key value pairs. $ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$]disp (⍳≢d) ,¨ d ⍝ zip d with indexes ┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐ │0 0│1 1│2 2│3 1│4 2│5 3│6 2│7 1│8 2│9 3│10 3│11 2│12 3│13 3│14 2│ └~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘  $ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$]display b ← {⍺ ⍵}⌸d ⍝ each row i has tuple (i, js): d[js] = i ┌→──────────────────┐ ↓ ┌→┐ │ │ 0 │0│ │ │ └~┘ │ │ ┌→────┐ │ │ 1 │1 3 7│ │ │ └~────┘ │ │ ┌→────────────┐ │ │ 2 │2 4 6 8 11 14│ │ │ └~────────────┘ │ │ ┌→───────────┐ │ │ 3 │5 9 10 12 13│ │ │ └~───────────┘ │ └∊──────────────────┘  In fact, it allows us to apply an arbitrary function to combine keys and values. We will use a function that simply returns all the values for each key. $ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$]display b ← {⍵}⌸d ⍝ each row i contains values j such that d[j] = i. ┌→──────────────┐ ↓0 0 0 0 0 0│ │1 3 7 0 0 0│ │2 4 6 8 11 14│ │5 9 10 12 13 0│ └~──────────────┘  Our first try doesn't quite work: it winds up trying to create a numeric matrix, which means that we can't have different rows of different sizes. So, the information that only index 0 is such that d[0] = 0 is lost. What we can to is to wrap the keys to arrive at: $ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$]display b ← {⊂⍵}⌸d ⍝ d[b[i]] = i ┌→───────────────────────────────────────────┐ │ ┌→┐ ┌→────┐ ┌→────────────┐ ┌→───────────┐ │ │ │0│ │1 3 7│ │2 4 6 8 11 14│ │5 9 10 12 13│ │ │ └~┘ └~────┘ └~────────────┘ └~───────────┘ │ └∊───────────────────────────────────────────┘  Consider the groups b[2] = (2 4 6 8 11 14) and b[3] = (5 9 10 12 13). All of 3's parents are present in 2. Every element in 3 fits at some location in 2. Here is what the fit would look like: b[2] 2 4 _ 6 8 _ _ 11 __ __ 14 (nodes of depth 2) b[3] 5 9 10 12 13 (leaf nodes) 4 8 8 11 11 (parents: predecessor of b[3] in b[2]) ∘:0 0 ┌──────────┬──┴─────────────────┐ a:1 b:3 c:7 1 │ ┌───┴───┐ ┌──────────┼───────┐ p:2 q:4 r:6 s:8 t:11 u:14 2 │ │ │ │ ┌──┴──┐ ┌─┴───┐ v:5 w:9 x:10 y:12 z:13 3  We use the Interval Index(⍸) operator to solve the problem of finding the parent / where we should sqeeze a node from b[3] into b[2] (This is formally known as the predecessor problem) ⍝ left[a[i]] is closest number < right[i] ⍝ left[a[i]] is the predecessor of right[i] in left[i].$ a ← (1 10 100 1000) ⍸ (1 2000 300 50 2 )
0 3 2 1 0

Now, we can use the technology of predecessor to find parents of depth 3 nodes among the depth 2 nodes:
$depth2 ← 2 4 6 8 11 14$ depth3 ← 5 9 10 12 13 ⍝ parents (from chart): 4 8 8 11 11
$depth3parentixs ← depth2 ⍸ depth3$ depth3parents  ← depth2[depth3parentixs]
4 8 8 11 11

∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
│             │          │
│          ┌──┴──┐     ┌─┴───┐
v:5        w:9   x:10  y:12  z:13        3

We need to know one-more APL-ism: the 2-scan. When we write a usual scan operation, we have:
$⍳5 1 2 3 4 5  $ +/⍳5 ⍝ reduce
15

$2+/⍳5 ⍝ apply + to _pairs_ (2 = pairs) 3 5 7 9 ⍝ (1+2) (2+3) (3+4) (4+5)  $ 3+/⍳5 ⍝  apply + to 3-tuples
6 9 12 ⍝ (1+2+3) (2+3+4) (3+4+5)

We begin by assuming the parent of i is i by using p←⍳≢d.
$d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)$ d2nodes ← {⊂⍵}⌸d
┌→┬─────┬─────────────┬─────────────┐
│1│2 4 8│3 5 7 9 12 15│6 10 11 13 14│
└→┴~───→┴~───────────→┴~───────────→┘
$p←⍳≢d 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  Now comes the biggie: $ findparent ← {parentixs ← ⍺⍸⍵ ⋄ p[⍵]←⍺[parentixs]}

• ⍺ is the list of parent nodes.
• ⍵ is the list of current child nodes.
• We first find the indexes of our parent nodes by using the pix ← parent ⍸ child idiom.
• Then, we find the actual parents by indexing into the parent list: pix[parentixs].
• We write these into the parents of the child using: p[children] ← parent[parent ⍸ child]
This finally culminates in:
$d←0 1 2 1 2 3 2 1 2 3 3 2 3 3 2$ p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7

(0   1  2  3  4  5  6  7  8  9 10 11 12 13 14)  | indexes
d ← (0   1  2  1  2  3  2  1  2  3  3  2  3  3  2)  │ depths
P ← (0   0  1  0  3  4  3  0  7  8  8  7 11 11  7)  │ parent indices
∘:0                               0
┌──────────┬──┴─────────────────┐
a:1        b:3                 c:7              1
│      ┌───┴───┐     ┌──────────┼───────┐
p:2    q:4     r:6   s:8        t:11    u:14    2
│             │          │
│          ┌──┴──┐     ┌─┴───┐
v:5        w:9   x:10  y:12  z:13        3

Which can be further golfed to:
$p⊣2{p[⍵]←⍺[⍺⍸⍵]}⌿⊢∘⊂⌸d⊣p←⍳≢d 0 0 1 0 3 4 3 0 7 8 8 7 11 11 7  The total time complexity of this method assuming infinite parallelism is as follows: $ p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p

• (p←⍳≢d) can be filled in O(1) time.
• (d2nodes←{⊂⍵}⌸d) is searching for keys in a small integer domain, so this is O(#nodes) using radix sort as far as I know. However, the thesis mentions that this can be done in O(log(|#nodes|)). I'm not sure how, I need to learn this.
• For each call of findp, the call (pix ← ⍺⍸⍵) can be implemented using binary search leading to a logarthmic complexity in the size of ⍺ (since we are looking up for predecessors of ⍵ in ⍺).
• The time complexity of the fold 2findp/d2nodes can be done entirely in parallel since all the writes into the p vector are independent: we only write the parent of the current node we are looking at.

#### § 3.4: Computing nearest Parent by predicate

I'm going to simplify the original presentation by quite a bit.
     a b c d e f g h i  | names
0 1 2 3 4 5 6 7 8  | indexes
P ← (0 0 1 2 0 4 5 6 7) | parents
X ← (0 1 0 0 1 1 0 0 0) | marked nodes

a:0
┌────┴───┐
b:1(X)   e:4(X)
|        |
c:2      f:5(X)
|        |
d:3      g:6
│
h:7
|
i:8

We want to find nodes marked as X that are the closest parents to a given node. The X vector is a boolean vector that has a 1 at the index of each X node: (b, e, f). So, the indexes (1, 4, 5) are 1 in the X vector. The output we want is the vector:
      0 1 2 3 4 5 6 7 8  | indexes
a b c d e f g h i  | names
PX ← (0 0 1 1 0 4 5 5 5) | closest X parent index
a a b b a e f f f  | closest X parent name

a:0
┌────┴───┐
b:1(X)   e:4(X)
|        |
c:2      f:5(X)
|        |
d:3      g:6
│
h:7
|
i:8

The incantation is:
$I←{(⊂⍵)⌷⍺} ⍝ index LHS by RHS | (100 101 102 103)[(3 1 2)] := 103 101 102$ PX ← P I@{X[⍵]≠1} ⍣ ≡ P
0 0 1 1 0 4 5 5 5

TODO. At any rate, since this does not require any writes and purely reads, and nor does it need any synchronization, this is fairly straightforward to implement on the GPU.

#### § 3.5: Lifting subtrees to the root

Once we have marked our X nodes, we now wish to lift entire subtrees of X up to the root.
• This pass displays how to lift subtrees and add new nodes to replace the subtree's original nodes.
• Luckily, there are no sibling relationships that need to be maintained since we are uprooting an entire subtree.
• There are no ordering constraints on how the subtrees should be arranged at the top.
• Hence, we can simply add new nodes to the end of the tree (in terms of the preorder traversal). Adding to the middle of the tree will be discussed later.
There is some good advice in the thesis:
When using APL primitives this way, it may be good to map their names and definitions to the domain of trees. For example, the primitive ⍸Predicate is read as "the nodes where Predicate holds" and not as "the indexes where Predicate is 1".
For example, given the tree:
      0 1 2 3 4 5  | indexes
a b c d e f  | names
P  ← (0 0 1 0 3 4) | parents
X  ← (0 1 0 1 1 0) | X nodes
PX ← (0 0 1 0 3 4) | closest X parent index

a:0
┌────┴───┐
b:1(X)   d:3(X)
|        |
c:2      e:4(X)
|
f:5

we want the transformed tree to be:
    a:0
┌────┴───┐
bp:1(X)   ep:4(X)
---------
b:1(X)
|
c:2
---------
e:4
|
fp:5
---------
f:5(X)
|
g:6

We first look for nodes that need to be lifted. There are:
• Non-root nodes (ie, nodes whose parents are not themselves: p≠(⍳≢p))
• Which have the property X.
nodes←⍸(X ∧ p≠(⍳≢p))  ⍝ ⍸:pick indexes.


### § Things I wish I knew when I was learning APL

• Operators in APL terminology (such as ¨) are higher order functions. Thus, an operator allows one to modify known functions.
• Use ]disp and ]display to understand the structure of APL arrays.
• Set ]box on -style=max to always enable drawing arrays with ]display. This is supremely useful as a newbie to understand array structure.
• Set ]box on -trains=parens to render trains as trees. Super helpful when attempting to grok train code.
• Set ]boxing on to enable boxing for trains, arguments, everything.

### § Every ideal that is maximal wrt. being disjoint from a multiplicative subset is prime

I ran across this when reading another question on math.se, so I posted this proof for verification just to be sure I wasn't missing something. We wish to characterise prime ideals as precisely those that are disjoint from a multiplicative subset SR. That is:
• An ideal P is prime iff P = RS, where S is a multiplicative subset that cannot be made larger (ie, is maximal wrt to the ordering).
I'll be using the definition of prime as:
• An ideal P is prime if for all x, yR, xyP xPyP.

#### § Prime ideal implies complement is maximal multiplicative subset:

Let S = ≡ RP be the complement of the prime ideal P R in question.
• Since PR, 1 ∉P. (if 1 ∈ P, then every element x . 1 ∈ P since P is an ideal, and must be closed under multiplication with the entire ring). Hence, 1 ∈ S.
• For any x, yS, we need xyS for S to be mulitplicative.
• For contradiction, let us say that x, yS such that xyS. Translating to P, this means that x, yP such that xyP. This contradictions the definition of P being prime.

#### § Ideal whose complement is maximal multiplicative subset implies ideal is prime.

• Let I be an ideal of the ring R such that its complement SR / I is a maximal multiplicative subset.
• Let i1 i2I. For I to be prime, we need to show that i1I or i2I.
• For contradiction, let i1, i2I. Thus, i1, i2S. Since S is multiplicative, i1 i2S. That is, i1 i2I (since I is disjoint from S).
• But this violates our assumption that i1 i2I. Hence, contradiction.

### § Getting started with APL

• Install Dyalog APL.
• Setup RIDE, the IDE for dyalog APL. This IDE comes with auto complete, good key bindings, a top bar chock-full of information of all the APL symbols. It's really well designed and a pleasure to use.
• Follow the Dyalog tutorial, solving it chapter by chapter.
• Bookmark APLCart, a collection of APL idioms, and refer to it when in need.

### § SpaceChem was the best compiler I ever used

It's kind of sad that this is the case, but on thinking about this, I realised that the SpaceChem game was essentially a compiler, and it was such a pleasure to learn how to use and debug --- the visual nature of it made it amazing to find out.

### § Mnemonic for Kruskal and Prim

I often forget which is which, so I came up with this:
• Prim is very prim and proper, and therefore doesn't spread herself out. She picks out the minimum spanning tree one vertex at a time.

### § Cartesian Trees

Cartesian trees construct a tree T = C(A) given an array A, such that range minimum query (RMQ) on the array A is equivalent to the lowest common ancestor (LCA) of the nodes of the tree T. Note that the tree looks like a min-heap. To see the connection to LCA, if we want to find the range minimum in the range containing the elements [12, 10, 20, 15, 18] of the array, the minimum is 10, which is indeed the lowest common ancestor of the nodes of 12 and 18 in the tree.

#### § Converting LCA to RMQ

We can go the other way, and convert an LCA problem into a RMQ problem. We perform an inorder traversal of the nodes, scribbling down the depth of the node (Link to lecture at 15:30). We ask for the argmin version of RMQ, that gives us the index of the node with the lowest depth. This gives us the index of where the node lives.

#### § Universe reduction in RMQ

We can have an arbitrary ordered universe, on which we want to perform RMQ. We can convert this to LCA by using a cartesian tree, and then convert to a "clean" RMQ (by using the LCA -> RMQ using depth conversion). This now will give us way faster operations (since we now have integers).

#### § +-1 RMQ:

We want the differences between nodes to have a difference of only +-1. We had a much wider gap. Here, we perform an Euler tour (walk the tree DFS search order), and sribble down every vertex we visit. To find the LCA, we perform the RMQ on the locations of the first occurence of the node. (I think we don't actually need the first occurence, any occurence will do).

### § DFS numbers as a monotone map

Really, we want a partial order that is defined with the tree as the Hasse diagram. However, performing operations on this is hard. Hence, the DFS numbering is a good monotone map from this partial order to the naturals, which creates a total order. I want to think about this deeper, I feel that this might be a good way to think about the low numbers that show up in tarjan's algorithm for strongly connected components This also begs the question: can we use other partial orders, that chunk some information, but don't lose all the information as going to a total order (the naturals) does?

### § Self attention? not really

The code is taken from The annotated transformer which explains the "attention is all you need paper". On skimming the code, one sees the delightful line of code:
class EncoderLayer(nn.Module):
"Encoder is made up of self-attn and feed forward (defined below)"
def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 2)
self.size = size
"Follow Figure 1 (left) for connections."
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
return self.sublayer[1](x, self.feed_forward)

where the line:
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))

seems to imply that we are, indeed, performing a self attention with the same value x as the query, key, and value. However, reading the code of the self-attention (or the paper) leads one to realise:
class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
"Take in model size and number of heads."
assert d_model % h == 0
# We assume d_v always equals d_k
self.d_k = d_model // h
self.h = h
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)

def forward(self, query, key, value, mask=None):
"Implements Figure 2"
nbatches = query.size(0)

# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
[l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]

# 2) Apply attention on all the projected vectors in batch.
dropout=self.dropout)

# 3) "Concat" using a view and apply a final linear.
x = x.transpose(1, 2).contiguous() \
.view(nbatches, -1, self.h * self.d_k)
return self.linears[-1](x)

where we notice:
## 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
[l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]

## 2) Apply attention on all the projected vectors in batch.
dropout=self.dropout)

where we see that query, key, value are being linearly transformed before being used. Hence, an input of (x, x, x) is transformed to (q′, k′, v′) = (Qx, Kx, Vx) where Q, K, V are arbitrary matrices. Next, when we pass these into attention, the output we get is:
 softmax(q′, k′T) v = (Q x) (K x)T (V x) = Q x xT KT V x
the code below is the same thing, spelled out:
def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'"
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) \
/ math.sqrt(d_k)
p_attn = F.softmax(scores, dim = -1)
if dropout is not None:
p_attn = dropout(p_attn)

So It's not really self attention: it's more like: modulated attention to self :)

### § Coarse structures

A coarse structure on the set X is a collection of relations on X: E ⊆ 2X × X (called as controlled sets / entourages) such that:
• (δ ≡ { (x, x) : xX }) ∈ E.
• Closed under subsets: eE, fe fE.
• Closed under transpose: if eE then (eT ≡ { (y, x) : (x, y) ∈ e }) ∈ E.
• Closed under finite unions.
• Closed under composition: e, fE, efE, where is composition of relations.
The sets that are controlled are "small" sets. The bounded coarse structure on a metric space (X, d) is the set of all relations such that there exists a uniform bound such that all related elements are within that bounded distance.
 (e ⊂ X × X) ∈ E  ⇐⇒ ∃ δ ∈ ℝ, ∀ (x, y) ∈ E, d(x, y) < δ
We can check that the functions:
• f: ℤ → ℝ, f(x) ≡ x and
• g: ℝ → ℤ, g(x) ≡ ⌊ x
are coarse inverses to each other. I am interested in this because if topology is related to semidecidability, then coarse structures (which are their dual) are related to..?

### § Matroids for greedy algorithms

#### § Definitions of matroids

A matrioid M is a set X equipped with an independence set I ⊆ 2X.
• The empty set is independent: ∅ ∈ I.
• The independence set is downward-closed/closed under subsets: iI, ∀ i′ ⊆ i, i′ ∈ I.
• For any independent sets A, BI, if | A | is larger than | B |, then we must be able to add an element from aA into B′ ≡ Ba such that B is both independent and larger than B: B′ ∈ I ∧ | B′ | > | B |. (The exchange property)

#### § Example 1: Linearly independent sets

Let V be a vector space. The independent sets I are of the form:
 I ≡ { S ⊆ V  :  vectors in S are lineary independent }
This is an independence system because the empty set is linearly independent, and subsets of a linearly independent collection of vectors will be linearly independent. The exchange property is satisfied because of linear algebraic reasons.

#### § Example 2: The graphic/cyclic Matroid: Matroid of Forests

Let G = (V, E) be a graph. Then collections of edges of the form:
 I ≡ { F ⊆ E : F contains no cycles }
is an independence system because the empty forest is a forest, and a subset of edges of a forest continues to be a forest. To check the exchange property, TODO

#### § Example 3: The partition matroid

Consider the partition matroid M ≡ (E, I), where we have a partitioning of E known as Ei, and numbers ki the independence set consists of subsets F which have at most ki elements in common with each Ei.
 I ≡ { F ⊆ E  : ∀ i = 1, … N, | F ⋂ Ei | ≤ ki }
The independence axioms are intuitively satisfied, since our constraints on picking edges are of the form | FEi | ≤ ki, which will continue to hold as F becomes smaller. For the exchange axiom, let | Y | > | X |. Then, we can assert that for some index i, it must be the case that | YEi | > | XEi |. Hence, we can add an element in Ei ∩ (Y / X) into X whilst still maintaining independence.

#### § Bases and Circuits

• Bases are the maximal independent sets of I (ordered by inclusion). On adding an element into a basis element, they will become dependent. They are called bases by analogy with linear algebra.
• Circuits are minimal dependent sets of I. This comes from analogy with trees: if we remove an element from any circuit (loop) in a graph, what we are left with is a tree.
A matroid can be completely categorized by knowing either the bases or the circuits of that matroid.

#### § Unique Circuit property

• Theorem: Let M ≡ (E, I) be a matroid, and let SI, eE such that S ∪ {e } ∉I.
Then, there exists a unique circuit CS ∪ { e }. That is, when we go from independent to dependent by adding an element, we will have a single, unique circuit. For example, when we add an edge into a forest to create a cycle, this cycle will be unique!

#### § Proof

Let C1, C2 be circuits created when e was added into S, where C1 is the largest circuit of S, and C2 is the smallest circuit of S. Notice that C1, C2 must contain e --- if they did not, then C1, C2 would be circuits in S, contradicting the assumption that S is independent. Recall that C1, C2 are both circuits, which means that removing even a single element from them will cause them to become independent sets. Let us contemplate CC1C2. Either C = C1 in which case we are done. Otherwise, | C | > | C1 |, | C | > | C2 |. Otherwise, consider C′ ≡ C { e } = (C1C2) {e} = (C1 {e}) ∪ (C2 { e }).
• C′ ⊆ S, since 1 {e}, C2 {e} ⊆ S.
• S is an independent set, all of whose subsets are independent by definition. So C is an independent set.
• | C′ | ≥ | C1 |, | C′ | ≥ | C2 |.
Now, we consider C. Clearly, this is a dependent set, since C1 C, and C1 is a dependent set. Since, C = C′ ∪ {e }, this means that C is a maximally independent set. Since C does not contain e, C′ = S.

#### § Rank functions

A rank function of a matroid M ≡ ⟨ X, I is a function:
 r: 2X → ℕ : r(S) = max{ | T | : T ⊆ S ∧ T ∈ I }
That is, for any subset SX, r(S) is the cardinality of the largest independent subset of S.
• In the matroid of linearly independent sets of vectors, the rank of a set of vectors is the dimension of their spanning set.
In this matroid, the TODO: picture

#### § Bipartite matching as matroid intersection

Matchings in a bipartite graph G = (V, E) with partition (A, B) arise as the intersection of the independent sets of two matroids. We will denote by δ: V → 2E the function which takes a vertex to the set of edges incident on that vertex. Let MA be a partition matroid: MA ≡ (E, IA) where IA is:
 IA ≡ { F ⊆ E : | F ⋂ δ(a) | ≤ 1 ∀ a ∈ A }
That is, in IA, every independent set has for each vertex of A, at most one edge incident. We need to check that this is an independent set. The empty set of no edges is independent. If some collection of edges are such that they have at most one edge incident, then removing edges can only decrease incidence. Hence, it's also downward closed. TODO: add picture Similarly, we define MB ≡ (E, IB):
 IB ≡ { F ⊆ E : | F ⋂ δ(b) | ≤ 1 ∀ b ∈ B }
Now, notice that any collection of edges FIAIB is a legal matching, since the edges cover all vertices of A and B at most once. The largest element of IAIB is the maximum matching that we are looking for.

#### § Largest common independent set

Given two matroids M1 ≡ (E, I1), M2 ≡ (E, I2), with rank functions r1 and r2. Let SI1 cap I2 and let FE.
• | S | = | SF | + | S ∩ (E / F) |.

### § Grokking Zariski

There's a lot written on the Zariski topology on the internet, but most of them lack explicit examples and pictures. This is my attempt to communicate what the Zariski topology looks like, from the perspectives that tickle my fancy (a wealth of concrete examples, topology-as-semi-decidability, and pictures).

#### § The Zariski topology

Recall that the Zariski topology is defined by talking about what its closed sets are. The common zero sets of a family of polynomials are the closed sets of the Zariski topology. Formally, the topology (ℝn, τ) has as closed sets:
 { x ∈ ℝn : ∀ fi ∈ ℝ[X1, X2, … Xn],  fi(x) = 0  }
Open sets (the complement of closed sets) are of them form:
 { x ∈ ℝn : ∃ fi ∈ ℝ[X1, X2, … Xn],  fi(x) ≠ 0  } ∈ τ
The empty set is generated as { x ∈ ℝn : 0 ≠ 0 } and the full set is generated as { x ∈ ℝn : 1 ≠ 0 }.

#### § Semi-decidability

Recall that in this view of topology, for a space (X, τ), for every open set O ∈ τ, we associate a turing machine TO which semi-decides inclusion. That is, if a point is in O then it halts with the output IN-SET. if the point is not in O, TO infinite loops. Formally:

 x ∈ O  ⇐⇒ TO halts on input x x ∉O  ⇐⇒ TO does not halts on input o
Alternatively, for a closed set C ∈ τ, we associate a a turing machine TC which semi-decides exclusion. That is, if a point is not in C it halts with the output "NOT-IN-SET". If the point is in C, the turing machine TC infinite loops. Now in the case of polynomials, we can write a function that semidecides exclusion from closed sets:
# return NOT-IN-SET of x is not a zero of the polynomial
def is_not_in_zeros(poly, x0):
while True:
if poly(x0[:precision]) != 0: return NOT-IN-SET
precision += 1 # up the precision

Since we can only evaluate a polynomial up to some finite precision, we start with zero precsion and then gradually get more precise. If some x0 is not a root of poly, then at some point, we will have that poly(x0) /neq 0. If x0 is indeed a root, then we will never halt this process; We will keep getting poly(x0[:precision]) = 0 for all levels of precision.

#### § Spec(R)

• To setup a topology for the prime spectrum of a ring, we define the topological space as Spec(R), the set of all prime ideals of R.
• The closed sets of the topology are { V(I) : I is an ideal of R }, where the function V: Ideals of R → 2Prime ideals of R each ideal to the set of prime ideals that contain it. Formally, V(I) = { pSpec(R) : Ip }.
• We can think of this differently, by seeing that we can rewrite the condition as V(I) = { PSpec(R) : IP 0 }: On rewriting using the prime ideal P, we send the ideal I to 0.
• Thus, the closed sets of Spec(R) are precisely the 'zeroes of polynomials' / 'zeroes of ideals'.
• To make the analogy precise, note that in the polynomial case, we imposed a topology on R by saying that the closed sets were V(pi) = { x ∈ ℝ : p(x) = 0 } for some polynomial p ∈ ℝ[x].
• Here, we are saying that the closed sets are V(I) = { xSpec(R) : I(x) = 0 } for some ideal IR. so we are looking at ideals as functions from the prime ideal to the reduction of the ideal. That is, I(P) = I/P.

#### § Spec(ℤ) from this perspective

Since is a PID, we can think of numbers instead of ideals. The above picture asks us to think of a number as a function from a prime to a reduced number. n(p) = n % p. We then have that the closed sets are those primes which can zero out some number. That is:

 V(I) = { x ∈ Spec(ℤ) : I(x) = 0 } V((n)) = { (p) ∈ Spec(ℤ) : (n)/(p) = 0 } V((n)) = { (p) ∈ Spec(ℤ) : (n) mod (p) = 0 }
So in our minds eye, we need to imagine a space of prime ideals (points), which are testing with all ideals (polynomials). Given a set of prime ideals (a tentative locus, say a circle), the set of prime ideals is closed if it occurs as the zero of some collection of ideas (if the locus occurs as the zero of some collection of polynomials).

### § My preferred version of quicksort

Wikipedia lists the implementation of quicksort as:
algorithm quicksort(A, lo, hi) is
if lo < hi then
p := partition(A, lo, hi)
quicksort(A, lo, p - 1)
quicksort(A, p + 1, hi)

algorithm partition(A, lo, hi) is
pivot := A[hi]
i := lo
for j := lo to hi do
if A[j] < pivot then
swap A[i] with A[j]
i := i + 1
swap A[i] with A[hi]
return i

Here, the indeces [lo..i-1] have values less than the pivot, while [i..j] are great or equal to the pivot.

#### § The version I prefer

// #define SWAP(ix, ix2) { int t = a[ix]; a[ix] = a[ix2]; a[ix2] = t; }
// sorts the interval [l, r]
void qs(int l, int r) {
if (r - l < 0) return;
int part = a[r]; // the partition

// a[getill...n] >= part (getill = greater or equal till)
// starts at r since we know that a[r] >= (partition=a[r])
int getill = r;
// a[l..lttill] < part (lttill = less or equal till.
// starts at (l-1) since we do not know about any value < partition
int lttill = l-1;

// loop until they start probing into the other set
while(!(lttill+1 >=getill || getill-1 <=lttill)) {
// if the speculated element is < partition
if (a[getill-1] < part) {
// swap the value at getill-1 will the slot at lttill+1
SWAP(getill-1, lttill+1);
// increment lttill, since we KNOW that the
// value at lttill+1 = a[getill-1] is < part
lttill++;
} else {
// all we know is that a[getill-1] < part, so we can engulf
// the region into
getill--;
}
}
// the partitions must be next to each other, since we have engulfed everything
assert(getill - lttill == 1);
// move the partition value to the center.
SWAP(getill, r);

// recurse:solve [l..littil] (leave getill=part alone) solve [getill+1..r]
qs(l, lttill);
qs(getill+1, r);
}

This implementation to me makes very clear to me what information is "known":
• The segments that is strictly less than the partition.
• The segment that is strictly great or equal the partition.
It also makes clear what is being "probed"/"tentative":
• anything we are accessing as +-1 is not known yet, we are feeling out the boundaries of our partitions.
The termination condition is clear: when one partition starts reaching into the other partitions resources, its done. Due to using closed intervals everywhere, it's very easy to see precisely what data starts and ends where. What version of quicksort do you prefer? Drop me an email!

### § Geometric proof of Cauchy Schwarz inequality

• All credit goes to p0a on ##math on freenode for teaching me this proof!
Here's one fun application of Cauchy-Schwarz. We can apply it to two vectors x=(√a, √b) and y=(√b, √a) to derive the AM-GM inequality:

### § Dataflow analysis using Grobner basis

This was a quick experiment in using Grobner basis to model situations. We can represent our dataflow analysis constraints in terms of polynomial rewrites over F2. Given the program:
p = { 0: ["=", "x", 'y'],
1: ['br', 2, 100],
2: ['=', 'z', 'x'],
3: ['br', 2],
100: ['ret', 'z'] }

whose semantics I hope are fairly straightforward --- the dictionary represents instruction locations. Instructions proceed sequentially. branch moves control flow around. Note that br can branch to multiple locations, since we are not control-flow sensitive. The idea is that since in a dataflow analysis, we need information at each variable at each program point, we can create a ring of polynomials over F2 for each variable at each program point. So in this case, we wold need:
R = F_2[x0, y0, z0, x1, y1, z1, x2, y2, z2, x3, y3, z3, x100, y100, z100]

We then add elements into the ideal that represents our constraints. For example, to perform dataflow analysis, we need to add constraints about how if a variable z is alive, all variables that are used to compute z at 100 are alive. This sets up equations that may have cycles (in the case of loops). These are usually resolved using the Kildall algorithm. However, we can also ask SAGE to kindly solve the Grobner basis. I hypothesize that the "easy" dataflow problems out to be toric ideals which admit much faster solutions.

### § Fenwick trees and orbits

I learnt of a nice, formal way to prove the correctness of Fenwick trees in terms of orbits that I wish to reproduce here. One can use a Fenwick tree to perform cumulative sums Sum(n) ≡ ∑in A[i], and updates Upd(i, v) ≡ A[i] += v. Naively, cumulative sums can take O(n) time and updates take O(1) time. A Fenwick tree can perform both in log(n). In general, we can perform any monoid-based catenation and update in log(n).

#### § organization

We allow indexes [1, 2, … n]. The node with factorization i ≡ 2k × l, 2 ¬|l (that is, k is the highest power of 2 in i) is responsible for the interval [i−2k+1, i] = (i−2k, i]. I'm going to state all the manipulations in terms of prime factorizations, since I find it far more intuitive than bit-fiddling. In general, I want to find a new framework to discover and analyze bit-fiddling heavy algorithms. Some examples of the range of responsibility of an index are:
• 1 = 20 × 1 = (0, 1] (Subtract 20 = 1)
• 2 = 2× 1 = (0, 2] (Subtract 21 = 2)
• 3 = 3 = (2, 3]
• 4 = 22 = (0, 4]
• 5 = 5 = (4, 5]
• 6 = 2× 3 = (4, 6]
• 7 = 7 = (6,7]
• 8 = 23 = (0,8]
• 9 = 9 = (8, 9]
• 10 = 2× 5 = (8, 10]
• 11 = 11 = (10, 11]
• 12 = 22× 3 = (8, 12]
• 13 = 13 = (12, 13]
• 14 = 2× 7 = (12, 14]
• 15 = 15 = (14, 15]
• 16 = 24 = (0, 16]

#### § query

To perform a cumulative sum, we need to read from the correct overlap regions that cover the full array. For example, to read from 15, we would want to read:
• a[15] = (14, 15], a[14] = (12, 14], a[12] = (8, 12], a[8] = (0, 8].
So we need to read the indices:
• 15=20 · 15 →−20 14=21 · 7 →−21 12=22·3 →−22 8=23·1 →−23 0
At each location, we strip off the value 2r. We can discover this value with bit-fiddling: We claim that a & (−a) = 2r. Let a = x 1 0r. Now, a = ¬ a + 1 = x01r + 1 = x10r. Hence, a & (−a) = a & (¬ a + 1) = (x 10r) & (10r) = 0|α|10r = 2r So the full implementation of query is:
#define LSB(x) x&(-x)
int a[N];
int q(int i) {
int s = 0;
while (i > 0) { s += a[i]; i -= LSB(i); }
return s;
}



#### § update

To perform an update at i, we need to update all locations which on querying overlap with i. For example, to update the location 9, we would want to update:
• a[9] = (8, 9], a[10] = (8, 10], a[12] = (8, 12], a[16] = (0, 16].
So we need to update the indices:
• 9=20 · 9 →+20 10=21 · 5 →+21 12=22·3 →+22 16=24·1 →+24
We use the same bit-fiddling technique as above to strip off the value 2r
#define LSB(x) x&(-x)
int tree[N];
int u(int i, int v) {
while (i < N) { tree[i] += v; i += LSB(i); }
}


#### § correctness

We wish to analyze the operations Query(q) ≡ ∑i=1q a[i], and Update(i, val) ≡ a[i+= val. To do this, we are allowed to maintain an auxiliary array d which we will manipuate. We will stipulate the conditions of operations on d such that they will reflect the values of Query and Update, albeit much faster. We will analyze the algorithm in terms of orbits. We have two operators, one for update called U, and one for query called Q. Given an index i, repeatedly applying the query operator gives us the indeces we need to read and accumulate from the underlying array a to get the total sum a[0..i]:
• Query(i) = ∑i d[Qi(q)]
Given an index u, repeatedly applying the update operator U gives us all the indeces we need to add the change to update:
• Update(i, val) = ∀ j , d[Uj(i)] +=  val
For query and update to work, we need the condition that:
• qu  ⇐⇒ |{ Qi(q) :  i ∈ ℕ } ∩ { Ui(u) : i ∈ ℕ } |= 1
That is, if and only if the query index q includes the update location u, will the orbits intersect. The intuition is that we want updates at an index u to only affect queries that occur at indeces qu. Hence, we axiomatise that for an update to be legal, it must the orbits of queries that are at indeces greater than it. We will show that our operators:
• Q(i=2r· a) = i − 2r = 2r(a−1)
• U(j=2s· b) = j + 2s = 2s(b+1)
do satisfy the conditions above. For a quick numerical check, we can use the code blow to ensure that the orbits are indeed disjoint:
## calculate orbits of query and update in fenwick tree

def lsb(i): return i & (-i)
def U(i): return i + lsb(i)
def Q(i): return i - lsb(i)
def orbit(f, i):
s = set()
while i not in s and i > 0 and i < 64:
return s

if __name__ == "__main__":
for q in range(1, 16):
for u in range(1, 16):
qo = orbit(Q, q); uo = orbit(U, u)
c = qo.intersection(uo)
print("q:%4s | u:%4s | qo: %20s | uo: %20s | qu: %4s" %
(q, u, qo, uo, c))

print("--")


#### § Case 1: q = u

We note that Q always decreases the value of q, and u always increases it. Hence, if q = u, they meet at this point, and i, j ≥ 1,    Qi (q) ≠ Uj(u). Hence, they meet exactly once as required.

#### § Case 2: q < u

As noted above, q always decreases and u always increases, hence in this case they will never meet as required.

#### § Case 3: q > u

Let the entire array have size 2N. Let q = e1 f_q, u = e0 f_u, where e, f_q, f_u may be empty strings. Notice that Q will always strip away rightmost ones in fq, leading to q = e10...0 at some point. Similarly, U will keep on adding new rightmost ones, causing the state to be u = e01...10...0U e100.... Hence, at some point q = u.

### § Dirichlet inversion

We call all functions f: ℤ → ℝ as arithmetic functions, since they operate on the integers. We introduce an operator fg: ℤ → ℝ. It is defined by:
• (fg)(n) ≡ ∑d | n f(d) g(n/d)
We will show that the set of arithmetic functions forms a group under the operator , with identity:
id(n) ≡ 1/n =

 1 n = 1 0 otherwise
The reason all of this is interesting is that the inverse of the constant function 1(n) ≡ 1 is going to be this function called as the mobius function µ:
µ(n=p1α1 p2α2 … prαr) ≡

 0 if any αi > 1 (−1)α1 + α2 + … + αr if all αi ∈ { 0, 1 }
The mobius function will allow us to perform mobius inversion:

f(n)
≡
 ∑ d | n
g(d) =
 ∑ d | n
g(d) 1(n/d) = g ⋆ 1

f ⋆ 1−1=  g ⋆ 1 ⋆ 1−1
f ⋆ µg
That is, we originally had f defined in terms of g. We can recover an expression for g in terms of f.

#### § The algebra of multiplicative functions

We claim that the set of functions { ℤ → ℂ } is a commutative group, with the group operation such that:
• (fg)(n) ≡ ∑d | n f(d) g(n/d).
with the identity element being id(n) ≡ ⌊ 1 / n. The idea is that if (n = 1), then ⌊ 1/1 ⌋ = 1, and for any other number n > 0, 1/n < 1, hence ⌊ 1/n ⌋ = 0.

#### § verification of istar being the identity

(f ⋆ id)(n) ≡
 ∑ d | n
f(did(n/d

f(nid(1) +
 ∑ d | n, d > 1
f(nid(d

f(n) · 1 +
 ∑ d | n, d > 1
f(n) · 0

f(n

#### § associativity, commutativity of ⋆

To prove associativity, it's better to write the formula as:
(f ⋆ g)(n) ≡
 ∑ d | n
f(ng(n/d) =
 ∑ xy = n
f(xg(y)
From this rewrite, it's clear that (fgh)(n) will unambiguously sum over tripes (x, y, z) such that xyz = n. I leave the working-this-out to you. This should also make the commutativity immediate. Summing over pairs of the form f(x) g(y) : xy = n is the same as summing over f(y) g(x) : yx = n.

#### § Existence of inverse

We can show that an inverse exists by showing that a formula for it exists; The idea is to construct one by induction. Clearly, for a given function f, we need the inverse f−1 to be such that (ff−1)(n) = id. Hence:

 (f ⋆ f−1)(1) = id⋆(1) = 1 f(1) f−1(1) = 1 f−1(1) ≡ 1 / f(1)
Great, we have a base case; We can now compute f−1(n) inductively, assuming we know the value of f−1(d) for all d | n.

(f ⋆ f−1)(n) = id(1) = 0

 ∑ d | n
f(df−1(n/d) = 0

f(1) f−1(n) +
 ∑ d | n, d < n
f(df−1(n/d) = 0

f−1(n) = −
 ∑ d | n, d < n
f(df−1(n/d)
f(1)

#### § Mobius inversion

Now that we know that µ = const 1−1, we can use this fact to perform mobius inversion:
f(n) ≡
 ∑ d | n
g(n/d) = const 1 ⋆ g
We have f written in terms of g. We can now write g in terms of f:

f(n)  = const 1 ⋆ g
f ⋆ const 1−1 = g
g = f ⋆ µ

g =
 ∑ d | n
f(d) µ(n/d)

#### § n = ∑d | n φ(d)

 d { 1 ≤ x ≤ 12 : (x, 12) = d } { 1 ≤ x ≤ 12: (x/d, 12/d) = 1} size of set 1 { 1, 5, 7, 11 } (x, 12) = 1 4 2 {2, 10 } (x/2, 6) = 1 2 3 {3, 9 } (x/3, 4) = 1 2 4 {4, 8 } (x/4, 3) = 1 2 6 { 6 } (x/6, 2) = 1 1 12 { 12 } (x/12, 1) = 1 1
Notice that the sizes of sets that we are calculating, for example, |{ 1 ≤ x ≤ 12 : (x/2, 6) = 1 }| = φ(6). Summing over all of what we have, we've counted the numbers in [1, 2, …, 12] in two ways --- one directly, and the other by partitioning into equivalence classes:
12 = φ(1) + φ(2) + φ(3) + φ(4) + φ(6) + φ(12) =
 ∑ d | 12
φ(12/d
In general, the same argument allows us to prove that:
n =
 ∑ d | n
n/d

### § Incunabulum for the 21st century: Making the J interpreter compile in 2020

This is me trying to understand the fabled interpreter of the J language working, so I could absorb Arthur Whitney's style of writing C: it's cryptic, short, and fits in a page. I learnt of this from the J language page, which comes with the quote:
One summer weekend in 1989, Arthur Whitney visited Ken Iverson at Kiln Farm and produced—on one page and in one afternoon—an interpreter fragment on the AT&T 3B1 computer. I studied this interpreter for about a week for its organization and programming style; and on Sunday, August 27, 1989, at about four o'clock in the afternoon, wrote the first line of code that became the implementation described in this document. Arthur's one-page interpreter fragment is as follows:

#### § The original source

typedef char C;typedef long I;
typedef struct a{I t,r,d[3],p[2];}*A;
#define P printf
#define R return
#define V1(f) A f(w)A w;
#define V2(f) A f(a,w)A a,w;
#define DO(n,x) {I i=0,_n=(n);for(;i<_n;++i){x;}}
I *ma(n){R(I*)malloc(n*4);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
tr(r,d)I *d;{I z=1;DO(r,z=z*d[i]);R z;}
A ga(t,r,d)I *d;{A z=(A)ma(5+tr(r,d));z->t=t,z->r=r,mv(z->d,d,r);
R z;}
V1(iota){I n=*w->p;A z=ga(0,1,&n);DO(n,z->p[i]=i);R z;}
V2(plus){I r=w->r,*d=w->d,n=tr(r,d);A z=ga(0,r,d);
DO(n,z->p[i]=a->p[i]+w->p[i]);R z;}
V2(from){I r=w->r-1,*d=w->d+1,n=tr(r,d);
A z=ga(w->t,r,d);mv(z->p,w->p+(n**a->p),n);R z;}
V1(box){A z=ga(1,0,0);*z->p=(I)w;R z;}
V2(cat){I an=tr(a->r,a->d),wn=tr(w->r,w->d),n=an+wn;
A z=ga(w->t,1,&n);mv(z->p,a->p,an);mv(z->p+an,w->p,wn);R z;}
V2(find){}
V2(rsh){I r=a->r?*a->d:1,n=tr(r,a->p),wn=tr(w->r,w->d);
A z=ga(w->t,r,a->p);mv(z->p,w->p,wn=n>wn?wn:n);
if(n-=wn)mv(z->p+wn,z->p,n);R z;}
V1(sha){A z=ga(0,1,&w->r);mv(z->p,w->d,w->r);R z;}
V1(id){R w;}V1(size){A z=ga(0,0,0);*z->p=w->r?*w->d:1;R z;}
pi(i){P("%d ",i);}nl(){P("\n");}
pr(w)A w;{I r=w->r,*d=w->d,n=tr(r,d);DO(r,pi(d[i]));nl();
if(w->t)DO(n,P("< ");pr(w->p[i]))else DO(n,pi(w->p[i]));nl();}

C vt[]="+{~<#,";
A(*vd[])()={0,plus,from,find,0,rsh,cat},
(*vm[])()={0,id,size,iota,box,sha,0};
I st[26]; qp(a){R  a>='a'&&a<='z';}qv(a){R a<'a';}
A ex(e)I *e;{I a=*e;
if(qp(a)){if(e[1]=='=')R st[a-'a']=ex(e+2);a= st[ a-'a'];}
R qv(a)?(*vm[a])(ex(e+1)):e[1]?(*vd[e[1]])(a,ex(e+2)):(A)a;}
noun(c){A z;if(c<'0'||c>'9')R 0;z=ga(0,0,0);*z->p=c-'0';R z;}
verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
I *wd(s)C *s;{I a,n=strlen(s),*e=ma(n+1);C c;
DO(n,e[i]=(a=noun(c=s[i]))?a:(a=verb(c))?a:c);e[n]=0;R e;}

main(){C s[99];while(gets(s))pr(ex(wd(s)));}

It's a lot to take in --- it's quite breathtaking really, the way it all hangs together in one page.

#### § The attempt to get it run

Unfortunately, this does not work if we try to get it to run in 2020. I decided to read the code and understand what would happen. I managed to read enough to understand that the code a=~3 ought to create an array with values [0 1 2]. On attempting to run this however, we get:
\$ gcc -O0 -g -std=c89 -fsanitize=address -fsanitize=undefined incunabulum.c -o bin/incunabulum && ./bin/incunabulum
...
(many many GCC warnings elided)
...
a=~3
=================================================================
0x60300000eff0 at pc 0x000000402be3 bp 0x7ffe6dde6b70 sp 0x7ffe6dde6b60
WRITE of size 8 at 0x60300000eff0 thread T0
#0 0x402be2 in wd /home/bollu/work/w/incunabulum.c:40
#1 0x402d28 in main /home/bollu/work/w/incunabulum.c:42
#2 0x7f7ae901082f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
#3 0x400ca8 in _start (/home/bollu/w/bin/incunabulum+0x400ca8)
...
...
==23726==ABORTING

oops! The code uses a lot of punning between int and int*. These assumptions break now that we're in 64-bit. The patch to get this to work is:

#### § the patch

diff --git a/incunabulum.c b/incunabulum.c
index 2cae744..778e35a 100644
--- a/incunabulum.c
+++ b/incunabulum.c
@@ -1,11 +1,11 @@
-typedef char C;typedef long I;
+typedef char C;typedef long long I;
typedef struct a{I t,r,d[3],p[2];}*A;
#define P printf
#define R return
#define V1(f) A f(w)A w;
#define V2(f) A f(a,w)A a,w;
#define DO(n,x) {I i=0,_n=(n);for(;i<_n;++i){x;}}
-I *ma(n){R(I*)malloc(n*4);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
+I *ma(n){R(I*)malloc(n*8);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
tr(r,d)I *d;{I z=1;DO(r,z=z*d[i]);R z;}
A ga(t,r,d)I *d;{A z=(A)ma(5+tr(r,d));z->t=t,z->r=r,mv(z->d,d,r);
R z;}
@@ -34,9 +34,10 @@ I st[26]; qp(a){R  a>='a'&&a<='z';}qv(a){R a<'a';}
A ex(e)I *e;{I a=*e;
if(qp(a)){if(e[1]=='=')R st[a-'a']=ex(e+2);a= st[ a-'a'];}
R qv(a)?(*vm[a])(ex(e+1)):e[1]?(*vd[e[1]])(a,ex(e+2)):(A)a;}
-noun(c){A z;if(c<'0'||c>'9')R 0;z=ga(0,0,0);*z->p=c-'0';R z;}
-verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
+I noun(c){A z;if(c<'0'||c>'9')R 0;z=ga(0,0,0);*z->p=c-'0';R z;}
+I verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
I *wd(s)C *s;{I a,n=strlen(s),*e=ma(n+1);C c;
DO(n,e[i]=(a=noun(c=s[i]))?a:(a=verb(c))?a:c);e[n]=0;R e;}

main(){C s[99];while(gets(s))pr(ex(wd(s)));}
+


#### § It runs!

After applying the patch, we manage to get the interpreter to run!
./bin/incunabulum
a=~3
3
0 1 2


#### § The lock screen

I liked it so much that I took a screenshot and made it my lock screen.

#### § Thoughts

I'm really fascinated by the code. I loved I could simply stare the screenshot to absorb the code. There was no scrolling involved. The variables are well-named (to the extent I understand the code), and it's clearly extremely well thought out. If there's someone who understands some of the thorny aspects of the code:
• What is the t variable really tracking?
• How does one create a multi-dimensional array easily?
• What are some interesting programs one can run with this mini-interpreter?
I'd be really glad to know the details. Please leave an issue or a pull request against the repo. I'm going write a dissection of the code once I fully understand it, since I couldn't find explanaing the code on the internet. Until then, enjoy the monolith of code!

### § An example of a sequence whose successive terms get closer together but isn't Cauchy (does not converge)

#### § The problem

Provide an example of a sequence an: ℕ → ℝ such that limn → ∞ | an+1an | → 0, but limn, m → ∞, m > n |anam| ≠ 0. That is, proide a series where the distances between successive terms converges to zero, but where distances between terms that are "farther apart than 1" does not converge to 0. That is, the sequence is not Cauchy.

#### § Regular solution: Harmonic numbers

The usual solution is to take the harmonic numbers, Hn ≡ ∑i=1n 1/i. Then, we show that:

 lim n → ∞

Hn+1 − Hn

 1 n+1
−
 1 n

 1 (n+1)n
→ 0

 lim n → ∞

H2n − Hn

 1 2n
−
 1 n

 2n ∑ i=n+1

 1 n+1
+
 1 n+2
+ … +
 1 2n

≥
 2n ∑ i=n+1

 1 2n
+
 1 2n
+ … +
 1 2n

≥
 n 2n
=
 1 2
≠ 0 on x → ∞

#### § Memorable solution: logarithm

We can much more simply choose an = log(n). This yields the simple calculation:

 lim n → ∞
an+1 − an = log(n+1) − log(n

= log((n+1)/n))

= log(1 + 1/n
 n → ∞ ▸
log(1) = 0

while on the other hand,

 lim n → ∞
a2n − an  = log(2n) − log(n) = log(2) + log(n) − log(n) = log2 ≠ 0

I find this far cleaner conceptually, since it's "obvious" to everyone that an = log(n) diverges, while the corresponding fact for Hn is hardly convincing. We also get straightforward equalities everywhere, instead of inequalities. I still feel that I don't grok what precisely fails here, in that, my intuition still feels that the local condition ought to imply the Cauchy condition: if an tells an+1 to not be too far, and an+1 tells an+2, surely this must be transitive? I have taught my instincts to not trust my instincts on analysis, which is a shitty solution :) I hope to internalize this someday. EDIT: I feel I now understand what's precisely happening after ruminating a bit. The Cauchy convergence criterion allows us to drop a finite number of terms, and then capture everything after that point in a ball of radius є. As є shrinks, all the terms in the sequence are "squeezed togeher". In the an+1an case, only successive terms must maintain an є distance. But as the log example shows, you can steadily plod along, keeping є ball next to є ball, to reach:
 lim n → ∞

 lim є → 0
f(n) · є
whose behaviour can do unexpected things depending on the choice of .

### § Krylov subspace method

This is a class of methods used to solve Ax = b, where A is sparse. Krylov subspace methods are a class of methods which use the idea of a Krylov subspace. There is conjugate gradient (CG), GMRES (Generalized minimal residual method).
 Km(A, v) ≡ span { v, Av, A2v, …, Am v}
Clearly, KmKm+1, and there is a maximum KN that we can span (the full vector space). We are interested in the smallest index M such that KM = KM+1. We notice that KM is invariant under the action of A. Now, let's consider:

 Km(A, x) ≡ span {x, Ax, A2x, … Am x } = span { A−1 b, b, Ab, … Am−1 x }      (substitute x = A−1b) = A span { A−1 b, b, Ab, … Am−1 b}      (Invariance of Krylov subspace) = span {b, Ab, … Am b} = Km(A, b)
We learnt that Ax = b has a solution in Km(A, b). Using this, we can build solvers that exploit the Krylov subspace. We will describe GMRES and CG.

### § Generalized minimal residual --- GMRES

We wish to solve Ax = b where A is sparse and b is normalized. The nth Krylov subspace is Kn(A, b) ≡ span {b, Ab, A2b, …, Anb }. We approximate the actual solution with a vector xnKn(A, b). We define the residual as rnA xnb.

### § Good reference to the Rete pattern matching algorithm

The Rete pattern matching algorithm is an algorithm that allows matching a huge number of rules with a huge database of "facts". MLIR ("multi-language intermediate representation") is a new technology that hopes to centralize much of the research and development of various compilers into a single authoritative source. The major claim-to-fame is that it allows one to mix various "dialects" (much as Racket does). So, to a first order approximation, MLIR is "JSON for compiler intermediate representations". What MLIR gets right is tooling. They take the experience that the LLVM project has mined for the last decade and a half, and bake many of the good stuff that came with LLVM right in. For example, MLIR comes in-built with a pretty printer, a notion of types, a notion of "attributes", SSA, enforced provenance tracking of code (so one can always know what the original source code was that led to some assembly). Sound engineering might see MLIR succeed where many others failed. I was reminded of Rete since the MLIR folks are trying to solve the pattern matching problem in general for their Generic DAG Rewriter. They currently just use a worklist based algorithm. I'm trying to understand if Rete can be used instead. Rete is famous for being hard to understand, so I began a quest to look for good sources to implement it. I found a great PhD thesis written by Robert B. Doorenboos, which quips:
Since the Rete match algorithm provides the starting point for much of the work in this thesis, this chapter describes Rete. Unfortunately, most of the descriptions of Rete in the literature are not particularly lucid,1 which is perhaps why Rete has acquired \a reputation for extreme differentialculty."(Perlin, 1990b) To remedy this situation, this chapter describes Rete in a tutorial style, rather than just briey reviewing it and referring the reader to the literature for a full description. We will first give an overview of Rete, and then discuss the principle data structures and procedures commonly used to implement it. High-level pseudocode will be given for many of the structures and procedures, so that this chapter may serve as a guide to readers who want to implement rete (or some variant) in their own systems.
I now have a reference to an accessible description of this stuff. I might implement Rete to understand it, so that it's part of my toolkit if I ever need it.

### § Leapfrog Integration

We have a system we wish to simulate using hamilton's equations:

 ∂ q ∂ t
=
 ∂ H ∂ p
|(p0q0)

 ∂ p ∂ t
= −
 ∂ H ∂ q
|(p0q0)

We want to simulate a system using these differential equations. We will begin with some initial position and momentum (q0, p0), evaluate q/∂ t |(q0, p0), p/∂ t |(q0, p0), and use these to find (qnext, pnext). An integrator is a general algorithm that produces the next position and momentum using current information:
(qnextpnext) = I

q0, p0,
 ∂ q ∂ t
|(q0p0),
 ∂ p ∂ t
|(q0p0)

The design of I is crucial: different choices of I will have different trade-offs for accuracy and performance. Another interesting property we might want is for I to be a symplectic integrator --- that is, it preserves the total energy of the system. For example, here's a plot of the orbits of planets using two integrators, one that's symplectic (leapfrog) and one that isn't (Euler) Notice that since leapfrog attempts to keep energy conserved, the orbits stay as orbits! On the other hand, the euler integrator quickly spirals out, since we lose energy during the integration. Note that this is not an issue of numerical precision: The euler integrator is ineherently such that over long timescales, it will lose energy. On the other hand, the leapfrog integrator will always remain stable, even with very large timesteps and low precision. I present the equations of the leapfrog integrator, a proof sketch that it is symplectic, and the code listing that was used to generate the above plot. Often, code makes most ideas very clear!

#### § Incantations

## Run HMC with a particular choice of potential
import numpy as np
from matplotlib.animation import FuncAnimation
import matplotlib.pyplot as plt
import numpy.linalg

## dq/dt = dH/dp|_{p0, q0}
## dp/dt = -dH/dq|_{p0, q0}
def leapfrog(dhdp, dhdq, q0, p0, dt):
p0 += -dhdq(q0, p0) * 0.5 * dt

# full step position
# q += dt * p
q0 += dhdp(q0, p0) * dt

# half step position
p0 += -dhdq(q0, p0) * 0.5 * dt
return (q0, p0)

For reference, we also implement an euler integrator, that uses the derivative to compute the position and momentum of the next timestep independently.
def euler(dhdp, dhdq, q, p, dt):
pnew = p + -dhdq(q, p) * dt
qnew = q + dhdp(q, p) * dt
return (qnew, pnew)

Finally, we implement planet(integrator, n, dt) which simulates gravitational potential and usual kinetic energy, using the integrator given by integrator for n steps, with each timestep taking dt.
def planet(integrator, n, dt):
STRENGTH = 0.5

# minimise potential V(q): q, K(p, q) p^2
q = np.array([0.0, 1.0])
p = np.array([-1.0, 0.0])

# H = STRENGTH * |q| (potential) + p^2/2 (kinetic)
def H(qcur, pcur): return STRENGTH * np.linalg.norm(q) + np.dot(p, p) / 2
def dhdp(qcur, pcur): return p
def dhdq(qcur, pcur): return STRENGTH * 2 * q / np.linalg.norm(q)

qs = []
for i in range(n):
print("q: %10s | p: %10s | H: %6.4f" % (q, p, H(q, p)))
(q, p) = integrator(dhdp, dhdq, q, p, dt)
qs.append(q.copy())
return np.asarray(qs)

We plot the simulations using matplotlib and save them.
NITERS = 15
TIMESTEP = 1

print("planet simulation with leapfrog")
planet_leapfrog = planet(leapfrog, NITERS, TIMESTEP)

plt.rcParams.update({'font.size': 12, 'font.family':'monospace'})
fig, ax = plt.subplots()
print(planet_leapfrog)
ax.plot(planet_leapfrog[:, 0], planet_leapfrog[:, 1], label='leapfrog',
linewidth=3, color='#00ACC1')
print("planet simulation with euler")
planet_euler = planet(euler, NITERS, TIMESTEP)
ax.plot(planet_euler[:, 0], planet_euler[:, 1], label='euler',
linewidth=3, color='#D81B60')

legend = plt.legend(frameon=False)
ax.set_title("leapfrog v/s euler: NITERS=%s dt=%s" % (NITERS, TIMESTEP))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.savefig("leapfrog-vs-euler.png")
plt.show()


### § Comparison of forward and reverse mode AD

Quite a lot of ink has been spilt on this topic. My favourite reference is the one by Rufflewind. However, none of these examples have a good stock of examples for the diference. So here, I catalogue the explicit computations between computing forward mode AD and reverse mode AD. In general, in forward mode AD, we fix how much the inputs wiggle with respect to a parameter t. We figure out how much the output wiggles with respect to t. If output = f(input1, input2, … inputn), then output/∂ t = ∑if/∂ inputiinputi/∂ dt. In reverse mode AD, we fix how much the parameter t wiggles with respect to the output. We figure out how much the parameter t wiggles with respect to the inputs. If outputi = fi(input, …), then t/∂ input = ∑it/∂ outputifi/input. This is a much messier expression, since we need to accumulate the data over all outputs. Essentially, deriving output from input is easy, since how to compute an output from an input is documented in one place. deriving input from output is annoying, since many outputs can depent on a single output. The upshot is that if we have few "root outputs" (like a loss function), we need to run AD once with respect to this, and we will get the wiggles of all inputs at the same time with respect to this output, since we compute the wiggles output to input. The first example of z = max(x, y) captures the essential difference between the two approached succinctly. Study this, and everything else will make sense.

#### § Maximum: z = max(x, y)

• Forward mode equations:

zmax(xy
 ∂ x ∂ t
= ?
 ∂ y ∂ t
= ?
 ∂ z ∂ t

 ∂ x ∂ t
if x > y

 ∂ y ∂ t
otherwise

We can compute z/∂ x by setting t = x. That is, x/∂ t = 1, ∂ y/∂ t = 0. Similarly, can compute z/∂ y by setting t = y. That is, x/∂ t = 1, ∂ y/∂ t = 0. If we want both gradients z/∂ x, ∂ z/∂ y, we will have to rerun the above equations twice with the two initializations. In our equations, we are saying that we know how sensitive the inputs x, y are to a given parameter t. We are deriving how sensitive the output z is to the parameter t as a composition of x, y. If x > y, then we know that z is as sensitive to t as x is.
• Reverse mode equations:

zmax(xy
 ∂ t ∂ z
= ?
 ∂ t ∂ x

 ∂ t ∂ z
if x > y
0otherwise

 ∂ t ∂ y

 ∂ t ∂ z
if y > x
0otherwise

We can compute z/∂ x, ∂ z/∂ y in one shot by setting t = z. That is, z/∂ t = 1. In our equations, we are saying that we know how sensitive the parameter t is to a given output z. We are trying to see how sensitive t is to the inputs x, y. If x is active (ie, x > y), then t is indeed sensitive to x and t/∂ x = 1. Otherwise, it is not sensitive, and t/∂ x = 0.

#### § sin: z = sin(x)

• Forward mode equations:

zsin(x
 ∂ x ∂ t
= ?
 ∂ z ∂ t
 ∂ z ∂ x

 ∂ x ∂ t

cos(x
 ∂ x ∂ t

We can compute z/∂ x by setting t = x. That is, setting x/∂ t = 1.
• Reverse mode equations:

zsin(x
 ∂ t ∂ z
= ?
 ∂ t ∂ x
 ∂ t ∂ z

 ∂ z ∂ x

 ∂ t ∂ z
cos(x)

We can compute z/∂ x by setting t = z. That is, setting z/∂ t = 1.

#### § addition: z = x + y:

• Forward mode equations:

zx + y
 ∂ x ∂ t
= ?
 ∂ y ∂ t
= ?
 ∂ z ∂ t
 ∂ z ∂ x

 ∂ x ∂ t
+
 ∂ z ∂ y

 ∂ y ∂ t

= 1 ·
 ∂ x ∂ t
+ 1 ·
 ∂ y ∂ t
 ∂ x ∂ t
+
 ∂ y ∂ t

• Reverse mode equations:

zx + y
 ∂ t ∂ z
= ?
 ∂ t ∂ x
 ∂ t ∂ z

 ∂ z ∂ x

 ∂ t ∂ z
· 1 =
 ∂ t ∂ z

 ∂ t ∂ y
 ∂ t ∂ z

 ∂ z ∂ y

 ∂ t ∂ z
· 1 =
 ∂ t ∂ z

#### § multiplication: z = xy`

• Forward mode equations:

zx y
 ∂ x ∂ t
= ?
 ∂ y ∂ t
= ?
 ∂ z ∂ t
 ∂ z ∂ x

 ∂ x ∂ t
+
 ∂ z ∂ y

 ∂ y ∂ t

y
 ∂ x ∂ t
+ x
 ∂ y ∂ t

• Reverse mode equations:
 &#