A Universe of Sorts
§ Siddharth Bhat
§ Table of contents:
 Sanskrit and Sumerian
 Writing Cuneiform
 The code of hammurabi
 The implicit and inverse function theorem
 Whalesong hyperbolic space in detail
 Motivating Djikstra's
 Intuitions for hyperbolic space
 Product of compact spaces in compact
 Hyperbolic groups have solvable word problem
 Elementary uses of Sheaves in complex analysis
 Snake lemma
 Kernel,cokernel,image
 The commutator subgroup
 Simplicity of A5 using PSL(2, 5)
 A5 is not solvable
 Complex orthogonality in terms of projective geometry
 Arithmetic sequences, number of integers in a closed interval
 The arg function, continuity, orientation
 Odd partitions, unique partitions
 Continued fractions, mobius transformations
 permutationsandlyndonfactorization
 Graphs are preorders
 Crash course on domain theory
 Parallelisable version of maximum sum subarray
 Thoughts on implicit heaps
 Discriminant and Resultant
 Polynomial root finding using QR decomposition
 A hacker's guide to numerical analysis (WIP)
 Mobius inversion on Incidence Algebras
 Finite differences and Umbral calculus
 Permutahedron
 Lyndon + Christoffel = Convex Hull
 Geometric proof of
e^x >= 1+x
, e^(x) >= 1x
 Ranking and Sorting
 Proof of minkowski convex body theorem
 Burrows Wheeler (WIP)
 Intuitionstic logic as a Heytig algebra
 Edit distance
 Evolution of bee colonies
 Best practices for array indexing
 Algebraic structure for vector clocks
 Networks are now faster than disks
 Einsteinde Haas effect
 Rankselect as adjunction
 Bounding chains: uniformly sample colorings
 Coupling from the past (WIP)
 Word problems in Russia and America
 Encoding mathematical hieararchies
 Learning code by hearing it
 Your arm can be a spinor
 Self modifying code for function calls
 Adjunctions as advice
 Reversible computation as groups on programs
 Blazing fast math rendering on the web
 VC dimension
 Symplectic version of classical mechanics (WIP)
 Theorems for free
 How to reason with halfopen intervals
 how does one build a fusion bomb?
 Christoffel symbols, geometrically
 A natural vector space without an explicit basis
 Cache oblivious B trees
 KrohnRhodes decomposition (WIP)
 Proving block matmul using program analysis (WIP)
 Why I like algebra over analysis

using
for cleaner function type typedefs
 A walkway of lanterns (WIP)
 Natural transformations
 The hilarious commentary by dinosaure in OCaml git
 How to link against MLIR with CMake
 Energy as triangulaizing state space
 The cutest way to write semidirect products
 My Favourite APLisms
 Proof of chinese remainder theorem on rings
 monic and epic arrows
 The geometry of Lagrange multipliers
 Efficient tree transformations on GPUs (WIP)
 Things I wish I knew when I was learning APL
 Every ideal that is maximal wrt. being disjoint from a multiplicative subset is prime
 Getting started with APL
 SpaceChem was the best compiler I ever used
 Mnemonic for Kruskal and Prim
 Legendre transform
 Cartesian Trees
 DFS numbers as a monotone map
 Self attention? not really
 Coarse structures
 Matroids for greedy algorithms (WIP)
 Grokking Zariski (WIP)
 My preferred version of quicksort
 Geometric proof of Cauchy Schwarz inequality
 Dataflow analysis using Grobner basis
 Fenwick trees and orbits
 Dirichlet inversion (WIP)
 Incunabulum for the 21st century: Making the J interpreter compile in 2020
 An example of a sequence whose successive terms get closer together but isn't Cauchy (does not converge)
 Krylov subspace method
 Good reference to the Rete pattern matching algorithm
 Leapfrog Integration
 Comparison of forward and reverse mode automatic differentiation
 An invitation to homology and cohomology: Part 1  Homology
 An invitation to homology and cohomology: Part 2  Cohomology
 Stuff I learnt in 2019
 A motivation for padic analysis
 Line of investigation to build physical intuition for semidirect products
 Topology is really about computation  part 2
 Topology is really about computation  part 1
 PSLQ algorithm: finding integer relations between reals
 Geometric characterization of normal subgroups
 Radical ideals, nilpotents, and reduced rings
 My disenchantment with abstract interpretation
 Computing equivalent gate sets using grobner bases
 The janus programming language  Time reversible computation

A = B
 A book about proofs of combinatorial closed forms (TODO link)
 Generating
k
bitsets of a given length n
 Bondi kcalculus
 Vivado toolchain craziness
 What the hell is a Grobner basis? Ideals as rewrite systems
 Lie bracket versus torsion
 Spatial partitioning data structures in molecular dynamics
 Vector: Arthur Whitney and text editors
 Discrete random distributions with conditioning in 20 lines of haskell
 Everything you know about word2vec is wrong
 Small Haskell MCMC implementation
 Debugging debug info in GHC
 GHC LLVM code generator: Switch to unreachable
 Concurrency in Haskell
 Handy list of differential geometry definitions
 Lazy programs have space leaks, Strict programs have time leaks
 Presburger arithmetic can represent the Collatz Conjecture
 Using compactness to argue about covers
 Stephen wolfram's live stream
 Japanese Financial Counting system

Cleave
as a word has some of the most irregular inflections
 McCune's single axiom for group theory
 Arthur Whitney: dense code
 How does one work with arrays in a linear language?
 Linear optimisation is the same as linear feasibility checking
 Quantum computation without complex numbers
 Linguistic fun fact: Comparative Illusion
 Stuff I learnt in 2018
 Stuff I learnt in 2017
 Reading the
structs
library
 Reading the
machines
library (WIP)
 Explaining laziness (WIP)
 Explaining STG(WIP)
 Simplexhc: proc points suck / making GHC an order of magnitude faster
 Simplexhc: dec 2017
 Simplexhc: oct 29 2017
 Simplexhc: july 2017
 Simplexhc: july 6th 2017
 Simplexhc: announcement
 GSoC 2015 proposal
 GSoC 2015 week 1
 GSoC 2015 week 3 and 4
 GSoC 2015 week 5
 GSoC 2015 week 6
 GSoC 2015 week 7
 GSoC 2015 final report
 Link Dump
 Big list of emacs gripes
 Big list of Coq
 Big list of writing
 Big list of Latex
 Big list of Architecture
 Big list of Recipes
 Big list of words
 Big list of Haiku
§ Sanskrit and Sumerian
§ Sanskrit's precursors
The oldest known sanskrit text that has been found is the Rig Veda. Eveyrthing
after is the study of sanskit proper. This is quite problematic because the
Rig Veda is a complete and consistent textbook with respect to language. It's
a blackbox in terms of language evolution.
The question "what is the precursor" asks for a method of study to determine
the precursor. We just don't know because we have no writings, text, etc.
§ Archaeological data
Archaeological data is problematic as well. We don't know where the people who
knew sanskrit came from. Sanskrit was spoken in the northen part of
Hindusthan [what they originally called Bhrahmanagar (?)]. While we can
try to undersatnd where it comes from, it's hard. The script is Brahmi / Devanagiri,
which means used by Brahma or god. The name "sanskrit" is a compound that stands
for "wellformed". It's really cleanslate in that sense. The study of IndoAryan
languages in the Indian subcontinent has only one possible known history, which
stops at the Rig Veda. We don't know the relationship between 2400BC when
Rig Veda was written to anything before it.
§ Non Vedic sanskrit
Studies in nonvedic sanskrit is known to be the "true" protoIndoEuropoean (PIE)
language. The conjecture is that this Indo European language ought to be
able to cover middle greek, hittite, and sanskit.
§ Prakrit and its relationship to this story
Prakrit evolved as a vernacular of Sanskrit in the north pahadi region. Hindi
as we know toady evolved from Hindusthani, which came from languages in
northern india. Languages like Marathi, Marvadi, Gujurathi, etc. came a lot
before Hindi did.
§ TLDR on Sanskrit v/s Hindi
There is a huge gap of time between Sanskit, Prakrit, Pali, and Hindi.
Hindi evolved around the 1600s due to the Mughals who used to speak a
vernacular of Hindusthani. Kabir also wrote in Hindusthani. There was also
some Farsi and Urdu mixed in.
Hindi is more of a political exploit than an actual language ~ Alok 2020
§ The relationship to Sumerian
We don't know what the relationship to sumerian is. Social expectations that
was setup is sumerian has become more stringent in Sanskrit.
I've been reading about the Sumerian people, and I've gotten fascinated with
the question of how to write in Cuneiform, which is their script. I wanted
to learn how to write cuneiform. It appears that it was originally written
down by pressing reed styluses into clay. The script is syllabic in nature.
We have three components:
 Vertical wedge
𐏑
 Horizontal wedge
𐎣
 Diagonal
𐏓
§ References
I've wanted to read the code of hammurabi since it was name dropped in
Snow Crash by Neal Stephenson.
I finally got around to it. Here's some excerpts I found fascinating.
§ References
I keep forgetting the precise conditions of these two theorems. So here
I'm writing it down as a reference for myself.
§ Implicit function: Relation to function
 Example 1:
x^{2} + y^{2} = 1 to
y = √1 − x^{2}.
If we have a function
y = g(p, q), we can write this as
y − g(p, q) = 0.
This can be taken as an implicit function
h(y; p, q) = y − g(p, q). We then
want to recover the explicit version of
y = g′(p, q) such that
h(g′(p, q); p, q) = 0. That is, we recover the original explicit formulation
of
y = g′(p, q) in a way that satisfies
h.
§ The 1D linear equation case
In the simplest possible case, assume the relationship between
y and
p
is a linear one, given implicitly. So we have
h(y; p) = α y + β p + γ = 0.
Solving for
h(y,p) = 0, one arrives at:
y = −1/α (β p + γ).
 Note that the solution exists iff
α ≠ 0.
 Also note that the the existence of the solution is equivalent to asking that
∂ h / ∂ y = α ≠ 0.
§ The circle case
In the circle case, we have
h(y; p) = p^{2} + y^{2} − 1. We can write
y = ± √p^{2} − 1. These are
two solutions, not one, and hence
a relation, not a function.
 We can however build two functions by taking two parts.
y+ = +√p^{2} − 1;
y− = −√p^{2} − 1.
 In this case, we have
∂ h / ∂ y = 2y, which changes sign
for the two solutions. If
y^{⋆}> 0, then
(∂ h / ∂ y)(y^{⋆}= 0).
Similarly for the negative case.
§ Assuming that a solution for
h(y, p) exists
Let us say we wish to solve
h(y; p) = y^{3} + p^{2} − 3 yp − 7 = 0. Let's assume
that we have a solution
y = sol(p) around the point
(y=3, p=4). Then we
must have:
sol(p)^{3} + p^{2} − 3 sol(p) p − 7 = 0. Differentiating by
p,
we get:
3 sol(p)^{2} sol′(p) + 2p − 3 sol′(p) p − 3 sol(p) = 0. This gives
us the condition on the derivative:
  3 sol(p)^{2} sol′(p) + 2p − 3 sol′(p) p − 3 sol(p) = 0          
 sol′(p)  ⎡
⎣  3 sol(p)^{2} − 3p  ⎤
⎦  = 3 sol(p) − 2p 
         
 sol′(p) = [3 sol(p) − 2p] /  ⎡
⎣  3(sol(p)^{2} − 3p)  ⎤
⎦ 
         
          

The above solution exists if
3(sol(p)^{2} − 3p ≠ 0). This quantity is again
∂ h / ∂ y.
§ Application to economics
 We have two inputs which are purchaed as
x_{1} units of input 1,
x_{2} units of input
2.
 The price of the first input is
w_{1} BTC/unit. That of the second input is
w_{2} BTC/unit.
 We produce an output which is sold at price
w BTC/unit.
 For a given
(x_{1}, x_{2}) units of input, we can produce
x_{1}^{a} x_{2}^{b} units of output where
a + b < 1.
The Coobdouglas function.
 The profit is going to be
profit(x_{1} x_{2}, w_{1}, w_{2}, w) = w(x_{1}^{a} x_{2}^{b}) − w_{1} x_{1} − w_{2} x_{2}.
 We want to select
x_{1}, x_{2} to maximize profits.
 Assume we are at breakeven:
profit(x_{1}, x_{2}, w_{1}, w_{2}, w) = 0.
 The implicit function theorem allows us to understand how any variable changes
with respect to any other variable. It tells us that locally, for example,
that the number of units of the first input we buy (
x_{1}) is a function
of the price
w_{1}. Moreover, we can show that it's a decreasing function
of the price.
§ Inverse function: Function to Bijection
 Given a differentiable function
f, at a point
p, we will have a continuous inverse
f^{−1}(p) if the derivative
f′(p) is locally invertible.
 The intuition is that we can approximate the original function with a linear
function.
y = f(p + δ) = f(p) + f′(p) δ. Now since
f′(p) is
locally invertible, we can solve for
δ.
y = f(p) + f′(p) δ
implies that
δ = 1/f′(p) [y − f(p + δ) ].
This gives us the preimage
(p + delta) ↦ y.
 The fact that
1/f′(p) is nonzero is the key property. This generalizes
in multiple dimensions to saying that
f′(p) is invertible.
One perspective we can adopt is that of Newton's method. Recall that Newton's
method allows us to find
x^{*} for a fixed
y^{*} such that
y^{*} = f(x^{*}). It follows
the exact same process!
 We start with some
x[1].
 We then find the tangent
f′(x[1]).
 We draw the tangent at the point
x[1] as
(y[2] − y[1]) = f′(x[1])(x[2] − x[1]).
 To find the
y^{*} we set
y[2] = y^{*}.
 This gives us
x[2] = x[1] − (y^{*} − y[1])/f′(x[1]).
 Immediately generalizing, we get
x[n+1] = x[n] − (y^{*} − y[n]) / f′(x[n]).
§ References
We can build a toy model of a space where velocity increases with depth.
Let the xy axis be: lefttoright (→) is positive x, toptobottom (↓) is positive y.
Now let the velocity at a given location
(x^{⋆}, y^{⋆}) be
(y^{⋆}+1, 1).
That is, velocity along
y is constant; the velocity along
x is an increasing
function of the current depth. The velocity along
x increases linearly with depth.
Under such a model, our shortest paths will be 'curved' paths.
§ References
Usually I've seen Djikstra's algorithm presented as a greedy algorithm,
and then an analogy given to "fire" for the greediness. We can reverse
this presentation start with the "fire", and the discretize the "fire" solution
to arrive at Djikstra's
§ The fire analogy
 Assume we want to find the shortest path from point
A to point
B. We
should find the path that fire travels. How do we do this? Well, we simply
simulate a fire going through all paths from our initial point,
and then pick the shortest path (ala Fermat). We interpret the graph
edges as distances between the different locations in space.
 So we write a function
simulate :: V × T → 2^{V × T}, which
when given a starting vertex
v : V and a time
t : T, returns the set
of vertices reachable in at most that time, and the time it took to reach them.
 Notice that we are likely repeating a bunch of work. Say the fire
from
x reaches
p, and we know the trajectory of the fire from
p. The
next time where a fire from
y reaches
p, we don't need to recalculate
the trajectory. We can simply cache this.
 Notice that this time parameter being a real number is pointless;
really the only thing that matters are the events (as in computational geometry),
where the time crosses some threshold.
 By combining the two pieces of information, we are led to the
usual implementation: Order the events by time using a queue, and then
add new explorations as we get to them.
I have never seen an elementary account of this in a 'trust these facts, now
here is why hyperbolic groups have a solvable word problem'. I am writing
such an account for myself. It's an account for building intuition, so no
proofs will be provided except for the final theorem. All facts will be backed
by intuition. Since most of it is geometric, it's easy to convey intuition.
§ Graphs of groups, quasiisometry.
 NOTE: I will consistently denote the inverse of
g by
g′.
We can convert any group into a graph by using the cayley graph of the group.
We characterize hyperbolic space as a space where we can build 'thin triangles'.
We also think of hyperbolic space as where geodesics from a given point
diverge (in terms of angle) exponentially fast.
The choice of generators for the cayley graph gives different graphs. We can
assign a
unique geometric object by considering cayley graphs upto quasi
isometry. The cayley graph of a group with respect to different generating sets
are quasiisometric. We can now try to study properties that are invariant
under quasiisometry, since these are somehow 'represented faithfully by the geometry'.
§ Hyperbolicity enters the picture
We now say that a graph is hyperbolic if the cayley graph of the group is
hyperbolic. We can show that hyperbolicity is preserved by quasiisometry.
So this property does not depend on the generating set.
§ Isoperimetric Inequality
If we have a finitely presented group
G = ⟨ S  R ⟩, and
w
is a word in the free group
Free(S), if
[[w]] = 1, we will have
w =   u_{i} r_{i}^{± 1} u_{i}′

This follows almost by definition. Since we have quotiented by
r we can have
elements of
r in between
u u′. We will need to have a
u′ since there's
nothing else to cancel off the
u.
§ Area of a word
Let
G = ⟨ S  R ⟩. Let
w be a word in
S such that
[[w]] = e.
The area of the word
w is the minimal number of such
u r u′ components
we need to write it down. Formally:
Area(w) = min  n    u_{i} r+i^{± 1} u_{i}′ 

I don't understand the geometric content of this definition. I asked
on mathoverflow.
§ Isopermetric function for a group
f: ℕ → ℕ is a Dehn function or isoperimetric function
if the area of the word is upper bounded by
f(word). In some sense, the
length of the word is the perimeter to the area, and this gives us a form
of the isoperimetric inequality. Formally,
f is a Dehn function if for all
words
w ∈ F(S) such that
[[w]] = e, we have
A(w) ≤ f(w). depending
on the growth of
f, we say that
G has linear, quadratic, exponential etc.
Dehn function.
§ Geometric content of Area
We define a map to be aninite, planar, oriented, connected and simply connected simplicial2complex (!).
A map
D is a diagram over an alphabet
S iff every edge
e ∈ D has a label
lbl(e) ∈ S such that
lbl(e^{−1}) = (lbl(e))^{−1}. Hang on: what does it mean to invert an edge?
I presume it means to go backwards along an edge. So we assume the graph is
directed, and we have edges in both directions.
A Van Kampen diagram over a group
G = ⟨ S  R ⟩ is a diagram
D
over
S such that for all faces of
D, the label of the boundary of
f
is labelled by some
r^{±} : r ∈ R. The area of such a diagram
is the number of faces.
§ Hyperbolic iff Linear Isopermetric Inequality is satisfied
A finitely presented group is hyperbolic if and of if its cayley grah
satisfies the linear isoperimetric inequality.
§ Deciding if elements are conjugate to each other
 If we can answer the question of whether two elements are conjugate to each
other (does there exist a
g such that
ghg′ =?= k), we can solve
that an element is equal to the identity:
 Pick
k = e. Then if we have
ghg′ = k = e, then
gh = g hence
h = e.
 If we can check that an element is equal to the identity, we can check for
equality of elements. two elements
k, l are equal iff
kl′ = e.
 So solving conjugacy automatically allows us to check of equality.
§ Proof that conjugacy is solvable for hyperbolic groups
ex>x
 
g2 g1
 
v v
g1 =x g1 x'<x' x g1
Assume such an element
x
exists such that
g2 = x g1 x'
.
Now we have two triangles,
exxg1
and
xg1x'g2
. Let us focus on
exxg1
. Assume the triangles in our space is
delta thin.
α = δ
ex‖>x
\_______ ‖ 
\________ ‖ > δ 
\_____p====================
\________ 
\_______xg1
Since our triangles are
δ thin, if the distance of the point
p ∈ [e xg1]
to
§ References
I always wanted to see sheaves in the wild in a setting that was both
elementary but 'correct': In that, it's not some perverse example
created to show sheaves (DaTaBaSeS arE ShEAvEs). Ahlfors has a great example
of this which I'm condensing here for future reference.
§ Sheafs: Trial 1
 We have function elements
(f: Ω → ℂ, Ω ⊆ ℂ).
f is complex analytic,
Ω is an open subset of
ℂ.
 Two function elements
(f_{1}, Ω_{1}), (f_{2}, Ω_{2}) are said to be analytic
continuations of each other iff
Ω_{1} ∩ Ω_{2} ≠ ∅, and
f_{1} = f_{2} on the set
Ω_{1} ∩ Ω_{2}).

(f_{2}, Ω_{2}) can be called as the continuation of
(f_{1}, Ω_{1}) to
region
Ω_{2}.
 We will have that the analytic continuation of
f_{1} to
Ω_{2} is unique.
If there exists a function element
(g_{2}, Ω_{2}),
(h_{2}, Ω_{2}) such that
g_{2} = f_{1} = h_{2} in the region
Ω_{1} ∩ Ω_{2}, then by analyticity,
this agreement will extend to all of
Ω_{2}.
 Analytic continuation is therefore an equivalence relation (prove this!)
 A chain of analytic continuations is a sequence of
(f_{i}, Ω_{i}) such that
the adjacent elements of this sequence are analytic continuations of each other.
(f_{i}, Ω_{i}) analytically continues
(f_{i+1}, Ω_{i+1}).
 Every equivalence class of this equivalence relation is called as a global
analytic function. Put differently, it's a family of function elements
(f, U) and
(g, V) such that we can start from
(f, U) and build
analytic continuations to get to
(g, V).
§ Sheafs: Trial 2
 We can take a different view, with
(f, z ∈ ℂ) such that
f
is analytic at some open set
Ω which contains
z. So we should
picture an
f sitting analytically on some open set
Ω which contains
z.
 Two pairs
(f, z),
(f′, z′) are considered equivalent if
z = z′ and
f = f′ is some neighbourhood of
z (= z′).
 This is clearly an equivalence relation. The equivalence classes are called as germs.
 Each germ
(f, z) has a unique projection
z. We denote a germ of
f with projection
z
as
f_{z}.
 A function element
(f, Ω) gives rise to germs
(f, z) for each
z ∈ Ω.
 Conversely, every germ
(f, z) is determined by some function element
(f, Ω)
since we needed
f to be analytic around some open neighbourhood of
z: Call
this neighbourhood
Ω.
 Let
D ⊆ ℂ be an open set. The set of all germs
{ f_{z} : z ∈ D }
is called as a sheaf over
D. If we are considering analytic
f then
this will be known as the sheaf of germs of analytic functions over
D. This
sheaf will be denoted as
Sh(D).
 There is a projection
π: Sh(D) → D; (f, z) ↦ z. For a fixed
z0 ∈ D,
the inverseimage
π^{−1}(z0) is called as the stalk over
z0. It is
denoted by
Sh(z).

Sh carries both topological and algebraic structure. We can give the sheaf
a topology to talk about about continuous mappings in and out of
Sh.
It also carries a pointwise algebraic structure at each stalk: we can
add and subtract functions at each stalk; This makes it an abelain group.
§ Sheaf: Trial 3
A sheaf over
D is a topological space
Sh and a mapping
π: Sh → D
with the properties:

π is a local homeomorphism. Each
s ∈ S has an open neighbourhood
D
such that
π(D) is open, and the restriction of
π to
D is a homeomorphism.
 For each point
z ∈ D, the stalk
π^{−1}(z) ≡ S_{z} has the structre of an abelian
group.
 The group operations are continuous with respect to the topology of
Sh.
We will pick
D to be an open set in the complex plane; Really,
D can
be arbitrary.
§ Germs of analytic functions satisfy (Sheaf: Trial 3)
§ Why homomorphisms for chain maps?
First of all, to define a mapping between simplicial complexes
{ G_{i} }
and
{ H_{i} }, one might naively assume that we can ask for
functions
{ f_{i}: G_{i} → H_{i} }:
∂ ∂
G3 → G2 → G1 → 0
  
f g h
↓ ↓ ↓
0 → H3 → H2 → H1
∂ ∂
Unfortunately, to be able to use the
machinery of Homology, we need the
{ f_{i} } to be abelian group homomorphisms.
However, this is no great loss. Intuitively, when we want to map complexes,
we first say where the
generators of the abelian group (
ℤmodule)
maps to; Everything else is determined by the generators. This aligns
nicely with our intuition of what a map between complexes should look like:
we tell where the geometry goes ("this edge goes there"), and the algebra
is "dragged along for the ride". This gives us the diagram:
G3∂→G2∂→G1
  
f3 f2 f1
↓ ↓ ↓
0 →H3∂→H2∂→H1
where the
fi
are
homomorphisms. So, this means we can talk about kernels and
images!
Ker(f3)→Ker(f2)→Ker(f1)
  
↓ ↓ ↓
G3∂→G2∂→G1→ 0
  
f3 f2 f1
↓ ↓ ↓
Im(f3)∂→Im(f2)∂→Im(f1)
F → E → V → {0}
{0} → {e} → {v} → {0}
The
Snake Lemma gives us a mapping
d: Ker(f1) → Im(f3) such that
this long exact sequence is saatisfied:
§ What do we wish to compute?
 Now that we've agreed that this family of maps
{ f_{i} : G_{i} → H_{i} }
ought to be structured maps, the next question is "OK, now what? What does
one want to determine"? Ideally, we would get a new chain complex which
I tacitly denote as
{ f(G_{i}) }, consisting of the image of
G_{i} inside
H_{i} and the ability to determine its structure.
 However, this is the boring bit. We don't really care about the chain complex
{ f(G_{i}) } per se.
What we actually care about are the homology groups! So we would really like a tool
that allows us to compute
H_{i}(f(G)) in some convenient fashion.
Consider a linear map
T: X → Y. we want to solve for
{ x : T(x) = y0 }.
 If we have an
x0 such that
T(x0) = y0, then see that
T(x0 + Ker(T)) = T(x0) + T(Ker(T)) = y0 + 0 = y0.
So the kernel gives us our degrees of freedom: how much can we change around without
changing the solution.
 Now consider the cokernel:
Coker(T) = Y / Im(T). If we want to find
a solution
{ x : T(x) = y0 }. If we have
y0 ≠ 0 ∈ Coker(T), then
the solution set is empty. The cokernel tells us the obstruction to a
solution.
Define the commutator of
g, h as
[g, h] ≡ ghg^{−1}h^{−1}.
The subgroup
generated by all commutators
in a group is called as the commutator subgroup. Sometimes denoted as
[G, G].
 We need to consider generation. Consider the free group on 4 letters
G = ⟨ a, b, c, d ⟩. Now
[a, b] · [c, d] has no
expression in terms of
[α, β].
 In general, the elements of the commutator subgroup will be products
of commutators.
 It measures the degree of nonabelianness of the group.
G/[G, G] is
the largest quotient of
G that is abelian. Alternatively,
[G, G]
is the smallest normal subgroup we need to quotient by to get an abelian
quotient. This quotienting is called abelianization.
§ Presentation of A5
We take as faith A5 has the presentation:
<a, b  a^2 = b^3 = (ab)^5 = 1>
If I find a nice proof of this isomorphism, or some other way to derive
the fact that
PSL(2, 5)
is isomorphic to
A5
, I will fill this up.
§ Step 1: PSL(2, 5)
is isomorphic to A5
PSL(2, 5) consists of projective Mobius transformations with function composition
as the group operation. Here, we freely use the linear algebraic relationship
between transformations of the form
(az + b)/(cz + d)
and matrices
[a b; c z]
.
  a, b, c, d ∈ ℤ5, ad − bc = 1          
 f: ℤ5 ⋃ { ∞ } → ℤ5 ⋃ { ∞ }          
 f(z) ≡ (az + b)/(cz + d)          
          

 We allow coefficients for the Mobius transform to be from
ℤ5,
and we allow the domain and codomain of the function to be projectivized: so we
add a point at infinity to
ℤ5.
 We construct a map from
PSL(2, 5) to
A5 and then show that this map is
an isomorphism. We exploit the presentation of
A5 to find elements
a, b ∈ PSL(2, 5) such that
p^{2} = q^{3} = (pq)^{5} = I. We can link this
to the presentation of A5 which requires precisely those relations.
 For an element of order 3, we pick
q(z) = 1/(1z)
.
  q(z) = 1/(1−z)          
          
          
          
          
 q(q(q(z))) = 1 −   = 1 − (1 − z) = z

         

 I don't know of a principled way to arrive at this choice of
q(z)
, except
by noticing that az + b
does not work, and neither does 1/z
. The next
simplest choice is things of the form 1/(1z)
. If there is a nicer way,
I'd love to know.
 For a function of order
5, we have to use the structure of the finite field
somehow. We can consider the function
r(z) = 1 + z
. On repeating this 5
times, we wil get 5 + z = z
. However, it is hard to connect r(z) = 1 + z
to the previous choice of q(z) = 1/(1z)
.
 We use the same idea for
r(z)
, and pick r(z) = z  1
. This will allow
us to accumulate 1
s till we hit a 5 = 0
.
 To get
r(z) = (z  1)
, we need to compose q(z) = 1/(1z)
with p(z) = 1/z
.
This p(z)
is of order 2.
To recap, we have achieved a set of functions:
p(z) = 1/z [order 2]
q(z) = 1/(1z) [order 3]
r(z) = (z  1) [order 5]
r = 1/[1/(1z)] = p . q
That is, we have found a way elements in
PSL(2, 5)
such that
p^2 = q^3 = (pq)^5 = 1
.
This gives us the [surjective] map from
PSL(2, 5)
into
A5
.
 By a cardinality argument, we know that the size of
PSL(2, 5)
is 60. Hence,
since PSL(2, 5)
and A5
have the same number of elements, this map
must be a bijection.
§ Step 2: PSL(2, 5) is simple
TODO! I'm still reading Keith Conrad's notes.
§ References
There are many accounts of why A5 is not solvable on the internet. I'm recording my
version here, because the proof involves certain adhoc choices which I want
to make sure I can find offhand in the future.
We'll show that
[A5, A5] = A5
, thereby proving that
A5
not solvable.
This is useful for Galois theory, where we want to show tha
A5
cannot be
built as extensions of smaller cyclic groups.
§ Notation
I'll be using nonstandard notation:
(12);(34)
means 'perform
(12)
then perform
(34)
'.
I find this notation makes permutation composition intuitive for me. The
;
is evocative of Cstyle languages, where we are ending a statement. I will
be consistently using
[g, h] ≡ ghg^{−1}h^{−1} to denote the commutator.
§ permutations in A5
First, recall that
A5
only has the
even permutations in
S5
. So it can
have zero, two, four, involutions that build it up. There can't be more after
simplification, since
S5
ony has
5
elements  the largest sequence
of transpositions we can do is
(12)(23)(34)(45)
. So, in
A5
, we have:
 The identity permutation
()
.  The transpositions
(ij)(kl)
where {i, j}
and {k, l}
do not overlap.
From these, we get the 2cycles.  The transpositions
(ij)(kl)
where {i, j}
and {k, l}
overlap. Here we
cannot have {i, j} = {k, l}
since then we will just have a single transposition.
So, let us assume that we have j = k
. If we have any other equality, we
can always flip the transpositions around to get to the normal form j = k
:
(23);(12)
= (32);(12) [(23) = (32)]
= (32);(21) [(12) = (21)]
 In this case, we can show that such a transposition must be a cycle:
[a b c] (32)>
[a c b] (21)>
[c a b]
 Intuitively, we are pushing the element
c
backward, and allowing the
other elements to take its place using the permutation (23);(12)
.
 So, from the transpositions of the form
(ij)(kl)
where {i, j}
and
{k, l}
intersect, we get the 3cycles.
 Finally, we can have the transpositions of the form
(12)(23)(34)(45)
.
It must be of this form, or some permutation of this form. Otherwise,
we would have repeated elements, since these transpositions are packed
"as close as possible". These generate the 5cycles.
§ A5 is generated by 3cycles.
We claim that we can write any element of
A5 in terms of 3cycles.
 The disjoint transpositions of the type
(34)(12)
can be written as
(34)(23)(23)(12)
, because (23)(23) = e
. This can be further
broken down into ((34)(23)) ((23)(12))
which is two 2cycles:
(234); (123)
.
 The nondisjoint transpositions of the type
(32)(21)
are 3cycles:
(32)(21) = (123)
.
 Any 5cycle an be written as two 3cycles:
(45)(34)(23)(12)
can be written
as ((45)(34))((23)(12))
which is two 3cycles: (345); (123)
.
So, if we figure out how to write 3cycles in terms of commutators, we win.
Because the commutator subgroup of
A_{n} is generated by elements that
can be written as
[g, h]. If we can show that 3cycles can be written
as
[g, h], then every other element has a representation in terms of
these 3cycles, and are therefore elements of the commutator subgroup.
§ 3cycles can be generated as commutators of 2cycles:
 We saw how we can write a 3cycle of the form
C = (123)
as (32)(21)
.
We wish to write this as the commutator of two elements g, h
:
C = [g, h].
 The idea is that we have the leftover elements
4, 5
that are unsused by C
in A5
[here is where 5
is important: 3 + 2 = 5
, and we need two leftover elements].
 We can use these two leftover elements
4, 5
to build elements g, h
which cancel off, leaving us with (32)(21)
. We start with g = (32)___
,
h = (21)___
where the ___
is to be determined:
(32)___(21)______(32)___(21)
g h g^1 h^1
 It is important that
g
and h
contain another tuple, because they are
members of A5
! We need them to be permutations having 2, 4, 6
transpositions.
 We insert
(4 5)
everywhere. These (4 5)
can slide over the (2 1)
and thereby
harmlessly cancel:
(32)(45)(21)(45)(45)(32)(45)(21)
g h g^1 h^1
 Simplify the above expression by moving the
(45)
over (21), (32)
:
(32)(21)(45)(45)(32)(45)(45)(21)
g h g^1 h^1
(32)(21)(32)(21)
g h g^1 h^1
So we are left with
(32);(21);(32);(21)
. This is the
square of what
we really wanted,
C = (32);(21)
. However, since
C
is a 3cycle, we know
that
C = C^{−2}. So, we can start with
C^{−1}, use our trick to generate
C^{−2} which is equal to
C. Since this works for any
C, we have shown
that we can generate 3cycles from commutators of
A5
.
§ Alternate viewpoint on above proof
We have a 3cycle
s = (a b c)
. We first first a square root
t
such
that
t*t=s
. To do this, we make
t
have the cycles of
s
spread out
in gaps of 2:
t = (a _ _)
t = (a _ b) [+2]
t = (a c b) [+2, modulo]
It is hopefully clear that
t*t = s
:
t = (a c b)
t*t: apply the cycle twice.
t*t = a (skip c) > b
b (skip a) > c
c >(skip b) > a
= (a b c) = s
Now, we will write
s = t*t
and then find the commutator decomposition from
it:
s = t*t
= (abc)(abc)
= (cb)(ba)(cb)(ba)
= (cb)(ba)(cb)(ba)
= (cb)(ba)(cb)(ba)
g h g1 h1
But there's a problem: this
g
and
h
do not belong to
A5
, they belong
to
S5
. This is fixed by using a random
(pq)
which we know
will exist.
§ Recap: How have we shown that A5 is not solvable?
what have we shown?
 3cycles can be written as
[g, h] for
g, h ∈ A_{5}. Alternatively,
we can say that 3cycles belong to the commutator subgroup of
A_{5},
since they can be written as commutators.
 any element in
A5 can be written as the composition of 3cycles.
 Hence, any element in
A5 can be written as the composition of commutators.
In my mind, I think of it as:
arbitrary g
= (3cycle1)(3cycle2)....(3cyclen)
= [g, h][g2, h2]....[gn, hn]
= member of [A5, A5]
Recall that
[A5, A5] is
generated by commutators. It not only contains
elements of the form
[g, h], but also all products of the form
[g, h][g′, h′].
So we don't need to exhibit how to write a 5cycle as some
[g, h]. We just
need to exhibit how to write as the product of commutators, which we have
now shown.
§ Solvable implies simple
We can consider the other definition of simple. Let there be a
chain of normal subgroups
G = N[0] ≤ N[1] ≤ N[1] ≤ … ≤ N[m] = e,
such that each quotient
N[i] / N[i+1] is abelian. Then, if
G is
simple, this chain can only be
G = N[0] ≤ N[1] = e.
 If we want the quotient
G/N to be abelian, then we need the commutator
subgroup
[G, G] to be a a subset of
N.
 In our case,
[A_{5}, A_{5}] = A_{5}. So if we want to remove the nonabelianness
of A5, we need to quotient by the whole of
A5.
 This means that any such chain will immediately collapse to
e.
 So, it's impossible to build
A5 using 'cycling components' starting from
{e}.
Viewed from the field theoretic perspective, this means that it's impossible
to reach a polynomial whose splitting field has galois group A5 by simply
appending cycles.
§ Nagging doubt: Did we depend on our numbering of cycles?
In all my proofs, I had used
one 3cycle, or 5cycle, or 2cycle to
argue that it all works out. Is this really legal? Perhaps the argument
written for the 3cycle
C = (123)
will break down for
D = (321)
. Fear not!
 We will show that all 3cycles are conjugate to each other. So, we can always
relabel a 3cycle within A5.
 It is easy to note that
g[k, l]g^{−1} = [gkg^{−1}, glg^{−1}]. This shows
that the commutator subgroup is closed under conjugation. It better be,
because it ought to be normal for us to take quotients from it.
 Combining these facts, if we show that
(123)
is in [A5, A5]
, then some
other cycle (ijk)
can be conjugated to (123)
. Since the commutator
subgroup is closed under conjugation, we have that (ijk)
is a member
of [A5, A5]
.
§ All 3cycles are conjugate to each other in A5.
 Given two 3cycles
C=(abc)
and D=(pqr)
, at least one of a, b, c
must
be equal to one of p, q, r
. Since each a, b, c
is unique, and each
p, q, r
is unique, for them to not overlap, we would need 6 elements.
But we only have 5, so there must be some overlap:
So, we will perform our proof assuming there is 1 overlap, 2 overlap, 3 overlap.
Recall that if
C = (a b c)
is a cycle and
s
is a permutation, then the action
of conjugating
C
with
s
produces a permutation
(s(a) s(b) s(c))
. We will
prove our results by finding an
s
, and then
making s
even. This is
the difficult part of the proof, since we need to show that all 3cycles are
conjugate
in A5. We will write
s
as two distinct transpositions, which will
guarantee that it belongs to
A5
.
 Case 1:
(abx)
and (pqx)
have a single element x
in common:
C = (abx)
D = (pqx)
s: send a to p, b to q
s = (ap)(bq)
C = (abx) conj s> (pqx) = D
 Case 2:
(axy)
and (pxy)
have two elements in common, x
and y
. Naively,
we would pick s: send x to y
. But this is odd, so this isn't a member of
A5
. To make it even, we rearrange D = (pxy)
as D = (yxp)
. This lets us
go from C
to D
by relabelling a
to y
, y
to p
. This permutation
is even since it has two distinct transpositions.
C = (axy)
D = (pxy) = (yxp) [cyclic property]
s: send a to y, y to p
s = (ay)(yp)
C = (axy) conj s> (yxp) = D
 Case 3:
(xyz)
and (xyz)
have all three elements in common, x
, y
, z
.
Here we can conjugate by identity and we are done.
§ Why do we care about solvable?
 Roughly, we can look at the solvability criterion as giving us a way to build
our group
G from a series of extensions
N[1], N[2], …. This extension
is special, because at each step, we are adding a cyclic group.
 When we want to write a solution using nth roots, we can only add the
nth roots of unity, a "cyclic" component. So, any element we can reach
by using nth roots ought to be able to be written down as an extension of
cyclic elements.
§ SAGE code to play around with commutators of A5
:
 Create a dictionary
m
which maps each element of A5
to the commutators
that create it.
from collections import defaultdict
m = defaultdict(set)
A5 = AlternatingGroup(5)
S5 = SymmetricGroup(5) # if necessary
for g in A5:
for h in A5:
m[g * h * g^(1) * h^(1)] = { (g, h) }
# all 60 elem can be written in terms of commutators
print("number of elem generated as commutator: " + str(len(m.keys())))
# Show how to access elements of A5 and their commutator representation
cyc5 = A5("(1, 2, 3, 4, 5)")
cyc3 = A5("(1, 2, 3)")
cyc2disj = A5("(1, 2) (3, 4)")
print(m[cyc5])
print(m[cyc3])
print(m[cyc2disj])
§ Writing each element in A5
directly as a commutator
We have shown how to write 3cycles as the commutator of 2cycles. We will now
show how to do this for disjoint 2cycles and 5cycles as a matter of
enlightenment.
§ Writing disjoint 2cycles as commutator
First, we will write a two disjoint two cycles as the square root of
a 4cycle. We will then show how to write this 4cycle as two
3cycles.
Note that if we choose
t = (abcd)
, then
t*t
will exchange the first
and third elements
a <> c
, and the second and fourth elements
b <> d
.
So, if we choose:
t = (1324)
t*t = (12) (34)
Next, we need to write this
t*t
as
[g, h]
for
g, h
from
A5
.
t*t = (1324)(1324)
= (42)(23)(31);(42)(23)(31)
= (42)(23)(31);(42)(23)(31)
= (42)(23)(31);(23)(23);(42)(23)(31)
^^^^^^^^ inserted
= (42)(23)(31)(23)(23)(42)(23)(31)
g  h  g'  h'
= [(42)(23), (31)(23)]
Where both
(42)(23)
, and
(31)(23)
are members of
A5
.
§ Writing 3cycle as commutator
In the description of showing how to generate 3cycles, we do this
explicitly.
§ Writing 5cycle as commutator
Let
s = (1 2 3 4 5)
. we once again find a square root of
s
. To build
this, we will build an element with the elements of
s
written with
gaps of
2
:
t = (1 _ _ _ _)
= (1 _ 2 _ _) [+2 index]
= (1 _ 2 _ 3) [+2 index, wrap]
= (1 4 2 _ 3) [+2 index, wrap]
= (1 4 2 5 3) [+2 index, wrap]
It should be clear how
t*t = s
: When we take
s = t*t
, the resulting permutation
s
will move an element
j = t[i]
to
k = t[i+2]
. But we have built
t
such
that
t[i+2] = s[i+1]
. So we will move the element according to how
s
pleases:
t = (1 4 2 5 3)
t*t = 1 > (4 skip) > 2
2 > (5 skip) > 3
3 > (1 skip) > 4
3 > (2 skip) > 5
5 > (3 skip) > 1
t*t = (1 2 3 4 5) = s
We will now use
t*t
to write the commutator:
s = t*t
= (35)(52)(24)(41);(35)(52)(24)(41)
=
=
=
= (1, 2)(3, 5)(1, 5)(2, 4)(3, 5)(1, 2)(2, 4)(1, 5)
= (1, 2)(3, 5)(1, 5)(2, 4)(3, 5)(1, 2)(2, 4)(1, 5)
g h g^{1} h^{1}
§ To think: relationship between square roots and commutators?
If we think of complex vectors
p = [p_{1}, p_{2}],
q = [q_{1}, q_{2}] as belonging to
projective space: that is,
p ≃ p_{1}/p_{2}, and
q ≃ q_{1} / q_{2}, we can
interpret orthogonality as:
 p . q = 0           
p_{1}   _{1} + p_{2}   _{2} = 0 
          
          
          
          

If we imagine these as points on the Riemann sphere, TODO
§ References
 Visual Complex analysis by Tristan Needham
This is a cute reason for why when we count the number of integers in
the closed interval
[a, b]
, it's going to be
b  a + 1
. We setup
an arithmetic sequence with initial term
a
, common difference
1
. Now
we know that the
n
th term is
a + (n1)d
. So, we get
a + (n1)d = b
(n1).1 = b  a
n = b  a + 1
Let us think of the function
arg: ℂ → ℝ as a multi
valued function, which maps each complex number to the set of possible
valid angles that generate it:
arg(z) ≡ { t ∈ ℝ : ze^{i (π/2)t} = z }

We plot the function here:
 Note that for every value
z ∈ C, we get a set of values associated
to it.
§ Recovering single valuedness
Now, the question is, can we somehow automatically recover single
valuedness? kind of, by stipulating that for any given curve
c: [0, 1] → ℂ,
the function
arg ∘ c: [0, 1] → ℝ is
continuous.
Let's try to investigate what happens if we move from
right
towards
bot
,
arbitrarily stipulating ("picking a branch") that
arg(right) = 0
as a sort
of basepoint.
 Note that we were forced to pick the value
arg(bot) = 1
from our
considerations of continuity. No other value extends continuous from the
right to the bottom.  Also note that we got a smaller value: we move from
0 > 1
: we decrease
our value as we move clockwise.
This prompts the natural question:
what happens if we move in the opposite direction?
§ Counterclockwise movement
 Let's move counterclockwise from
right
, arbitrarily picking the branch
arg(right) = 0
as before. This gives us:
 Note that once again, we were forced to pick
arg(top) = 1
by continuity
considerations.
 Also note that this time, we got a larger value: we move from
0 > 1
: we
increase our value as we move counterclockwise
§ Multiple winding
the true power of this multivalued approach comes from being able to handle
multiple windings. Here the real meaning of being a multivalued function shows
through. If we decide to go through the the loop
twice, as:
bot > right > top > left > bot > right > top > left
1 > 0 > 1 > 2 > 3 > 4 > 5 > 6
That is, we end up with the value
6
, which can only b
§ Orientation from continuity
There's something really elegant about being able to recover a notion of
"orientation" by simply:
 Allowing multivalued functions.
 Forcing continuity constraints.
 Interpreting increase/decrease in the value of the function.
§ Discretizing, gaining more insight
I was personally dissatisfied with the above explanation, because it seemed
weird that we would need to depend on the history to define this function. We
can formalize this notion of history. Let's first discretize the situation,
giving us:
 We are on the space of the spokes, given by
a, b, c, d, e, f, g, h
.  We have a function
f: Spoke > Val
whose values are given on the spokes.  We are interested in the path
p: Time > Spoke
, p = [a, b, c, d, e, f, g, h, a]
.  If we evaluate the function
f
on the path p
, we get out: Time > Val
, out = [0, 1, 2, 3, 4, 5, 6, 7, 0]
.  We have a "jump" from
7
to 0
in out
as we cross from h
to a
. This is a
discontinuity in out
at time 7
.  We want to fix this, so we make the function
f
multivalued.
 We assign both values
8
and 0
to the spoke a
. We wish to define
the evaluation of f: Spoke > 2^N
relative to path p
. At time t
, point
p[t]
, we pick any value in f(p[t])
that makes out[t]
continuous.
 So in this case, when we start, we have two choices for
out[0] = f(p[0]) = f(a)
: 0
and 8
.
But we know that out[1] = f(p[1]) = f(b) = 1
. Hence, for out[0]
to be continuous, we must
pick out[0] = 0
.
 Similarly, at
out[8]
we have two choices: 0
and 8
. But we have that
out[7] = 7
, hence we pick out[8] = 8
.
 Note that we say 'we pick the value' that makes
out
continuous. This is
not really rigorous. We can fix this by redefining f
in such a way
that f
is not Spoke > Val
, but rather it knows the full path:
f': (Time > Spoke) > Val
.
§ Making the theory pathdependent
We originally had:
path: Time > Spoke
f: Spoke > 2^Val  multivalued
  morally, not mathematically.
out: Time > Val
out t = choose_best_value (f(path[t]))
But there was a vagueness in this
choose_best_value
. So we redefine it:
path: Time > Spoke
f': (Time > Spoke) > Time > Val
f'(path, tcur) =
argmax (\v > v  path[tcur1] + v  path[tcur+1)
f(path[tcur])
out: Time > Val
out = f'(path)
 The function
f'
that defines the value of the path has full
access to the path itself!  At time
tcur
, it attempts to pick the value in f(path[tcur])
which
makes the discontinuity as small as possible. It picks a value v
from the
possible values of f(path[tcur])
. This v
minimises the
of the distances from the previous time point (v  path[tcur1]
),
and the distance from the next time point (v  path[tcur + 1]
).  This provides a rigorous definition of what it means to "pick a value in the branch".
This can clearly be extended to the continuous domain.
A well known identity in combinatorics is that the partitions
n = l1 + l2 + ... + ln
where each
li
is odd is in bijectiion with a partition where each
li
is unique.
I really liked this bijection.
(15, 9, 9, 5, 5, 5, 3, 3, 3, 1, 1, 1, 1) [all odd]
=
(15, 9x2, 5x3, 3x3, 1x4) [group]
=
(15x1, 9x2, 5x(2+1), 3x(2+1), 1x4) [expand base2]
=
(15, 18, [10, 5], [6, 3], 4) [all unique]
read
ekmett/fractions
properly
and write detailed log about it, and the related math.
For a string
s
, the Lyndon factorization writes
s
as the concatenation of
substrings
t1
,
t2
, ...,
tn
, such that:
 each
ti
is a simple word. That is, it is lexicographically smaller than all
of its cyclic shifts.  the words are in nonincreasing order:
t1 >= t2 >= t3 ... >= tn
.
For example, given the word
banana
, the lyndon factorization is:
We can define a notation for writing permutation as:
 Each term in a cycle is written in ascending order.
 Cycles are written in descending order of the first element.
 Single element are ignored.
If we treat it as a string
723145
,
the duval algorithm provides the decomposition:
So, we can treat the duval algorithm as a way to recover the permutation given
the raw string. It's a nice way to
remember the definition of lyndon
decomposition if nothng else.
I wanted to record this fact for myself, so that I have reason to come back
to it as I learn more preorder theory. Perhaps there are certain graph theoretic
phenomena that make more sense when looked at from a preorder point of view.
In lambda calculus, we often see functions of the form
λ x → x(x). We would
like a way to associate a "natural" mathematical object to such a function. The
most obvious choice for lambda calculus is to try to create a set
V of values
which contains its own function space:
(V → V) ⊆ V. Unfortunately,
the only solution for this in sets is the trivial set
{ * }.
§ Computation as fixpoints of continuous functions
§ CPOs
 Given a partial order
(P, ≤). assume we have a subset
Q ⊆ P.
A least upper bound
u of
Q is an element that is the smallest element in
P
which is larger than every element in
Q.
 A subset
Q of
P is called as a chain if its elements can be put into order.
That is, there is a labelling of elements of
Q into
q1, q2, …, qn such
that
q1 ≤ q2 ≤ … ≤ qn.
§ CCPOs
 A partially ordered set is called as a chain complete partial order if
each chain has a least upper bound.
 This is different from a lattice, where each subset has a least upper bound.
 Every ccpo has a minimal element given by
completion(∅) = .
 TODO: example of ccpo that is not a lattice
§ Monotone map
 A function from
P to
Q is said to be monotone if
p ≤ p′ f(p) ≤ f(p′).
 Composition of monotone functions is monotone.
 The image of a chain wrt a monotone function is a chain.
 A monotone function need not preserve least upper bounds. Consider:
f: 2^{ℕ} → 2^{ℕ}
f(S) ≡
 ⎧
⎨
⎩  S  S is finite 
S U { 0 }  S is infinite



This does not preserve leastupperbounds. Consider the sequence of elements:
A_{1} = { 1}, A_{2} = {1, 2}, A_{3} = {1, 2, 3}, …, A_{n} = {1, 2, 3, …, n }

The union of all
A_{i} is
ℕ.
Each of these sets is finite.
Hence
f({1 }) = {1 },
f({1, 2 }) = {1, 2} and so on. Therefore:
f(⊔ A_{i}) = f(ℕ) = ℕ ⋃ { 0 }⊔ f(A_{i}) = ⊔ A_{i} = ℕ

§ Continuous function
 A function is continous if it is monotone and preserves all LUBs. This is
only sensible as a definition on ccpos, because the equation defining it is:
lub . f = f . lub
, where lub: chain(P) \rightarrow P
. However, for lub
to always exist, we need P
to be a CCPO. So, the definition of continuous
only works for CCPOs.
 The composition of continuous functions of chaincomplete partially
ordered sets is continuous.
§ Fixpoints of continuouts functions
The least fixed point of a continous function
f: D → D is:
FIX(f) ≡ lub({ f^{n}() : n ≥ 0 }) 
§
≤ as implication
We can think of
b ≤ a as
b a. That is,
b has more information
than
a, and hence implies
a.
§ References
 Semantics with Applications: Hanne Riis Nielson, Flemming Nielson.
I learnt of a "prefix sum/min" based formulation from
the solution to question D, codeforces educational round 88.
The idea is to start with the max prefix sum as the difference of right minus
left:
           
 =   :  ⎛
⎜
⎜
⎝   a[i] − min L:    ⎞
⎟
⎟
⎠ 
         
          

Which is then expressed as:
asum[n] ≡   a[i]
=   : (asum[R] − min_{(L ≤ R)}: asum[L])

aminsum[n] ≡   asum[i] =   : (asum[R] − aminsum[R])

Since
asum is a prefixsum of
a, and
amin is a prefix min of
asum, the whole thing is
O(n) serial,
O(logn) parallel.
In haskell, this translates to:
let sums = scanl (+) 0
let sums_mins = scanl1 min . sums
let rise xs = zipWith () (sums xs) (sums_mins xs)
main = rise [1, 2, 3, 2, 1, 4, 4, 6]
> [0,1,3,6,4,3,0,4,10]
sums_mins
keeps track of the sea level, while the
zipWith () xxx sums_mins
computes the elevation from the sea level.
Some musings I had on the ability to represent heaps as arrays, and in general,
the benifits of knowing the total number of elements.
 Knowing the total number of elements allows us to preordain a memory layout.
We can decide that for a node at index
i
, left child is at 2*i
, and
right child is at 2*i+1
. This gives parent at i//2
.
 This immediately gives us
O(1)
access to parent (i//2
) and sibling (i^1)
with no extra memory usage. S
 This cannot be done for data structures where we need to splice into
the middle: For example, an implicit treap where we wish to splice subarrays
together.
 On coding a heap, we can decide whether to use the left or right sibling by
using
next = 2*i + predicate
. If predicate = false = 0
, we will pick
the left child, otherwise the right child. This allows us to compress
some annoying if/then/elses into oneliners.
I had always seen the definition of a discriminant of a polynomial
p(x) as:
Disc(p(x)) ≡ a_{n}^{(2n − n)}   (r_{i} − r_{j})^{2}

While it is clear
why this tracks if a polynomial has repeated roots
or not, I could never motivate to myself or remember this definition.
I learnt that in fact, this comes from a more general object, the
resultant
of two polynomials
P(x), Q(x), which provides a new polynomial
Res(P(x), Q(x)
which is zero iff
P, Q share a common root. Then, the discriminant is
defined as the resultant of a polynomial and its derivative. This makes far more
sense:
 If a polynomial has a repeated root
r, then its factorization will
be of the form
p(x) = (x − r)^{2} q(x). The derivative of the polynomial
will have an
(x−r) term that can be factored out.
 On the contrary, if a polynomial only has a root of degree 1, then the
factorization will be
p(x) = (x − r) q(x), where
q(x) is not divisible by
(x−r).
Then, the derivative will be
p′(x) = 1 · q(x) + (x − r) q′(x). We cannot take
(x − r) common
from this, since
q(x) is not divisible by
(x−r).
This cleared up a lot of the mystery for me.
§ How did I run into this? Elimination theory.
I was trying to learn how elimination theory works: Given a variety
V = { (x, y) : Z(x, y) = 0 }, how does one find a rational parametrization
(p(t), q(t)) such that
Z(p(t), q(t)) = 0, and
p(t), q(t) are
rational functions? That is, how do we find a rational parametrization of the
locus of a polynomial
Z(x, y)? The answer is: use resultants!
 We have two univariate polynomials
p(a; x), p(b; x), where the notation
p(a; x) means that we have a polynomial
p(a; x) ≡ ∑_{i} a[i] x^{i}.
The resultant isa polynomial
Res(a; b) which is equal to
0 when
p(a; x) and
p(b; x) share a common root.
 We can use this to eliminate variables. We can treat a bivariate polynomial
p(x, y)
as a univariate polynomial
p′(y) over the ring
R[X]. This way, given two
bivariate polynomial
p(a; x, y),
q(b; x, y), we can compute their resultant,
giving us conditions to detect for which values of
a, b, x, there exists
a common
y such that
p(a; x, y) and
(q, x, y) share a root. If
(a, b)
are constants, then we get a polynomial
Res(x) that tracks whether
p(a; x, y)
and
q(a; x, y) share a root.
 We can treat the implicit equation above as two equations,
x − p(t) = 0,
y − q(t) = 0. We can apply the method of resultants to project out
t
from the equations.
§ 5 minute intro to elimination theory.
Recall that when we have a linear system
Ax = 0, the system has a nontrivial
solution iff
A = 0. Formally:
x ≠ 0 ⇐⇒ A = 0. Also, the
ratio of solutions is given by:
x_{i} / x_{j} = (−1)^{i+j} A_{i}/A_{j} 
If we have two polynomials
p(a; x) = a_{0} + a_{1} x + a_{2} x^{2}, and
q(b; x) = b_{0} + b_{1}x + b_{2} x^{2}, then the system
p(a; x),
q(b; x) has
a simeltaneous zero iff:
  ⎡
⎢
⎢
⎢
⎣  a_{2}  a_{1}  a_{0}  0 
0  a_{2}  a_{1}  a_{0} 
b_{2}  b_{1}  b_{0}  0 
0  b_{2}  b_{1}  b_{0} 
 ⎤
⎥
⎥
⎥
⎦ 

 
= 0 
         
 A x = 0
         

§ Big idea
The matrix is setup in such a way that any solution vector
v such that
Qv = 0 will be of the form
v = (α^{3}, α^{2}, α, 1). That is,
the solution vector is a
polynomial, such that
Qv = 0. Since
Qv = 0,
we have that
a_{2} α^{2} + a_{1} α + a_{0} = 0, and
b_{2} α^{2} + b_{1} α + b_{0} = 0.
§ Proof
 Necessity is clear. If we have some non trivial vector
v ≠ 0 such that
Qv = 0, then we need
Q = 0.
 Sufficiency: Since
Q = 0, there is some vector
v = (w, x, y, z)
such that
Qv = 0.
We need to show that this
v is nontrivial. If the polynomials
p(a;x),
q(b;x) are not equal, then we have that the rows which have coefficients
from
p and
q are linearly independent. So, the pair of rows
(1, 3),
and the pair
(2, 4) are linearly independent. This means that
the linear system:
a_{2} w + a_{1} x + a_{0} y = 0 b_{2} w + a_{1} x + a_{0} y = 0 
Similarly:
a_{2} x + a_{1} y + a_{0} z = 0 b_{2} x + a_{1} y + a_{0} z = 0 
Since the coefficients of the two systems are the same, we must have that
(w, x, y) and
(x, y, z) are linearly dependent. That is:
(w, x, y) = α (x, y, z) w = α x = α^{2} y = α^{3} z 
We can take
z = 1 arbitrarily, giving us a vector of the form
(w, x, y, z) = (α^{3}, α^{2}, α, 1), which is the structure
of the solution we are looking for!
§ References
 For a polynomial
p(x), build the companion matrix
P(x).
 Show that the characteristic polynomial
cp(P) of the companion matrix
P(x) is indeed
p(x).
 Find eigenvalues of
P(x), which will be roots of
p(x), since the
eigenvalues of a matrix
M are the roots of its characteristic polynomial
cp(M).
 We use QR since it is numerically stable. The
Q matrix discovered by QR
is orthogonal, and hence does not disturb the covariance of the noise
on matrix multiplication.
Life may toss us illconditioned problems, but it is too short to settle for unstable algorithms.  D.P. O'Leary
§ Measures of error
If
x is a number and
x is its approximation, then the are two notions of
error:
 absolute errror:
x − x.
 relative error:
x − x/x.
Since the relative error is invariant under scaling
(x ↦ α x), we
will mostly be interested in relative error.
§ Significant digits
The significant digits in a number are the first nonzero digit and all
succeeding digits. Thus
1.7320
has five significant digits.
0.0491
has
only three significant digits.
It is not transparent to me why this definition is sensible.
§ Correct Significant digits  a first stab
We can naively define
x agrees to
x upto
p significant digits
if
x and
x round to the same number upto
p significant digits.
This definition is seriously problematic. Consider the numbers:

x = 0.9949,
x_{1} = 1.0,
x_{2} = 0.99,
x_{3} = 0.9950

y = 0.9951,
y_{1} = 1.0,
y_{2} = 1.0,
y_{3} = 0.9950
Here,
y has correct one and three significant digits relative to
x,
but incorrect 2 significant digits, since the truncation at
x_{2} and
y_{2}
do not agree even to the first significant digit.
§ Correct Significant digits  the correct definition
We say that
x agress to
x upto
p significant digits if
x − x
is less than half a unit in the pth significant digit of
x.
§ Accuracy v/s precision
 Accuracy: absolute or relative error of a quantity.
 Precision: accuracy with which basic arithmetic
+, , *, /
are performed.
for floating point, measured by unit roundoff (we have not met this yet).
Accuracy is not limited by precision: By using fixed precision arithmetic,
we can emulate arbitrary precision arithmetic. The problem is that often
this emulation is too expensive to be useful.
§ Backward, Forward errors
Let
y = f(x), where
f: ℝ → ℝ. Let us compute
ŷ as an approximation to
y, in an arithmetic of precision
u. How do we measure the quality of
ŷ?
 In many cases, we maybe happy with an
ŷ such that the relative error between
y and
ŷ is equal to
u: we did the best we can with the precision
that was given. This is the forward error.
 An alternative question we can ask is, for what
δ x do we have that
ŷ = f(x + δ x). That is, how far away from the input do we
need to stray, to get a matching output? There maybe many such
δ x,
so we ask for
minδ x. We can divide this error by
x as a
normalization factor. This is the backward error.
There are two reasons we prefer backward error.
 It converts error analysis into "data analysis". The data itself tends
to be uncertain. If the error given by the backward analysis is smaller
than the data uncertainty, then we can write off our error as being
too small. Since for all we know, we have 'fixed' the uncertain data
with our small error.
 It reduces the question of error analysis into perturbation theory,
which is very well understood for a large class of problems.
§ Backward stable
A method for computing
y = f(x) is called
backward stable
if it produces a
ŷ with small backward error. That is, we need a
small
δ x such that
ŷ = f(x + δ x).
§ Mixed forwardbackward error
We assume that addition and subtraction are backward stable, where
u
is the number of significant digits to which our arithmetic operations
can be performed:
x ± y = x(1 + Δ) ± y(1 + Δ) ∀ Δ ≤ u

Another type of error we can consider is that of the form:
That is, for a small perturbation in the output
(δ y), we can get a
backward error of
δ x. This is called as
mixed forward backward error.
We can say that an algorithm with mixedforwardbackwarderror is stable iff:
  ŷ + δ y = f(x + Δ x)          
 Δ y/ŷ < є          
 Δ x/x < η          
 є, η are small
         

This definition of stability is useful when rounding errors are the dominant
form of errors.
§ Conditioning
Relationship between forward and backward error is govered by
conditioning:
the sensitivity of solutions to perturbations of data. Let us have an approximate
solution
ŷ = f(x + δ x). Then:
  ŷ − y = f(x + δ x) − f(x) = f′(x) δ x + O((δ x)^{2})          
 (ŷ − y)/y = (x f′(x)/f(x)) (Δ x/x) + O((Δ x)^{2})          
 (ŷ − y)/y = c(x) (Δ x/x) + O((Δ x)^{2})          
 c(x) ≡ x f′(x)/f(x)
         

The quantity
c(x) measures the scaling factor to go from the relative
change in output to the relative change in input. Note that this is a property
of the function
f, not any particular algorithm.
§ Example:
logx
If
f(x) = logx, then
c(x) = (x (logx)′) / logx = 1/logx. This
quantity is very large for
x ≃ 1. So, a small change in
x can
produce a drastic change in
logx around
1.
 Note the the absolute change is quite small:
log(x + δ x) ≃ logx + δ x/x.
However, relative to
logx, this change of
δ x/x is quite large.
§ Rule of thumb
We now gain access to the useful rule:
forward error condition number × backward error

 Glass half empty: Illconditioned problems can have large forward error.
 Glass half full: Wellconditioned problems do not amplify error in data.
§ Forward stable
If a method produces answers with forward errors of similar magnitude to those
produced by a backward stable method, then it is called forward stable.
Backward stability implies forward stability, but not viceversa (TODO: why?)
§ Cancellation
Consider the following program:
#include <cmath>
#include <stdio.h>
int main() {
double x = 12e9;
double c = cos(x);
double one_sub_c = 1  c;
double denom = x*x;
double yhat = one_sub_c / denom;
printf("x: %20.16f\n"
"cx: %20.16f\n"
"one_sub_c: %20.16f\n"
"denom: %20.16f\n"
"yhat: %20.16f\n",
x, c, one_sub_c, denom, yhat);
}
which produces the output:
x: 0.0000000120000000
cx: 0.9999999999999999
one_sub_c: 0.0000000000000001
denom: 0.0000000000000001
yhat: 0.7709882115452477
This is
clearly wrong, because we know that
(1−cosx)/x^{2}) ≤ 1/2.
The reason for this terrible result is that:
 we know
cosx to high accuracy, since
x was some fixed quantity.

1 − cosx converted the error in
cosx into its value.

1 − cosx has only one significant figure.
 This makes it practically useless for anything else we are interested in doing.
In general:
  x ≡ 1 + є error of order є          
 y ≡ 1 − x = є value of order є          
          

That is, subtracting values close to each other (in this case,
1 and
x)
converts
error order of magnitude into
value order of magnitude.
Alternatively, it brings earlier errors into promience as values.
§ Analysis of subtraction
We can consider the subtraction:
  x = a − b; x = â − b          
 â = a(1 + Δ a)          
 b = b(1 + Δ b)          
          
          
          
          
 ≤  max(Δ a, Δ b) (a + b) 

a − b 

         

This quantity will be large when
a − b ≪ a + b: that is, when
there is heavy cancellation in the subtraction to compute
x.
§ Underflow
#include <cmath>
#include <stdio.h>
int main() {
double x = 1000;
for(int i = 0; i < 60; ++i) {
x = sqrt(x);
}
for(int i = 0; i < 60; ++i) {
x = x*x;
}
printf("x: %10.20f\n", x);
}
This produces the output:
./sqrtpow112
...
x: 1.00000000000000000000
That is, even though the function is an identity function, the answer collapses
to
1
. What is happening?
§ Computing
(e^{x} − 1)/x
One way to evaluate this function is as follows:
double f(double x) { return x == 0 ? 1 : (pow(M_E, x)  1) / x; }
This can suffer from catastrophic cancellation in the numerator. When
x is close to
0,
e^{x} is close to 1, and
e^{x} − 1 will magnify the
error in
e^{x}.
double f(double x) {
const double y = pow(M_E, x);
return y == 1 ? 1 : (y  1) / log(y);
}
This algorithm seems crazy, but there's insight in it. We can show that
the errors cancel! The idea is that neither
(y − 1) nor
logy are
particularly good, the errors accumulated in them almost completely
cancel out, leaving out a good value:
assume ŷ = 1 1 = ŷ ≡ e^{x}(1 + δ) log1 = log(e^{x} ) + log(1 + δ) x = −log(1 + δ) x = −δ + O(δ^{2})

If
ŷ ≠ 1:
f = (ŷ − 1)/logŷ = (1+є_{3})(ŷ − 1)(1 + є+1)/(logŷ(1 + є_{2}))

§ IEEE floating point fun: +0
and 0
for complex analysis
Rather than think of +0
and 0
as distinct numerical values, think of their sign bit as an auxiliary variable that conveys one bit of information (or misinformation) about any numerical variable that takes on 0 as its value.
We have two types of zeroes,
+0
and
0
in IEEE754. These are used in some
cases. The most famous is that
1/+0 = +∞, while
1/−0 = −∞. Here,
we proceed to discuss some complexanalytic considerations.
Therefore. implementers of compilers and runtime libraries bear a heavy burden of attention to detail if applications programmers are to realize the full benefit of the IEEE style of complex arithmetic. That benefit deserves Some discussion here if only to reassure implementers that their assiduity will be appreciated.
These will ensure that
√z* = (√z)*:
copysign(1, +0) = +1 copysign(1, −0) = −1 
These will ensure that
x, 1/x = x when
x = ± ∞.
An example is provided where the two limits:
§ Complexanalytic considerations
The principal branch of a complex function is a way to select one branch
of a complexfunction, which tends to be multivalued. A classical example
is the argument function, where
arg(r e^{i θ} = θ.
However, this is ambiguous: we can map
θ ↦ θ + 2 π
and still have the same complex number. So, we need to fix some standard.
We usually pick the "branch" where
0 ≤ θ < 2 π.
In general, we need to carefully handle what happens to the function at
the discontinuity.
What deserves to be undermined is blind faith in the power of Algebra. We should not believe that the equivalence class of expressions that all describe the same complex analytic function can be recognized by algebraic means alone, not even if relatively uncomplicated expressions are the only ones considered.
§ References
Most of these functions are really defined on the
incidence algebra of
the poset
P with ground field
K. An
incidence algebra
I(P) is a
set of functions which maps intervals of
P to the ground field
K. an
interval is a tuple
(x, y) ∈ P × P such that
x ≤ P
(where the
≤ comes from the partial order on
P). We have a product
structure which I denote
⋆, given by:
(α ⋆ β)([x, z]) =   α(x, y) β(y, z)

A linear algebra way to look at this is to consider
P x P matrices over
K
where the rows and columns are indexed by
P.
The a function
α: P × P → K
can be written as the elements of the
P × P matrix.
Then this convolutionlike operator
⋆ is simply matrix multiplication.
We have three natural functions:
(1) The characteristic function, which is the identity for
⋆:
δ([x, z]) ≡ 1 if x = z ; 0 otherwise

(2) the zeta function, which plays the role of the constant
1:
ζ([x, z]) ≡ 1 if x ≤ z ; 0 otherwise

(3) The inverse of the zeta function, the mobius function, a tool for mobius inversion:
The mobius inversion theorem for posets states that
ζ and
µ as
defined above are convolutional inverses. that is,
ζ ⋆ µ = δ.
This allows us to prove:
           
 g([x, z]) =   f([x, y]) · 1 
         
 g([x, z]) =   f([x, y]) · ζ(y, z) 
         
 g = f ⋆ ζ          
 g ⋆ mu = f ⋆ ζ ⋆ µ          
 g ⋆ mu = f ⋆ δ          
 g ⋆ mu = f
         

We have managed to find
f in terms of
g, when previously we had
g
in terms of
f.
TODO: we are usually interested in a
fixed
[x, z]. What happens if we
make this implicit? We may get nice notation for all of this!
§ Sums as mobius inversion
We can derive the formula we had for integrals, that:
F(i) =   f(i) ⇐⇒ f(k) = F(k) − F(k−1)

by setting up mobius inversion on the usual partial order for the natural
numbers. For simplicity, I'll show the example on
[0, 1, 2, 3, 4]. The example
immediately generalizes.
 We have the partial (well, total) order
P:
0 < 1 < 2 < 3 < 4.
 We are given a function
f(·) we wish to integrate. We define an
auxiliary function
fi([x, y]) = f(y) which evaluates
f on the right
endpoint.
 We can now define
F([x, z]) as the sum of
f from
x to
z:
 This tells us that
f(n) = fi([0, n]) = (F ⋆ µ)([0, n]):
  f(n) = fi([0, n]) ≡ (F ⋆ mu)[0, n]          
          

 We note that we need to know the values of
µ([x, n]) for a fixed n,
for varying x. Let us attempt to calculate
µ([0, 4]), µ([1, 4]), µ([2, 4]), µ([3, 4]), µ([4, 4])
and see if this can be generalized:
 µ([4, 4]) = 1 By definition
µ([3, 4]) = −  ⎛
⎜
⎜
⎝    ⎞
⎟
⎟
⎠  By definition

          

Umbral calculus lays out a large collection of "umbral" / "shadowy"
coincidences across combinatorics and calculus. Here, I'll lay out some of
these that I learnt from
Concrete Mathematics.
I hope to use this for myself to motivate a bunch of combinatorics. I'll
provide an interesting proof of why
∑1 ≤ k < n k^{2} = k(k−1)/2
using this umbral calculus.
§ Discrete Derivative: Forward difference
We begin by trying to build a discrete version of the derivative operator. The
derivative of
f
will be denoted using
f'
. The
forward difference
of
f: ℝ → ℝ is defined as:
δ f: ℝ → ℝ; (δ f)(x) = f(x + 1) − f(x)

Immediately, we can see that this operator is linear:
  δ(f + g)  = (f+g)(x+1) − (f+g)(x)         
 = (f(x+1) − f(x)) + (g(x+1)−g(x))          
 = (δ f) + (δ g)          
 δ(α f)(x)  = (α f)(x+1) − (α f)(x)         
 = α · (f(x+1) − f(x))          
 = α (δ f)
         

it obeys a slightly corrupted version of the chain rule,
(fg)′ = f′ g + g′ f:
  δ(fg)          
 = (fg)(x+1) − (fg)(x)          
 = f(x+1)g(x+1) − f(x)g(x) + 0          
 = f(x+1)g(x+1) − f(x)g(x) + [f(x)g(x+1) − f(x)g(x+1)]          
 = g(x+1)[f(x+1) − f(x)] + f(x)[g(x+1) − g(x)]          
 = g(x+1)(δ f)(x) + f(x) (δ g)(x)          
 = (S δ f)(x) + (f δ g)(x) [(Sh)(x) ≡ h(x+1)]          
 = (S δ f + f δ g)(x)          
          

We need this new
S operator to shift the function's input from
x to
x+1.
§ Falling factorial as polynomials
cool, now that we seem to have a linear derivative operator that behaves
roughly sanely, let's test it out! The first reasonable target is a polynomial,
so let's try
C_{2}:
δ(x^{2}) = (x+1)^{2} − x^{2} = 2x + 1

This is disappointing, it does not behave very well
:(
However, there
is
an analogue that does well. This is the
falling factorial, defined as:
x^{(n)} ≡ x(x−1)(x−2)⋯(x−n+1)

For example:
x^{(0)} = 1 x^{(1)} = x x^{(2)} = x(x−1) x^{(3)} = x(x−1)(x−2) 
Let's try and apply our discrete difference
δ:
  δ(x^{(2)})          
 = (x+1)(x−1+1) − x(x−1)          
 = (x+1)(x) − x(x−1)          
 = x*2 = 2x(1)          
 δ(x^{(3)})          
 = (x+1)(x−1+1)(x−2+1) − x(x−1)(x−2)          
 = (x+1)(x)(x−1) − x(x−1)(x−2)          
 = x(x−1)((x+1) − (x−2)) = 3x(x−1) = 3x^{(2)}          
          

These falling factorials look pretty unnatural though, why do we care?
Well, once we build some integral calculus, we can handle our problem
of
∑_{1 ≤ i < k} i^{2} using this calculus, by rewriting
i^{2}
in terms of these falling factorials.
§ Sums as discrete integrals
We want to think of
∑_{0 ≤ i < n} f(i) as the correct variant of
∫_{0}^{n} f(i) di. The most important property of an integral is
the fundamental theorem of calculus:
∫   f′(i) di = f(b) − f(a) ↦   (δ f)(i) =?= f(b) − f(a)

we can check if the assertion is true:
           
 = [f(a+1) − f(a)] + [f(a+2) − f(a+1)] + [f(a+3)−f(a+2)] + ⋯ + [f(b) − f(b−1)]          
 = f(b) − f(a) (The sum telescopes)
         

Sweet, so we just kicked a theory of calculus of the ground. Let's put
this to some use:
§ Gauss' famous formula from discrete calculus
Let's begin by deriving the closed form for
[1·(k−1)] naturals:
 i =   i^{(1)} = i^{(2)}/2  ⎪
⎪   = n(n−1)/2

Let's now derive the closed form form the sum of squares of
[1·(k−1)]:
           
          
          
 = n^{(3)}/2 + n^{(2)}/2          
 = n(n−1)(n−2)/3 + n(n−1)/2          
 = n(n−1)(n/3 − 2/3 + 1/2)          
 = n(n−1)(2n − 1)/6          
          

Trying to perform this process in general does beg a question: how do we
convert from
x^{n} into some combination of rising and falling factorials?
It turns out that
Stirling Numbers will show up to aid this conversion.
But before that, let's see some more connections.
§
2^{x} as the combinatorial version of
e^{x}
We want to find the analogue of the exponential function
e^{x}, which
satisfies the equation
f′(x) = f(x). Setting this up in the discrete case:
  d′f(x) = f(x)  f(0) = 1          
 f(x+1) − f(x) = f(x)  f(0) = 1          
 f(x+1) = 2f(x)  f(0) = 1          
 f(n) = 2^{n}
         

What does this buy us? It gives us a nice proof that
∑_{k} nCk = 2^{n}.
It proceeds by taking the taylor of
e^{x}, "combinatorializing it", and then
simplifying:
           
          
 =   x*(x−1)*(x−2)*⋯(x−n+1) 

n! 
 
         
 =   x*(x−1)*(x−2)*⋯(x−n+1) 

n! 

         

§ Harmonic series as the combinatorial version of logarithm
In the continuous case, we have that
1/x = logx. In the
discrete case, we get
∑_{i=1}^{n} 1/x ≡ H_{n}, the sum of the
first
n harmonic numbers. This explains
why the harmonic numbers
keep showing up in different places across discrete math  We're often
trying to integrate some kind of
1/x quantity.
This also may tell us that
H_{n} and
2^{n} are somewhat "weak inverses" of each other.
I haven't thought about this weak inverse thing too much.
§ Stirling numbers of the first kind for convering between polynomials and falling factorials
We can express
x^{(n)} = ∑_{k=0}^{n} [n, k] x^{k} where
[n, k]
are the (yet to be defined) unsigned stirling numbers of the first kind.
(aside: I wonder if this can be derived from some kind of mobius inversion).
We begin my in fact
defining the stirling numbers of the first kind,
[n, k]
as the coefficients of the rising factorial:
x^{(n)} = x(x+1)(x+2) … (x+n−1) =   [n, i] x^{i}

The question as a combinatorialist is:
what do the unsigned striling numbers of the first kind,
[n, i], count?
The answer is:
[n, i] counts the number of permutations of
n elements with
i disjoint cycles.
For example, in the case of the permutations of the set
{1, 2, 3},
we have the permutations:
 (1)(2)(3) 3 cycles           
(12)(3) 2 cycles           
(1)(23) 2 cycles           
(2)(13) 2 cycles           
(132) 1 cycle           
(123) 1 cycle           
          

So, this gives the counts:
[3, 3] = 1
[3, 2] = 3
[3, 1] = 2

These stirling numbers satisfy a recurrence:
[n+1, k] = n[n, k] + [n, k−1]

This can be seen combinatorially. If we want the number permutations of
n+1 objects
with
k cycles, we can either:
 Take the permutations of
n objects with
k−1 cycles,
[n, k−1], and then
insert the new
(n+1)th object into a new cycle all by itself. That is,
we create a new permutation where
p(n+1) = (n+1).
 Take the permutation of
n objects with
k cycles, and insert this new
(n+1)th
object into any of the
k cycles. If the original permutation had
x p> y
,
we can insert this new object as x p> * p>y
for any x
. Thus, there are
n choices of locations to inser *
 as the image of all possible x
s.
This gives us the
n[n, k] term.
Another cute fact of the unsigned stirling numbers of the first kind
[n, k]
is that since permutations are partitioned by their number of cycles, we have:
§ References
The permutahedron over
n letters is a polytope which is defined as the convex
hull of all permutations of the point
(1, 2, …, n).
For example, the permutahedron over 3 letters is the convex hull of the
points
(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1).
Here, we show that it can be embedded in
(d−1) dimensionl space, and that each point
perm((1, 2, … n)) is indeed a vertex of the convex hull.
An example of a situation where a point is
not in the vertex of a convex hull
is
convexhull(1, 2, 3) = [1, 3]
. Note that
2
was used to generate the
convex hull, but is not a vertex since it is the convex combination of
1, 3
.
We wish to prove that such a situation does not happen in the case of the
permutahedron.
I have not seen an elementary proof that each point that is a permutation of
the original is a vertex, so I'm recording that proof out of interest.
§ Showing that all permutations of
(1, 2, …, n) are vertices
§ Core idea
start by looking at index
i
of the "largest value"
N
in the point
P
.
Since it's the largest point, if it's a convex combination, all other points in
the convex combination must have the same "largest value" at that index
i
. So
we can now ignore the index
i
and get permutations of
[1..N1]
Repeat this process to prove that
all indexes of things in the convex sum
must be equal to
P
. But this is absurd, because all vertices are distinct.
Hence, the only way for this to work out is if we have only one point that we
are convexsumming over. That is,
P
cannot be written as the convex sum of
two or more vertices.
§ Formal proof
Let us be working with permutations of
[1, 2, ..., N]
. Pick point
P
. Assume
P
is in the convex sum of some
{ Xi }
.
Let
index(N, P)
be the index of number
N
in
P
. For example, if
P = [2, 3, 1]
, then
index(2, P) = 0 (assuming 0based indexing)
index(3, P) = 1
index(1, P) = 2
If
P
is the convex sum of vertices
{ Xi }
, all of them
must have that:
Xi[index(N, P)] = N forall Xi in the convex sum
since
N
is the largest value that
X_I
can have at any index. The only way
to get the maximum value by a convex sum is for all values to be that maximum
value.
So, we can now ignore dimension
index(N, P)
, since
all the vertices X_i
involved in the convex combination and
P
have
index(N, P) = N
. If we
project out this dimension, we are left with permutation of
(1..N1)
.
Repeat the same process. We will need to have all
X_i
to have their value at
index(N1, P)
to be
N1
.
Keep doing this till we get that the dimensions of all points in
Xi
and
P
are equal. But all the points in
{ Xi }
are distinct since they are
permutation of
[1..n]
. Hence,
{ Xi }
can contain only a single point, and
this point is
P
.
Hence,
P
cannot be written as the convex combination of other points. Hence,
P
is a vertex.
§ Proof that convex sum of maximum must have maximum
Assume we have values
{v1, v2, ... vn}
where without loss of generality,
v1 <= v2 <= ... <= vn
. We can always relabel otherwise.
Now we want to write
vn
as the convex sum of
{v1, v2, ... vn}
. We can draw this on
the number line as:
[v1v2v3...vn]
 A convex sum must be inside this line segment between
[v1...vn]
(vn
is
to the rightmost since it's known to be the largest value in {v1...vn}
).  So, if we try to write
vn
as the convex sum, we must have the coefficient
of vn=1
and the coefficient of all other points =0
, since any other
combination will "pull the convex sum away from the right (where vn
sits),
towards the left".
Actual "real world" usecase of lyndon factorization, cribbed from here:
I wanted to learn a realworld usecase of lyndon factorization so I could
remember it. I found a whole bridge between combinatorics/strings and
discrete geometry (they call it "digital geometry") that I didn't know
existed before.
 If you have a line with rational slope
p/q and you want to draw a
"discretized line" by connecting integer points in ZxZ, you can describe this
discretized line as starting from
(0, 0), making moves
x (move up 1 unit
along
xaxis),
y (move up 1 unit along
yaxis), finally reaching the point
(p, q). For example, to reach the point
(2, 3), you can make the moves
[x, x, x, y, y].
 A christoffel word is a word
w ∈ {x, y }^{⋆} such that it hugs a line of
rational slope
p/q as close as possible. Formally, there are no integer
points between the line with slope
p/q starting from the origin, and the
discretized line as described by
w. An example picture:
 It turns out that all primitive christoffel words are Lyndon words. I'm not
100% sure what primitive is, but the takeaway is that these primitive
christoffel words represent discrete lines that are good approximations
of lines.
 Now, we are given a discrete sequence of adjacent line segments going
upwards, where the line segments are described by
x, y moves. We want to
check if the discrete curve defined by them is wellapproximating a convex
polygon.
 We compute the lyndon factorization of the word. This splits the
original line segment into a series of line segments, where each
successive line segment has lower slope than the previous (since the
lyndon decomposition splits words into nondecreasing lex order).
 We can then check that each word in the lyndon decomposition is a Christoffel
word. If it is, then your sequence of moves describes a "good discrete
convex hull", since as described above, a christoffel word "hugs the line"
well.
§ Bonus: slick characterization of line drawing
If we want to draw a line with slope
p/q = 4/7
using the lower approximation the idea
is that we keep taking
4
steps along
x
, and every time we "exceed"
7
steps
along
x
, we take a step along
y
.
This is the same as:
 working in
Z/(4+7)Z
, starting from 0
, writing down multiples of 4
till we cycle back to 0
 marking an "increase" in a step with an
x
, and a "decrease" in a step
with y
.
The intuition is that once we walk
k*p
steps where
k*p >= q
, we want
to increment
y
. So, at first glance, we may believe we should consider
Z/qZ
. However, this is misguided. Examples to enlighten:
 Consider
x=1, y=0
. We should use Z/1Z = { 0 }
: that is, we must keep
moving along 0 x > 0 x> 0 x> ...
. This is unlike what happens if we choose
Z/0Z
(which is not a welldefined idea).  Consider
x=1,y=1
. We should use Z/2Z
, so we keep going 0 x> 1 y> 0 > ...
which will cause is to flip x > y > x > y > ...
.
In some sense, we are making sure that we can "start" with an
x
and see where that takes us.
In the
Z/1Z
case, we realise that we can keep taking
x
s. In the
Z/2Z
case, we realise we need to flip between
x
and
x
.
Let's try to show this formally, where
k
is the smallest number
such that
kp >= q
. We'll also have concrete examples where
p=2, q=7
. For the example,
k=4
since
kp = 4*2 = 8 > 7
.
If we work in
Z/yZ = Z/7Z
, we will get the numbers:
Z: 0, p, 2p, ... (k1)p, [q] kp
Z/qZ: 0, p, 2p, ... (k1)p, [q] (kp % q)
Z/7z: 0, 2, 4, ... 6, [7] (8 % 7 = 1)
But notice that we are inserting
x
between numbers. So we will get:
Z: 0 x> p x> 2p, ... (k1)p y> (k+1)p
Z/qZ: 0 x> p x> 2p, ... (k1)p y> ((k+1)p % y)
Z/7z: 0 x> 2 x> 4, ... 6 y> (8 % 7 = 1)
^ ^
x since [0 < 2] y since [6 > 1]
which gives us the moves:
Z/7Z: 0 x> 2 x> 4 x> 6 y> 8 % 7 = 1
We only get
3
occurences of
x
, after which on the next accumulation of
p
,
becomes an
8
which wraps around to a
1
. This is the ageold tension that exists
between
points and
gaps. For
k
points, there are
k1
gaps. In our
case, we have
4
points
[0, 2, 4, 6]
, but this leaves us room for only
three gaps for three
x
moves, while in reality we need
4
.
We remedy this situation by giving ourselves space enough for
one more
x
, by changing from
Z/qZ
to
Z/(p+q)Z
. We should look at this as creating space for another gap.
Z: 0, p, 2p, ... (k1)p, [q] kp, [q+p ], (k+1)p
Z/(p+q)Z: 0, p, 2p, ... (k1)p, [q] kp, [q+p ], (k+1)p % (p+q)
Z/9z: 0, 2, 4, ... 6, [7] 8, [7+2=9], (10 % 9=1)
which gives us the moves:
Z: 0 x> p x> ... (k1)p x> kp y> (k+1)p
Z/(p+q)Z: 0 x> p x> ... (k1)p x> kp y> (k+1)p % (p+q)
Z/9Z: 0 x> 2 x> ... 6 x> 8 y> ( 10 % 9 = 1)
^ ^
x since [0 < 2] y since [8 > 1]
That is, we are able to get
k
occurences
x
between
0, p, ..,kp
which
has
(k+1)
points. Concretely, we have:
Z/9Z: 0 x> 2 x> 4 x> 6 x> 8 y> 1
where we have 4 occurences of
x
in between after which we have an occurence
of
y
, which is what we want when
x=2, y = 7
. We need to reach at least
q=7
before we have exceeded our denominator and need to make a move
along
y
.
Let's concentrate on the
e^x >= 1 + x
part.
 The tangent of
e^x
at x = 0
is 1 + x
, since the taylor series
of e^x
truncated upto x
is 1 + x
. 
e^x
is a strongly convex function, since (e^x)'' = e^x
which is positive
everywhere. Hence, e^x
will always lie above its tangent.
Similarly for
e^(x)
, working through the math:

1 x
is tangent at x=0
to e^(x)

(e^(x))'' = (e^(x))' e^(x)
which is again positive everywhere, and
hence, e^(x)
is strongly convex.
We we want to sort an arrray
xs
and write the results into an array
ys
.
In both cases, the invariant to be satisfied is that
ys
is
xs
in ascending
order. I'll be considering two broad ways to implement such a thing:
 1. (RANK) Index straight into
ys
, reindexed into xs
. We name the reindexing
array as rs
(rs
for ranks).
for(int i = 0; i < N; ++i) {
ys[rs[i]] = xs[i]
}
 2. (SELECT) Reindex into
ys
, index straight into xs
.
We name the reindexing array ss
(ss
for selects).
for(int i = 0; i < N; ++i) {
ys[i] = xs[ss[i]]
}
§ Defnition of rank
In the case of
(RANK), the specification is that the rank of an element
e
is the number of elements that are:
 less than
e
.  equal to
e
but occur before e
.
This ensures that rank is a
permutation: that is, every element
e
is given
a
unique index.
int rank(const int *xs, int i) {
int cnt = 0;
for (int j = 0; j < N; ++j) {
if (xs[j] < xs[i]  (xs[j] == xs[i] && j < i)) cnt += 1;
}
return cnt;
}
for(int i = 0; i < N; ++i) {
rs[i] = rank(xs, i);
}
§ Rank: Alternative 1
An alternative way to look at our definition of rank is that we are
sorting the tuples
(xs[i], i)
using lex ordering. So if two indeces
i, j
have the same value, then we sort on the index.
§ Rank: Alternative 2
We could have also defined rank as:
int rank(int *xs, int i) {
int cnt = 0;
for (int j = 0; j < i; ++j) {
if (xs[j] <= xs[i]) cnt += 1
}
return cnt;
}
I declined from doing so because I wanted to show the lexordering
interpretation of
rank
will be be useful later.
§ Definition of select
For
(SELECT), we'll need to think a little harder. Let's try to rewrite
our way to glory:
 1. From definition of rank:
 2. move
i > ss[i]
where ss[i]
is guaranteed to be a permutation,
so we will write each index eventually:
ys[rs[ss[i]]] = xs[ss[i]]
 3. Stipulate that
rs[ss[i]] = i
, since we want a ys[i]
:
This gives us necessary and sufficient conditions on how to find an
ss
:
ss
must be a permutation that is an inverse permutation to
rs
.
§ How to invert a permutation?
How does one invert a permutation? We have a permutation
rs[i]
that maps
i
to
rs[i]
. We want to find a new permutation
ss[i]
such that
Equivalently:
// ss[i] is an index 'k'...
ss[i] = k
// 'k' is the location of 'i' in rs.
rs[k] = i
So if we first tag each location of
rs
with its index, and then sort
with the values of
rs
being the keys, then
ss[i]
will be the values.
In pictures:
rs[i] [3 4 0 1 2]
rs_tagged[i] [(3, 0) (4, 1) (0, 2) (1, 3) (2, 4)]
rs_sorted[i] [(0, 2) (1, 3) (2, 4) (3, 0) (4, 1)]
ss[i] [ 2 3 4 0 1 ]
rs[ss[i]] [ rs[2] rs[3] rs[4] rs[0] rs[1]]
rs[ss[i]] [ 0 1 2 3 4 ]
§ Rank and Select are inverses
It's nice, there are a couple of miracles that lined up here:
 Rank starts with
r
and select starts with s
so we get nice naming.  Rank and select are true inverse permutations of each other.
§ Generalized rank and select are adjoint
We had to "fix" our definition of rank to avoid equal elements in the array.
Hence, we had the rule that we also sort by indexes if the elements are
equal. However, if we now decide to ignore this rule, we will then recreate
the classical
adjunction between rank and select.
§ References
 Richard Bird: Pearls of functional algorithm design
We can derive a proof of the minkowski convex body theorem starting from
Blichfeldt’s theorem.
§ Blichfeldt's theorem
This theorem allows us to prove that a set
of largeenoughsize in any lattice will have two points such that their
difference lies in the lattice. Formally, we have:
 A lattice
L(B) ≡ { Bx : x ∈ ℤ^{n} } for some basis
B ∈ ^{n}. The lattice
L is spanned by integer linear
combinations of rows of
B.
 A body
S ⊆ R^{n} which need not be convex!, which
has volume greater than
det(B). Recall that for a lattice
L(B),
the volume of a fundamental unit / fundamental parallelopiped is
det(B).
Blichfeldt's theorem tells us that there exists two points
x_{1}, x_{2} ∈ S
such that
x_{1} − x_{2} ∈ L.
§ Proof
The idea is to:
 Chop up sections of
S across all translates of the fundamental parallelopiped
that have nonempty intersections with
S back to the origin. This makes
all of them overlap with the fundamental parallelopiped with the origin.
 Since
S has volume great that
det(B), but the fundamental paralellopiped
only has volume
det(B), points from two different parallelograms must
overlap.
 "Undo" the translation to find two points which are of the form
x_{1} = l_{1} + δ,
x_{2} = l_{2} + δ. they must have the same
δ since they overlapped
when they were laid on the fundamental paralellopiped. Also notice that
l_{1} ≠ l_{2}
since they came from two different parallograms on the plane!
 Notice that
x_{1} − x_{2} = l_{1} − l_{2}∈ L ≠ 0, since we already argued
that
l_{1} ≠ l_{2}. This gives us what we want.
§ Minkowskis' Convex body Theorem from Blichfeldt's theorem
Consider a convex set
S ⊆ ℝ^{n}
that is symmetric about the origin with volume greater than
2^{n} det(B).
Create a new set
T which is
S * 0.5. Formally:
T ≡ S/2 = { (x_{1}/2, x_{2}, …, x_{n}/2) : (x_{1}, x_{2}, …, x_{n}) ∈ S } 
We now see that
Vol(T) > det(B) to invoke Blichfeldt's theorem.
Formally:
Vol(T) = 1/2^{n} Vol(S) > 1/2^{n} (2^{n} det(B)) = det(B) 
We can apply Blichfeldt's theorem to get our hands on two points
x_{1}, x_{2} ∈ T
such that
x_{1} − x_{2} ∈ L.
  x_{1} ∈ T ⇒ 2x_{1} ∈ S (S = 2T)          
 x_{2} ∈ T ⇒ 2x_{2} ∈ S (S = 2T)          
 2x_{2} ∈ S ⇒ −2x_{2} ∈ S (S is symmetric about origin)          
  (2x_{1}) +   (−2x_{2}) ∈ S (S is convex) 
         
 x_{1} − x_{2} ∈ S (Simplification)          
 nonzero lattice point ∈ S          
          

§ References
We aim to get the
O(n) algorithm for burrows wheeler, by starting from the
naive
O(n^{2}) implementation and then slowly chipping away to get to the
fastest algorithm
§ String rotations
Given a string
s of length
n, we can index it as
s[0],
s[1], upto
s[n−1]. We can now build a table consisting of
rotations of the string.
We'll define:
lrot :: [a] > [a]
lrot [] = []; lrot (x:xs) = xs ++ [x]
We can build a table of leftrotations:
foobar
oobarf
obarfo
barfoo
arfoob
rfooba
We can immediately notice that it's a
symmetric matrix. We can prove
this as follows. We write
lrot(rot, s) as an array such that:
lrot(rot, s)[i] = s[(rot + i) 
Now, we note that on row
r of our array we have the string
lrot(r, s).
So the value at row
r, column
c is
.
But this is symmetric in
c, r so can be written as
,
which is equal to
lrot(c, s)[r]. Formally:
lrot(r, s)[c] = s[(r + c) 
§ Sorts of string rotations
We're next interested in considering
sorted order of the string
rotations. For example, in the case of the string "here, there",
all the rotations are:
herethere 0
erethereh 1
retherehe 2
ethereher 3
therehere 4
therehere 5
hereheret 6
erehereth 7
reherethe 8
eherether 9
which is calculated with the help of the following haskell code:
lrot :: [a] > [a]
lrot [] = []; lrot (a:as) = as ++ [a]
lrots :: [a] > [[a]]; lrots as = take (length as) (iterate lrot as)
main = putStrLn $ intercalate "\n"
(zipWith (\s i > s <> " " <> show i)
(lrots "herethere") [0,1..])
If we now
sort these rotations, we get:
therehere 0
ethereher 1
eherether 2
erethereh 3
erehereth 4
herethere 5
hereheret 6
retherehe 7
reherethe 8
therehere 9
we produce this by chainging the above definition of
main
to:
main = putStrLn $ intercalate "\n"
(zipWith (\s i > s <> " " <> show i)
  extra `sort` here
(sort $ lrots "herethere") [0,1..])
Let's look at the
final column of that table. We have:
therehere 0
ethereher 1
eherether 2
erethereh 3
erehereth 4
herethere 5 < ORIGINAL STRING
hereheret 6
retherehe 7
reherethe 8
therehere 9
0123456789
errhhetee
Now, we'll show how to write a
really cool function call
bwt
such that:
bwtinv ("errhhetee",5) = "herethere"
The
5
is the index of the original string in the list. That is, given
the jumbled up lastcolumn and the index of the original string, we're
able to reconstruct the original string. The reason this is useful is
that the jumbled string
"errhhetee"
is easier to compress: it has
long runs of
r
,
h
, and
e
which makes it easier to compress.
The process we performed of rotating, sorting the rotations and taking
the final column is the BurrowsWheeler transform. in code:
import Data.List
lrot :: [a] > [a]
lrot [] = []; lrot (a:as) = as ++ [a]
lrots :: [a] > [[a]]
lrots as = take (length as) (iterate lrot as)
findix :: Eq a => a > [a] > Int
findix a as = length (takeWhile (/= a) as)
lastcol :: [[a]] > [a]; lastcol = map last
bwt :: Eq a => Ord a => [a] > ([a], Int)
bwt as = let as' = (sort . lrots) as in (lastcol as', findix as as')
Now we need to understand how
bwtinv
is defined and what it does.
The definition is:
import Control.Arrow (&&&)
bwtinv :: Eq a => Ord a => ([a], Int) > [a]
bwtinv (as, k) = recreate as !! k
 recreate · lastcol · sort · rots = sort · rots
recreate :: (Eq a, Ord a) => [a] > [[a]]
recreate as = recreate' (length as) as
recreate' :: (Eq a, Ord a) => Int > [a] > [[a]]
recreate' 0 = map (const [])
recreate' n = hdsort . consCol . (id &&& recreate' (n1))
hdsort :: Ord a => [[a]] > [[a]]
hdsort = let cmp (x:xs) (y:ys) = compare x y
in sortBy cmp
consCol :: ([a], [[a]]) > [[a]]
consCol = uncurry (zipWith (:))
OK, so much for the code. what does it
do?
The idea is that:
 we recreate the entire matrix from the last column using
recreate
and take the k
th element.

recreate
apprents a copy of the initial last column to a matrix of
columns, and then sorts this.
§ References
 Richard Bird: Pearls of functional algorithm design
open sets form a Heytig Algebra, which is a lattice plus an implication
operator. So it's stronger than a lattice, but weaker than a boolean algebra.
Formally, a Heytig algebra over a set
X is a collection
(L, ∨, ∧, ⇒)
where
(X, ∨, ∧) form a lattice, and
⇒ obeys the axiom
⇒: L → L; (c ∧ a) ≤ b ⇐⇒ c ≤ (a ⇒ b)

In any topological space
(U, τ) (
U for universe set)
the open sets of that space form a Heytig algebra.
Here, setunion and setintersection become the lattice join and lattice meet.
We define a "weak complement" denoted by
¬ which is defined as:
We need the
interior to make sure that we're left with an open
set. Also, this
¬ is not really a complement, since we won't have that
¬ ¬ A = A. but, more on that later!
We write open intervals with round parenthesis:
¬ A of the above set
A becomes the open interval that is in the
interior of
A complement.
(===) A
==][== A complement
==)(== Not A
Now, we can try to ask, when is
¬ ¬ A = U (where
U is the universe
set or the full space). If we consider a set containing a single point,
then we will have:
==)(== A
 Not A
============ Not (Not A)
in words:
  A = U ∖ { 1 }          
 ¬ A = interior(A^{c}) = interior({ 1 }) = ∅          
 ¬ ¬ A = interior((¬ A)^{c}) = interior(∅^{c}) = interior(U) = U          
          

So in some sense, the law of excluded middle is "almost true": It holds that
A ≃ ¬ ¬ A,
where
A excludes a set of points of measure zero.
This is really interesting, since it gives some sort of oddball probabilistic
flavour to things where if we blind ourselves to measure0 sets, then
¬ ¬ A = A.
Now, we look at implication. The implication
A ⇒ B is the largest
set open
C such that
C ∧ A = B. In pictures:
(========) A
(=====) B
(==) A && B
===)(======== A > B
The reason this is true is that from the definition:
  ⇒: L → L; c ≤ (a ⇒ b) ⇐⇒ (c ∧ a) ≤ b          
 replacing ≤ with = for insight:          
 ⇒: L → L; c = (a ⇒ b) ⇐⇒ (c ∧ a) = b          
          

Alternatively, we can use the fact that in regular boolean algebra:
to derive
A −> B:
(========) A
===)(===== NOT A
(=====) B
[Extend B to everything that doesn't include
more of A than already included]
===)(======== NOT A or B = A > B
§ a > b
as a
contained in b
:
We'll show that
a > b
is true/top/the full real line if and only if
a
is contained in
b
. We can show this using
¬ a ∨ b definition:

a ≤ b (given)
 Since
a ≤ b,
¬ a ≥ ¬ b, since
¬ reverses inclusion order.

x ≥ y implies that
x ∨ p ≥ y ∨ p for all p, since
p is
on both sides.
 Specializing to
x = ¬ a,
y = ¬ b,
p = b gives us
¬ a ∨ b ≥ ¬ b ∨ b

¬ b ∨ b is universe
U

¬ a ∨ b ≥ U, which means it's equal to
U (nothing can be greater than
U).
We can also show this geometrically:
(===) A
(==========) B
(===) A && B
[Extend B to everything that doesn't include
more of A than already included.
But all of A is
already included!]
================== A > B
§ reading curry
, uncurry
using a > b
as containment:
curry
,
uncurry
are the adjunction/isomorphism:
((c,a) > b) ~ (c > (a > b))
Read this as:
( c , a) > b
(c `intersection` a) `contained in` b
if and only if
c > (a > b)
c `contained in` (a `implication` b)
That is, we have "read"
curry/uncurry
as:
which was the definition of
⇒ we wanted!
§ References
This implementation of edit distance crystallizes the fact that when computing
edit distance, we only ever move forwards on solving the problem. we
do not
store the results of the overlapping computations, though we could. Rather,
the goal of this implementation is to capture the traversal pattern necessary
for edit distance into a
Cursor
, and to view the edit distance problem from
the point of view of this
Cursor
.
The problem setting is that we have a source string, a destination
string, and we wish to perform operations that convert the source
string into the destination string. The operations allowed are to:
 Insert a character from the destination to the source.
 Remove a character from the source.
 Replace a character in the source with a character from the destination.
We want to minimise the number of operations to be used.
Let's see how we model this.
{# LANGUAGE ViewPatterns #}
type Ix = Int; type DestIx = Ix; type SrcIx = Ix;
data Move = InsertFromDest DestIx 
RemoveFromSrc SrcIx 
ReplaceSrcWithDest SrcIx DestIx
deriving(Show)
Our cost model says that each move costs
1
. We are charged for every
move we make. We are to minimize the number of operations.
movescost :: [Move] > Int; movescost = length
We model this as us having a
Cursor
which contains list
[a]
and information
about where we are in the list as an
Ix
.
This is the same as a
Zipper
for a list, except that in this case, we only
allow ourselves to walk forward.
data Cursor a = Cursor Ix [a]

cdone
tells is if we have completely consumed a cursor. 
cix
tells us the index of the cursor. 
cval
lets us dereference a cursor. 
incr
lets us move a cursor to the next array entry. 
cursor
converts a list into a Cursor
at the first index.
cdone :: Cursor a > Bool; cdone (Cursor ix vs) = ix >= length vs
cix :: Cursor a > Ix; cix (Cursor ix _) = ix
cval :: Cursor a > a; cval c@(Cursor ix vs) = vs !! ix
incr :: Cursor a > Cursor a; incr (Cursor ix vs) = Cursor (ix+1) vs
cursor :: [a] > Cursor a; cursor = Cursor 0
We implement
edit
, that tells us how to edit the source string into
the destination string. The convention is
edit
.
  decide how to get ixth character of bs from our as.
edit :: Eq a => Cursor a > Cursor a > [Move]
 1. If both strings have been consumed, then no moves are to be made.
edit (cdone > True) (cdone > True) = []
 2. If the destination string has been fully matched while the source string
has not, then remove characters from the source string.
edit a@(cdone > False) b@(cdone > True) =
(RemoveFromSrc (cix a)):edit (incr a) b
 3. If the source string has run out of characters while the destination string
still has characters, insert characters from the destination string.
edit a@(cdone > True) b@(cdone > False) =
(InsertFromDest (cix b)):edit a (incr b)
 4. Otherwise, we have characters remaining in both strings. Try the
options of (1) replacing a source character with a destination
character (2) removing a character from the source and continuing,
and (3) if the current characters match, then keep the match and try
to combine characters that come later in the string. We pick the
best out of these using the
argmin
combinator.
edit a b =
let nomatch = argmin movescost
(ReplaceSrcWithDest (cix a) (cix b):edit (incr a) (incr b))
(RemoveFromSrc (cix a):edit (incr a) b)
in case cval a == cval b of
True > argmin movescost nomatch (edit (incr a) (incr b))
False > nomatch
The helper used for finding minimum according to a cost model.
argmin :: (a > Int) > a > a > a
argmin f a a' = if (f a) < (f a') then a else a'
This kind of culture that beehives have is called as 'eusociality'.
I'm interested in this because I want to understand what alien societies might
look like, and what selection pressures are required to have beelike societies
evolve. Many scifi books (Ender's game for example) depict aliens with such
a society, but tend to be hazy on how this state of affairs came to be.
These are rules I'm going to follow when I solve problems on
codeforces. I've arrived at these rules by repeatedly
noticing the kinds of mistakes I make and attempting to adopt conventions
to eliminate these.
§ Rule 1: Halfopen intervals
Intervals are only represented as
[begin, pasttheend)
That is, we start at
begin
, include numbers
begin+1, begin+2, ...
, upto,
but excluding
pasttheend
.
So, for example,
[0, 3) = [0, 1, 2]
, while
[0, 1) = [0]
, and
[0, 0) = []
.
I am not allowed to use
[begin, end]
, or
(beforebegin, pasttheend)
or
any such variation I represent and think of intervals.
§ Rule 2: No relational operators when using for/while/do while
loops.
When I write my loop conditions, I am not allowed to use relational operators.
I must only write
i != n
or
i == m
.
I am not allowed to write
i < n
,
i <= n
,
i > n
,
i >= n
for my
for
loops.
§ Rule 3: The loop iterator lives in "getting to" space:
The loop iterator
i
in
for(int i = a; i != b; ++i)
is to be thought of as
getting to/living right before" the values
[a, a+1, a+2, ... b1]
. In
ASCIIart:
 a1  a  a+1  ...  b1  b  b+1 
^^ ^^ ^^ ^^
(i=a) (i=a+1) (i=a+2) ... (i=b)
§ Ruel 4: One always reads loops according to the above rules
for(int i = begin; i != pasttheend; ++i) {
// NO: i from begin to pasttheend.
// YES: [i _getting to_ the beginning] till [i _getting_ pasttheend]
}
§ The rationale for banning relational operators
There is a strong difference in qualia between
i < n
and
i != n
. The former
makes on aware of when the loop runs; the latter of when the loop quits.
I wish to be cognizant of the precisely when a loop quits.
On writing
i != pasttheend
, I know that we quit as soon as we
get past the end. This feels much clearer than being aware that the loops
runs as long as
i < n
.
§ halfopen indexing: length <> index
The first advantage of these conventions is that
[begin, pasttheend)
is the same as
[begin, begin+length)
. I've found this
to be of great help to flit between lengthbasedthoughts and
indexingbasedthoughts.
§ halfopen indexing: 0
length
The second almost trivial point is that
[begin, begin+length)
holds when
length
is
zero. That is,
for(int i = begin; i != begin + 0; ++ i) { ... }
does the right thing and iterates over no elements.
§ halfopen indexing: ve
length
The third neat point is that
[begin, begin+length)
holds even when
length
is
negative. Hence, if we want to index the last two elements of an
array, we recall that we start from index
(n1)
, and we want a segment
of length
2
from there, so our loop is:
for(int i = n1; i != (n1) + 2; i) {
// loops over the correct segment.
}
§ halfopen indexing: splitting an array
Finally, this convention helps when attempting to take the beginning part
of a list and the rest of the list. If we want to split an array into
two segments at some index
len
,
 The first subarray is
[0, len)
since it starts at location 0
and has
length len
.  Since we have taken
len
elements, the second subarray must have length
nlen
. It must start at index len
since len
is pasttheend for the
first subarray. So, the second subarray has indeces [len, nlen)
.
Notice how we used
both the "length view" and the "past the end" view to
quickly derive the bounds of the second array from the first array.
§ halfopen indexing: uniformly generate powerof2 intervals.
If we want intervals of length
2^{n}:
1, 2, 4, 8, ...
which is common if one
is building data structures such as segment trees and fenwick trees,
in a halfopen representation, this literally becomes
[a, a+1)
,
[a, a+2)
,
[a, a+4)
and so on. On the other hand, if one wants
to use closed interval based indexing, one needs to generate the
series
2^{n} − 1, which is
[a, a+0]
,
[a, a+3]
,
[a, a+7]
which is
slightly more finicky.
§ How to deal with strides
If there's a loop
for(int i = 0; i < 5; i += 2) { ... }
the naive
i != 5
translation will not work since
i
only takes
on even numbers
[0, 2, 4, 6, ...]
. For this, we can perform a
simple transformation and always make sure our loop variables increment
by
1
. In a compiler, this is often called as "canonicalizing the loop
induction variable". In LLVM the canonicalization is
performed by the indvars
pass.
The above example canonicalized becomes:
// btc = backedge taken count. How many times we have
// gone past the "back edge" of the loop to go back to the
// beginning of the loop.
for(int btc = 0; btc != 5 / 2; btc += 1) { const int i = btc * 2; }
§ In conclusion
These are rules that I dreamed up after noticing my idiosyncracies in
loopwriting.
§ Algebraic structure for vector clocks
I update my time, ie, union(time me, time me), I get an element that's one up the lattice.
When I union with someone else, I get the max. So we have an algebraic structure
which is
(L, ≤, next: L → L) where
next
is monotone for
(L, <=)
.
The induced union operator
∪: L × L → L is:
x ⋃ y ≡  ⎧
⎨
⎩  next(x)  x = y 
max(x, y)  x ≠ y 


 Is the total ordering on vector clocks not isomorphic to the total ordering on
ℝ?
This also shows up in the case of the "GallagerHumbletSpira Algorithm" and
the fragmentnameunionrule.
I learnt of this counterintutive fact first from this
usenix article no SQL.
On checking up, it seems to actually be true.
Jeff Dean's keynote at LADIS 2009 report these numbers:
 Round trip within same datacenter: 500,000 ns
 Disk seek: 10,000,000 ns [regular disk]
On the other hand, a commentor on
this discussion at serverfault
mentioned that SSDs might be much faster, and the numbers bear out:
 Read 4K randomly from SSD: 150,000 ns
 Round trip within same datacenter: 500,000 ns
 Disk seek: 10,000,000 ns [regular disk]
I learnt of this from hacker news. This is some crazy experiment that shows
that the 'quantum angular momentum' (spin) and the 'classical angular momentum'
need to be conserved
together for physics to work out:
There's an experiment that transfers that angular momentum all the way up to macroscopic levels. By magnetizing a cylinder of iron, all the spins start pointing in the same direction. By conservation of angular momentum, the cylinder itself has to start spinning in the opposite direction. I'm very fond of this experiment, because it magnifies a strange quantum phenomenon to the classical level.
So, my understanding of the experiment is:
 classical angular momentum and quantum angular momentum are related.
 quantum angular momentum is decomposed into spin and orbital angular momentum.
 for something like iron, spin is 96% of magnetization
 angular momentum is proportional to magnetization
 So, the experiment measures the spin (mostly) in terms of the classical
spinning of the cylinder.
§ References
We will introduce two operations
rank
,
select
,  these are used to
build memoryefficient data structures that can still be queried quickly.
We will show how
rank
and
select
are adjoint. This will also us to also
consider
coselect
, which is a variation of
select
that is not commonly
discussed.
{# LANGUAGE GeneralizedNewtypeDeriving #}
import Data.List
newtype Count = Count Int deriving(Eq, Show, Ord, Num)
newtype Ix = Ix Int deriving(Eq, Show, Ord, Num)
eqscan :: Eq a => a > [a] > [Count]
eqscan a0 as = map Count $ scanl (+) 0 [if a0 == a then 1 else 0  a < as]
  finds the index `Ix` in `as` at which `a` has occured
  for the `Count`th time.
select :: Eq a => a > [a] > Count > Maybe Ix
select a0 as c = Ix <$> findIndex (== c) (eqscan a0 as)
  finds the number of times `a` has occured in `as` till `Ix`.
rank :: Eq a => a > [a] > Ix > Count
rank a as (Ix ix) = (eqscan a as) !! ix
Given this, we can prove that
select
and
rank
are adjunctions. There
are different ways to think about this. My favourite way to is to notice
that an
Ix
(index) is much "finer" information than
Count
. Hence,
the concrete domain must be
Ix
, the abstract domain must be
Count
,
and rankselect is an abstract interpretation!
§ Coselect
We can build another version of
select
called
coselect
that scans
from the right. Alternatively, it finds on the reverse list:
coselect :: Eq a => a > [a] > Count > Maybe Ix
coselect a0 as c = Ix <$> findIndex (== c) (reverse (eqscan a0 as))
Thanks to Edward Kmett for teaching me this.
We wish to
uniformly sample k
colorings of a graph
G with maximum degree
Δ. Hence, we require
k ≥ Δ + 1. To perform this sampling,
we use MCMC to sample from a markov chain whose states are
kcolorings of
G,
whose stationary distribution is the uniform distribution over valid colorings.
The issue with MCMC techniques is that we never know when our chain has reached
the stationary state. To ensure we receive uniformly distributed samples,
we built a "bounding chain"
W which has the following properties:
 States of
W cover states of
X [ie, state space of
W are subsets of the state space of
X].

W's convergence to a stationary state can be checked.
 Upon convergence of
W, state of
W = state of
X.
We will in fact run the
W chain, and prove that running the
W chain
is equivalent to running a 'shadow' of the
X chain, and that stationary
states of the
W chain correspond to stationary sates of the
X chain.
Let
X be the chain whose states are valid
k colorings of
G. In one step
of the chain
X, we choose a vertex
v uniformly at random; we then choose a
color
c′_{v} uniformly at random for
v that makes it a proper coloring. The vertex
v is changed to this new color
c′. This is a symmetric proposal distribution,
Hence this chain has the uniform distribution over
kcolorings to be
the stationary state.
Sampling exactly from this chain is hard: construct an initial state
X_{0}
amounts to finding some valid
k coloring which in itself might be
challenging. Worse, we do not know whether the chain
X has reached a
stationary state or not.
§ Bounding Chain
We construct a new chain
W (the bounding chain of
X), whose states are
sets of colors for
vertices in
G. Formally, the states of
W are functions
Vert(G) → 2^{C} where
Vert(G) denotes the vertices of
G;
C the set of colors. The transition
will be to pick a vertex
v uniformly at random. Then, pick a new set of
legal colors
C′_{v} for
v, such that:
 It is guaranteed that if
X were transitioning on
v, the
color
c′_{v} that would be picked by
X for
v is a member of
C′_{v}. [state is covered]
 The size of the set
C′_{v} attempts to be smaller than the current set of colorings
C_{v}. [convergence]
We describe the transition function next. But first, we need an alternate
lens on the transitions of
X that is amenable to massaging.
§ Equivalent description of the transitions of
X:
 Choosing a color uniformly at random from the set of valid colors
for a vertex.
 Choosing colors from
C without replacement until we get a color
that is a valid color.
We claim that (1) and (2) have the same probability distribution.
Abstracting slightly, we state:
 Probability of choosing an element
t ∈ T uniformly from
T ⊆ S.
This has probability
1/T.
 Probability of choosing a particular element
t ∈ T, by picking elements
from
S without replacement until we get some element in
T.
(1) and (2) have the same probability distribution.
§ Proof by induction:
Process (2) in code is the following:
def process(S, T):
assert(issubset(T, S)) # precondition
s = np.random.choice(S) # uniform choice over S.
if s in T: return s # T/S prob. to enter `if`.
else:
# (1  T/S) to enter `else`
Snext = S.remove(s); return process(Snext, T)
We claim that the probability that
process(S, T) = t0
for a fixed
t0
in
T
is
1/T. We create a new function
indicator
to express this:
def indicator(t0, S, T):
assert(t0 in T) # precondition
assert(issubset(T, S)) # precondition
return t0 == process(S, T)
Let's push in
t0 ==
into the definiton of
process
after inling
process
.
This gives us:
def indicator(t0, S, T):
assert(t0 in T) # precondition
assert(issubset(T, S)) # precondition
s = np.random.choice(S)
if s in T:
# T/S prob. for x in T
return t0 == s # 1/T prob. for t0 == x
else:
# (1  T/S) to reach else branch.
Snext = S.remove(s); return process(Snext, T)
Now, we write down the recurrence for the probability that we are trying
to compute:
P(t0, S, T) is the probability that
indicator(t0, S, T)
returns
True
. Alternatively, it's the probability that
process(S, T)
returns
t0
.
P(t0, S, T) = prob. that process(S, T) = t0
P(t0, T, T) = T/T * 1/T = 1/T [base case]
P(t0, S, T) = 1/S + (1  T/S) * P(t0, S1, T) [induction]
We assume for induction that
P(t0, S1, T) = 1/T
. On substitution into
[induction]
,
we get:
P(t0, S, T)
= 1/S + (1  T/S) * P(t0, S1, T) [induction]
= 1/S + (1  T/S) * 1/T
= 1/S + (1/T  1/S)s
= 1/T
Which is indeed the same probability as (1):
1. Choosing an element uniformly from a subset
T = 1/T
.
§ Proof by program analysis, Version 1
def process(S, T):
s = np.random.choice(S)
if s in T: return s
else return process(S.remove(s), T)
Notice that the last return is tailcall. This program can be rewritten as:
def process(S, T):
while True:
s = np.random.choice(S)
if s in T: return s
S = S.remove(s)
As previously, we create an
indicator
function and study its behaviour:
def indicator(t0, S, T):
assert(t0 in T) # precondition
assert(issubset(T, S)) # precondition
while True:
s = np.random.choice(S)
if s in T: return t0 == s
S = S.remove(s)
We know that this programs only returns a value from the line:

if s in T: return t0 == s
We now compute
P(process(S, T) == t0)
.
Whatever the return value of
indicator
, we can assume that it occured within
the
if
condition. We can use this to compute:
P(indicator(t0, S, T) = True)
= P(t0 == s  s in T) [program only returns in this case]
= 1/T
§ Proof by program analysis, Version 2
Alternatively, we can also analyze this as we did in the
first proof,
using the rule:
def f():
return if cond1 then cond2 else cond3
will have probability:
P(cond1) * P(cond2cond1) + P(not cond1) * P(cond3not cond1)
Using this, we can analyze
indicator
as:
P(indicator(t0, S, T) = True)
= P(s in T) * P(t0 == s s in T) +
P(s not in T) * P(indicator(t0, S.remove(s), T)  s not in T)
= T/S * 1/T +
(ST)/S * P(indicator(t0, S.remove(s), T))
= 1/S + (ST)/S * 1/T [by induction]
= 1/T
§ Sampling using the above definition
Recall that
State(X) ≡ V → C,
State(W) ≡ V → 2^{C}.
Let
x[ ], w[ ] be the states of some run in the markov chains.
We start by having
x[0] to be
any valid kcoloring of
G, and
w[0] to
be the state where all vertices have all possible colors;
w[0] ≡ _ ↦ C.
Clearly,
x[0] ∈ w[0].
By induction we assume that
x[n−1] ∈ w[n−1]. We must now calculate a
w[n], x[n] such that (1)
x[n] ∈ w[n], (2)
x[n]'s proposal is symmetric.
(3)
w[n]'s proposal is symmetric.
§ Occluded set
O
Define
O ⊆ C (for occluded) be the set colors that might possibly
be blocked for
v from our view of
w
. Note that this is an
overapproxmation: that is, there may be colors that are not blocked for
v
, which we
believe to be blocked from
w
's point of view.
O ≡ { c ∈ C : (v, α) ∈ E, c ∈ w[n−1](α) } 
§ Allowed set
A
Define
A ⊂ C (for allowed) to be
C − O. Note that
A is
an
underapproxmation, since
O
was an
overapproximation. That is:
 Any color in
A
is definitely a legal color for v
in x[n]
.  There are colors which are legal for
v
in x[n]
that is not in A
.
§ S: the sequence of states for transition
Now, we pick elements of
C in sequence till we get an element of
A
.
call this sequence
S.
We will at worst pick
Δ + 1 elements for
S, since the maxdegree
of the graph is
Δ.
§ Transition
Let
i be the first index in
S where we get a color that is
truly legal
for
v in
x[n]. Note that such an index will always exist: We pick
elements into
S till we get an element in
A, and elements of
A are
always legal. However, there can be elements which are not in
A that
are still legal for
v in
x[n], since
A is an underapproximation.
 We assign
x[n](v) = i. So,
x
only cares about S[:i]
.  We assign
w[n](v) = A. So,
W
cares about the entire sequence.
By the lemma proven, we know that this process of picking colors
C
in a sequence till we get a color that is legal for
v at index
i
is the same as picking uniformly at random from the set of colors that are legal for
v.
§ An example
For example, we could have:
X  p:2  q:4
W  p:{1, 2}  q:{4, 5}
we are sampling q:
O = {1, 2}
A = {3, 4, 5}
S = [2, 1, 3]
X  p:1  q:2
W  p:{1, 2}  q:{1, 2, 3}
If we analyze
S = [2, 1, 3]
we notice that:
2: invalid for W(p:[1, 2]), invalid for X(p:2)
1: invalid for W, valid for X
3: valid for W, valid for X
So, what we are really sampling ix:
 A prefix sequence
SX = [1]
(for Xs transition)  A leftover sequence
SW = [2, 3]
(for Ws transition)
To transition
X
, we can safely drop
SW
. However, to transition
W
correctly,
we generate more elements in the sequence, till we hit a "safe" element.
§ An optimisation on picking colors: why we need a blocking set
Define the set
B ⊆ C (for blocked) which governs which values
x[n] surely cannot take from our view of
w[n−1].
Note that
B is an
underapproximation.
v
might have
more colors that are blocked than what
w
sees.
B ≡ { c ∈ C : (v, α) ∈ E, w[n−1](α) = {c} } 
Rather than sampling colors from
C
till we get an element of
A
, we can
sample colors from
C/B
. We know that the colors in
B
can
never be used
by
X, since the colors in
B
are those that we know are blocked
for sure.
This is used in the theoretical analysis of the paper.
§ Termination
We terminate when
W has "trapped"
X. That is,
w[n](v) = 1 forall
v ∈ V.
In such a case, the states of
W is equal to states of
X. This is
a coalescence (as it occurs in coupling from the past). From the coupling
from the past theory, we know that we have reached a stationary state of
A
when this happens.
§ Pseudocode
  colors, vertices, edges
cs :: [C]; cs = ...
vs :: [V]; vs = ...
es :: [(V, V); es = ...
  definitely not allowed
blocked :: (V > [C]) > (V > [C])
blocked v2cs v0 = concat [v2cs w  (v, w) < es, v == v0, length (v2cs w) == 1]
  maybe not allowed
occluded :: (V > [C]) > (V > [C])
occluded v2cs v0 = concat [v2cs w  (v, w) < es, v == v0]
  definitely allowed from Ws point of view
allowed :: (V > [C]) > (V > [C])
allowed v2cs = cs \\ occluded v2cs
  perturb the color function to another function
perturb :: (V > [C]) > Rand (V > [C])
perturb v2cs = do $
randv < uniform_rand vs
randc_for_v < uniform_rand (allowed v2cs v)
return $ \v > if v == randv then randc_for_v else v2cs v
  check if we are done
terminated :: (V > [C]) > Bool
terminated v2cs = all [length (v2cs v) == 1  v < vs]
  generate chain
chain :: (V > [C]) > Rand [V > [C]]
chain f = do
if terminated f
then [] else do
f;' < perturb f; fs < chain f'; return (f:fs)
  return a sample
sample :: (V > [C]) > Rand (V > [C])
sample = last . chain
§ References
§ Relationship between CFTP and reset transitions
§ References
scathing critique of how ameriacn math education is screwed:
also, interesting
anecdote about how looking for 'reality' in mathematical problems may in fact
break student's ability to think in the abstract! This is a deep insight.
I've wanted to learn how the SageMATH system is organized when it comes to math
hieararchies. I also wish to learn how
lean4
encodes their hiearchies. I know
how mathematical components does it. This might help narrow in on what what the
"goldilocks zone" is for typeclass design.
§ References
I learnt of this from an amazing discussion on HackerNews, where a sighted
programmed, who is going blind, asked the community if he could remain
a software engineer. An answer by
kolanos
read:
You can definitely continue as a software engineer. I'm living proof. ... For example, as you get better with a screen reader, you'll be bumping the speech rate up to 1.752X normal speech. ... Typos will be easily spotted as they just won't "sound right". It will be like listening to a familiar song and then hitting an off note in the melody. And this includes code. Also, because code is no longer represented visually as blocks, you'll find you're building an increasingly detailed memory model of your code. Sighted people do this, too, but they tend to visualize in their mind. When you abandon this two dimensional representation, your nonvisual mental map suffers no spatial limits. You'll be amazed how good your memory will get without the crutch of sight.
I find this incredibly fascinating. I think I'm going to try this: I'll listen
to lectures on
1.52x
anyway, so this may work just as well. I'm planning
on trying this with the MLIR codebase, which I want to get to know intimately.
I'll write updates on how this goes.
Drew DeVault also posted links to tools/scripts
for programming without needing visual feedback:
I also learnt about
emacspeak, which
supposedly works well for an audiobasedworkflow.
 [https://news.ycombinator.com/item?id=22919455]
I tend to forget the name of this trick. It exhibits spinors in real life:
a system that needs to rotate by 720 degrees to return back to its
original state, versus the usual 360 tha we are generally used to. We need
to consider our entire arm + cup we are holding as a system for this to work.
If one does not have recursive calls, one can eliminate the need to push
return addresses on a call stack by writing selfmodifying code 
I leant of this from TAOCP, volume 1.
Knuth shows this off once he introduces
MIXAL
, his fantasy
aseembly language in which TAOCP programs are written.
I'll explain the usual way one performs callreturn, then explain the nifty
selfmodifyingcode way. I think this is the cleanest, most accessible
example of selfmodifyingcode that I know.
§ The traditional solution for call/ret
We wish to have function
f
call function
g
. For
g
to be able to
return control to
f
,
f
pushes a return address into the call stack,
that
g
pops and
jmp
s to
f's body
START_F:
...
L0 push addr(L0); location of instr to be run after call.
jmp START_G
L1: <code after call>
g's body
START_G:
...
retloc = pop ; pop location to jump to
RETG: jmp retloc
Rather than
push
ing and
pop
ing, we can
rewrite the code of
g
, to
change retloc
before a call
to
g
. In madeuppseudocode, here's what that would look like:
§ The jump based solution
* f's body
START_F:
...
L0 store loc(RETG) assemble(jmp addr(L1))
jmp START_G
L1: <code after call>
* g's body
START_G:
...
RETG: <###tobefilleddummy###>
instead of having a call stack, before
f
calls g,
f
modify
g
's code at location
RETG
into a
jmp
instruction by
store
ing the instruction
jmp addr(L1)
.
This effectively creates a 'custom'
g
that knows how to return
control flow into
f
.
* g's body (after execution of L0)
START_G:
...
RETG: jump addr(L1)
This way, we have obviated the need for a
push/pop
sequence, by directly
modifying
g
's code. This is really neat  we replace the overhead of
a
push/pop
with a single
store
.
§ Why recursion breaks.
We don't actually need a call stack, as long as we don't want to write recursive functions.
We can't have recursion, or more generally "reentrance": consider a call chain of the form:

A > B > C > B
.  during
A > B
, A
will scribble a
into B
.  during
B > C
, B
will scribble a
into C
.  during
C > B
, C
will scribble
into B
,
destroying the previous
.  This creates a cycle, where
C
will attempt to return to B
and vice versa.
An adjunction
F  U
allows us to go from
F a > x
to
a > U x
. We
can look at this as shifting the "beforeadvice" from the
input to an
"after advice" of the
output, where I'm using
Also, to remember this, we can write it as:
F a <F a
 
v v
x U> U x
Where
F
is for
F
ree and
U
is for forgetf
U
l.
Recall that if
F
is free and
U
is forgetful, then
U(F(x)) = x
, since
adding more structure through the free functor and then stripping it away gives
us the object back. We can prove that if
U(F(x)) = x
and
U, F
are functors,
then we have a function
fwd: (f a > x) > (a > u x)
as:
f :: (f a > x)
fmap :: (p > q) > (u p > u q)
fmap :: (p > q) > (u ( p ) > u q)
fmap f :: (f a > x) > (u (f a) > u x)
fmap f :: (f a > x) > (a > u x) [using u (f a) = a]
§ References
If we consider a language like
Janus
where every program is reversible, we can then get a group structure on
programs with the identity program not computing anything at all, the inverse
performing the reverse operation.
Alternatively, one can use the trick from quantum mechanics of using anciliary
qubits to build reversible classical gates.
The question is, do either of these approaches allow for betterthanSTOKE
exploration of the program space? Ie, can we somehow exploit the
discrete group structure (in the case of Janus) or the Lie group structure
of the unitary group (as in the QM case) to find programs in far quicker ways?
So, I've shifted the blog to be staticsitegenerated using a
staticsitegenerator written by yours truly. The code clocks in at around a
thousand lines of C++:
What did I gain?
 My generator is a real compiler, so I get errors on math and markdown
malformation.
 I can write math that loads instantly on your browser, using no MathJax,
KaTeX or any client side processing, nor the need to fetch images, which looks like this:
§ Why?
My blog is a
single 9000 line markdown file,
rendered as a
single HTML page, so I
need it to compile fast, render fast, render beautiful.
Existing tools compromise on one or the other.
§ No seriously, why a single markdown file?
I need a single file to edit, so I can rapidly jot down new ideas. This is
the essence of why I'm able to log most of what I study:
because it's seamless.
Far more importantly, it provides
spatiotemporal locality. I add things
in chronological order to tbe blog, as I learn thing. If I need to recall
something I had studied, go to that location in the blog
based on a sense of when.
When I do get to a location I want, the scrollbar gives me a sense of
where I am in the file. this is important to me, since it hepls me reason
spatially about what i know and what I've learnt. It's someting I love about
books, and deeply miss when navigtaing the web.I'm determined to keep this
spatiotemporal locality on my little slice of the internet.
§ Why is this awful?
As elegant as this model is to
edit, it's awful for browsers to render. The
file used to take on the order of minutes for all the math to finish
rendering. MathJax (and KaTeX) painfully attempt to render each
math block. As they do, the page jumps around until everything has settled.
As this is happening, your CPU throttles, your lap or hand gets warm,
and the page is stuck. Clearly not great UX.
I still want math. What do I do? The solution is easy: Approximate the math
rendering using ASCII/UTF8 characters! There are tools that do this 
hevea
is one of them. Unfortunately, there is no
markdownbasedbloggingplatform that uses this, so I
had to write my own.
§ The cure
The solution is easy. I wrote the tool. The page you're reading it
is rendered using the tool. All the math renders in under a second because
it's nothing crazy, it's just text and tables which browsers know how to
render. No JavaScript necessary. snappy performance. Whoo!
§ The details: Writing my own Markdown to HTML transpiler.
the final transpiler clocks in at
1300Loc
of C++,
which is very small for a featurecomplete markdowntoHTML piece of code
that's blazing fast, renders math correctly, and provides error messages.
§ Quirks fixed, features gained.
I got quite a bit "for free" as I wrote this, fixing mild annoyances
and larger pain points around using github + markdown for publishing on
the web:
 I really don't want tables, but I do want the ability to write vertical bars

freely in my text. Unfortunately, github insists that those are tables,
and completely wrecks rendering.
 I get line numbers in code blocks now, which Github Flavoured Markdown
did not have.
 I get error messages on incorrectly closed bold/italic/code blocks, using
heuristics that prevent them from spanning across too many lines.
 I get error messages on broken latex, since all my latex passes through
hevea
. This is awesome, since I no longer need to refresh my browser,
wait for mathjax to load, go make myself tea (remember that mathjax was slow?),
and then come back to see the errors.
 I can get error messages if my internal document links are broken. To be
fair, my tool doesn't currently give me these errors, but it can (and soon
will).
 In general, I get control, which was something I did not have with
rendering directly using Github, or using someone else's tool.
§ Choice of language
I choose to write this in CstyleC++, primarily because I wanted the tool
to be fast, and I'd missed writing C++ for a while. I really enjoy how
stupidsimple C style C++ turns out to be: the C++ papers over some of C's
annoyances (like formatted output for custom types), while still preserving the
KISS feeling of writing C++.
Why not Rust? I freely admit that rust might have been a sane choice as
well. unfortunately, asking rust to treat UTF8 string as a "ball of bytes" is
hard, when it's stupidly easy with C. Plus, I wanted to use arenastyleallocation
where I make
huge allocations in one go and then don't think about memory,
something that I don't have control over in Rust. I don't have any segfaults
(yet, perhaps), thanks to UBSAN and ASAN. I find Rust to have more impedance
than C on small applications, and this was indeed small.
§ Performance
Everything
except the latex to HTML is blazing fast. Unfortunately,
calling
hevea
is slow, so I implemented a caching mechanism to make using
hevea
notslow.
hevea
does not have an API, so I need to
fork
and
talk to its process which is understandably flow. I built a "keyvaluestore"
(read: serialize data into a file) with the stupidlysimple approach of writing
an appendonly log into a file.
hevea
is a pure function conceptally,
since on providing the same latex input it's going to produce the same HTML
output, so it's perfectly safe to cache it:
const char DB_PATH[]="./blogcache.txt";
unordered_map<ll, const char *> G_DB;
void loadDB() {
G_DB = {};
FILE *f = fopen(DB_PATH, "rb");
...
while (!feof(f)) {
ll k, len;
fread(&k, sizeof(ll), 1, f); if (feof(f)) break;
fread(&len, sizeof(ll), 1, f);
...
char *buf = (char *)calloc(sizeof(char), len + 2);
fread(buf, sizeof(char), len, f);
...
}
fclose(f);
};
const char *lookup_key(ll k) {
unordered_map<ll, const char *>::iterator it = G_DB.find(k);
if (it == G_DB.end()) { return nullptr; } return it>second;
};
void store_key_value(const ll k, KEEP const char *v, const ll len) {
assert(G_DB.count(k) == 0);
G_DB.insert(make_pair(k, strdup(v)));
FILE *f = fopen(DB_PATH, "ab");
assert(f != nullptr && "unable to open DB file");
fwrite(&k, sizeof(ll), 1, f);
fwrite(&len, sizeof(ll), 1, f);
fwrite(v, sizeof(char), len, f);
fclose(f);
}
§ For the future
I plan to rip out
hevea
and write my own
latex > HTML
converter for
the
subset of LaTeX I actually use.
hevea
's strength is its downfall:
It can handle all of LaTeX, which means it's really slow. If I can concentrate
on a small subset, I don't need to play caching tricks, and I can likely
optimise the layout further for my usecases.
I also want colored error messages, because who doesn't?
I'll probably gradually improve my static site generator over time. Once it's
at a level of polish where I'm happy with it, I'll spin it out as a separate
project.
§ Conclusions
Am I glad I did it? Yes, purely because my chunk of the internet aligns with
how I want it to be, and that makes me
є more happy.
I think of it as an investment into future me, since I can extend the
markdown and the transpiler in the way
I want it to be.
Consider a ground set
X. Let the space of all possible binary classifications
be the function space
C ≡ { f ∣ f : X → ± 1 }.
Now, a hypothesis class
H is a subset of
C. For example, some model
such as "return
+1 if a point is inside a region,
−1 otherwise" is a subset
of the full class
C.
The VC dimension of
H measures how good
H is generating different classification.
We first need the notion of shattering to define this.
A subset
S ⊆ X of the ground set shatters a hypothesis class
H
if the function
act_{S} has full range, where
act_{S} is defined as:
act_{S}: H → S^{{0, 1}}
act_{S}(h) = (h(s_{0}), h(s_{1}), h(s_{2}), …, h(s_{n}))

That is, the hypothesis class
H can classify all the subsets of
S.
Now the
VC dimension of the hypothesis class
H of a ground set
X is
the size of
largest possible
S ⊆ X such that
S is shattered
by
H.
§ Correct interpretation
 We need just one set
S of size
n to be shattered by
H. We get
to pick the set
S.
§ Subtletly 1:
 We do not need all sets of size
n to be shattered by
H.
We
can have the case where:
 All sets of size 3 are shattered by H
 Only one set of size 4 is shattered by H. All other sets of size 4 are not.
 Only one size of size 5 is shattered by H. All other sets of size 5 are not.
 No set of size 6 is shattered by H.
In this case, the VC dimension of
H is
5, not 3.
§ Subtletly 2:
We
cannot have the case where:
 All sets of size 3 are shattered by H
 No set of size 4 is shattered by H
 Some set of size 5 is shattered by H
For contradiction, let
S be the set of size
5 that is shattered by
H.
Let
T S,
T = 4. Now,
H shatters
T since
H shatters
S.
Hence, Some set of size 4 has been shattered. Contradiction, since we assumed
that no set of size 4 is shattered by
H.
So, to prove that sets of size
(≥ n) cannot be shattered, it suffices
to prove that sets of size equal to
n cannot be shattered.
§ Growth of number of sets shattered in
S for
S ⊆ X for a fixed
H.
If we fix a hypothesis class
H for
X, and we want to understand how
H
varies over subsets of
X, the idea is this:
Let
S be a set that is the maximum sized set that is shattered by
X. ie,
S = Vcdim(H) and
H shatters
S.
Now, the idea is this:
 For subsets
T ⊆ S,
act_{T}(H) = 2^{T}  exponential.
 For subpersets
S Sup,
act_{Sup}(H) = Comb(Sup, S)  polynomial.
We can show that this exponential/polynomial behaviour happens in general
for
S ⊆ X.
§ Basics, symplectic mechanics as inverting
ω:
I've never seen this kind of "inverting
ω" perspective written down
anywhere. Most of them start by using the inteior product
i_{X} ω without
ever showing where the thing came from. This is my personal interpretation of
how the symplectic version of classical mecanics comes to be.
If we have a nondegenerate, closed
twoform
ω: T_{p}M × T_{p}M → ℝ.
Now, given a hamiltonian
H: M → ℝ, we can construct a
vector field
X_{H}: M → TM under the definition:
  partially apply ω to see ω as a mapping from T_{p} to T_{p}^{*}M          
 ω2: T_{p} M → T_{p}*M ≡ λ (v: T_{p} M). λ (w: T_{p} M) . ω(v, w)          
 ω2^{−1}: T_{p}^{*}M → T_{p} M; dH: M → T_{p}* M          
 X_{H} ≡ λ (p: M) → ω2^{−1} (dH(p))          
 (p: M)   dH(p) : T_{p}*M   ω2^{−1}(dH(p)): T_{p}M 
         
 X_{H} = ω2^{−1} ∘ dH
         

This way, given a hamiltonian
H: M → ℝ, we can construct
an associated vector field
X_{H}, in a pretty natural way.
We can also go the other way. Given the
X, we can build the
dH
under the equivalence:
  ω2^{−1} ∘ dH = X_{H}          
 dH = ω2(X_{H})          
          
          

This needs some demands, like the oneform
dH being integrable. But this
works, and gives us a bijection between
X_{H} and
H as we wanted.
We can also analyze the definition we got from the previous manipulation:
  ω2(X_{H}) = dH          
 λ (w: T_{p} M) ω(X_{H}, w) = dH          
 ω(X_{H}, ·) = dH          
          

We can take this as a
relationship between
X_{H} and
dH. Exploiting
this, we can notice that
dH(X_{H}) = 0. That is, moving along
X_{H} does
not modify
dH:
  ω2(X_{H}) = dH          
 λ (w: T_{p} M) ω(X_{H}, w) = dH          
 dH(X_{H}) = ω(X_{H}, X_{H}) = 0 ω is antisymmetric
         

§ Preservation of
ω
We wish to show that
X_{H}^{*}(ω) = ω. That is, pushing forward
ω along the vector field
X_{H} preserves
ω.
TODO.
§ Moment Map
Now that we have a method of going from a vector field
X_{H} to a Hamiltonian
H, we can go crazier with this. We can
generate vector fields using
Lie group actions on the manifold, and then look for hamiltonians corresponding
to this lie group. This lets us perform "inverse Noether", where for a given
choice of symmetry, we can find the Hamiltonian that possesses this symmetry.
We can create a map from the Lie algebra
g ∈ G to
a vector field
X_{ g}, performing:
  t : ℝ ↦ e^{t g} : G          
 t : ℝ ↦ φ(e^{t g}) : M          
 X_{ g} ≡   (φ(e^{t g}))_{t = 0}: TM

         

We can then attempt to recover a hamiltonian
H_{ g} from
X_{ g}. If we get a hamiltonian from this process, then it
will have the right symmetries.
These are personal notes I made of a custom notation for denoting the relations
from the theorems for free paper. I developed the notation since I wanted
to keep track of what types are floating around and what the relations are doing.
We interpret types as sets. If elements belong to the relation, ie, if
(a, a') ∈ R ⊂ AxA
,
we will denote this as
a[A R A']a'
. We will now write down some inference
rules:
 We define
ReflB
as a[Bool ReflB Bool]a
 We define
ReflI
as i[Int ReflI Int]i
 The product of two relations
[A R B]
, [X S Y]
is called as RxS
,
and is defined as: (a,x)[AxX RxS BxY](b,y)
iff: ∀ abxy, a[A R B]b ∧ x[X S Y]y
.  The list space of a
[A R B]
is called [A* [A R B] B*]
,
and is defined as: la[A* [A R B] B*]lb
iff:
∀ la lb, la = lb ∧ (∀ i, la[i][A R B]lb[i])
 The function space of two relations
[A R B]
, [X S Y]
is called [A>X R>S B>Y]
,
and is defined as: f[A>X R>S B>Y]g
iff: ∀ a b, a[A R B]b => f(a)[X S Y]g(b)
.  The type family space of two relations is a function that takes
a relation
[A R B]
and produces a new relation:
g[FA  [A R B]  FB]h
. The relation takes as parameter a relation [A R B]
for each choice.  The space of relations of
∀X.F(X)
is a relation defined by:
g[A>FA  ∀X.F(X) [FA [A R B] FB] B>FB]h
∀ A B R, (g A)[FA  [A R B]  FB](h B)
.
§ Parametricity theorem
The parametricity thm states that for all terms
(r: R)
, we can deduce
r[R rel(R) R]r
where
rel(R)
is the relation that fits the type, and is
derived from the above rules.
§ Parametricity for lists when the relation is a function:
The list space of a
[A R B]
is called
[A* [A R B] B*]
,
and is defined as:
la[A* [A R B] B*]lb
iff:

∀ la lb, la = lb ∧ (∀ i, la[i][A R B]lb[i])
Now, let us take a special case where
[A R B]
is a function
δ: A > B
. That is:
If this is the case, then we can simplify the math to be:

la[A* [A R B] B*]lb <=> ∀ la lb, la = lb ∧ (∀ i, la[i][A R B]lb[i])

la[A* [A R B] B*]lb <=> ∀ la lb, la = lb ∧ (∀ i, δ(la[i]) = lb[i]

la[A* [A R B] B*]lb <=> ∀ la lb, map δ la = lb
§ Parametricity to prove rearrangements

r[∀ X. X* > X*]r

(r A)[A*>A*  [A*>A* [A R B] B*>B*]  B*>B*](r B)

as[A* [A R B] B*]bs => (r A as)[A* [A R B] B*](r B bs)
 Pick
[A R B]
to be a function δ: A > B
. Ie, a[A R B]b
iff δ(a) = b
.  This lets us convert all occrences of
α[A R B]ω
into ω = δ(α)
.  Hence,
as[A* [A R B] B*]bs
becomes map δ as = bs
.  Hence,
(r A as)[A* [A R B] B*](r B bs)
becomes map δ (r A as) = (r B bs)
 In toto, this let us replace
bs
with map δ as
. We derive: 
map δ (r A as) = (r B bs)

map δ (r A as) = (r B (map δ as)

map δ . (r A) = (r B) . map δ
 Replace
bs[i]
with δ(as[i])
to get result:
δ(r A as[i]) = r B δ(as[i])
, which was indeed what we were looking for.
§ References
I've always found code that uses halfopen intervals far harder to write
than using closed intervals. For example, when performing string processing,
I prefer to write
closed
over
halfopen
since I find it easier
to think about:
void closed(int begin, int end, char *c) { //easier
for(int i = begin; i <= end; ++i) {... }
}
void open(int begin, int len, char *c) { //harder
for(int i = begin; i < begin + len; ++i) { ... }
}
However, I realised that by changing how I think about this to:
void open(int begin, int len, char *c) { //harder
for(int i = begin; i != begin + len; ++i)
}
It somehow made it way easier to grok.
 I had problems with
<
since I would
mentally shift from i < begin + len
to i <= begin + len  1
. Making
this move would then make all other reasoning harder, since I had
keep switching between the <
and <=
point of view.
 On the other hand, when using
i != begin + len
, there was a single location
to focus on: the point begin + len
, and what happens when i
reaches it.
Of course, this is old had to anyone who performs loop optimisatison: LLVM
internally converts most comparisons into the
a != b
form, because it's
easier to analyse. It took me this long it's easier for me to
think
in this viewpoint as well.
I haven't found anything on the internet that describes how to build
a fusion bomb; it's almost as if this information has been supressed
by governments. However, I'm curious  would a physics grad student
studying nuclear physics or explosives have the theoretical knowhow to
precisely build one, given the raw materials? Or is there some
"secret sauce" that's necessary?
I read on wikipedia that most countries classify the details:
Detailed knowledge of fission and fusion weapons is classified to some degree in virtually every industrialized nation. In the United States, such knowledge can by default be classified as "Restricted Data", even if it is created by persons who are not government employees or associated with weapons programs, in a legal doctrine known as "born secret".
Suppose we have a manifold
M. of dimension
d that has been embedded isometrically
into
ℝ^{n}. So we have a function
e: ℝ^{d} → ℝ^{n}
which is the embedding. We will identify
M to be the subspace
Im(e).
Recall that
∂_{xi} e : ℝ^{d} → ℝ^{n}
is defined as:
  ∂_{xi}e : ℝ^{d} → ℝ^{n}          
 [∂ x_{i}e](p) ≡
   e(p + (0:0, 1:0…, i:δ_{x}, …, n:0)) − e(p) 

δ x 

         

Note that it is a function of type
ℝ^{d} → ℝ^{n}.
 The tangent space at point
p ∈ Image(e) is going to be spanned by
the basis
{ ∂_{xi}e _{p} : ℝ^{n} }.
 The metric tensor of
M,
g_{ij} ≡ ⟨ ∂ e/∂ x_{i}  ∂ e/∂ x_{j} ⟩.
That is, the metric tensor "agrees" with the dot product of the
ambient space
ℝ^{n}.
 A vector field
V on the manifold
M is by definition a combination of
the tangent vector fields.
V(p_{0}) ≡ v^{j}(p_{0}) ∂_{xj} e(p_{0})
We can calculate the derivaive of this vector field as follows:
           
 = ∂_{xi}  ⎡
⎣  v_{j}(p) ∂_{xj} e  ⎤
⎦  
 = v^{j} · ∂_{xi} ∂_{xj} e + ∂_{xj}e · ∂_{xi} v^{j}
        

We choose to rewrite the second degree term in terms of the tangent
space, and some component that is normal to us that we have no
control over.
(∂_{xi} ∂_{xj} e )(p) ≡ Γ_{ij}^{k} ∂_{xk} e +  
This gives us the Christoffel symbols as "variation of second derivative
along
the manifold.
§ Relationship to the LeviCevita connection
The covariant derivative defined by the LeviCevita connection is the derivative
that contains the projection of the full derivative in
ℝ^{n} onto
the tangent space
T_{p} M. This is defined by the equations:
           
 = Π    ⎡
⎣  v^{j} · ∂_{xi} ∂_{xj} e + ∂_{xj}e · ∂_{xi} v^{j}  ⎤
⎦ 
         
 = Π    ⎡
⎢
⎢
⎣  v^{j} · (Γ_{ij}^{k} ∂_{xk} e +   )+ ∂_{xj}e · ∂_{xi} v^{j}  ⎤
⎥
⎥
⎦ 
         
 = v^{j} · (Γ_{ij}^{k} ∂_{xk} e +   ) + ∂_{xj}e · ∂_{xi} v^{j} 
         
 = v^{j} · (Γ_{ij}^{k} ∂_{xk} e +   ) + ∂_{xk}e · ∂_{xi} v^{k} 
         
 = v^{j} · Γ_{ij}^{k} ∂_{xk} e + ∂_{xk}e · ∂_{xi} v^{k}          
 = ∂_{xk} e  ⎛
⎝  v^{j} · Γ_{ij}^{k} + ∂_{xi} v^{k}  ⎞
⎠ 
         
          

§ References
On learning about infinite dimensional vector spaces, one learns that
we need to use the axiom of choice to assert that every such vector space
has a basis; indeed, it's equivalent to the AoC to assert this. However,
I had not known any "natural" examples of such a vector space till I studied
the proof of the barvinok algorithm. I produce the example here.
Consider a space such as
S ≡ ℝ^{3}. Now, consider the vector
space spanned by the indicator functions of polyhedra in
S. that's a mouthful,
so let's break it down.
A polyhedra is defined as a set of points that is defined by linear
inequalities:
P ≡ { x ∈ S : a_{i} · x ≤ b_{i}, i ∈ [1… n] },
for all
a_{i} ∈ S,
b ∈ ℝ.
The indicator functions are of the form:
[poly]: S → ℝ;
[poly](x) ≡
 
we can define a vector space of these functions over
ℝ, using
the "scaling" action as the action of
ℝ on these functions:
The vector space
V is
defined as the span of the indicator functions
of all polyhedra. It's clearly a vector space, and a hopefully intuitive
one. However, note that the set we generated this from (indicators of polyhedra)
don't form a basis since they have many linear dependencies between them.
For example, one can write the equation:
** ** ** *
### # # 
### = # + #  
### # # 
** ** ** *
Central idea: assume a memory model where computation is free, only cost
is pulling data from cache into memory. Cache has total size
M, can hold
blocks of size
B. So it can hold
M/B blocks of main memory. Memory memory
has infinite size. Cost is number of transfers.
We assume that the algorithm
does not know M or B. We assume that the cache
replacement strategy is optimal (kick out block that is going to be used
farthest in the future). This is an OK assumption to make since an LRU cache
using
twice the memory of a "oracular" cache performs equally well (citation?)
These data structures are cool since they essentially "Adapt" to varying cache
hierarchies and even multiple level cache hierarchies.
We study how to build cacheoblivious Btrees.
§ Building optimal cacheoblivious B trees to solve search
 We use a balanced BST. We want to find an order to store nodes in memory
such that when we search for an element, we minimize number of blocks
we need to pull in.
 All standard orders such as level order, preorder, postorder fail.
 Corrrect order is "VEB (Van Em De Boas) order": carve a tree at the middle
level of its edges. Layout a "triangle" or smaller collection
of nodes linearly. Then Recursively layout the trees, linearly in memory.
 Supposedly if the number of nodes is
N, we wil have roughly
√(N)
nodes on the top, and then
√(N) triangles at the bottom.
§ Analysis Claim: we need to pull
O(log_{B} N) blocks for any
B for any search query
N is the number of nodes in the BST. Note that in the analysis,
we know what B is,
even though the
algorithm does not.
 We look at a particuar level of recursion. We will call it a "level of detail"
straddling B.
 We will have large triangles of size
≥ B, inside which there are smaller
triangles of size
≤ B (reminds me of sierpinski).
 We know that the algorithm recursively lays it out, and triangle stores
everything "inside" it in a contiguous region. So we stop at the
requisite size where we know that the tree's triangles themselves
contain triangles which fit into the block size.
 A little triangle of size less than B can live in at most two memory blocks
by straddling a block boundary: by eg. having
(B−1) bits in one block
and a single bit in another block.
1 2 3 4 5 6 7 8 < index
   < boundary
xxxxxxx < data
 The question is that on a roottoleaf bpath, how many such triangles do
we need to visit. Since we repeatedly divide the nodes in half
with respect to height until the little triangle has number of nodes less
than
B, the height is going to be
O(logB) since it's still a binary tree.
 total height in
O(logN).
 so height of "chunked tree" where we view each triangle as a single node
is
logN / logB = log_{B} n.
 insight: ou data structure construction in some sense permits us to
"binary search on
B" since we divide the data structure into levels
based on
B. if
B = N, then the full data structure fits into memory
and we're good.
§ Black box: ordered file maintainince
We need a black box: ordered file maintainance (linked list for arrays)
 Store
n elements in specified order in an array of linear size
O(N).
Array permits gaps.
 updates: delete element, insert elements between 2 elements.
 cannot do this in linear time, but we can move elements in an interval of
size
log^{2}(N) amortized.
 We need
O(1) scans for the data structure.
§ Next: dynamic BST (inserts and deletes): layout
we take a VEB static tree on top of an ordered file. Tree is a segtree
that has max of nodes. Leaves are the members of the ordered file.
§ Updates
 search for node.
 update ordered file.
 propogate updates into the tree. This will have to be done in postorder
because we need the leaves to be fixed before we can update the parent
max
.
§ Updates: analysis.
 look at level of detail that straddles
B.
 Let us look at the bottom 2 levels.
 Note that when we perform postorder inside a triangle that has 3 triangles
of size
≤ B, we need to alternate between parent triangle and child triangle.
Since the parent triangle is of size
≤ B and can therefore take
at most
2B blocks of memory, similarly the child can take at most
2B
blocks of memory.
 So if our cache can hold
4 blocks of memory, we're done.
We won't need to kick anything out when performing the postorder
traversal.
 For levels that are above the bottom 2 levels, we're still OK. there
are not many triangles! / not many nodes! (
1:16:00
in the video)
§ References
We denote partial functions with
X ⇀ Y and total functions
with
X → Y.
A set
X equipped with a binary operator
⋆: X × X → X which is closed and associative
is a semigroup.
§ Partial function semigroup
For a ground set
X, the set of partial functions
Pf(X) ≡ { f: X ⇀ X }
along with function composition forms a semigroup. This is in fact stronger
than a semigroup. There exists:
 An identify function
e_{x}: X → X; e_{X}(x) = x
 A zero function
θ_{x}: X ⇀ X; θ_{x}(x) = undef, where by
undef we mean that it is undefined.
§ Transformation semigroup(TS)
Let
Q be a set. Let
S ⊆ Pf(Q) be a subsemigroup of
Pf(Q).
Then the semigroup
X ≡ (Q, S) is called as the
transformation semigroup(X) of states
Q.
 The elements of
Q are called states of
X
 while the elements of
S are called actions of
X.
 The set
Q itself is called as the underlying set of
X.
 For a fixed transformation semigroup
X, we will write
Q_{X} and
S_{X}
to refer to its states and actions.
We call
X ≡ (Q, S) as a
transformation monoid if
S contains
1_{Q}(q) = q.
§ Subtlety of being at transformation monoid.
There is some subttlety here. Just because
S is a monoid does not mean that
it that is a
transformation monoid. It must have the identity element of
Pf(Q) to be called a transformation monoid. For example, consider the
set
Q ≡
a, b
and the transformation semigroup
S ≡
f ≡ α ↦ b
.
Now the set
S is indeed a monoid with identity element as
f: Q → Q.
however,
f ≠ 1_{Q} , andh ence,
S is a not a
transformation monoid.
§ Examples of transformation semigroups

(X, { θ(x) = undef }). The semigroup with the empty transformation.

(X, ∅), the semigroup with no transformations.
§ Semigroup action
We sometimes wish to represent a semigroup using an action/transformation semigroup
on a ground set
X. So, given some semigroup
(T, ×) that needs to be represented,
if we can find a morphism
r: T → Pf(X) (
r for representation)
such that:

r(t_{1} + t_{2}) = r(t_{1}) ∘ r(t_{2}). [
r is a semigroup morphism].

t_{1} ≠ t_{2} ∃ x ∈ X such that
r(t_{1})(x) ≠ r(t_{2})(x).
[Faithfulness].
Put more simply,
t_{1} ≠ t_{2} r(t_{1}) ≠ r(t_{2}) where we define
function equality extensionally:
f = g ≡ ∀ x, f(x) = g(x).
§ Embedding actions into the transformation semigroup
We often wish to represent some semigroup
S as the transformation semigroup
of some set of states
Q. We can achieve this by proving a morphism:

r: S → Pf(Q) that is faithful.
Then, we can treat elements of
S as elements of
Pf(Q).
§ Completion of a transformation semigroup
Given a transformation semigroup
X ≡ (Q, S) we can
complete it
by adding a new sink state
⊥, and then converting all partial
functions in
S to total functions that transition to
⊥. We have that
⊥ · s = s · ⊥ ∀ s ∈ S.
We denote the completion as
X^{c} ≡ (Q^{c}, S^{c}).
§ Coverings
Let
X ≡ (Q_{X}, S_{X}) and
Y ≡ (Q_{Y}, S_{Y}) be transformation
semigroups. Let
φ ⊆ Q_{Y} × Q_{X} be a relation. Let
s_{x} ∈ S_{X}
and
s_{y} ∈ S_{Y}. Then, if the following diagram commutes:
If
s_{x}(φ(q_{y})) = φ(t_{y}(q_{y})), then we say that
t_{y} covers
s_{x} relative to
φ.
We imagine the
t_{y} lying above
s_{x}, being projected down by
φ.
If a fixed
φ, for all
s_{x} ∈ S_{X} there exists a
t_{y} ∈ S_{Y} such that
t covers
s relative to
φ, then we say that
φ: is a
relation of automata.
 If
φ: Q_{Y} → Q_{X} is surjective,
then we say that
φ is a relational covering and write:
 If
φ Q_{Y} × Q_{X} is both surjective and partial,
then we say that
φ is a covering and write:
If
X ≺_{φ}Y, we say that
Y dominates
X, or
Y covers
X, or
X divides
Y.
§ Checking coverings and generating subsets
We note that for a given covering
φ, if
s_{x} is covered by
t_{y}
and
p_{x} is covered by
q_{y}, then
s_{x} ∘ t_{x} is covered by
t_{y} ∘ q_{y}.
Thus, to check if
X is covered by
Y, we simply need to check if
some generating subset of
X is covered by
Y.
§ Checking coverings of representations
Let us assume we have a representation of a transformation semigroup
given with a semigroup
Σ, a transformation semigroup
X ≡ (Q_{X}, S_{X}), and a representation
r: Σ → S_{X} that is
faithful.
Now, to check that
X is covered by another
Y, it suffices to check that
there exists a
t_{y} ∈ Y for each
σ ∈ X such that
r(σ) is
covered by this
t_{y}.
§ Companion relation
Given a relation
φ: Y → X, then we define:
Σ ≡ { (s, t) : t ∈ T_{Y} covers s ∈ S_{X} }

Recall compositions of elements are covered by a composition
of their coverings. Hence, if
(s, t), (s′, t′) ∈ Σ, then
(ss′, tt′) ∈ Σ. thus,
Σ is a subsemigroup of
S_{X} × S_{Y}.
We can regard
Σ as the graph of a relation
φ′ ⊆ Q_{Y} × Q_{X}.
This will be called as
companion relation of
φ.
§ Wreath products
Let
X ≡ (Q_{X}, S_{X}) and
Y ≡ (Q_{Y}, S_{Y}). We're going to define a large
product
X ≀ Y.
We begn with the set
W ≡ S_{X}^{Q}_{Y} × S_{Y}, where
S_{X}^{Q}_{Y} ≡ { f : Q_{Y} → S_{X} }.
The wreath product then becomes:
X ≀ Y ≡ (Q_{X} × Q_{Y}, W)

with the action of
W on an element of
Q_{X} × Q_{Y} being defined as:
(f : Q_{Y} → S_{X}, s_{y} : S_{Y}) (q_{x} : Q_{X}, q_{Y} : Q_{Y}) ≡ ( f(q_{y})(q_{x}) , s_{y} (q_{y}))

it's a "follow the types" sort of definition, where we edit the right component
as
r_{y} ↦ t_{y}(r_{y}) since that's all we can do. In the case of
the left component, we have a
q_{x}, and we need to produce another element
in
Q_{X}, so we "must use
f". The only way to use
f is to feed it
a
t_{y}. This forces us into the above definition.
§ Composition of wreath products
To show that its closed under composition, let's consider
(f, s_{y}), (g, t_{y}) ∈ W
with
f, g: Q_{Y}g → S_{X}, and
s_{y}, t_{y} ∈ S_{Y}. The result is
going to be:
(f, s_{y}) (g, t_{y}) = (λ q_{y}. f(q_{y}) ∘ g(q_{y}), t_{y} ∘ u_{y})

§ Equivalences of subsets of states
Let
X = (Q, S) be a transition system. Given subsets
(a, b, ⊆ Q),
we shall write
b ≤ a if either
b ⊆ a or there exists some
s ∈ S
such that
b ⊆ sa, where
s(a) ≡ { s(a_{i}) : a_{i} ∈ a}. We can
define an equivalence relation
a ∼ b ⇐⇒ a ≤ b ∧ b ≤ a.
Note:
b ≤ a b ≤ a, since
b ≤ a means that
b ⊆ s(a). Note that
s is actually a
function
s: Q → Q, and a function mapped over a set can only
ever decrease the number of elements in a set, since a function can only
xglomp elements together; it can never break an element apart into two.
Hence,
b ⊆ sa ⊆ a, and thus
b ≤ a.
Similiarly,
a ≤ b a ≤ b. Therefore,
b ∼ a means
that
b = a.
Theorem: for all
a, b ∈ Q_{X} such that
a b such that
b ⊆ s(a), we show that
b = s(a), and there exists
a
t ∈ S_{X} such that
a = t(b).
Proof: Since
b ⊆ s(a) ⊆ a and
b = a,
b = s(a).
Therefore
s is a permutation. Hence,
s is invertible and there exists
an inverse permutation
t such that
a = t(b). We now need to show that
t ∈ S_{X}. To do this, first note that if the order of the permutation
s is
n, then
t = s^{n−1}, since
t ∘ s = s^{n−1} ∘ s = 1_{S}.
Since the semigroup
S is closed under composition
t = s^{n−1} is in
S,
since it is
s composed with itself
(n−1) times.
§ Subset families of interest
We will be interest in a family of subsets of
Q_{X} called
A, of the form:
 all sets of the form
s(Q) for all
s ∈ S_{X}
 the set
Q
 the empty set
∅
 all the singleton sets
{ q } for all
q ∈ Q.
In the above set, we have
≤ and
∼ as defined above.
We note that the set
A is
closed under the action of all
s ∈ S_{X}.
For example, the empty set is taken to the empty set. All singleton
sets are taken to other singleton sets. For the full set
Q, we add
the sets
s(Q) for all
s ∈ S_{X}.
§ Height function
A height function for a transition system
X ≡ (Q_{X}, S_{X}) is a function
h: A → ℤ such that:

h(∅) = −1.

h({ q }) = 0 ∀ q ∈ Q.

a ∼ b h(a) = h(b) for all
a, b ∈ A.

b < a h(b) < h(a) for all
a, b ∈ A.
The notation
b < a ≡ (b ≤ a) ∧ ¬ (a ≤ b).
(3) + (4) imply that two elements of the same height are either equivalent
or incomparable.
§ Pavings and bricks
for
a ∈ A such that
a > 1, we denote by
B_{a} the set of all
b ∈ A
what are maximal subsets of
a. That is, if
b ∈ B_{a} then
b a,
and
∄c, b c a. Equivalently, if there
exists a
c such that
b ⊆ c ⊆ a, then
b = c or
b = a.
Note that we can assert that
a = ∪_{b ∈ Ba} b. This is because
B_{a}
contains all the singletons of
Q_{X}. so we can begin by writing
a as
a union of singletons, and then merging elements of
B_{a} into larger elements
of
B, terminating when we cannot merge any more elements of
B_{a}.
 The set
B_{a} is called as the paving of
a.
 The elements of
B_{a} are called as the bricks of
a.
§ Group of permutations for
a ∈ A
Let us assume that there exists a
s ∈ S such that
s(a) = a. Let
A_{a}
be the set of all elements in
A contained in
a:
A_{a} = { A_{i} : A_{i} ∈ A, A_{i} ⊆ a }.
Recall that the set
A was closed under the action of all
s, and hence,
since
s is a permutation of
a, this naturally extends into a
permutation of
A_{a}:
s A_{a} = A_{a}. Now note that this induces a permutation
of the set
B_{a}. This creates a transition system:
  G_{a} ≡ { s ∈ S : s a = a }          
 H_{a} ≡ (B_{a}, G_{a})          
          

We have already shown how if
s ∈ S defines a permutation of some set
X
by its action, then its inverse also exists in
S. So, this means that
G_{a} is in fact a transition
group that acts on
B_{a}.
It might turn out that
G_{a} = ∅. However, if
G_{a} ≠ ∅,
then as stated above,
G_{a} is a group.
We will call such a transition group a
generalized transition group, since
either
G_{a} = ∅ or
G_{a} is a group.
Now, the generalized transition group
H_{a} is called as the
holonomy transition system of
a, and the group
G_{a} is called as
the
holonomy group of
a.
We have that
G_{a} ≺ S since
G_{a} is a quotient of the subsemigroup
{ s  s ∈ S, as = a }. (TODO: so what? why does this mean that it's
≺?)
Theorem: if
a ∼ b, then
H_{a} ≃ H_{b}
(similar subsets have isomorphic holonomy transition systems).
Proof: Let us assume that
a ≠ b. since
a ∼ b, we have elements
of the form
s, s^{−1} ∈ S such that
b = s(a),
a = s^{−1}(b).
Recall that for
b_{a} ∈ B_{a} is such that for a member
g ∈ G_{a},
g(b_{a}) = b_{a}.
B_{b} must have the element
s(b_{a}). [TODO!]
§ Holonomy decomposition
Let
X ≡ (Q, S) be a transition system and let
h be a height
function for
X, such that
h(Q) > 0. For a fixed
i,
let
a_{1}, a_{2}, … a_{k} be the representatives of equivalence classes of
elements of
A of height equal to
i. We define:
H_{i}^{∨}≡ H_{a1} ∨ H_{a2} … ∨ H_{an}

§ Inductive hypothesis for coverings
We will say a relational covering
X ◁_{φ} Y is
of rank
i
with respect to a given height function
h if
φ relates states in
Y
to subsets of states in
x that are members of
A and have rank at most i.
Formally, for each
p ∈ Q_{Y}, we have that
φ(p) ∈ A and
h(φ(p)) ≤ i.
We prove that if
X ◁_{φ} Y is a relational covering of rank
i,
then
X ◁_{φ} H_{i}^{∨} ≀ Y is a relational covering
of rank
i − 1.
The proof is a proof by induction.
§ Base case:
Start with the relational covering with
Q_{Y} = { 0 }, S_{Y} = { id },
and the cover
φ(0) = Q_{X}. Clearly, this has rank
n since the height
of
Q_{X} is
n, and
φ is inded a covering, since the only transition
that
Y can make (stay at the same state) is simulated by any transition
in
S_{X} [TODO: is this really the argument?]
For induction, assume
X ◁_{φ} Y is a relational covering of rank
i
with respect to some height function
h.
X≡ (Q_{X}, S_{X}) and
Y ≡ (Q_{Y}, S_{Y}). We define

QY_{i} ≡ { q_{y} : q_{y} ∈ Q_{Y}, h(φ(q_{y})) = i }

QY_{<} ≡ { q_{y} : q_{y} ∈ Q_{Y}, h(φ(q_{y})) < i }
We know that
A contains elements of height exactly
i. Let
a_{1}, a_{2}, … a_{k}
be representatives of sets of of height
i in
A. Thus, for each
qy_{i} ∈ QY_{i},
we have that:

φ(qy_{i}) = a_{j} for a unique
1 ≤ j ≤ k.
 We select elements
u, u ∈ S such that
u(φ(qy_{i})) = a_{j}
and
u(a_{j}) = φ(qy_{i}).
We will show how to establish a relational covering:

X ◁_{φ} ≀ H_{i}^{∨} Y using a relation:

φ ⊆ [(B_{a1} ∪ B_{a2} ∪ … B_{ak})× Q_{Y} ] × Q_{X}
§ References
It's a somewhat wellknown fact that given matrix multiplication:
O = AB
where
O ∈ ℝ^{2n × 2m} (
O for output),
A ∈ ℝ^{2n × r}, B ∈ ℝ^{r × 2m} are matrices.
We can also write this as follows:
⎡
⎢
⎣  o_{11}  o_{12} 
o_{21}  o_{22} 
 ⎤
⎥
⎦ 
 =
 ⎡
⎢
⎣  a_{11}  a_{12} 
a_{21}  a_{22} 
 ⎤
⎥
⎦ 

 ⎡
⎢
⎣  b_{11}  b_{12} 
b_{21}  b_{22} 
 ⎤
⎥
⎦ 

=
 ⎡
⎢
⎣  a_{11} b_{11} + a_{12} b_{21}  a_{11} b_{12} + a_{12} b_{22} 
a_{21} b_{11}+ a_{22} b_{21}  a_{21} b_{12} + a_{22} b_{22}

 ⎤
⎥
⎦ 

When written as code, the original matrix multiplication is:
// a:[2N][2Z] b:[2Z][2M] > out:[2N][2M]
int matmul(int N, int Z, int M, int a[N][Z], int b[Z][M], int out[N][M]) {
for(int i = 0; i < 2*N; ++i) {
for(int j = 0; j < 2*M; ++j) {
for(int k = 0; k < 2Z; ++k) out[i][j] += a[i][k] * b[k][j]
}
}
}
and the blockbased matrix multiplication is:
// a:[2N][2Z] b:[2Z][2M] > out:[2N][2M]
int matmulBlock(int N, int Z, int M, int a[N][Z], int b[Z][M], int out[N][M]) {
for (int BI = 0; BI < 2; ++BI) {
for (int BJ = 0; BJ < 2; ++BJ) {
for(int i = BI*N; i < BI*N+N; ++i) {
for(int j = BJ*M; j < BJ*M+M; ++j) {
for(int k = 0; k < 2Z; ++k) { out[i][j] += a[i][k] * b[k][j] }
}
}
}
}
}
we wish to show that both of these programs have the
same semantics.
We will do this by appealing to ideas from program analysis.
§ The key idea
We will consider the statement:
out[i][j] += a[i][k] * b[k][j]
as occuring at an abstract "point in time"
(i, j, k) in the
matmul
function.
I also occurs at an abstract "point in time"
(BI, BJ, i′, j′, k′) in
the
matmulBlock
function.
We will then show that the loops
for(i...) for(j...) for(k...)
are fully
parallel, and hence we can reorder the loops any way we want.
Then, we will show that the ordering imposed by
(BI, BJ, i′, j′, k′)
is a reordering of the original
(i, j, k) ordering. We do this by
showing that there is a bijection:
(i=i_{0}, j=j_{0}, k=k_{0}) → (BI=i_{0}/N, BJ=j_{0}/N, i=i_{0}%N, j=j_{0}%N, k=k_{0})

Thus, this bijection executes all loops, and does so without affecting the
program semantics.
§ Schedules
We'll zoom out a little, to consider some simple programs and understan
how to represent parallelism.
void eg1(int N, int M, int out[N][M]) {
for(int i = 0; i < N; ++i) {
for(int j = 1; j < M; ++j) {
out[i][j] = out[i][j1];
}
}
Notice that this program is equivalent to the program with the
i loop
reversed:
void eg1rev(int N, int M, int out[N][M]) {
for(int i = N1; i >=0; i) {
for(int j = 1; j < M; ++j) {
out[i][j] = out[i][j1];
}
}
What's actually
stopping us from reversing the loop
for(j...)
? it's
the fact that the value of, say,
out[i=0][j=1]
depends on
out[i=0][j=0]
. We can see that in general,
out[i=i_0][j=j_0]
depends
on
out[i=i_0][j=j_01]
. We can represent this by considering a
dependence set:
{ write:(i_{0}, j_{0}−1) → write:(i_{0}, j_{0}) }

in general, we can reorder statements as long as we do not change
the
directions of the arrows in the dependence set.
§ Dependence structure of matmul
.
§ Fully parallel, reordering
§ References
Midnight discussions with my roommate
Arjun P.
This tries to explore what it is about algebra that I find appealing.
I think the fundamental difference to me comes down to flavour 
analysis and combinatorial objects feel very "algorithm", while Algebra feels
"data structure".
To expand on the analogy, a proof technique is like an algorithm, while an
algebraic object is like a data structure. The existence of an algebraic object
allows us to "meditate" on the proof technique as a separate object that does
not move through time. This allows us to "get to know" the algebraic object,
independent of how it's used. So, at least for me, I have a richness of
feeling when it comes to algebra that just doesn't shine through with analysis.
The one exception maybe reading something like "by compactness", which has
been hammered into me by exercises from Munkres :)
Meditating on a proof technique is much harder, since the proof technique
is necessarily intertwined with the problem, unlike a data structure which
to some degree has an independent existence.
This reminds me of the quote: "“Art is how we decorate space;
Music is how we decorate time.”. I'm not sure how to draw out the
tenuous connection I feel, but it's there.
Arjun comes from a background of combinatorics, and my understanding of his
perspective is that each proof is a technique unto itself. Or, perhaps
instantiating the technique for each proof is difficult enough that abstracting
it out is not useful enough in the first place.
A good example of a proof technique that got studied on its own right in
combinatorics is the probabilistic method. A more reasonable example is that of
the Pigeonhole principle, which still requires insight to instantiate in
practise.
Not that this does not occur in algebra either, but there is something in
algebra about how just meditating on the definitions. For example,
Whitney trick that got pulled out of the proof of the Whitney embedding
theorem.
To draw an analogy for the haskellers, it's the same joy of being able to write
down the type of a haskell function and know exactly what it does, enough that
a program can automatically derive the function (djinn). The fact that we know
the object well enough that just writing the type down allows us to infer the
program, makes it beautiful. There's something very elegant about the
minimality that algebra demands. Indeed, this calls back to another quote:
"perfection is achieved not when there is nothing more to add, but when there
is nothing left to take away".
I'm really glad that this 2 AM discussion allowed me to finally pin down
why I like algebra.
I've always struggled with remembering the syntax for function type typedefs:
typedef RETTY (*FNTYNAME)(ARGTY1, ARGTY2, ..., ARGTYn);
we can now use
using
for a far more pleasant syntax:
using FNTYNAME = RETTY(ARGTY1, ARGTY2, ..., ARGTYn);
which is also intuitive. You write down the "type"
on the right hand side, and give it a name on the left.
This is not strictly the same, since the
typedef
typedefs
FNTYNAME
to a
function pointer type, while
the C++ version typedefs the
function type. I prefer
the latter at any rate, since I dislike the fact
that the usual typedef tends to hide the fact that a
function pointer is some pointerlikething.
§ Semidirect products

(α ≡ { a, b, …}, +, 0)

(ω ≡ { X, Y, …}, ×, 1)

· : ω → Automorphisms(α)
 rotations:
ℤ 5
 reflection:
ℤ 2
§ A walkway of lanterns
 Imagine
ℤ as a long walkway. you start at 0. You are but a poor lamp lighter.
 Where are the lamps? At each
i ∈ ℤ, you have a lamp that is either on, or off. So you have
ℤ2.

L ≡ ℤ → ℤ2 is our space of lanterns. You can act on this space by either moving using
ℤ, or toggling a lamp using
ℤ2.
ℤ2^{ℤ} ℤ

g = (lights:⟨−1, 0, 1⟩, loc:10)

move_{3}: (lights: ⟨ ⟩, loc: 3)

move_{3} · g = (lights:⟨−1, 0, 1⟩, loc:13)

togglex = (lights:⟨ 0, 2 ⟩, loc: 0)

togglex · g = (lights: ⟨ −1, 0, 1, 13, 15 ⟩, loc:13)

toggley = (lights: ⟨ −13, −12 ⟩, loc:0)

toggley· g= (lights:⟨ −1 ⟩, loc:13)
§ Krohnrhodes, AKA how to model Freudian psychoanalysis using Lagrangians over semigroups.
I don't find people who draw "all three parts" of the natural transformation:
the catories
C,
FC, and
GC, and then show the relationship between
them, so I made this for my own reference.
the
Ocamlgit project is a
reimplementation of
git
in
OCaml
. It's wellwritten, and I was
walking through the codebase, when I found absolutely amazing, hilarious,
and deep comments from
dinosaure
. I really enjoyed reading through the
codebase, and the existence of these comments made it more humane to read.
I don't know who
dinosaure
is, but I'm really glad they wrote the comments
they did, it really made my day.
§ The one that takes a stab at Haskell for fun
(* XXX(dinosaure): ...we can fix this detail but
I'm lazy like haskell. TODO! *)
§ The academic one that brokenlinks to a paper
(* XXX(dinosaure): see this paper
https://github.com/ocamllabs/papers/blob/master/irmin/2014.08.matthieu/rapport.pdf *)
§ The one about the frustrations of bughunting
(* XXX(dinosaure): [~chunked:false] is mandatory, I don't want to explain
why (I lost one day to find this bug) but believe me. *)
§ The one about a potential heisenbug
(* XXX(dinosaure): if, one day, we find a bug about the serialization of the
IDX file, may be it's about this function (stable/unstable sort). *)
The requisite comment in french for an OCaml project
(* XXX(dinosaure): bon ici, c'est une note compliqué, j'ai mis 2 jours
à fixer le bug. Donc je l'explique en français, c'est plus simple.
En gros, [Helper.MakeDecoder] utilise ce buffer comme buffer interne
pour gérer les alterations. Ce qui ce passe, c'est que dans la
fonction [refill], il s'agit de compléter à partir d'un [input]
(typiquement [zl]) le buffer interne. C'est finalement un
__mauvais__ jeu entre un [Cstruct.t] et un [Bigarray].
Il s'agit de connaître la véritable taille du [Bigarray] et de
rajouter avec [blit] le contenu de l'[input] si la taille du
[Bigarray] (et pas du [Cstruct]) est suffisante.
Avant, cette modification, [zl], [de] et [io] partagaient le même
[Bigarray] découpé (avec [Cstruct]) en 3. Donc, dans le
[MakeDecoder], [refill] considérait (pour des gros fichiers faisant
plus de 0x8000 bytes) que après [de], nous avions encore de la
place  et dans ce cas, nous avions [io].
Ainsi, on [blit]ait [zl] dans [de+sizeof(de) == io], et finalement,
on se retrouvait à essayer de décompresser ce que nous avions
décompressé. (YOLO).
Donc, on considère maintenant [de] comme un [Cstruct.t] et un
[Bigarray] physiquement différent pour éviter ce problème.
Cependant, il faudrait continuer à introspecter car j'ai
l'intuition que pour un fichier plus gros que [2 * 0x8000], on
devrait avoir un problème. Donc TODO. *)
§ The deep one
(* XXX(dinosaure): at the end, we don't care if we lost something. *)
Since
MLIR
hasn't setup the nice tooling that LLVM has around
CMake
as far as I can tell, one needs to actually
know CMake
to link against
MLIR
. However, as is well known,
CMake
incantations are handed down
by preists who spend the better part of their lives studying the tome
that is the CMake manual. I, an unlucky soul had to go on this adventure,
and I hope to spare you the trouble.
I wished to link against a static library build of MLIR. The secret
lies in the
find_library
call:
#If the variable has been set by DMLIR_INCLUDE_PATH, then keep it.
#Otherwise fallback to the environment variable $MLIR_INCLUDE_PATH.
#if neither, then *shrug*.
IF(NOT MLIR_INCLUDE_PATH)
set (MLIR_INCLUDE_PATH $ENV{MLIR_INCLUDE_PATH})
endif()
#Resolve for:
# a library target called `MLIRAnalysis`
# asking to link against `libMLIAnalysis.a`
# using the variable MLIR_INCLUDE_PATH which as we saw before
## is either an environment variable or a cmake option
target_include_directories(languagemodels PRIVATE ${MLIR_INCLUDE_PATH})
I cribbed the actual things to link against from the path
mlir/examples/Toy/Ch2/CMakeLists.txt
which helpfully lists MLIR things it needs to link against.
The full
CMakeLists
is here:
cmake_minimum_required(VERSION 3.5)
project(languagemodels)
set(CMAKE_CXX_STANDARD 14)
## I don't want to use find_package since I want proper control over where my LLVM comes from.
## find_package(LLVM REQUIRED)
add_executable(languagemodels
rnn.cpp codegenc.h lang.h codegenmlir.h)
## Attempt to take these as command line arguments. IF that fails,
## lookup environment.
IF(NOT MLIR_INCLUDE_PATH)
set (MLIR_INCLUDE_PATH $ENV{MLIR_INCLUDE_PATH})
endif()
IF(NOT MLIR_LIBRARY_PATH)
set (MLIR_LIBRARY_PATH $ENV{MLIR_LIBRARY_PATH})
endif()
target_include_directories(languagemodels PRIVATE ${MLIR_INCLUDE_PATH})
find_library(MLIRAnalysis MLIRAnalysis ${MLIR_LIBRARY_PATH})
find_library(MLIRIR MLIRIR ${MLIR_LIBRARY_PATH})
find_library(MLIRParser MLIRParser ${MLIR_LIBRARY_PATH})
find_library(MLIRSideEffects MLIRSideEffects ${MLIR_LIBRARY_PATH})
find_library(MLIRTransforms MLIRTransforms ${MLIR_LIBRARY_PATH})
find_library(LLVMCore LLVMCore ${MLIR_LIBRARY_PATH})
find_library(LLVMSupport LLVMSupport ${MLIR_LIBRARY_PATH})
## debugging to check if it's been set properly
message(MLIR_INCLUDE_PATH ${MLIR_INCLUDE_PATH})
message(MLIR_LIBRARY_PATH ${MLIR_LIBRARY_PATH})
message(MLIRAnalysis ${MLIRAnalysis})
target_link_libraries(languagemodels
${MLIRAnalysis}
${MLIRIR}
${MLIRParser}
${MLIRSideEffects}
${MLIRTransforms}
${LLVMCore}
${LLVMSupport})
This comes from The wild book by John Rhodes, which I anticipate I'll be posting more of in the coming weeks.
§ Experiments
Let an experiment be a tuple of the phase space
X, action space
A,
and an action of the actions onto the phase space
: A × X → X. We will write
x′ = a x to denote the new state of the system
x. So the experiment
E is the data
E ≡ (X, A, : A × X → X).
§ Coordinate systems.
The existence of the action
allows us to
write the evolution of the system recursively:
x_{t+1} = a → x_{t}.
However, to understand the final state
x_{t+1}, we need to essentially
"run the recursion", which does not permit us to
understand the experiment.
What we really need is the ability to "unroll" the loop. To quote:
Informally, understanding an experiment
E means introducing coordinates into phase space of
E which are in triangular form under the action of the inputs of
E.
§ Conservation laws as triangular form
We identify certain interesting invariants of a system by two criteria:
 The parameter
Q(t) determines some obviously important aspects of
the system. That is, there is a deterministic function
M(Q(t)) which
maps
Q(t) to "measure" some internal state of the system.
 If the values of such a parameter
Q is known at time
t_{0} (denoted
Q(t_{0}))
and it is also known what inputs are presented to the
system from time
t to time
t + є
(denoted
I[t_{0}, t_{0} + є]), then the new value of
Q is a
deterministic function of
Q(t_{0}) and
I[t_{0}, t_{0}+ є].
Such parameters allow us to understand a system, since they are deterministic
parameters of the evolution of the system, while also provding a way to
measure some internal state of the system using
M.
For example, consider a system
x with an energy function
e(x). If we
perform an action
a on the system
x, then we can predict the action
e(x′ = a x) given just
e(x) and
a  here,
(x′ = a x) is the action of the system
a on
x.
In general, conservation principles give a first coordinate of a triangularization. In the main a large part of physics can be viewed as discovering and introducing functions
e of the states
q of the system such that under action
a,
e(a q) depends only on
e(q) and
a, and not on
q.
§ Theory: semidirect and wreath products
§ Symmetries as triangular form
We first heuristically indicate the construction involved in going from the group of symmetries to the triangularization, and then precisely write it out in all pedantic detail.
Let an experiment be
E ≡ (X, A, ). Then we define
Π
is a
symmetry of
E iff:

Π: X → X is a permutation of
X.

Π commutes with the action of each
a:
Π(a x) = a Π(x) .
We say that the theory
E is
transitive (in the action sense) if for
all
x_{1}, x_{2} ∈ X, x_{1} ≠ x_{2}, there exists
a_{1}, a_{2}, … a_{n}
such that
x_{2} = a_{n} … (a_{1} x_{1}) .
Facts of the symmetries of a system:
 We know that the symmetries of a theory
E form a group.
 If
E is transitive, then each symmetry
Π is a regular permutation
 If there exists an
x such that
Π(x_{f}) = x_{f} (a fixed point), then
this implies that
Π(x) = x for all
x.
 Let the action split
X into disjoint orbits
O_{1}, O_{2}, … O_{k} from whom
we choose representatives
x_{1} ∈ O_{1}, x_{2} ∈ O_{2}, … x_{k} ∈ O_{k}.
Then, if
E is transitive, there is exactly one action that sends a
particular
x_{i} to a particular
x_{j}. So, on fixing one component
of an action, we fix all components.
To show that this gives rise to a triangulation, we first construct
a semigroup of the actions of the experiment:
S(E) ≡ { a_{1} … a_{n} : n ≥ 1 and a_{i} ∈ A }.
Now, let
G = Sym(E), the full symmetry group of
E. One can apparently
express the symmetry group in terms of:
(X, S) ≤ (G, G) ≀ ({ O_{1}, O_{2}, … O_{k}}, T) 
Given two monoids
(M, +, 0_{M}) and
(N, ×, 1_{N}), and a
homomorphism
φ: N → End(M), where
End(M)
is the endomorphism group of
M. We will notate
φ(n)(m) as
n · m ∈ M.
Now the semidirect product
M _{φ}N is the set
M × N equipped
with the multiplication rule:

(m, n) (m′, n′) = (m + n · m′, nn′)
This can also be written down as:
This way of writing down semidirect products as matrices makes many things
immediately clear:
 The semidirect product is some kind of "shear" transform, since that's
what a shear transformation looks like, matrixwise.
 The resulting monoid
M _{φ} N has identity
(0_{M}, 1_{N}),
since for the matrix to be identity, we need the 2nd row to be
(0, 1).
 The inverse operation if
(M, N) were groups would have to be such that
Hence:

nn′ = 1 implies that
n′ = 1/n.

m + n m′ = 0 implies that
m′ = −m/n.
which is indeed the right expression for the inverse.
§ identity matrix
n←3 ⋄ id ← n n ⍴(1,n⍴0) ⋄ id
This relies heavily on
⍴
replicating its arguments.
§ histogram
xs←(1 1 3 3 3 6) ⋄ n←(⌈/xs)⍴0 ⋄ n[xs]+←1 ⋄ n
The use of
n[x] +←1
will stably write
+1
as many times as there are repeated
indexes in
xs
.
§ String matching / parity as fold ≠
:
]disp str ← (1 0 0 1 0 0 0 0 1 0 1 0 0 0) ⋄ 2 1 ⍴ ((⊂ str) ⍪ ⊂((≠\str)))
┌→──────────────────────────┐
↓1 0 0 1 0 0 0 0 1 0 1 0 0 0│
├~─────────────────────────→┤
│1 1 1 0 0 0 0 0 1 1 0 0 0 0│
└~─────────────────────────→┘
§ General operations on ideals
We have at our hands a commutative ring
R, and we wish to study the ideal
structure on the ring. In particular, we can combine ideals in the following
ways:

I + J ≡ { i + j : ∀ i ∈ I, j ∈ J }

I ∩ J ≡ { x : ∀ x ∈ I ∧ x ∈ J }

I⊕ J ≡ { (i, j) : ∀ i ∈ I ∧ j ∈ J }

IJ ≡ { ij : ∀ i ∈ I ∧ j ∈ J }
We have the containment:
IJ ⊆ I ⋂ J ⊆ I, J ⊆ I + J ⊆ R

§
IJ is a ideal,
IJ ⊆ I ∩ J
it's not immediate from the definition that
IJ is an ideal. The idea is
that given a sum
∑_{k} i_{k} j_{k} ∈ IJ, we can write each
i_{k} j_{k} = i′_{k},
since the ideal
I is closed under multiplication with
R. This gives
us
∑i′_{k} = i″ ∈ I. Similarly, we can interpret
∑_{k} i_{k} j_{k} = ∑_{k} j′_{k} = j″k ∈ J.
Hence, we get the containment
IJ ⊆ I ∩ J.
§
I ∩ J subseteq I,
I ∩ J ⊆ J
Immediate from the inclusion function.
§
I, J ⊆ I + J
Immediate from inclusion
§ CRT from an exact sequence
There exists an exact sequence:
 0 → I ⋂ J   I ⊕ J   I + J → 0 
          
 f(r) = (r, r)          
 g((i, j)) = i + j
         

We are forced into this formula by considerations of dimension. We know:
  dim(I ⊕ J) = dim(I) + dim(J)          
 dim(I + J) = dim(I) + dim(J) − dim(I ⋂ J) [inclusionexclusion]          
 dim(I + J) = dim(I ⊕ J) − dim(I ⋂ J)          
 dim(I + J) − dim(I ⊕ J) + dim(I ⋂ J) = 0          
 V − E + F = 2
         

By analogy to euler characteristic which arises from homology, we need to have
I ⊕ J in the middle of our exact sequence. So we must have:
Now we need to decide on the relative ordering between
I ∩ J and
I + J.
 There is no universal way to send
I oplus J → I ∩ J. It's
an unnatural operation to restrict the direct sum into the intersection.
 There is a universal way to send
I ⊕ J → I + J: sum
the two components. This can be seen as currying the addition operation.
Thus, the exact sequence
must have
I + J in the image of
I ⊕ J. This
forces us to arrive at:
0 → I ⋂ J → I ⊕ J → I + J → 0

The product ideal
IJ plays no role, since it's not possible to define a
product of modules
in general (just as it is not possible to define
a product of vector spaces). Thus, the exact sequence better involve
module related operations. We can now recover CRT:
 0 → I ⋂ J   I ⊕ J   I + J → 0 
          
          
0 → R / (I ⋂ J) → R/I ⊕ R /J → R/(I + J) → 0
          

§ References
This is trivial, I'm surprised it took me
this long to internalize this fact.
When we convert a poset
(X, ≤) into a category, we stipulate that
x → y ⇐⇒ x ≤ y.
If we now consider the category
Set of sets and functions between sets,
and arrow
A →^{f} B is a function from
A to
B. If
f is
monic, then we know that
A = Im(f) ≤ B. That is, a monic arrow
behaves a lot like a poset arrow!
Similarly, an epic arrow behaves a lot like the arrow in the inverse poset.
I wonder if quite a lot of category theoretic diagrams are clarified by thinking
of monic and epic directly in terms of controlling sizes.
If we want to minise a function
f(x) subject to the constraints
g(x) = c,
one uses the method of lagrange multipliers. The idea is to consider a new
function
L(x, λ) = f(x) + λ (c − g(x)). Now, if one has a local maxima
(x^{⋆}, y^{⋆}), then the conditions:

∂ L/∂ x = 0:
f′(x^{⋆}) − λ g′(x^{⋆}) = 0.

∂ L/∂ λ = 0:
g(x^{⋆}) = c.
Equation (2) is sensible: we want our optima to satisfy the constraint that
we had originally imposed. What is Equation (1) trying to say?
Geometrically, it's asking us to keep
f′(x^{⋆}) parallel to
g′(x^{⋆}).
Why is this a good ask?
Let us say that we are at an
(x_{0}) which is a feasible point (
g(x_{0}) = c).
We are interested in wiggling
(x_{0}) →^{wiggle} (x_{0} + є^{→}) ≡ x_{1}.

x_{1} is still feasible:
g(x_{1}) = c = g(x_{0}).

x_{1} is an improvement:
f(x_{1}) > f(x_{0}).
 If we want
g(x_{1}) to not change, then we need
g′(x_{0}) · є^{→}= 0.
 If we want
f(x_{1}) to be larger, we need
f′(x_{0}) · є^{→}> 0.
If
f′(x_{0}) and
g′(x_{0}) are parallel, then attempting to improve
f(x_{0} + є^{→})
by change
g(x_{0} + є^{→}), and thereby violate the constraint
g(x_{0} + ) = c.
All material lifted straight from
Aaron Hsu's PhD thesis. I'll be converting
APL notation to C++like notation.
§ Tree repsentation as multidimensional ragged nested arrays
We're interested in this tree:
∘
┌──┬──┴────┐
a b c
│ ┌┴┐ ┌───┼───┐
p q r s t u
│ │ 
│ ┌┴┐ ┌┴┐
v w x y z
I'll be writing APL commands in front of a
$
to mimic bash, and I'll
write some arrays as multiline. To run them, collapse them into a single
line. The
ast
object is represented in memory as:
$ ast ← ('∘'
('a' ('p'))
('b'
('q' ('v'))
('r'))
('c'
('s' ('w' 'x'))
('t' ('y' 'z'))
('u')))
$ ]disp ast
┌→┬──┬────────┬───────────────────┐
│∘│ap│┌→┬──┬─┐│┌→┬──────┬──────┬─┐│
│ │ ││b│qv│r│││c│┌→┬──┐│┌→┬──┐│u││
│ │ │└─┴─→┴─┘││ ││s│wx│││t│yz││ ││
│ │ │ ││ │└─┴─→┘│└─┴─→┘│ ││
│ │ │ │└─┴─────→┴─────→┴─┘│
└─┴─→┴───────→┴──────────────────→┘
Here's how read the array representation. Look at the top level of the tree.
we have a root node with three children:
∘
┌──┬──┴────┐
a b c
┌→┬──┬────────┬─────────────┐
│∘│ │ │ │
│ │ a│ b │ c │
│ │ │ │ │
└─┴─→┴───────→┴────────────→┘
With the first
∘
being the root node, and the three adjacent cells
being the children.
Next, we look at how
x
is represented. This is predictably recursive. Let's
see the subtree under
x
:
∘
┌──┬──┴────┐
a b c
│
p
┌→┬──┬────────┬─────────────┐
│∘│ap│ │ │
│ │ │ b │ c │
│ │ │ │ │
└─┴─→┴───────→┴────────────→┘
Similarly for
y
:
∘
┌──┬──┴────┐
a b c
│ ┌┴┐
p q r
┌→┬──┬────────┬─────────────┐
│∘│ap│┌→┬──┬─┐│ │
│ │ ││b│q │r││ c │
│ │ │└─┴─→┴─┘│ │
└─┴─→┴───────→┴────────────→┘
And so on, leading to the final representation:
∘
┌──┬──┴────┐
a b c
│ ┌┴┐ ┌───┼───┐
p q r s t u
│ │ 
│ ┌┴┐ ┌┴┐
v w x y z
┌→┬──┬────────┬───────────────────┐
│∘│ap│┌→┬──┬─┐│┌→┬──────┬──────┬─┐│
│ │ ││b│qv│r│││c│┌→┬──┐│┌→┬──┐│u││
│ │ │└─┴─→┴─┘││ ││s│wx│││t│yz││ ││
│ │ │ ││ │└─┴─→┘│└─┴─→┘│ ││
│ │ │ │└─┴─────→┴─────→┴─┘│
└─┴─→┴───────→┴──────────────────→┘
Note that for this representation to work, we need to be able to:
 nest arrays inside arrays.
 have subarrays of different sizes (ragged arrays)
 of different nesting depths  so it's really not even an array?
I don't understand the memory layout of this, to be honest. I feel like to
represent this in memory would still rely on pointerchasing, since we need
to box all the arrays. This is possibly optimised by APL to not be too bad.
§ The depth vector representation
∘ 0
┌──┬──┴────┐
a b c 1
│ ┌┴┐ ┌───┼───┐
p q r s t u 2
│ │ 
│ ┌┴┐ ┌┴┐
v w x y z 3
If we visit this tree and record depths in preorder
(node left right)
, we
arrive at the list:
(∘:0
(a:1 (p:2)) (b:1 (q:2 (v:3)) (r:2))
(c:1 (s:2 (w:3 x:3)) (t:2 (y:3 z:3)) (u:2)))
formatted as:
(∘:0
(a:1
(p:2))
(b:1
(q:2 (v:3))
(r:2)
)
(c:1 (s:2 (w:3 x:3))
(t:2 (y:3 z:3))
(u:2))
)
This linearlized is the list:
(∘ a p b q v r c s w x t y z u)
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
∘ 0
┌──┬──┴────┐
a b c 1
│ ┌┴┐ ┌───┼───┐
p q r s t u 2
│ │ 
│ ┌┴┐ ┌┴┐
v w x y z 3
To convert the
ast
object into a depth vector representation, we can
use the following call:
$ ast ← ('∘' ('a' ('p')) ('b' ('q' ('v')) ('r')) ('c' ('s' ('w' 'x')) ('t' ('y' 'z')) ('u')))
$ d ← ∊0{(⊢,(⍺+1)∇⊣)/⌽⍺,1↓⍵}ast
0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
Let's break this down:
TODO
§ Inverted tables
We represent data associated with our nodes as follows:
$ data ← ⍪ ¨d(15⍴'T')(↑15⍴⊂'n.')
$ ]disp data
┌→┬─┬──┐
│0│T│n.│
│1│T│n.│
│2│T│n.│
│1│T│n.│
│2│T│n.│
│3│T│n.│
│2│T│n.│
│1│T│n.│
│2│T│n.│
│3│T│n.│
│4│T│n.│
│2│T│n.│
│3│T│n.│
│4│T│n.│
│2↓T↓n.↓
└→┴→┴─→┘
This is the same thing as a
structure of arrays (SOA) representation,
where each array of information (eg, the depth at
data[1]
, the
T
information at
data[2]
) are each
arrays which can be accessed well on SIMD
instructions.
§ AST representation
TODO
§ Path matrices
We want information of how to go up and down the tree in ideally constant time.
We store this information in what is known as a
path matrix.
For our recurring example, the path matrix is:
∘ a p b q v r c s w x t y z u  preorder traversal
──────────────────────────────────────────────────
∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘  depth=0
 a a b b b b c c c c c c c c  depth=1
  p  q q r  s s s t t t u  depth=2
     v    w x  y z   depth=3
∘ 0
┌──┬──┴────┐
a b c 1
│ ┌┴┐ ┌───┼───┐
p q r s t u 2
│ │ 
│ ┌┴┐ ┌┴┐
v w x y z 3
To efficiently compute this, we first replace every value in
our tree with its preorder traversal visit time. This changes
the tree to:
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
The values we store in the tree are the integers. The old labels
are represented for clarity.
The path matrix for this tree is:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14  preorder traversal
────────────────────────────────────────────────────────────
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  depth=0
 1 1 3 3 3 3 7 7 7 7 7 7 7 7  depth=1
  2  4 4 6  8 8 8 11 11 11 14  depth=2
     5    9 10  12 13   depth=3
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
§ Rendering the depth information in 2D
We use the incantation:
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ ((⍳≢d)@(d,¨⍳≢d)) ((⌈/d) (≢d))⍴''
0              
 1  3    7       
  2  4  6  8   11   14
     5    9   12  
          10   13 
Let's break this down (the symbol
means a lamp, for commenting/illumination)
$ ⍳ 3 ⍝ iota: make a list of n elements:.
1 2 3
$ d
0 1 2 1 2 3 2 1 2 3 4 2 3 4 2
$ ≢d ⍝ tally: ≢`. count no. of elements in d:
15
⍳≢d ⍝ list of elements of len (no. of elements in d).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
$ ]disp (1 2 3),(4 5 6) ⍝ ,:concatenate
┌→────┬─────┐
│1 2 3│4 5 6│
└~───→┴~───→┘
]disp (1 2 3) ,¨ (4 5 6)
┌→──┬───┬───┐
│1 4│2 5│3 6│
└~─→┴~─→┴~─→┘
The use of
¨
needs some explanation.
¨
is a higher order function which
takes a function and makes it a mapped version of the original function.
So,
,¨
is a function which attemps to map the concatenation operator.
Now, given two arrays
(1 2 3)
and
(4 5 6)
,
(1 2 3) ,¨ 4 5 6
attemps to run
,
on each pair
1 and 4
,
2 and 5
,
3 and 6
. This gives us tuples
((1 4) (2 5) (3 6))
.
So, for our purposes,
zip ← ,¨
.
]disp (d,¨⍳≢d) ⍝ zip d with [1..len d].
┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐
│0 0│1 1│2 2│1 3│2 4│3 5│2 6│1 7│2 8│3 9│4 10│2 11│3 12│4 13│2 14│
└~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘
$ ((⌈/d) (≢d))⍴'' ⍝ array of dim (max val in d) x (no. of elem in d)





⌈
is the maximum operator and /
is the fold operator, so
⌈/d
finds the maximum in d
. Recall that (≢d)
find the no. of
elements in d
. ⍴
reshapes an array to the desired size. We pass it
a 1x1
array containing only 
, which gets reshaped into a
(⌈/d) x (≢d)
sizes array of 
symbols.
TODO: explain @ and its use
§ Creating the path matrix
$ ⎕IO ← 0 ⍝ (inform APL that we wish to use 0indexing.)
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ PM ← ⌈\((⍳≢d)@(d,¨⍳≢d))(((⌈/d+1)(≢d))⍴0)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 3 3 3 3 7 7 7 7 7 7 7 7
0 0 2 2 4 4 6 6 8 8 8 11 11 11 14
0 0 0 0 0 5 5 5 5 9 10 10 12 13 13
0 0
┌──┬──┴───────┐
1 3 7 1
│ ┌┴┐ ┌──────┼───┐
2 4 6 8 11 14 2
│ │ 
│ ┌┴─┐ ┌┴──┐
5 9 10 12 13 3
The incantation can be broken down into:

(((⌈/d+1)(≢d))⍴0)
is used to create a max(d+1)xd
dimension array of zeros.
Here, the rows define depths, and the columns correspond to tree nodes
which for us are their preorder indexes.
$ grid←(⌈/d+1) (≢d) ⍴ 0
$ grid
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

((d ,¨ ⍳≢d))
creates an array of pairs (depth, preindex)
. We will use
this to fill index (d, pi)
with the value pi
.
$ writeixs ← (d,¨⍳≢d)
$ ]disp writeixs
┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐
│0 0│1 1│2 2│1 3│2 4│3 5│2 6│1 7│2 8│3 9│3 10│2 11│3 12│3 13│2 14│
└~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘

ixgrid ← ((⍳≢d)@writeixs) grid
rewrites at index writeixs[i]
the value ((i≢d)[i]
).
$ ixgrid ← ((⍳≢d)@writeixs) grid
$ ixgrid
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 3 0 0 0 7 0 0 0 0 0 0 0
0 0 2 0 4 0 6 0 8 0 0 11 0 0 14
0 0 0 0 0 5 0 0 0 9 10 0 12 13 0
 Finally,
⌈
is the maximum operator, and \
is the prefix scan operator,
so ⌈\ixgrid
creates a prefix scan of the above grid to give us our
final path matrix:
$ PM ← ⌈\ixgrid
$ PM
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 3 3 3 3 7 7 7 7 7 7 7 7
0 0 2 2 4 4 6 6 8 8 8 11 11 11 14
0 0 0 0 0 5 5 5 5 9 10 10 12 13 13
§ Using the path matrix: distance of a node from every other node.
Note that the maximum distance between two nodes is to climb
all the way to the top node, and then climb down:
dmax ← depth(a) + depth(b)
If we know the lowest common ancestor of two nodes,
then the distance of one node to another is:
dcorrect ← dist(a, lca(a, b)) + dist(b, lca(a, b))
So, we can compute the depth as:
dcorrect ← dist(a, lca(a, b)) + dist(lca(a, b), b)
= dist(a, lca(a, b)) + depth(lca(a, b)) +
dist(b, lca(a, b)) + depth(lca(a, b)) +
2 * depth(lca(a, b))
= depth(a) +
depth(b) +
2 * depth (lca(a, b))
[TODO: picture]
[TODO: finish writing this]
§ Parent vector representation
A parent vector is a vector of length
n
where
Parent[i]
denotes an
index into
Parent
. Hence, the following condition will return 1
if V is a parent vector.
For example, for our given example, here is the parent vector:
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths
(∘ a p b q v r c s w x t y z u) │ values
p ← (∘ ∘ a ∘ b q b ∘ c s s c t t c) │ parents
(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)  indexes
P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
The condition a parent vector must satisfy is:
∧/V ∊(⍳≢V) ⍝ [All elements of V belong in the list [1..len(V)] ]

V ∊ (⍳≢V)
will be a list of whether each element in v belongs (∊
) to the list
(⍳≢V) = [1..len(V)]
 Recall that
/
is for reduction, and ∧/
is a boolean AND
reduction.
Hence, we compute whether each element of the vector V
is in the range [1..len(V)]
.  We add the constraint that root notes that don't have a parent simply
point to themselves. This allows us to free ourselves from requiring
some kind of
nullptr
check.
The root node (parent of all elements) can be found using the fixpoint operator (
⍨
):
I←{(⊂⍵)⌷⍺} ⍝ index into the left hand side param using right hand side param
I⍣≡⍨p ⍝ compute the fixpoint of the I operator using ⍨ and apply it to p
§ Converting from depth vector to parent vector, Take 1
As usual, let's consider our example:
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths
(∘ a p b q v r c s w x t y z u) │ values
p ← (∘ ∘ a ∘ b q b ∘ c s s c t t c) │ parents
(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)  indexes
P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
Note that the depth vector already encodes parentchild information.
 The parent of node
i
is a node j
such that d[j] = d[i]  1
and
j
is the closest index to the left of i
such that this happens.
For example, to compute the parent of
t:11
, notice that it's at depth
2
.
So we should find all the nodes from
d[0..11]
which have depths equal to
2
, and then pick the rightmost one. This translates to the expression:
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ t ← 11 ⍝ target node
$ ixs ← ⍳t ⍝ array indexes upto this node
0 1 2 3 4 5 6 7 8 9 10
$ d[ixs] ⍝ depths of nodes to the left of the given node t
0 1 2 1 2 3 2 1 2 3 3
$ d[ixs] = d[t]1 ⍝ boolean vector of nodes whose depth is that of t's parent
0 1 0 1 0 0 0 1 0 0 0
$ eqds ← ⍸ (d[ixs] = d[t]1) ⍝ array indexes of nodes whose depth is that of t's parent
1 3 7
$ ⌽ eqds ⍝ reverse of array indexes to extract `7`
7 3 1
$ ⊃ ⌽ eqds ⍝ first of the reverse of the array indexes to extract `7`
7
$ (⌽⍸(d[⍳t] = d[t]1))[0] ⍝ APL style oneliner of the above
While this is intuitive, this does not scale: It does not permit us to find
the parent of all the nodes
at once  ie, it is not parallelisable
over choices of
t
.
§ Converting from depth vector to parent vector, Take 2 (Or scan idiom)
Imagine we have a list of
0
s and
1
s, and we want to find the
index of
the rightmost
1
value. For example, given:
0 1 2 3 4 5 6 7 8 9 10 11 12
$ a ← (0 0 1 0 0 0 1 0 1 0 0 0 0)
we want the answer to be
f a = 8
. We saw an implementation in terms of
f←{(⌽⍸⍵)[0]}
in Take 1.
(recall that
⍵
is the symbol for the righthandside argument of a function).
We're going to perform the same operation slightly differently. Let's consider
the series of transformations:
⍝ 0 1 2 3 4 5 6 7 8 9 10 11 12
$ a ← (0 0 1 0 0 0 1 0 1 0 0 0 0) ⍝ original array
$ ⌽a ⍝ reverse of a
0 0 0 0 1 0 1 0 0 0 1 0 0
$ ∨\ ⌽a ⍝ prefix scan(\) using the OR(∨) operator. Turn all
⍝ entries after the first 1 into a 1
0 0 0 0 1 1 1 1 1 1 1 1 1
$ +/ (∨\ ⌽a) ⍝ sum over the previous list, counting number of 1s
9
$ ¯1 + (+/ (∨\ ⌽a)) ⍝ subtract 1 from the previous number
8
Why the hell does this work? Well, here's the proof:
 On running
⌽a
, we reverse the a
. The last 1 of a
at index
i
becomes the first
1 of ⌽a
at index
i′ ≡ n−i.  On running
∨\ ⌽a
, numbers including and after the first 1
become 1
. That is, all indexes
j ≥ i′ have 1 in them.  On running
+/ (∨\ ⌽a)
, we sum up all 1s. This will give us
n−i′+1 1s.
That is,
n−i′+1 = n−(n−i)+1 =i+1.  We subtract a
1 to correctly find the
i from
i+1.
This technique will work for
every row of a matrix. This is paramount,
since we can now repeat this for the depth vector we were previously
interested in for each row, and thereby compute the parent index!
§ Converting from depth vector to parent vector, Take 3 (full matrix)
We want to extend the previous method we hit upon to compute the parents
of all nodes in parallel. To perform this, we need to run the moral
equivalent of the following:
$ ⎕IO ← 0 ⍝ 0 indexing
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depth vector
$ t ← 11 ⍝ node we are interested in
$ a←d[⍳t]=d[t]1 ⍝ boolean vector of nodes whose depth is that of t's parent
0 1 0 1 0 0 0 1 0 0 0
$ ¯1 + (+/ (∨\ ⌽a)) ⍝ index of last 0 of boolean vector
7
for
every single choice of t. To perform this, we can build a 2D matrix
of
d[⍳t]=d[t]1
where
t
ranges over
[0..len(d)1]
(ie, it ranges
over all the nodes in the graph).
We begin by using:
$ ⎕IO ← 0 ⋄ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depths
$ ]display ltdepth ← d ∘.> d ⍝ find `d[i] > d[j]` for all i, j.
┌→────────────────────────────┐
↓0 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
└~────────────────────────────┘
 Note that
gt[i][j] = 1
iff d[j] < d[i]
. So, for a given row (i = fixed
), the 1s
nodes that are at lower depth (ie, potential parents).
 If we mask this to only have those indeces where
j <= i
, then the
last one in each row will be such that d[last 1] = d[i]  1
. Why? Because
the node that is closest to us with a depth less than us must be our parent,
in the preorder traversal.
$ ⎕IO ← 0 ⋄ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depths
$ ]display left ← (⍳3) ∘.> (⍳3) ⍝ find `i > j` for all i, j.
┌→────┐
↓0 0 0│
│1 0 0│
│1 1 0│
└~────┘
Combining the three techniques, we can arrive at:
$ ⎕IO ← 0 ⋄ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depths
$ ltdepth ← d ∘.> d ⍝ find `d[i] > d[j]` for all i, j.
$ preds ← (⍳≢d) ∘.> (⍳≢d) ⍝ predecessors: find `i > j` for all i, j.
$ pred_higher ← ltdepth ∧ left ⍝ predecessors tht are higher in the tree
$ parents_take_3 ← ¯1 + +/∨\⌽pred_higher ⍝ previous idiom for finding last 1.
¯1 0 1 0 3 4 3 0 7 8 8 7 11 11 7
For comparison, the actual value is:
(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)  indexes
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths
P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices
(¯1 0 1 0 3 4 3 0 7 8 8 7 11 11 7)  parents, take 3
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
We have an offbyone error for the
0
node! That's easily fixed, we simply
perform a maximum with
0
to move
¯1 > 0
:
$ parents_take_3 ← 0⌈ ¯1 + +/∨\⌽pred_higher
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7
So, that's our function:
parents_take_3 ← 0⌈ ¯1 + +/∨\⌽ ((d∘.>d) ∧ (⍳≢d)∘.>(⍳≢d))
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7
Note that the time complexity for this is dominated by having to calculate
the outer products, which even given infinite parallelism, take
O(n) time.
We will slowly chip away at this, to be far better.
§ Converting from depth vector to parent vector, Take 4 (log critial depth)
We will use the Key(
⌸
) operator which allows us to create key value pairs.
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]disp (⍳≢d) ,¨ d ⍝ zip d with indexes
┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐
│0 0│1 1│2 2│3 1│4 2│5 3│6 2│7 1│8 2│9 3│10 3│11 2│12 3│13 3│14 2│
└~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⍺ ⍵}⌸d ⍝ each row i has tuple (i, js): d[js] = i
┌→──────────────────┐
↓ ┌→┐ │
│ 0 │0│ │
│ └~┘ │
│ ┌→────┐ │
│ 1 │1 3 7│ │
│ └~────┘ │
│ ┌→────────────┐ │
│ 2 │2 4 6 8 11 14│ │
│ └~────────────┘ │
│ ┌→───────────┐ │
│ 3 │5 9 10 12 13│ │
│ └~───────────┘ │
└∊──────────────────┘
In fact, it allows us to apply an arbitrary function to combine keys and values.
We will use a function that simply returns all the values for each key.
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⍵}⌸d ⍝ each row i contains values j such that d[j] = i.
┌→──────────────┐
↓0 0 0 0 0 0│
│1 3 7 0 0 0│
│2 4 6 8 11 14│
│5 9 10 12 13 0│
└~──────────────┘
Our first try doesn't quite work: it winds up trying to create a numeric matrix,
which means that we can't have different rows of different sizes. So, the
information that
only index
0
is such that
d[0] = 0
is lost. What we
can to is to wrap the keys to arrive at:
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⊂⍵}⌸d ⍝ d[b[i]] = i
┌→───────────────────────────────────────────┐
│ ┌→┐ ┌→────┐ ┌→────────────┐ ┌→───────────┐ │
│ │0│ │1 3 7│ │2 4 6 8 11 14│ │5 9 10 12 13│ │
│ └~┘ └~────┘ └~────────────┘ └~───────────┘ │
└∊───────────────────────────────────────────┘
Consider the groups
b[2] = (2 4 6 8 11 14)
and
b[3] = (5 9 10 12 13)
. All of
3
's parents
are present in
2
. Every element in
3
fits at some location in
2
. Here is what
the fit would look like:
b[2] 2 4 _ 6 8 _ _ 11 __ __ 14 (nodes of depth 2)
b[3] 5 9 10 12 13 (leaf nodes)
4 8 8 11 11 (parents: predecessor of b[3] in b[2])
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
We use the Interval Index(
⍸
) operator to solve the problem of finding the
parent / where we should sqeeze a node from
b[3]
into
b[2]
(This is formally known as the
predecessor problem)
⍝ left[a[i]] is closest number < right[i]
⍝ left[a[i]] is the predecessor of right[i] in left[i].
$ a ← (1 10 100 1000) ⍸ (1 2000 300 50 2 )
0 3 2 1 0
Now, we can use the technology of predecessor to find parents
of depth 3 nodes among the depth 2 nodes:
$ depth2 ← 2 4 6 8 11 14
$ depth3 ← 5 9 10 12 13 ⍝ parents (from chart): 4 8 8 11 11
$ depth3parentixs ← depth2 ⍸ depth3
$ depth3parents ← depth2[depth3parentixs]
4 8 8 11 11
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
We need to know onemore APLism: the
2scan
. When we write
a usual scan operation, we have:
$ 2+/⍳5 ⍝ apply + to _pairs_ (2 = pairs)
3 5 7 9 ⍝ (1+2) (2+3) (3+4) (4+5)
$ 3+/⍳5 ⍝ apply + to 3tuples
6 9 12 ⍝ (1+2+3) (2+3+4) (3+4+5)
We begin by assuming the parent of
i
is
i
by using
p←⍳≢d
.
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ d2nodes ← {⊂⍵}⌸d
┌→┬─────┬─────────────┬─────────────┐
│1│2 4 8│3 5 7 9 12 15│6 10 11 13 14│
└→┴~───→┴~───────────→┴~───────────→┘
$ p←⍳≢d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Now comes the biggie:
$ findparent ← {parentixs ← ⍺⍸⍵ ⋄ p[⍵]←⍺[parentixs]}

⍺
is the list of parent nodes. 
⍵
is the list of current child nodes.  We first find the indexes of our parent nodes by using
the
pix ← parent ⍸ child
idiom.  Then, we find the actual parents by indexing into
the parent list:
pix[parentixs]
.  We write these into the parents of the child using:
p[children] ← parent[parent ⍸ child]
This finally culminates in:
$ d←0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7
(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)  indexes
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths
P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
Which can be further golfed to:
$ p⊣2{p[⍵]←⍺[⍺⍸⍵]}⌿⊢∘⊂⌸d⊣p←⍳≢d
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7
The total time complexity of this method assuming infinite parallelism is as follows:
$ p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p

(p←⍳≢d)
can be filled in O(1)
time. 
(d2nodes←{⊂⍵}⌸d)
is searching for keys in a small integer domain, so this is O(#nodes)
using
radix sort as far as I know. However, the thesis mentions that this can be done in
O(log(#nodes))
. I'm not sure how, I need to learn this.  For each call of
findp
, the call (pix ← ⍺⍸⍵)
can be implemented using binary search
leading to a logarthmic complexity in the size of ⍺
(since we are looking up
for predecessors of ⍵
in ⍺
).  The time complexity of the fold
2findp/d2nodes
can be done entirely in parallel
since all the writes into the p
vector are independent: we only write the
parent of the current node we are looking at.
§ 3.4: Computing nearest Parent by predicate
I'm going to simplify the original presentation by quite a bit.
a b c d e f g h i  names
0 1 2 3 4 5 6 7 8  indexes
P ← (0 0 1 2 0 4 5 6 7)  parents
X ← (0 1 0 0 1 1 0 0 0)  marked nodes
a:0
┌────┴───┐
b:1(X) e:4(X)
 
c:2 f:5(X)
 
d:3 g:6
│
h:7

i:8
We want to find nodes marked as
X
that are the closest parents to a
given node. The
X
vector is a boolean vector that has a
1
at
the index of each
X
node:
(b, e, f)
. So, the indexes
(1, 4, 5)
are
1
in the
X
vector.
The output we want is the vector:
0 1 2 3 4 5 6 7 8  indexes
a b c d e f g h i  names
PX ← (0 0 1 1 0 4 5 5 5)  closest X parent index
a a b b a e f f f  closest X parent name
a:0
┌────┴───┐
b:1(X) e:4(X)
 
c:2 f:5(X)
 
d:3 g:6
│
h:7

i:8
The incantation is:
$ I←{(⊂⍵)⌷⍺} ⍝ index LHS by RHS  (100 101 102 103)[(3 1 2)] := 103 101 102
$ PX ← P I@{X[⍵]≠1} ⍣ ≡ P
0 0 1 1 0 4 5 5 5
TODO. At any rate, since this does not require any writes and purely reads,
and nor does it need any synchronization, this is fairly straightforward
to implement on the GPU.
§ 3.5: Lifting subtrees to the root
Once we have marked our
X
nodes, we now wish to lift entire subtrees of
X
up to the root.
 This pass displays how to lift subtrees and add new nodes to replace the subtree's original nodes.
 Luckily, there are no sibling relationships that need to be maintained since
we are uprooting an entire subtree.
 There are no ordering constraints on how the subtrees should be arranged at
the top.
 Hence, we can simply add new nodes to the end of the tree (in terms of the preorder traversal).
Adding to the middle of the tree will be discussed later.
There is some good advice in the thesis:
When using APL primitives this way, it may be good to map their names and definitions to the domain of trees. For example, the primitive ⍸Predicate
is read as "the nodes where Predicate
holds" and not as "the indexes where Predicate
is 1".
For example, given the tree:
0 1 2 3 4 5  indexes
a b c d e f  names
P ← (0 0 1 0 3 4)  parents
X ← (0 1 0 1 1 0)  X nodes
PX ← (0 0 1 0 3 4)  closest X parent index
a:0
┌────┴───┐
b:1(X) d:3(X)
 
c:2 e:4(X)

f:5
we want the transformed tree to be:
a:0
┌────┴───┐
bp:1(X) ep:4(X)

b:1(X)

c:2

e:4

fp:5

f:5(X)

g:6
We first look for nodes that need to be lifted. There are:
 Nonroot nodes (ie, nodes whose parents are not themselves:
p≠(⍳≢p)
)  Which have the property
X
.
nodes←⍸(X ∧ p≠(⍳≢p)) ⍝ ⍸:pick indexes.
§ 3.6: Wrapping Expressions
§ 3.7: Lifting Guard Test Exprsessions
§ 3.8: Couting rank of index operators
§ 3.9: Flattening Expressions
§ 3.10: Associating Frame slots and variables
§ 3.11: Placing frames into a lexical stack
§ 3.12: Recording Exported names
§ 3.13: Lexical Resolution
§ 5.2.1 Traversal Idioms
§ 5.2.2 Edge Mutation Idioms
§ 5.2.3 Node Mutation Idioms
 Operators in APL terminology (such as
¨
) are higher order functions.
Thus, an operator allows one to modify known functions.
 Use
]disp
and ]display
to understand the structure of APL arrays.
 Set
]box on style=max
to always enable drawing arrays with ]display
.
This is supremely useful as a newbie to understand array structure.
 Set
]box on trains=parens
to render trains as trees. Super
helpful when attempting to grok train
code.
 Set
]boxing on
to enable boxing for trains, arguments, everything.
I ran across this when reading another question on math.se, so I
posted this proof for verification just to be sure I wasn't missing
something.
We wish to characterise prime ideals as precisely those that are disjoint from
a multiplicative subset
S ⊆ R. That is:
 An ideal
P is prime iff
P = R ∖ S, where
S is a multiplicative subset
that cannot be made larger (ie, is maximal wrt to the
⊆ ordering).
I'll be using the definition of prime as:
 An ideal
P is prime if for all
x, y ∈ R,
xy ∈ P x ∈ P ∨ y ∈ P.
§ Prime ideal implies complement is maximal multiplicative subset:
Let
S = ≡ R ∖ P be the complement of the prime ideal
P R
in question.
 Since
P ≠ R,
1 ∉P. (if
1 ∈ P, then every element
x . 1 ∈ P
since
P is an ideal, and must be closed under multiplication with the
entire ring). Hence,
1 ∈ S.
 For any
x, y ∈ S, we need
xy ∈ S for
S to be mulitplicative.
 For contradiction, let us say that
x, y ∈ S such that
xy ∉S.
Translating to
P, this means that
x, y ∉P such that
xy ∈ P.
This contradictions the definition of
P being prime.
§ Ideal whose complement is maximal multiplicative subset implies ideal is prime.
 Let
I be an ideal of the ring
R such that its complement
S ≡ R / I
is a maximal multiplicative subset.
 Let
i_{1} i_{2} ∈ I. For
I to be prime,
we need to show that
i_{1} ∈ I or
i_{2} ∈ I.
 For contradiction, let
i_{1}, i_{2} ∉I.
Thus,
i_{1}, i_{2} ∈ S. Since
S is multiplicative,
i_{1} i_{2} ∈ S. That is,
i_{1} i_{2} ∉I (since
I is disjoint from
S).
 But this violates our assumption that
i_{1} i_{2} ∈ I. Hence, contradiction.
 Install Dyalog APL.
 Setup RIDE, the IDE for dyalog APL.
This IDE comes with auto complete, good key bindings, a top bar chockfull of
information of all the APL symbols. It's really well designed and a pleasure
to use.
 Follow the Dyalog tutorial, solving it
chapter by chapter.
 Bookmark APLCart, a collection of APL idioms, and
refer to it when in need.
It's kind of sad that this is the case, but on thinking about this, I realised
that the SpaceChem game was essentially a compiler, and it was such a pleasure
to learn how to use and debug  the visual nature of it made it amazing to
find out.
I often forget which is which, so I came up with this:
 Prim is very prim and proper, and therefore doesn't spread herself out. She
picks out the minimum spanning tree one vertex at a time.
Cartesian trees construct a tree
T = C(A) given an array
A, such that
range minimum query (RMQ) on the array
A is equivalent to the lowest common ancestor (LCA)
of the nodes of the tree
T.
Note that the tree looks like a
minheap.
To see the connection to LCA, if we want to find the range minimum in the range containing the
elements
[12, 10, 20, 15, 18]
of the array, the minimum is
10
, which is
indeed the lowest common ancestor of the nodes of
12
and
18
in the tree.
§ Building a Cartesian tree in linear time:
§ Converting LCA to RMQ
We can go the other way, and convert an LCA problem into a RMQ problem. We
perform an inorder traversal of the nodes, scribbling down the
depth of the node (
Link to lecture at 15:30).
We ask for the
argmin version of RMQ, that gives us the
index of
the node with the lowest depth. This gives us the index of where the node lives.
§ Universe reduction in RMQ
We can have an arbitrary ordered universe, on which we want to perform RMQ.
We can convert this to LCA by using a cartesian tree, and then convert to
a "clean" RMQ (by using the LCA > RMQ using depth conversion). This now
will give us way faster operations (since we now have integers).
§ +1
RMQ:
We want the differences between nodes to have a difference of only
+1
. We
had a much wider gap. Here, we perform an Euler tour (walk the tree DFS search order),
and sribble down every vertex we visit.
To find the LCA, we perform the RMQ on the locations of the
first occurence
of the node. (I think we don't actually need the first occurence, any
occurence will do).
§ References
Really, we want a partial order that is defined with the tree as the
Hasse diagram. However, performing operations on this is hard. Hence,
the DFS numbering is a good monotone map from this partial order
to the naturals, which creates a total order.
I want to think about this deeper, I feel that this might be a good way
to think about the
low
numbers that show up in
tarjan's algorithm for strongly connected components
This also begs the question: can we use other partial orders, that chunk
some information, but don't lose
all the information as going to a total
order (the naturals) does?
The code is taken from
The annotated transformer
which explains the "attention is all you need paper".
On skimming the code, one sees the delightful line of code:
class EncoderLayer(nn.Module):
"Encoder is made up of selfattn and feed forward (defined below)"
def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 2)
self.size = size
def forward(self, x, mask):
"Follow Figure 1 (left) for connections."
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
return self.sublayer[1](x, self.feed_forward)
where the line:
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
seems to imply that we are, indeed, performing a self attention with the same
value
x
as the query, key, and value.
However, reading the code of the selfattention (or the paper) leads
one to realise:
class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
"Take in model size and number of heads."
super(MultiHeadedAttention, self).__init__()
assert d_model % h == 0
# We assume d_v always equals d_k
self.d_k = d_model // h
self.h = h
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)
def forward(self, query, key, value, mask=None):
"Implements Figure 2"
if mask is not None:
# Same mask applied to all h heads.
mask = mask.unsqueeze(1)
nbatches = query.size(0)
# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
[l(x).view(nbatches, 1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]
# 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask,
dropout=self.dropout)
# 3) "Concat" using a view and apply a final linear.
x = x.transpose(1, 2).contiguous() \
.view(nbatches, 1, self.h * self.d_k)
return self.linears[1](x)
where we notice:
## 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
[l(x).view(nbatches, 1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]
## 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask,
dropout=self.dropout)
where we see that
query, key, value
are being linearly transformed
before being used. Hence, an input of
(x, x, x) is transformed
to
(q′, k′, v′) = (Qx, Kx, Vx) where
Q, K, V are arbitrary matrices.
Next, when we pass these into attention, the output we get is:
softmax(q′, k′^{T}) v = (Q x) (K x)^{T} (V x) = Q x x^{T} K^{T} V x

the code below is the same thing, spelled out:
def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'"
d_k = query.size(1)
scores = torch.matmul(query, key.transpose(2, 1)) \
/ math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, 1e9)
p_attn = F.softmax(scores, dim = 1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn
So It's not
really self attention: it's more like: modulated attention
to self
:)
A coarse structure on the set
X is a collection of relations on
X:
E ⊆ 2^{X × X} (called as
controlled sets /
entourages)
such that:

(δ ≡ { (x, x) : x ∈ X }) ∈ E.
 Closed under subsets:
∀ e ∈ E, f ⊂ e f ∈ E.
 Closed under transpose: if
e ∈ E then
(e^{T} ≡ { (y, x) : (x, y) ∈ e }) ∈ E.
 Closed under finite unions.
 Closed under composition:
∀ e, f ∈ E, e ∘ f ∈ E, where
∘
is composition of relations.
The sets that are controlled are "small" sets.
The bounded coarse structure on a metric space
(X, d) is the set of all relations
such that there exists a uniform bound such that all related elements are within
that bounded distance.
(e ⊂ X × X) ∈ E ⇐⇒ ∃ δ ∈ ℝ, ∀ (x, y) ∈ E, d(x, y) < δ

We can check that the functions:

f: ℤ → ℝ, f(x) ≡ x and

g: ℝ → ℤ, g(x) ≡ ⌊ x ⌋
are coarse inverses to each other.
I am interested in this because if topology is related to semidecidability,
then coarse structures (which are their dual) are related to..?
§ References
§ Definitions of matroids
A matrioid
M is a set
X equipped with an independence set
I ⊆ 2^{X}.
 The empty set is independent:
∅ ∈ I.
 The independence set is downwardclosed/closed under subsets:
∀ i ∈ I, ∀ i′ ⊆ i, i′ ∈ I.
 For any independent sets
A, B ∈ I, if
 A  is larger than
 B , then we must be able to add an element from
a ∈ A into
B′ ≡ B ∪ a such that
B′ is both independent and larger than
B:
B′ ∈ I ∧  B′  >  B . (The exchange property)
§ Example 1: Linearly independent sets
Let
V be a vector space. The independent sets
I are of the form:
I ≡ { S ⊆ V : vectors in S are lineary independent } 
This is an independence system because the empty set is linearly independent,
and subsets of a linearly independent collection of vectors will be linearly
independent.
The exchange property is satisfied because of linear algebraic reasons.
§ Example 2: The graphic/cyclic Matroid: Matroid of Forests
Let
G = (V, E) be a graph. Then collections of edges of the form:
I ≡ { F ⊆ E : F contains no cycles } 
is an independence system because the empty forest is a forest, and
a subset of edges of a forest continues to be a forest.
To check the exchange property, TODO
§ Example 3: The partition matroid
Consider the partition matroid
M ≡ (E, I), where we have a
partitioning of
E known as
E_{i}, and numbers
k_{i} the
independence set consists of subsets
F which have at most
k_{i}
elements in common with each
E_{i}.
I ≡ { F ⊆ E : ∀ i = 1, … N,  F ⋂ E_{i}  ≤ k_{i} }

The independence axioms are intuitively satisfied, since our constraints on picking
edges are of the form
 F ∩ E_{i}  ≤ k_{i}, which will continue to
hold as
F becomes smaller.
For the exchange axiom, let
 Y  >  X . Then, we can assert that for some index
i, it must be the case that
 Y ∩ E_{i}  >  X ∩ E_{i} . Hence,
we can add an element in
E_{i} ∩ (Y / X) into
X whilst still maintaining independence.
§ Bases and Circuits
 Bases are the maximal independent sets of
I (ordered by inclusion). On adding an element into a basis element, they
will become dependent. They are called bases by analogy with linear algebra.
 Circuits are minimal dependent sets of
I. This comes from analogy with trees: if we remove an element
from any circuit (loop) in a graph, what we are left with is a tree.
A matroid can be completely categorized by knowing either the bases or the circuits of that matroid.
§ Unique Circuit property
 Theorem: Let
M ≡ (E, I) be a matroid, and let
S ∈ I, e ∈ E such that
S ∪ {e } ∉I.
Then, there exists a
unique circuit
C ⊆ S ∪ { e }.
That is, when we go from independent to dependent by adding an element, we will
have a
single, unique circuit. For example, when we add an edge into a
forest to create a cycle, this cycle will be unique!
§ Proof
Let
C_{1}, C_{2} be circuits
created when
e was added into
S, where
C_{1} is the
largest circuit of
S,
and
C_{2} is the
smallest circuit of
S.
Notice that
C_{1}, C_{2} must contain
e 
if they did not, then
C_{1}, C_{2} would be circuits in
S, contradicting the assumption that
S is independent.
Recall that
C_{1}, C_{2} are both circuits, which means that removing even a
single element from them will cause them to become independent sets.
Let us contemplate
C ≡ C_{1} ∪ C_{2}. Either
C = C_{1} in which
case we are done.
Otherwise,
 C  >  C_{1} ,
 C  >  C_{2} .
Otherwise, consider
C′ ≡ C { e } = (C_{1} ∪ C_{2}) {e} = (C_{1} {e}) ∪ (C_{2} { e }).

C′ ⊆ S, since
_{1} {e}, C_{2} {e} ⊆ S.

S is an independent set, all of whose subsets are independent by
definition. So
C′ is an independent set.

 C′  ≥  C_{1} ,
 C′  ≥  C_{2} .
Now, we consider
C. Clearly, this is a dependent set,
since
C_{1} C, and
C_{1} is a dependent set.
Since,
C = C′ ∪ {e }, this means that
C′ is a maximally independent set.
Since
C′ does not contain
e,
C′ = S.
§ Rank functions
A rank function of a matroid
M ≡ ⟨ X, I ⟩
is a function:
r: 2^{X} → ℕ : r(S) = max{  T  : T ⊆ S ∧ T ∈ I }

That is, for any subset
S ⊆ X,
r(S) is the cardinality of the
largest independent subset of
S.
 In the matroid of linearly independent sets of vectors, the rank of
a set of vectors is the dimension of their spanning set.
In this matroid, the
TODO: picture
§ Bipartite matching as matroid intersection
Matchings in a bipartite graph
G = (V, E) with partition
(A, B) arise
as the intersection of the independent sets of two matroids.
We will denote by
δ: V → 2^{E} the function which takes
a vertex to the set of edges incident on that vertex.
Let
M_{A} be a
partition matroid:
M_{A} ≡ (E, I_{A}) where
I_{A} is:
I_{A} ≡ { F ⊆ E :  F ⋂ δ(a)  ≤ 1 ∀ a ∈ A }

That is, in
I_{A}, every independent set has for each vertex of
A, at most
one edge incident. We need to check that this is an independent set.
The empty set of no edges is independent. If some collection of edges are
such that they have at most one edge incident, then removing edges can
only
decrease incidence. Hence, it's also downward closed.
TODO: add picture
Similarly, we define
M_{B} ≡ (E, I_{B}):
I_{B} ≡ { F ⊆ E :  F ⋂ δ(b)  ≤ 1 ∀ b ∈ B }

Now, notice that any collection of edges
F ∈ I_{A} ∩ I_{B} is a legal
matching, since the edges cover all vertices of
A and
B at most once.
The largest element of
I_{A} ∩ I_{B} is the
maximum matching that we
are looking for.
§ Largest common independent set
Given two matroids
M_{1} ≡ (E, I_{1}),
M_{2} ≡ (E, I_{2}), with rank
functions
r_{1} and
r_{2}. Let
S ∈ I_{1} cap I_{2} and let
F ⊆ E.

 S  =  S ∩ F  +  S ∩ (E / F) .
§ References:
There's a lot written on the Zariski topology on the internet, but most
of them lack explicit examples and pictures. This is my attempt to
communicate what the Zariski topology looks like, from the perspectives
that tickle my fancy (a wealth of concrete examples,
topologyassemidecidability, and pictures).
§ The Zariski topology
Recall that the Zariski topology is defined by talking about what its
closed sets are. The common zero sets of a family of polynomials are
the closed sets of the Zariski topology. Formally, the topology
(ℝ^{n}, τ)
has as closed sets:
{ x ∈ ℝ^{n} : ∀ f_{i} ∈ ℝ[X_{1}, X_{2}, … X_{n}], f_{i}(x) = 0 }

Open sets (the complement of closed sets) are of them form:
{ x ∈ ℝ^{n} : ∃ f_{i} ∈ ℝ[X_{1}, X_{2}, … X_{n}], f_{i}(x) ≠ 0 } ∈ τ

The empty set is generated as
{ x ∈ ℝ^{n} : 0 ≠ 0 } and the
full set is generated as
{ x ∈ ℝ^{n} : 1 ≠ 0 }.
§ Semidecidability
Recall that in this view of topology, for a space
(X, τ), for every
open set
O ∈ τ, we associate a turing machine
T_{O} which semidecides
inclusion. That is, if a point is in
O then it halts with the output
INSET
.
if the point is not in
O,
T_{O} infinite loops. Formally:
 x   ∈ O ⇐⇒ T_{O} halts on input x         
x   ∉O ⇐⇒ T_{O} does not halts on input o
        

Alternatively, for a closed set
C ∈ τ, we associate a a turing machine
T_{C} which semidecides exclusion. That is, if a point is
not in
C
it halts with the output "NOTINSET". If the point is in
C, the turing
machine
T_{C} infinite loops.
Now in the case of polynomials, we can write a function that semidecides
exclusion from closed sets:
# return NOTINSET of x is not a zero of the polynomial
def is_not_in_zeros(poly, x0):
precision = 0 # start with zero precision
while True:
if poly(x0[:precision]) != 0: return NOTINSET
precision += 1 # up the precision
Since we can only evaluate a polynomial up to some finite precision, we
start with zero precsion and then gradually get more precise. If some
x0 is
not a root of
poly, then at some point, we will have
that
poly(x0) /neq 0. If
x0 is indeed a root, then we will never halt
this process; We will keep getting
poly(x0[:precision]) = 0
for all
levels of precision.
§ Spec(R)
 To setup a topology for the prime spectrum of a ring, we define the topological
space as
Spec(R), the set of all prime ideals of
R.
 The closed sets of the topology are
{ V(I) : I is an ideal of R },
where the function
V: Ideals of R → 2^{Prime ideals of R}
each ideal to the set of prime ideals that contain it. Formally,
V(I) = { p ∈ Spec(R) : I ⊆ p }.
 We can think of this differently, by seeing that we can rewrite the condition
as
V(I) = { P ∈ Spec(R) : I →^{P} 0 }: On rewriting using the prime ideal
P, we send the ideal
I to
0.
 Thus, the closed sets of
Spec(R) are precisely the 'zeroes of polynomials' / 'zeroes of ideals'.
 To make the analogy precise, note that in the polynomial case, we imposed a
topology on
R by saying that the closed sets were
V(p_{i}) = { x ∈ ℝ : p(x) = 0 }
for some polynomial
p ∈ ℝ[x].
 Here, we are saying that the closed sets are
V(I) = { x ∈ Spec(R) : I(x) = 0 }
for some ideal
I ∈ R. so we are looking at ideals as functions
from the prime ideal to the reduction of the ideal. That is,
I(P) = I/P.
§
Spec(ℤ) from this perspective
Since
ℤ is a PID, we can think of numbers instead of ideals. The above
picture asks us to think of a number as a function from a prime to a reduced
number.
n(p) = n % p. We then have that the closed sets are those primes
which can zero out some number. That is:
 V(I) = { x ∈ Spec(ℤ) : I(x) = 0 }
V((n)) = { (p) ∈ Spec(ℤ) : (n)/(p) = 0 }
V((n)) = { (p) ∈ Spec(ℤ) : (n) mod (p) = 0 }
          

So in our minds eye, we need to imagine a space of prime ideals (points), which
are testing with all ideals (polynomials). Given a set of prime ideals (a tentative locus, say a circle),
the set of prime ideals is closed if it occurs as the zero of some collection of ideas
(if the locus occurs as the zero of some collection of polynomials).
Wikipedia lists the implementation of quicksort as:
algorithm quicksort(A, lo, hi) is
if lo < hi then
p := partition(A, lo, hi)
quicksort(A, lo, p  1)
quicksort(A, p + 1, hi)
algorithm partition(A, lo, hi) is
pivot := A[hi]
i := lo
for j := lo to hi do
if A[j] < pivot then
swap A[i] with A[j]
i := i + 1
swap A[i] with A[hi]
return i
Here, the indeces
[lo..i1]
have values less than the pivot, while
[i..j]
are great or equal to the pivot.
§ The version I prefer
// #define SWAP(ix, ix2) { int t = a[ix]; a[ix] = a[ix2]; a[ix2] = t; }
// sorts the interval [l, r]
void qs(int l, int r) {
if (r  l < 0) return;
int part = a[r]; // the partition
// a[getill...n] >= part (getill = greater or equal till)
// starts at r since we know that a[r] >= (partition=a[r])
int getill = r;
// a[l..lttill] < part (lttill = less or equal till.
// starts at (l1) since we do not know about any value < partition
int lttill = l1;
// loop until they start probing into the other set
while(!(lttill+1 >=getill  getill1 <=lttill)) {
// if the speculated element is < partition
if (a[getill1] < part) {
// swap the value at getill1 will the slot at lttill+1
SWAP(getill1, lttill+1);
// increment lttill, since we KNOW that the
// value at lttill+1 = a[getill1] is < part
lttill++;
} else {
// all we know is that a[getill1] < part, so we can engulf
// the region into
getill;
}
}
// the partitions must be next to each other, since we have engulfed everything
assert(getill  lttill == 1);
// move the partition value to the center.
SWAP(getill, r);
// recurse:solve [l..littil] (leave getill=part alone) solve [getill+1..r]
qs(l, lttill);
qs(getill+1, r);
}
This implementation to me makes very clear to me what information is "known":
 The segments that is strictly less than the partition.
 The segment that is strictly great or equal the partition.
It also makes clear what is being "probed"/"tentative":
 anything we are accessing as
+1
is not known yet, we are feeling out
the boundaries of our partitions.
The termination condition is clear: when one partition starts reaching into
the other partitions resources, its done.
Due to using closed intervals everywhere, it's very easy to see precisely
what data starts and ends where.
What version of quicksort do you prefer? Drop me an email!
 All credit goes to
p0a
on ##math
on freenode for teaching me this proof!
Here's one fun application of CauchySchwarz. We can apply it to two vectors
x=(√a, √b) and
y=(√b, √a) to derive the AMGM
inequality:
This was a quick experiment in using Grobner basis to model situations. We
can represent our dataflow analysis constraints in terms of polynomial
rewrites over
F_{2}.
Given the program:
p = { 0: ["=", "x", 'y'],
1: ['br', 2, 100],
2: ['=', 'z', 'x'],
3: ['br', 2],
100: ['ret', 'z'] }
whose semantics I hope are fairly straightforward  the dictionary represents
instruction locations. Instructions proceed sequentially. branch moves
control flow around. Note that
br
can branch to multiple locations,
since we are not controlflow sensitive.
The idea is that since in a dataflow analysis, we need information at
each variable at each program point, we can create a ring of polynomials
over
F_{2} for each variable at each program point. So in this case,
we wold need:
R = F_2[x0, y0, z0, x1, y1, z1, x2, y2, z2, x3, y3, z3, x100, y100, z100]
We then add elements into the ideal that represents our constraints.
For example, to perform dataflow analysis, we need to add constraints
about how if a variable
z
is alive, all variables that are used
to compute
z
at
100
are alive. This sets up equations that may
have cycles (in the case of loops).
These are usually resolved using the
Kildall algorithm.
However, we can also ask SAGE to kindly solve the Grobner basis. I hypothesize
that the "easy" dataflow problems out to be
toric ideals
which admit much faster solutions.
I learnt of a nice, formal way to prove the correctness of Fenwick
trees in terms of orbits that I wish to reproduce here.
One can use a Fenwick tree to perform cumulative sums
Sum(n) ≡ ∑_{i}^{n} A[i], and updates
Upd(i, v) ≡ A[i] += v. Naively,
cumulative sums can take
O(n) time and updates take
O(1) time.
A Fenwick tree can perform
both in
log(n). In general, we can perform
any monoidbased catenation and update in
log(n).
§ organization
We allow indexes
[1, 2, … n]. The node with factorization
i ≡ 2^{k} × l,
2 ¬l (that is,
k is the highest power of
2 in
i)
is responsible for the interval
[i−2^{k}+1, i] = (i−2^{k}, i].
I'm going to state all the manipulations in terms of prime factorizations,
since I find it far more intuitive than bitfiddling. In general, I want
to find a new framework to discover and analyze bitfiddling heavy algorithms.
Some examples of the range of responsibility of an index are:

1 = 2^{0} × 1 = (0, 1] (Subtract
2^{0} = 1)

2 = 2× 1 = (0, 2] (Subtract
2^{1} = 2)

3 = 3 = (2, 3]

4 = 2^{2} = (0, 4]

5 = 5 = (4, 5]

6 = 2× 3 = (4, 6]

7 = 7 = (6,7]

8 = 2^{3} = (0,8]

9 = 9 = (8, 9]

10 = 2× 5 = (8, 10]

11 = 11 = (10, 11]

12 = 2^{2}× 3 = (8, 12]

13 = 13 = (12, 13]

14 = 2× 7 = (12, 14]

15 = 15 = (14, 15]

16 = 2^{4} = (0, 16]
§ query
To perform a cumulative sum, we need to read from the correct overlap regions
that cover the full array. For example, to read from
15, we would want
to read:

a[15] = (14, 15], a[14] = (12, 14], a[12] = (8, 12], a[8] = (0, 8].
So we need to read the indices:

15=2^{0} · 15 →^{−20} 14=2^{1} · 7 →^{−21} 12=2^{2}·3 →^{−22} 8=2^{3}·1 →^{−23} 0
At each location, we strip off the value
2^{r}. We can discover this value
with bitfiddling: We claim that
a & (−a) = 2^{r}.
Let
a = x 1 0^{r}. Now,
−a = ¬ a + 1 = x01^{r} + 1 = x10^{r}.
Hence,
a & (−a) = a & (¬ a + 1) = (x 10^{r}) & (10^{r}) = 0^{α}10^{r} = 2^{r}
So the full implementation of query is:
#define LSB(x) x&(x)
int a[N];
int q(int i) {
int s = 0;
while (i > 0) { s += a[i]; i = LSB(i); }
return s;
}
§ update
To perform an update at
i, we need to update all locations which on querying
overlap with
i. For example, to update the location
9, we would want to
update:

a[9] = (8, 9], a[10] = (8, 10], a[12] = (8, 12], a[16] = (0, 16].
So we need to update the indices:

9=2^{0} · 9 →^{+20} 10=2^{1} · 5 →^{+21} 12=2^{2}·3 →^{+22} 16=2^{4}·1 →^{+24} …
We use the same bitfiddling technique as above to strip off the value
2^{r}
#define LSB(x) x&(x)
int tree[N];
int u(int i, int v) {
while (i < N) { tree[i] += v; i += LSB(i); }
}
§ correctness
We wish to analyze the operations
Query(q) ≡ ∑_{i=1}^{q} a[i], and
Update(i, val) ≡ a[i] += val. To do this, we are allowed to maintain
an auxiliary array
d which we will manipuate. We will stipulate the
conditions of operations on
d such that they will reflect the values of
Query and
Update, albeit much faster.
We will analyze the algorithm in terms of orbits. We have two operators, one
for update called
U, and one for query called
Q. Given an index
i,
repeatedly applying the query operator gives us the indeces we need to read and
accumulate from the underlying array
a to get the total sum
a[0..i]:

Query(i) = ∑_{i} d[Q^{i}(q)]
Given an index
u, repeatedly applying the update operator
U gives us all
the indeces we need to add the change to update:

Update(i, val) = ∀ j , d[U^{j}(i)] += val
For query and update to work, we need the condition that:

q ≥ u ⇐⇒ { Q^{i}(q) : i ∈ ℕ } ∩ { U^{i}(u) : i ∈ ℕ } = 1
That is, if and only if the query index
q includes the update location
u,
will the orbits intersect.
The intuition is that we want updates at an index
u to only affect queries
that occur at indeces
q ≥ u. Hence, we axiomatise that for an update
to be legal, it must the orbits of queries that are at indeces greater than it.
We will show that our operators:

Q(i=2^{r}· a) = i − 2^{r} = 2^{r}(a−1)

U(j=2^{s}· b) = j + 2^{s} = 2^{s}(b+1)
do satisfy the conditions above.
For a quick numerical check, we can use the code blow to ensure
that the orbits are indeed disjoint:
## calculate orbits of query and update in fenwick tree
def lsb(i): return i & (i)
def U(i): return i + lsb(i)
def Q(i): return i  lsb(i)
def orbit(f, i):
s = set()
while i not in s and i > 0 and i < 64:
s.add(i); i = f(i)
return s
if __name__ == "__main__":
for q in range(1, 16):
for u in range(1, 16):
qo = orbit(Q, q); uo = orbit(U, u)
c = qo.intersection(uo)
print("q:%4s  u:%4s  qo: %20s  uo: %20s  qu: %4s" %
(q, u, qo, uo, c))
print("")
§ Case 1:
q = u
We note that
Q always decreases the value of
q, and
u always increases
it. Hence, if
q = u, they meet at this point, and
∀ i, j ≥ 1, Q^{i} (q) ≠ U^{j}(u).
Hence, they meet exactly once as required.
§ Case 2:
q < u
As noted above,
q always decreases and
u always increases, hence in this
case they will never meet as required.
§ Case 3:
q > u
Let the entire array have size
2^{N}.
Let
q = e1 f_q, u = e0 f_u, where
e, f_q, f_u may be empty strings.
Notice that
Q will always strip away rightmost ones in
f_{q},
leading to
q = e10...0 at some point.
Similarly,
U will keep on adding new rightmost ones, causing the
state to be
u = e01...10...0 →^{U} e100....
Hence, at some point
q = u.
§ References
We call all functions
f: ℤ → ℝ as
arithmetic functions, since they operate on the integers.
We introduce an operator
f ⋆ g: ℤ → ℝ.
It is defined by:

(f ⋆ g)(n) ≡ ∑_{d  n} f(d) g(n/d)
We will show that the set of arithmetic functions forms a group
under the operator
⋆, with identity:
The reason all of this is interesting is that the inverse of the constant function
1(n) ≡ 1
is going to be this function called as the mobius function
µ:
µ(n=p_{1}^{α}_{1} p_{2}^{α}_{2} … p_{r}^{α}_{r}) ≡
 ⎧
⎨
⎩  0  if any α_{i} > 1 
(−1)^{α1 + α2 + … + αr}  if all α_{i} ∈ { 0, 1 }



The mobius function will allow us to perform
mobius inversion:
 f(n)  ≡   g(d) =   g(d) 1(n/d) = g ⋆ 1 
         
f ⋆ 1^{−1}  = g ⋆ 1 ⋆ 1^{−1}          
f ⋆ µ  = g
         

That is, we originally had
f defined in terms of
g. We can
recover an expression for
g in terms of
f.
§ The algebra of multiplicative functions
We claim that the set of functions
{ ℤ → ℂ }
is a commutative group, with the group operation
⋆ such that:

(f ⋆ g)(n) ≡ ∑_{d  n} f(d) g(n/d).
with the identity element being
id_{⋆}(n) ≡ ⌊ 1 / n ⌋. The idea
is that if
(n = 1), then
⌊ 1/1 ⌋ = 1, and for any other
number
n > 0,
1/n < 1, hence
⌊ 1/n ⌋ = 0.
§ verification of
istar being the identity
  (f ⋆ id_{⋆})(n) ≡   f(d) id_{⋆}(n/d) 
         
 = f(n) id_{⋆}(1) +   f(n) id_{⋆}(d) 
         
          
 = f(n)          
          

§ associativity, commutativity of
⋆
To prove associativity, it's better to write the formula as:
(f ⋆ g)(n) ≡   f(n) g(n/d) =   f(x) g(y)

From this rewrite, it's clear that
(f ⋆ g ⋆ h)(n) will unambiguously
sum over tripes
(x, y, z) such that
xyz = n. I leave the workingthisout
to you. This should also make the commutativity immediate. Summing over pairs
of the form
f(x) g(y) : xy = n is the same as summing over
f(y) g(x) : yx = n.
§ Existence of inverse
We can show that an inverse exists by showing that a formula for it exists; The
idea is to construct one by induction.
Clearly, for a given function
f, we need the inverse
f^{−1} to be such that
(f ⋆ f^{−1})(n) = id_{⋆}. Hence:
  (f ⋆ f^{−1})(1) = id_{⋆}(1) = 1          
 f(1) f^{−1}(1) = 1          
 f^{−1}(1) ≡ 1 / f(1)          
          

Great, we have a base case; We can now compute
f^{−1}(n) inductively, assuming
we know the value of
f^{−1}(d) for all
d  n.
  (f ⋆ f^{−1})(n) = id_{⋆}(1) = 0          
          
 f(1) f^{−1}(n) +   f(d) f^{−1}(n/d) = 0 
         
          

§
µ as the inverse of the
one function
§ Mobius inversion
Now that we know that
µ = const 1^{−1}, we can use this fact
to perform
mobius inversion:
f(n) ≡   g(n/d) = const 1 ⋆ g

We have
f written in terms of
g. We can now write
g in terms of
f:
  f(n) = const 1 ⋆ g          
 f ⋆ const 1^{−1} = g          
 g = f ⋆ µ          
          

§
n = ∑_{d  n} φ(d)
d  { 1 ≤ x ≤ 12 : (x, 12) = d }  { 1 ≤ x ≤ 12: (x/d, 12/d) = 1}  size of set 
1  { 1, 5, 7, 11 }  (x, 12) = 1  4 
2  {2, 10 }  (x/2, 6) = 1  2 
3  {3, 9 }  (x/3, 4) = 1  2 
4  {4, 8 }  (x/4, 3) = 1  2 
6  { 6 }  (x/6, 2) = 1  1 
12  { 12 } (x/12, 1) = 1  1


Notice that the sizes of sets that we are calculating, for example,
{ 1 ≤ x ≤ 12 : (x/2, 6) = 1 } = φ(6). Summing over all of
what we have, we've counted the numbers in
[1, 2, …, 12] in two ways 
one directly, and the other by partitioning into equivalence classes:
12 = φ(1) + φ(2) + φ(3) + φ(4) + φ(6) + φ(12) =   φ(12/d) 
In general, the same argument allows us to prove that:
§ Using mobius inversion on the euler totient function
§ Other arithmetical functions and their relations
This is me trying to understand the fabled interpreter of the
J
language
working, so I could absorb Arthur Whitney's style of writing C: it's
cryptic, short, and fits in a page.
I learnt of this from the J
language page,
which comes with the quote:
One summer weekend in 1989, Arthur Whitney visited Ken Iverson at Kiln Farm and produced—on one page and in one afternoon—an interpreter fragment on the AT&T 3B1 computer. I studied this interpreter for about a week for its organization and programming style; and on Sunday, August 27, 1989, at about four o'clock in the afternoon, wrote the first line of code that became the implementation described in this document. Arthur's onepage interpreter fragment is as follows:
§ The original source
typedef char C;typedef long I;
typedef struct a{I t,r,d[3],p[2];}*A;
#define P printf
#define R return
#define V1(f) A f(w)A w;
#define V2(f) A f(a,w)A a,w;
#define DO(n,x) {I i=0,_n=(n);for(;i<_n;++i){x;}}
I *ma(n){R(I*)malloc(n*4);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
tr(r,d)I *d;{I z=1;DO(r,z=z*d[i]);R z;}
A ga(t,r,d)I *d;{A z=(A)ma(5+tr(r,d));z>t=t,z>r=r,mv(z>d,d,r);
R z;}
V1(iota){I n=*w>p;A z=ga(0,1,&n);DO(n,z>p[i]=i);R z;}
V2(plus){I r=w>r,*d=w>d,n=tr(r,d);A z=ga(0,r,d);
DO(n,z>p[i]=a>p[i]+w>p[i]);R z;}
V2(from){I r=w>r1,*d=w>d+1,n=tr(r,d);
A z=ga(w>t,r,d);mv(z>p,w>p+(n**a>p),n);R z;}
V1(box){A z=ga(1,0,0);*z>p=(I)w;R z;}
V2(cat){I an=tr(a>r,a>d),wn=tr(w>r,w>d),n=an+wn;
A z=ga(w>t,1,&n);mv(z>p,a>p,an);mv(z>p+an,w>p,wn);R z;}
V2(find){}
V2(rsh){I r=a>r?*a>d:1,n=tr(r,a>p),wn=tr(w>r,w>d);
A z=ga(w>t,r,a>p);mv(z>p,w>p,wn=n>wn?wn:n);
if(n=wn)mv(z>p+wn,z>p,n);R z;}
V1(sha){A z=ga(0,1,&w>r);mv(z>p,w>d,w>r);R z;}
V1(id){R w;}V1(size){A z=ga(0,0,0);*z>p=w>r?*w>d:1;R z;}
pi(i){P("%d ",i);}nl(){P("\n");}
pr(w)A w;{I r=w>r,*d=w>d,n=tr(r,d);DO(r,pi(d[i]));nl();
if(w>t)DO(n,P("< ");pr(w>p[i]))else DO(n,pi(w>p[i]));nl();}
C vt[]="+{~<#,";
A(*vd[])()={0,plus,from,find,0,rsh,cat},
(*vm[])()={0,id,size,iota,box,sha,0};
I st[26]; qp(a){R a>='a'&&a<='z';}qv(a){R a<'a';}
A ex(e)I *e;{I a=*e;
if(qp(a)){if(e[1]=='=')R st[a'a']=ex(e+2);a= st[ a'a'];}
R qv(a)?(*vm[a])(ex(e+1)):e[1]?(*vd[e[1]])(a,ex(e+2)):(A)a;}
noun(c){A z;if(c<'0'c>'9')R 0;z=ga(0,0,0);*z>p=c'0';R z;}
verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
I *wd(s)C *s;{I a,n=strlen(s),*e=ma(n+1);C c;
DO(n,e[i]=(a=noun(c=s[i]))?a:(a=verb(c))?a:c);e[n]=0;R e;}
main(){C s[99];while(gets(s))pr(ex(wd(s)));}
It's a lot to take in  it's quite breathtaking really, the way it all
hangs together in one page.
§ The attempt to get it run
Unfortunately, this does not work if we try to get it to run in 2020. I decided
to read the code and understand what would happen. I managed to read enough
to understand that the code
a=~3
ought to create an array with values
[0 1 2]
.
On attempting to
run this however, we get:
$ gcc O0 g std=c89 fsanitize=address fsanitize=undefined incunabulum.c o bin/incunabulum && ./bin/incunabulum
...
(many many GCC warnings elided)
...
a=~3
=================================================================
==23726==ERROR: AddressSanitizer: heapbufferoverflow on address
0x60300000eff0 at pc 0x000000402be3 bp 0x7ffe6dde6b70 sp 0x7ffe6dde6b60
WRITE of size 8 at 0x60300000eff0 thread T0
#0 0x402be2 in wd /home/bollu/work/w/incunabulum.c:40
#1 0x402d28 in main /home/bollu/work/w/incunabulum.c:42
#2 0x7f7ae901082f in __libc_start_main (/lib/x86_64linuxgnu/libc.so.6+0x2082f)
#3 0x400ca8 in _start (/home/bollu/w/bin/incunabulum+0x400ca8)
...
SUMMARY: AddressSanitizer: heapbufferoverflow /home/bollu/work/w/incunabulum.c:40 wd
...
==23726==ABORTING
oops! The code uses a lot of punning between
int
and
int*
. These assumptions break now that we're in 64bit. The patch to
get this to work is:
§ the patch
diff git a/incunabulum.c b/incunabulum.c
index 2cae744..778e35a 100644
 a/incunabulum.c
+++ b/incunabulum.c
@@ 1,11 +1,11 @@
typedef char C;typedef long I;
+typedef char C;typedef long long I;
typedef struct a{I t,r,d[3],p[2];}*A;
#define P printf
#define R return
#define V1(f) A f(w)A w;
#define V2(f) A f(a,w)A a,w;
#define DO(n,x) {I i=0,_n=(n);for(;i<_n;++i){x;}}
I *ma(n){R(I*)malloc(n*4);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
+I *ma(n){R(I*)malloc(n*8);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
tr(r,d)I *d;{I z=1;DO(r,z=z*d[i]);R z;}
A ga(t,r,d)I *d;{A z=(A)ma(5+tr(r,d));z>t=t,z>r=r,mv(z>d,d,r);
R z;}
@@ 34,9 +34,10 @@ I st[26]; qp(a){R a>='a'&&a<='z';}qv(a){R a<'a';}
A ex(e)I *e;{I a=*e;
if(qp(a)){if(e[1]=='=')R st[a'a']=ex(e+2);a= st[ a'a'];}
R qv(a)?(*vm[a])(ex(e+1)):e[1]?(*vd[e[1]])(a,ex(e+2)):(A)a;}
noun(c){A z;if(c<'0'c>'9')R 0;z=ga(0,0,0);*z>p=c'0';R z;}
verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
+I noun(c){A z;if(c<'0'c>'9')R 0;z=ga(0,0,0);*z>p=c'0';R z;}
+I verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
I *wd(s)C *s;{I a,n=strlen(s),*e=ma(n+1);C c;
DO(n,e[i]=(a=noun(c=s[i]))?a:(a=verb(c))?a:c);e[n]=0;R e;}
main(){C s[99];while(gets(s))pr(ex(wd(s)));}
+
§ It runs!
After applying the patch, we manage to get the interpreter to run!
./bin/incunabulum
a=~3
3
0 1 2
§ The lock screen
I liked it so much that I took a screenshot and made it my lock screen.
§ Thoughts
I'm really fascinated by the code. I loved I could simply stare the screenshot
to absorb the code. There was no scrolling involved. The variables are
wellnamed (to the extent I understand the code), and it's clearly extremely
well thought out. If there's someone who understands some of the thorny
aspects of the code:
 What is the
t
variable really tracking?  How does one create a multidimensional array easily?
 What are some interesting programs one can run with this miniinterpreter?
I'd be really glad to know the details. Please leave
an issue or a pull request against the repo.
I'm going write a dissection of the code once I fully understand it, since I
couldn't find explanaing the code on the internet.
Until then, enjoy the monolith of code!
§ The problem
Provide an example of a sequence
a_{n}: ℕ → ℝ
such that
lim_{n → ∞}  a_{n+1} − a_{n}  → 0,
but
lim_{n, m → ∞, m > n} a_{n} − a_{m} ≠ 0. That is,
proide a series where the distances between successive terms converges to zero,
but where distances between terms that are "farther apart than 1" does
not converge to 0. That is, the sequence is not
Cauchy.
§ Regular solution: Harmonic numbers
The usual solution is to take the harmonic numbers,
H_{n} ≡ ∑_{i=1}^{n} 1/i. Then, we show that:
§ Memorable solution: logarithm
We can much more simply choose
a_{n} = log(n). This yields the simple
calculation:
   a_{n+1} − a_{n} = log(n+1) − log(n) 
         
 = log((n+1)/n))          
 = log(1 + 1/n)   log(1) = 0

         

while on the other hand,
  a_{2n} − a_{n}
= log(2n) − log(n)
= log(2) + log(n) − log(n)
= log2 ≠ 0

          

I find this far cleaner conceptually, since it's "obvious" to everyone
that
a_{n} = log(n) diverges, while the corresponding fact for
H_{n}
is hardly convincing. We also get straightforward equalities everywhere,
instead of inequalities.
I still feel that I don't grok what precisely fails here, in that, my intuition
still feels that the local condition
ought to imply the Cauchy condition:
if
a_{n} tells
a_{n+1} to not be too far, and
a_{n+1} tells
a_{n+2},
surely this
must be transitive?
I have taught my instincts to not trust my instincts on analysis, which is a
shitty solution :) I hope to internalize this someday.
EDIT: I feel I now understand what's precisely happening
after ruminating a bit.
The Cauchy convergence criterion allows us to drop a finite number
of terms, and then capture
everything after that point in a ball
of radius
є. As
є shrinks,
all the terms in the
sequence are "squeezed togeher".
In the
a_{n+1} − a_{n} case, only successive terms must maintain
an
є distance. But as the
log example shows, you can steadily
plod along, keeping
є ball next to
є ball, to reach:
whose behaviour can do unexpected things depending on the choice of
.
This is a class of methods used to solve
Ax = b, where
A is sparse.
Krylov subspace methods are a class of methods which use the idea of a
Krylov subspace. There is conjugate gradient (CG), GMRES (Generalized minimal
residual method).
K_{m}(A, v) ≡ span { v, Av, A^{2}v, …, A^{m} v}

Clearly,
K_{m} ⊆ K_{m+1}, and there is a maximum
K_{N} that we can span
(the full vector space). We are interested in the smallest index
M
such that
K_{M} = K_{M+1}.
We notice that
K_{M} is invariant under the action of
A.
Now, let's consider:
 K_{m}(A, x)  ≡ span {x, Ax, A^{2}x, … A^{m} x }          
 = span { A^{−1} b, b, Ab, … A^{m−1} x } (substitute x = A^{−1}b)          
 = A span { A^{−1} b, b, Ab, … A^{m−1} b} (Invariance of Krylov subspace)          
 = span {b, Ab, … A^{m} b}          
 = K_{m}(A, b)
         

We learnt that
Ax = b has a solution in
K_{m}(A, b). Using this, we can build
solvers that exploit the Krylov subspace. We will describe GMRES and CG.
§ Generalized minimal residual  GMRES
We wish to solve
Ax = b where
A is sparse and
b is normalized. The
nth
Krylov subspace is
K_{n}(A, b) ≡ span {b, Ab, A^{2}b, …, A^{n}b }.
We approximate the actual solution with a vector
x_{n} ∈ K_{n}(A, b). We
define the
residual as
r_{n} ≡ A x_{n} − b.
§ Conjugate gradient descent
The
Rete pattern matching algorithm
is an algorithm that allows matching a huge number of rules with a huge database
of "facts".
MLIR ("multilanguage intermediate representation") is a new technology that
hopes to centralize much of the research and development of various compilers
into a single authoritative source. The major claimtofame is that it allows
one to mix various "dialects" (much as Racket does). So, to a first order
approximation, MLIR is "JSON for compiler intermediate representations".
What MLIR gets right is
tooling. They take the experience that the LLVM project
has mined for the last decade and a half, and bake many of the good stuff that
came with LLVM right in. For example, MLIR comes inbuilt with a pretty printer,
a notion of types, a notion of "attributes", SSA, enforced provenance
tracking of code (so one can
always know what the original source code was
that led to some assembly). Sound engineering might see MLIR succeed where
many others failed.
I was reminded of Rete since the MLIR folks are trying to solve the pattern
matching problem in general for their
Generic DAG Rewriter.
They currently just use a worklist based algorithm. I'm trying to understand
if Rete can be used instead. Rete is famous for being hard to understand,
so I began a quest to look for good sources to implement it. I found a great
PhD thesis written by Robert B. Doorenboos,
which quips:
Since the Rete match algorithm provides the starting point for much of the work in this thesis, this chapter describes Rete. Unfortunately, most of the descriptions of Rete in the literature are not particularly lucid,1 which is perhaps why Rete has acquired \a reputation for extreme differentialculty."(Perlin, 1990b) To remedy this situation, this chapter describes Rete in a tutorial style, rather than just briey reviewing it and referring the reader to the literature for a full description. We will first give an overview of Rete, and then discuss the principle data structures and procedures commonly used to implement it. Highlevel pseudocode will be given for many of the structures and procedures, so that this chapter may serve as a guide to readers who want to implement rete (or some variant) in their own systems.
I now have a reference to an accessible description of this stuff. I might
implement Rete to understand it, so that it's part of my toolkit if I ever
need it.
We have a system we wish to simulate using hamilton's equations:
We want to simulate a system using these differential equations. We will begin
with some initial position and momentum
(q_{0}, p_{0}), evaluate
∂ q/∂ t _{(q0, p0)},
∂ p/∂ t _{(q0, p0)}, and use
these to find
(q_{next}, p_{next}). An integrator is a general algorithm
that produces the next position and momentum using current information:
(q_{next}, p_{next}) =
I  ⎛
⎜
⎜
⎝  q_{0},
p_{0},
  _{(q0, p0)},
  _{(q0, p0)}  ⎞
⎟
⎟
⎠ 
The design of
I is crucial: different choices of
I will have different
tradeoffs for accuracy and performance. Another interesting property
we might want is for
I to be a
symplectic integrator  that is,
it preserves the total energy of the system. For example, here's a plot
of the orbits of planets using two integrators, one that's symplectic (leapfrog)
and one that isn't (Euler)
Notice that since leapfrog attempts to keep energy conserved, the orbits stay
as orbits! On the other hand, the euler integrator quickly spirals out, since
we lose energy during the integration. Note that this is
not an issue of numerical precision: The euler integrator is ineherently
such that over long timescales, it will lose energy. On the other hand, the
leapfrog integrator will
always remain stable, even with very large timesteps
and low precision.
I present the equations of the leapfrog integrator, a proof sketch that it
is symplectic, and the code listing that was used to generate the above plot.
Often, code makes most ideas very clear!
§ The integrator
§ Code listing
§ Incantations
## Run HMC with a particular choice of potential
import numpy as np
from matplotlib.animation import FuncAnimation
import matplotlib.pyplot as plt
import numpy.linalg
## dq/dt = dH/dp_{p0, q0}
## dp/dt = dH/dq_{p0, q0}
def leapfrog(dhdp, dhdq, q0, p0, dt):
p0 += dhdq(q0, p0) * 0.5 * dt
# full step position
# q += dt * p
q0 += dhdp(q0, p0) * dt
# half step position
p0 += dhdq(q0, p0) * 0.5 * dt
return (q0, p0)
For reference, we also implement an euler integrator, that uses the derivative
to compute the position and momentum of the next timestep independently.
def euler(dhdp, dhdq, q, p, dt):
pnew = p + dhdq(q, p) * dt
qnew = q + dhdp(q, p) * dt
return (qnew, pnew)
Finally, we implement
planet(integrator, n, dt)
which simulates gravitational
potential and usual kinetic energy, using the integrator given by
integrator
for
n
steps, with each timestep taking
dt
.
def planet(integrator, n, dt):
STRENGTH = 0.5
# minimise potential V(q): q, K(p, q) p^2
q = np.array([0.0, 1.0])
p = np.array([1.0, 0.0])
# H = STRENGTH * q (potential) + p^2/2 (kinetic)
def H(qcur, pcur): return STRENGTH * np.linalg.norm(q) + np.dot(p, p) / 2
def dhdp(qcur, pcur): return p
def dhdq(qcur, pcur): return STRENGTH * 2 * q / np.linalg.norm(q)
qs = []
for i in range(n):
print("q: %10s  p: %10s  H: %6.4f" % (q, p, H(q, p)))
(q, p) = integrator(dhdp, dhdq, q, p, dt)
qs.append(q.copy())
return np.asarray(qs)
We plot the simulations using
matplotlib
and save them.
NITERS = 15
TIMESTEP = 1
print("planet simulation with leapfrog")
planet_leapfrog = planet(leapfrog, NITERS, TIMESTEP)
plt.rcParams.update({'font.size': 12, 'font.family':'monospace'})
fig, ax = plt.subplots()
print(planet_leapfrog)
ax.plot(planet_leapfrog[:, 0], planet_leapfrog[:, 1], label='leapfrog',
linewidth=3, color='#00ACC1')
print("planet simulation with euler")
planet_euler = planet(euler, NITERS, TIMESTEP)
ax.plot(planet_euler[:, 0], planet_euler[:, 1], label='euler',
linewidth=3, color='#D81B60')
legend = plt.legend(frameon=False)
ax.set_title("leapfrog v/s euler: NITERS=%s dt=%s" % (NITERS, TIMESTEP))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.savefig("leapfrogvseuler.png")
plt.show()
Quite a lot of ink has been spilt on this topic. My favourite reference
is the one by
Rufflewind.
However, none of these examples have a good stock of examples for the diference.
So here, I catalogue the explicit computations between computing forward
mode AD and reverse mode AD.
In general, in forward mode AD, we fix how much the inputs wiggle with
respect to a parameter
t. We figure out how much the output wiggles
with respect to
t. If
output = f(input_{1}, input_{2}, … input_{n}),
then
∂ output/∂ t = ∑_{i} ∂ f/∂ input_{i} ∂ input_{i}/∂ dt.
In reverse mode AD, we fix how much the parameter
t wiggles with
respect to the output. We figure out how much the parameter
t
wiggles with respect to the inputs.
If
output_{i} = f_{i}(input, …), then
∂ t/∂ input = ∑_{i} ∂ t/∂ output_{i} ∂ f_{i}/input.
This is a much messier expression, since we need to accumulate the data
over all outputs.
Essentially, deriving output from input is easy, since how to compute an output
from an input is documented in one place. deriving input from output is
annoying, since many outputs can depent on a single output.
The upshot is that if we have few "root outputs" (like a loss function),
we need to run AD once with respect to this, and we will get the wiggles
of
all inputs at the same time with respect to this output, since we
compute the wiggles output to input.
The first example of
z = max(x, y)
captures the essential difference
between the two approached succinctly. Study this, and everything else will make
sense.
§ Maximum: z = max(x, y)
We can compute
∂ z/∂ x by setting
t = x.
That is,
∂ x/∂ t = 1, ∂ y/∂ t = 0.
Similarly, can compute
∂ z/∂ y by setting
t = y.
That is,
∂ x/∂ t = 1, ∂ y/∂ t = 0.
If we want both gradients
∂ z/∂ x, ∂ z/∂ y,
we will have to
rerun the above equations twice with the two initializations.
In our equations, we are saying that we know how sensitive
the inputs
x, y are to a given parameter
t. We are deriving how sensitive
the output
z is to the parameter
t as a composition of
x, y. If
x > y, then we know that
z is as sensitive to
t as
x is.
We can compute
∂ z/∂ x, ∂ z/∂ y
in one shot by setting
t = z. That is,
∂ z/∂ t = 1.
In our equations, we are saying that we know how sensitive
the parameter
t is to a given output
z. We are trying to see
how sensitive
t is to the inputs
x, y. If
x is active (ie,
x > y),
then
t is indeed sensitive to
x and
∂ t/∂ x = 1.
Otherwise, it is not sensitive, and
∂ t/∂ x = 0.
§ sin: z = sin(x)
We can compute
∂ z/∂ x by setting
t = x.
That is, setting
∂ x/∂ t = 1.
We can compute
∂ z/∂ x by setting
t = z.
That is, setting
∂ z/∂ t = 1.
§ addition: z = x + y
:
§ multiplication: z = xy