fvqqh.qehvq@tznvy.pbz
)using
for cleaner function type typedefsA = B
— A book about proofs of combinatorial closed forms (TODO link)k
bitsets of a given length n
Cleave
as a word has some of the most irregular inflectionsstructs
librarymachines
library (WIP)On learning about infinite dimensional vector spaces, one learns that we need to use the axiom of choice to assert that every such vector space has a basis; indeed, it’s equivalent to the AoC to assert this. However, I had not known any “natural” examples of such a vector space till I studied the proof of the barvinok algorithm. I produce the example here.
Consider a space such as $S \equiv \mathbb R^3$. Now, consider the vector space spanned by the indicator functions of polyhedra in $S$. that’s a mouthful, so let’s break it down.
A polyhedra is defined as a set of points that is defined by linear inequalities: $P \equiv { x \in S : a_i \cdot x \leq b_i, i \in [1\dots n] }$, for all $a_i \in S$, $b \in \mathbb R$.
The indicator functions are of the form:
we can define a vector space of these functions over $\mathbb R$, using the “scaling” action as the action of $\mathbb R$ on these functions:
\begin{align}
&(f + g)(x) \equiv f(x) + g(x)
&(r \cdot f)(x) \equiv r \times f(x)
\end{align}
The vector space $V$ is defined as the span of the indicator functions of all polyhedra. It’s clearly a vector space, and a hopefully intuitive one. However, note that the set we generated this from (indicators of polyhedra) don’t form a basis since they have many linear dependencies between them. For example, one can write the equation:
Central idea: assume a memory model where computation is free, only cost is pulling data from cache into memory. Cache has total size $M$, can hold blocks of size $B$. So it can hold $M/B$ blocks of main memory. Memory memory has infinite size. Cost is number of transfers.
We assume that the algorithm does not know M or B. We assume that the cache replacement strategy is optimal (kick out block that is going to be used farthest in the future). This is an OK assumption to make since an LRU cache using twice the memory of a “oracular” cache performs equally well (citation?)
These data structures are cool since they essentially “Adapt” to varying cache hierarchies and even multiple level cache hierarchies.
We study how to build cacheoblivious Btrees.
We use a balanced BST. We want to find an order to store nodes in memory such that when we search for an element, we minimize number of blocks we need to pull in.
All standard orders such as level order, preorder, postorder fail.
Corrrect order is “VEB (Van Em De Boas) order”: carve a tree at the middle level of its edges. Layout a “triangle” or smaller collection of nodes linearly. Then Recursively layout the trees, linearly in memory.
Supposedly if the number of nodes is $N$, we wil have roughly $\sqrt(N)$ nodes on the top, and then $\sqrt(N) triangles at the bottom/
$N$ is the number of nodes in the BST. Note that in the analysis, we know what B is, even though the algorithm does not.
We look at a particuar level of recursion. We will call it a “level of detail” straddling B.
We will have large triangles of size $\geq B$, inside which there are smaller triangles of size $\leq B$ (reminds me of sierpinski).
We know that the algorithm recursively lays it out, and triangle stores everything “inside” it in a contiguous region. So we stop at the requisite size where we know that the tree’s triangles themselves contain triangles which fit into the block size.
1 2 3 4 5 6 7 8 < index
   < boundary
xxxxxxx < data
The question is that on a roottoleaf bpath, how many such triangles do we need to visit. Since we repeatedly divide the nodes in half with respect to height until the little triangle has number of nodes less than $B$, the height is going to be $O(\log B)$ since it’s still a binary tree.
total height in $O(\log N)$.
so height of “chunked tree” where we view each triangle as a single node is $\log N / \log B = \log_B n$.
We need a black box: ordered file maintainance (linked list for arrays)
we take a VEB static tree on top of an ordered file. Tree is a segtree that has max of nodes. Leaves are the members of the ordered file.
max
.look at level of detail that straddles $B$.
Let us look at the bottom 2 levels.
Note that when we perform postorder inside a triangle that has 3 triangles of size $\leq B$, we need to alternate between parent triangle and child triangle. Since the parent triangle is of size $\leq B$ and can therefore take at most $2B$ blocks of memory, similarly the child can take at most $2B$ blocks of memory.
So if our cache can hold $4$ blocks of memory, we’re done. We won’t need to kick anything out when performing the postorder traversal.
For levels that are above the bottom 2 levels, we’re still OK. there
are not many triangles! / not many nodes! (1:16:00
in the video)
We denote partial functions with $X \rightharpoonup Y$ and total functions with $X \rightarrow Y$.
A set $X$ equipped with a binary operator $\star: X \times X \rightarrow X$ which is closed and associative is a semigroup.
For a ground set $X$, the set of partial functions $Pf(X) \equiv { f: X \rightharpoonup X }$ along with function composition forms a semigroup. This is in fact stronger than a semigroup. There exists:
Let $Q$ be a set. Let $S \subseteq Pf(Q)$ be a subsemigroup of $Pf(Q). Then the semigroup $X \equiv (Q, S)$ is called as the transformation semigroup(X) of states $Q$.
We call $X \equiv (Q, S)$ as a transformation monoid if $S$ contains $1_Q(q) = q$.
There is some subttlety here. Just because $S$ is a monoid does not mean that it that is a transformation monoid. It must have the identity element of $Pf(Q)$ to be called a transformation monoid. For example, consider the set $Q \equiv { a, b}$ and the transformation semigroup $S \equiv { f:(_ \mapsto ) b}$. Now the set $S$ is indeed a monoid with identity element as $f: Q \rightarrow Q$. however, $f \neq 1Q$, hence, $S$ is a not a _transformation monoid.
$(X, { \theta(x) = _ })$. The semigroup with the empty transformation.
$(X, \emptyset)$, the semigroup with no transformations.
We sometimes wish to represent a semigroup using an action/transformation semigroup on a ground set $X$. So, given some semigroup $(T, \times)$ that needs to be represented, if we can find a morphism $r: T \rightarrow Pf(X)$ ($r$ for representation) such that:
Put more simply, $t_1 \neq t_2 \implies r(t_1) \neq r(t_2)$ where we define function equality extensionally: $f = g \equiv \forall x, f(x) = g(x)$.
We often wish to represent some semigroup $S$ as the transformation semigroup of some set of states $Q$. We can achieve this by proving a morphism:
Then, we can treat elements of $S$ as elements of $Pf(Q)$.
Given a transformation semigroup $X \equiv (Q, S)$ we can complete it by adding a new sink state $\lbot$, and then converting all partial functions in $S$ to total functions that transition to $\lbot$. We have that $\lbot \cdot s = s \cdot \lbot ~ \forall s \in S$.
We denote the completion as $X^c \equiv (Q^c, S^c)$.
Let $X \equiv (Q_X, S_X)$ and $Y \equiv (Q_Y, S_Y)$ be transformation semigroups. Let $\phi: Q_Y \rightarrow Q_X$ be a relation. Let $s_x \in S_X$ and $s_y \in S_Y$. Then, if the following diagram commutes:
If $s_x(\phi(q_y)) = \phi(t_y(q_y))$, then we say that $t_y$ covers $s_x$ relative to $\phi$. We imagine the $t_y$ lying above $s_x$, being projected down by $\phi$.
If a fixed $\phi$, for all $s_x \in S_X$ there exists a $t_y \in S_Y$ such that $t$ covers $s$ relative to $\phi$, then we say that $\phi:$ is a relation of automata.
If $X \prec_\phi Y$, we say that $Y$ dominates $X$, or $Y$ covers $X$, or $X$ divides $Y$.
We note that for a given covering $\phi$, if $s_x$ is covered by $t_y$ and $p_x$ is covered by $q_y$, then $s_x \circ t_x$ is covered by $t_y \circ q_y$.
Thus, to check if $X$ is covered by $Y$, we simply need to check if some generating subset of $X$ is covered by $Y$.
Let us assume we have a representation of a transformation semigroup given with a semigroup $\Sigma$, a transformation semigroup $X \equiv (Q_X, S_X)$, and a representation $r: \Sigma \rightarrow S_X$ that is faithful.
Now, to check that $X$ is covered by another $Y$, it suffices to check that there exists a $t_y \in Y$ for each $\sigma \in X$ such that $r(\sigma)$ is covered by this $t_y$.
Given a relation $\phi: Y \rightarrow X$, then we define:
Recall compositions of elements are covered by a composition of their coverings. Hence, if $(s, t), (s’, t’) \in \Sigma$, then $(ss’, tt’) \in \Sigma$. thus, $\Sigma$ is a subsemigroup of $S_X \times S_Y$.
We can regard $\Sigma$ as the graph of a relation $\phi’ \subseteq Q_Y \times Q_X$. This will be called as companion relation of $\phi$.
Let $X \equiv (Q_X, S_X)$ and $Y \equiv (Q_Y, S_Y)$. We’re going to define a large product $X \wr Y$.
We begn with the set $W \equiv S_X^Q_Y \times S_Y$, where $S_X^Q_Y \equiv { f : Q_Y \rightarrow S_X }$. The wreath product then becomes:
with the action of $W$ on an element of $Q_X \times Q_Y$ being defined as:
it’s a “follow the types” sort of definition, where we edit the right component as $r_y \mapsto t_y(r_y)$ since that’s all we can do. In the case of the left component, we have a $q_x$, and we need to produce another element in $Q_X$, so we “must use $f$”. The only way to use $f$ is to feed it a $t_y$. This forces us into the above definition.
To show that its closed under composition, let’s consider $(f, s_y), (g, t_y) \in W$ with $f, g: Q_Yg \rightarrow S_X$, and $s_y, t_y \in S_Y$. The result is going to be:
Let $X = (Q, S)$ be a transition system. Given subsets $(a, b, \subseteq Q)$, we shall write $b \leq a$ if either $b \subseteq a$ or there exists some $s \in S$ such that $b \subseteq sa$, where $s(a) \equiv { s(a_i) : a_i \in a}$. We can define an equivalence relation $a \sim b \iff a \leq b \land b \leq a$.
Note that: $ b \leq a \implies b \leq a$, since:
Similiarly, $a \leq b \implies  a  \leq  b  $. Therefore, $b \sim a$ means 
that $  b  =  a  $. 
Theorem: for all $a, b \in Q_X$ such that $a ~ b$ such that $b \subseteq s(a)$, we show that $b = s(a)$, and there exists a $t \in S_X$ such that $a = t(b)$.
Proof: Since $b \subseteq s(a) \subseteq a$ and $b = a$, $b = s(a)$. Therefore $s$ is a permutation. Hence, $s$ is invertible and there exists an inverse permutation $t$ such that $a = t(b)$. We now need to show that $t \in S_X$. To do this, first note that if the order of the permutation $s$ is $n$, then $t = s^{n1}$, since $t \circ s = s^{n1} \circ s = 1_S$. Since the semigroup $S$ is closed under composition $t = s^{n1}$ is in $S$, since it is $s$ composed with itself $(n1)$ times.
We will be interest in a family of subsets of $Q_X$ called $A$, of the form:
In the above set, we have $\leq$ and $\sim$ as defined above.
We note that the set $A$ is closed under the action of all $s \in S_X$. For example, the empty set is taken to the empty set. All singleton sets are taken to other singleton sets. For the full set $Q$, we add the sets $s(Q)$ for all $s \in S_X$.
A height function for a transition system $X \equiv (Q_X, S_X)$ is a function $h: A \rightarrow \mathbb Z$ such that:
The notation $b < a \equiv (b \leq a) \land \lnot (a \leq b)$.
(3) + (4) imply that two elements of the same height are either equivalent or incomparable.
for $a \in A$ such that $a > 1$, we denote by $B_a$ the set of all $b \in A$ what are maximal subsets of $a$. That is, if $b \in B_a$ then $b \subsetneq a$, and $\not \exists c, $b \subsetneq c \subsetneq a$. Equivalently, if there exists a $c$ such that $b \subseteq c \subseteq a$, then $b = c$ or $b = a$.
Note that we can assert that $a = \cup_{b \in B_a} b$. This is because $B_a$ contains all the singletons of $Q_X$. so we can begin by writing $a$ as a union of singletons, and then merging elements of $B_a$ into larger elements of $B$, terminating when we cannot merge any more elements of $B_a$.
Let us assume that there exists a $s \in S$ such that $s(a) = a$. Let $A_a$ be the set of all elements in $A$ contained in $a$: $A_a = { A_i : A_i \in A, A_i \subseteq a }$.
Recall that the set $A$ was closed under the action of all $s$, and hence, since $s$ is a permutation of $a$, this naturally extends into a permutation of $A_a$: $s A_a = A_a$. Now note that this induces a permutation of the set $B_a$. This creates a transition system:
\begin{align}
&G_a \equiv { s \in S : s a = a}
&H_a \equiv (B_a, G_a)
\end{align}
We have already shown how if $s \in S$ defines a permutation of some set $X$ by its action, then its inverse also exists in $S$. So, this means that $G_a$ is in fact a transition group that acts on $B_a$.
It might turn out that $G_a = \emptyset$. However, if $G_a \neq \emptyset$, then as stated above, $G_a$ is a group.
We will call such a transition group a generalized_ transition group, since either $G_a = \emptyset$ or $G_a$ is a group.
Now, the generalized transition group $H_a$ is called as the holonomy transition system of $a$, and the group $G_a$ is called as the holonomy group of $a$.
We have that $G_a \prec S$ since $G_a$ is a quotient of the subsemigroup ${ s  s \in S, as = a }$. (TODO: so what? why does this mean that it’s $\prec$?)
Theorem: if $a \sim b$, then $H_a \simeq H_b$ (similar subsets have isomorphic holonomy transition systems). Proof: TODO.
It’s a somewhat wellknown fact that given matrix multiplication: $O = AB$ where $O$ \in \mathbb R^{2n \times 2m}$ ($O$ for output), $A \in \mathbb R^{2n \times r}, B \in \mathbb R^{r \times 2m}$ are matrices.
We can also write this as follows:
When written as code, the original matrix multiplication is:
// a:[2N][2Z] b:[2Z][2M] > out:[2N][2M]
int matmul(int N, int Z, int M, int a[N][Z], int b[Z][M], int out[N][M]) {
for(int i = 0; i < 2*N; ++i) {
for(int j = 0; j < 2*M; ++j) {
for(int k = 0; k < 2Z; ++k) out[i][j] += a[i][k] * b[k][j]
}
}
}
and the blockbased matrix multiplication is:
// a:[2N][2Z] b:[2Z][2M] > out:[2N][2M]
int matmulBlock(int N, int Z, int M, int a[N][Z], int b[Z][M], int out[N][M]) {
for (int BI = 0; BI < 2; ++BI) {
for (int BJ = 0; BJ < 2; ++BJ) {
for(int i = BI*N; i < BI*N+N; ++i) {
for(int j = BJ*M; j < BJ*M+M; ++j) {
for(int k = 0; k < 2Z; ++k) { out[i][j] += a[i][k] * b[k][j] }
}
}
}
}
}
we wish to show that both of these programs have the same semantics. We will do this by appealing to ideas from program analysis.
We will consider the statement:
out[i][j] += a[i][k] * b[k][j]
as occuring at an abstract “point in time” $(i, j, k)$ in the matmul
function.
I also occurs at an abstract “point in time” $(BI, BJ, i’, j’, k’)$ in
the matmulBlock
function.
We will then show that the loops for(i...) for(j...) for(k...)
are fully
parallel, and hence we can reorder the loops any way we want.
Then, we will show that the ordering imposed by $(BI, BJ, i’, j’, k’)$ is a reordering of the original $(i, j, k)$ ordering. We do this by showing that there is a bijection:
Thus, this bijection executes all loops, and does so without affecting the program semantics.
We’ll zoom out a little, to consider some simple programs and understan how to represent parallelism.
void eg1(int N, int M, int out[N][M]) {
for(int i = 0; i < N; ++i) {
for(int j = 1; j < M; ++j) {
out[i][j] = out[i][j1];
}
}
Notice that this program is equivalent to the program with the $i$ loop reversed:
void eg1rev(int N, int M, int out[N][M]) {
for(int i = N1; i >=0; i) {
for(int j = 1; j < M; ++j) {
out[i][j] = out[i][j1];
}
}
What’s actually stopping us from reversing the loop for(j...)
? it’s
the fact that the value of, say, out[i=0][j=1]
depends on
out[i=0][j=0]
. We can see that in general, out[i=i_0][j=j_0]
depends
on out[i=i_0][j=j_01]
. We can represent this by considering a
dependence set:
in general, we can reorder statements as long as we do not change the directions of the arrows in the dependence set.
matmul
.Midnight discussions with my roommate Arjun P.
This tries to explore what it is about algebra that I find appealing.
I think the fundamental difference to me comes down to flavour — analysis and combinatorial objects feel very “algorithm”, while Algebra feels “data structure”.
To expand on the analogy, a proof technique is like an algorithm, while an algebraic object is like a data structure. The existence of an algebraic object allows us to “meditate” on the proof technique as a separate object that does not move through time. This allows us to “get to know” the algebraic object, independent of how it’s used. So, at least for me, I have a richness of feeling when it comes to algebra that just doesn’t shine through with analysis. The one exception maybe reading something like “by compactness”, which has been hammered into me by exercises from Munkres :)
Meditating on a proof technique is much harder, since the proof technique is necessarily intertwined with the problem, unlike a data structure which to some degree has an independent existence.
This reminds me of the quote: ““Art is how we decorate space; Music is how we decorate time.”. I’m not sure how to draw out the tenuous connection I feel, but it’s there.
Arjun comes from a background of combinatorics, and my understanding of his perspective is that each proof is a technique unto itself. Or, perhaps instantiating the technique for each proof is difficult enough that abstracting it out is not useful enough in the first place.
A good example of a proof technique that got studied on its own right in combinatorics is the probabilistic method. A more reasonable example is that of the Pigeonhole principle, which still requires insight to instantiate in practise.
Not that this does not occur in algebra either, but there is something in algebra about how just meditating on the definitions. For example, Whitney trick that got pulled out of the proof of the Whitney embedding theorem.
To draw an analogy for the haskellers, it’s the same joy of being able to write down the type of a haskell function and know exactly what it does, enough that a program can automatically derive the function (djinn). The fact that we know the object well enough that just writing the type down allows us to infer the program, makes it beautiful. There’s something very elegant about the minimality that algebra demands. Indeed, this calls back to another quote: “perfection is achieved not when there is nothing more to add, but when there is nothing left to take away”.
I’m really glad that this 2 AM discussion allowed me to finally pin down why I like algebra.
using
for cleaner function type typedefsI’ve always struggled with remembering the syntax for function type typedefs:
typedef RETTY (*FNTYNAME)(ARGTY1, ARGTY2, ..., ARGTYn);
we can now use using
for a far more pleasant syntax:
using FNTYNAME = RETTY(ARGTY1, ARGTY2, ..., ARGTYn);
which is also intuitive. You write down the “type” on the right hand side, and give it a name on the left.
This is not strictly the same, since the typedef
typedefs
FNTYNAME
to a function pointer type, while
the C++ version typedefs the function type. I prefer
the latter at any rate, since I dislike the fact
that the usual typedef tends to hide the fact that a
function pointer is some pointerlikething.
reflection: $\mathbb Z 2$
\begin{align}
\begin{bmatrix}
1 & 0
a & X
\end{bmatrix}
\begin{bmatrix}
1 & 0
b & Y
\end{bmatrix}
= \begin{bmatrix}
1 & 0
a + X \cdot b & XY
\end{bmatrix}
\end{align}
$(Y \mapsto b) \xrightarrow{act} (X \mapsto a)$
$XY \mapsto a + X \cdot b$
Where are the lamps? At each $i \in \mathbb Z$, you have a lamp that is either on, or off. So you have $\mathbb Z2$.
$L \equiv \mathbb Z \rightarrow \mathbb Z2$ is our space of lanterns. You can act on this space by either moving using $\mathbb Z$, or toggling a lamp using $\mathbb Z2$. $\mathbb Z2^{\mathbb Z} \rtimes \mathbb Z$
I don’t find people who draw “all three parts” of the natural transformation: the catories $C$, $FC$, and $GC$, and then show the relationship between them, so I made this for my own reference.
the Ocamlgit project is a
reimplementation of git
in OCaml
. It’s wellwritten, and I was
walking through the codebase, when I found absolutely amazing, hilarious,
and deep comments from dinosaure
. I really enjoyed reading through the
codebase, and the existence of these comments made it more humane to read.
I don’t know who dinosaure
is, but I’m really glad they wrote the comments
they did, it really made my day.
(* XXX(dinosaure): ...we can fix this detail but
I'm lazy like haskell. TODO! *)
(* XXX(dinosaure): see this paper
https://github.com/ocamllabs/papers/blob/master/irmin/2014.08.matthieu/rapport.pdf *)
(* XXX(dinosaure): [~chunked:false] is mandatory, I don't want to explain
why (I lost one day to find this bug) but believe me. *)
(* XXX(dinosaure): if, one day, we find a bug about the serialization of the
IDX file, may be it's about this function (stable/unstable sort). *)
(* XXX(dinosaure): bon ici, c'est une note compliqué, j'ai mis 2 jours
à fixer le bug. Donc je l'explique en français, c'est plus simple.
En gros, [Helper.MakeDecoder] utilise ce buffer comme buffer interne
pour gérer les alterations. Ce qui ce passe, c'est que dans la
fonction [refill], il s'agit de compléter à partir d'un [input]
(typiquement [zl]) le buffer interne. C'est finalement un
__mauvais__ jeu entre un [Cstruct.t] et un [Bigarray].
Il s'agit de connaître la véritable taille du [Bigarray] et de
rajouter avec [blit] le contenu de l'[input] si la taille du
[Bigarray] (et pas du [Cstruct]) est suffisante.
Avant, cette modification, [zl], [de] et [io] partagaient le même
[Bigarray] découpé (avec [Cstruct]) en 3. Donc, dans le
[MakeDecoder], [refill] considérait (pour des gros fichiers faisant
plus de 0x8000 bytes) que après [de], nous avions encore de la
place  et dans ce cas, nous avions [io].
Ainsi, on [blit]ait [zl] dans [de+sizeof(de) == io], et finalement,
on se retrouvait à essayer de décompresser ce que nous avions
décompressé. (YOLO).
Donc, on considère maintenant [de] comme un [Cstruct.t] et un
[Bigarray] physiquement différent pour éviter ce problème.
Cependant, il faudrait continuer à introspecter car j'ai
l'intuition que pour un fichier plus gros que [2 * 0x8000], on
devrait avoir un problème. Donc TODO. *)
(* XXX(dinosaure): at the end, we don't care if we lost something. *)
Since MLIR
hasn’t setup the nice tooling that LLVM has around CMake
as far as I can tell, one needs to actually know CMake
to link against
MLIR
. However, as is well known, CMake
incantations are handed down
by preists who spend the better part of their lives studying the tome
that is the CMake manual. I, an unlucky soul had to go on this adventure,
and I hope to spare you the trouble.
I wished to link against a static library build of MLIR. The secret
lies in the find_library
call:
#If the variable has been set by DMLIR_INCLUDE_PATH, then keep it.
#Otherwise fallback to the environment variable $MLIR_INCLUDE_PATH.
#if neither, then *shrug*.
IF(NOT MLIR_INCLUDE_PATH)
set (MLIR_INCLUDE_PATH $ENV{MLIR_INCLUDE_PATH})
endif()
#Resolve for:
# a library target called `MLIRAnalysis`
# asking to link against `libMLIAnalysis.a`
# using the variable MLIR_INCLUDE_PATH which as we saw before
# is either an environment variable or a cmake option
target_include_directories(languagemodels PRIVATE ${MLIR_INCLUDE_PATH})
I cribbed the actual things to link against from the path
mlir/examples/Toy/Ch2/CMakeLists.txt
which helpfully lists MLIR things it needs to link against.
The full CMakeLists
is here:
cmake_minimum_required(VERSION 3.5)
project(languagemodels)
set(CMAKE_CXX_STANDARD 14)
# I don't want to use find_package since I want proper control over where my LLVM comes from.
# find_package(LLVM REQUIRED)
add_executable(languagemodels
rnn.cpp codegenc.h lang.h codegenmlir.h)
# Attempt to take these as command line arguments. IF that fails,
# lookup environment.
IF(NOT MLIR_INCLUDE_PATH)
set (MLIR_INCLUDE_PATH $ENV{MLIR_INCLUDE_PATH})
endif()
IF(NOT MLIR_LIBRARY_PATH)
set (MLIR_LIBRARY_PATH $ENV{MLIR_LIBRARY_PATH})
endif()
target_include_directories(languagemodels PRIVATE ${MLIR_INCLUDE_PATH})
find_library(MLIRAnalysis MLIRAnalysis ${MLIR_LIBRARY_PATH})
find_library(MLIRIR MLIRIR ${MLIR_LIBRARY_PATH})
find_library(MLIRParser MLIRParser ${MLIR_LIBRARY_PATH})
find_library(MLIRSideEffects MLIRSideEffects ${MLIR_LIBRARY_PATH})
find_library(MLIRTransforms MLIRTransforms ${MLIR_LIBRARY_PATH})
find_library(LLVMCore LLVMCore ${MLIR_LIBRARY_PATH})
find_library(LLVMSupport LLVMSupport ${MLIR_LIBRARY_PATH})
# debugging to check if it's been set properly
message(MLIR_INCLUDE_PATH ${MLIR_INCLUDE_PATH})
message(MLIR_LIBRARY_PATH ${MLIR_LIBRARY_PATH})
message(MLIRAnalysis ${MLIRAnalysis})
target_link_libraries(languagemodels
${MLIRAnalysis}
${MLIRIR}
${MLIRParser}
${MLIRSideEffects}
${MLIRTransforms}
${LLVMCore}
${LLVMSupport})
This comes from The wild book which I anticipate I’ll be posting more of in the coming weeks.
Let an experiment be a tuple of the phase space $X$, action space $A$, and an action of the actions onto the phase space $\curverightarrow: A \times X \rightarrow X$. We will write $x’ = a \curverightarrow x$ to denote the new state of the system $x$. So the experiment $E$ is the data $E \equiv (X, A, \curverightarrow : A \times X \rightarrow X)$.
The existence of the action $\curverightarrow$ allows us to write the evolution of the system recursively:
$x_{t+1} = a \rightarrow x_t$.
However, to understand the final state $x_{t+1}$, we need to essentially “run the recursion”, which does not permit us to understand the experiment.
What we really need is the ability to “unroll” the loop. To quote:
Informally, understanding an experiment $E$ means introducing coordinates into phase space of $E$ which are in triangular form under the action of the inputs of $E$.
We identify certain interesting invariants of a system by two criteria:
The parameter $Q(t)$ determines some obviously important aspects of the system. That is, there is a deterministic function $M(Q(t))$ which maps $Q(t)$ to “measure” some internal state of the system.
If the values of such a parameter $Q$ is known at time $t_0$ (denoted $Q(t_0)$) and it is also known what inputs are presented to the system from time $t$ to time $t + \epsilon$ (denoted $I[t_0, t_0 + \epsilon]$), then the new value of $Q$ is a deterministic function of $Q(t_0)$ and $I[t_0, t_0+ \epsilon]$.
Such parameters allow us to understand a system, since they are deterministic parameters of the evolution of the system, while also provding a way to measure some internal state of the system using $M$.
For example, consider a system $x$ with an energy function $e(x)$. If we perform an action $a$ on the system $x$, then we can predict the action $e(x’ = a \curvearrowright x)$ given just $e(x)$ and $a$ — here, $(x’ = a \curverightarrow x)$ is the action of the system $a$ on $x$.
In general, conservation principles give a first coordinate of a triangularization. In the main a large part of physics can be viewed as discovering and introducing functions $e$ of the states $q$ of the system such that under action $a$, $e(a \curverightarrow q)$ depends only on $e(q)$ and $a$, and not on $q$.
A wreath product
We first heuristically indicate the construction involved in going from the group of symmetries to the triangularization, and then precisely write it out in all pedantic detail.
Let an experiment be $ E \equiv (X, A, \curverightarrow) $. Then we define $\Pi$ is a symmetry of $E$ iff:
$\Pi: X \rightarrow X$ is a permutation of $X$.
$\Pi$ commutes with the action of each $a$: $ \Pi(a \curverightarrow x) = a \curverightarrow \Pi(x) $.
We say that the theory $E$ is transitive (in the action sense) if for all $x_1, x_2 \in X, x_1 \neq x_2$, there exists $a_1, a_2, \dots a_n$ such that $ x_2 = a_n \curverightarrow \dots (a_1 \curverightarrow x_1) $.
Facts of the symmetries of a system:
We know that the symmetries of a theory $E$ form a group.
If $E$ is transitive, then each symmetry $\Pi$ is a regular permutation — If there exists an $x$ such that $\Pi(x_f) = x_f$ (a fixed point), then this implies that $\Pi(x) = x$ for all $x$.
Let the action split $X$ into disjoint orbits $O_1, O_2, \dots O_k$ from whom we choose representatives $x_1 \in O_1, x_2 \in O_2, \dots x_k \in O_k$. Then, if $E$ is transitive, there is exactly one action that sends a particular $x_i$ to a particular $x_j$. So, on fixing one component of an action, we fix all components.
To show that this gives rise to a triangulation, we first construct a semigroup of the actions of the experiment: $S(E) \equiv { a_1 \dots a_n : n \geq 1 \text{~and~} a_i \in A }$.
Now, let $G = Sym(E)$, the full symmetry group of $E$. One can apparently express the symmetry group in terms of:
!! (X, S) \leq (G, G) \wr ({ O_1, O_2, \dots O_k}, T) !!
Given two monoids $(M, +, 0_M)$ and $(N, \times, 1_N)$, and a homomorphism $\phi: N \rightarrow End(M)$, where $End(M)$ is the endomorphism group of $M$. We will notate $\phi(n)(m)$ as $n \cdot m \in M$.
Now the semidirect product $M \ltimes_\phi N$ is the set $M \times N$ equipped with the multiplication rule:
This can also be written down as:
!!
\begin{bmatrix}
1 & 0
m & n \end{bmatrix}
\begin{bmatrix} 1 & 0
\ m’ & n’
\end{bmatrix} =
\begin{bmatrix} 1 & 0
m + n \cdot m’ & n \times n’ \end{bmatrix}
!!
This way of writing down semidirect products as matrices makes many things immediately clear:
The semidirect product is some kind of “shear” transform, since that’s what a shear transformation looks like, matrixwise.
The resulting monoid $M \ltimes_{\phi} N$ has identity $(0_M, 1_N)$, since for the matrix to be identity, we need the 2nd row to be $(0, 1)$.
The inverse operation if $(M, N)$ were groups would have to be such that
!! \begin{bmatrix} 1 & 0 \ m + n \cdot m’ & n \times n’ \end{bmatrix} = \begin{bmatrix} 1 & 0 \ 0 & 1 \end{bmatrix} !!
Hence:
which is indeed the right expression for the inverse.
n←3 ⋄ id ← n n ⍴(1,n⍴0) ⋄ id
This relies heavily on ⍴
replicating its arguments.
xs←(1 1 3 3 3 6) ⋄ n←(⌈/xs)⍴0 ⋄ n[xs]+←1 ⋄ n
The use of n[x] +←1
will stably write +1
as many times as there are repeated
indexes in xs
.
We have at our hands a commutative ring $R$, and we wish to study the ideal structure on the ring. In particular, we can combine ideals in the following ways:
Is the third one really right? How do we prove that: $\forall i_1, i_2 \in I, j_1, j_2 \in J, \exists i_3 \in I, j_3 \in J$ such that $i_1 j_2 + i_2 j_2 = i_3 j_3$?
Indeed, we can’t do so in general! For a quick counterexample, consider the ring $\mathbb Z[X, Y]$ and the ideals $I \equiv \langle X \rangle$, $J \equiv \langle Y \rangle$. Now, note that $XY + X^2Y^2$ cannot be written as the product of a power of $X$ and a power of $Y$.
So, the correct definition is to in fact generate an ideal from all elements of the form $ij$. So #3 should be:
Let $I \equiv \langle 12 \rangle, J \equiv \langle 20 \rangle$.
Great. Now, one can conjecture the relation:
by the following chain of inference:
This is trivial, I’m surprised it took me this long to internalize this fact.
When we convert a poset $(X, \leq)$ into a category, we stipulate that $x \rightarrow y \iff x \leq y$.
If we now consider the category $Set$ of sets and functions between sets, and arrow $A \xrightarrow{f} B$ is a function from $A$ to $B$. If $f$ is monic, then we know that $A = Im(f) \leq B$. That is, a monic arrow behaves a lot like a poset arrow!
Similarly, an epic arrow behaves a lot like the arrow in the inverse poset. I wonder if quite a lot of category theoretic diagrams are clarified by thinking of monic and epic directly in terms of controlling sizes.
If we want to minise a function $f(x)$ subject to the constraints $g(x) = c$, one uses the method of lagrange multipliers. The idea is to consider a new function $L(x, \lambda) = f(x) + \lambda (c  g(x))$. Now, if one has a local maxima $(x^\star, y^\star)$, then the conditions:
Equation (2) is sensible: we want our optima to satisfy the constraint that we had originally imposed. What is Equation (1) trying to say?
Let us say that we are at an $(x_0)$ which is a feasible point ($g(x_0) = c$).
We are interested in wiggling $(x_0) \xrightarrow{wiggle} (x_0 + \epsilon) \equiv x_1$. such that:
$x_1$ is an improvement: $f(x_1) > f(x_0)$.
We need $f$ to increase in the direction of $\epsilon$. So, we need $\epsilon$ to have positive component along $f’(x)$. That is, $f’(x) \cdot \epsilon > 0$.
Now, if $f’(x_0) \lvert \lvert g’(x_0)$ (parallel to), then we have $f’(x_0) \lbot levelset(g(x_0))$ (perpendicular to).
Clearly if $f’(x_0)  g’(x_0)$ these two conditions contradict each other, and so we are at a local maxmia.
On the other hand, if it is not true that $f’(x_0)  g’(x_0)$, then there will be some component of $f’$ to be had by moving along $levelset(g(x_0))$.
All material lifted straight from Aaron Hsu’s PhD thesis. I’ll be converting APL notation to C++like notation. Source code link to my implementation is here
We’re interested in this tree:
∘
┌──┬──┴────┐
a b c
│ ┌┴┐ ┌───┼───┐
p q r s t u
│ │ 
│ ┌┴┐ ┌┴┐
v w x y z
I’ll be writing APL commands in front of a $
to mimic bash, and I’ll
write some arrays as multiline. To run them, collapse them into a single
line. The ast
object is represented in memory as:
$ ast ← ('∘'
('a' ('p'))
('b'
('q' ('v'))
('r'))
('c'
('s' ('w' 'x'))
('t' ('y' 'z'))
('u')))
$ ]disp ast
┌→┬──┬────────┬───────────────────┐
│∘│ap│┌→┬──┬─┐│┌→┬──────┬──────┬─┐│
│ │ ││b│qv│r│││c│┌→┬──┐│┌→┬──┐│u││
│ │ │└─┴─→┴─┘││ ││s│wx│││t│yz││ ││
│ │ │ ││ │└─┴─→┘│└─┴─→┘│ ││
│ │ │ │└─┴─────→┴─────→┴─┘│
└─┴─→┴───────→┴──────────────────→┘
Here’s how read the array representation. Look at the top level of the tree. we have a root node with three children:
∘
┌──┬──┴────┐
a b c
┌→┬──┬────────┬─────────────┐
│∘│ │ │ │
│ │ a│ b │ c │
│ │ │ │ │
└─┴─→┴───────→┴────────────→┘
With the first ∘
being the root node, and the three adjacent cells
being the children.
Next, we look at how x
is represented. This is predictably recursive. Let’s
see the subtree under x
:
∘
┌──┬──┴────┐
a b c
│
p
┌→┬──┬────────┬─────────────┐
│∘│ap│ │ │
│ │ │ b │ c │
│ │ │ │ │
└─┴─→┴───────→┴────────────→┘
Similarly for y
:
∘
┌──┬──┴────┐
a b c
│ ┌┴┐
p q r
┌→┬──┬────────┬─────────────┐
│∘│ap│┌→┬──┬─┐│ │
│ │ ││b│q │r││ c │
│ │ │└─┴─→┴─┘│ │
└─┴─→┴───────→┴────────────→┘
And so on, leading to the final representation:
∘
┌──┬──┴────┐
a b c
│ ┌┴┐ ┌───┼───┐
p q r s t u
│ │ 
│ ┌┴┐ ┌┴┐
v w x y z
┌→┬──┬────────┬───────────────────┐
│∘│ap│┌→┬──┬─┐│┌→┬──────┬──────┬─┐│
│ │ ││b│qv│r│││c│┌→┬──┐│┌→┬──┐│u││
│ │ │└─┴─→┴─┘││ ││s│wx│││t│yz││ ││
│ │ │ ││ │└─┴─→┘│└─┴─→┘│ ││
│ │ │ │└─┴─────→┴─────→┴─┘│
└─┴─→┴───────→┴──────────────────→┘
Note that for this representation to work, we need to be able to:
I don’t understand the memory layout of this, to be honest. I feel like to represent this in memory would still rely on pointerchasing, since we need to box all the arrays. This is possibly optimised by APL to not be too bad.
∘ 0
┌──┬──┴────┐
a b c 1
│ ┌┴┐ ┌───┼───┐
p q r s t u 2
│ │ 
│ ┌┴┐ ┌┴┐
v w x y z 3
If we visit this tree and record depths in preorder (node left right)
, we
arrive at the list:
(∘:0
(a:1 (p:2)) (b:1 (q:2 (v:3)) (r:2))
(c:1 (s:2 (w:3 x:3)) (t:2 (y:3 z:3)) (u:2)))
formatted as:
(∘:0
(a:1
(p:2))
(b:1
(q:2 (v:3))
(r:2)
)
(c:1 (s:2 (w:3 x:3))
(t:2 (y:3 z:3))
(u:2))
)
This linearlized is the list:
(∘ a p b q v r c s w x t y z u)
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
∘ 0
┌──┬──┴────┐
a b c 1
│ ┌┴┐ ┌───┼───┐
p q r s t u 2
│ │ 
│ ┌┴┐ ┌┴┐
v w x y z 3
To convert the ast
object into a depth vector representation, we can
use the following call:
$ ast ← ('∘' ('a' ('p')) ('b' ('q' ('v')) ('r')) ('c' ('s' ('w' 'x')) ('t' ('y' 'z')) ('u')))
$ d ← ∊0{(⊢,(⍺+1)∇⊣)/⌽⍺,1↓⍵}ast
0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
Let’s break this down:
TODO
We represent data associated with our nodes as follows:
$ data ← ⍪ ¨d(15⍴'T')(↑15⍴⊂'n.')
$ ]disp data
┌→┬─┬──┐
│0│T│n.│
│1│T│n.│
│2│T│n.│
│1│T│n.│
│2│T│n.│
│3│T│n.│
│2│T│n.│
│1│T│n.│
│2│T│n.│
│3│T│n.│
│4│T│n.│
│2│T│n.│
│3│T│n.│
│4│T│n.│
│2↓T↓n.↓
└→┴→┴─→┘
This is the same thing as a
structure of arrays (SOA) representation,
where each array of information (eg, the depth at data[1]
, the T
information at data[2]
) are each arrays which can be accessed well on SIMD
instructions.
TODO
We want information of how to go up and down the tree in ideally constant time. We store this information in what is known as a path matrix.
For our recurring example, the path matrix is:
∘ a p b q v r c s w x t y z u  preorder traversal
──────────────────────────────────────────────────
∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘ ∘  depth=0
 a a b b b b c c c c c c c c  depth=1
  p  q q r  s s s t t t u  depth=2
     v    w x  y z   depth=3
∘ 0
┌──┬──┴────┐
a b c 1
│ ┌┴┐ ┌───┼───┐
p q r s t u 2
│ │ 
│ ┌┴┐ ┌┴┐
v w x y z 3
To efficiently compute this, we first replace every value in our tree with its preorder traversal visit time. This changes the tree to:
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
The values we store in the tree are the integers. The old labels are represented for clarity.
The path matrix for this tree is:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14  preorder traversal
────────────────────────────────────────────────────────────
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  depth=0
 1 1 3 3 3 3 7 7 7 7 7 7 7 7  depth=1
  2  4 4 6  8 8 8 11 11 11 14  depth=2
     5    9 10  12 13   depth=3
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
We use the incantation:
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ ((⍳≢d)@(d,¨⍳≢d)) ((⌈/d) (≢d))⍴''
0              
 1  3    7       
  2  4  6  8   11   14
     5    9   12  
          10   13 
Let’s break this down (the symbol ` ` means a lamp, for commenting/illumination)
$ ⍳ 3 ⍝ iota: make a list of n elements:.
1 2 3
$ d
0 1 2 1 2 3 2 1 2 3 4 2 3 4 2
$ ≢d ⍝ tally: ≢`. count no. of elements in d:
15
⍳≢d ⍝ list of elements of len (no. of elements in d).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
$ ]disp (1 2 3),(4 5 6) ⍝ ,:concatenate
┌→────┬─────┐
│1 2 3│4 5 6│
└~───→┴~───→┘
]disp (1 2 3) ,¨ (4 5 6)
┌→──┬───┬───┐
│1 4│2 5│3 6│
└~─→┴~─→┴~─→┘
The use of ¨
needs some explanation. ¨
is a higher order function which
takes a function and makes it a mapped version of the original function.
So, ,¨
is a function which attemps to map the concatenation operator.
Now, given two arrays (1 2 3)
and (4 5 6)
, (1 2 3) ,¨ 4 5 6
attemps to run ,
on each pair
1 and 4
, 2 and 5
, 3 and 6
. This gives us tuples ((1 4) (2 5) (3 6))
.
So, for our purposes, zip ← ,¨
.
]disp (d,¨⍳≢d) ⍝ zip d with [1..len d].
┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐
│0 0│1 1│2 2│1 3│2 4│3 5│2 6│1 7│2 8│3 9│4 10│2 11│3 12│4 13│2 14│
└~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘
$ ((⌈/d) (≢d))⍴'' ⍝ array of dim (max val in d) x (no. of elem in d)




⌈
is the maximum operator and /
is the fold operator, so
⌈/d
finds the maximum in d
. Recall that (≢d)
find the no. of
elements in d
. ⍴
reshapes an array to the desired size. We pass it
a 1x1
array containing only 
, which gets reshaped into a
(⌈/d) x (≢d)
sizes array of 
symbols.TODO: explain @ and its use
$ ⎕IO ← 0 ⍝ (inform APL that we wish to use 0indexing.)
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ PM ← ⌈\((⍳≢d)@(d,¨⍳≢d))(((⌈/d+1)(≢d))⍴0)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 3 3 3 3 7 7 7 7 7 7 7 7
0 0 2 2 4 4 6 6 8 8 8 11 11 11 14
0 0 0 0 0 5 5 5 5 9 10 10 12 13 13
0 0
┌──┬──┴───────┐
1 3 7 1
│ ┌┴┐ ┌──────┼───┐
2 4 6 8 11 14 2
│ │ 
│ ┌┴─┐ ┌┴──┐
5 9 10 12 13 3
The incantation can be broken down into:
(((⌈/d+1)(≢d))⍴0)
is used to create a max(d+1)xd
dimension array of zeros.
Here, the rows define depths, and the columns correspond to tree nodes
which for us are their preorder indexes.$ grid←(⌈/d+1) (≢d) ⍴ 0
$ grid
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
((d ,¨ ⍳≢d))
creates an array of pairs (depth, preindex)
. We will use
this to fill index (d, pi)
with the value pi
.$ writeixs ← (d,¨⍳≢d)
$ ]disp writeixs
┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐
│0 0│1 1│2 2│1 3│2 4│3 5│2 6│1 7│2 8│3 9│3 10│2 11│3 12│3 13│2 14│
└~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘
ixgrid ← ((⍳≢d)@writeixs) grid
rewrites at index writeixs[i]
the value ((i≢d)[i]
).$ ixgrid ← ((⍳≢d)@writeixs) grid
$ ixgrid
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 3 0 0 0 7 0 0 0 0 0 0 0
0 0 2 0 4 0 6 0 8 0 0 11 0 0 14
0 0 0 0 0 5 0 0 0 9 10 0 12 13 0
⌈
is the maximum operator, and \
is the prefix scan operator,
so ⌈\ixgrid
creates a prefix scan of the above grid to give us our
final path matrix:$ PM ← ⌈\ixgrid
$ PM
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 3 3 3 3 7 7 7 7 7 7 7 7
0 0 2 2 4 4 6 6 8 8 8 11 11 11 14
0 0 0 0 0 5 5 5 5 9 10 10 12 13 13
Note that the maximum distance between two nodes is to climb all the way to the top node, and then climb down:
dmax ← depth(a) + depth(b)
If we know the lowest common ancestor of two nodes, then the distance of one node to another is:
dcorrect ← dist(a, lca(a, b)) + dist(b, lca(a, b))
So, we can compute the depth as:
dcorrect ← dist(a, lca(a, b)) + dist(lca(a, b), b)
= dist(a, lca(a, b)) + depth(lca(a, b)) +
dist(b, lca(a, b)) + depth(lca(a, b)) +
2 * depth(lca(a, b))
= depth(a) +
depth(b) +
2 * depth (lca(a, b))
[TODO: picture] [TODO: finish writing this]
A parent vector is a vector of length n
where Parent[i]
denotes an
index into Parent
. Hence, the following condition will return 1
if V is a parent vector.
For example, for our given example, here is the parent vector:
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths
(∘ a p b q v r c s w x t y z u) │ values
p ← (∘ ∘ a ∘ b q b ∘ c s s c t t c) │ parents
(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)  indexes
P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
The condition a parent vector must satisfy is:
∧/V ∊(⍳≢V) ⍝ [All elements of V belong in the list [1..len(V)] ]
V ∊ (⍳≢V)
will be a list of whether each element in v belongs (∊
) to the list
(⍳≢V) = [1..len(V)]
Recall that /
is for reduction, and ∧/
is a boolean AND
reduction.
Hence, we compute whether each element of the vector V
is in the range [1..len(V)]
.
We add the constraint that root notes that don’t have a parent simply
point to themselves. This allows us to free ourselves from requiring
some kind of nullptr
check.
The root node (parent of all elements) can be found using the fixpoint operator (⍨
):
I←{(⊂⍵)⌷⍺} ⍝ index into the left hand side param using right hand side param
I⍣≡⍨p ⍝ compute the fixpoint of the I operator using ⍨ and apply it to p
As usual, let’s consider our example:
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths
(∘ a p b q v r c s w x t y z u) │ values
p ← (∘ ∘ a ∘ b q b ∘ c s s c t t c) │ parents
(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)  indexes
P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
Note that the depth vector already encodes parentchild information.
i
is a node j
such that d[j] = d[i]  1
and
j
is the closest index to the left of i
such that this happens.For example, to compute the parent of t:11
, notice that it’s at depth 2
.
So we should find all the nodes from d[0..11]
which have depths equal to
2
, and then pick the rightmost one. This translates to the expression:
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ t ← 11 ⍝ target node
$ ixs ← ⍳t ⍝ array indexes upto this node
0 1 2 3 4 5 6 7 8 9 10
$ d[ixs] ⍝ depths of nodes to the left of the given node t
0 1 2 1 2 3 2 1 2 3 3
$ d[ixs] = d[t]1 ⍝ boolean vector of nodes whose depth is that of t's parent
0 1 0 1 0 0 0 1 0 0 0
$ eqds ← ⍸ (d[ixs] = d[t]1) ⍝ array indexes of nodes whose depth is that of t's parent
1 3 7
$ ⌽ eqds ⍝ reverse of array indexes to extract `7`
7 3 1
$ ⊃ ⌽ eqds ⍝ first of the reverse of the array indexes to extract `7`
7
$ (⌽⍸(d[⍳t] = d[t]1))[0] ⍝ APL style oneliner of the above
While this is intuitive, this does not scale: It does not permit us to find
the parent of all the nodes at once — ie, it is not parallelisable
over choices of t
.
Imagine we have a list of 0
s and 1
s, and we want to find the index of
the rightmost 1
value. For example, given:
0 1 2 3 4 5 6 7 8 9 10 11 12
$ a ← (0 0 1 0 0 0 1 0 1 0 0 0 0)
we want the answer to be f a = 8
. We saw an implementation in terms of
f←{(⌽⍸⍵)[0]}
in Take 1.
(recall that ⍵
is the symbol for the righthandside argument of a function).
We’re going to perform the same operation slightly differently. Let’s consider the series of transformations:
$ ⌽a ⍝ reverse of a
0 0 0 0 1 0 1 0 0 0 1 0 0
$ ∨\ ⌽a ⍝ prefix scan(\) using the OR(∨) operator. Turn all
⍝ entries after the first 1 into a 1
0 0 0 0 1 1 1 1 1 1 1 1 1
$ +/ (∨\ ⌽a) ⍝ sum over the previous list, counting number of 1s
9
$ ¯1 + (+/ (∨\ ⌽a)) ⍝ subtract 1 from the previous number
8
Why the hell does this work? Well, here’s the proof:
⌽a
, we reverse the a
. The last 1 of a
at index $i$
becomes the first $1$ of ⌽a
at index $i’ \equiv ni$.∨\ ⌽a
, numbers including and after the first 11
. That is, all indexes $j \geq i’$ have 1 in them.+/ (∨\ ⌽a)
, we sum up all 1s. This will give us $ni’+1$ 1s.
That is, $ni’+1 = n(ni)+1 =i+1$.This technique will work for every row of a matrix. This is paramount, since we can now repeat this for the depth vector we were previously interested in for each row, and thereby compute the parent index!
We want to extend the previous method we hit upon to compute the parents of all nodes in parallel. To perform this, we need to run the moral equivalent of the following:
$ ⎕IO ← 0 ⍝ 0 indexing
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depth vector
$ t ← 11 ⍝ node we are interested in
$ a←d[⍳t]=d[t]1 ⍝ boolean vector of nodes whose depth is that of t's parent
0 1 0 1 0 0 0 1 0 0 0
$ ¯1 + (+/ (∨\ ⌽a)) ⍝ index of last 0 of boolean vector
7
for every single choice of t. To perform this, we can build a 2D matrix
of d[⍳t]=d[t]1
where t
ranges over [0..len(d)1]
(ie, it ranges
over all the nodes in the graph).
We begin by using:
$ ⎕IO ← 0 ⋄ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depths
$ ]display ltdepth ← d ∘.> d ⍝ find `d[i] > d[j]` for all i, j.
┌→────────────────────────────┐
↓0 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 0 0 0 0 0 0 0 0 0 0 0 0 0 0│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 1 1 1 0 1 1 1 0 0 1 0 0 1│
│1 1 0 1 0 0 0 1 0 0 0 0 0 0 0│
└~────────────────────────────┘
Note that gt[i][j] = 1
iff d[j] < d[i]
. So, for a given row (i = fixed
), the 1s
nodes that are at lower depth (ie, potential parents).
If we mask this to only have those indeces where j <= i
, then the
last one in each row will be such that d[last 1] = d[i]  1
. Why? Because
the node that is closest to us with a depth less than us must be our parent,
in the preorder traversal.
$ ⎕IO ← 0 ⋄ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depths
$ ]display left ← (⍳3) ∘.> (⍳3) ⍝ find `i > j` for all i, j.
┌→────┐
↓0 0 0│
│1 0 0│
│1 1 0│
└~────┘
Combining the three techniques, we can arrive at:
$ ⎕IO ← 0 ⋄ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) ⍝ depths
$ ltdepth ← d ∘.> d ⍝ find `d[i] > d[j]` for all i, j.
$ preds ← (⍳≢d) ∘.> (⍳≢d) ⍝ predecessors: find `i > j` for all i, j.
$ pred_higher ← ltdepth ∧ left ⍝ predecessors tht are higher in the tree
$ parents_take_3 ← ¯1 + +/∨\⌽pred_higher ⍝ previous idiom for finding last 1.
¯1 0 1 0 3 4 3 0 7 8 8 7 11 11 7
For comparison, the actual value is:
(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)  indexes
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths
P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices
(¯1 0 1 0 3 4 3 0 7 8 8 7 11 11 7)  parents, take 3
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
We have an offbyone error for the 0
node! That’s easily fixed, we simply
perform a maximum with 0
to move ¯1 > 0
:
$ parents_take_3 ← 0⌈ ¯1 + +/∨\⌽pred_higher
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7
So, that’s our function:
parents_take_3 ← 0⌈ ¯1 + +/∨\⌽ ((d∘.>d) ∧ (⍳≢d)∘.>(⍳≢d))
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7
Note that the time complexity for this is dominated by having to calculate the outer products, which even given infinite parallelism, take $O(n)$ time. We will slowly chip away at this, to be far better.
We will use the Key(⌸
) operator which allows us to create key value pairs.
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]disp (⍳≢d) ,¨ d ⍝ zip d with indexes
┌→──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬────┬────┬────┬────┬────┐
│0 0│1 1│2 2│3 1│4 2│5 3│6 2│7 1│8 2│9 3│10 3│11 2│12 3│13 3│14 2│
└~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~─→┴~──→┴~──→┴~──→┴~──→┴~──→┘
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⍺ ⍵}⌸d ⍝ each row i has tuple (i, js): d[js] = i
┌→──────────────────┐
↓ ┌→┐ │
│ 0 │0│ │
│ └~┘ │
│ ┌→────┐ │
│ 1 │1 3 7│ │
│ └~────┘ │
│ ┌→────────────┐ │
│ 2 │2 4 6 8 11 14│ │
│ └~────────────┘ │
│ ┌→───────────┐ │
│ 3 │5 9 10 12 13│ │
│ └~───────────┘ │
└∊──────────────────┘
In fact, it allows us to apply an arbitrary function to combine keys and values. We will use a function that simply returns all the values for each key.
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⍵}⌸d ⍝ each row i contains values j such that d[j] = i.
┌→──────────────┐
↓0 0 0 0 0 0│
│1 3 7 0 0 0│
│2 4 6 8 11 14│
│5 9 10 12 13 0│
└~──────────────┘
Our first try doesn’t quite work: it winds up trying to create a numeric matrix,
which means that we can’t have different rows of different sizes. So, the
information that only index 0
is such that d[0] = 0
is lost. What we
can to is to wrap the keys to arrive at:
$ d ← 0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ ]display b ← {⊂⍵}⌸d ⍝ d[b[i]] = i
┌→───────────────────────────────────────────┐
│ ┌→┐ ┌→────┐ ┌→────────────┐ ┌→───────────┐ │
│ │0│ │1 3 7│ │2 4 6 8 11 14│ │5 9 10 12 13│ │
│ └~┘ └~────┘ └~────────────┘ └~───────────┘ │
└∊───────────────────────────────────────────┘
Consider the groups b[2] = (2 4 6 8 11 14)
and b[3] = (5 9 10 12 13)
. All of 3
’s parents
are present in 2
. Every element in 3
fits at some location in 2
. Here is what
the fit would look like:
b[2] 2 4 _ 6 8 _ _ 11 __ __ 14 (nodes of depth 2)
b[3] 5 9 10 12 13 (leaf nodes)
4 8 8 11 11 (parents: predecessor of b[3] in b[2])
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
We use the Interval Index(⍸
) operator to solve the problem of finding the
parent / where we should sqeeze a node from b[3]
into b[2]
(This is formally known as the
predecessor problem)
⍝ left[a[i]] is closest number < right[i]
⍝ left[a[i]] is the predecessor of right[i] in left[i].
$ a ← (1 10 100 1000) ⍸ (1 2000 300 50 2 )
0 3 2 1 0
Now, we can use the technology of predecessor to find parents of depth 3 nodes among the depth 2 nodes:
$ depth2 ← 2 4 6 8 11 14
$ depth3 ← 5 9 10 12 13 ⍝ parents (from chart): 4 8 8 11 11
$ depth3parentixs ← depth2 ⍸ depth3
$ depth3parents ← depth2[depth3parentixs]
4 8 8 11 11
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
We need to know onemore APLism: the 2scan
. When we write
a usual scan operation, we have:
$ ⍳5
1 2 3 4 5
$ +/⍳5 ⍝ reduce
15
$ 2+/⍳5 ⍝ apply + to _pairs_ (2 = pairs)
3 5 7 9 ⍝ (1+2) (2+3) (3+4) (4+5)
$ 3+/⍳5 ⍝ apply + to 3tuples
6 9 12 ⍝ (1+2+3) (2+3+4) (3+4+5)
We begin by assuming the parent of i
is i
by using p←⍳≢d
.
$ d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2)
$ d2nodes ← {⊂⍵}⌸d
┌→┬─────┬─────────────┬─────────────┐
│1│2 4 8│3 5 7 9 12 15│6 10 11 13 14│
└→┴~───→┴~───────────→┴~───────────→┘
$ p←⍳≢d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Now comes the biggie:
$ findparent ← {parentixs ← ⍺⍸⍵ ⋄ p[⍵]←⍺[parentixs]}
⍺
is the list of parent nodes.⍵
is the list of current child nodes.pix ← parent ⍸ child
idiom.pix[parentixs]
.p[children] ← parent[parent ⍸ child]
This finally culminates in:
$ d←0 1 2 1 2 3 2 1 2 3 3 2 3 3 2
$ p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7
(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)  indexes
d ← (0 1 2 1 2 3 2 1 2 3 3 2 3 3 2) │ depths
P ← (0 0 1 0 3 4 3 0 7 8 8 7 11 11 7) │ parent indices
∘:0 0
┌──────────┬──┴─────────────────┐
a:1 b:3 c:7 1
│ ┌───┴───┐ ┌──────────┼───────┐
p:2 q:4 r:6 s:8 t:11 u:14 2
│ │ │
│ ┌──┴──┐ ┌─┴───┐
v:5 w:9 x:10 y:12 z:13 3
Which can be further golfed to:
$ p⊣2{p[⍵]←⍺[⍺⍸⍵]}⌿⊢∘⊂⌸d⊣p←⍳≢d
0 0 1 0 3 4 3 0 7 8 8 7 11 11 7
The total time complexity of this method assuming infinite parallelism is as follows:
$ p←⍳≢d ⋄ d2nodes←{⊂⍵}⌸d ⋄ findp←{pix ← ⍺⍸⍵ ⋄ p[⍵]←⍺[pix]} ⋄ 2findp/d2nodes ⋄ p
(p←⍳≢d)
can be filled in O(1)
time.(d2nodes←{⊂⍵}⌸d)
is searching for keys in a small integer domain, so this is O(#nodes)
using
radix sort as far as I know. However, the thesis mentions that this can be done in
O(log(#nodes))
. I’m not sure how, I need to learn this.findp
, the call (pix ← ⍺⍸⍵)
can be implemented using binary search
leading to a logarthmic complexity in the size of ⍺
(since we are looking up
for predecessors of ⍵
in ⍺
).2findp/d2nodes
can be done entirely in parallel
since all the writes into the p
vector are independent: we only write the
parent of the current node we are looking at.I’m going to simplify the original presentation by quite a bit.
a b c d e f g h i  names
0 1 2 3 4 5 6 7 8  indexes
P ← (0 0 1 2 0 4 5 6 7)  parents
X ← (0 1 0 0 1 1 0 0 0)  marked nodes
a:0
┌────┴───┐
b:1(X) e:4(X)
 
c:2 f:5(X)
 
d:3 g:6
│
h:7

i:8
We want to find nodes marked as X
that are the closest parents to a
given node. The X
vector is a boolean vector that has a 1
at
the index of each X
node: (b, e, f)
. So, the indexes (1, 4, 5)
are 1
in the X
vector.
The output we want is the vector:
0 1 2 3 4 5 6 7 8  indexes
a b c d e f g h i  names
PX ← (0 0 1 1 0 4 5 5 5)  closest X parent index
a a b b a e f f f  closest X parent name
a:0
┌────┴───┐
b:1(X) e:4(X)
 
c:2 f:5(X)
 
d:3 g:6
│
h:7

i:8
The incantation is:
$ I←{(⊂⍵)⌷⍺} ⍝ index LHS by RHS  (100 101 102 103)[(3 1 2)] := 103 101 102
$ PX ← P I@{X[⍵]≠1} ⍣ ≡ P
0 0 1 1 0 4 5 5 5
TODO. At any rate, since this does not require any writes and purely reads, and nor does it need any synchronization, this is fairly straightforward to implement on the GPU.
Once we have marked our X
nodes, we now wish to lift entire subtrees of X
up to the root.
There is some good advice in the thesis:
When using APL primitives this way, it may be good to map their names and definitions to the domain of trees. For example, the primitive
⍸Predicate
is read as “the nodes wherePredicate
holds” and not as “the indexes wherePredicate
is 1”.
For example, given the tree:
0 1 2 3 4 5  indexes
a b c d e f  names
P ← (0 0 1 0 3 4)  parents
X ← (0 1 0 1 1 0)  X nodes
PX ← (0 0 1 0 3 4)  closest X parent index
a:0
┌────┴───┐
b:1(X) d:3(X)
 
c:2 e:4(X)

f:5
we want the transformed tree to be:
a:0
┌────┴───┐
bp:1(X) ep:4(X)

b:1(X)

c:2

e:4

fp:5

f:5(X)

g:6
We first look for nodes that need to be lifted. There are:
p≠(⍳≢p)
)X
.
nodes←⍸(X ∧ p≠(⍳≢p)) ⍝ ⍸:pick indexes.
For pasting multiline code,
there is a bug in the bug tracker for RIDE.
For multiline dfns, one can use ∇
. For multiline values, I don’t know yet.
Operators in APL terminology (such as ¨
) are higher order functions.
Thus, an operator allows one to modify known functions.
Use ]disp
and ]display
to understand the structure of APL arrays.
Set ]box on style=max
to always enable drawing arrays with ]display
.
This is supremely useful as a newbie to understand array structure.
Set ]box on trains=parens
to render trains as trees. Super
helpful when attempting to grok train
code.
Set ]boxing on
to enable boxing for trains, arguments, everything.
I ran across this when reading another question on math.se, so I posted this proof for verification just to be sure I wasn’t missing something.
We wish to characterise prime ideals as precisely those that are disjoint from a multiplicative subset $S \subseteq R$. That is:
I’ll be using the definition of prime as:
Let $S = \equiv R \setminus P$ be the complement of the prime ideal $P \subsetneq R$ in question.
Since $P \neq R$, $1 \not \in $P. (if $1 \in P$, then every element $x . 1 \in P$ since $P$ is an ideal, and must be closed under multiplication with the entire ring). Hence, $1 \in S$.
For any $x, y \in S$, we need $xy \in S$ for $S$ to be mulitplicative.
For contradiction, let us say that $x, y \in S$ such that $xy \not \in S$. Translating to $P$, this means that $x, y \not \in P$ such that $xy \in P$. This contradictions the definition of $P$ being prime.
Let $I$ be an ideal of the ring $R$ such that its complement $S \equiv R / I$ is a maximal multiplicative subset.
Let $i_1 i_2 \in I$. For $I$ to be prime, we need to show that $i_1 \in I$ or $i_2 \in I$.
For contradiction, let $i_1, i_2 \not \in I$. Thus, $i_1, i_2 \in S$. Since $S$ is multiplicative, $i_1 i_2 \in S$. That is, $i_1 i_2 \not \in I$ (since $I$ is disjoint from $S$).
But this violates our assumption that $i_1 i_2 \in I$. Hence, contradiction.
Install Dyalog APL.
Setup RIDE, the IDE for dyalog APL. This IDE comes with auto complete, good key bindings, a top bar chockfull of information of all the APL symbols. It’s really well designed and a pleasure to use.
Follow the Dyalog tutorial, solving it chapter by chapter.
Bookmark APLCart, a collection of APL idioms, and refer to it when in need.
It’s kind of sad that this is the case, but on thinking about this, I realised that the SpaceChem game was essentially a compiler, and it was such a pleasure to learn how to use and debug — the visual nature of it made it amazing to find out.
I often forget which is which, so I came up with this:
Cartesian trees construct a tree $T = C(A)$ given an array $A$, such that range minimum query (RMQ) on the array $A$ is equivalent to the lowest common ancestor (LCA) of the nodes of the tree $T$.
Note that the tree looks like a minheap.
To see the connection to LCA, if we want to find the range minimum in the range containing the
elements [12, 10, 20, 15, 18]
of the array, the minimum is 10
, which is
indeed the lowest common ancestor of the nodes of 12
and 18
in the tree.
We can go the other way, and convert an LCA problem into a RMQ problem. We perform an inorder traversal of the nodes, scribbling down the depth of the node (Link to lecture at 15:30). We ask for the argmin version of RMQ, that gives us the index of the node with the lowest depth. This gives us the index of where the node lives.
We can have an arbitrary ordered universe, on which we want to perform RMQ. We can convert this to LCA by using a cartesian tree, and then convert to a “clean” RMQ (by using the LCA > RMQ using depth conversion). This now will give us way faster operations (since we now have integers).
+1
RMQ:We want the differences between nodes to have a difference of only +1
. We
had a much wider gap. Here, we perform an Euler tour (walk the tree DFS search order),
and sribble down every vertex we visit.
To find the LCA, we perform the RMQ on the locations of the first occurence of the node. (I think we don’t actually need the first occurence, any occurence will do).
Really, we want a partial order that is defined with the tree as the Hasse diagram. However, performing operations on this is hard. Hence, the DFS numbering is a good monotone map from this partial order to the naturals, which creates a total order.
I want to think about this deeper, I feel that this might be a good way
to think about the low
numbers that show up in
tarjan’s algorithm for strongly connected components
This also begs the question: can we use other partial orders, that chunk some information, but don’t lose all the information as going to a total order (the naturals) does?
The code is taken from The annotated transformer which explains the “attention is all you need paper”.
On skimming the code, one sees the delightful line of code:
class EncoderLayer(nn.Module):
"Encoder is made up of selfattn and feed forward (defined below)"
def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 2)
self.size = size
def forward(self, x, mask):
"Follow Figure 1 (left) for connections."
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
return self.sublayer[1](x, self.feed_forward)
where the line:
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
seems to imply that we are, indeed, performing a self attention with the same
value x
as the query, key, and value.
However, reading the code of the selfattention (or the paper) leads one to realise:
class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
"Take in model size and number of heads."
super(MultiHeadedAttention, self).__init__()
assert d_model % h == 0
# We assume d_v always equals d_k
self.d_k = d_model // h
self.h = h
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)
def forward(self, query, key, value, mask=None):
"Implements Figure 2"
if mask is not None:
# Same mask applied to all h heads.
mask = mask.unsqueeze(1)
nbatches = query.size(0)
# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
[l(x).view(nbatches, 1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]
# 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask,
dropout=self.dropout)
# 3) "Concat" using a view and apply a final linear.
x = x.transpose(1, 2).contiguous() \
.view(nbatches, 1, self.h * self.d_k)
return self.linears[1](x)
where we notice:
# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
[l(x).view(nbatches, 1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]
# 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask,
dropout=self.dropout)
where we see that query, key, value
are being linearly transformed
before being used. Hence, an input of $(x, x, x)$ is transformed
to $(q’, k’, v’) = (Qx, Kx, Vx)$ where $Q, K, V$ are arbitrary matrices.
Next, when we pass these into attention, the output we get is:
the code below is the same thing, spelled out:
def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'"
d_k = query.size(1)
scores = torch.matmul(query, key.transpose(2, 1)) \
/ math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, 1e9)
p_attn = F.softmax(scores, dim = 1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn
So It’s not really self attention: it’s more like: modulated attention
to self :)
A coarse structure on the set $X$ is a collection of relations on $X$: $E \subseteq 2^{X \times X}$ (called as controlled sets / entourages) such that:
The sets that are controlled are “small” sets.
The bounded coarse structure on a metric space $(X, d)$ is the set of all relations such that there exists a uniform bound such that all related elements are within that bounded distance.
We can check that the functions:
are coarse inverses to each other.
I am interested in this because if topology is related to semidecidability, then coarse structures (which are their dual) are related to..?
A matrioid $M$ is a set $X$ equipped with an independence set $I \subseteq 2^X$.
Let $V$ be a vector space. The independent sets $I$ are of the form:
!! I \equiv { S \subseteq V ~:~ \text{vectors in $S$ are lineary independent} } !!
This is an independence system because the empty set is linearly independent, and subsets of a linearly independent collection of vectors will be linearly independent.
The exchange property is satisfied because of linear algebraic reasons.
Let $G = (V, E)$ be a graph. Then collections of edges of the form:
is an independence system because the empty forest is a forest, and a subset of edges of a forest continues to be a forest.
To check the exchange property, TODO
Consider the partition matroid $M \equiv (E, I)$, where we have a partitioning of $E$ known as $E_i$, and numbers $k_i$ the independence set consists of subsets $F$ which have at most $k_i$ elements in common with each $E_i$.
The independence axioms are intuitively satisfied, since our constraints on picking edges are of the form $\vert F \cap E_i \vert \leq k_i$, which will continue to hold as $F$ becomes smaller.
For the exchange axiom, let $\vert Y \vert > \vert X \vert$. Then, we can assert that for some index $i$, it must be the case that $\vert Y \cap E_i \vert > \vert X \cap E_i \vert$. Hence, we can add an element in $E_i \cap (Y / X)$ into $X$ whilst still maintaining independence.
Bases are the maximal independent sets of $I$ (ordered by inclusion). On adding an element into a basis element, they will become dependent. They are called bases by analogy with linear algebra.
Circuits are minimal dependent sets of $I$. This comes from analogy with trees: if we remove an element from any circuit (loop) in a graph, what we are left with is a tree.
A matroid can be completely categorized by knowing either the bases or the circuits of that matroid.
That is, when we go from independent to dependent by adding an element, we will have a single, unique circuit. For example, when we add an edge into a forest to create a cycle, this cycle will be unique!
Let $C_1, C_2$ be circuits created when $e$ was added into $S$, where $C_1$ is the largest circuit of $S$, and $C_2$ is the smallest circuit of $S$.
Notice that $C_1, C_2$ must contain $e$ — if they did not, then $C_1, C_2$ would be circuits in $S$, contradicting the assumption that $S$ is independent.
Recall that $C_1, C_2$ are both circuits, which means that removing even a single element from them will cause them to become independent sets.
Let us contemplate $C \equiv C_1 \cup C_2$. Either $C = C_1$ in which case we are done.
Otherwise, \vert C \vert > \vert C_1 \vert$, $\vert C \vert > \vert C_2 \vert$.
Otherwise, consider $C’ \equiv C \ { e } = (C_1 \cup C_2) \ {e} = (C_1 \ {e}) \cup (C_2 \ { e })$.
Now, we consider $C$. Clearly, this is a dependent set, since $C_1 \subsetneq C$, and $C_1S is a dependent set.
Since, $C = C’ \cup {e }$, this means that $C’$ is a maximally independent set. Since $C’$ does not contain $e$, $C’ = S$.
A rank function of a matroid $M \equiv \langle X, I \rangle$ is a function:
That is, for any subset $S \subseteq X$, $r(S)$ is the cardinality of the largest independent subset of $S$.
In this matroid, the
TODO: picture
Matchings in a bipartite graph $G = (V, E)$ with partition $(A, B)$ arise as the intersection of the independent sets of two matroids. We will denote by $\delta: V \rightarrow 2^E$ the function which takes a vertex to the set of edges incident on that vertex.
Let $M_A$ be a partition matroid: $M_A \equiv (E, I_A)$ where $I_A$ is: That is, in $I_A$, every independent set has for each vertex of $A$, at most one edge incident. We need to check that this is an independent set. The empty set of no edges is independent. If some collection of edges are such that they have at most one edge incident, then removing edges can only decrease incidence. Hence, it’s also downward closed.
TODO: add picture
Similarly, we define $M_B \equiv (E, I_B)$:
Now, notice that any collection of edges $F \in I_A \cap I_B$ is a legal matching, since the edges cover all vertices of $A$ and $B$ at most once. The largest element of $I_A \cap I_B$ is the maximum matching that we are looking for.
Given two matroids $M_1 \equiv (E, I_1)$, $M_2 \equiv (E, I_2)$, with rank functions $r_1$ and $r_2$. Let $S \in I_1 cap I_2$ and let $F \subseteq E$.
There’s a lot written on the Zariski topology on the internet, but most of them lack explicit examples and pictures. This is my attempt to communicate what the Zariski topology looks like, from the perspectives that tickle my fancy (a wealth of concrete examples, topologyassemidecidability, and pictures).
Recall that the Zariski topology is defined by talking about what its closed sets are. The common zero sets of a family of polynomials are the closed sets of the Zariski topology. Formally, the topology $(\mathbb R^n, \tau)$ has as closed sets:
Open sets (the complement of closed sets) are of them form:
The empty set is generated as ${ x \in \mathbb R^n : 0 \neq 0 }$ and the full set is generated as ${ x \in \mathbb R^n : 1 \neq 0 }$.
Recall that in this view of topology, for a space $(X, \tau)$, for every open set $O \in \tau$, we associate a turing machine $T_O$. which obeys the relation:
Let’s consider functions of the form $(\mathbb R^2, Z) \xrightarrow{f} (\mathbb R^2, Z)$ where $Z$ is the Zariski topology. We are interested in discovering what sort of functions $f$ are continuous.
Let’s repeat the exercise for 2D. Here, we will manage to see much richer behaviour. Let’s consider functions of the form $(\mathbb R^2, Z) \xrightarrow{f} (\mathbb R^2, Z)$ where $Z$ is the Zariski topology.
Wikipedia lists the implementation of quicksort as:
algorithm quicksort(A, lo, hi) is
if lo < hi then
p := partition(A, lo, hi)
quicksort(A, lo, p  1)
quicksort(A, p + 1, hi)
algorithm partition(A, lo, hi) is
pivot := A[hi]
i := lo
for j := lo to hi do
if A[j] < pivot then
swap A[i] with A[j]
i := i + 1
swap A[i] with A[hi]
return i
Here, the indeces [lo..i1]
have values less than the pivot, while
[i..j]
are great or equal to the pivot.
// #define SWAP(ix, ix2) { int t = a[ix]; a[ix] = a[ix2]; a[ix2] = t; }
// sorts the interval [l, r]
void qs(int l, int r) {
if (r  l < 0) return;
int part = a[r]; // the partition
// a[getill...n] >= part (getill = greater or equal till)
// starts at r since we know that a[r] >= (partition=a[r])
int getill = r;
// a[l..lttill] < part (lttill = less or equal till.
// starts at (l1) since we do not know about any value < partition
int lttill = l1;
// loop until they start probing into the other set
while(!(lttill+1 >=getill  getill1 <=lttill)) {
// if the speculated element is < partition
if (a[getill1] < part) {
// swap the value at getill1 will the slot at lttill+1
SWAP(getill1, lttill+1);
// increment lttill, since we KNOW that the
// value at lttill+1 = a[getill1] is < part
lttill++;
} else {
// all we know is that a[getill1] < part, so we can engulf
// the region into
getill;
}
}
// the partitions must be next to each other, since we have engulfed everything
assert(getill  lttill == 1);
// move the partition value to the center.
SWAP(getill, r);
// recurse:solve [l..littil] (leave getill=part alone) solve [getill+1..r]
qs(l, lttill);
qs(getill+1, r);
}
This implementation to me makes very clear to me what information is “known”:
It also makes clear what is being “probed”/”tentative”:
+1
is not known yet, we are feeling out
the boundaries of our partitions.The termination condition is clear: when one partition starts reaching into the other partitions resources, its done.
Due to using closed intervals everywhere, it’s very easy to see precisely what data starts and ends where.
What version of quicksort do you prefer? Drop me an email!
p0a
on ##math
on freenode for teaching me this proof!Here’s one fun application of CauchySchwarz. We can apply it to two vectors $x=(\sqrt a, \sqrt b)$ and $y=(\sqrt b, \sqrt a)$ to derive the AMGM inequality:
This was a quick experiment in using Grobner basis to model situations. We can represent our dataflow analysis constraints in terms of polynomial rewrites over $F_2$.
Given the program:
p = { 0: ["=", "x", 'y'],
1: ['br', 2, 100],
2: ['=', 'z', 'x'],
3: ['br', 2],
100: ['ret', 'z'] }
whose semantics I hope are fairly straightforward — the dictionary represents
instruction locations. Instructions proceed sequentially. branch moves
control flow around. Note that br
can branch to multiple locations,
since we are not controlflow sensitive.
The idea is that since in a dataflow analysis, we need information at each variable at each program point, we can create a ring of polynomials over $F_2$ for each variable at each program point. So in this case, we wold need:
R = F_2[x0, y0, z0, x1, y1, z1, x2, y2, z2, x3, y3, z3, x100, y100, z100]
We then add elements into the ideal that represents our constraints.
For example, to perform dataflow analysis, we need to add constraints
about how if a variable z
is alive, all variables that are used
to compute z
at 100
are alive. This sets up equations that may
have cycles (in the case of loops).
These are usually resolved using the Kildall algorithm.
However, we can also ask SAGE to kindly solve the Grobner basis. I hypothesize that the “easy” dataflow problems out to be toric ideals which admit much faster solutions.
I learnt of a nice, formal way to prove the correctness of Fenwick trees in terms of orbits that I wish to reproduce here.
One can use a Fenwick tree to perform cumulative sums $Sum(n) \equiv \sum_i^n A[i]$, and updates $Upd(i, v) \equiv A[i] += v$. Naively, cumulative sums can take $O(n)$ time and updates take $O(1)$ time.
A Fenwick tree can perform both in $\log(n)$. In general, we can perform any monoidbased catenation and update in $\log(n)$.
We allow indexes $[1, 2, \dots n]$. The node with factorization $i \equiv 2^k \times l$, $2 \not \vert l$ (that is, $k$ is the highest power of $2$ in $i$) is responsible for the interval $[i2^k+1, i] = (i2^k, i]$.
I’m going to state all the manipulations in terms of prime factorizations, since I find it far more intuitive than bitfiddling. In general, I want to find a new framework to discover and analyze bitfiddling heavy algorithms.
Some examples of the range of responsibility of an index are:
To perform a cumulative sum, we need to read from the correct overlap regions that cover the full array. For example, to read from $15$, we would want to read:
So we need to read the indices:
At each location, we strip off the value $2^r$. We can discover this value with bitfiddling: We claim that $a \& (a) = 2^r$.
Let $a = x 1 0^r$. Now, $a = \lnot a + 1 = x01^r + 1 = \overline{x}10^r$. Hence, $a \& (a) = a \& (\lnot a + 1) = (x 10^r) \& (\overline{\x}10^r) = 0^{\alpha}10^r = 2^r$
So the full implementation of query is:
#define LSB(x) x&(x)
int a[N];
int q(int i) {
int s = 0;
while (i > 0) { s += a[i]; i = LSB(i); }
return s;
}
To perform an update at $i$, we need to update all locations which on querying overlap with $i$. For example, to update the location $9$, we would want to update:
So we need to update the indices:
We use the same bitfiddling technique as above to strip off the value $2^r$
#define LSB(x) x&(x)
int tree[N];
int u(int i, int v) {
while (i < N) { tree[i] += v; i += LSB(i); }
}
We wish to analyze the operations $Query(q) \equiv \sum_{i=1}^q a[i]$, and $Update(i, val) \equiv a[i]~\texttt{+=}~val$. To do this, we are allowed to maintain an auxiliary array $d$ which we will manipuate. We will stipulate the conditions of operations on $d$ such that they will reflect the values of $Query$ and $Update$, albeit much faster.
We will analyze the algorithm in terms of orbits. We have two operators, one for update called $U$, and one for query called $Q$. Given an index $i$, repeatedly applying the query operator gives us the indeces we need to read and accumulate from the underlying array $a$ to get the total sum $a[0..i]$:
Given an index $u$, repeatedly applying the update operator $U$ gives us all the indeces we need to add the change to update:
For query and update to work, we need the condition that:
That is, if and only if the query index $q$ includes the update location $u$, will the orbits intersect.
The intuition is that we want updates at an index $u$ to only affect queries that occur at indeces $q \geq u$. Hence, we axiomatise that for an update to be legal, it must the orbits of queries that are at indeces greater than it.
We will show that our operators:
do satisfy the conditions above.
For a quick numerical check, we can use the code blow to ensure that the orbits are indeed disjoint:
# calculate orbits of query and update in fenwick tree
def lsb(i): return i & (i)
def U(i): return i + lsb(i)
def Q(i): return i  lsb(i)
def orbit(f, i):
s = set()
while i not in s and i > 0 and i < 64:
s.add(i); i = f(i)
return s
if __name__ == "__main__":
for q in range(1, 16):
for u in range(1, 16):
qo = orbit(Q, q); uo = orbit(U, u)
c = qo.intersection(uo)
print("q:%4s  u:%4s  qo: %20s  uo: %20s  qu: %4s" %
(q, u, qo, uo, c))
print("")
We note that $Q$ always decreases the value of $q$, and $u$ always increases it. Hence, if $q = u$, they meet at this point, and $\forall i, j \geq 1, \quad Q^i (q) \neq U^j(u)$. Hence, they meet exactly once as required.
As noted above, $q$ always decreases and $u$ always increases, hence in this case they will never meet as required.
Let the entire array have size $2^N$.
Let $q = \texttt{e1$f_q$}, u = \texttt{e0$f_u$}$, where
$\texttt{e},f_q, f_u$ may be empty strings.
Notice that $Q$ will always strip away rightmost ones in $f_q$, leading to $q = \texttt{e10…0}$ at some point.
Similarly, $U$ will keep on adding new rightmost ones, causing the state to be $u = \texttt{e01…10…0} \xrightarrow{U} \texttt{e100…}$.
Hence, at some point $q = u$.
We call all functions $f: \mathbb Z \rightarrow \mathbb R$ as arithmetic functions, since they operate on the integers.
We introduce an operator $f \star g: \mathbb Z \rightarrow \mathbb R$. It is defined by:
We will show that the set of arithmetic functions forms a group under the operator $\star$, with identity:
The reason all of this is interesting is that the inverse of the constant function $1(n) \equiv 1$ is going to be this function called as the mobius function $\mu$:
$
\mu(n=p_1^\alpha_1 p_2^\alpha_2 \dots p_r^\alpha_r) \equiv
\begin{cases}{
0 & \textt{if any $\alpha_i > 1$}
(1)^{\alpha_1 + \alpha_2 + \dots + \alpha_r} & \text{if all $\alpha_i \in { 0, 1 }$}
\end{cases}
$
The mobius function will allow us to perform mobius inversion:
$$
\begin{aligned}
f(n) &\equiv \sum_{d \vert n} g(d) = \sum_{d \vert n} g(d) 1(n/d) = g \star 1
f \star 1^{1} &= g \star 1 \star 1^{1}
f \star \mu &= g
\end{aligned}
That is, we originally had $f$ defined in terms of $g$. We can recover an expression for $g$ in terms of $f$.
We claim that the set of functions ${ \mathbb Z \rightarrow \mathbb C }$ is a commutative group, with the group operation $\star$ such that:
with the identity element being $id_\star(n) \equiv \lfloor 1 / n \rfloor$. The idea is that if $(n = 1)$, then $\lfloor 1/1 \rfloor = 1$, and for any other number $n > 0$, $1/n < 1$, hence $\lfloor 1/n \rfloor = 0$.
Notice that the sizes of sets that we are calculating, for example, ${ 1 \leq x \leq 12 : (x/2, 6) = 1 } = \phi(6)$. Summing over all of what we have, we’ve counted the numbers in $[1, 2, \dots, 12]$ in two ways — one directly, and the other by partitioning into equivalence classes:
In general, the same argument allows us to prove that:
This is me trying to understand the fabled interpreter of the J
language
working, so I could absorb Arthur Whitney’s style of writing C: it’s
cryptic, short, and fits in a page. I learnt of this from the J
language page,
which comes with the quote:
One summer weekend in 1989, Arthur Whitney visited Ken Iverson at Kiln Farm and produced—on one page and in one afternoon—an interpreter fragment on the AT&T 3B1 computer. I studied this interpreter for about a week for its organization and programming style; and on Sunday, August 27, 1989, at about four o’clock in the afternoon, wrote the first line of code that became the implementation described in this document. Arthur’s onepage interpreter fragment is as follows:
typedef char C;typedef long I;
typedef struct a{I t,r,d[3],p[2];}*A;
#define P printf
#define R return
#define V1(f) A f(w)A w;
#define V2(f) A f(a,w)A a,w;
#define DO(n,x) {I i=0,_n=(n);for(;i<_n;++i){x;}}
I *ma(n){R(I*)malloc(n*4);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
tr(r,d)I *d;{I z=1;DO(r,z=z*d[i]);R z;}
A ga(t,r,d)I *d;{A z=(A)ma(5+tr(r,d));z>t=t,z>r=r,mv(z>d,d,r);
R z;}
V1(iota){I n=*w>p;A z=ga(0,1,&n);DO(n,z>p[i]=i);R z;}
V2(plus){I r=w>r,*d=w>d,n=tr(r,d);A z=ga(0,r,d);
DO(n,z>p[i]=a>p[i]+w>p[i]);R z;}
V2(from){I r=w>r1,*d=w>d+1,n=tr(r,d);
A z=ga(w>t,r,d);mv(z>p,w>p+(n**a>p),n);R z;}
V1(box){A z=ga(1,0,0);*z>p=(I)w;R z;}
V2(cat){I an=tr(a>r,a>d),wn=tr(w>r,w>d),n=an+wn;
A z=ga(w>t,1,&n);mv(z>p,a>p,an);mv(z>p+an,w>p,wn);R z;}
V2(find){}
V2(rsh){I r=a>r?*a>d:1,n=tr(r,a>p),wn=tr(w>r,w>d);
A z=ga(w>t,r,a>p);mv(z>p,w>p,wn=n>wn?wn:n);
if(n=wn)mv(z>p+wn,z>p,n);R z;}
V1(sha){A z=ga(0,1,&w>r);mv(z>p,w>d,w>r);R z;}
V1(id){R w;}V1(size){A z=ga(0,0,0);*z>p=w>r?*w>d:1;R z;}
pi(i){P("%d ",i);}nl(){P("\n");}
pr(w)A w;{I r=w>r,*d=w>d,n=tr(r,d);DO(r,pi(d[i]));nl();
if(w>t)DO(n,P("< ");pr(w>p[i]))else DO(n,pi(w>p[i]));nl();}
C vt[]="+{~<#,";
A(*vd[])()={0,plus,from,find,0,rsh,cat},
(*vm[])()={0,id,size,iota,box,sha,0};
I st[26]; qp(a){R a>='a'&&a<='z';}qv(a){R a<'a';}
A ex(e)I *e;{I a=*e;
if(qp(a)){if(e[1]=='=')R st[a'a']=ex(e+2);a= st[ a'a'];}
R qv(a)?(*vm[a])(ex(e+1)):e[1]?(*vd[e[1]])(a,ex(e+2)):(A)a;}
noun(c){A z;if(c<'0'c>'9')R 0;z=ga(0,0,0);*z>p=c'0';R z;}
verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
I *wd(s)C *s;{I a,n=strlen(s),*e=ma(n+1);C c;
DO(n,e[i]=(a=noun(c=s[i]))?a:(a=verb(c))?a:c);e[n]=0;R e;}
main(){C s[99];while(gets(s))pr(ex(wd(s)));}
It’s a lot to take in — it’s quite breathtaking really, the way it all hangs together in one page.
Unfortunately, this does not work if we try to get it to run in 2020. I decided
to read the code and understand what would happen. I managed to read enough
to understand that the code a=~3
ought to create an array with values [0 1 2]
.
On attempting to run this however, we get:
$ gcc O0 g std=c89 fsanitize=address fsanitize=undefined incunabulum.c o bin/incunabulum && ./bin/incunabulum
...
(many many GCC warnings elided)
...
a=~3
=================================================================
==23726==ERROR: AddressSanitizer: heapbufferoverflow on address
0x60300000eff0 at pc 0x000000402be3 bp 0x7ffe6dde6b70 sp 0x7ffe6dde6b60
WRITE of size 8 at 0x60300000eff0 thread T0
#0 0x402be2 in wd /home/bollu/work/w/incunabulum.c:40
#1 0x402d28 in main /home/bollu/work/w/incunabulum.c:42
#2 0x7f7ae901082f in __libc_start_main (/lib/x86_64linuxgnu/libc.so.6+0x2082f)
#3 0x400ca8 in _start (/home/bollu/w/bin/incunabulum+0x400ca8)
...
SUMMARY: AddressSanitizer: heapbufferoverflow /home/bollu/work/w/incunabulum.c:40 wd
...
==23726==ABORTING
oops! The code uses a lot of punning between int
and int*
. These assumptions break now that we’re in 64bit. The patch to
get this to work is:
diff git a/incunabulum.c b/incunabulum.c
index 2cae744..778e35a 100644
 a/incunabulum.c
+++ b/incunabulum.c
@@ 1,11 +1,11 @@
typedef char C;typedef long I;
+typedef char C;typedef long long I;
typedef struct a{I t,r,d[3],p[2];}*A;
#define P printf
#define R return
#define V1(f) A f(w)A w;
#define V2(f) A f(a,w)A a,w;
#define DO(n,x) {I i=0,_n=(n);for(;i<_n;++i){x;}}
I *ma(n){R(I*)malloc(n*4);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
+I *ma(n){R(I*)malloc(n*8);}mv(d,s,n)I *d,*s;{DO(n,d[i]=s[i]);}
tr(r,d)I *d;{I z=1;DO(r,z=z*d[i]);R z;}
A ga(t,r,d)I *d;{A z=(A)ma(5+tr(r,d));z>t=t,z>r=r,mv(z>d,d,r);
R z;}
@@ 34,9 +34,10 @@ I st[26]; qp(a){R a>='a'&&a<='z';}qv(a){R a<'a';}
A ex(e)I *e;{I a=*e;
if(qp(a)){if(e[1]=='=')R st[a'a']=ex(e+2);a= st[ a'a'];}
R qv(a)?(*vm[a])(ex(e+1)):e[1]?(*vd[e[1]])(a,ex(e+2)):(A)a;}
noun(c){A z;if(c<'0'c>'9')R 0;z=ga(0,0,0);*z>p=c'0';R z;}
verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
+I noun(c){A z;if(c<'0'c>'9')R 0;z=ga(0,0,0);*z>p=c'0';R z;}
+I verb(c){I i=0;for(;vt[i];)if(vt[i++]==c)R i;R 0;}
I *wd(s)C *s;{I a,n=strlen(s),*e=ma(n+1);C c;
DO(n,e[i]=(a=noun(c=s[i]))?a:(a=verb(c))?a:c);e[n]=0;R e;}
main(){C s[99];while(gets(s))pr(ex(wd(s)));}
+
After applying the patch, we manage to get the interpreter to run!
./bin/incunabulum
a=~3
3
0 1 2
I liked it so much that I took a screenshot and made it my lock screen.
I’m really fascinated by the code. I loved I could simply stare the screenshot to absorb the code. There was no scrolling involved. The variables are wellnamed (to the extent I understand the code), and it’s clearly extremely well thought out. If there’s someone who understands some of the thorny aspects of the code:
t
variable really tracking?I’d be really glad to know the details. Please leave an issue or a pull request against the repo.
I’m going write a dissection of the code once I fully understand it, since I couldn’t find explanaing the code on the internet.
Until then, enjoy the monolith of code!
Provide an example of a sequence $a_n: \mathbb N \rightarrow \mathbb R$ such that $\lim_{n \rightarrow \infty} \vert a_{n+1}  a_n \vert \rightarrow 0$, but $\lim_{n, m \rightarrow \infty, m > n} a_n  a_m \neq 0$. That is, proide a series where the distances between successive terms converges to zero, but where distances between terms that are “farther apart than 1” does not converge to 0. That is, the sequence is not Cauchy.
The usual solution is to take the harmonic numbers, $H_n \equiv \sum_{i=1}^n 1/i$. Then, we show that:
We can much more simply choose $a_n = \log(n)$. This yields the simple calculation:
while on the other hand,
I find this far cleaner conceptually, since it’s “obvious” to everyone that $a_n = \log(n)$ diverges, while the corresponding fact for $H_n$ is hardly convincing. We also get straightforward equalities everywhere, instead of inequalities.
I still feel that I don’t grok what precisely fails here, in that, my intuition still feels that the local condition ought to imply the Cauchy condition: if $a_n$ tells $a_{n+1}$ to not be too far, and $a_{n+1}$ tells $a_{n+2}$, surely this must be transitive?
I have taught my instincts to not trust my instincts on analysis, which is a shitty solution :) I hope to internalize this someday.
EDIT: I feel I now understand what’s precisely happening after ruminating a bit.
The Cauchy convergence criterion allows us to drop a finite number of terms, and then capture everything after that point in a ball of radius $\epsilon$. As $\epsilon$ shrinks, all the terms in the sequence are “squeezed togeher”.
In the $a_{n+1}  a_n$ case, only successive terms must maintain an $\epsilon$ distance. But as the $\log$ example shows, you can steadily plod along, keeping $\epsilon$ ball next to $\epsilon$ ball, to reach:
whose behaviour can do unexpected things depending on the choice of $\n$.
This is a class of methods used to solve $Ax = b$, where $A$ is sparse. Krylov subspace methods are a class of methods which use the idea of a Krylov subspace. There is conjugate gradient (CG), GMRES (Generalized minimal residual method).
Clearly, $K_m \subseteq K_{m+1}$, and there is a maximum $K_N$ that we can span (the full vector space). We are interested in the smallest index $M$ such that $K_M = K_{M+1}$.
We notice that $K_M$ is invariant under the action of $A$.
Now, let’s consider:
We learnt that $Ax = b$ has a solution in $K_m(A, b)$. Using this, we can build solvers that exploit the Krylov subspace. We will describe GMRES and CG.
We wish to solve $Ax = b$ where $A$ is sparse and $b$ is normalized. The $n$th Krylov subspace is $K_n(A, b) \equiv span~{b, Ab, A^2b, \dots, A^nb }$.
We approximate the actual solution with a vector $x_n \in K_n(A, b)$. We define the residual as $r_n \equiv A x_n  b$.
The Rete pattern matching algorithm is an algorithm that allows matching a huge number of rules with a huge database of “facts”.
MLIR (“multilanguage intermediate representation”) is a new technology that hopes to centralize much of the research and development of various compilers into a single authoritative source. The major claimtofame is that it allows one to mix various “dialects” (much as Racket does). So, to a first order approximation, MLIR is “JSON for compiler intermediate representations”.
What MLIR gets right is tooling. They take the experience that the LLVM project has mined for the last decade and a half, and bake many of the good stuff that came with LLVM right in. For example, MLIR comes inbuilt with a pretty printer, a notion of types, a notion of “attributes”, SSA, enforced provenance tracking of code (so one can always know what the original source code was that led to some assembly). Sound engineering might see MLIR succeed where many others failed.
I was reminded of Rete since the MLIR folks are trying to solve the pattern matching problem in general for their Generic DAG Rewriter.
They currently just use a worklist based algorithm. I’m trying to understand if Rete can be used instead. Rete is famous for being hard to understand, so I began a quest to look for good sources to implement it. I found a great PhD thesis written by Robert B. Doorenboos, which quips:
Since the Rete match algorithm provides the starting point for much of the work in this thesis, this chapter describes Rete. Unfortunately, most of the descriptions of Rete in the literature are not particularly lucid,1 which is perhaps why Rete has acquired \a reputation for extreme differentialculty.”(Perlin, 1990b) To remedy this situation, this chapter describes Rete in a tutorial style, rather than just briey reviewing it and referring the reader to the literature for a full description. We will rst give an overview of Rete, and then discuss the principle data structures and procedures commonly used to implement it. Highlevel pseudocode will be given for many of the structures and procedures, so that this chapter may serve as a guide to readers who want to implement Rete (or some variant) in their own systems.
I now have a reference to an accessible description of this stuff. I might implement Rete to understand it, so that it’s part of my toolkit if I ever need it.
We have a system we wish to simulate using hamilton’s equations:
We want to simulate a system using these differential equations. We will begin with some initial position and momentum $(q_0, p_0)$, evaluate $\frac{\partial q}{\partial t} \rvert_{(q_0, p_0)}$, $\frac{\partial p}{\partial t} \rvert_{(q_0, p_0)}$, and use these to find $(q_{next}, p_{next})$. An integrator is a general algorithm that produces the next position and momentum using current information:
The design of $I$ is crucial: different choices of $I$ will have different tradeoffs for accuracy and performance. Another interesting property we might want is for $I$ to be a symplectic integrator — that is, it preserves the total energy of the system. For example, here’s a plot of the orbits of planets using two integrators, one that’s symplectic (leapfrog) and one that isn’t (Euler)
Notice that since leapfrog attempts to keep energy conserved, the orbits stay as orbits! On the other hand, the euler integrator quickly spirals out, since we lose energy during the integration. Note that this is not an issue of numerical precision: The euler integrator is ineherently such that over long timescales, it will lose energy. On the other hand, the leapfrog integrator will always remain stable, even with very large timesteps and low precision.
I present the equations of the leapfrog integrator, a proof sketch that it is symplectic, and the code listing that was used to generate the above plot. Often, code makes most ideas very clear!
# Run HMC with a particular choice of potential
import numpy as np
from matplotlib.animation import FuncAnimation
import matplotlib.pyplot as plt
import numpy.linalg
# dq/dt = dH/dp_{p0, q0}
# dp/dt = dH/dq_{p0, q0}
def leapfrog(dhdp, dhdq, q0, p0, dt):
p0 += dhdq(q0, p0) * 0.5 * dt
# full step position
# q += dt * p
q0 += dhdp(q0, p0) * dt
# half step position
p0 += dhdq(q0, p0) * 0.5 * dt
return (q0, p0)
For reference, we also implement an euler integrator, that uses the derivative to compute the position and momentum of the next timestep independently.
def euler(dhdp, dhdq, q, p, dt):
pnew = p + dhdq(q, p) * dt
qnew = q + dhdp(q, p) * dt
return (qnew, pnew)
Finally, we implement planet(integrator, n, dt)
which simulates gravitational
potential and usual kinetic energy, using the integrator given by integrator
for n
steps, with each timestep taking dt
.
def planet(integrator, n, dt):
STRENGTH = 0.5
# minimise potential V(q): q, K(p, q) p^2
q = np.array([0.0, 1.0])
p = np.array([1.0, 0.0])
# H = STRENGTH * q (potential) + p^2/2 (kinetic)
def H(qcur, pcur): return STRENGTH * np.linalg.norm(q) + np.dot(p, p) / 2
def dhdp(qcur, pcur): return p
def dhdq(qcur, pcur): return STRENGTH * 2 * q / np.linalg.norm(q)
qs = []
for i in range(n):
print("q: %10s  p: %10s  H: %6.4f" % (q, p, H(q, p)))
(q, p) = integrator(dhdp, dhdq, q, p, dt)
qs.append(q.copy())
return np.asarray(qs)
We plot the simulations using matplotlib
and save them.
NITERS = 15
TIMESTEP = 1
print("planet simulation with leapfrog")
planet_leapfrog = planet(leapfrog, NITERS, TIMESTEP)
plt.rcParams.update({'font.size': 12, 'font.family':'monospace'})
fig, ax = plt.subplots()
print(planet_leapfrog)
ax.plot(planet_leapfrog[:, 0], planet_leapfrog[:, 1], label='leapfrog',
linewidth=3, color='#00ACC1')
print("planet simulation with euler")
planet_euler = planet(euler, NITERS, TIMESTEP)
ax.plot(planet_euler[:, 0], planet_euler[:, 1], label='euler',
linewidth=3, color='#D81B60')
legend = plt.legend(frameon=False)
ax.set_title("leapfrog v/s euler: NITERS=%s dt=%s" % (NITERS, TIMESTEP))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.savefig("leapfrogvseuler.png")
plt.show()
Quite a lot of ink has been spilt on this topic. My favourite reference is the one by Rufflewind.
However, none of these examples have a good stock of examples for the diference. So here, I catalogue the explicit computations between computing forward mode AD and reverse mode AD.
In general, in forward mode AD, we fix how much the inputs wiggle with respect to a parameter $t$. We figure out how much the output wiggles with respect to $t$. If $output = f(input_1, input_2, \dots input_n)$, then $\frac{\partial output}{\partial t} = \sum_i \frac{\partial f}{\partial input_i} \frac{\partial input_i}{\partial dt}$.
In reverse mode AD, we fix how much the parameter $t$ wiggles with respect to the output. We figure out how much the parameter $t$ wiggles with respect to the inputs. If $output_i = f_i(input, \dots)$, then $\frac{\partial t}{\partial input} = \sum_i \frac{\partial t}{\partial output_i} \frac{\partial f_i}{input}$. This is a much messier expression, since we need to accumulate the data over all outputs.
Essentially, deriving output from input is easy, since how to compute an output from an input is documented in one place. deriving input from output is annoying, since many outputs can depent on a single output.
The upshot is that if we have few “root outputs” (like a loss function), we need to run AD once with respect to this, and we will get the wiggles of all inputs at the same time with respect to this output, since we compute the wiggles output to input.
The first example of z = max(x, y)
captures the essential difference
between the two approached succinctly. Study this, and everything else will make
sense.
z = max(x, y)
We can compute $\frac{\partial z}{\partial x}$ by setting $t = x$. That is, $\frac{\partial x}{\partial t} = 1, \frac{\partial y}{\partial t} = 0$. Similarly, can compute $\frac{\partial z}{\partial y}$ by setting $t = y$. That is, $\frac{\partial x}{\partial t} = 1, \frac{\partial y}{\partial t} = 0$. If we want both gradients $\frac{\partial z}{\partial x}, \frac{\partial z}{\partial y}$, we will have to rerun the above equations twice with the two initializations.
In our equations, we are saying that we know how sensitive the inputs $x, y$ are to a given parameter $t$. We are deriving how sensitive the output $z$ is to the parameter $t$ as a composition of $x, y$. If $x > y$, then we know that $z$ is as sensitive to $t$ as $x$ is.
We can compute $\frac{\partial z}{\partial x}, \frac{\partial z}{\partial y}$ in one shot by setting $t = z$. That is, $\frac{\partial z}{\partial t} = 1$.
In our equations, we are saying that we know how sensitive the parameter $t$ is to a given output $z$. We are trying to see how sensitive $t$ is to the inputs $x, y$. If $x$ is active (ie, $x > y$), then $t$ is indeed sensitive to $x$ and $\frac{\partial t}{\partial x} = 1$. Otherwise, it is not sensitive, and $\frac{\partial t}{\partial x} = 0$.
z = sin(x)
We can compute $\frac{\partial z}{\partial x}$ by setting $t = x$. That is, setting $\frac{\partial x}{\partial t} = 1$.
We can compute $\frac{\partial z}{\partial x}$ by setting $t = z$. That is, setting $\frac{\partial z}{\partial t} = 1$.
z = x + y
:z = xy
z = x  y
:There are many introductions to homology on the internet, but none of them really met my criteria for being simple, picture filled, and getting the basic ideas across. I feel that Hatcher might come closest to what I want (and where I originally learnt the material), but their description of homology is surrounded by the context of Algebraic Topology, while really, simplicial homology is accessible to anyone who has seen some linear algebra and group theory. This is my attempt to get the ideas across.
Let’s first try to understand what we’re trying to do here. We want to detect holes in a space, broadly construed. We focus in simplicial complexes, which are collections of triangles and trianglelike objects in higher (and lower) dimensions. We define what holes are for these simplicial complexes, and we then try to find algebraic objects that allow us to “detect” these holes.
A 0simplex is a point
A 1simplex is a line
A 2simplex is a filled triangle
A 3simplex is a solid tetrahedra
A $k$dimensional simplex is the convex hull of $k+1$ linearly independent points $(u_i \in \mathbb R^{k+1})$ in $k+1$ dimensional space. $ S_k \equiv \left \{ \sum \theta_i u_i ~\mid~ \theta_i \geq 0, ~ \sum_i \theta_i = 1 \right\} $
A simplicial complex $K$ is a collection of simplices where:
Examples of simplicial complexes:
Nonexamples of simplicial complexes are:
Let’s consider the simplest possible case of computing the homlogy, and we do so, we will expand on what homology is, and what we’re trying to do.
Look at the triangle above. We have the red, green, and blue vertices, which I’ll notate $r, g, b$. We also have the edges that are orange ($o$), cyan ($c$), and magenta ($m$).
What we are interested in doing is to be able to detect the “hole” in the
triangle in between the edges omc
. That is, we want some algorithm which
when offered the representation of the triangle, can somehow detect the hole.
Note that the hole doesn’t really depend on the length of the edges. We can
“bend and stretch” the triangle, and the hole will still exist. The only way
to destroy the hole is to either cut the triangle, or fill in the triangle.
We first need to agree on an abstract representation of the triangle, which ideally does not change if we were to stretch out the edges, before we can discuss how we can detect the existence of the hole.
We first describe the shape of the triangle in terms of two sets, $E$ and $V$ representing the edges and the vertices, and a function $\partial_{EV}$, called as the boundary operator, which tells us how edges are glued to vertices.
We first have a ground set of edges $E \equiv \{o, m, c\}$ and a set of vertices $V \equiv \{r, g, b \}$.
What we now need is to know how the edges are connected to the vertices, since that’s what really makes a triangle. We would like to say something like “the boundary of the edge $o$ has points $r, g$”. In fact, we have slightly more information than that: the orientation. So what we really ought to be saying is “the edge $o$ points from $g$ to $r$”.
To do this, we create a map from $o$ to $r  g$, where we think of $o$ as a “vector”, pointing from $g$ to $r$. But hang on, what is $r  g$? we don’t have a mathematical structure on $V$ that lets us add and subtract vertices. So, we create a new set $\mathcal V$, which represents linear combinations of vertices in $V$. Similarly, anticipating some future development, we also create a new set $\mathcal E$ of linear combinations of edges $E$.
We define $\mathcal E \equiv \mathbb Z \times \mathbb Z \times \mathbb Z$ that represents linear combinations of edges. For example, $(1, 2, 3) \in \mathcal E$ represents $o + 2m + 3c$ — that is, take 1 copy of the orange edge, 2 copies of the magenta edge, and 3 copies of the cyan edge.
We define $\mathcal V \equiv \mathbb Z \times \mathbb Z \times \mathbb Z$ which represents linear combinations of vertices. For example, $(1, 1, 2) \in V$ represents $r  g + 2b$ — that is, take a copy of the red vertex, subtract the green vertex, and add two copies of the blue vertex.
The boundary operator $\partial_{EV}: \mathcal{E} \rightarrow \mathcal V$ is depicted in the picture. This operator sends edges to their boundary, and is therefore called the boundary operator. The boundary of an edge describes the edge in terms of vertices, just like we would describe a direction vector (to borrow physics parlance) by subtracting points.
The action of the operator on a linear combination of edges is:
Now, notice that to traverse the cycle, we should traverse the orange edge, then the magenta edge, then the cyan edge, in that direction. That is, the cycle can be thought of as $o + m + c$. However, how do we detect this cycle? The key idea is that if we look at the image of the cycle $o + m + c$ under the boundary operator $\partial_{EV}$, we will get $0$! For us to have completed a cycle, we must have both entered and exited each vertex, so the total sum must be $0$.
Formally:
This is very nice, since we have converted the topological invariant of a hole in the space into an algebraic invariant of “linear combination of edges that map to 0”. That is, we want to consider all thoose loops that belong to the kernel of $\partial_{EV}$. (Terminology: the kernel of a linear transformation is the set of all things in the domain which map to zero)
So, we define (tentatively) the first homology group:
If we try to compute this, we will have to have:
So, we know that we have a $\mathbb Z$ worth of cycles in our triangle, which makes sense: We can go clockwise (positive numbers) and counterclockwise (negative numbers) around the triangle, and we can go as many times as we wish, so we have $\mathbb Z$ as the number of cycles.
that is, it’s the linear combination of edges that map to zero through the boundary map. Note that this also includes combinations such as two loops around the triangle, such as $o + m + c + o + m + c$.
In this case, notice that the triangle is filled with a face $f$. Therefore, the “hole” that we had previously is now filled up, and does not count anymore. So, we now need to amend our previous definition of $H_1$ to kill the hole we had detected.
The idea is that the hole we had previously is now the boundary of the new face $f$. Since it is the boundary of a “filled in” region, it does not count anymore, since we can “shrink the hole” across the face to make it a nonloop. Hence, we need to quotient our $H_1$ with the boundary of the face.
Formally, what we do is we create another group $\mathcal F \equiv \mathbb Z$, which counts copies of our face $f$, and we define another boundary operator, such that the boundary of the face $f$ is $o + m + c$.
Now, we should notice that the image of $\partial_{FE}$ is a loop $(o + m + c)$, which lies ie the kernel of $\partial_{EV}$. This is a general feature of homology, so it bears repeating:
Now, since the image of $\partial_{FE}$ lies entirely in the kernel of $\partial_{EV}$, we can construct $H_1$ as:
Once again, we have our humble triangle with vertices $V = \{r, g, b\}$, edges $E = \{o, m, c \}$, faces $F = \{ f \}$ with boundary maps $\partial_{EV}$, $\partial_{FE}$:
We define a function $h_v: V \rightarrow \mathbb R$ on the vertices as:
We now learn how to extend this function to the higher dimensional objects, the edges and the faces of the triangle.
To extend this function to the edges, we define a new function:
Expanded out on the example, we evaluate $h_v$ as:
More conceptually, we have created an operator called $d$ (the coboundary operator) which takes functions defined on vertices to functions defined on edges. This uses the boundary map on the edges to “lift” a function on the vertices to a function on the edges. It does so by assigning the “potential difference” of the vertices to the edges.
We can repeat the construction we performed above, to construct another operator $d : (E \rightarrow \mathbb R) \rightarrow (F \rightarrow \mathbb R)$, defined in exactly the same way as we did before. For example, we can evaluate:
What we have is a chain:
Where we notice that $d^2 = d \circ d = 0$, since the function $h_f$ that we have gotten evaluates to zero on the face $f$. We can prove this will happen in general, for any choice of $h_v$. (it’s a good exercise in definition chasing).
Introducing some terminology, A differential form $f$ is said to be a closed differential form iff $df = 0$.
In our case, $h_e$ is closed, since $d h_e = h_f = 0$. On the other hand $h_v$ is not closed, since $d h_v = h_e \neq 0$.
The intuition for why this is called “closed” is that its coboundary vanishes.
Here, we try to understand what functions defined on the edges can look like, and their relationship with the $d$ operator. We discover that there are some functions $g_e: E \rightarrow \mathbb R$ which can be realised as the differential of another function $g_v: V \rightarrow \mathbb R$. The differential forms such as $g_e$ which can be generated a $g_v$ through the $d$ operator are called as exact differential forms. That is, $g_e = d g_v$ exactly, such that there is no “remainder term” on applying the $d$ operator.
We take an example of a differential form that is not exact, which has been defined on the edges of the triangle above. Let’s call it $h_e$.
It is defined on the edges as:
We can calcuate $h_f = d h_e$ the same way we had before:
Since $d h_e \neq 0$, this form is not exact.
Let’s also try to generate $h_e$ from a potential. We arbitrarily fix the potential of $b$ to $0$. That is, we fix $h_v(b) = 0$, and we then try to see what values we are forced to values of $h_v$ across the rest of the triangle.
Hence, there can exist no such $h_v$ such that $h_e \equiv d h_v$. The interesting thing is, when we started out by assigning $h_v(b) = 0$, we could make local choices of potentials that seemed like they would fit together, but they failed to fit globally throughout the triangle. This failure of locally consistent choices to be globally consistent is the essence of cohomology.
Here, we have vertices $V \equiv \{ r, g, b, b, p \}$, edges $E \equiv \{rb, gr, bg, m, o, c \}$ and faces $F \equiv \{ f \}$.
Here, we see a differential form $h_e$ that is defined on the edges, and also obeys the equation $dh_e = 0$ (Hence is closed). However, it does not have an associated potential energy to derive it from. That is, there cannot exist a certain $h_v$ such that $d h_v = h_e$.
So, while every exact form is closed, not every closed form is exact.
Hence, this $g$ that we have found is a nontrivial element of $Kernel(d_{FE}) / Image(d_{EV})$, since $dh_e = 0$, hence $h_e \in Kernel(d_{FE})$, while there does not exist a $h_v$ such that $d h_v = h_e$, hence it is not quotiented by the image of $d_{EV}$.
So the failure of the space to be fully filled in (ie, the space has a hole), is measured by the existence of a function $h_e$ that is closed but not exact!
This reveals a deep connection between homology and cohomology, which is made explicit by the Universal Coefficient Theorem
I write these retrospective blog posts every year since 2017. I tend to post a collection of papers, books, and ideas I’ve stumbled across that year. Unfortunately, this year, the paper list will be sparser, since I lost some data along the way to the year, and hence I don’t have links to everything I read. So this is going to be a sparser list, consisting of things that I found memorable.
I also reorganised my website, letting the link die, since keeping it up was
taking far too many cycles (In particular, CertBot was far too annoying to
maintain, and the feedback of hugo was also very annoying). I now have a
single file, the
README.md
of the bollu/bollu.github.io
repo,
to which I add notes on things I find interesting. I’ve bound the i
alias
(for idea) on all my shells everywhere, to open the README.md
file, wait
for me to write to it, run a git commit
so I can type out a commit, and
then push. This has been massive for what I manage to write down: I feel
like I’ve managed to write down a lot of oneliners / small facts that I’ve
picked up which I would not otherwise. I’m attempting to similarly pare down
other frictioninducing parts of my workflow. Suggestions here would be very
welcome!
If there’s a theme of the year (insofar as my scattered reading has a theme…), it’s “lattices and geometry”. Geometry in terms of differential geometry, topology, and topoi. Lattices in the sense of a bunch of abstract interpretation and semantics.
My course work was less interesting to me this time, due to the fact that I had chosen to study some wild stuff earlier on, and now have to take reasonable stuff to graduate. However, there were courses that filled in a lot of gaps in my selftaught knowledge for me, and the ones I listed were the top ones in that regard.
I wound up reading Boyd on optimisation theory, Nielsen and Chuang for quantum computation, where I also solved a bunch of exercises in Q# which was very fun and rewarding. I’m beginning to feel that learning quantum computation is the right route to grokking things like entanglement and superposition, unlike the physics which is first of all much harder due to infinite dimensionality, and less accessible since we can’t program it.
My research work is on the above topics, so I try to stay abreast of what’s going on in the field. What I’ve read over the past year on these topics is:
A^2I
: metaabstract interpretation.
This paper extends the theory of abstract interpretation to perform abstract
interpretation on program analyses themselves. I’m not sure how useful this
is going to be, as I still hold on to the belief that AI as a framework is
too general to allow one to prove complex results. But I am still interested
in trying to adapt this to some problems I have at hand. Perhaps it’s going
to work.
Cubicial Agda. This paper introduces
cubical type theory and its implementation in Agda. It appears to solve many
problems that I had struggled with during my formalization of loop
optimisations: In particular, dealing with Coinductive types in Coq, and that
of defining quotient types / setoids. Supposedly, cubical Agda makes dealing
with Coinduction far easier. It allows allows the creation of “real” quotient
types that respect equality, without having to deal with setoid
style
objects that make for large Gallina terms. I don’t fully understand how the
theory works: In particular, as far as I can tell, the synthetic interval
type I
allows one to only access the start and end points (0
and 1
),
but not anything in between, so I don’t really see how it allows for
interpolation. I also don’t understand how this allows us to make Univalence
computable. I feel I need to practice with this new technology before I’m
well versed, but it’s definitely a paper I’m going to read many, many times
till I grok it.
Naive Cubical type theory. This paper promises a way to perform informal reasoning with cubical type theory, the way we are able to do so with, say, a polymorphic type theory for lambda calculus. The section names such as “how do we think of paths”, “what can we do with paths”, inspire confidence
Call by need is Clairvoyant call by value. This key insight is to notice that call by need is “just” call by value, when we evaluate only those values that are eventually forced, and throw away the rest. Thus, if we had an oracle that tells us which values are eventually forced, we can convert call by need into call by value, relative to this oracle. This cleans up many proofs in the literature, and might make it far more intuitive to teach call by need to people as well. Slick paper, I personally really enjoyed reading this.
Shift/Reset the Penultimate Backpropagator This paper describes how to implement backprop using delimited continuations. Also, supposedly, using staging / building a compiler out of this paradigm allows one to write high performance compilers for backprop without having to suffer, which is always nice.
Closed forms for numerical loops This paper introduces a new algebra of polynomials with exponentials. It then studies the eigenvalues of the matrix that describes the loop, and tries to find closed forms in terms of polynomials and exponentials. They choose to only work with rationals, but not extensions of rational numbers (in terms of field extensions of the rationals). Supposedly, this is easier to implement and reason about. Once again, this is a paper I’d like to reimplement to understand fully, but the paper is welldone!
Composable, sound transformations of Nested recursion and loops. This paper attempts to bring ideas from polyhedral compilation into working with nested recursion. They create a representation using multitape finite automata, using which they provide a representation for nested recursion. I was somewhat disappointed that it does not handle mutual recursion, since my current understanding is that one can always convert nested recursion into a “reasonable” straight line program by simply inlining calls and then reusing polyhedral techniques.
Reimplementation of STOKE
at bollu/blaze
.
I reimplemented the STOKE: stochastic superoptimisation
paper, and much to my delight, it was supereffective at regenerating common
compiler transformations. I want to use this to generate loop optimisations
as well, by transforming a polyhedral model of the original program.
I really enjoyed my time at Tweag! It was fun, and
Shao Cheng
was a great mentor. I must admit that I was somewhat distracted, by all the new
and shiny things I was learning thanks to all the cool people there :)
In
particular, I wound up bugging
Arnaud Spiwack,
Simeon Carstens,
and Matthias Meschede
quite a bit, about type theory, MCMC sampling, and signal processing of storm
clouds.
I wound up reading a decent chunk of GHC source code, and while I can’t link to specifics here, I understood a lot of the RTS much better than I did before. It was an enlightening experience, to say the least, and being paid to hack on a GHC backend was a really fun way to spend the summer.
It also led me to fun discoveries, such as how does one debug debug info?
I also really loved Paris as a city. My AirBnb host was a charming artist who suggest spots for me around the city, which I really appreciated. Getting around was disorienting for the first week or so, due to the fact that I could not (and still do not) really understand how to decide in which direction to walk inside the subways to find a particular line going in a particular direction.
The city has some great spots for quiet work, though! In particular, the Louvre Anticafe was a really nice place to hang out and grab coffee. The model is great: you pay for hours spent at the Anticafe, with coffee and snacks free. They also had a discount for students which I gratefully used. I bumped into interesting artists, programmers, and students who were open for conversation there. I highly recommend hanging out there.
This was the first talk I’d ever given, and it was on probabilistic programming
in haskell. In particular, I explained the
monadbayes
approach of
doing this, and why this was profitable.
The slides are available here.
It was a fun experience giving a talk, and I’d like to do more of it, since I got a lot out of attempting to explain the ideas to people. I wish I had more time, and had a clearer idea of who the audience was. I got quite a bit of help from Michael Snoyman to whip the talk into shape, which I greatly appreciated.
The major ideas of probabilistic programming as I described it are from Adam Scibior’s thesis:
Along the way, I and others at tweag read the other major papers in the space, including:
Church, a language for generative models, which is nice since it describes it’s semantics in terms of sampling. This is unlike Adam’s thesis, where they define the denotational semantics in terms of measure theory, which is then approximated by sampling.
Riemann Manifold Langevin and Hamiltonian Monte Carlo which describes how to perform Hamiltonian Monte Carlo on the information geometry manifold. So, for example, if we are trying to sample from gaussians, we sample from a 2D Riemannian manifold with parameters mean and varince, and metric as the Fisher information metric. This is philosophically the “correct” manifold to sample from, since it represents the intrinsic geometry of the space we want to sample from.
An elementary introduction to Information geometry by Frank Nielsen something I stumbled onto as I continued reading about sampling from distributions. The above description about the “correct” manifold for gaussians comes from this branch of math, but generalises it quite a bit further. I’ve tried to reread it several times as I gradually gained maturity in differential geometry. I can’t say I understand it just yet, but I hope to do so in a couple of months. I need more time for sure to meditate on the objects.
Reimplementation of monadbayes
.
This repo holds the original implementation on which the talk is based on.
I read through the monadbayes
source code, and then reimplemented the
bits I found interesting. It was a nice exercise, and you can see
the git history tell a tale of my numerous misunderstandings of MCMC methods,
till I finally got what the hell was going on.
Since we use a bunch of presburger arithmetic for polyhedral compilation which is a large research interest of mine, I’ve been trying to build a “complete” understanding of this space. So this time, I wanted to learn how to build good solvers:
bollu/gutenberger
is a decision
procedure for Presburger arithmetic that exploits their encoding as finite
automata. One thing that I was experimenting with was that we only use
numbers of finite bitwidth, so we can explore the entire state space
of the automata and then perform NFA reduction using
DFA minimisation. The
reference I used for this was the excellent textbook
Automata theory: An algorithmic approach, Chapter 10
The taming of the semilinear set This uses a different encoding of presburger sets, which allows them to bound a different quantity (the norm) rather than the bitwidth descriptions. This allows them to compute exponentially better bounds for some operations than were known before, which is quite cool. This is a paper I keep trying to read and failing due to density. I should really find a week away from civilization to just plonk down and meditate upon this.
I want better references to being able to regenerate the inequalities description from a given automata which accepts the presburger set automata. This will allow one to smoothly switch between the geometric description and the algebraic description. There are some operations that only work well on the geometry (such as optimisation), and others that only work well on the algebraic description (such as statespace minimisation). I have not found any good results for this, only scattered fragments of partial results. If nothing else, I would like some kind of intuition for why this is hard.
Having tried my stab at it, the general impression that I have is that the space of automata is much larger than the things that can be encoded as presburger sets. Indeed, it was shown that automata accept numbers which are ultimately periodic.
But yes, it’s known that automata accept a language that’s broader than just first order logic + “arithmetic with +”, which means it’s hard to disentangle the presburger gits from the nonpresburger bits of the automata.
I wanted to get a better understading of how prolog works under the hood, so I began
reimplementing the WAM: warren abstract machine.
It’s really weird, this is the only stable reference I can find to implementing
highperformance prolog interpreters. I don’t really understand how to chase the
papertrail in this space, I’d greatly appreciate references. My implementation
is at bollu/warrencpp
. Unfortunately,
I had to give up due to a really hardtodebug bug.
It’s crazy to debug this abstract machine, since the internal representation gets super convoluted and hard to track, due to the kind of optimised encoding it uses on the heap.
If anyone has a better/cleaner design for implementing good prologs, I’d love to know.
Another fun paper I found in this space thanks to Edward Kmett was the Rete matching algorithm, which allows one to declare many many pattern matches, which are then “fused” together into an optimal matcher that tries to reuse work across failed matchers.
This was on my “list of things I want to understand before I die”, so I wound up taking up an Independent Study in university, which basically means that I study something on my own, and visit a professor once every couple weeks, and am graded at the end of the term. For GR, I wound up referencing a wide variety of sources, as well as a bunch of pure math diffgeo books. I’ve read everything referenced to various levels. I feel I did take away the core ideas of differential and Riemannian geometry. I’m much less sure I’ve grokked general relativity, but I can at least read the equations and I know all the terms, so that’s something.
The theoretical minimum by Leonard Susskind. The lectures are breezy in style, building up the minimal theory (and no proofs) for the math, and a bunch of lectures spent analysing the physics. While I wish it were a little more proof heavy, it was a really great reference to learn the basic theory! I definitely recommend following this and then reading other books to fill in the gaps.
Gravitation by Misner Thorne and Wheeler This is an imposing book. I first read through the entire thing (Well, the parts I thought I needed), to be able to get a vague sense of what they’re going for. They’re rigorous in a very curious way: It has a bunch of great physics perspectives of looking at things, and that was invaluable to me. Their view of forms as “slot machines” is also fun. In general, I found myself repeatedly consulting this book for the “true physical” meaning of a thing, such as curvature, parallel transport, the equation of a geodesic, and whatnot.
Differential Geometry of Curves and Surfaces by do Carmo This is the best book to intro differential geometry I found. It throws away all of the high powered definitions that “modern” treatments offer, and starts from the ground up, building up the theory in 2D and 3D. This is amazing, since it gives you small, computable examples for things like “the Jacobian represents how tangents on a surface are transformed locally”.
Symplectic geometry & classical mechanics by Tobias Osborne This lecture series was great, since it redid a lot of the math I’d seen in a more physicist style, especially around vector fields, flows, and Lie brackets. Unfortunately for me, I never even got to the classical mechanics part by the time the semester ended. I began taking down notes in my repo, which I plan to complete.
Introduction to Smooth manifolds: John Lee This was a very well written mathematical introduction to differential geometry. So it gets to the physically important bits (metrics, covariant derivatives) far later, so I mostly used it as a reference for problems and more rigour.
Einstein’s original paper introducing GR, translated
finally made it click as to why
he wanted to use tensor equations: tensor equations of the form T = 0
are
invariant in any coordinate system, since on change of coordinates, T
changes by a multiplicative factor! It’s a small thing in hindsight, but it
was nice to see it explicitly spelled out, since as I understand, no one
among the physicists knew tensor calculus at the time, so he had to introduce
all of it.
I can’t recall how I ran across this: I think it was because I was trying to get a better understanding of Cohomology, which led me to Google for “computational differential geometry”, that finally led me to Discrete differential geometry.
It’s a really nice collection of theories that show us how to discretize differential geometry in low dimensions, leading to rich intuitions and a myriad of applications for computer graphics.
The textbook by Kennan Crane on the topic which I read over the summer when I was stuck (more often than I’d like) in the Paris metro. The book is very accessible, and requires just some imagination to grok. Discretizing differential geometry leads to most things being linear algebra, which means one can calculate things on paper easily. That’s such a blessing.
Geodesics in Heat
explores a really nice way to discover geodesics by simulating the heat
equation for a short time. The intuition is that we should think of the heat
equation as describing the evolution of particles that are performing random
walks. Now, if we simulate this system for a short while and then look at the
distribution, particles that reach a particular location on the graph must
have taken the shortest path, since any longer path would not have allowed
particles to reach there. Thus, the distribution of particles at time dt
does truly represent distances from a given point. The paper explores this
analogy to find accurate geodesics on complex computational grids. This is
aided by the use of differential geometry, appropriately discretized.
The vector heat method explores computing the parallel transport of a vector across a discrete manifold efficiently, borrowing techniques from the ‘Geodesics in Heat’ paper.
Another paper by Kennan Crane: Lie group integrators for animation and control of vehicles This paper describes a general recipe to tailormake integrators for a system of constraints, by directly integrating over the lie group of the configuration space. This leads to much more stable integrators. I have some misguided hope that we can perhaps adapt these techniques to build better FRP (functional reactive programming) systems, but I need to meditate on this a lot more to say anything definitively.
It was Arnaud Spiwack
who pointed me to this. It’s a nice axiomatic
system of differential geometry, where we can use physicist style proofs of
“taking stuff upto order dx
”, and having everything work upto mathematical
rigor.
The TL;DR is that we want to add a new number called dx
into the reals,
such that dx^2 = 0
. But if such a number did exist, then clearly dx = 0
.
However, the punchline is that to prove that dx^2 = 0 => dx = 0
requires
the use of contradiction!
So, if we banish the law of excluded middle (and therefore no longer use
proof by contradiction), we are able to postulate the existence of a new
element dx
, which obeys dx^2 = 0
. Using this, we can build up the
whole theory of differential geometry in a pleasing way, without having to
go through the horror that is real analysis. (I am being hyperbolic, but really,
real analytic proofs are not pleasant).
I began formalizing this in Coq and got a formalism going: bollu/diffgeo
.
Once I was done with that, I realised I don’t know how to exhibit models of the damn thing! So, reading up on that made me realise that I need around 8 chapters worth of a grad level textbook (the aptly named Models of Smooth Infinitesimal Analysis).
I was disheartened, so I asked on MathOverflow
(also my first ever question there), where I learnt about tangent categories and
differential lambda calculus. Unfortunately, I don’t have the bandwidth to read
another 150page tome, so this has languished.
I began reading Absil: Optimisation on matrix manifolds which describes how to perform optimisation / gradient descent on arbitrary Riemannian manifolds, as well as closed forms for wellknown manifolds. The exposition in this book is really good, since it picks a concrete manifold and churns out all the basic properties of it manually. The only problem I had with the books was that there were quite a few gaps (?) in the proofs – perhaps I missed a bunch.
This led me to learn Lie theory to some degree, since that was the natural
setting for many of the proofs. I finally saw why anyone gives a shit about
the tangent space at the identity: because it’s easier to compute! For a
flavour
of this, consider this question on math.se
by me that asks about computing
tangent spaces of $O(n)$.
I attended the AI risk for computer scientists workshop hosted by MIRI (Machine intelligence research institute) in December. Here, a bunch of people were housed at a bed & breakfast for a week, and we discussed AI risk, why it’s potentially the most important thing to work on, and anything our hearts desired, really. I came away with new branches of math I wanted to read, a better appreciation of the AI risk community and a sense of what their “risk timelines” were, and some explanations about sheaves and number theory that I was sorely lacking. All in all, it was a great time, and I’d love to go back.
While I was on a particularly rough flight back from the USA to India when coming back from the AIRCS workshop, I began to read the textbook Introduction to padic numbers by Fernando Gouvea, which fascinated me, so I then wrote up the cool parts introduced in the first two chapters as a blog post. I wish to learn more about the padics and padic analysis, since they seem to be deep objects in number theory.
In particular, a question that I thought might have a somewhat trivial answer (why do the padics use base p in defining norm) turned out to have answers that were quite deep, which was something unexpected and joyful!
Both of these describe a general method to transfer all topological ideas as statements about computability in a way that’s far more natural (at least for me, as a computer scientist). The notion of what a continuous function “should be” (keeping inverse images of open sets open) arises naturally from this computational viewpoint, and many proofs of topology amount to finding functional programs that fit a certain type. It’s a great read, and I feel gave me a much better sense of what topology is trying to do.
I’ve wanted to understand philosophy as a whole for a while now, at least enough to get a general sense of what happened in each century. The last year, I meandered through some philosophy of science, which led me to some really wild ideas (such as that of Paul Feyerabend’s ‘science as an anarchic enterprise’ which I really enjoyed).
I also seem to get a lot more out of audio and video than text in general, so I’ve been looking for podcasts and video lectures. I’ve been following:
The history of philosophy without any gaps for a detailed exposition on, say, the greeks, or the arabic philosophers. Unfortunately, this podcast focuses on far too much detail for me to have been able to use it as a way to get a broad idea about philosophy in itself.
Philosophize This! by Stephen West Is a good philosophy podcast for a broad overview of different areas of Philosophy. I got a lot out of this, since I was able to get a sense of the progression of ideas in (Western) Philosophy. So I now know what Phenomenology is, or what Foucault was reacting against.
I also attempted to read a bunch of philosophers, but the only ones I could make a good dent on were the ones listed below. I struggled in writing this section, since it’s much harder to sanity check my understanding of philosophy, versus mathematics, since there seems to be a range of interpretations of the same philosophical work, and the general imprecise nature of language doesn’t help here at all. So please take all the descriptions below with some salt to taste.
Discipline and Punish by Michel Foucault Here, Foucault traces the history of the criminal justice system of society, and how it began as something performed ‘on the body’ (punishment), which was then expanded to a control ‘of the mind’ (reform). As usual, the perspective is fun, and I’m still going through the book.
Madness and Civilization by Michel Foucault which attempts to chronicle how our view of madness evolved as society did. It describes how madmen, who were on the edges of society, but still “respected” (for exmaple, considered as ‘being touched by the gods’) were pathologized by the Renaissance, and were seen as requiring treatment. I’m still reading it, but it’s enjoyable so far, as a new perspective for me.
The value of science by Henri Poincare. Here, he defends the importance of experimentation, as well as the value of intuition to mathematics, along with the importance of what we consider formal logic. It’s a tough read sometimes, but I think I got something out of it, at least in terms of perspective about science and mathematics.
I’ve been on a quest to understand information theory far better than I currently do. In general, I feel like this might be a much better way to internalize probability theory, since it feels like it states probabilistic objects in terms of “couting” / “optimisation of encodings”, which is a perspective I find far more natural.
Towards this aim, I wound up reading:
Information theory, Learning, and inference algorithms This book attempts to provide the holistic view I was hoping for. It has great illustrations of the basic objects of information theory. However, I was hoping that the three topics would be more “unified” in the book, rather than being presented as three separate sections with some amount of backandforthreferencing among them. Even so, it was a really fun read.
I don’t understand the trifecta of sheaves, topoi, geometry, and logic, and I’m trying to attack this knot from multiple sides at once.
All of these provide geometric viewpoints of what sheaves are, in lowdimensional examples of graphs which are easy to visualize. I’m also trudging through the tome:
which appears to follow the “correct path” of the algebraic geometers, but this requires a lot of bandwidth.
This is a hardcore algebraic geometry textbook, and is arguably
great for studying sheaves because of it. Sheaves are Chapter 2, and allows
one to see them be developed in their “true setting” as it were. In that
Grothendeick first invented sheaves for algebraic geometry, so it’s good to
see them in the home they were born in. Once again, this is a book I lack
bandwidth for except to breezily read it as I go to bed. I did get something
out from doing this. I’m considering taking this book up as an independent
study, say the first four chapters. I’ll need someone who knows algebraic
geometry to supervise me, though, which is hard to find in an institute geared
purely for computer science. (If anyone on the internet is kind enough to
volunteer some of their time to answer questions, I’ll be very glad! Please
email me at rot13(fvqqh.qehvq@tznvy.pbz)
)
This section contains random assortments that I don’t recall how I stumbled
across, but too cool to not include on the list. I usually read these in bits
and pieces, or as bedtime reading right before I go to bed to skim. I find
that skimming such things gives me access to knowing about tools I would not
have known otherwise. I like knowing the existence of things, even if I don’t
recall the exact thing, since knowing that something like X
exists has saved me
from having to reinvent X
from scratch.
Group Theory: Birdtracks, Lie’s and Exceptional Groups by Predrag Cvitanovic is an exposition of Lie theory using some notation called as “Birdtrack notation”, which is supposedly a very clean way of computing invariants, inspired by Feynmann notation. The writing style is informal and pleasant, and I decided to save the book purely because the first chapter begins with “Basic Concepts: A typical quantum theory is constructed from a few building blocks…”. If a book considers building quantum theories as its starting point, I really want to see where it goes.
Elementary Applied topology by Robert M Ghirst I wouldn’t classify the book as elementary because it skims over too much to be useful as a reference, but it’s great to gain an intuition for what, say, homology or cohomology is. I am currently reading the section on Sheaf theory, and I’m getting a lot out of it, since it describes how to write down, say, mincutmaxflow or niquistshannon in terms of sheaves. I don’t grok it yet, but even knowing this can be done is very nice. The book is a wonderful walkthrough in general.
On polysemous mathematical illustration by Robert M Ghirst This is a talk on the wonderful illustrations by the above author, about the different types of mathematical illustrations one can have, and different “levels of abstraction”.
Mathematical Impressions: The illustrations of AT Femenko These are beautiful illustrated pictures of various concepts in math, which tend to evoke the feeling of the object, without being too direct about it. For example, consider “gradient descent” below. I highly recommend going through the full gallery.
Gradient Descent <img width=200 height=200 src=”http://chronologia.org/art/math/123a176.jpg”>
Topological Zoo <img width=200 height=200 src=”http://chronologia.org/art/math/077a011.jpg”>
Persistent Homology in Multivariate Data Visualization This is the PhD dissertation of Bastian Rieck, who’s now a postdoc at ETH. I deeply enjoyed reading it, since it pays a lot of attention to the design of analyses, and how to interpret topological data. I really enjoyed getting a good sense of how one can use persistent homology to understand data, and the tradeoffs between VietorisRips complex and the Cech complex.
An introduction to Geometric algebra I fell in love with geometric algebra, since it provides a really clean way to talk about all possible subspaces of a given vector space. This provides super slick solutions to many geometry and linear algebra problems. The way I tend to look at it is that when one does linear algebra, there’s a strict separation between “vectors” (which are elements of the vector space), and, say, “hyperplanes” (which are subspaces of the vector space), as well as objects such as “rotations” (which are operators on the vector space). Geometric algebra provides a rich enough instruction set to throw all these three distinct things into a blender. This gives a really concise language to describe all phenomena that occurs in the vector space world — which, let’s be honest, is most tractable phenomena! I had a blast reading about GA and the kinds of operators it provides.
Circuits via Topoi. This paper attempts to provide an introduction to topos theory by providing a semantics for both combinational and sequential circuits under a unifying framework. I keep coming back to this article as I read more topos theory. Unfortunately, I’m not “there yet” in my understanding of topoi. I hope to be next year!
Fearless Symmetry This is definitely my favourite nonfiction book that I’ve read in 2019, hands down. The book gives a great account of the mathematical objects that went into Wiles’ book of Fermat’s last theorem. It starts with things like “what is a permutation” and ends at questions like “what’s a reciprocity law” or “what’s the absolute galois group”. While at points, I do believe the book goes far too rapidly, all in all, it’s a solid account of number theory that’s distilled, but not in any way diluted. I really recommend reading this book if you have any interest in number theory (or, like me, a passing distaste due to a course on elementary number theory I took, with proofs that looked very unmotivated). This book made me decide that I should, indeed, definitely learn algebraic number theory, upto at least Artin Reciprocity.
Rememberance of Earth’s past trilogy by Liu Cixin While I would not classify this as “mindblowing” (which I do classify Greg Egan books as), they were still a solidly fun read into how humanity would evolve and interact with alien races. It also poses some standard solutions to the Fermi Paradox, but it’s done well. I felt that the fact that it was translated was painfully obvious in certain parts of the translation, which I found quite unfortunate. However, I think book 3 makes up in grandeur for whatever was lost in translation.
Walkaway by Cory Doctorow The book is set in a dystopian nightmare, where people are attempting to “walk away” from society and set up communes, where they espouse having a postscarcity style economy based on gifting. It was a really great description of what such a society could look like. I took issue with some weird lovetrianglelikeshenanigans in the second half of the book, but the story arc more than makes up for it. Plus, the people throw a party called as a “communist party” in the first page of the book, which grabbed my attention immediately!
PURRS: Parma University Recurrence Relation Solver I wanted better tactics for solving recurrences in Coq, which led me into a rabbit hole of the technology of recurrence relation solving. This was the newest stable reference to a complete tool that I was able to find. Their references section is invaluable, since it’s been vetted by them actually implementing this tool!
Term rewriting and all that. I read this book purely for its description of Groebner bases and the Bucchberger algorithm in a way that made sense for the first time. I’ve written about this more extensively before so I’m not going to repeat myself here. In general, I think it’s a great book that’s worth reading, if nothing else, for at least the chapter on Groebner bases.
Lucid: The dataflow programming language This document is the user manual of Lucid. I didn’t fully understand the book, but what I understood as their main argument is that full access too looping is unnecessary to perform most of the tasks that we do. Rather, one can provide a “rich enough” set of combinators to manipulate streams that allows one to write all programs worthy of our interest.
Bundle Adjustment — A Modern Synthesis I learnt about Bundle Adjustment from a friend taking a course on robotics. The general problem is to reconstruct the 3D coordinates of a point cloud given 2D projections of the points and the camera parameters, as the camera moves in time. I found the paper interesting since it winds up invoking a decent amount of differential geometric and gauge theoretic language to describe the problem at hand. I was unable to see why this vocabulary helped in this usecase, but perhaps I missed the point of the paper. It was hard to tell.
I always feel a little wrong posting this at the end of every year, since I feel that among the things I cover under “read”, I’ve internalized some things far better than others: For example, I feel I understannd Riemannian geometry far better than I do General Relativity. I try to put up the caveats at the beginning of each section, but I’d really like a way to communicate my confidence without reducing readability.
The final thing that I wish for is some kind of reading group? It’s hard to maintain a group when my interests shift as rapidly as they do, which was one of the reason I really loved the AIRCS workshop: They were people who were working on formal methods, compilers, type theory, number theory, embedded systems, temporal logic… It was very cool to be in a group of people who had answers and intuitions to questions that had bugged me for some time now. I wonder if attending courses at a larger research university feels the same way. My uni is good, but we have quite small population, which almost by construction means reduced diversity.
I also wish that I could openly add more references to repos I’ve been working on for a while now, but I can’t due to the nature of academia and publishing. This one bums me out, since there’s a long story of a huge number of commits and trialbyfire that I think I’ll be too exhausted to write about once the thing is done.
Sometimes, I also wish that I could spend the time I spend reading disparate
topics on focused reading on one topic. Unfortunately, I feel like I’m not
wired this way, and the joy I get from sampling many things at the same time
and making connections is somehow much deeper than the joy I get by deeply
reading one topic (in exclusion of all else). I don’t know what this says
about my chances as a grad student in the future :)
.
I’ve seen the definitions of padic numbers scattered around on the internet, but this analogy as motivated by the book padic numbers by Fernando Gouvea really made me understand why one would study the padics, and why the definitions are natural. So I’m going to recapitulate the material, with the aim of having somoene who reads this post be left with a sense of why it’s profitable to study the padics, and what sorts of analogies are fruitful when thinking about them.
We wish to draw an analogy between the ring $\mathbb C[X]$, where $(X  \alpha)$ are the prime ideals, and $\mathbb Z$ where $(p)$ are the prime ideals. We wish to take all operations one can perform with polynomials, such as generating functions ($1/(X  \alpha) = 1 + X + X^2 + \dots$ ), taylor expansions (expanding aronund $(X  \alpha)$), and see what their analogous objects will look like in $\mathbb Z$ relative to a prime $p$.
Now, for example, given a prime $p$, we can write any positive integer $m$ in base $p$, as $(m = \sum_{i=0}^n a_i p^i)$ where $(0 \leq a_i \leq p  1)$.
For example, consider $m = 72, p = 3$. The expansion of 72 is $72 = 0\times 1 + 0 \times 3 + 2 \times 3^2 + 2 \times 3^3$. This shows us that 72 is divisible by $3^2$.
This perspective to take is that this us the information local to prime $p$, about what order the number $m$ is divisible by $p$, just as the taylor expansion tells us around $(X  \alpha)$ of a polynomial $P(X)$ tells us to what order $P(X)$ vanishes at a point $\alpha$.
Now, we investigate the behaviour of expressions such as
We know that the above formula is correct formally from the theory of generating functions. Hence, we take inspiration to define values for rational numbers.
Let’s take $p \equiv 3$, and we know that $4 = 1 + 3 = 1 + p$.
We now calculate $1/4$ as:
However, we don’t really know how to interpret $(1 \cdot p)$, since we assumed the coefficients are always nonnegative. What we can do is to rewrite $p^2 = 3p$, and then use this to make the coefficient positive. Performing this transformation for every negative coefficient, we arrive at:
We can verify that this is indeed correct, by multiplying with $4 = (1 + p)$ and checking that the result is $1$:
What winds up happening is that all the numbers after $1$ end up being cleared due to the carrying of $(3p^i \mapsto p^{i+1})$.
This little calculation indicates that we can also define take the $p$adic expansion of rational numbers.
We next want to find a padic expansion of 1, since we can then expand out theory to work out “in general”. The core idea is to “borrow” $p$, so that we can write 1 as $(p  1)  p$, and then we fix $p$, just like we fixed $1$. This eventually leads us to an infinite series expansion for $1$. Written down formally, the calculation proceeds as:
This now gives us access to negative numbers, since we can formally multiply the series of two numbers, to write $a = 1 \cdot a$.
Notice that this definition of $1$ also curiously matches the 2s complement definition, where we have $1 = 11\dots 1$. In this case, the expansion is infinite, while in the 2s complement case, it is finite. I would be very interested to explore this connection more fully.
We’ve now managed to completely reinterpret all the numbers we care about in the rationals as power series in base $p$. This is pretty neat. We’re next going to try to complete this, just as we complete the rationals to get the reals. We’re going to show that we get a different number system on completion, called $\mathbb Q_p$.
To perform this, we first look at how the $p$adic numbers help us solve congruences mod p, and how this gives rise to completions to equations such as $x^2  2 = 0$, which in the reals give us $x = \sqrt 2$, and in $\mathbb Q_p$ give us a different answer!
Let’s start by solving an equation we already know how to solve: $X^2 \equiv 25 \mod 3^n$.
We already know the solutions to $X^2 \equiv 25 \mod 3^n$ in $\mathbb Z$ are $X \equiv \pm 5 \mod 3^n$.
Explicitly, the solutions are:
This was somewhat predictable. We move to a slightly more interesting case.
The solution sets are:
This gives us the infinite 3adic expansion:
Note that we can’t really predict the digits in the 3adic sequence of 5, but we can keep expanding and finding more digits.
Also see that the solutions are “coherent”. In that, if we look at the solution mod 9, which is $4$, and then consider it mod 3, we get $1$. So, we can say that given a sequence of integers $0 \leq \alpha_n \leq p^n  1$, $\alpha_n$ is padically coherent sequence iff:
Since our solution sets are coherent, we can view the solutions as a tree, with the expansions of $X = 5, X = 5 \mod 3$ and then continuing onwards from there. That is, the sequences are
We now construct a solution to the equation $X^2 = 1$ in the 7adic system, thereby showing that $\mathbb Q_p$ is indeed strictly larger than $\mathbb Q$, since this equation does not have rational roots.
For $n=1$, we have the solutions as $X \equiv 3 \mod 7$, $X \equiv 4 \equiv 3 \mod 7$.
To find solutions for $n = 2$, we recall that we need our solutions to be consistent with those for $n = 1$. So, we solve for:
Solving the first of these:
This gives the solution $X \equiv 10 \mod 49$. The other branch ($X = 4 + 7k$) gives us $X \equiv 39 \equiv 10 \mod 49$.
We can continue this process indefinitely (exercise), giving us the sequences:
We can show that the sequences of solutions we get satisfy the equation $X^2 = 2 \mod 7$. This is so by construction. Hence, $\mathbb Q_7$ contains a solution that $\mathbb Q$ does not, and is therefore strictly bigger, since we can already represent every rational in $\mathbb Q$ in $\mathbb Q_7$.
Let’s use the tools we have built so far to solve the equation $X = 1 + 3X$. Instead of solving it using algebra, we look at it as a recurrence $X_{n+1} = 1 + 3X_n$. This gives us the terms:
In $\mathbb R$, this is a divergent sequence. However, we know that the solution so $1 + X + X^2 + \dots = 1/(1X)$, at least as a generating function. Plugging this in, we get that the answer should be:
which is indeed the correct answer.
Now this required some really shady stuff in $\mathbb R$. However, with a change of viewpoint, we can explain what’s going on. We can look at the above series as being a series in $\mathbb Q_3$. Now, this series does really converge, and by the same argument as above, it converges to $1/2$.
The nice thing about this is that a dubious computation becomes a legal one by changing one’s perspective on where the above series lives.
The last thing that we need to import from the theory of polynomials is the ability to evaluate them: Given a rational function $F(X) = P(X)/Q(X)$, where $P(X), Q(X)$ are polynomials, we can evaluate it at some arbitrary point $x_0$, as long as $x_0$ is not a zero of the polynomial $Q(X)$.
We would like a similar function, such that for a fixed prime $p$, we obtain a ring homomorphism from $\mathbb Q \rightarrow \mathbb F_p^x$, which we will denote as $p(x_0)$, where we are imagining that we are “evaluating” the prime $p$ against the rational $x_0$.
We define the value of $x_0 = a/b$ at the prime $p$ to be equal to $ab^{1} \mod p$, where $b b^{1} \equiv 1 \mod p$. That is, we compute the usual $ab^{1}$ to evaluate $a/b$, except we do this $(\mod p)$, to stay with the analogy.
Note that if $b \equiv 0 \mod p$, then we cannot evaluate the rational $a/b$, and we say that $a/b$ has a pole at $p$. The order of the pole is the number of times $p$ occurs in the prime factorization of $b$.
I’m not sure how profitable this viewpoint is, so I asked on math.se, and I’ll update this post when I recieve a good answer.
So far, we have dealt with infinite series in base $p$, which have terms $p^i, i \geq 0$. Clearly, these sums are divergent as per the usual topology on $\mathbb Q$. However, we would enjoy assigning analytic meaning to these series. Hence, we wish to consider a new notion of the absolute value of a number, which makes it such that $p^i$ with large $i$ are considered small.
We define the absolute value for a field $K$ as a function $\cdot : K \rightarrow \mathbb R$. It obeys the axioms:
We want the triangle inequality so it’s metriclike, and the norm to be multiplicative so it measures the size of elements.
The usual absolute value $\lvert x \rvert \equiv \{ x : x \geq 0; x : ~ \text{otherwise} \}$ satisfies these axioms.
Now, we create a new absolute value that measures primeness. We first introduce a gadget known as a valuation, which measures the $p$ness of a number. We use this to create a norm that makes number smaller as their $p$ness increases. This will allow infinite series in $p^i$ to converge.
First, we introduce a valuation $v_p: \mathbb Z  \{0\} \rightarrow \mathbb R$, where $v_p(n)$ is the power of the prime $p^i$ in the prime factorization of $n$. More formally, $v_p(n)$ is the unique number such that:
The valuation gets larger as we have larger powers of $p$ in the prime factorization of a number. However, we want the norm to get smaller. Also, we need the norm to be multiplicative, while $v_p(nm) = v_p(n) + v_p(m)$, which is additive.
To fix both of these, we create a norm by exponentiating $v_p$. This converts the additive property into a multiplicative property. We exponentiate with a negative sign so that higher values of $v_p$ lead to smaller values of the norm.
Now, we define the padic absolute value of a number $n$ as $n_p \equiv p^{v_p(n)}$.
So $n_p$ is indeed a norm, which measures $p$ness, and is smaller as $i$ gets larger in the power $p^i$ of the factorization of $n$, causing our infinite series to converge.
There is a question of why we chose a base $p$ for $n_p = p^{v_p(n)}$. It would
appear that any choice of $n_p = c^{v_p(n)}, c > 1$ would be legal.
I asked this on math.se
,
and the answer is that this choosing a base $p$ gives us the nice formula
That is, the product of all $p$ norms and the usual norm (denoted by $\lvert x \rvert_\infty $ ) give us the number 1. The reason is that the $ \lvert x\rvert_p $ give us multiples $p^{v_p(x)}$, while the usual norm $\lvert x \rvert_\infty$ contains a multiple $p^{v_p(x)}$, thereby cancelling each other out.
What we’ve done in this whirlwind tour is to try and draw analogies between the ring of polynomials $\mathbb C[X]$ and the ring $\mathbb Z$, by trying to draw analogies between their prime ideals: $(X  \alpha)$ and $(p)$. So, we imported the notions of generating functions, polynomial evaluation, and completions (of $\mathbb Q$) to gain a picture of what $\mathbb Q_p$ is like.
We also tried out the theory we’ve built against some toy problems, that shows us that this point of view maybe profitable. If you found this interesting, I highly recommend the book padic numbers by Fernando Gouvea.
To quote wikipedia:
In crystallography, the space group of a crystal splits as the semidirect product of the point group and the translation group if and only if the space group is symmorphic
The if and only if is interesting: The geometry ofthe crystal lattice truly appears to capture the structure of the semidirect product. It’s a discrete object as well, which makes it way easier to visualize. I’m going to hunt down the definitions involved so I can finally feel like I truly understand semidirect products from the “action” perspective.
Here, we’re going to describe whatever I’ve picked up of sheaves in the past couple of weeks. I’m trying to understand the relationship between sheaves, topoi, geometry, and logic. I currently see how topoi allows us to model logic, and how sheaves allow us to model geometry, but I see nothing about the relationship! I’m hoping that writing this down will allow me to gain some perspective on this.
Let’s consider two sets $P, A$, $P \subseteq A$. Now, given a function $f: A \rightarrow X$, we can restrict this function to $ A_P: P \rightarrow X $. So, we get to invert the direction:
.
We should now try to discover some sort of structure to this “reversal” business. Perhaps we will discover a contravariant functor! (Spoiler: we will).
Most people believe that topology is about some notion of “nearness” or “closeness”, which has been abstracted out from our usual notion of continuity that we have from a metric space. Here, I make the claim that topology is really about computation, and more specifically, decidability. These are not new ideas. I learnt of this from a monograph The topology of computation, but this does not seem very well known, so I decided to write about it.
The idea is this: We have turing machines which can compute things. We then also have a set $S$. Now, a topology $\tau \subset 2^S$ precisely encodes which of the subsets of $S$ can be separated from the rest of the space by a turing machine. Thus, a discrete space is a very nice space, where every point can be separated from every other point. An indescrete space is one where no point can be separated.
Something like the reals is somewhere in between, where we can separate
stuff inside an open interval from stuff clearly outside, but there’s some
funny behaviour that goes on at the boundary due to things like (0.999... = 1)
,
which we’ll see in detail in a moment.
A subset $Q\subseteq S$ is semidecidable, if there exists a turing machine $\hat Q: Q \rightarrow { \bot, \top }$, such that:
Where $\top$ signifies stopping at a state and returning TRUE
, and
$\bot$ signifies never halting at all!. So, the subset $Q$ is
semidedicable, in that, we will halt and say TRUE
if the element
belongs in the set. But if an element does not belong in the set, we are
supposed to never terminate.
Let’s start with an example. We consider the interval $I = (1, 2)$, as a subset of $\mathbb{R}$.Let the turing machine recieve the real number as a function $f: \mathbb N \rightarrow {0, 1, \dots 9}$, such that given a real number ${(a_0 \cdot a_1 \cdot a_2 \dots)}$, this is encoded as a function ${f_a(i) = a_i}$.
We now build a turing machine $\hat I$ which when given the input the function $f_a$, semidecides whether ${a \in I}$.
Let’s consider the numbers in $I$:
So, we can write a turing machine (ie, some code) that tries to decide whether a real number $a$’s encoding $f_a$ belongs to the interval $I = (1, 2)$ as follows:
def decide_number_in_open_1_2(f):
# if the number is (1.abcd)
if f(0) == 1:
# (1.99...99x)  look for the x.
# If the number is 1.999..., do not terminate.
# if the number is any other number of the form 1.99..x, terminate
i = 1
while f(i) != 9: i += 1
return
# if the number is not 1.abcd, do not terminate
while True: pass
Hence, we say that the interval $I = (1, 2)$ is semidecidable, since we have a function $\hat I \equiv \texttt{decidenumberinopen12}$ such that $\hat I (f_a) \text{ terminates } \iff a \in I$. We don’t make any claim about what happens if $a \notin I$. This is the essence of semidecidability: We can precisely state when elements in the set belong to the set, but not when they don’t.
To put this on more solid ground, we define a topology on a set $S$ by considering programs which recieve as input elements of $S$, suitably encoded. For example, the way in which we encoded real numbers as functions from the index to the digit. Similarly, we encode other mathematical objects in some suitable way.
Now, we define:
For every program $P$ which takes as inputs elements in $S$, the set ${halts(P) \equiv \{ s \in S \vert P(s) \text{halts} \}}$ is called as a semidecidable set.
Alternatively, we can say for a subset ${T \subset S}$, if there exists a program ${\hat T}$, such that ${s \in T \iff \hat T(s) \text{ halts}}$, then $T$ is semidedecidable.
These are just two viewpoints on the same object. In one, we define the set based on the program. In the other, we define the program based on the set.
The empty set is semidecidable, due to the existence of the program:
def semidecide_empty(x):
while True: continue
The universe set is semidecidable, due to the existence of the program:
def semidecide_univ(x): return
infinite unions of sets are semi decidable, since we can “diagonalize” on the steps of all programs. That way, if any program halts, we will reach the state where it halts in our diagonalized enumeration.
Let A00, A01... A0n
be the initial states of the machines. We are trying to
semidecide whether any of them halt. We lay out the steps of the machines
in an imaginary grid:
A00 A01 A02 ... A0n
A10 A11 A12 ... A1n
A20 A21 A22 ... A2n
Am0 Am1 Am2 ... Amn
For example, machine A0
has states:
A00 > A10 > .. > Am0
We can walk through the combined statespace of the machines as:
A00
A01 A10
A02 A11 A20
A03 A12 A21 A30
...
Where on the k
‘th line, we collect all states $A_{ij}$ such that $(i + j = k)$.
Now, if any of the machines have a state that is HALT
, we will reach the
state as we enumerate the diagonals, and the machine that explores the
combined state space can also return HALT
.
infinite intersections of sets are not semi decidable, since by running these programs in parallel, we cannot know if an infinite number of programs halt in finite time. We can tell if one of them halts, but of if all of them halt.
For example, consider the sequence of machines produced by machine_creator
:
# creates a machine that stops after n steps
def machine_creator(n):
# f terminates after n steps
def f(x):
for _ in range(n):
continue
return
return f
We wish to check if the intersection of all machine_creator(n)
halt, for all
$n \geq 0, n \in \mathbb N$. Clearly, the answer is an infinite number of steps,
even though every single machine created by machine_creator
halts in a
finite number of steps.
An algorithm to find integer relations between real numbers. It was apparently named “algorithms of the century” by Computing in science and engineering.
$Stab(Orb(x)) = Stab(x) \iff Stab(x) \text{ is normal}$
$\forall x’ \in Orb(x), Stab(x’) = Stab(x) \iff Stab(x) \text{ is normal}$
Let a group $G$ act on a set $X$ with action $(~\dot~) : G \times X \rightarrow X$.
Let $H \subseteq G$ be the stabilizer of a point $x \in X$. Now, let
$K = kHk^{1}$, a conjugacy class of $H$. Clearly, the element $(k \cdot x)$
in the orbit of $x$ is stabilized by $K$.
If the group $H$ is normal, then $K = H$. So every element in the orbit of $x$ is stabilized by $H$.
$Stab(g \cdot x) = g Stab(x) g^{1}$
$g^{1} Stab(g \cdot x) g = Stab(x)$
Proof of $s \in Stab(x) \implies gsg^{1} \in Stab(g \cdot x)$: The action of $gsg^{1}$ on $g \cdot x$ is: $(g \cdot x \rightarrow_{g^1} x \rightarrow_s x \rightarrow_g g \cdot x)$.
Proof of $s’ \in Stab(g \cdot x) \implies g^{1}s’g \in Stab(x)$: The action of $g^{1}s’g$ on $x$ is: $(x \rightarrow_{g} g \cdot x \rightarrow_{s’} g \cdot x \rightarrow_{g^{1}} x)$.
Hence, both containments are proved.
From the above equation $Stab(g \cdot x) = g Stab(x) g^{1}$. If the entire orbit has the same stabilizer, $Stab (g \cdot x) = Stab(x)$. Hence, we get $Stab(x) = g Stab(x) g^{1}$, proving that it’s normal.
Let $I$ be an ideal. The ideal generated by adding $(a \in R)$ to $I$ is defined as $A \equiv (I \cup { a})$. We prove that $A = I + aR$.
An ideal $P$ is prime iff the quotient ring $R/P$ is an integral domain. An ideal $M$ is maximal $R/M$ is a field. Every field is an integral domain, hence:
$M \text{ is maximal } \implies R/M \text{ is a field } \implies R/M \text {is an integral domain} \implies M \text{ is prime}$.
I was dissatisfied with this proof, since it is not ideal theoretic: It argues about the behaviour of the quotients. I then found this proof that argues purly using ideals:
Let $I$ be a maximal ideal. Let $a, b \in R$ such that $ab \in I$. We need to prove that $a \in I \lor b \in I$. If $a \in I$, the problem is done. So, let $a \notin I$. Build ideal $A = (I \cup {a})$. $I \subsetneq A$. Since $I$ is maximal, $A = R$. Hence, there are solutions for $1_R \in A \implies 1_r \in I + aR \implies \exists i \in I, r \in R, 1_R = i + ar$. Now, $b = b \cdot 1_R = b(i + ar) = bi + (ba)r \in I + IR = I$. ($ba \in I$ by assumption). Hence, $b \in I$.
let $i$ be a maximal ideal. let $a, b \in r$ such that $ab \in i$. we need to prove that $a \in i \lor b \in i$.
if $a \in i$, then the problem is done. so, let $a \notin i$. consider the ideal $A$ generated by adding $a$ into $I$. $A \equiv (I \cup {a})$.
We have shown that $A = I + aR$. Hence, $I + a0 = I \subset A$. Also, $0 + ac \dot 1 = a \in A$, $a \neq I$ \implies $A \neq I$. Therefore, $I \subsetneq A$. Since $I$ is maximal, this means that $A = R$
Therefore, $I + aR = R$. Hence, there exists some $i \in I, r \in R$ such that $i + ar = 1_R$.
Now, $b = b \cdot 1_R = b \cdot (i + ar) = bi + (ba) r \in I + IR = I$ Hence, $b \in I$.
A radical ideal of a ring $R$ is an ideal such that $\forall r \in R, r^n \in I \implies r \in I$. That is, if any power of $r$ is in $I$, then the element $r$ also gets “sucked into” $I$.
A nilpotent element of a ring $R$ is any element $r$ such that there exists some power $n$ such that $r^n = 0$.
Note that every ideal of the ring contains $0$. Hence, if an ideal $I$ of a ring is known to be a radical ideal, then for any nilpotent $r$, since $\exists n, r^n = 0 \in I$, since $I$ is radical, $r \in I$.
That is, a radical ideal with always contain all nilpotents! It will contain other elements as well, but it will contain nilpotents for sure.
Given a ideal $I$, it’s radical idea $\sqrt I \equiv { r \in R, r^n \in I }$. That is, we add all the elements $I$ needs to have for it to become a radical.
Notice that the radicalization of the zero ideal $I$ will precisely contain all nilpotents. that is, $\sqrt{(0)} \equiv { r \in R, r^n = 0}$.
A ring $R$ is a reduced ring if the only nilpotent in the ring is $0$.
Tto remove nilpotents of the ring $R$, we can create $R’ \equiv R / \sqrt{(0}$. Since $\sqrt{(0)}$ is the ideal which contains all nilpotents, the quotient ring $R’$ will contain no nilpotents other than $0$.
Similarly, quotienting by any larger radical ideal $I$ will remove all nilpotents (and then some), leaving a reduced ring.
A ring modulo a radical ideal is reduced
a Ring $R$ is an integral domain if $ab = 0 \implies a = 0 \lor b = 0$. That is, the ring $R$ has no zero divisors.
An ideal $I$ of a ring $R$ is a prime ideal if $\forall xy \in R, xy \in I \implies x \in I \lor y \in I$. This generalizes the notion of a prime number diving a composite: $p  xy \implies p  x \lor p  y$.
Recall that every ideal contains a $0$. Now, if an ideal $I$ is prime, and if $ab = 0 \in I$, then either $a \in I$ or $b \in I$ (by the definition of prime).
We create $R’ = R / I$. We denote $\overline{r} \in R’$ as the image of $r \in R$ in the quotient ring $R’$.
The intuition is that quotienting by a $I$, since if $ab = 0 \implies a \in I \lor b \in I$, we are “forcing” that in the quotient ring $R’$, if $\overline{a} \overline{b} = 0$, then either $\overline{a} = 0$ or $\overline{b} = 0$, since $(a \in I \implies \overline a = 0)$, and $b \in I \implies \overline b = 0)$.
A ring modulo a prime ideal is an integral domain.
I learnt of this explanation from this excellent blog post by Stefano Ottolenghi.
When I first ran across the theory of abstract interpretation, it seemed magical: Define two functions, check that they’re monotone maps, and boom, we have on our hands an analysis.
However, the problem appears to be that in reality, it’s not as simple. Here is the list of issues I’ve run across when trying to use abstract interpretation for a “real world” usecase:
First of all, all interesting lattices are infinte height, requiring some choice of widening. Defining a good widening is a black art. Secondly, while there is a lot of theory on combining abstract domains (reduced products and the like), it seems hard to deploy the theory in the real world.
I read a fair bit into the theory of abstract acceleration, where the idea is that instead of widening indiscriminately, if our theory is powerful enough to compute an exact closed form, we choose to do so. However, the problem is that this regime does not “fit well” into abstract interpretation: We have the abstract interpreter on the one hand, and then the acceleration regime on the other, which is a separate algorithm. So the full analysis looks something like:
def analyze(program):
analysis = {}
for loop in inner to outer:
loop_data = abstract_interpret(loop)
analaysis.append(accelerate(loop))
return analysis
That is, what used to be a nice theory of just “do things in any order and it will converge”, now becomes a new algorithm, that uses abstract interpretation as a subroutine. This was not the hope I had! I wanted to get away from having to do proofs by analyzing an algorithm, this was the entire promise: Define a lattice well enough and the proof is guaranteed. Rather, what I had imagined was:
def analyze(program):
return abstract_interpret_using_acceleration_domain(program)
Now this acceleration_domain
maybe frightfully complicated, but I’m willing
to pay that price, as long as it’s an honesttogod abstract interpretation.
This was a huge bummer for me to find out that this is not the case.
Here’s a fun little problem, whose only solution I know involves a fair bit of math and computer algebra:
We are given the grammar for a language L
:
E = T +_mod8 E  T _mod8 E  T
T = V  V ^ V  V ^ V ^ V
V = 'a1'  'a2'  ...
where +_mod8
is addition modulo 8, _mod8
is subtraction modulo 8,
and ^
is XOR.
This language is equipped with the obvious
evaluation rules, corresponding to those of arithmetic. We are guaranteed
that during evaluation, the variables a_i
will only have values 0
and 1
.
Since we have addition, we can perform multiplication by a constant
by repeated addition. So we can perform 3*a
as a+a+a
.
We are then given the input expression (a0 ^ a1 ^ a2 ^ a3)
. We wish
to find an equivalent expression in terms of the above language L
.
We think of E
as some set of logic gates we are allowed to use, and we are
trying to express the operation (a0 ^ a1 ^ a2 ^ a3)
in terms of these gates.
The first idea that I thought was that of employing a grobner basis, since they essentially embody rewrite rules modulo polynomial equalities, which is precisely our setting here.
In this blog post, I’m going to describe what a grobner basis is and why it’s natural to reach for them to solve this problem, the code, and the eventual solution.
As a spolier, the solution is:
a^b^c^d =
a  b + c + 3*d  3*axorb  axorc
+ axord  bxorc + bxord + 3*cxord
 3*axorbxorc  axorbxord
+ axorcxord + bxorcxord
Clearly, this contains only additions/subtractions and multiplications by a constant.
If there’s some principled way to derive this (beyond throwing symbolic algebra machinery), I’d really love to know — Please raise an issue with the explanation!
The nutshell is that a grobner basis is a way to construct rewrite rules which also understand arithmetic (I learnt this viewpoint from the book “Term rewriting and all that”. Fantastic book in general). Expanding on the nutshell, assume we have a term rewriting system:
A > 1*B  (1)
C > B^2  (2)
over an alphabet {A, B, C}
.
Now, given the string C + AB
, we wish to find out if it can be rewritten to
0
or not. Let’s try to substitute and see what happens:
C + AB 2> B^2 + AB 1> B^2 + (1*B)B
At this point, we’re stuck! we don’t have rewrite rules to allow us to
rewrite (1*B)B
into B^2
. Indeed, creating such a list would be
infinitely long. But if we are willing to accept that we somehow have
the rewrite rules that correspond to polynomial arithmetic, where we view
A, B, C
as variables, then we can rewrite the above string to 0:
B^2 + (1*B)B > B^2  B^2 > 0
A Grobner basis is the algorithmic / mathematical machine that allows us to perform this kind of substitution.
In this example, this might appear stupid: what is so special? We simply
substituted variables and arrived at 0
by using arithmetic. What’s
so complicated about that? To understand why this is not always so easy,
let’s consider a pathological, specially constructed example
Here’s the pathological example:
A > 1  (1)
AB > B^2  (2)
And we consider the string S = AB + B^2
. If we blindly apply the
first rule, we arrive at:
S = AB + B^2 1> 1B + B^2 = B + B^2 (STUCK)
However, if we apply (2)
and then (1)
:
S = AB + B^2 2> B^2 + B^2 > 0
This tells us that we can’t just apply the rewrite rules willynilly. It’s sensitive to the order of the rewrites! That is, the rewrite system is not confluent.
The grobner basis is a function from rewrite systems to rewrite systems.
When given a rewrite system R
, it produces a new rewrite system R'
that is confluent. So, we can apply the rewrite rules of R'
in any order,
and we guaranteed that we will only get a 0 from R'
if and only if
we could have gotten a 0
from R
for all strings.
We can then go on to phrase this whole rewriting setup in the language of ideals from ring theory, and that is the language in which it is most often described. I’ve gone into more depth on that perspective here: “What is a grobner basis? polynomial factorization as rewrite systems”.
Now that we have a handle on what a grobner basis is, let’s go on to solve the original problem:
I’ll first demonstrate the idea of how to solve the original problem by solving a slightly simpler problem:
Rewrite
a^b^c
in terms ofa^b
,b^c
,c^a
and the same+_mod8
instruction set as the original problem. The only difference this time is that we do not haveT > V ^ V ^ V
.
The idea is to construct the polynomial ring over Z/8Z
(integers modulo 8) with
variables as a, b, c, axorb, bxorc, axorc
. Now, we know that a^b = a + b  2ab
. So,
we setup rewrite rules such that a + b  2ab > axorb
, b + c  2bc > bxorb
,
c + a  2ca > cxora
.
We construct the polynomial f(a, b, c) = a^b^c
, which
has been written in terms of addition and multiplication, defined
as f_orig(a, b, c) = 4*a*b*c  2*a*b  2*a*c  2*b*c + a + b + c
. We then
rewrite f_orig
with respect to our rewrite rules. Hopefully, the rewrite
rules should give us a clean expression in terms of one variable and
twovariable xor
s. There is the danger that we may have some term
such as a * bxorc
, and we do get such a term (2*b*axorc
) in this case,
but it does not appear in the original problem.
# Create ring with variables a, b, c, axorb, bxorc, axorc
R = IntegerModRing(8)['a, b, c, axorb, bxorc, axorc']
(a, b, c, axorb, bxorc, axorc) = R.gens()
# xor of 2 numbers as a polynomial
def xor2(x, y): return x + y  2*x*y
# xor of 3 numbers as a polynomial
def xor3(x, y, z): return xor2(x, xor2(y, z))
# define the ideal which contains relations:
# xor2(a, b) > axorb, xor2(b, c) > bxorc, xor2(a, c) > axorc
# we also add the relation (a^2  a = 0 => a = 0 or a = 1)
# since we know that our variables are only {0, 1}
I = ideal((axorb  xor2(a, b), bxorc  xor2(b, c), axorc  xor2(a, c), a*aa, b*bb, c*cc))
# the polynomial representing a^b^c we wish to reduce
f_orig = xor3(a, b, c)
# we take the groebner basis of the ring to reduce the polynomial f.
IG = I.groebner_basis()
# we reduce a^b^c with respect to the groebner basis.
f_reduced = f_orig.reduce(IG)
print("value of a^b^c:\n\t%s\n\treduced: %s" % (f_orig, f_reduced))
# Code to evaluate the function `f` on all inputs to check correctness
def evalxor2(f):
for (i, j, k) in [(i, j, k) for i in [0, 1] for j in [0, 1] for k in [0, 1]]:
ref = i^^j^^k
eval = f.substitute(a=i, b=j, c=k, axorb=i^^j, bxorc=j^^k, axorc=i^^k)
print("%s^%s^%s: ref(%s) =?= f(%s): %s" %
(i, j, k, ref, eval, ref == eval))
# check original formulation is correct
print("evaulating original f for sanity check:")
evalxor2(f_orig)
# Check reduced formulation is correct
print("evaulating reduced f:")
evalxor2(f_reduced)
Running the code gives us the reduced polynomial 2*b*axorc + b + axorc
which unfortunately contains a term that is b * axorc
. So, this approach
does not work, and I was informed by my friend that she is unaware
of a solution to this problem (writing a^b^c
in terms of smaller xors and
sums).
The full code output is:
value of a^b^c:
4*a*b*c  2*a*b  2*a*c  2*b*c + a + b + c
reduced: 2*b*axorc + b + axorc
evaulating original f for sanity check:
0^0^0: ref(0) =?= f(0): True
0^0^1: ref(1) =?= f(1): True
0^1^0: ref(1) =?= f(1): True
0^1^1: ref(0) =?= f(0): True
1^0^0: ref(1) =?= f(1): True
1^0^1: ref(0) =?= f(0): True
1^1^0: ref(0) =?= f(0): True
1^1^1: ref(1) =?= f(1): True
evaulating reduced f:
0^0^0: ref(0) =?= f(0): True
0^0^1: ref(1) =?= f(1): True
0^1^0: ref(1) =?= f(1): True
0^1^1: ref(0) =?= f(0): True
1^0^0: ref(1) =?= f(1): True
1^0^1: ref(0) =?= f(0): True
1^1^0: ref(0) =?= f(0): True
1^1^1: ref(1) =?= f(1): True
That is, both the original polynomial and the reduced polynomial match
the expected results. But the reduced polynomial is not in our language L
,
since it has a term that is a product of b
with axorc
.
We try the exact same approach to the original problem of expressing
a ^ b ^ c ^ d
. We find that the reduced polynomial is
a  b + c + 3*d  3*axorb  axorc
+ axord  bxorc + bxord + 3*cxord
 3*axorbxorc  axorbxord
+ axorcxord + bxorcxord
which happily has no products between terms! It also passes our sanity check, so we’ve now found the answer.
The full output is:
value of a^b^c^d:
4*a*b*c + 4*a*b*d + 4*a*c*d + 4*b*c*d
 2*a*b  2*a*c  2*b*c  2*a*d
 2*b*d  2*c*d + a + b + c + d
reduced: a  b + c + 3*d  3*axorb
 axorc + axord  bxorc + bxord
+ 3*cxord  3*axorbxorc
 axorbxord + axorcxord + bxorcxord
evaluating original a^b^c^d
0^0^0^0: ref(0) =?= f(0): True
0^0^0^1: ref(1) =?= f(1): True
0^0^1^0: ref(1) =?= f(1): True
0^0^1^1: ref(0) =?= f(0): True
0^1^0^0: ref(1) =?= f(1): True
0^1^0^1: ref(0) =?= f(0): True
0^1^1^0: ref(0) =?= f(0): True
0^1^1^1: ref(1) =?= f(1): True
1^0^0^0: ref(1) =?= f(1): True
1^0^0^1: ref(0) =?= f(0): True
1^0^1^0: ref(0) =?= f(0): True
1^0^1^1: ref(1) =?= f(1): True
1^1^0^0: ref(0) =?= f(0): True
1^1^0^1: ref(1) =?= f(1): True
1^1^1^0: ref(1) =?= f(1): True
1^1^1^1: ref(0) =?= f(0): True
evaluating reduced a^b^c^d
0^0^0^0: ref(0) =?= f(0): True
0^0^0^1: ref(1) =?= f(1): True
0^0^1^0: ref(1) =?= f(1): True
0^0^1^1: ref(0) =?= f(0): True
0^1^0^0: ref(1) =?= f(1): True
0^1^0^1: ref(0) =?= f(0): True
0^1^1^0: ref(0) =?= f(0): True
0^1^1^1: ref(1) =?= f(1): True
1^0^0^0: ref(1) =?= f(1): True
1^0^0^1: ref(0) =?= f(0): True
1^0^1^0: ref(0) =?= f(0): True
1^0^1^1: ref(1) =?= f(1): True
1^1^0^0: ref(0) =?= f(0): True
1^1^0^1: ref(1) =?= f(1): True
1^1^1^0: ref(1) =?= f(1): True
1^1^1^1: ref(0) =?= f(0): True
a^b^c^d
reduction:def xor3(x, y, z): return xor2(x, xor2(y, z))
R = IntegerModRing(8)['a, b, c, d, axorb, axorc, axord, bxorc, bxord, cxord, axorbxorc, axorbxord, axorcxord, bxorcxord']
(a, b, c, d, axorb, axorc, axord, bxorc, bxord, cxord, axorbxorc, axorbxord, axorcxord, bxorcxord) = R.gens()
I = ideal((axorb  xor2(a, b),
axorc  xor2(a, c),
axord  xor2(a, d),
bxorc  xor2(b, c),
bxord  xor2(b, d),
cxord  xor2(c, d),
axorbxorc  xor3(a, b, c),
axorbxord  xor3(a, b, d),
axorcxord  xor3(a, c, d),
bxorcxord  xor3(b, c, d),
a*aa,
b*bb,
c*cc,
d*dd
))
IG = I.groebner_basis()
f_orig = (xor2(a, xor2(b, xor2(c, d))))
f_reduced = f_orig.reduce(IG)
print("value of a^b^c^d:\n\t%s\n\treduced: %s" % (f_orig, f_reduced))
def evalxor3(f):
for (i, j, k, l) in [(i, j, k, l) for i in [0, 1] for j in [0, 1] for k in [0, 1] for l in [0, 1]]:
ref = i^^j^^k^^l
eval = f.substitute(a=i, b=j, c=k, d=l,
axorb=i^^j, axorc=i^^k,
axord=i^^l, bxorc=j^^k,
bxord=j^^l, cxord=k^^l,
axorbxorc=i^^j^^k, axorbxord=i^^j^^l,
axorcxord=i^^k^^l, bxorcxord=j^^k^^l)
print("%s^%s^%s^%s: ref(%s) =?= f(%s): %s" %
(i, j, k, l, ref, eval, ref == eval))
print("evaluating original a^b^c^d")
evalxor3(f_orig)
print("evaluating reduced a^b^c^d")
evalxor3(f_reduced)
This was a really fun exercise: Around a hundred lines of code illuminates the use of machinery such as grobner basis for solving realworld problems! I really enjoyed hacking this up and getting nerd sniped.
I found out it’s called Janus, since Janus is the god of doorways in greek mythology. Hence, he is also the god of duality and transitions — he literally looks both into the future and into the past.
He is usually depicted as having two faces, since he looks to the future and to the past.
An apt name for the language!
A = B
— A book about proofs of combinatorial closed formsThe book explains algorithms on solving closed forms for combinatorial recurrences, by means of Zeilberger’s algorithm.
The book is written by Zeilberger himself, and supposedy also teaches one Maple. I’d like to learn the algorithm, since it might be useful eventually for Groebner basis / loop analysis shenanigans I like to play as part of my work on compilers.
k
bitsets of a given length n
:The problem is to generate all bitvectors of length n
that have k
bits
set. For example, generate all bitvectors of length 5
that have 3
bits
set.
I know that an algorithm exists in Hacker’s delight, but I’ve been too sick
to crack open a book, so I decided to discover the algorithm myself. The one
I came up with relies on looking at the numbers moving at a certain velocity,
and them colliding with each other. For example, let us try to generate all
5C3
combinations of bits.
We start wih:
#1 count of position
a b c d e positions
1 1 1 0 0 bitset
<     velocity
Where the <
represents that the 1
at position a
is moving leftwards.
Our arena is circular, so the leftmost 1
can wrap around to the right.
This leads to the next state
#2
a b c d e
0 1 1 0 1
    <
We continue moving left peacefully.
#3
a b c d e
0 1 1 1 0
   < 
whoops, we have now collided with a block of 1
s. Not to worry, we simply
transfer our velocity by way of collision, from the 1
at d
to the 1
at b
.
I denote the transfer as follows:
#3
a b c d e
0 1 1 1 0 original state
   < 
 < < <  transfer of velocity
 <    final state after transfer of velocity
The 1
at b
proceeds along its merry way with the given velocity
#4
a b c d e
1 0 1 1 0
<    
Once again, it wraps around, and suffers a collision
#5
a b c d e
0 0 1 1 1
     < (collision, transfer)
  < < < transfer of velocity
  <   final state after transfer of velocity
This continues:
0 1 0 1 1 #6
 <   
1 0 0 1 1 #7
<     (collision, transfer velocity)
<   < <
   < 
1 0 1 0 1 #8
  <  
1 1 0 0 1 #9
 <    (colision, transfer velocity
< <   <
    <
1 1 0 1 0 #10
   < 
1 1 1 0 0 #11: wrap around to initial state
I don’t have a proof of correctness, but I have an intuition that this should generate all states. Does anyone have a proof?
EDIT: this algorithm does not work, since it will keep clusters of $k1$ bits next to each other, when a bit hits a cluster of $k  1$ bits. For completeness, I’m going to draft out the usual algorithm in full:
Let’s consider the same example of 5C3
:
a b c d e
1 0 0 1 1 1 (LSB)
We start with all bits at their lowest position. Now, we try to go to
the next smallest number, which still has 3 bits toggled. Clearly, we need
the bit at position b
to be 1, since that’s the next number. Then,
we can keep the lower 2 bits d, e
set to 1, so that it’s still as small a
number as possible.
a b c d e
2 0 1 0 1 1 (LSB)
Once again, we now move the digit at d
to the digit at c
, while keeping
the final digit at e
to make sure it’s still the smallest possible.
a b c d e
3 0 1 1 0 1 (LSB)
Now, we can move the 1
at e
to d
, since that will lead to the smallest
increase:
a b c d e
4 0 1 1 1 0 (LSB)
At this point, we are forced to move to location a
, since we have exhausted
all smaller locations. so we move the 1
at b
to a
, and then reset all
the other bits to be as close to the LSB as possible:
a b c d e
5 1 0 0 1 1 (LSB)
Continuing this process gives us the rest of the sequence:
a b c d e
5  1 0 0 1 1
6  1 0 1 0 1
7  1 0 1 1 0
8  1 1 0 0 1 (note the reset of d!)
9  1 1 0 1 0
10 1 1 1 0 0
An alternative formalism to derive special relativity geometrically, resting purely on hypotehses about the way light travels.
However, I’ve not been able to prove the correctness of the assumptions made, by using coordinate geometry. I suspect this is because I will need to use hyperbolic geometry for the “side lengths” to work out.
Indeed, I found another source, called as The kcalculus fiddle which attempts to discredit kcalculus. The author of the above blog writes at the end:
In asking Ray D’Inverno’s permission to use his book as the example of kcalculus, he was kind enough to point out that the arguments I have given are invalid. Chapter 2 of his book should be read through to the end and then reread in the light of the fact that the geometry of space and time is Minkowskian. Euclidean geometry should not be used in interpreting the diagrams because their geometry is Minkowskian.
which seems to imply that we need to use hyperbolic geometry for this.
I found this file as I was cleaning up some old code, for a project to implement a fast K/V store on an FPGA, so I thought I should put this up for anyone else who stumbles on the same frustrations / errors. I’m not touching this particular toolchain again with a 10foot pole till the tools stabilize by a lot.
bool
, int16
. The data buses
will be 8/16
bits long, with error:[BD 41241] Message from IP propagation TCL of /blk_mem_gen_7: set_property
error: Validation failed for parameter 'Write Width A(Write_Width_A)' for BD
Cell 'blk_mem_gen_7'. Value '8' is out of the range (32,1024) Customization
errors found on 'blk_mem_gen_7'. Restoring to previous valid configuration.
struct S {
bool b;
int16 x;
int16 y;
}
This gets generated as 3 ports for memory, of widths 1
, 16
, 16
. Ideally,
I wanted one port, of width 16+16+1=33
, for each struct value.
However, what was generated were three ports of widths 1
, 16
, and 16
which I cannot connect to BRAM.
data_pack
allows us to create one port of width 16+16+1=33
Shared function names allocated on BRAM causes errors in synthesis:
struct Foo {...};
void f (Foo conflict) {
#pragma HLS interface bram port=conflict
}
void g (Foo conflict) {
#pragma HLS interface bram port=conflict
}
3c0d619039cff7a7abb61268e6c8bc6d250d8730
)ap_int
causes compile failurre in RTL generation (commit 3c0d619039cff7a7abb61268e6c8bc6d250d8730
)x % m
where m != 2^k
is very expensive – there must be faster encodings of modulus?uint32
. For whatever reason, having a #pragma pack
did something to the representation of the struct
as far as I can tell, and I couldn’t treat the memory as just a raw struct *
on the other side:// HLS side
struct Vec2 { int x; int y};
void f(Vec2 points[NUM_POINTS]) {
#pragma HLS DATA_PACK variable=points
#pragma HLS INTERFACE bram port=points
points[0] = {2, 3};
}
// Host side
Vec2 *points = (Vec2 *)(0xPOINTER_LOCATION_FROM_VIVADO);
int main() {
// points[0] will *not* be {2, 3}!
}
On generating a new bitstream from Vivado, Vivado SDK tries to reload the config,
fails at the reloading (thinks xil_print.h
doesn’t exist), and then fails to compile code.
Options are to either restart Vivado SDK, or refresh xil_print.h
.
git add
ing
everything, but this is a terrible strategy in so many ways.link to tutorial we were following
.exe
while it’s actually an ELF executable (The SDAccel tutorials say it is called as .elf
)bootd
manually (for “boot default”) so it boots ito linux. (The SDAccel tutorials say it automatically boots into it)/dev/mmcblk0p1
. (The SDAccel tutorials say that it should be automatically mounted)hashing.elf
. It dies with a truly bizarre error: hashing.elf: command not found
. This is almost ungoogleable, due to the fact that the same problem occurs when people don’t have the correct file name.ls
with hashing.elf
to see what would happen, because I conjectured that the shell was able to run coreutils
.ls: core not found
. I’d luckily seen this during my android days, and knew this was from busybox.readelf l <exe name>
dumps out [Requesting program interpreter: /lib/ldlinuxarmhf.so.3]
. So, I bravely copied: cp /lib/ldlinux.so.3 /lib/ldlinuxarmhf.so.3
.zynq> /hashing.elf
/hashing.elf: error while loading shared libraries:
libxilinxopencl.so: cannot open shared object file: No such file or directory
At this point, clearly we have some linker issues (why does xocc
not correctly statically link? What’s up with it? Why does it expect it to be able to load a shared library? WTF is happening). do note that this is not the way the process
is supposed to go according to the tutorial!
libxilinxopencl.so
, so that’s a dead end. I’m completely unsure if the tutorial even makes sense.This entire chain of debugging is full of luck.
BOOT
fileAt some point, I gave up on the entire enterprise.
The question a Grobner basis allows us to answer is this: can the polynomial $p(x, y) = xy^2 + y$ be factorized in terms of $a(x, y) = xy + 1, b(x, y) = y^2  1$, such that $p(x, y) = f(x, y) a(x, y) + g(x, y) b(x, y)$ for some arbitrary polynomials $f(x, y), g(x, y) \in R[x, y]$.
One might imagine, “well, I’ll divide and see what happens!” Now, there are two routes to go down:
a(x, y)
!So, clearly, the order in which we perform of factorization / division starts to matter! Ideally, we want an algorithm which is not sensitive to the order in which we choose to apply these changes. $x^2 + 1$.
An alternative viewpoint of asking “can this be factorized”, is to ask “can we look at the factorization as a rewrite rule”? For this perspective, notice that “factorizing” in terms of $xy + 1$ is the same as being able to set $xy = 1$, and then have the polynomial collapse to zero. (For the more algebraic minded, this relates to the fact that $R[x] / p(x) \sim R(\text{roots of p})$). The intuition behind this is that when we “divide by $xy + 1$”, really what we are doing is we are setting $xy + 1 = 0$, and then seeing what remains. But $xy + 1 = 0 \iff xy = 1$. Thus, we can look at the original question as:
How can we apply the rewrite rules $xy \rightarrow 1$, $y^2 \rightarrow 1$, along with the regular rewrite rules of polynomial arithmetic to the polynomial $p(x, y) = xy^2 + y$, such that we end with the value $0$?
Our two derivations above correspond to the application of the rules:
That is, our rewrite rules are not confluent
The grobner basis is a mathematical object, which is a a confluent set of rewrite rules for the above problem. That is, it’s a set of polynomials which manage to find the rewrite $p(x, y) \xrightarrow{\star} 0$, regardless of the order in which we apply them. It’s also correct, in that it only rewrites to $0$ if the original system had some way to rewrite to $0$.
We need to identify critical pairs, which in this setting are called as Spolynomials.
Let $f_i = H(f_i) + R(f_i)$ and $f_j = H(f_j) + R(f_j)$. Let $m = lcm(H(f_i), H(f_j))$, and let $m_i, m_j$ be monomials such that $m_i \cdot H(f_i) = m = m_j \cdot H(f_j)$. The Spolynomial induced by $f_i, f_j$ is defined as $S(f_i, f_j) = m_i f_i  m_i f_j$.
This picture finally made the difference between these two things clear. The lie bracket moves along the flow, while the torsion moves along parallel transport.
This is why the sides of the parallelogram that measure torsion form, well, a parallelogram: we set them up using parallel transport.
On the other hand, the lie bracket measures the actual failure of the parallelogram from being formed.
Click the title to go to the post. We replicate the STOKE
paper in haskell,
to implement a superoptimiser based on MCMC methods.
BlockId
, Label
, Unique
:We have this hiearchy of BlockId
, Label
, and Unique
that can be
collapsed.
appear to be version of spatial hierarchical data structures for fast interaction computation. Apparently, multipole expansions are not useful in this case since multipole expansions are useful to take into account long range effects, but not short range effects.
@coro.*
intrinsicsThis is part of a larger thread — Adding CPS call support to LLVM where there is a large discussion on the correct design of how to teach LLVM about CPS.
Gor Nishanov proided the above example of encoding CPS using the llvm coro
instructions.
MO_Add2
and MO_AddWordC
Both of these are lowered the same way, but they should be different.
In particular, GHC.Prim
explains:
AddWordC#
returns (result, carry)
PlusWordC#
returns (carry, result)
Honestly, this is confusing, but I guess there’s some story to having two separate primops for this?
newtype D a = D { unD :: [(a, Double)] } deriving(Eq, Show, Ord)
instance Functor D where
 fmap :: (a > b) > D a > D b
fmap f (D xs) = D $ fmap (\(a, d) > (f a, d)) xs
instance Monad D where
return x = D $ [(x, 1.0)]
 f :: a > (D b)
(D as) >>= f = D $ do  list monad
(a, p) < as
(b, p2) < unD (f a)
return $ (b, p * p2)
 [(a, 0.5), (b, 0.5)]
 [(a, 0.3), (a, 0.2), (b, 0.1), (b, 0.4)]

instance Applicative D where
pure = return
ff <*> fa = do
f < ff
a < fa
return $ f a
condition :: Bool > D ()
condition True = D [((), 1.0)]
condition False = D [((), 0.0)]
dice :: D Int
dice = let p = 1.0 / 6 in D $ [(x, p)  x < [1..6]]
dice_hard :: D Int
dice_hard = do
x < dice
condition $ x > 3
return $ x
main :: IO ()
main = do
print dice
print dice_hard
This gives the output:
D {unD = [(1,0.16666666666666666),
(2,0.16666666666666666),
(3,0.16666666666666666),
(4,0.16666666666666666),
(5,0.16666666666666666),
(6,0.16666666666666666)]}
D {unD = [(1,0.0),
(2,0.0),
(3,0.0),
(4,0.16666666666666666),
(5,0.16666666666666666),
(6,0.16666666666666666)]}
Notice that D a ~= WriterT (Product Float) []
!
The classic explanation of word2vec
, in skipgram, with negative sampling,
in the paper and countless blog posts on the internet is as follows:
while(1) {
1. vf = vector of focus word
2. vc = vector of context word
3. train such that (vc . vf = 1)
4. for(0 <= i < negative samples):
vneg = vector of word *not* in context
train such that (vf . vneg = 0)
}
Indeed, if I google “word2vec skipgram”, the results I get are:
The original word2vec C
implementation does not do what’s explained above,
and is drastically different. Most serious users of word embeddings, who use
embeddings generated from word2vec
do one of the following things:
gensim
implementation, which is transliterated from the
C source to the extent that the variables names are the same.Indeed, the gensim
implementation is the only one that I know of which
is faithful to the C implementation.
The C implementation in fact maintains two vectors for each word, one where
it appears as a focus word, and one where it appears as a context word.
(Is this sounding familiar? Indeed, it appears that GloVe actually took this
idea from word2vec
, which has never mentioned this fact!)
The setup is incredibly well done in the C code:
syn0
holds the vector embedding of a word when it occurs
as a focus word. This is random initialized.https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L369
for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) {
next_random = next_random * (unsigned long long)25214903917 + 11;
syn0[a * layer1_size + b] =
(((next_random & 0xFFFF) / (real)65536)  0.5) / layer1_size;
}
syn1neg
holds the vector of a word when it occurs
as a context word. This is zero initialized.https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L365
for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++)
syn1neg[a * layer1_size + b] = 0;
if (negative > 0) for (d = 0; d < negative + 1; d++) {
// if we are performing negative sampling, in the 1st iteration,
// pick a word from the context and set the dot product target to 1
if (d == 0) {
target = word;
label = 1;
} else {
// for all other iterations, pick a word randomly and set the dot
//product target to 0
next_random = next_random * (unsigned long long)25214903917 + 11;
target = table[(next_random >> 16) % table_size];
if (target == 0) target = next_random % (vocab_size  1) + 1;
if (target == word) continue;
label = 0;
}
l2 = target * layer1_size;
f = 0;
// find dot product of original vector with negative sample vector
// store in f
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
// set g = sigmoid(f) (roughly, the actual formula is slightly more complex)
if (f > MAX_EXP) g = (label  1) * alpha;
else if (f < MAX_EXP) g = (label  0) * alpha;
else g = (label  expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
// 1. update the vector syn1neg,
// 2. DO NOT UPDATE syn0
// 3. STORE THE syn0 gradient in a temporary buffer neu1e
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
}
// Finally, after all samples, update syn1 from neu1e
https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L541
// Learn weights input > hidden
for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
Once again, since none of this actually explained in the original papers or on the web, I can only hypothesize.
My hypothesis is that since the negative samples come from all over the text and are not really weighed by frequency, you can wind up picking any word, and more often than not, a word whose vector has not been trained much at all. If this vector actually had a value, then it could move the actually important focus word randomly.
The solution is to set all negative samples to zero, so that only vectors that have occurred somewhat frequently will affect the representation of another vector.
It’s quite ingenious, really, and until this, I’d never really thought of how important initialization strategies really are.
I spent two months of my life trying to reproduce word2vec
, following
the paper exactly, reading countless articles, and simply not succeeding.
I was unable to reach the same scores that word2vec
did, and it was not
for lack of trying.
I could not have imagined that the paper would have literally fabricated an algorithm that doesn’t work, while the implementation does something completely different.
Eventually, I decided to read the sources, and spent three whole days convinced I was reading the code wrong since literally everything on the internet told me otherwise.
I don’t understand why the original paper and the internet contain zero
explanations of the actual mechanism behind word2vec
, so I decided to put
it up myself.
This also explains GloVe’s radical choice of having a separate vector
for the negative context — they were just doing what word2vec
does, but
they told people about it :)
.
Is this academic dishonesty? I don’t know the answer, and that’s a heavy question. But I’m frankly incredibly pissed, and this is probably the last time I take a machine learning paper’s explanation of the algorithm seriously again — from next time, I read the source first.
This is a section that I’ll update as I learn more about the space, since I’m studying differential geometry over the summer, I hope to know enough about “sympletic manifolds”. I’ll make this an appendonly log to add to the section as I understand more.
To perform hamiltonian monte carlo, we use the hamiltonian and its derivatives to provide a momentum to our proposal distribution — That is, when we choose a new point from the current point, our probability distribution for the new point is influenced by our current momentum
For some integral necessary within this scheme, Euler integration doesn’t cut it since the error diverges to infinity
Hence, we need an integrator that guarantees that the energy of out system is conserved.
Enter the leapfrog integrator. This integrator is also time reversible – We can run it
forward for n
steps, and then run it backward for n
steps to arrive at the same state.
Now I finally know how Braid was implemented, something that bugged the hell out of 9th grade me
when I tried to implement Braidlike physics in my engine!
The actual derivation of the integrator uses Lie algebras, Sympletic geometry, and other
diffgeo ideas, which is great, because it gives me motivation to study differential geometry :)
Original paper: Construction of higher order sympletic integrators
Clearly, the leapfrog integrator preserves energy and continues to move in an orbit, while the euler integrator goes batshit and causes orbits to spiral outwards. Full code is available below. More of the code is spent coaxing matplotlib to look nice, than doing the actual computation.
import numpy as np
import matplotlib.pyplot as plt
import numpy.linalg
# dq/dt = dH/dp  dp/dt = dH/dq (a = del V)
def leapfroge(dhdp, dhdq, q, p, dt):
p += dhdq(q, p) * 0.5 * dt # halfstep momentum
q += dhdp(q, p) * dt # fullstep position
p += dhdq(q, p) * 0.5 * dt # halfstep momentum
return (q, p)
def euler(dhdp, dhdq, q, p, dt):
pnew = p + dhdq(q, p) * dt
qnew = q + dhdp(q, p) * dt
def planet(integrator, n, dt):
STRENGTH = 0.5
q = np.array([0.0, 1.0]); p = np.array([1.0, 0.0])
# H = STRENGTH * q (potential) + p^2/2 (kinetic)
def H(qcur, pcur): return STRENGTH * np.linalg.norm(q) + np.dot(p, p) / 2
def dhdp(qcur, pcur): return p
def dhdq(qcur, pcur): return STRENGTH * 2 * q / np.linalg.norm(q)
qs = []
for i in range(n):
(q, p) = integrator(dhdp, dhdq, q, p, dt)
qs.append(q.copy())
return np.asarray(qs)
NITERS = 15
TIMESTEP = 1
plt.rcParams.update({'font.size': 22, 'font.family':'monospace'})
fig, ax = plt.subplots()
planet_leapfrog = planet(leapfroge, NITERS, TIMESTEP)
ax.plot(planet_leapfrog[:, 0], planet_leapfrog[:, 1], label='leapfrog',
linewidth=3, color='#00ACC1')
planet_euler = planet(euler, NITERS, TIMESTEP)
ax.plot(planet_euler[:, 0], planet_euler[:, 1], label='euler',
linewidth=3, color='#D81B60')
legend = plt.legend(frameon=False)
ax.set_title("leapfrog v/s euler: NITERS=%s dt=%s" % (NITERS, TIMESTEP))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.show()
plt.savefig("leapfrogvseuler.png")
We create a simple monad called PL
which allows for a single operation: sampling
from a uniform distribution. We then exploit this to implement MCMC using metropolis hastings,
which is used to sample from arbitrary distributions. Bonus is a small library to render sparklines
in the CLI.
For next time:
{# LANGUAGE GeneralizedNewtypeDeriving #}
{# LANGUAGE GADTs #}
{# LANGUAGE StandaloneDeriving #}
{# LANGUAGE FlexibleContexts #}
{# LANGUAGE FlexibleInstances #}
{# LANGUAGE UndecidableInstances #}
{# LANGUAGE DeriveFunctor #}
import System.Random
import Data.List(sort, nub)
import Data.Proxy
import Control.Monad (replicateM)
import qualified Data.Map as M
  Loop a monadic computation.
mLoop :: Monad m =>
(a > m a)  ^ loop
> Int  ^ number of times to run
> a  initial value
> m a  final value
mLoop _ 0 a = return a
mLoop f n a = f a >>= mLoop f (n  1)
  Utility library for drawing sparklines
  List of characters that represent sparklines
sparkchars :: String
sparkchars = "_▁▂▃▄▅▆▇█"
 Convert an int to a sparkline character
num2spark :: RealFrac a => a  ^ Max value
> a  ^ Current value
> Char
num2spark maxv curv =
sparkchars !!
(floor $ (curv / maxv) * (fromIntegral (length sparkchars  1)))
series2spark :: RealFrac a => [a] > String
series2spark vs =
let maxv = if null vs then 0 else maximum vs
in map (num2spark maxv) vs
seriesPrintSpark :: RealFrac a => [a] > IO ()
seriesPrintSpark = putStrLn . series2spark
 Probabilities
 ============
type F = Float
  probability density
newtype P = P { unP :: Float } deriving(Num)
  prob. distributions over space a
newtype D a = D { runD :: a > P }
uniform :: Int > D a
uniform n =
D $ \_ > P $ 1.0 / (fromIntegral $ n)
(>$<) :: Contravariant f => (b > a) > f a > f b
(>$<) = cofmap
instance Contravariant D where
cofmap f (D d) = D (d . f)
  Normal distribution with given mean
normalD :: Float > D Float
normalD mu = D $ \f > P $ exp ( ((fmu)^2))
  Distribution that takes on value x^p for 1 <= x <= 2. Is normalized
polyD :: Float > D Float
polyD p = D $ \f > P $ if 1 <= f && f <= 2 then (f ** p) * (p + 1) / (2 ** (p+1)  1) else 0
class Contravariant f where
cofmap :: (b > a) > f a > f b
data PL next where
Ret :: next > PL next  ^ return a value
Sample01 :: (Float > PL next) > PL next  ^ sample uniformly from a [0, 1) distribution
instance Monad PL where
return = Ret
(Ret a) >>= f = f a
(Sample01 float2plnext) >>= next2next' =
Sample01 $ \f > float2plnext f >>= next2next'
instance Applicative PL where
pure = return
ff <*> fx = do
f < ff
x < fx
return $ f x
instance Functor PL where
fmap f plx = do
x < plx
return $ f x
  operation to sample from [0, 1)
sample01 :: PL Float
sample01 = Sample01 Ret
  Run one step of MH on a distribution to obtain a (correlated) sample
mhStep :: (a > Float)  ^ function to score sample with, proportional to distribution
> (a > PL a)  ^ Proposal program
> a  current sample
> PL a
mhStep f q a = do
a' < q a
let alpha = f a' / f a  acceptance ratio
u < sample01
return $ if u <= alpha then a' else a
 Typeclass that can provide me with data to run MCMC on it
class MCMC a where
arbitrary :: a
uniform2val :: Float > a
instance MCMC Float where
arbitrary = 0
 map [0, 1) > (infty, infty)
uniform2val v = tan (pi/2 + pi * v)
{
  Any enumerable object has a way to get me the starting point for MCMC
instance (Bounded a, Enum a) => MCMC a where
arbitrary = toEnum 0
uniform2val v = let
maxf = fromIntegral . fromEnum $ maxBound
minf = fromIntegral . fromEnum $ minBound
in toEnum $ floor $ minf + v * (maxf  minf)
}
  Run MH to sample from a distribution
mh :: (a > Float)  ^ function to score sample with
> (a > PL a)  ^ proposal program
> a  ^ current sample
> PL a
mh f q a = mLoop (mhStep f q) 100 $ a
  Construct a program to sample from an arbitrary distribution using MCMC
mhD :: MCMC a => D a > PL a
mhD (D d) =
let
scorer = (unP . d)
proposal _ = do
f < sample01
return $ uniform2val f
in mh scorer proposal arbitrary
  Run the probabilistic value to get a sample
sample :: RandomGen g => g > PL a > (a, g)
sample g (Ret a) = (a, g)
sample g (Sample01 f2plnext) = let (f, g') = random g in sample g' (f2plnext f)
  Sample n values from the distribution
samples :: RandomGen g => Int > g > PL a > ([a], g)
samples 0 g _ = ([], g)
samples n g pl = let (a, g') = sample g pl
(as, g'') = samples (n  1) g' pl
in (a:as, g'')
  count fraction of times value occurs in list
occurFrac :: (Eq a) => [a] > a > Float
occurFrac as a =
let noccur = length (filter (==a) as)
n = length as
in (fromIntegral noccur) / (fromIntegral n)
  Produce a distribution from a PL by using the sampler to sample N times
distribution :: (Eq a, Num a, RandomGen g) => Int > g > PL a > (D a, g)
distribution n g pl =
let (as, g') = samples n g pl in (D (\a > P (occurFrac as a)), g')
  biased coin
coin :: Float > PL Int  1 with prob. p1, 0 with prob. (1  p1)
coin p1 = do
Sample01 (\f > Ret $ if f < p1 then 1 else 0)
  Create a histogram from values.
histogram :: Int  ^ number of buckets
> [Float]  values
> [Int]
histogram nbuckets as =
let
minv :: Float
minv = minimum as
maxv :: Float
maxv = maximum as
 value per bucket
perbucket :: Float
perbucket = (maxv  minv) / (fromIntegral nbuckets)
bucket :: Float > Int
bucket v = floor (v / perbucket)
bucketed :: M.Map Int Int
bucketed = foldl (\m v > M.insertWith (+) (bucket v) 1 m) mempty as
in map snd . M.toList $ bucketed
printSamples :: (Real a, Eq a, Ord a, Show a) => String > [a] > IO ()
printSamples s as = do
putStrLn $ "***" <> s
putStrLn $ " samples: " <> series2spark (map toRational as)
printHistogram :: [Float] > IO ()
printHistogram samples = putStrLn $ series2spark (map fromIntegral . histogram 10 $ samples)
  Given a coin bias, take samples and print bias
printCoin :: Float > IO ()
printCoin bias = do
let g = mkStdGen 1
let (tosses, _) = samples 100 g (coin bias)
printSamples ("bias: " <> show bias) tosses
  Create normal distribution as sum of uniform distributions.
normal :: PL Float
normal = fromIntegral . sum <$> (replicateM 5 (coin 0.5))
main :: IO ()
main = do
printCoin 0.01
printCoin 0.99
printCoin 0.5
printCoin 0.7
putStrLn $ "normal distribution using central limit theorem: "
let g = mkStdGen 1
let (nsamples, _) = samples 1000 g normal
 printSamples "normal: " nsamples
printHistogram nsamples
putStrLn $ "normal distribution using MCMC: "
let (mcmcsamples, _) = samples 1000 g (mhD $ normalD 0.5)
printHistogram mcmcsamples
putStrLn $ "sampling from x^4 with finite support"
let (mcmcsamples, _) = samples 1000 g (mhD $ polyD 4)
printHistogram mcmcsamples
***bias: 1.0e2
samples: ________________________________________█_█_________________________________________________________
***bias: 0.99
samples: ████████████████████████████████████████████████████████████████████████████████████████████████████
***bias: 0.5
samples: __█____█__███_███_█__█_█___█_█_██___████████__█_████_████_████____██_█_██_____█__██__██_██____█__█__
***bias: 0.7
samples: __█__█_█__███_█████__███_█_█_█_██_█_████████__███████████_████_█_███_████_██__█_███__██_███_█_█__█_█
normal distribution using central limit theorem:
_▄▇█▄_
normal distribution using MCMC:
__▁▄█▅▂▁___
sampling from x^4 with finite support
▁▁▃▃▃▄▅▆▇█_
{# LANGUAGE GeneralizedNewtypeDeriving #}
import qualified Data.Map.Strict as M
  This file can be copypasted and will run!
  Symbols
type Sym = String
  Environments
type E a = M.Map Sym a
  Newtype to represent deriative values
type F = Float
newtype Der = Der { under :: F } deriving(Show, Num)
infixl 7 !#
  We are indexing the map at a "hash" (Sym)
(!#) :: E a > Sym > a
(!#) = (M.!)
  A node in the computation graph
data Node =
Node { name :: Sym  ^ Name of the node
, ins :: [Node]  ^ inputs to the node
, out :: E F > F  ^ output of the node
, der :: (E F, E (Sym > Der))
> Sym > Der  ^ derivative wrt to a name
}
  @ looks like a "circle", which is a node. So we are indexing the map
 at a node.
(!@) :: E a > Node > a
(!@) e node = e M.! (name node)
  Given the current environments of values and derivatives, compute
  The new value and derivative for a node.
run_ :: (E F, E (Sym > Der)) > Node > (E F, E (Sym > Der))
run_ ein (Node name ins out der) =
let (e', ed') = foldl run_ ein ins  run all the inputs
v = out e'  compute the output
dv = der (e', ed')  and the derivative
in (M.insert name v e', M.insert name dv ed')  and insert them
  Run the program given a node
run :: E F > Node > (E F, E (Sym > Der))
run e n = run_ (e, mempty) n
  Let's build nodes
nconst :: Sym > F > Node
nconst n f = Node n [] (\_ > f) (\_ _ > 0)
  Variable
nvar :: Sym > Node
nvar n = Node n [] (!# n) (\_ n' > if n == n' then 1 else 0)
  binary operation
nbinop :: (F > F > F)  ^ output computation from inputs
> (F > Der > F > Der > Der)  ^ derivative computation from outputs
> Sym  ^ Name
> (Node, Node)  ^ input nodes
> Node
nbinop f df n (in1, in2) =
Node { name = n
, ins = [in1, in2]
, out = \e > f (e !# name in1) (e !# name in2)
, der = \(e, ed) n' >
let (name1, name2) = (name in1, name in2)
(v1, v2) = (e !# name1, e !# name2)
(dv1, dv2) = (ed !# name1 $ n', ed !# name2 $ n')
in df v1 dv1 v2 dv2
}
nadd :: Sym > (Node, Node) > Node
nadd = nbinop (+) (\v dv v' dv' > dv + dv')
nmul :: Sym > (Node, Node) > Node
nmul = nbinop (*) (\v (Der dv) v' (Der dv') > Der $ (v*dv') + (v'*dv))
main :: IO ()
main = do
let x = nvar "x" :: Node
let y = nvar "y"
let xsq = nmul "xsq" (x, x)
let ten = nconst "10" 10
let xsq_plus_10 = nadd "xsq_plus_10" (xsq, ten)
let xsq_plus_10_plus_y = nadd "xsq_plus_10_plus_y" (xsq_plus_10, y)
let (e, de) = run (M.fromList $ [("x", 2.0), ("y", 3.0)]) xsq_plus_10_plus_y
putStrLn $ show e
putStrLn $ show $ de !@ xsq_plus_10_plus_y $ "x"
putStrLn $ show $ de !@ xsq_plus_10_plus_y $ "y"
Yeah, in ~80 lines of code, you can basically build an autograd engine. Isn’t haskell so rad?
v3
to get pass timings.AndreasK
:
 Register allocation, common block elimination, block layout and pretty printing are the “slow” things in the backend as far as I remember.
 There are also a handful of TODO’s in the x86 codegen which still apply. So you can try to grep for these.
 Strength reduction for division by a constant
ghc/testsuite/tests/rts/T7160.hs
A comment from this test case tells us why the function debugBelch2
exists:
ghc/testsuite/tests/rts/T7160.hs
 Don't use debugBelch() directly, because we cannot call varargs functions
 using the FFI (doing so produces a segfault on 64bit Linux, for example).
 See Debug.Trace.traceIO, which also uses debugBelch2.
foreign import ccall "&debugBelch2" fun :: FunPtr (Ptr () > Ptr () > IO ())
The implementation is:
ghc/libraries/base/cbits/PrelIOUtils.c
void debugBelch2(const char*s, char *t)
{
debugBelch(s,t);
}
ghc/rts/RtsMessages.c
RtsMsgFunction *debugMsgFn = rtsDebugMsgFn;
...
void
debugBelch(const char*s, ...)
{
va_list ap;
va_start(ap,s);
(*debugMsgFn)(s,ap);
va_end(ap);
}
I wanted to use debug info to help build a better debugging experience
within tweag/asterius
. So, I was
reading through the sources of cmm/Debug.hs
.
I’d never considered how to debug debuginfo, and I found the information
tucked inside a cute note in GHC (Note [Debugging DWARF unwinding info]
):
This makes GDB produce a trace of its internal workings. Having gone this far, it’s just a tiny step to run GDB in GDB. Make sure you install debugging symbols for gdb if you obtain it through a package manager.
The switch to out of range
code generator switches to the first label. It should be more profitable
to switch to a unreachable
block. That way, LLVM can take advantage of UB.
Great link to the GHC wiki that describes the concurrency primitives “bottom up”: https://gitlab.haskell.org/ghc/ghc/wikis/lightweightconcurrency
There are way too many objects in diffgeo, all of them subtly connected. Here I catalogue all of the ones I have run across:
A manifold $M$ of dimension $n$ is a topological space. So, there is a topological structure $T$ on $M$. There is also an Atlas, which is a family of _Chart_s that satisfy some properties.
A chart is a pair $(O \in T , cm: O > \mathbb R^n$. The $O$ is an open set of the manifold, and $cm$ (“chart for “m”) is a continuous mapping from $O$ to $\mathbb R^n$ under the subspace topology for $U$ and the standard topology for $\mathbb R^n$.
An Atlas is a collection of _Chart_s such that the charts cover the manifold, and the charts are pairwise compatible. That is, $A = { (U_i, \phi_i) }$, such that $\cup{i} U_i = M$, and $\phi_j \circ phi_i^{1}$ is smooth.
$f: M \to N$ be a mapping from an $m$ dimensional manifold to an $n$ dimensional manifold. Let $frep = cn \circ f \circ cm^{1}: \mathbb R^m > \mathbb R^n$ where $cm: M \to \mathbb R^m$ is a chart for $M$, $cn: N \to \mathbb R^n$ is a chart for $N$. $frep$ is $f$ represented in local coordinates. If $frep$ is smooth for all choices of $cm, cn$, then $f$ is a differentiable map from $M$ to $N$.
Let $I$ be an open interval of $\mathbb R$ which includes the point 0
. A Curve is a
differentiable map $C: (a, b) \to M$ where $a < 0 < b$.
A differentiable mapping from $M$ to $R$.
f(m): M > R
with respect to a curve c(t): I > M
, denoted as c[f]
.Let g(t) = (f . c)(t) :: I c> M f> R = I > R
.
This this is the value dg/dt(t0) = (d (f . c) / dt) (0)
.
p
:On a m
dimensional manifold M
, a tangent vector at a point p
is an
equivalence class of curves that have c(0) = p
, such that c1(t) ~ c2(t)
iff
:
(O, ch)
such that c1(0) ∈ O
,
d/dt (ch . c1: R > R^m) = d/dt (ch . c2: R > R^m)
.That is, they have equal derivatives.
TpM
):The set of all tangent vectors at a point p
forms a vector space TpM
.
We prove this by creating a bijection from every curve to a vector R^n
.
Let (U, ch: U > R)
be a chart around the point p
, where p ∈ U ⊆ M
. Now,
the bijection is defined as:
forward: (I > M) > R^n
forward(c) = d/dt (c . ch)
reverse: R^n > (I > M)
reverse(v)(t) = ch^1 (tv)
TpM*
): dual space of the tangent space / Space of all linear functions from TpM
to R
.f
, there is a cotangent vector, colorfully
called df
. The definition is df: TpM > R
, df(c: I > M) = c[f]
. That is,
given a curve c
, we take the directional derivative of the function f
along the curve c
. We need to prove that this is constant for all vectors
in the equivalence class and blah.push(f): TpM > TpN
Given a curve c: I > M
, the pushforward
is the curve f . c : I > N
. This extends to the equivalence classes
and provides us a way to move curves in M
to curves in N
, and thus
gives us a mapping from the tangent spaces.
This satisfies the identity:
push(f)(v)[g] === v[g . f]
pull(f): TpN* > TpM*
Given a linear functional wn : TpN > R
, the pullback is defined as
` wn . push(f) : TpM > R`.
This satisfies the identity:
(pull wn)(v) === wn (push v)
(pull (wn : TpN>R): TpM>R) (v : TpM) : R = (wn: TpN>R) (push (v: TpM): TpN) : R
TODO
Stumbled across this idea while reading some posts on a private discourse.
Continually adding new thunks without forcing them can lead to a space leak, aka the dreaded monadic parsing backtracking problem.
Continually running new thunks can lead to a “time leak”, where we spend far too much time running things that should not be run in the first place!
This is an interesting perspective that I’ve never seen articulated before, and somehow helps make space leaks feel more… palatable? Before, I had no analogue to a space leak in the strict world, so I saw them as a pathology. But with this new perspective, I can see that the strict world’s version of a space leak is a time leak.
An observation I had: the function
f(x) = x/2 if (x % 2 == 0)
f(x) = 3x + 1 otherwise
is a Presburger function, so by building better approximations to the transitive closure of a presburger function, one could get better answers to the Collatz conjecture. Unfortunately, ISL (the integer set library) of today is not great against the formidable foe.
The code:
#include <isl/set.h>
#include <isl/version.h>
#include <isl/map.h>
#include <isl/aff.h>
#include <isl/local_space.h>
#include <isl/constraint.h>
#include <isl/space.h>
int main() {
isl_ctx *ctx = isl_ctx_alloc();
const char *s = "{ [x] > [x / 2] : x % 2 = 0; [x] > [3 * x + 1] : x % 2 = 1}";
isl_map *m = isl_map_read_from_str(ctx, s);
isl_map_dump(m);
isl_bool b;
isl_map *p = isl_map_transitive_closure(m, &b);
printf("exact: %d\n", b);
printf("map:\n");
isl_map_dump(p);
}
Produces the somewhat disappointing, and yet expected output:
$ clang bug.c lisl Lisl0.20/.libs o bug I/usr/local/include/
$ ./bug
{ [x] > [o0] : 2o0 = x or (exists (e0 = floor((1 + x)/2): o0 = 1 + 3x and 2e0 = 1 + x)) }
exact: 0
map:
{ [x] > [o0] }
I find it odd that it is unable to prove anything about the image, even that
it is nonnegative, for example. This is an interesting direction in which
to improve the functions isl_map_power
and isl_map_transitive_closure
though.
I’ve always seen compactness be used by starting with a possibly infinite coverm and then filtering it into a finite subcover. This finite subcover is then used for finiteness properties (like summing, min, max, etc.).
I recently ran across a use of compactness when one starts with the set of all possible subcovers, and then argues about why a cover cannot be built from these subcovers if the set is compact. I found it to be a very cool use of compactness, which I’ll record below:
If a family of compact, countably infinite sets S_a
have all
finite intersections nonempty, then the intersection of the family S_a
is nonempty.
Let S = intersection of S_a
. We know that S
must be compact since
all the S_a
are compact, and the intersection of a countably infinite
number of compact sets is compact.
Now, let S
be empty. Therefore, this means there must be a point p ∈ P
such that p !∈ S_i
for some arbitrary i
.
We can see that the cantor set is nonempty, since it contains a family
of closed and bounded sets S1, S2, S3, ...
such that S1 ⊇ S2 ⊇ S3 ...
where each S_i
is one step of the cantorification. We can now see
that the cantor set is nonempty, since:
Each finite intersection is nonempty, and will be equal to the set that has the highest index in the finite intersection.
Each of the sets Si
are compact since they are closed and bounded subsets of R
Invoke theorem.
Japanese contains a separate kanji set called daiji
, to prevent people
from adding strokes to stuff previously written.
# Common Formal
1 一 壱
2 二 弐
3 三 参
I’ve taken to watching the live stream when I have some downtime and want some interesting content.
The discussions of Wolfram with his group are great, and they bring up really interesting ideas (like that of cleave being very irregular).
Cleave
as a word has some of the most irregular inflectionsSingle Axioms for Groups and Abelian Groups with Various Operations provides a single axiom for groups. This can be useful for some ideas I have for training groups, where we can use this axiom as the loss function!
Word2Vec
C code implements gradient descent really weirdlyI’ll be posting snippets of the original source code, along with a link to the Github sources. We are interested in exploring the skipgram implementation of Word2Vec, with negative sampling, without hierarchical softmax. I assume basic familiarity with word embeddings and the skipgram model.
// https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L708
expTable = (real *)malloc((EXP_TABLE_SIZE + 1) * sizeof(real));
for (i = 0; i < EXP_TABLE_SIZE; i++) {
expTable[i] = exp((i / (real)EXP_TABLE_SIZE * 2  1) *
MAX_EXP); // Precompute the exp() table
expTable[i] =
expTable[i] / (expTable[i] + 1); // Precompute f(x) = x / (x + 1)
}
Here, the code constructs a lookup table which maps [0...EXP_TABLE_SIZE1]
to [sigmoid(MAX_EXP)...sigmoid(MAX_EXP)]
. The index i
first gets mapped
to (i / EXP_TABLE_SIZE) * 2  1
, which sends 0
to 1
and EXP_TABLE_SIZE
to 1
. This is then rescaled by MAX_EXP
.
syn0
is a global variable, initialized with random weights in the range of
[0.5...0.5]
. It has dimensions VOCAB x HIDDEN
. This layer holds the
hidden representations of word vectors.// https://github.com/imsky/word2vec/blob/master/word2vec.c#L341
a = posix_memalign((void **)&syn0, 128,
(long long)vocab_size * layer1_size * sizeof(real));
...
// https://github.com/imsky/word2vec/blob/master/word2vec.c#L355
for (a = 0; a < vocab_size; a++)
for (b = 0; b < layer1_size; b++) {
next_random = next_random * (unsigned long long)25214903917 + 11;
syn0[a * layer1_size + b] =
(((next_random & 0xFFFF) / (real)65536)  0.5) / layer1_size;
}
syn1neg
is a global variable that is zeroinitialized. It has dimensions
VOCAB x HIDDEN
. This layer also holds hidden representations of word vectors,
when they are used as a negative sample.// https://github.com/imsky/word2vec/blob/master/word2vec.c#L350
a = posix_memalign((void **)&syn1neg, 128,
(long long)vocab_size * layer1_size * sizeof(real));
...
for (a = 0; a < vocab_size; a++)
for (b = 0; b < layer1_size; b++) syn1neg[a * layer1_size + b] = 0;
neu1e
is a temporary perthread buffer (Remember that the word2vec
C code
use CPU threads for parallelism) which is zero initialized. It has dimensions
1 x HIDDEN
.// https://github.com/imsky/word2vec/blob/master/word2vec.c#L370
real *neu1e = (real *)calloc(layer1_size, sizeof(real));
Throughout word2vec
, no 2D arrays are used. Indexing of the form
arr[word][ix]
is manually written as arr[word * layer1_size + ix]
. So, I
will call word * layer1_size
as the “base address”, and ix
as the “offset
of the array index expression henceforth.
Here, l1
is the base address of the word at the center of window (the focus
word). l2
is the base address of either the word that is negative sampled
from the corpus, or the word that is a positive sample from within the context
window.
label
tells us whether the sample is a positive or a negative sample.
label = 1
for positive samples, and label = 0
for negative samples.
// zero initialize neu1e
// https://github.com/imsky/word2vec/blob/master/word2vec.c#L419
for (c = 0; c < layer1_size; c++) neu1e[c] = 0;
...
// loop through each negative sample
// https://github.com/imsky/word2vec/blob/master/word2vec.c#L508
if (negative > 0) for (d = 0; d < negative + 1; d++) {
...
// https://github.com/imsky/word2vec/blob/master/word2vec.c#L521
// take the dot product: f= syn0[focus] . syn1neg[context]
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
// compute: g = (label  sigmoid(2f  1)) * alpha
// g is computed using lookups into a lookup table and clamping for
// efficiency.
if (f > MAX_EXP) g = (label  1) * alpha;
else if (f < MAX_EXP) g = (label  0) * alpha;
else
g = (label  expTable[(int)((f + MAX_EXP) *
(EXP_TABLE_SIZE /
MAX_EXP / 2))]) * alpha;
// Now that we have computed the gradient:
// `g = (label  output) * learningrate`,
// we need to perform backprop. This is where the code gets weird.
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
} // end loop through negative samples
// Learn weights input > hidden
for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
We have two vectors for each word, one called syn0[l1 + _]
and
the other syn1neg[l2 + _]
. The syn1neg
word embedding is used whenever
a word is used a negative sample, and is not used anywhere else. Also,
the syn1neg
vector is zero initialized, while the syn0
vectors are
randomly initialized.
The values we backprop with g * syn1neg[l2 + _]
, g * syn0[l1 + _]
are
not the correct gradients of the error term! The derivative of a sigmoid
is dsigmoid(x)/dx = sigmoid(x) [1  sigmoid(x)]
. The [1  sigmoid(x)]
is nowhere to be seen, let alone the fact that we are using
sigmoid(2x  1)
and not regular sigmoid. Very weird.
We hold the value of syn0
constant throughout all the negative samples,
which was not mentioned in any tutorial I’ve read.
The paper does not mentioned these implementation details, and neither does any blog post that I’ve read. I don’t understand what’s going on, and I plan on updating this section when I understand this better.
The b programming language. It’s quite
awesome to read the sources. For example, a.c
Given an array of qubits xs: Qubit[]
, I want to switch to little endian.
Due to nocloning, I can’t copy them! I suppose I can use recursion to build
up a new “list”. But this is not the efficient array version we know and love
and want.
The code that I want to work but does not:
function switchEndian(xs: Qubit[]): Unit {
for(i in 0..Length(xs)  1) {
Qubit q = xs[i]; // boom, this does not work!
xs[i] = xs[Length(xs)  1  i]
xs[Length(xs)  1  i] = q;
}
}
On the other hand, what does work is to setup a quantum circuit that
performs this flipping, since it’s a permutation matrix at the end of
the day. But this is very slow, since it needs to simulate the “quantumness”
of the solution, since it takes 2^n
basis vectors for n
qubits.
However, the usual recursion based solution works:
function switchEndian(xs: Qubit[]): Qubit[] {
if(Length(xs) == 1) {
return xs;
} else {
switchEndian(xs[1..(Length(xs)  1)] + xs[0]
}
}
This is of course, suboptimal.
I find it interesting that in the linear types world, often the “pure” solution is forced since mutation very often involves temporaries / copying!
(I’m solving assignments in qsharp for my course in college)
Core building block of effectively using the ellipsoid algorithm.
I recently learnt that the Toeffili and Hadamard gates are universal for quantum computation. The description of these gates involve no complex numbers. So, we can write any quantum circuit in a “complex number free” form. The caveat is that we may very well have input qubits that require complex numbers.
Even so, a large number (all?) of the basic algorithms shown in Nielsen and Chaung can be encoded in an entirely complexnumber free fashion.
I don’t really understand the ramifications of this, since I had the intuition that the power of quantum computation comes from the ability to express complex phases along with superposition (tensoring). However, I now have to remove the power from going from R to C in many cases. This is definitely something to ponder.
I steal from wikipedia:
Comparative Illusion, which is a grammatical illusion where certain sentences seem grammatically correct when you read them, but upon further reflection actually make no sense.
For example: “More people have been to Berlin than I have.”
structs
librarymachines
library (WIP)Napster connected to a central server than was an index of all users. Users are connected to each other to download MP3 files off of each other.
Each data segment is encrypted. Data segment is downloaded individually by peers.
Finger table: nodes are keys, contain values. A node’s chord table is:
with ids i :> (n + 2^i)
.
Each site maintains a local waitfor graph. Central site begins a check. All of the data is pushed to central node which begins the check.
When a packet is sent in a network, it should be forwarded using routing tables. The routing table must be computed when the network is initialized, and must be updated every time the topology of the network changes.
We assume that the network is undirected, and we have no negative weight cycles. Our weights are typically congestion.
Perform matrix multiplication on the (min, +)
semiring.
def floyd_warshall(V, E):
UnseenV = set(V)
W = np.fill("infty", shape=[len(V), len(V)])  distances
B = [[None for _ in len(V)] for _ in len(V)]  intermediate paths.
for (u, weight, v) in E: W[u][v] = weight; B[u][v] = v;
while UnseenV:
w = UnseenV.randitem(); UnseenV.delete(w);
if W[a][w] + W[w][b] < W[a][b]:
B[a][b] = B[a][w]  if a m
W[a][b] = W[a][w] + W[w][b]
# nth processor is running the code floyd_warshall(n, V, E)
def floyd_warshall(n, V, E):
UnseenV = set(V)
W = np.fill("infty", shape=[len(V), len(V)])  distances
B = [[None for _ in len(V)] for _ in len(V)]  intermediate paths.
for (u, weight, v) in E: W[u][v] = weight; B[u][v] = v;
while UnseenV:
# the ordering by which w's will be selected is _deterministic_,
# and is _shared across all nodes_
w = UnseenV.smallestitem(); UnseenV.delete(w);
assert(w is the same across all nodes)
if w == n: # if we are the current node
broadcast(W[w, :]) # broadcast the weights of w
else:
W[w, :] = receive() # receive the updated weights of w
if W[a][w] + W[w][b] < W[a][b]:
B[a][b] = B[a][w]  if a m
W[a][b] = W[a][w] + W[w][b]
NB[][w] = u
, then is NB[u][w] = a
?Here, we are trying to define how to broadcast the table W[w, :]
. It is
now W
’s turn to broadcast.
Define a graph $G_w \equiv (V_w, E_w)$.
We claim that is $G_w$ is in fact a tree. This is because every edge $(u, x)$ in fact signals a path from $u \rightarrow x \xrightarrow{*} w$. Hence,
This graph is connected since we only add vertices that are at distance less than infinity.
Hence, $G_w$ is a connected directed acyclic graph, so a tree.
Since it’s a tree, at most $  V   1$ nontrivial levels (level 0 is trivial). 
Hence, time complexity is $  V   1$. 
Same as chandymisra, but we allow edges to fail.
We assume that nodes are notified failures and repairs of their adjacent channels.
If we get a packet that is addressed to someone else, and our distance to that node is not infinity, then we send it to our first hop.
If we get a packet that is addressed to someone else, and our distance to that node is infinity, discard.
Let shortest path from $z \rightarrow w$ be through $u$ as $z \rightarrow u \rightarrow w$. Suppose chanell $uw$ fails. $u$ recomputes distance to $w$ as $d(w) = 1 + d(z, w) = 1 + 2 = 3$. Then it sets $nb(w) = z$. Is this incorrect?
ANSWER: No, it’s not incorrect. Now $z$ will be invalidated and it’ll attempt
to recompute its distance, causing its dinstance to go higher (say 5).
This invalidates $u$, leading to an increasing sequence of distances $3 \rightarrow 5 \rightarrow 7 \dots$.
If we have an uppoer bound on the distance (DIAMETER
) then we know that
this is wrong.
[ADHD: A lifelong struggle  Why organisation systems don’t work](https://gekk.info/articles/adhd.html) 