# Generating k bitsets of a given length n:

The problem is to generate all bitvectors of length n that have k bits set. For example, generate all bitvectors of length 5 that have 3 bits set.

I know that an algorithm exists in Hacker’s delight, but I’ve been too sick to crack open a book, so I decided to discover the algorithm myself. The one I came up with relies on looking at the numbers moving at a certain velocity, and them colliding with each other. For example, let us try to generate all 5C3 combinations of bits.

We start wih:

#1           count of position
a b c d e    positions
1 1 1 0 0    bitset
< - - - -    velocity


Where the < represents that the 1 at position a is moving leftwards. Our arena is circular, so the leftmost 1 can wrap around to the right. This leads to the next state

#2
a b c d e
0 1 1 0 1
- - - - <


We continue moving left peacefully.

#3
a b c d e
0 1 1 1 0
- - - < -


whoops, we have now collided with a block of 1s. Not to worry, we simply transfer our velocity by way of collision, from the 1 at d to the 1 at b.

I denote the transfer as follows:

#3
a b c d e
0 1 1 1 0  original state
- - - < -
- < < < -  transfer of velocity
- < - - -  final state after transfer of velocity


The 1 at b proceeds along its merry way with the given velocity

#4
a b c d e
1 0 1 1 0
< - - - -


Once again, it wraps around, and suffers a collision

#5
a b c d e
0 0 1 1 1
- - - - - < (collision, transfer)
- - < < < transfer of velocity
- - < - - final state after transfer of velocity


This continues:

0 1 0 1 1  #6
- < - - -
1 0 0 1 1  #7
< - - - - (collision, transfer velocity)
< - - < <
- - - < -
1 0 1 0 1 #8
- - < - -
1 1 0 0 1 #9
- < - - - (colision, transfer velocity
< < - - <
- - - - <
1 1 0 1 0 #10
- - - < -
1 1 1 0 0 #11: wrap around to initial state


I don’t have a proof of correctness, but I have an intuition that this should generate all states. Does anyone have a proof?

# Bondi k-calculus

An alternative formalism to derive special relativity geometrically, resting purely on hypotehses about the way light travels.

However, I’ve not been able to prove the correctness of the assumptions made, by using coordinate geometry. I suspect this is because I will need to use hyperbolic geometry for the “side lengths” to work out.

Indeed, I found another source, called as The k-calculus fiddle which attempts to discredit k-calculus. The author of the above blog writes at the end:

In asking Ray D’Inverno’s permission to use his book as the example of k-calculus, he was kind enough to point out that the arguments I have given are invalid. Chapter 2 of his book should be read through to the end and then reread in the light of the fact that the geometry of space and time is Minkowskian. Euclidean geometry should not be used in interpreting the diagrams because their geometry is Minkowskian.

which seems to imply that we need to use hyperbolic geometry for this.

# Topology as an object telling us what zero-locus is closed:

I found this file as I was cleaning up some old code, for a project to implement a fast K/V store on an FPGA, so I thought I should put this up for anyone else who stumbles on the same frustrations / errors. I’m not touching this particular toolchain again with a 10-foot pole till the tools stabilize by a lot.

• Unable to create BRAM for fields such as bool, int16. The data buses will be 8/16 bits long, with error:
[BD 41-241] Message from IP propagation TCL of /blk_mem_gen_7: set_property
error: Validation failed for parameter 'Write Width A(Write_Width_A)' for BD
Cell 'blk_mem_gen_7'. Value '8' is out of the range (32,1024) Customization
errors found on 'blk_mem_gen_7'. Restoring to previous valid configuration.

• I had an array of structs:
struct S {
bool b;
int16 x;
int16 y;
}


This gets generated as 3 ports for memory, of widths 1, 16, 16. Ideally, I wanted one port, of width 16+16+1=33, for each struct value. However, what was generated were three ports of widths 1, 16, and 16 which I cannot connect to BRAM.

• data_pack allows us to create one port of width 16+16+1=33

• Shared function names allocated on BRAM causes errors in synthesis: cpp struct Foo {…}; void f (Foo conflict) { #pragma HLS interface bram port=conflict }

void g (Foo conflict) { #pragma HLS interface bram port=conflict }



- Enums causes compile failure in RTL generation  (commit 3c0d619039cff7a7abb61268e6c8bc6d250d8730)
- ap_int causes compile failurre in RTL generation (commit 3c0d619039cff7a7abb61268e6c8bc6d250d8730)
- x % m where m != 2^k is very expensive -- there must be faster encodings of modulus?
- How to share code between HLS and vivado SDK? I often wanted to share constant values between
my HLS code and my Zynq code.
- Can't understand why array of structs that were packed does not get deserialized correctly. I had to manually
pack a struct into a uint32. For whatever reason, having a #pragma pack did something to the representation of the struct
as far as I can tell, and I couldn't treat the memory as just a raw struct * on the other side:

cpp
// HLS side
struct Vec2  { int x; int y};
void f(Vec2 points[NUM_POINTS]) {
#pragma HLS DATA_PACK variable=points
#pragma HLS INTERFACE bram port=points

points[0] = {2, 3};
}

// Host side

int main() {
// points[0] will *not* be {2, 3}!
}

• If I change my IP, there is no way to preserve the current connections in the GUI why just updating the “changed connections”. I’m forced to remove the IP and add it again (no, the Refresh IP button does not work).
• On generating a new bitstream from Vivado, Vivado SDK tries to reload the config, fails at the reloading (thinks xil_print.h doesn’t exist), and then fails to compile code. Options are to either restart Vivado SDK, or refresh xil_print.h.

• It is entirely unclear what to version control in a vivado project, unless one has an omniscient view of the entire toolchain. I resorted to git add ing everything, but this is a terrible strategy in so many ways.

#### SDAccel bugs

link to tutorial we were following

• The executable is named .exe while it’s actually an ELF executable (The SDAccel tutorials say it is called as .elf)
• the board is supposed to automatically boot into linux, which it does not. One is expected to call bootd manually (for “boot default”) so it boots ito linux. (The SDAccel tutorials say it automatically boots into it)
• At this point, the SD card is unreadable. It took a bunch of time to figure out that the SD card needs to be mounted by us, and has the mount name /dev/mmcblk0p1. (The SDAccel tutorials say that it should be automatically mounted)
• At this point, we are unable to run hashing.elf. It dies with a truly bizarre error: hashing.elf: command not found. This is almost un-googleable, due to the fact that the same problem occurs when people don’t have the correct file name.
• I rewrote ls with hashing.elf to see what would happen, because I conjectured that the shell was able to run coreutils.
• This dies with a different error ls: core not found. I’d luckily seen this during my android days, and knew this was from busybox.
• This led me to google “busybox unable to execute executable”, which led me to this StackOverflow link that clued me into the fact that the ELF interpreter is missing.
• When I discovered this, I wound up trying to understand how to get the right ELF interpreter. readelf -l <exe name> dumps out [Requesting program interpreter: /lib/ld-linux-armhf.so.3]. So, I bravely copied: cp /lib/ld-linux.so.3 /lib/ld-linux-armhf.so.3.
• Stuff is still broken, but I now get useful error messages:
zynq> /hashing.elf
/hashing.elf: error while loading shared libraries: libxilinxopencl.so: cannot open shared object file: No such file or directory


At this point, clearly we have some linker issues (why does xocc not correctly statically link? What’s up with it? Why does it expect it to be able to load a shared library? WTF is happening). do note that this is not the way the process is supposed to go according to the tutorial!

• Of course, there’s no static library version of libxilinxopencl.so, so that’s a dead end. I’m completely unsure if the tutorial even makes sense.
• This entire chain of debugging is full of luck.

• Link talking about generating BOOT file

At some point, I gave up on the entire enterprise.

# What is a Grobner basis - polynomial factorization as rewrite systems

##### A motivating example

The question a Grobner basis allows us to answer is this: can the polynomial $p(x, y) = xy^2 + y$ be factorized in terms of $a(x, y) = xy + 1, b(x, y) = y^2 - 1$, such that $p(x, y) = f(x, y) a(x, y) + g(x, y) b(x, y)$ for some arbitrary polynomials $f(x, y), g(x, y) \in R[x, y]$.

One might imagine, “well, I’ll divide and see what happens!” Now, there are two routes to go down:

• $xy^2 + y = y(xy + 1) = y a(x, y) + 0 b(x, y)$. Well, problem solved?
• $xy^2 + y = xy^2 - x + x + y = x (y^2 - 1) + x + y = x b(x, y) + (x + y)$. Now what? we’re stuck, and we can’t apply a(x, y)!

So, clearly, the order in which we perform of factorization / division starts to matter! Ideally, we want an algorithm which is not sensitive to the order in which we choose to apply these changes. $x^2 + 1$.

##### The rewrite rule perspective

An alternative viewpoint of asking “can this be factorized”, is to ask “can we look at the factorization as a rewrite rule”? For this perspective, notice that “factorizing” in terms of $xy + 1$ is the same as being able to set $xy = -1$, and then have the polynomial collapse to zero. (For the more algebraic minded, this relates to the fact that $R[x] / p(x) \sim R(\text{roots of p})$). The intuition behind this is that when we “divide by $xy + 1$”, really what we are doing is we are setting $xy + 1 = 0$, and then seeing what remains. But $xy + 1 = 0 \iff xy = -1$. Thus, we can look at the original question as:

How can we apply the rewrite rules $xy \rightarrow -1$, $y^2 \righarrow 1$, along with the regular rewrite rules of polynomial arithmetic to the polynomial $p(x, y) = xy^2 + y$, such that we end with the value $0$?

Our two derivations above correspond to the application of the rules:

• $xy^2 + y \xrightarrow{xy = -1} -y + y = 0$
• $xy^2 + y \xrightarrow{y^2 = 1} x + y \nrightarrow \text{stuck!}$

That is, our rewrite rules are not confluent

The grobner basis is a mathematical object, which is a a confluent set of rewrite rules for the above problem. That is, it’s a set of polynomials which manage to find the rewrite $p(x, y) \xrightarrow{\star} 0$, regardless of the order in which we apply them. It’s also correct, in that it only rewrites to $0$ if the original system had some way to rewrite to $0$.

###### The buchberger’s algorithm

We need to identify critical pairs, which in this setting are called as S-polynomials.

Let $f_i = H(f_i) + R(f_i)$ and $f_j = H(f_j) + R(f_j)$. Let $m = lcm(H(f_i), H(f_j))$, and let $m_i, m_j$ be monomials such that $m_i \cdot H(f_i) = m = m_j \cdot H(f_j)$. The S-polynomial induced by $f_i, f_j$ is defined as $S(f_i, f_j) = m_i f_i - m_i f_j$.

# Lie bracket versus torsion

This picture finally made the difference between these two things clear. The lie bracket moves along the flow, while the torsion moves along parallel transport.

This is why the sides of the parallelogram that measure torsion form, well, a parallelogram: we set them up using parallel transport.

On the other hand, the lie bracket measures the actual failure of the parallelogram from being formed.

# Blog post: Weekend paper replication of STOKE, the stochastic superoptimizer

Click the title to go to the post. We replicate the STOKE paper in haskell, to implement a superoptimiser based on MCMC methods.

# Collapsing BlockId, Label, Unique:

We have this hiearchy of BlockId, Label, and Unique that can be collapsed.

# Spatial partitioning data structures in molecular dynamics

appear to be version of spatial hierarchical data structures for fast interaction computation. Apparently, multipole expansions are not useful in this case since multipole expansions are useful to take into account long range effects, but not short range effects.

# Vector: Arthur Whitney and text editors

• http://archive.vector.org.uk/art10501320

# Representing CPS in LLVM using the @coro.* intrinsics

This is part of a larger thread — Adding CPS call support to LLVM where there is a large discussion on the correct design of how to teach LLVM about CPS.

Gor Nishanov proided the above example of encoding CPS using the llvm coro instructions.

• https://gist.github.com/bollu/e0573dbc145028fb42f89e64c6dd6742

# Bug in the LLVM code generator: Lowering of MO_Add2 and MO_AddWordC

Both of these are lowered the same way, but they should be different.

In particular, GHC.Prim explains:

Honestly, this is confusing, but I guess there’s some story to having two separate primops for this?

# Discrete random distributions with conditioning in 20 lines of haskell:

newtype D a = D { unD :: [(a, Double)] } deriving(Eq, Show, Ord)

instance Functor D where
-- fmap :: (a -> b) -> D a -> D b
fmap f (D xs) = D $fmap (\(a, d) -> (f a, d)) xs instance Monad D where return x = D$ [(x, 1.0)]
-- f :: a -> (D b)
(D as) >>= f = D $do -- list monad (a, p) <- as (b, p2) <- unD (f a) return$ (b, p * p2)

-- [(a, 0.5), (b, 0.5)]
-- [(a, 0.3), (a, 0.2), (b, 0.1), (b, 0.4)]
--
instance Applicative D where
pure = return
ff <*> fa = do
f <- ff
a <- fa
return $f a condition :: Bool -> D () condition True = D [((), 1.0)] condition False = D [((), 0.0)] dice :: D Int dice = let p = 1.0 / 6 in D$ [(x, p) | x <- [1..6]]

dice_hard :: D Int
dice_hard = do
x <- dice
condition $x > 3 return$ x

main :: IO ()
main = do
print dice
print dice_hard


This gives the output:

D {unD = [(1,0.16666666666666666),
(2,0.16666666666666666),
(3,0.16666666666666666),
(4,0.16666666666666666),
(5,0.16666666666666666),
(6,0.16666666666666666)]}

D {unD = [(1,0.0),
(2,0.0),
(3,0.0),
(4,0.16666666666666666),
(5,0.16666666666666666),
(6,0.16666666666666666)]}


Notice that D a ~= WriterT (Product Float) []!

# Everything you know about word2vec is wrong.

The classic explanation of word2vec, in skip-gram, with negative sampling, in the paper and countless blog posts on the internet is as follows:

while(1) {
1. vf = vector of focus word
2. vc = vector of context word
3. train such that (vc . vf = 1)
4. for(0 <= i < negative samples):
vneg = vector of word *not* in context
train such that (vf . vneg = 0)
}


Indeed, if I google “word2vec skipgram”, the results I get are:

The original word2vec C implementation does not do what’s explained above, and is drastically different. Most serious users of word embeddings, who use embeddings generated from word2vec do one of the following things:

1. They invoke the original C implementation directly.
2. They invoke the gensim implementation, which is transliterated from the C source to the extent that the variables names are the same.

Indeed, the gensim implementation is the only one that I know of which is faithful to the C implementation.

### The C implementation

The C implementation in fact maintains two vectors for each word, one where it appears as a focus word, and one where it appears as a context word. (Is this sounding familiar? Indeed, it appears that GloVe actually took this idea from word2vec, which has never mentioned this fact!)

The setup is incredibly well done in the C code:

• An array called syn0 holds the vector embedding of a word when it occurs as a focus word. This is random initialized.
https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L369
for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) {
next_random = next_random * (unsigned long long)25214903917 + 11;
syn0[a * layer1_size + b] =
(((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size;
}


• Another array called syn1neg holds the vector of a word when it occurs as a context word. This is zero initialized.
https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L365
for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++)
syn1neg[a * layer1_size + b] = 0;

• During training (skip-gram, negative sampling, though other cases are also similar), we first pick a focus word. This is held constant throughout the positive and negative sample training. The gradients of the focus vector are accumulated in a buffer, and are applied to the focus word after it has been affected by both positive and negative samples.
if (negative > 0) for (d = 0; d < negative + 1; d++) {
// if we are performing negative sampling, in the 1st iteration,
// pick a word from the context and set the dot product target to 1
if (d == 0) {
target = word;
label = 1;
} else {
// for all other iterations, pick a word randomly and set the dot
//product target to 0
next_random = next_random * (unsigned long long)25214903917 + 11;
target = table[(next_random >> 16) % table_size];
if (target == 0) target = next_random % (vocab_size - 1) + 1;
if (target == word) continue;
label = 0;
}
l2 = target * layer1_size;
f = 0;

// find dot product of original vector with negative sample vector
// store in f
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];

// set g = sigmoid(f) (roughly, the actual formula is slightly more complex)
if (f > MAX_EXP) g = (label - 1) * alpha;
else if (f < -MAX_EXP) g = (label - 0) * alpha;
else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;

// 1. update the vector syn1neg,
// 2. DO NOT UPDATE syn0
// 3. STORE THE syn0 gradient in a temporary buffer neu1e
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
}
// Finally, after all samples, update syn1 from neu1e
https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L541
// Learn weights input -> hidden
for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];


### Why random and zero initialization?

Once again, since none of this actually explained in the original papers or on the web, I can only hypothesize.

My hypothesis is that since the negative samples come from all over the text and are not really weighed by frequency, you can wind up picking any word, and more often than not, a word whose vector has not been trained much at all. If this vector actually had a value, then it could move the actually important focus word randomly.

The solution is to set all negative samples to zero, so that only vectors that have occurred somewhat frequently will affect the representation of another vector.

It’s quite ingenious, really, and until this, I’d never really thought of how important initialization strategies really are.

### Why I’m writing this

I spent two months of my life trying to reproduce word2vec, following the paper exactly, reading countless articles, and simply not succeeding. I was unable to reach the same scores that word2vec did, and it was not for lack of trying.

I could not have imagined that the paper would have literally fabricated an algorithm that doesn’t work, while the implementation does something completely different.

Eventually, I decided to read the sources, and spent three whole days convinced I was reading the code wrong since literally everything on the internet told me otherwise.

I don’t understand why the original paper and the internet contain zero explanations of the actual mechanism behind word2vec, so I decided to put it up myself.

This also explains GloVe’s radical choice of having a separate vector for the negative context — they were just doing what word2vec does, but they told people about it :).

Is this academic dishonesty? I don’t know the answer, and that’s a heavy question. But I’m frankly incredibly pissed, and this is probably the last time I take a machine learning paper’s explanation of the algorithm seriously again — from next time, I read the source first.

# Hamiltonian monte carlo, leapfrog integrators, and sympletic geometry

This is a section that I’ll update as I learn more about the space, since I’m studying differential geometry over the summer, I hope to know enough about “sympletic manifolds”. I’ll make this an append-only log to add to the section as I understand more.

##### 31st May
• To perform hamiltonian monte carlo, we use the hamiltonian and its derivatives to provide a momentum to our proposal distribution — That is, when we choose a new point from the current point, our probability distribution for the new point is influenced by our current momentum

• For some integral necessary within this scheme, Euler integration doesn’t cut it since the error diverges to infinity

• Hence, we need an integrator that guarantees that the energy of out system is conserved. Enter the leapfrog integrator. This integrator is also time reversible – We can run it forward for n steps, and then run it backward for n steps to arrive at the same state. Now I finally know how Braid was implemented, something that bugged the hell out of 9th grade me when I tried to implement Braid-like physics in my engine!

• The actual derivation of the integrator uses Lie algebras, Sympletic geometry, and other diffgeo ideas, which is great, because it gives me motivation to study differential geometry :)

• Original paper: Construction of higher order sympletic integrators

We create a simple monad called PL which allows for a single operation: sampling from a uniform distribution. We then exploit this to implement MCMC using metropolis hastings, which is used to sample from arbitrary distributions. Bonus is a small library to render sparklines in the CLI.

For next time:

• Using applicative to speed up computations by exploiting parallelism
• Conditioning of a distribution wrt a variable

### Source code

{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE UndecidableInstances #-}
{-# LANGUAGE DeriveFunctor #-}
import System.Random
import Data.List(sort, nub)
import Data.Proxy
import qualified Data.Map as M

-- | Loop a monadic computation.
(a -> m a) -- ^ loop
-> Int -- ^ number of times to run
-> a -- initial value
-> m a -- final value
mLoop _ 0 a = return a
mLoop f n a = f a >>= mLoop f (n - 1)

-- | Utility library for drawing sparklines

-- | List of characters that represent sparklines
sparkchars :: String
sparkchars = "_▁▂▃▄▅▆▇█"

-- Convert an int to a sparkline character
num2spark :: RealFrac a => a -- ^ Max value
-> a -- ^ Current value
-> Char
num2spark maxv curv =
sparkchars !!
(floor $(curv / maxv) * (fromIntegral (length sparkchars - 1))) series2spark :: RealFrac a => [a] -> String series2spark vs = let maxv = if null vs then 0 else maximum vs in map (num2spark maxv) vs seriesPrintSpark :: RealFrac a => [a] -> IO () seriesPrintSpark = putStrLn . series2spark -- Probabilities -- ============ type F = Float -- | probability density newtype P = P { unP :: Float } deriving(Num) -- | prob. distributions over space a newtype D a = D { runD :: a -> P } uniform :: Int -> D a uniform n = D$ \_ -> P $1.0 / (fromIntegral$ n)

(>$<) :: Contravariant f => (b -> a) -> f a -> f b (>$<) = cofmap

instance Contravariant D where
cofmap f (D d) = D (d . f)

-- | Normal distribution with given mean
normalD :: Float ->  D Float
normalD mu = D $\f -> P$ exp (- ((f-mu)^2))

-- | Distribution that takes on value x^p for 1 <= x <= 2.  Is normalized
polyD :: Float -> D Float
polyD p = D $\f -> P$ if 1 <= f && f <= 2 then (f ** p) * (p + 1) / (2 ** (p+1) - 1) else 0

class Contravariant f where
cofmap :: (b -> a) -> f a -> f b

data PL next where
Ret :: next -> PL next -- ^ return  a value
Sample01 :: (Float -> PL next) -> PL next -- ^ sample uniformly from a [0, 1) distribution

return = Ret
(Ret a) >>= f = f a
(Sample01 float2plnext) >>= next2next' =
Sample01 $\f -> float2plnext f >>= next2next' instance Applicative PL where pure = return ff <*> fx = do f <- ff x <- fx return$ f x

instance Functor PL where
fmap f plx = do
x <- plx
return $f x -- | operation to sample from [0, 1) sample01 :: PL Float sample01 = Sample01 Ret -- | Run one step of MH on a distribution to obtain a (correlated) sample mhStep :: (a -> Float) -- ^ function to score sample with, proportional to distribution -> (a -> PL a) -- ^ Proposal program -> a -- current sample -> PL a mhStep f q a = do a' <- q a let alpha = f a' / f a -- acceptance ratio u <- sample01 return$ if u <= alpha then a' else a

-- Typeclass that can provide me with data to run MCMC on it
class MCMC a where
arbitrary :: a
uniform2val :: Float -> a

instance MCMC Float where
arbitrary = 0
-- map [0, 1) -> (-infty, infty)
uniform2val v = tan (-pi/2 + pi * v)

{-
-- | Any enumerable object has a way to get me the starting point for MCMC
instance (Bounded a, Enum a) => MCMC a where
arbitrary = toEnum 0
uniform2val v = let
maxf = fromIntegral . fromEnum $maxBound minf = fromIntegral . fromEnum$ minBound
in toEnum $floor$ minf + v * (maxf - minf)
-}

-- | Run MH to sample from a distribution
mh :: (a -> Float) -- ^ function to score sample with
-> (a -> PL a) -- ^ proposal program
-> a -- ^ current sample
-> PL a
mh f q a = mLoop (mhStep f q) 100  $a -- | Construct a program to sample from an arbitrary distribution using MCMC mhD :: MCMC a => D a -> PL a mhD (D d) = let scorer = (unP . d) proposal _ = do f <- sample01 return$ uniform2val f
in mh scorer proposal arbitrary

-- | Run the probabilistic value to get a sample
sample :: RandomGen g => g -> PL a -> (a, g)
sample g (Ret a) = (a, g)
sample g (Sample01 f2plnext) = let (f, g') = random g in sample g' (f2plnext f)

-- | Sample n values from the distribution
samples :: RandomGen g => Int -> g -> PL a -> ([a], g)
samples 0 g _ = ([], g)
samples n g pl = let (a, g') = sample g pl
(as, g'') = samples (n - 1) g' pl
in (a:as, g'')

-- | count fraction of times value occurs in list
occurFrac :: (Eq a) => [a] -> a -> Float
occurFrac as a =
let noccur = length (filter (==a) as)
n = length as
in (fromIntegral noccur) / (fromIntegral n)

-- | Produce a distribution from a PL by using the sampler to sample N times
distribution :: (Eq a, Num a, RandomGen g) => Int -> g -> PL a -> (D a, g)
distribution n g pl =
let (as, g') = samples n g pl in (D (\a -> P (occurFrac as a)), g')

-- | biased coin
coin :: Float -> PL Int -- 1 with prob. p1, 0 with prob. (1 - p1)
coin p1 = do
Sample01 (\f -> Ret $if f < p1 then 1 else 0) -- | Create a histogram from values. histogram :: Int -- ^ number of buckets -> [Float] -- values -> [Int] histogram nbuckets as = let minv :: Float minv = minimum as maxv :: Float maxv = maximum as -- value per bucket perbucket :: Float perbucket = (maxv - minv) / (fromIntegral nbuckets) bucket :: Float -> Int bucket v = floor (v / perbucket) bucketed :: M.Map Int Int bucketed = foldl (\m v -> M.insertWith (+) (bucket v) 1 m) mempty as in map snd . M.toList$ bucketed

printSamples :: (Real a, Eq a, Ord a, Show a) => String -> [a] -> IO ()
printSamples s as =  do
putStrLn $"***" <> s putStrLn$ "   samples: " <> series2spark (map toRational as)

printHistogram :: [Float] -> IO ()
printHistogram samples = putStrLn $series2spark (map fromIntegral . histogram 10$  samples)

-- | Given a coin bias, take samples and print bias
printCoin :: Float -> IO ()
printCoin bias = do
let g = mkStdGen 1
let (tosses, _) = samples 100 g (coin bias)
printSamples ("bias: " <> show bias) tosses

-- | Create normal distribution as sum of uniform distributions.
normal :: PL Float
normal =  fromIntegral . sum <$> (replicateM 5 (coin 0.5)) main :: IO () main = do printCoin 0.01 printCoin 0.99 printCoin 0.5 printCoin 0.7 putStrLn$ "normal distribution using central limit theorem: "
let g = mkStdGen 1
let (nsamples, _) = samples 1000 g normal
-- printSamples "normal: " nsamples
printHistogram nsamples

putStrLn $"normal distribution using MCMC: " let (mcmcsamples, _) = samples 1000 g (mhD$  normalD 0.5)
printHistogram mcmcsamples

putStrLn $"sampling from x^4 with finite support" let (mcmcsamples, _) = samples 1000 g (mhD$  polyD 4)
printHistogram mcmcsamples


### Output

***bias: 1.0e-2
samples: ________________________________________█_█_________________________________________________________
***bias: 0.99
samples: ████████████████████████████████████████████████████████████████████████████████████████████████████
***bias: 0.5
samples: __█____█__███_███_█__█_█___█_█_██___████████__█_████_████_████____██_█_██_____█__██__██_██____█__█__
***bias: 0.7
samples: __█__█_█__███_█████__███_█_█_█_██_█_████████__███████████_████_█_███_████_██__█_███__██_███_█_█__█_█
normal distribution using central limit theorem:
_▄▇█▄_
normal distribution using MCMC:
__▁▄█▅▂▁___
sampling from x^4 with finite support
▁▁▃▃▃▄▅▆▇█_



{-# LANGUAGE GeneralizedNewtypeDeriving #-}
import qualified Data.Map.Strict as M

-- | This file can be copy-pasted and will run!

-- | Symbols
type Sym = String
-- | Environments
type E a = M.Map Sym a
-- | Newtype to represent deriative values
type F = Float
newtype Der = Der { under :: F } deriving(Show, Num)

infixl 7 !#
-- | We are indexing the map at a "hash" (Sym)
(!#) :: E a -> Sym -> a
(!#) = (M.!)

-- | A node in the computation graph
data Node =
Node { name :: Sym -- ^ Name of the node
, ins :: [Node] -- ^ inputs to the node
, out :: E F -> F -- ^ output of the node
, der :: (E F, E (Sym -> Der))
-> Sym -> Der -- ^ derivative wrt to a name
}

-- | @ looks like a "circle", which is a node. So we are indexing the map
-- at a node.
(!@) :: E a -> Node -> a
(!@) e node = e M.! (name node)

-- | Given the current environments of values and derivatives, compute
-- | The new value and derivative for a node.
run_ :: (E F, E (Sym -> Der)) -> Node -> (E F, E (Sym -> Der))
run_ ein (Node name ins out der) =
let (e', ed') = foldl run_ ein ins -- run all the inputs
v = out e' -- compute the output
dv = der (e', ed') -- and the derivative
in (M.insert name v e', M.insert name dv ed')  -- and insert them

-- | Run the program given a node
run :: E F -> Node -> (E F, E (Sym -> Der))
run e n = run_ (e, mempty) n

-- | Let's build nodes
nconst :: Sym -> F -> Node
nconst n f = Node n [] (\_ -> f) (\_ _ -> 0)

-- | Variable
nvar :: Sym -> Node
nvar n = Node n [] (!# n) (\_ n' -> if n == n' then 1 else 0)

-- | binary operation
nbinop :: (F -> F -> F)  -- ^ output computation from inputs
-> (F -> Der -> F -> Der -> Der) -- ^ derivative computation from outputs
-> Sym -- ^ Name
-> (Node, Node) -- ^ input nodes
-> Node
nbinop f df n (in1, in2) =
Node { name = n
, ins = [in1, in2]
, out = \e -> f (e !# name in1) (e !# name in2)
, der = \(e, ed) n' ->
let (name1, name2) = (name in1, name in2)
(v1, v2) = (e !# name1, e !# name2)
(dv1, dv2) = (ed !# name1 $n', ed !# name2$ n')
in df v1 dv1 v2 dv2
}

nadd :: Sym -> (Node, Node) -> Node
nadd = nbinop (+) (\v dv v' dv' -> dv + dv')

nmul :: Sym -> (Node, Node) -> Node
nmul = nbinop (*) (\v (Der dv) v' (Der dv') -> Der $(v*dv') + (v'*dv)) main :: IO () main = do let x = nvar "x" :: Node let y = nvar "y" let xsq = nmul "xsq" (x, x) let ten = nconst "10" 10 let xsq_plus_10 = nadd "xsq_plus_10" (xsq, ten) let xsq_plus_10_plus_y = nadd "xsq_plus_10_plus_y" (xsq_plus_10, y) let (e, de) = run (M.fromList$ [("x", 2.0), ("y", 3.0)]) xsq_plus_10_plus_y
putStrLn $show e putStrLn$ show $de !@ xsq_plus_10_plus_y$ "x"
./bug { [x] -> [o0] : 2o0 = x or (exists (e0 = floor((1 + x)/2): o0 = 1 + 3x and 2e0 = 1 + x)) } exact: 0 map: { [x] -> [o0] }  I find it odd that it is unable to prove anything about the image, even that it is non-negative, for example. This is an interesting direction in which to improve the functions isl_map_power and isl_map_transitive_closure though. ## Using compactness to argue about the cover in an argument I’ve always seen compactness be used by starting with a possibly infinite coverm and then filtering it into a finite subcover. This finite subcover is then used for finiteness properties (like summing, min, max, etc.). I recently ran across a use of compactness when one starts with the set of all possible subcovers, and then argues about why a cover cannot be built from these subcovers if the set is compact. I found it to be a very cool use of compactness, which I’ll record below: ##### Theorem: If a family of compact, countably infinite sets S_a have all finite intersections non-empty, then the intersection of the family S_a is non-empty. ##### Proof: Let S = intersection of S_a. We know that S must be compact since all the S_a are compact, and the intersection of a countably infinite number of compact sets is compact. Now, let S be empty. Therefore, this means there must be a point p ∈ P such that p !∈ S_i for some arbitrary i. ##### Cool use of theorem: We can see that the cantor set is non-empty, since it contains a family of closed and bounded sets S1, S2, S3, ... such that S1 ⊇ S2 ⊇ S3 ... where each S_i is one step of the cantor-ification. We can now see that the cantor set is non-empty, since: 1. Each finite intersection is non-empty, and will be equal to the set that has the highest index in the finite intersection. 2. Each of the sets Si are compact since they are closed and bounded subsets of R 3. Invoke theorem. ## Japanese Financial Counting system Japanese contains a separate kanji set called daiji, to prevent people from adding strokes to stuff previously written. # |Common |Formal 1 |一 |壱 2 |二 |弐 3 |三 |参  ## Stephen wolfram’s live stream I’ve taken to watching the live stream when I have some downtime and want some interesting content. The discussions of Wolfram with his group are great, and they bring up really interesting ideas (like that of cleave being very irregular). ## Cleave as a word has some of the most irregular inflections • cleave • clove • cleaved • clave • cleft ## McCune’s single axiom for group theory Single Axioms for Groups and Abelian Groups with Various Operations provides a single axiom for groups. This can be useful for some ideas I have for training groups, where we can use this axiom as the loss function! ## Word2Vec C code implements gradient descent really weirdly I’ll be posting snippets of the original source code, along with a link to the Github sources. We are interested in exploring the skip-gram implementation of Word2Vec, with negative sampling, without hierarchical softmax. I assume basic familiarity with word embeddings and the skip-gram model. #### Construction of the sigmoid lookup table // https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L708 expTable = (real *)malloc((EXP_TABLE_SIZE + 1) * sizeof(real)); for (i = 0; i < EXP_TABLE_SIZE; i++) { expTable[i] = exp((i / (real)EXP_TABLE_SIZE * 2 - 1) * MAX_EXP); // Precompute the exp() table expTable[i] = expTable[i] / (expTable[i] + 1); // Precompute f(x) = x / (x + 1) }  Here, the code constructs a lookup table which maps [0...EXP_TABLE_SIZE-1] to [sigmoid(-MAX_EXP)...sigmoid(MAX_EXP)]. The index i first gets mapped to (i / EXP_TABLE_SIZE) * 2 - 1, which sends 0 to -1 and EXP_TABLE_SIZE to 1. This is then rescaled by MAX_EXP. #### Layer initialization • syn0 is a global variable, initialized with random weights in the range of [-0.5...0.5]. It has dimensions VOCAB x HIDDEN. This layer holds the hidden representations of word vectors. // https://github.com/imsky/word2vec/blob/master/word2vec.c#L341 a = posix_memalign((void **)&syn0, 128, (long long)vocab_size * layer1_size * sizeof(real)); ... // https://github.com/imsky/word2vec/blob/master/word2vec.c#L355 for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) { next_random = next_random * (unsigned long long)25214903917 + 11; syn0[a * layer1_size + b] = (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size; }  • syn1neg is a global variable that is zero-initialized. It has dimensions VOCAB x HIDDEN. This layer also holds hidden representations of word vectors, when they are used as a negative sample. // https://github.com/imsky/word2vec/blob/master/word2vec.c#L350 a = posix_memalign((void **)&syn1neg, 128, (long long)vocab_size * layer1_size * sizeof(real)); ... for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) syn1neg[a * layer1_size + b] = 0;  • neu1e is a temporary per-thread buffer (Remember that the word2vec C code use CPU threads for parallelism) which is zero initialized. It has dimensions 1 x HIDDEN. // https://github.com/imsky/word2vec/blob/master/word2vec.c#L370 real *neu1e = (real *)calloc(layer1_size, sizeof(real));  #### Backpropogation Throughout word2vec, no 2D arrays are used. Indexing of the form arr[word][ix] is manually written as arr[word * layer1_size + ix]. So, I will call word * layer1_size as the “base address”, and ix as the “offset of the array index expression henceforth. Here, l1 is the base address of the word at the center of window (the focus word). l2 is the base address of either the word that is negative sampled from the corpus, or the word that is a positive sample from within the context window. label tells us whether the sample is a positive or a negative sample. label = 1 for positive samples, and label = 0 for negative samples. // zero initialize neu1e // https://github.com/imsky/word2vec/blob/master/word2vec.c#L419 for (c = 0; c < layer1_size; c++) neu1e[c] = 0; ... // loop through each negative sample // https://github.com/imsky/word2vec/blob/master/word2vec.c#L508 if (negative > 0) for (d = 0; d < negative + 1; d++) { ... // https://github.com/imsky/word2vec/blob/master/word2vec.c#L521 // take the dot product: f= syn0[focus] . syn1neg[context] for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2]; // compute: g = (label - sigmoid(2f - 1)) * alpha // g is computed using lookups into a lookup table and clamping for // efficiency. if (f > MAX_EXP) g = (label - 1) * alpha; else if (f < -MAX_EXP) g = (label - 0) * alpha; else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; // Now that we have computed the gradient: // g = (label - output) * learningrate, // we need to perform backprop. This is where the code gets weird. for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1]; } // end loop through negative samples // Learn weights input -> hidden for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];  • We have two vectors for each word, one called syn0[l1 + _] and the other syn1neg[l2 + _]. The syn1neg word embedding is used whenever a word is used a negative sample, and is not used anywhere else. Also, the syn1neg vector is zero initialized, while the syn0 vectors are randomly initialized. • The values we backprop with g * syn1neg[l2 + _], g * syn0[l1 + _] are not the correct gradients of the error term! The derivative of a sigmoid is dsigmoid(x)/dx = sigmoid(x) [1 - sigmoid(x)]. The [1 - sigmoid(x)] is nowhere to be seen, let alone the fact that we are using sigmoid(2x - 1) and not regular sigmoid. Very weird. • We hold the value of syn0 constant throughout all the negative samples, which was not mentioned in any tutorial I’ve read. The paper does not mentioned these implementation details, and neither does any blog post that I’ve read. I don’t understand what’s going on, and I plan on updating this section when I understand this better. ## Arthur Whitney: dense code • Guy who wrote a bunch of APL dialects, write code in an eclectic style that has very little whitespace and single letter variable names. • Believes that this allows him to hold the entire program in his head. • Seems legit from my limited experience with APL, haskell one-liners. • The b programming language. It’s quite awesome to read the sources. For example, a.c ## How does one work with arrays in a linear language? Given an array of qubits xs: Qubit[], I want to switch to little endian. Due to no-cloning, I can’t copy them! I suppose I can use recursion to build up a new “list”. But this is not the efficient array version we know and love and want. The code that I want to work but does not: function switchEndian(xs: Qubit[]): Unit { for(i in 0..Length(xs) - 1) { Qubit q = xs[i]; // boom, this does not work! xs[i] = xs[Length(xs) - 1 - i] xs[Length(xs) - 1 - i] = q; } }  On the other hand, what does work is to setup a quantum circuit that performs this flipping, since it’s a permutation matrix at the end of the day. But this is very slow, since it needs to simulate the “quantumness” of the solution, since it takes 2^n basis vectors for n qubits. However, the usual recursion based solution works: function switchEndian(xs: Qubit[]): Qubit[] { if(Length(xs) == 1) { return xs; } else { switchEndian(xs[1..(Length(xs) - 1)] + xs[0] } }  This is of course, suboptimal. I find it interesting that in the linear types world, often the “pure” solution is forced since mutation very often involves temporaries / copying! (I’m solving assignments in qsharp for my course in college) ## How Linear optimisation is the same as Linear feasibility checking Core building block of effectively using the ellipsoid algorithm. • If we posess a way to check if a pointp \in P$where$P$is a polytope, we can use this to solve optimisation problems. • Given the optimisation problem maximise$c^Tx$subject to$Ax = b$, we can construct a new non-emptiness problem. This allows us to convert optimisation into feasibility. • The new problem is$Ax = b, A^Ty = c, c^Tx = b^T y$. Note that by duality, a point in this new polyhedra will be an optimal solution to the above linear program. We are forcing$c^Tx = b^Ty\$, which will be the optimal solution, since the solution where the primal and dual agree is the optimal solution by strong duality.
• This way, we have converted a linear programming problem into a check if this polytope is empty problem!

## Quantum computation without complex numbers

I recently learnt that the Toeffili and Hadamard gates are universal for quantum computation. The description of these gates involve no complex numbers. So, we can write any quantum circuit in a “complex number free” form. The caveat is that we may very well have input qubits that require complex numbers.

Even so, a large number (all?) of the basic algorithms shown in Nielsen and Chaung can be encoded in an entirely complex-number free fashion.

I don’t really understand the ramifications of this, since I had the intuition that the power of quantum computation comes from the ability to express complex phases along with superposition (tensoring). However, I now have to remove the power from going from R to C in many cases. This is definitely something to ponder.

## Linguistic fun fact: Comparative Illusion

I steal from wikipedia:

Comparative Illusion, which is a grammatical illusion where certain sentences seem grammatically correct when you read them, but upon further reflection actually make no sense.

For example: “More people have been to Berlin than I have.”