§ Burrows Wheeler

We aim to get the O(n)O(n) algorithm for burrows wheeler, by starting from the naive O(n2)O(n^2) implementation and then slowly chipping away to get to the fastest algorithm

§ String rotations

Given a string ss of length nn, we can index it as s[0]s[0], s[1]s[1], upto s[n1]s[n-1]. We can now build a table consisting of rotations of the string. We'll define:
lrot :: [a] -> [a]
lrot [] = []; lrot (x:xs) = xs ++ [x]
We can build a table of left-rotations:
foobar
oobarf
obarfo
barfoo
arfoob
rfooba
We can immediately notice that it's a symmetric matrix. We can prove this as follows. We write lrot(rot,s)lrot(rot, s) as an array such that:
lrot(rot,s)[i]=s[(rot+i) lrot(rot, s)[i] = s[(rot + i) % n] \text{where $n = |s|$}
Now, we note that on row rr of our array we have the string lrot(r,s)lrot(r, s). So the value at row rr, column cc is lrot(r,s)[c]=s[(r+c)lrot(r, s)[c] = s[(r + c)%n]. But this is symmetric in c,rc, r so can be written as s[(c+r)s[(c + r)%n], which is equal to lrot(c,s)[r]lrot(c, s)[r]. Formally:
lrot(r,s)[c]=s[(r+c) lrot(r, s)[c] = s[(r + c) %n] = s[(c + r)%n] = lrot(c, s)[r]

§ Sorts of string rotations

We're next interested in considering sorted order of the string rotations. For example, in the case of the string "here, there", all the rotations are:
here-there  0
ere-thereh  1
re-therehe  2
e-thereher  3
-therehere  4
therehere-  5
herehere-t  6
erehere-th  7
rehere-the  8
ehere-ther  9
which is calculated with the help of the following haskell code:
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]

lrots :: [a] -> [[a]]; lrots as = take (length as) (iterate lrot as)

main =  putStrLn $ intercalate "\n"  
  (zipWith (\s i -> s <> "  " <> show i)
           (lrots "here-there") [0,1..])
If we now sort these rotations, we get:
-therehere  0
e-thereher  1
ehere-ther  2
ere-thereh  3
erehere-th  4
here-there  5
herehere-t  6
re-therehe  7
rehere-the  8
therehere-  9
we produce this by chainging the above definition of main to:
main =  putStrLn $ intercalate "\n"  
  (zipWith (\s i -> s <> "  " <> show i)
           -- | extra `sort` here
           (sort $ lrots "here-there") [0,1..])
Let's look at the final column of that table. We have:
-thereher|e|  0
e-therehe|r|  1
ehere-the|r|  2
ere-there|h|  3
erehere-t|h|  4
here-ther|e|  5 <- ORIGINAL STRING
herehere-|t|  6
re-thereh|e|  7
rehere-th|e|  8
therehere|-|  9

0123456789
errhhetee-
Now, we'll show how to write a really cool function call bwt such that:
bwtinv ("errhhetee-",5) = "here-there"
The 5 is the index of the original string in the list. That is, given the jumbled up last-column and the index of the original string, we're able to reconstruct the original string. The reason this is useful is that the jumbled string "errhhetee-" is easier to compress: it has long runs of r, h, and e which makes it easier to compress. The process we performed of rotating, sorting the rotations and taking the final column is the Burrows-Wheeler transform. in code:
import Data.List
lrot :: [a] -> [a]
lrot [] = []; lrot (a:as) = as ++ [a]

lrots :: [a] -> [[a]]
lrots as = take (length as) (iterate lrot as)

findix :: Eq a => a -> [a] -> Int
findix a as = length (takeWhile (/= a) as)

lastcol :: [[a]] -> [a]; lastcol = map last

bwt :: Eq a => Ord a => [a] -> ([a], Int)
bwt as = let as' = (sort . lrots) as in (lastcol as', findix as as')
Now we need to understand how bwtinv is defined and what it does. The definition is:
import Control.Arrow (&&&)

bwtinv :: Eq a => Ord a => ([a], Int) -> [a]
bwtinv (as, k) = recreate as !! k

-- recreate · lastcol · sort · rots = sort · rots
recreate :: (Eq a, Ord a) => [a] -> [[a]]
recreate as = recreate' (length as) as

recreate' :: (Eq a, Ord a) => Int -> [a] -> [[a]]
recreate' 0 = map (const [])
recreate' n = hdsort . consCol . (id &&& recreate' (n-1))


hdsort :: Ord a => [[a]] -> [[a]]
hdsort = let cmp (x:xs) (y:ys) = compare x y
         in sortBy cmp 

consCol :: ([a], [[a]]) -> [[a]]
consCol = uncurry (zipWith (:))
OK, so much for the code. what does it do? The idea is that:
  • we recreate the entire matrix from the last column using recreateand take the kth element.
  • recreate apprents a copy of the initial last column to a matrix ofcolumns, and then sorts this.

§ References

  • Richard Bird: Pearls of functional algorithm design