Part IV: IntMap!?

Back in part 2 we showed how we can compare two keys in Morton order without having to actually do the interleaving.

I'm going to take some time today to try to help folks build intuition for what that means by taking a look at an old standby in the Haskell ecosystem, Data.IntMap, and use the techniques we developed in part 2 to generate a version of some of the core routines that uses the same xor trick rather than store the prefix and mask it stores today.

Nothing in here has to do with matrix multiplication, but it is a powerful application of the notion of a "most significant difference" and xor based comparison by it.

If you're just getting here, you might want to start with parts 1, 2 and 3, but there is no pressure. Like part 3, this post can largely stand alone.

IntMap?!

Changing to my "difference tree" approach permits a number of operations to terminate earlier, and may well turn out to be a viable way to improve the venerable IntMap in the containers package, but I'm using it here mostly to help us develop familiarity with the 'most significant most significant difference'.

In many ways this is a degenerate case, but it at least helps us develop some facility for using the tool!

In Data.IntMap.Base, based on some decade old code from Daan Leijen, the containers library defines:

type Prefix = Int
type Mask   = Int

data IntMap a
  = Nil
  | Tip {-# UNPACK #-} !Int a
  | Bin {-# UNPACK #-} !Prefix {-# UNPACK #-} !Mask !(IntMap a) !(IntMap a)

The Prefix and Mask contain information about the known common prefix of the PATRICIA trie up to that point, and the Mask of the position where they diverge.

Using what we now know, we can change this to

data IntMap a
  = Nil
  | Tip {-# UNPACK #-} !Int a
  | Bin {-# UNPACK #-} !Int {-# UNPACK #-} !Int !(IntMap a) !(IntMap a)

where the values we store in the Bin constructor are just the minimum and maximum Int key in the tree below.

Classifying Keys

To do so we need to be able to distinguish between roughly 6 cases for how a key can interact with the map, as if we had the Prefix and Mask in hand. From left to right:

data Class
  = FarLeft   -- differs on a higher msb, outside left branch
  | NearLeft  -- differs on the same msb, but outside current left branch
  | InLeft    -- within the left branch
  | InRight   -- within the right branch
  | NearRight -- differs on the same msb, but outside current right branch
  | FarRight  -- differs on a higher msb, outside right branch

The xor trick I mentioned at the end of part 2 can be bundled into a slightly unwieldy combinator, significant such that significant a b c d implies that the position of the most significant difference between c and d dominates the position of the most significant difference between a and b.

significant :: Int -> Int -> Int -> Int -> Bool
significant a b c d = ab < cd && ab < xor ab cd where
  cd = xor c d
  ab = xor a b

With that we can proceed to use trickery and slight of hand to classify our keys with regards to the bounds of our IntMap:

import Data.Bits

data Class
  = FarLeft   -- differs on a higher msb, outside left branch
  | NearLeft  -- differs on the same msb, but outside current left branch
  | InLeft    -- within the left branch
  | InRight   -- within the right branch
  | NearRight -- differs on the same msb, but outside current right branch
  | FarRight  -- differs on a higher msb, outside right branch
  deriving (Eq,Ord,Show,Read)

significant :: Int -> Int -> Int -> Int -> Bool
significant a b c d = ab < cd && ab < xor ab cd where
  cd = xor c d
  ab = xor a b

-- show
classify :: Int -> Int -> Int -> Class
classify k x y
  | k < x = if significant x y k y then FarLeft else NearLeft
  | k > y = if significant x y x k then FarRight else NearRight
  | significant k y x y = InRight
  | otherwise = InLeft
  
main = print $ classify 1 2 4
-- /show

We don't need to use the full power of classify, as often some subset of those 6 cases will be the same, so lets define a couple of additional combinators:

outside :: Int -> Int -> Int -> Bool
outside k x y = k < x || k > y

insideR :: Int -> Int -> Int -> Bool
insideR k x y = significant k y x y

outside serves as a more accurate version of nomatch from the Data.IntMap internals, and insideR assumes we're inside the range [x..y] and notes that if there is an extra bit of difference between x and y than between k and y, then we're in the right branch.

We simply use integer comparisons and 3 xors to classify how our key relates to the range of our IntMap.

We won't actually be using classify explicitly but you can play with it to see if you agree with its results! You'll be able to see it conceptually at work in the code below though.

Stock Definitions

Some of the stock combinators don't change at all:

null :: IntMap a -> Bool
null Nil = True
null _   = False

empty :: IntMap a
empty = Nil

Similarly the instances don't change:

instance Traversable IntMap where
  traverse f m0 = go m0 where
    go (Bin x y l r) = Bin x y <$> go l <*> go r
    go (Tip x a) = Tip x <$> f a
    go Nil = pure Nil
  {-# INLINE traverse #-}

instance Foldable IntMap where
  foldMap f m0 = go m0 where
    go Nil = mempty
    go (Tip _ a) = f a
    go (Bin _ _ l r) = mappend (go l) (go r)
  {-# INLINE foldMap #-}

instance Functor IntMap where
  fmap f m0 = go m0 where
    go Nil = Nil
    go (Tip x a) = Tip x (f a)
    go (Bin x y l r) = Bin x y (go l) (go r)
  {-# INLINE fmap #-}

but fast new friends become possible.

range :: IntMap a -> Maybe (Int,Int)
range Nil           = Nothing
range (Tip i a)     = Just (i,i)
range (Bin i j _ _) = Just (i,j)

Given the common usecase of finding the maximum key in an IntMap and inserting a new entry, that is a pretty nice side-effect!

Lookup

The next combinator to benefit from this change is lookup.

lookup :: Int -> IntMap a -> Maybe a
lookup k m0 = go m0 where
  go (Tip i a)
    | k == i    = Just a
    | otherwise = Nothing
  go (Bin x y  l r)
    | outside k x y = Nothing
    | insideR r x y = go r
    | otherwise     = go l
  go Nil = Nothing
{-# INLINE lookup #-}

lookup can now use the smarter outside check to fail faster than it can in stock containers.

Insert

Defining insert showcases the need for all 6 cases from classify. You can identify them in the reasoning below for how to handle the Bin case.

insert :: Int -> a -> IntMap a -> IntMap a
insert k a m0 = go m0 where
  go Nil = Tip k a
  go (Tip j b) = case compare k j of
    LT -> Bin k j (Tip k a) (Tip j b)
    EQ -> Tip k a
    GT -> Bin j k (Tip j b) (Tip k a)
  go n@(Bin x y l r)
    | k < x = if significant x y k y then Bin k y (Tip k a) n
                                     else Bin k y (go l) r
    | k > y = if significant x y x k then Bin x k n (Tip k a)
                                     else Bin x k l (go r)
    | significant k y x y = Bin x y l (go r)
    | otherwise           = Bin x y (go l) r

Delete

We can also define delete, benefiting similarly from the earlier exit in the unnecessary deletion case:

newx :: Int -> IntMap a -> IntMap a -> IntMap a
newx _ Nil r = r
newx y l@(Tip x _) r = Bin x y l r
newx y l@(Bin x _ _ _) r = Bin x y l r
{-# INLINE newx #-}

newy :: Int -> IntMap a -> IntMap a -> IntMap a
newy _ l Nil = l
newy x l r@(Tip y _)     = Bin x y l r
newy x l r@(Bin _ y _ _) = Bin x y l r
{-# INLINE newy #-}

delete :: Int -> IntMap a -> IntMap a
delete k m0 = go m0 where
  go n@(Bin x y l r)
    | outside k x y = n
    | insideR k x y = newy x l (go r)
    | otherwise     = newx y (go l) r
  go n@(Tip x a)
    | k == x    = Nil
    | otherwise = n
  go Nil = Nil
{-# INLINE delete #-}

Here we suffer ever so slightly. The Prefix and Mask are fixed when we call bin in the old code, but now we need to inspect the values we're given in newx and newy to find their bounds.

At

Finally, no post of mine would be complete without at least one reference to lens.

We can define the new alterF Lens that is being backported to containers for our modified IntMap directly. Here I'll call it at, due to its similarity to the lens combinator of the same name.

at :: Functor f => Int -> (Maybe a -> f (Maybe a)) -> IntMap a -> f (IntMap a)
at k f m0 = go m0 where
  go Nil       = maybe Nil (Tip k) <$> f Nothing
  go n@(Tip x a) = case compare k x of
    LT -> maybe n (\b -> Bin k x (Tip k b) n) <$> f Nothing
    EQ -> maybe Nil (Tip k) <$> f (Just a)
    GT -> maybe n (\b -> Bin x k n (Tip k b)) <$> f Nothing
  go n@(Bin x y l r)
    | k < x = if significant x y k y then maybe n (\b -> Bin k y (Tip k b) n) <$> f Nothing
                                     else maybe n (\b -> Bin (min k x) y (insert k b l) r) <$> f Nothing
    | k > y = if significant x y x k then maybe n (\b -> Bin x k n (Tip k b)) <$> f Nothing
                                     else maybe n (\b -> Bin x (max k y) l (insert k b r)) <$> f Nothing
    | insideR k x y = newy x l <$> go r
    | otherwise     = (\l' -> newx y l' r) <$> go l
{-# INLINE at #-}

We can view the scarier, but Haskell 98 type for at in the definition above as

at :: Int -> Lens' (IntMap a) (Maybe a)

This combinator is a bit hideous, but it should work! Feel free to test it. =)

Run It!

Putting it all together we get:

-- show
import Control.Applicative hiding (empty)
import Control.Lens hiding (at,outside)
import Data.Bits
import Data.Foldable
import Data.Traversable
import Data.Monoid
import Prelude hiding (lookup, null)

data IntMap a
  = Nil
  | Tip {-# UNPACK #-} !Int a
  | Bin {-# UNPACK #-} !Int {-# UNPACK #-} !Int !(IntMap a) !(IntMap a)
  deriving (Eq,Ord,Show,Read)

null :: IntMap a -> Bool
null Nil = True
null _   = False
{-# INLINE null #-}

empty :: IntMap a
empty = Nil
{-# INLINE empty #-}

range :: IntMap a -> Maybe (Int,Int)
range Nil           = Nothing
range (Tip i a)     = Just (i,i)
range (Bin i j _ _) = Just (i,j)
{-# INLINE range #-}

instance Traversable IntMap where
  traverse f m0 = go m0 where
    go (Bin x y l r) = Bin x y <$> go l <*> go r
    go (Tip x a) = Tip x <$> f a
    go Nil = pure Nil
  {-# INLINE traverse #-}

instance Foldable IntMap where
  foldMap f m0 = go m0 where
    go Nil = mempty
    go (Tip _ a) = f a
    go (Bin _ _ l r) = mappend (go l) (go r)
  {-# INLINE foldMap #-}

instance Functor IntMap where
  fmap f m0 = go m0 where
    go Nil = Nil
    go (Tip x a) = Tip x (f a)
    go (Bin x y l r) = Bin x y (go l) (go r)
  {-# INLINE fmap #-}

-- @significant a b c d@ implies that the position of the most significant difference between
-- @c@ and @d@ dominates the position of the difference between @a and b@.
significant :: Int -> Int -> Int -> Int -> Bool
significant a b c d = ab < cd && ab < xor ab cd where
  cd = xor c d
  ab = xor a b
{-# INLINE significant #-}

-- | for expository purposes only
data Class
  = FarLeft   -- differs on a higher msb, outside left branch
  | NearLeft  -- differs on the same msb, but outside current left branch
  | InLeft    -- within the left branch
  | InRight   -- within the right branch
  | NearRight -- differs on the same msb, but outside current right branch
  | FarRight  -- differs on a higher msb, outside right branch
  deriving (Eq,Ord,Show,Read)

-- | classify a key @k@ with regards to a binary tree split on the 2-fattest number within @(x..y]@
classify :: Int -> Int -> Int -> Class
classify k x y
  | k < x = if significant x y k y then FarLeft else NearLeft
  | k > y = if significant x y x k then FarRight else NearRight
  | significant k y x y = InRight
  | otherwise = InLeft
{-# INLINE classify #-}

outside :: Int -> Int -> Int -> Bool
outside k x y = k < x || k > y
{-# INLINE outside #-}

insideR :: Int -> Int -> Int -> Bool
insideR k x y = significant k y x y
{-# INLINE insideR #-}

lookup :: Int -> IntMap a -> Maybe a
lookup k m0 = go m0 where
  go (Tip i a)
    | k == i    = Just a
    | otherwise = Nothing
  go (Bin x y  l r)
    | outside k x y = Nothing -- short-circuit
    | insideR k x y = go r
    | otherwise     = go l
  go Nil = Nothing
{-# INLINE lookup #-}

insert :: Int -> a -> IntMap a -> IntMap a
insert k a m0 = go m0 where
  go Nil = Tip k a
  go (Tip j b) = case compare k j of
    LT -> Bin k j (Tip k a) (Tip j b)
    EQ -> Tip k a
    GT -> Bin j k (Tip j b) (Tip k a)
  go n@(Bin x y l r)
    | k < x = if significant x y k y then Bin k y (Tip k a) n
                                     else Bin k y (go l) r
    | k > y = if significant x y x k then Bin x k n (Tip k a)
                                     else Bin x k l (go r)
    | significant k y x y = Bin x y l (go r)
    | otherwise           = Bin x y (go l) r
{-# INLINE insert #-}

newx :: Int -> IntMap a -> IntMap a -> IntMap a
newx _ Nil r = r
newx y l@(Tip x _) r = Bin x y l r
newx y l@(Bin x _ _ _) r = Bin x y l r
{-# INLINE newx #-}

newy :: Int -> IntMap a -> IntMap a -> IntMap a
newy _ l Nil = l
newy x l r@(Tip y _)     = Bin x y l r
newy x l r@(Bin _ y _ _) = Bin x y l r
{-# INLINE newy #-}

delete :: Int -> IntMap a -> IntMap a
delete k m0 = go m0 where
  go n@(Bin x y l r)
    | outside k x y = n
    | insideR k x y = newy x l (go r)
    | otherwise     = newx y (go l) r
  go n@(Tip x a)
    | k == x    = Nil
    | otherwise = n
  go Nil = Nil
{-# INLINE delete #-}

at :: Functor f => Int -> (Maybe a -> f (Maybe a)) -> IntMap a -> f (IntMap a)
at k f m0 = go m0 where
  go Nil       = maybe Nil (Tip k) <$> f Nothing
  go n@(Tip x a) = case compare k x of    
    LT -> maybe n (\b -> Bin k x (Tip k b) n) <$> f Nothing
    EQ -> maybe Nil (Tip k) <$> f (Just a)
    GT -> maybe n (\b -> Bin x k n (Tip k b)) <$> f Nothing
  go n@(Bin x y l r)
    | k > y = if significant x y x k then maybe n (\b -> Bin x k n (Tip k b)) <$> f Nothing               -- far right
                                     else maybe n (\b -> Bin x (max k y) l (insert k b r)) <$> f Nothing  -- near right
    | k < x = if significant x y k y then maybe n (\b -> Bin k y (Tip k b) n) <$> f Nothing               -- far left
                                     else maybe n (\b -> Bin (min k x) y (insert k b l) r) <$> f Nothing  -- near left
    | significant k y x y = newy x l <$> go r -- in right
    | otherwise           = (\l' -> newx y l' r) <$> go l -- in left
{-# INLINE at #-}

bin :: IntMap a -> IntMap a -> IntMap a
bin l Nil = l
bin Nil r = r
bin l@(Tip x _)     r@(Tip y _)     = Bin x y l r
bin l@(Tip x _)     r@(Bin _ y _ _) = Bin x y l r
bin l@(Bin x _ _ _) r@(Bin _ y _ _) = Bin x y l r
bin l@(Bin x _ _ _) r@(Tip y _)     = Bin x y l r
{-# INLINE bin #-}
-- /show
-- show Run it!
main = print $ (empty & at 1 ?~ "hello" & at 2 ?~ "world") ^. at 2
-- /show

I have no idea if this is faster than the approach taken by Data.IntMap in practice on real data, but xor is your friend.

A great opportunity for participation would be to prove whether this code is faster or slower than the code in Data.IntMap in practice and if it proves to be faster, flesh it out!

I have one last diversion I need to post about before I can finally get to talking about the algorithm that started this discussion.

-Edward Kmett

August 23 2013