27 May 2016

While working for various clients that needed fast binary
serialization, we had discovered that the
`binary`

and
`cereal`

packages are both
inefficient and so we created the
`store`

package.

In the high-frequency trading sector, we had to decode and encode a
binary protocol into Haskell data structures for analysis. During this
process it was made apparent to us that while we had been attentive to
micro-benchmark with the venerable
`criterion`

package, we
hadn't put a lot of work into ensuring that memory usage was well
studied. Bringing down allocations (and thus work, and garbage
collection) was key to achieving reasonable speed.

## Let's measure space

In response, let's measure space more, in an automatic way.

The currently available way to do this is by compiling with profiling
enabled and adding call centers and then running our program with RTS
options. For example, we write a program with an `SCC`

call center,
like this:

main :: IO () main = do let !_ = {-# SCC myfunction_10 #-} myfunction 10 return ()

Then compile with profiling enabled with `-p`

and run with `+RTS -P`

and we get an output like this:

```
COST CENTRE MODULE no. entries ... bytes
MAIN MAIN 43 0 ... 760
CAF:main1 Main 85 0 ... 0
main Main 86 1 ... 0
myfunction_10 Main 87 1 ... 160
```

(Information omitted with `...`

to save space.)

That's great, exactly the kind of information we'd like to get. But we want it in a more concise, programmatic fashion. On a test suite level.

## Announcing `weigh`

To serve this purpose, I've written the
`weigh`

package, which seeks to
automate the measuring of memory usage of programs, in the same way
that `criterion`

does for timing of programs.

It doesn't promise perfect measurement and comes with a grain of salt,
but it's reproducible. Unlike timing, allocation is generally reliable
provided you use something like `stack`

to
pin the GHC version and packages, so you can also make a test suite
out of it.

## How it works

There is a simple DSL, like `hspec`

, for writing out your tests. It
looks like this:

import Weigh main = mainWith (do func "integers count 0" count 0 func "integers count 1" count 1 func "integers count 2" count 2 func "integers count 3" count 3 func "integers count 10" count 10 func "integers count 100" count 100) where count :: Integer -> () count 0 = () count a = count (a - 1)

This example weighs the function `count`

, which counts down to
zero. We want to measure the bytes allocated to perform the
action. The output is:

```
Case Bytes GCs Check
integers count 0 0 0 OK
integers count 1 32 0 OK
integers count 2 64 0 OK
integers count 3 96 0 OK
integers count 10 320 0 OK
integers count 100 3,200 0 OK
```

Weee! We can now go around weighing everything! I encourage you to do that. Even Haskell newbies can make use of this to get a vague idea of how costly their code (or libraries they're using) is.

## Real-world use-case: `store`

I wrote a few tests, while developing `weigh`

, for the `store`

package: encoding of lists, vectors and storable vectors. Here's the
`criterion`

result for encoding a regular `Vector`

type:

```
benchmarking encode/1kb normal (Vector Int32)/store
time 3.454 μs (3.418 μs .. 3.484 μs)
benchmarking encode/1kb normal (Vector Int32)/cereal
time 19.56 μs (19.34 μs .. 19.79 μs)
benchmarking encode/10kb normal (Vector Int32)/store
time 33.09 μs (32.73 μs .. 33.57 μs)
benchmarking encode/10kb normal (Vector Int32)/cereal
time 202.7 μs (201.2 μs .. 204.6 μs)
```

`store`

is **6x** faster than `cereal`

at encoding Int32 vectors. Great! Our
job is done, we've overcome previous limitations of binary encoding
speed. Let's take a look at how heavy this process is. Weighing the
program on 1 million and 10 million elements yields:

```
1,000,000 Boxed Vector Int Encode: Store 88,008,584 140 OK
1,000,000 Boxed Vector Int Encode: Cereal 600,238,200 1,118 OK
10,000,000 Boxed Vector Int Encode: Store 880,078,896 1,384 OK
10,000,000 Boxed Vector Int Encode: Cereal 6,002,099,680 11,168 OK
```

`store`

is 6.8x more memory efficient than `cereal`

. Excellent. But is our
job really finished? Take a look at those allocations. To simply
allocate a vector of that size, it's:

```
1,000,000 Boxed Vector Int Allocate 8,007,936 1 OK
10,000,000 Boxed Vector Int Allocate 80,078,248 1 OK
```

While `store`

is more efficient than `cereal`

, how are we allocating 11x
the amount of space necessary? We looked into this in the codebase, it
turned out more inlining was needed. After comprehensively applying
the `INLINE`

pragma to key methods and functions, the memory was
brought down to:

```
1,000,000 Boxed Vector Int Allocate 8,007,936 1 OK
1,000,000 Boxed Vector Int Encode: Store 16,008,568 2 OK
1,000,000 Boxed Vector Int Encode: Cereal 600,238,200 1,118 OK
10,000,000 Boxed Vector Int Allocate 80,078,248 1 OK
10,000,000 Boxed Vector Int Encode: Store 160,078,880 2 OK
10,000,000 Boxed Vector Int Encode: Cereal 6,002,099,680 11,168 OK
```

Now, `store`

takes an additional 8MB to encode an 8MB vector, 80MB for
an 80MB buffer. That's perfect 1:1 memory usage! Let's check out the
new speed without these allocations:

```
benchmarking encode/1kb normal (Vector Int32)/store
time 848.4 ns (831.6 ns .. 868.6 ns)
benchmarking encode/1kb normal (Vector Int32)/cereal
time 20.80 μs (20.33 μs .. 21.20 μs)
benchmarking encode/10kb normal (Vector Int32)/store
time 7.708 μs (7.606 μs .. 7.822 μs)
benchmarking encode/10kb normal (Vector Int32)/cereal
time 207.4 μs (204.9 μs .. 210.3 μs)
```

`store`

is 4x faster than previously! `store`

is also now 20x faster than
`cereal`

at encoding a vector of ints.

## Containers vs unordered-containers

Another quick example, the Map structures from the two containers
packages. Let's weigh how heavy `fromList`

is on 1 million
elements. For fun, the keys are randomly generated rather than
ordered. We force the list completely ahead of time, because we just
want to see the allocations by the library, not our input list.

fromlists :: Weigh () fromlists = do let !elems = force (zip (randoms (mkStdGen 0) :: [Int]) [1 :: Int .. 1000000]) func "Data.Map.Strict.fromList (1 million)" Data.Map.Strict.fromList elems func "Data.Map.Lazy.fromList (1 million)" Data.Map.Lazy.fromList elems func "Data.IntMap.Strict.fromList (1 million)" Data.IntMap.Strict.fromList elems func "Data.IntMap.Lazy.fromList (1 million)" Data.IntMap.Lazy.fromList elems func "Data.HashMap.Strict.fromList (1 million)" Data.HashMap.Strict.fromList elems func "Data.HashMap.Lazy.fromList (1 million)" Data.HashMap.Lazy.fromList elems

We clearly see that `IntMap`

from `containers`

is about 1.3x more
memory efficient than the generic `Ord`

-based `Map`

. However,
`HashMap`

wipes the floor with both of them (for `Int`

, at least),
using 6.3x less memory than `Map`

and 4.8x less memory than `IntMap`

:

```
Data.Map.Strict.fromList (1 million) 1,016,187,152 1,942 OK
Data.Map.Lazy.fromList (1 million) 1,016,187,152 1,942 OK
Data.IntMap.Strict.fromList (1 million) 776,852,648 1,489 OK
Data.IntMap.Lazy.fromList (1 million) 776,852,648 1,489 OK
Data.HashMap.Strict.fromList (1 million) 161,155,384 314 OK
Data.HashMap.Lazy.fromList (1 million) 161,155,384 314 OK
```

This is just a trivial few lines of code to generate this result, as you see above.

## Caveat

But beware: it's not going to be obvious exactly where allocations are coming from in the computation (if you need to know that, use the profiler). It's better to consider a computation holistically: this is how much was allocated to produce this result.

Analysis at finer granularity is likely to be guess-work (beyond even what's available in profiling). For the brave, let's study some examples of that.

## Interpreting the results: `Integer`

Notice that in the table we generated, there is a rather odd increase of allocations:

```
Case Bytes GCs Check
integers count 0 0 0 OK
integers count 1 32 0 OK
integers count 2 64 0 OK
integers count 3 96 0 OK
integers count 10 320 0 OK
integers count 100 3,200 0 OK
```

What's the explanation for those bytes in each iteration?

Refreshing our memory: The space taken up by a "small" Integer is
two machine words. On
64-bit that's 16 bytes. `Integer`

is defined like this:

data Integer = S# Int# -- small integers | J# Int# ByteArray# -- large integers

For the rest, we'd expect only 16 bytes per iteration, but we're
seeing more than that. **Why?** Let's look at
the Core
for `count`

:

Main.main48 = __integer 0 Main.main41 = __integer 1 Rec { Main.main_count [Occ=LoopBreaker] :: Integer -> () [GblId, Arity=1, Str=DmdType <S,U>] Main.main_count = \ (ds_d4Am :: Integer) -> case eqInteger# ds_d4Am Main.main48 of wild_a4Fq { __DEFAULT -> case ghc-prim-0.4.0.0:GHC.Prim.tagToEnum# @ Bool wild_a4Fq of _ [Occ=Dead] { False -> Main.main_count (minusInteger ds_d4Am Main.main41); True -> ghc-prim-0.4.0.0:GHC.Tuple.() } } end Rec }

The `eqInteger#`

function is
a pretend-primop,
which apparently combines with `tagToEnum#`

and is
optimized away at
the code generation phase.
This should lead to an unboxed comparison of something like `Int#`

,
which should not allocate. This leaves only the addition operation,
which should allocate one new 16-byte `Integer`

.

So where are those additional 16 bytes from?
The implementation of `minusInteger`

for `Integer`

types
is actually implemented as `x + -y`

:

-- TODO -- | Subtract two 'Integer's from each other. minusInteger :: Integer -> Integer -> Integer minusInteger x y = inline plusInteger x (inline negateInteger y)

This means we're allocating **one more Integer**. That explains the additional 16 bytes!

There's a `TODO`

there. I guess someone implemented `negateInteger`

and `plusInteger`

(which is
non-trivial)
and had enough.

If we implement a second function `count'`

that takes this into
account,

import Weigh main :: IO () main = mainWith (do func "integers count 0" count 0 func "integers count 1" count 1 func "integers count 2" count 2 func "integers count 3" count 3 func "integers count' 0" count' 0 func "integers count' 1" count' 1 func "integers count' 2" count' 2 func "integers count' 3" count' 3) where count :: Integer -> () count 0 = () count a = count (a - 1) count' :: Integer -> () count' 0 = () count' a = count' (a + (-1))

we get more reasonable allocations:

```
Case Bytes GCs Check
integers count 0 0 0 OK
integers count 1 32 0 OK
integers count 2 64 0 OK
integers count 3 96 0 OK
integers count' 0 0 0 OK
integers count' 1 16 0 OK
integers count' 2 32 0 OK
integers count' 3 48 0 OK
```

It turns out that `count'`

is 20% faster (from `criterion`

benchmarks), but realistically, if speed matters, we'd be using `Int`

,
which is practically 1000x faster.

What did we learn? Even something as simple as `Integer`

subtraction
doesn't behave as you would naively expect.

## Considering a different type: `Int`

Comparatively, let's look at `Int`

:

import Weigh main = mainWith (do func "int count 0" count 0 func "int count 1" count 1 func "int count 10" count 10 func "int count 100" count 100) where count :: Int -> () count 0 = () count a = count (a - 1)

The output is:

```
Case Bytes GCs Check
ints count 1 0 0 OK
ints count 10 0 0 OK
ints count 1000000 0 0 OK
```

It allocates zero bytes. Why? Let's take a look at the Core:

Rec { Main.$wcount1 [InlPrag=[0], Occ=LoopBreaker] :: ghc-prim-0.4.0.0:GHC.Prim.Int# -> () [GblId, Arity=1, Caf=NoCafRefs, Str=DmdType <S,1*U>] Main.$wcount1 = \ (ww_s57C :: ghc-prim-0.4.0.0:GHC.Prim.Int#) -> case ww_s57C of ds_X4Gu { __DEFAULT -> Main.$wcount1 (ghc-prim-0.4.0.0:GHC.Prim.-# ds_X4Gu 1); 0 -> ghc-prim-0.4.0.0:GHC.Tuple.() } end Rec }

It's clear that GHC is able to optimize this tight loop, and unbox the
`Int`

into an `Int#`

, which can be put into a register rather than
being allocated by the GHC runtime allocator to be freed later.

The lesson is not to take for granted that everything has a 1:1 memory mapping at runtime with your source, and to take each case in context.

## Data structures

Finally, from our contrived examples we can take a look at user-defined data types and observe some of the optimizations that GHC does for memory.

Let's demonstrate that unpacking a data structure yields less memory. Here is a data type that contains an `Int`

:

data HasInt = HasInt !Int deriving (Generic) instance NFData HasInt

Here are two identical data types which use `HasInt`

, but the first
simply uses `HasInt`

, and the latter unpacks it.

data HasPacked = HasPacked HasInt deriving (Generic) instance NFData HasPacked data HasUnpacked = HasUnpacked {-# UNPACK #-} !HasInt deriving (Generic) instance NFData HasUnpacked

We can measure the difference by weighing them like this:

-- | Weigh: packing vs no packing. packing :: Weigh () packing = do func "\\x -> HasInt x" (\x -> HasInt x) 5 func "\\x -> HasPacked (HasInt x)" (\x -> HasPacked (HasInt x)) 5 func "\\x -> HasUnpacked (HasInt x)" (\x -> HasUnpacked (HasInt x)) 5

The output is:

\x -> HasInt x 16 0 OK \x -> HasPacked (HasInt x) 32 0 OK \x -> HasUnpacked (HasInt x) 16 0 OK

Voilà! Here we've demonstrated that:

`HasInt x`

consists of the 8 byte header for the constructor, and 8 bytes for the Int.`HasPacked`

has 8 bytes for the constructor, 8 bytes for the first slot, then another 8 bytes for the`HasInt`

constructor, finally 8 bytes for the`Int`

itself.`HasUnpacked`

only allocates 8 bytes for the header, and 8 bytes for the`Int`

.

GHC did what we wanted!

## Summary

We've looked at:

- What lead to this package.
- Propose that we start measuring our functions more, especially libraries.
- How to use this package.
- Some of our use-cases at FP Complete.
- Caveats.
- Some contrived examples do not lead to obvious explanations.

Now I encourage you to try it out!