We do a lot of work at FP Complete in the realm of multi-machine computing for our customers. One of the things we need to do efficiently to get good speedups from adding more machines is optimizing serialization. Since our multi-machine work often revolves around scientific and mathematical computing, we primarily need to make sure that there is efficient serialization of primitive data types (like
Not too long ago Francesco discovered and reported some issues with the binary package's serialization of
Both the binary and cereal packages are very featureful, and with these features come some overhead. We actually have much simpler requirements for our standard serialization cases, e.g.:
With these ideas in mind, I've spent the past few days putting together a variety of different implementations of both encoding and decoding, and believe that for the limited cases I just described, we can get significantly better serialization performance.
The benchmarks I've used for this experiment focus on a fairly simple encoding and decoding test for a boxed vector of
There are three different binary representations of the data used in this benchmark suite:
Let's start with the benchmark results, and then do some more analysis:
Since the little endian encoding wins as far as representation, we'll spend the rest of this approach analyzing the different trade-offs in encoding and decoding, based on the 6 encoding and 3 decoding functions implemented and tested here.
Continuation vs mutable reference
When both encoding and decoding, we need to keep track of the current offset being written to/read from within the buffer. In the case of decoding, we also need to have some way to fail out if the data parsed is not what we expect, or there are insufficient bytes to consume. I tested two approaches for this:
As you can see from the results, the continuation based approach gave a slight performance advantage. While this is what I would have expected, the difference between the two was less than I had expected. Additionally, in some earlier testing before I'd added more optimization, the reference based implementation actually outperformed the continuation based implementation. The details of what's going on at the core level would be an interesting follow-up analysis.
Optimizing the mutable reference
If you dive into the code, you may be a bit surprised at how the offset reference is represented. Instead of a more standard
newtype OffsetRef = OffsetRef (Data.Vector.Unboxed.Mutable.MVector RealWorld Offset)
The idea here is to take advantage of the more efficient unboxed representation and byte array operations provided by the vector package and GHC. This also avoids a lot of garbage collection overhead used by the boxed nature of
Storable, bit-shifting, and bytestring-builder
When it comes to storing primitive datatypes in a pointer, it's unsurprising to see the Storable type class. This class's peek and poke operations are implemented under the surface with GHC primops. We get away with this because, in our implementation, we are always using host byte order. By contrast, the cereal, binary, and bytestring-builder packages need to do more direct bit-shifting, such as:
word64be :: ByteString -> Word64 word64be = \s -> (fromIntegral (s `SU.unsafeIndex` 0) `shiftL` 56) .|. (fromIntegral (s `SU.unsafeIndex` 1) `shiftL` 48) .|. (fromIntegral (s `SU.unsafeIndex` 2) `shiftL` 40) .|. (fromIntegral (s `SU.unsafeIndex` 3) `shiftL` 32) .|. (fromIntegral (s `SU.unsafeIndex` 4) `shiftL` 24) .|. (fromIntegral (s `SU.unsafeIndex` 5) `shiftL` 16) .|. (fromIntegral (s `SU.unsafeIndex` 6) `shiftL` 8) .|. (fromIntegral (s `SU.unsafeIndex` 7) )
instance Simple Int64 instance Simple Word8 instance Simple Double
simpleSize :: Either Int (a -> Int)
You may be wondering: if
A naive implementation of this would look like:
simpleSize :: a -> Int
However, this signature implies that every type's size may change at runtime. This would require that, when serializing a
simpleSize :: Vector a -> Int simpleSize v = 8 + V.sum (V.map simpleSize v)
But in the case of statically-known sizes, we'd like to replace that
simpleSize :: Either Int (a -> Int)
If this function returns
simpleSize :: Either Int (Vector a -> Int) simpleSize = Right $ \v -> case simpleSize of Left s -> s * V.length v + 8 Right f -> V.sum (V.map f v) + 8
Notice how, for a
encodeSimpleRaw :: Simple a => a -> ByteString encodeSimpleRaw x = unsafeCreate (either id ($ x) simpleSize) (\p -> simpleRawPoke p 0 x)
Monadic vs Monoidal interface
The monadic interface though has many advantages over a monoidal one:
Overall abstraction overhead
In addition to the nice
simplePeek = SomeData <$> simplePeek <*> simplePeek <*> simplePeek
readInt64 bs $ \bsX x -> readWord8 bsX $ \bsY y -> readDouble bsY $ \bsZ z -> do MV.unsafeWrite mv idx (SomeData x y z) loop (idx + 1) bsZ
If you look at the implementation of the
Since we're designing a new decoding interface from scratch, and have
newtype Peek s a
Also, there is a
simplePeek :: Simple a => Peek (Vector a) simplePeek = do len :: Int64 <- simplePeek let len' = fromIntegral len mv <- MV.new len' let loop i | i >= len' = V.unsafeFreeze mv | otherwise = do x <- simplePeek MV.unsafeWrite mv i x loop $! i + 1 loop 0
All of this also applies to the
The criterion link I shared above has the results of all of these benchmarks, but for the lazy, here's the result graph again:
As you can see: the little-endian encoding regularly outperformed the other two formats by a significant margin. This isn't surprising, given that there is less CPU work that needs to be done, and that we are able to leverage primops from GHC to do most of the work.
Similarly, while bytestring-builder is a very highly tuned library, for a narrow use case like ours where the resulting buffer size is easily computed in advance, constructing
To my surprise, the mutable reference/exception based approach to peeking and poking was fairly competitive with the continuation-based approach. Nonetheless, overall the continuation-based approach outperforms it, and is more elegant an approach.
While the monadic and monoidal interfaces for poking/encoding give similar results, the performance of the monoidal interface can be unpredictable if used incorrectly. Since monadic combinators are more commonly optimized and more ubiquitous, it's safer to expose the monadic interface. (And, for what it's worth, we can always provide a
And thankfully, the abstractions we use to make our encoding and decoding easy to work with and nicely composable do not introduce a slowdown versus the low-level implementation. So no need to force people to drop down to manual continuation passing.
Based on this analysis, it seems like we have a clear set of choices for a new serialization library. This blog post already clarifies the design goals of such a library, and its limitations (e.g., no streaming decoding). Such a library is something we'll almost certainly want to leverage for our multi-machine computation work at FP Complete.
I'm interested on feedback from others on this topic, including ways to better optimize the code, alternative strategies, and additional use cases you might have for such a library.
Posted by Michael Snoyman - 14 March, 2016