Isolate the problem and develop a solution that left the clients’ proven core code intact.
FP Complete quickly confirmed that the problem stemmed from Haskell’s practice of recycling unused memory and worked around this problem by developing a platform capable of easily distributing Haskell computation across distinct processes.
Since the inefficiencies we were experiencing were due to Haskell’s garbage collector having trouble managing a very big heap accessed by multiple threads, one way to mitigate the problem is to split up the work in many small heaps serving different threads or processes. Thus, garbage collection in one memory heap would not block users of the other heaps. We decided to distribute the workload across several separate single-threaded processes, each with its own heap, so that each would be very productive while still being able to distribute work across vast computing resources.
To distribute work efficiently, we developed a design using a master-slave architecture with the master orchestrating work to slaves. The slaves communicate with the master via messages over TCP sockets, and thus can be on the same machine or on separate machines -- a “two for one” win since the same code could support multi-core as well as multi-machine scaling. We distributed large chunks of work per message, to ensure that the latency in message-passing would not significantly affect performance.
The client’s predictive model uses a statistical approach called time-series Monte Carlo, which explores thousands of possible evolutions over time. Thus as a first attempt the master would orchestrate the work as follows: The master node is initialized with the initial states. When the master node needs to evolve a state, it transfers it to a slave node, which can be another process on the same machine, or on another machine across a network connection. The slave node executes the compute-intensive “evolution” function, and then returns the new state to the master. The master inspects the new states as they come from the slaves and continues evolving them. The problem with this first approach is that the states are relatively big -- in the hundreds of kilobytes per state, for a total of several gigabytes of states alive at any given time. This makes transferring them to a slave each time impractical.
We developed an algorithm capable of persisting the states on slave nodes, and then evolving them directly on the slaves without having to transfer them each time. Moreover, the master also take care to dynamically rebalance the work when a slave ends up having a backlog of states to evolve while others are idling. This keeps the communication traffic to a minimum while achieving good work balance across slave nodes. Crucially, all this management was handled outside the core regulated code, the mathematical analysis.
This allowed us not only to improve performance by 100% on a single 18-core machine, but also to distribute the computation across multiple machines, achieving latency reductions of up to 600% compared to the best performance we could get with the single-process version.
We choose this approach since it allowed us to achieve a dramatic speedup without any change to the core algorithm, and maintaining the output of the device byte-identical to the original version at all times, thus avoiding the burden of verifying the output twice, given the highly regulated environment the device operates in.
New Challenges for FP Complete
A remaining bottleneck was converting Haskell objects to streams of bytes that could be transferred as a message. This led to optimization work that culminated in the development of a new serialization library called “store” which improves the Haskell status quo by several orders of magnitude. With the client’s permission we released this library under a permissive open source license, following a string of FP Complete releases aiming to improve the Haskell ecosystem while protecting client confidentiality. This helps to reduce future maintenance work for the client, by bringing the open-source community to bear on maintaining the generic function over time.
We dramatically improved the program’s speed and scalability by building a framework to distribute the workload efficiently across multiple processes on multiple CPU cores, and then across multiple machines. We included the “middleware” infrastructure supporting these improvements as part of FP Complete's High Performance Computing framework, allowing for easy deployment of the scaled-up device on Amazon’s AWS cloud computing platform.
At the end of this optimization effort we doubled the performance of the device on the 18-core machines that we were using, and enabled us to gain a latency reduction of up to 500% by distributing the work across several machines.
As the resulting measurement showed, the distributed version of the software is already twice as fast on a single 18-core machine, and is then capable of scaling beyond one machine for even faster simulations. For example, the plot shows the time the simulation takes for 2 machines (36 cores) and 4 machines (72 cores), where the simulation time is slightly more than 3 minutes, while the best time we could get using the local version of the software is around 20 minutes. This is not only advantageous for the product, but also a boon to development: programmers testing changes or data scientists tuning parameters can iterate 6 times faster.
Moreover, we integrated this platform to distribute work into a broader high-performance computing (HPC) framework that allows the client to easily and reliably schedule requests and wait for responses. This framework allows multiple clients, even across a remote Web API, to schedule work that is automatically and flexibly distributed across a cluster. The cluster can expand and shrink with no downtime, creating and deleting cloud compute machines to quickly react to changing demands. This HPC framework is architected to be extremely easily to deploy and scale, taking advantage of the AWS computing platform (both in the form of the EC2 and ElastiCache) and Docker containers to maximize reliability and minimize maintenance costs.
Having a direct handle on how the work is distributed let us integrate the strategy to distribute work into the client’s broader strategy to deploy the medical device reliably and scalably. This means that we’re able to use the hardware we are paying for as efficiently as possible, and we’re able to scale seamlessly as the load increases.