Just by reading the blogpost title you are likely to guess the problem at hand, but to be fair I will recap it anyways.
What's the problem?
It is increasingly common for software projects to have Continuous Integration (CI) setup, regardless of their size or complexity. In the matter of fact, it is so common that it is considered bad practice not to have your project be at least compiled upon every push to the repository. It's even better when a project is also tested with a possibility of doing an automated release as an extra step, which means uploading a new version to the package index, pushing
This drawback of long builds can have significant impact on the overall speed of the development process. Let's consider a large project that takes an hour to compile and run the test suite on decent hardware. It is not uncommon to have dozens of developers working daily on such a project, each of whom might introduce a change or two or maybe even more during a single day to the project, consequently triggering a CI build each time. Moreover, it is usually the case that there is a matrix of jobs for each build that will do compilation and testing on different Operating Systems, with different compiler versions and build tools. For example a common Haskell project will have a Travis CI build setup that will compile and test the project on Linux as well as Mac OS, while using 4~6 different GHC versions and multiple build tools, such as
nix. A CI build with such a matrix easily results in dozens of jobs on its own. Naturally, all of those jobs can be executed in parallel on as many Virtual Machines (VMs) as you have at your disposal, so you don't have to wait hundreds of hours for a CI build to finish, but even with an unlimited number of those VMs a developer will still end up waiting at least an hour for a successful feedback.
If you think recompiling a project from scratch every time when there is a slight change is really absurd, you are absolutely right. What is the point of wasting all of that electricity and developer's time just so we can create practically the same set of files, especially since all of the decent build tools are fully capable of handling changes. I am almost positive that all of the CI providers have this ability of caching files from one build, so they are available for the subsequent one. This simple idea solves the problem above in the majority of the situations. An example solution for a Haskell project would be adding these few lines to the
.travis.yml file in the project repository:
cache: directories: - $HOME/.ghc - $HOME/.cabal - $HOME/.stack - $TRAVIS_BUILD_DIR/.stack-work
In a perfect world this solution would be universal and complete and this blog post would end now, but, unfortunately, that is not the case. CI providers are not created equal, and their caching capabilities and limitations vary drastically, which can pose real problems for some projects. Just to list a few:
- AppVeyor limits the cache size to 1GB for their free accounts and 20Gb for the paid ones,
- it also shares the same cache between builds for different repository branches, which can easily lead to cache corruption.
- Travis handles cache sharing between builds for different branches properly, namely it will make sure a build for one branch does not interfere with cache for another branch's build, while also using cache created for
masterbranch, whenever there is an initial build for a fresh branch. Problem is, that Travis makes cache available even to Pull Requests (PRs) from forked repositories, consequently making it publicly readable.
- You can't set cache paths dynamically, e.g. whenever locations of files depend on the output of the build script itself. One way to solve this is to move files around to the known locations that will be cached, but then you have to restore them during the next build and not to forget to properly handle access/modification times, permissions and ownership.
Regardless of the
- upon a successful build select files and
foldersyou need to cache and create an archive
- check if there is already a cache on S3 for that branch
- if there is none or the content has changed, upload created
- compute the cryptographic hash and attach it to the uploaded object, so it can be used later for validating consistency and detecting a change in
During the restore step we do as follows:
- Check if there is an archive available on S3 from a previous build for a current branch, in
casewhen there is none fallback onto cache stored for a base branch such as
- Download the archive, validate content is consistent by validating the value of cryptographic hash
- Restore the files from the downloaded archive into their original locations.
I would not be surprised to see some of those steps implemented with bash and PowerShell scripts out there in the wild that use common tools
$ cache-s3 save -p $PROJECT_PATH/.build
And to restore files to their original place at the beginning of a CI build:
$ cache-s3 restore --base-branch=master
AWS credentials and S3 bucket name are being read from the environment, as it is commonly done during CI. I encourage you to read the cache-s3/README for more details on how to get everything setup and other available options before using the tool.
directory, where GHC will live along with all of the project's dependencies from a specified snapshot, eg. ghc-8.2.2 and lts-10.3 respectfully. This set of files doesn't change very often during the lifetime of a project, so it would be wasteful if we try to cache them during builds for every branch. It makes more sense to reuse global stack folder that is being cached for a base branch, such as
,as cache for all other branches. readonly
- local to a
directory. For a more complicated project with many nested packages, there might be more than one
,usually one per package.
In order for
$ cache-s3 save stack
Running the above command inside the stack project,
if [ "$TRAVIS_BRANCH" = master ]; then ...
$ cache-s3 save stack work
As you might suspect, above will perform the caching of all of
Going in reverse is just as easy:
$ cache-s3 restore stack --base-branch=master $ cache-s3 restore stack work --base-branch=master
If you are thinking about using S3 in
If you find any of those steps even a little overwhelming, feel free to get in touch with our representative and we will be happy to either set up the CI environment for you or schedule some training sessions with your engineers on how to use terraform and other tools necessary to get the job done.
cache-s3 will find other places it can be
cache-s3 is pretty customizable, so try running it
--git-branch for overriding inferred branch name,
If you liked this blog you may also like:
- Dockerizing your App
- Cloud Deployment Models Evaluated