Devops best practices: Multifaceted Testing.

avatar

Posted by Aaron Contorer - 29 November, 2016

Even among skilled enterprise IT departments, it is too rare that software is thoroughly tested before deployment. Failed deployments mean costly downtime, service failures, upset users, and even security breaches. How can we verify that a solution is actually ready to deploy, free of serious defects?

You probably need more kinds of tests

We’ve all seen “ready to deploy” applications that do not work as expected once deployed. Often it’s because the production system is not in fact identical to the staging system, so the testing wasn’t valid. This can be prevented with another devops best practice, automated deployments -- which we talked about in this recent post and will return to again.

But often the problem is that the software, even on a properly configured test system and staging system, was never fully tested. Before an app is approved for deployment, your QA system (mostly automated) should complete:

  • Success tests
  • Failure tests
  • Corner-case tests
  • Randomized tests (also called mutation tests, fuzz tests)
  • Load and performance tests
  • Usability tests
  • Security tests

(If you were not using automated, reproducible deployments, you would also have to do explicit pre-tests of the deployment script itself -- but you are doing fully automated deployments, right? If you aren’t, consider moving to the sorts of tools we use in FP Deploy -- like Docker and Kubernetes, Puppet and Ansible.)

In the rest of this article we’ll see how each kind of testing adds something different and important.

Success testing: run with realistic inputs

Most operations teams won’t accept a deployment from the engineering group (dev or test or QA) unless the system has at least passed success testing. When presented with correct inputs, does the system generate correct outputs and not crash?

It’s the most basic testing. Yet it’s often left incomplete. To avoid serious omissions, use at least this checklist:

  • Did you play back a realistic workload, based on a real sample of typical inputs? If not, your test’s idea of what to try may be very different from what your users will actually do in the first day in production. Huge source of real-world failures.
  • Did you test through both the UI and the API? Automated testing only through the UI may mask problems that the UI prevents or corrects. Automated testing only through the API can’t find pure UI bugs.
  • Were your tests updated when the spec was updated? For that matter, was the spec even updated when the software’s intended function was updated? If “no” on either of these, you are not actually testing the software’s correct function.
  • Do you have coverage? Has someone gone back through the spec and identified all the promised functionality, and verified that some test actually tests each item -- and that it was run on the latest build, and passed?
  • Did you test in a system that’s configured the same as your production system? Especially if deployments are not automated, it is crucial to ensure that nothing was done in the staging system to make it easier to pass than the production system. Common examples:
    • omitting firewalls that would be present between layers in production
    • running test services under accounts with excessive permissions
    • having a manual security checklist for production deployments that is not used on staging deployments
    • sequencing inputs so that multiple simulated users cannot appear concurrently.

    Passing tests may not mean much if the environment is rigged to ensure success.

Believe it or not, that was the easy part. For enterprise production quality, you still want to test your system six more ways. Jumping right in to number two...

Failure testing: when you break the law, do you go to jail?

Your testing is all under an automated set of continuous integration (CI) scripts, right? (If not, time to look into that.) But do you have tests that force all of the specified error conditions to occur? Do you pass in every identified kind of prohibited/invalid input? Do you also create all the realistic external error conditions, like a network link failure, a timeout, a full disk, low memory (with a tool like Chaos Monkey)?

A test suite doesn’t check whether specified error conditions actually generate the right errors is not complete. We recommend that the QA team present a report showing a list of all existing tests, what conditions they purport to test, and when they were last run and passed.

Corner-case testing: try something crazy

Maybe you’ve heard the joke: a QA engineer walks into a bar, and orders a beer, and 2 beers, and 20 beers, and 0 beers, and a million beers, and -1 beers. And a duck.

Corner-case testing means success testing using unrealistic but legal inputs. Often, developers write code that works correctly in typical cases, but fails in the extremes.

Before deploying, consider: are any of your users going to try anything crazy? What would be the oddest things still permitted, and what happens if you try? Who tested that and verified that the output was correct? Correct code works on all permitted inputs, not just average ones. This is a fast way to find bugs before deployment -- push the system right to the edge of what it should be able to do.

Corner cases vary by application, but here are some typical examples to spark your thinking. Where strings are permitted, what happens if they are in a very differently structured language, like Chinese or Arabic? What happens if they are extremely long? Where numbers are permitted, what happens if they are very large, very small, zero, negative? And why is the permitted range as big as it is; should it be reduced? Is it legal to request output of a billion records, and what happens if I do? Where options are permitted, what if someone chooses all of them, or a bizarre mixture? Can I order a pizza with 30 toppings? Can I prescribe 50 medicines, at 50 bottles each, for a sample patient? What happens if nested or structured inputs are extremely complex? Can I send an email with 100 embeddings, each of which is an email with 100 embeddings?

If an application hasn’t been tested with ridiculous-but-legal inputs, no one really knows if it’s going to hold up in production.

Randomized testing: never saw that before!

No human team can test every possible combination of cases and actions. And that may be okay, because many projects find more bugs per unit of effort through randomized generation of test cases than any other way.

This means writing scripts that start with well-understood inputs, and then letting them make random, arbitrary changes to these inputs and run again, then change and run again, many thousands of times. Even if it’s not realistic to test the outputs for correctness (because the script may be unable to tell what its crazy inputs were supposed to do), the outputs can be tested for structural validity -- and the system can be watched for not crashing, and not generating any admin alerts or unhandled errors or side effects.

It’s downright surprising how fast you can find bugs in a typical unsafe language (like Python or C or Java) through simple mutation testing. Extremely safe languages like Haskell tend to find these bugs at compile time, but it may still be worth trying some randomized testing. Remember, machine time is cheap; holes in deployed code are very expensive.

Load and performance testing: now that's heavy

Companies with good devops say heavy load is one of the top remaining sources of failure. The app works for a while, but fails when peak user workload hits. Be on the lookout for conditions that could overload your servers, and make sure someone is forcing them to happen on the staging system -- before they happen in production.

Consider whether your test and staging systems are similar enough to your production system. If your production system accepts 5000 requests per second on 10 big machines, and your test system accepts 5 per second on one tiny VM, how will you know about database capacity issues or network problems?

A good practice is to throw enormous, concurrent, simulated load at your test system that (1) exceeds any observed real-world load and (2) includes a wide mix of realistic inputs, perhaps a stream of historic real captured inputs as well as random ones. This reduces the chance that you threw a softball at the system when real users are going to throw a hardball.

Performance testing can include sending faster and faster inputs until some hardware resource becomes saturated. (You may enjoy watching system monitor screens as this is happening!) Find the bottleneck -- what resource can you expect to fail first in production? How will you prevent it? Do your deployment scripts specify an abundance of this resource? Have you implemented cloud auto-scaling so that new parallel servers are fired up when the typically scarce resource (CPU, RAM, network link, …) gets too busy?

Usability testing: it doesn't work if people can't use it

Most people consider this to be outside the realm of devops. Who cares if users find your system confusing and hard to use? Well, lots of people, but why should a devops person care?

What will happen to your production environment if a new feature is deployed and suddenly user confusion goes through the roof? Will they think there’s a bug? Will support calls double in an hour, and stay doubled? Will you be forced to do a rollback?

User interface design probably isn’t your job. But if you are deploying user-facing software that has not been usability tested, you’re going to hear about it. Encourage your colleagues to do real testing of their UI on realistic, uninitiated users (not just team members who know what the feature is supposed to do), or at least skeptical test staff who know how to try naive things on purpose, before declaring a new feature ready to deploy.

Security testing: before it's too late

One of the worst things you can do is to cause a major security breach, leading to a loss of trust and exposure of users’ private data.

Testing a major, public-facing, multi-server device (a distributed app) for security is a big topic, and I’d be doing a disservice by trying to summarize it in just a couple of paragraphs. We’ll return in future posts to both the verification and testing side, and the design and implementation side, of security. A best practice is to push quality requirements upstream, letting developers know that security is their concern too, and ensuring that integration-test systems use a secure automated deployment similar or identical to the production system. Don’t let developers say “I assume you’ll secure this later.”

Meanwhile, as a devops best practice, your deployment and operations team should have at least one identified security expert, a person whose job includes knowing about all the latest security test tools and ensuring that they are being used where appropriate. Are you checking for XSS attacks and SQL injection attacks? DDoS attacks? Port-scan attacks? Misconfigured default accounts? It’s easy to neglect security until it’s too late, so make someone responsible.

Security holes can appear in application code, or in the platform itself (operating system, library packages, and middleware). At a minimum, security testing should include running standard off-the-shelf automated scanning software that looks for known ways to intrude, most often taking advantage of poor default configurations, or of platform components that have not been upgraded to the latest patch level. Run an automated security scan before moving a substantially changed system from staging into production. Automated testing is almost free, in stark contrast to the costly manual clean-up after a breach.

Conclusions

Wow, that’s a lot of testing! More than a lot of companies actually do. Yet it’s quite hard to look back through that list and find something that’s okay to omit. Clearly the devops pipeline doesn’t begin at the moment of deployment, but sooner, in the development team itself. That means developers and testers taking responsibility for delivering a quality product that’s actually ready to deploy.

As usual, systems thinking wins. A good operations team doesn’t say “give us what you’ve got, and we’ll somehow get it online.” A good operations team says “we offer deployment and operation services, and here’s how you can give us software that will deploy successfully into our production environment.”

We hope you’ll keep reading our blog and making use of what we have learned. Thanks for spending time with FP Complete.

Contact us to help your engineering and devops teams


Recent Posts

Why blockchain and cryptocurrency audits?

read more

Sed: a debugging story

read more

Hackathon Review and Stack Maintenance

read more