QA Environments v2

Justin Hnatow
Making Dia
Published in
6 min readApr 17, 2019

--

From Under the Ocean by Anouck Boisrobert and Louis Rigaud

Summary (tl;dr)

As described in How We Deploy Software 2.0, quality assurance testing is an essential part of deploying code at Dia. Early on, the simple idea that shared environments do not scale while individual environments do was taken to heart. The ability to build individual test environments for QA was added to our main application. While these test environments provided exceptional parity with our production environment, the time to spin them up was lengthy, clocking in at approximately 45 minutes. As Dia has grown, the urge to reduce this time grew with it, so a process of revision for the test environments was undertaken. There were two main goals to this revision: maintain parity with production and increase the ease-of-use of the test environments.

Here we’ll provide a technical overview of the second iteration of individual testing environments, drawing particular attention to the reduction of spin-up time from approximately 45 minutes to approximately 15 minutes.

Overview

Our primary goal with the second iteration of our QA environments was to speed up provisioning. Using instrumentation code to gather timing data on QA environment spin-up time, it became patently obvious that the greatest gains to be had were related to the data layer. The chart below depicts the output from an instrumented run by showing the cumulative time (in red) and the per-step times (in blue). Restoring the test database from a backup took approximately 18 minutes and re-indexing Elasticsearch took over 8. The total time was over 45 minutes. We have a lot of test data!

Since we need test data to test, the only options we had were to run the data layer operations in parallel or to pre-build the data layer. The major innovation in the second version was in pre-building the data layer. This was achieved by periodically creating a machine image in AWS (AMI) that contained the full data layer — a database with test data, re-indexed Elasticsearch, and redis. This AMI was then used as the base for the EC2 instance that runs the test environment. The chart below depicts the current cumulative time and per-step time to spin-up a QA environment using this process. The total time is just over 15 minutes.

With this single change, the time to create a full test environment went from 45 minutes to 15 minutes.

How Did We Achieve Such Time Savings?

Below is a high-level view of a QA environment. The characteristic that distinguishes a version 1 environment from a version 2 environment is that all of the elements outlined with a dashed line are preloaded into an AMI that is used as the base image for the EC2 instance.

AMI, Postgres, ElasticSearch

In the first version of the QA environment, each component was loaded serially. This involved creating an EC2 instance from an AMI, pulling images from Docker, unpacking the test data archive, indexing Elasticsearch, loading the application code, and adding all the necessary infrastructure components (DNS entries, target groups, entries in the load balancer, etc).

By pre-populating an AMI with the test data and data-layer Docker images, there are fewer Docker images to pull, no data to unpack, and some relatively quick steps to bring the QA environment into an operational state. This saves a substantial amount of time. The AMI is created periodically using Packer.

Jenkins, Terraform, ECS

Another improvement involved the manner in which the QA environments were created. The first version used a Makefile and several make commands to facilitate the lifecycle of QA environments. Much of the environment bootstrapping was built into these commands which, in their interface were simple, but, in their implementation were complex. There was a desire to remove the logic from these commands and place it somewhere more suitable.

A suitable location was found to be Jenkins.

A Jenkins script was created to manage the lifecycle of the QA environment. This script runs a job that manages QA environments with a given set of parameters. During creation, the job runs the requisite steps to bring the QA environment together and to deploy it. Building the Docker image used by Rails and Sidekiq requires installing Ruby, pulling the branch to be tested, bundling gems, and pre-compiling assets (among other steps). Deployment is a complex process facilitated by numerous Terraform scripts which provision AWS resources including an ECS task responsible for running the QA environment. Among the resources created are a load balancer target group, DNS entries, the full EC2 instance, the aforementioned ECS task, and an ECS service to run the task.

This highlights another substantial but superficially transparent change: the new environments are managed by AWS Fargate. Version 1 QA environments did not have any application-management. They were simply EC2 instances running Docker containers. If the instance failed to start, shutdown, or started to act unpredictably, there was nothing to manage this. Now, QA environments are managed by Fargate. This means that there is a layer of infrastructure that ensures that the EC2 instance starts and continues to run. It also makes debugging a bit more challenging, but we’ve overcome that — more below.

Automation

The final update was in user-interaction. As mentioned above, users needed to run make commands to build the QA environments. Since we use Github to manage our version control and release process, it was determined that Github would be a suitable place to interact with QA environments. Now, users can place comments with commands on their pull requests to trigger lifecycle events for QA environments and the environment state is noted via labels on the pull request. Also, QA environments automatically update whenever new code is pushed to a branch with and open pull request.

Working in Github to spin-up a QA environment

These new commands work by using a Github webhook and a small Sinatra application to monitor the webhook events. The label updates are performed by monitoring the ECS service state as it transitions through its lifecycle.

Tooling & Debugging

Because of the transition to AWS Fargate, the means by which we debug the QA environments had to change. There are two huge technical boosters that brought the new QA environments to a state of approximately similar debuggability.

First, we began to push all logs into Datadog. Dia recently adopted Datadog as its de facto monitoring solution. It only made sense to redirect all logs from QA environments into Datadog. Now, we have simple filters in Datadog which allow anyone to view any log from any container running in a QA environment. It’s very powerful.

Second, a means of running a Rails console using the same application environment as that running in the QA environment was developed. This was done by pulling the Docker image to the local machine and attaching it to all of the external data-layer resources. It’s managed by a make command and allows for running scripts, rake tasks, and general data-exploration.

Conclusion

Ultimately, this update took about three months to complete. The time savings are huge, the interface simple, and the infrastructure resilient. Running a QA environment for a given branch is simple and powerful. The new developer workflow can be summarized in 2 steps:

  1. Open the PR
  2. Tell the bot to spin-up a QA environment

Once the QA environment is running, the bot automatically updates it when new code is pushed to the branch and automatically terminates it when the PR is merged.

This simple workflow incentivizes developers to test as well as experiment. There is a slight loss of configurability and visibility, but the overall gains are well worth the losses.

What’s next? As we gradually decompose our monolith into more focused applications and micro-services, we’ll need to expand the QA environment offerings to support these. Additionally, this decomposition will mandate improved integration testing. We’ve got our work cut out for us.

--

--