How We Deploy Software

Zachary Friedman
Making Dia

--

People have written amazing articles on deploying software to production. One of the best quotes I’ve heard about deploys in the last two years is from early GitHub engineer, Zach Holman:

Your deploys should be as boring, straightforward, and stress-free as possible. Deploying major new features to production should be as easy as starting a flamewar on Hacker News about spaces versus tabs.

I couldn’t really state our end goal better than this, so I won’t even attempt to redefine what we are ultimately shooting for. It’s exactly this. What I will do is attempt to tell the complete story of how and why we recently changed our deployment process at Dia&Co., what the benefits of doing so were, and what we still have standing in between our team and boring deploy nirvana.

By way of background, we have two major, protected branches on GitHub, master, which corresponds to our staging environment, and production, which, you guessed it, maps to our production environment. We branch features off of master, make pull requests back into master, perform code reviews, run our CircleCI build, and run our automated linting in those pull requests, and when accepted, we merge into master. Once that merge happens into master, CircleCI will kick off a build that ultimately results in a deployment to staging. The same is true of our workflow for deploying features to production.

The Exact Moment We Realized We Had a Problem

Once our engineering team size reached a certain headcount, the pipeline of new features to production got a little too hectic. We generally respect the boy scout rule of taking one story from our backlog at a time, so our GitHub pull request page would roughly have one open pull request per developer at any given point in time. This is totally manageable with a small set of engineers, and for awhile, this worked just fine for us.

We noticed that as we added more engineers, we were not deploying software any more frequently. In fact, we were probably deploying a little less often, which is really a business problem, one that we needed to correct in order to maximize the impact of our engineering team.

Refactoring Our Process

We started to think about what was causing our deployment frequency to decline or stagnate, as one would expect that with a product of increasing complexity and more engineers building features, that the opposite would occur.

One of our consistent bottlenecks over the past year or so has been QA and UAT. Previously this responsibility fell to the product manager, who in addition to the full-time job of being a product manager, was also a part-time QA analyst. As with most part-time jobs one attempts to perform simultaneously with his or her full-time job, one can expect to do that job pretty poorly, and for good reason. Finding time to perform QA was near impossible, so it would tend to get done in chunks of time that had more to do with the product manager’s schedule than our product deployment needs. Our product managers did a fantastic job of juggling multiple priorities with respect to this, but it was a problem we felt that we could pretty easily remove from their day-to-day.

“Real” QA

After a lot of searching, we found a pretty badass QA analyst who wanted to work with our team. I personally had not had much experience working with QA before, so I was interested to see how this role would fit into our existing process, which was wholly driven by automated testing and manual testing from people who had a lot of other things to do at a given point in time.

My biggest observation over the past couple of months has been that the job title of QA is a complete misnomer.

When performing at its highest level, QA doesn’t just assure quality, they drive it, both in the product and in the overall software development lifecycle.

So what’s the point in mentioning all of this? Another major bottleneck in our old deployment process was a bit of a game theory problem, ultimately related to answering the question of who deploys, and when.

Who’s On First

Previously, an engineering team member would be responsible for creating a deploy, and pull requests were merged into master, generally whenever they were accepted in code review, but not quite, because one always had to take into account how much was on master (staging) currently, because with increased deployment size, the risk of breaking something in production goes up.

This created the need for a concept which was referred to by two words which I will never enjoy hearing said together: merge freeze ⛄️. Merge freeze is terrible for a couple of reasons. For starters, I’m not sure exactly how much deploying is the right amount of deploying, but I can assure you, we haven’t even begun to approach that limit. The entire point of the work that our team does is to have an impact on our business. Leaving features that will make our customer’s or our employee’s lives better in a GitHub pull request because we cannot take on the additional risk of including it in the next release just sounds silly to me. Especially if we can avoid it by largely decreasing the risk. And the way you de-risk software deployments is by making them smaller. Like much smaller.

So now that we have a QA owner, there is really one primary owner of our staging environment. That being the case, really the only team that needs to merge pull requests into master is the QA team itself. The merge freeze problem is eliminated by eliminating engineers merging to master. Engineers still merge features or stories that we refer to as “engineering acceptance”, which would be a non-customer facing change, for instance. Now, QA decides what to merge and when, thereby controlling exactly what is on our staging environment at any time, and tailoring the pace of deploys to match the pace of QA.

So What’s Left?

Many companies have written about how they deploy software to production 50 to 100 times per day. Intercom has a great article detailing how and why they ship over 100 times per day. We certainly desire similar benefits in our own deployment process, namely the invaluable feedback loop of deploying smaller changes more frequently. Doing so would require fundamentally changing our entire process of working on many levels, so we’re also cautious about blowing up a system of delivering impact that has served us well, and continues to. One of our core values at Dia&Co. is “take off with a denim parachute”. The story behind the name warrants an explanation out of scope for this post, feel free to ask me on Twitter if you are interested 😀, but what it means is basically: Lead with a get-it-done attitude. Drive exceptional outcomes with fewer resources. Creativity is your currency. How this applies to deploying software is that while continuous deployment has a number of benefits, our current process represents an order of magnitude improvement over our previous process, and it probably quite stage-appropriate for our current needs.

If you are interested in thinking about deploying software, and working with a team that is changing the fashion industry in a permanent way, please reach out to me or apply to join our team on our careers page.

--

--