CI/CD

1. introduction

CI/CD stands for Continuous Integration / Continous (Delivery|Deployment). Roughly when we refer to this, we’re talking about an automated build/test/deployment suite of tools. Virtually all of your big programs you write are useless if they just run on your personal computer. You need to put them somewhere so they can run when your computer is off, or other people need to be able to use the software you made. So we need to deploy the software somewhere, and the software needs to be consumable in a way that doesn’t require your machine’s development setup for those who need to consume said software.

2. The Ultimate CI/CD Setup

2.1. general philosophy
- 2.1.1. small releases
- 2.1.2. feature flags
2.2. automation
2.3. more to come

2.1. general philosophy

You want to automate as much as possible. I’ve worked at software shops where there’s a 6 month delay between the business analyst putting together requirements and seeing those ideas arrive as a software solution in production. That doesn’t include any time spent actually developing the software. That’s just time spent going through various phases of the release process and work scheduling. Software shops will choose this approach as a means of deliberately slowing down the development process. They want to make change painful, and in doing so they hypothetically make it such that fewer errors can spring up. I love pointing at I Prefer This Over That for reading material on the topic.

The problem is that unless you have your engineers sitting on their hands through most of the release process, there’s going to be a swathe of changes that come earmarked with that one feature you wanted, all in the same release. Since the delta is huge, it’s hard to ascertain the root of a problem when it does happen. And it will. The trick to keeping software relatively easy is keeping it simple. Releasing 12 features and 8 bug fixes all at once is not simple. That’s complex.

Even if you did slow it down so you released one thing at a time, you still run into the problem of figuring out what the hell happened 6 months ago. Frequently codebases will look almost foreign when you’re talking about that kind of time delta. And the kicker is you might have found and fixed the bug 1 month after the release was cut. Now what? Do you backport the fix? Well that might have required a bunch of other changes to go with it. What happens most of the time is you just fix it again (which is a different fix since the codebase differs so greatly.

A much better solution is to make it so mistakes are cheap to fix. When you make a fix it’s always against the latest version of your software. You’re current. You don’t need to do archaeological digs in your codebase. You make the fix, you test it, you deploy it, you check that the deployed version works, you move on. You don’t want to just deploy anything though. Software has gotten really good about various levels of verification and checks we can perform to make sure our software does what we think it should do. Those checks can be run as part of that deployment process. Any failure in the process means no deployment. The general approach here is to automate virtually every part of this process.

2.1.1. small releases

Releases should be done frequently and not when people are about to go home for the day. Frequent releases mean small releases. That means if something goes wrong, it’s not painful to roll back. With orgs that do giant releases, they generally do a big release, and then hold a war room to catch the fallout. When you have several applications each with several features and fixes, you’ve introduced so many variables that it’s nearly impossible to narrow down the culprit in a timely fashion when your app is broken in the sacred production environment. Any integration point requires people from multiple teams - I’ve even seen where an integration between two apps pulls in people from 3-4 teams because two teams stand in the middle of the apps (usually operations and security). This is a colossal waste of resources and very damaging to your team’s morale. If someone deploys a new button and the button screws up your page layout, reverting is a no brainer.

2.1.2. feature flags

You can use flags (generally configuration variables) in your codebase to disable or enable certain features. Take the button out entirely and only enable it when you want to. This makes it easy to deploy changes all the way to production because your code when deployed shouldn’t do anything until you actually flip the switch. This is

2.2. automation

2.2.1. tests

Your app might have a suite of unit test, acceptance tests, type checks, etc that can be run. You want to run all of these. On my teams I put in checks for formatting as well.

2.2.2. deployment

Software engineer salaries are pretty high, and costs for hardware or hosting is very low. Having an environment where your software gets automatically deployed on any successful build is essential.

You need environments. An environment is a place where the software lives. This might be a server where your web app gets deployed. It also accounts for other apps that need to live alongside your primary app. Getting environments going needs to be cheap. You should have scripts to define your environment. If your environment setup is just “our server person hammered on it until it worked” then you’re going to have all kinds of variance between your environments that’s subtle and impossible to account for. You will invite disaster in your deployment process. Script it, and have the scripts take variables on what environment the setup occurs.

Let’s cover some environments and how you’d use them:

local: This is your local environment, or what you do to iterate quickly on your machine. I’ve seen some places that don’t actually have a local environment. You just edit in a develop server that might or might not be shared by other engineers. This is bad. Having your engineer be able to work from anywhere is valuable. Connection issues should not hamper your development process. That means you need to be able to run as much as possible locally and test as much as possible locally.
develop: (Or dev) is an environment with a singular purpose: It demonstrates that the software can run when it’s not in the local environment. I recommend develop continues to run with whatever mocks you have. Your CI/CD setup should automatically deploy to this environment once you successfully automatically test and build the software. You can use it as a “here’s what’s current” place as well. I would let QA know about this environment, but create the expectation that there’s a proverbial gun pointed at it. The data that exists there, and any running processes are forfeit no a whim.
integration: This like develop except it interfaces with real external tooling. If your app needs to talk to another app, this is the environment where it actually happens. If you’re in a large org with lots of apps, this is where all of them can meet. Lots of shops like to gate deployment here - meaning some human has to hit a button to make what’s in develop is deployed to integration. This is not a great idea. The problem gating an integration deployment is trying to solve is that if someone else relies on your software working a certain way, then your cutting-edge deployments will break their stuff. Breaking here is a good thing. We need to be able to enable or disable functionality using feature flags, or use versioned API calls to prevent backwards incompatible changes from breaking our consumers. Automated deploys to integration allow us to catch those. The whole point of this environment is that we find out what’s breaking when we connect everything together. If this hampers your consumers’ development process, then have a discussion about how to mock whatever it is they are consuming.
staging: This is QA’s test bed. It should be gated. The reason being is that QA needs to be able to test things without the test subject being ripped out from under them. That said, it’s not an excuse to allow staging to drift far behind of what develop is. The larger the delta in changes, the more painful it will be to fix any problems that got deployed.
production: AKA prod. This is where our end users consume our software. This place is sacred ground and changes made to this environment should be made with extreme caution. Deployment should be gated here, but realistically speaking you could have multiple production deployments per day. Any time QA approves a software change in staging we should do a deployment to production.

2.2.3. release notes

Lots of places require release notes to be present as part of a release. Release notes, or a change log is a list of what’s different about this particular version of the software. For some reason this is one of the easier things to automate that we generally fail to automate. Be lazy. Let the computer do the work for you. It’s better at it than you are anyways.

Here’s an example of what release notes might look like:

My Shiny App v1.1.2

1. Fixed issue #12344 :: Clicking buttons no longer mocks you for clicking them
                        out of order.
2. Fixed issue #15532 :: The app no longer crashes when you try to do useful
                        things in it.
3. New Feature :: You can now add stickers to your reviews even if it adds no
                  inherent value to your reviews. I just know this is what they
                  did in SnapChat and they are worth bazillions okay?

But if you’re a mobile developer you’ll get a free pass:

Bugfixes and performance enhancements.

See how far something like that will go on your résumé.

Let’s just assume that we want to write quality software for a moment and show some intention behind the changes we make. If we must add the ability to hold the thumb button to make the thumbs up icon bigger, at least we’ll know it got in there on purpose.

2.2.3.1. branch naming

In the good example above, those release notes could be built with some tooling. Assuming you’re using git or something similar for your code, you can enforce a naming convention on your branches. Something like <ticket number>/small-decription. Some places like to do bugfix and feature as prefixes to the branch names. I find this just gets in the way. Give me a ticket number and I can find the branch easily without needing to know if it’s a bug fix or feature (and sometime those lines can get really blurred anyways). A real example might be 5342/add-performance-metrics. Notice that there’s no capital letters nor spaces. It’s too easy to be inconsistent with casing and spaces generally have special meaning in just about all of our tools. When your system does its automatic checks against new work, it can verify that the branch name matches the format.

2.2.3.2. commit messages

As an added benefit, adding the ticket number to the commit messages also can help if you need to play CSI on your codebase later. Software engineers are rightfully lazy. Don’t ask them to do this unless you like talking to walls. Instead make a pre-commit hook that adds the ticket number by getting it from the branch name. This makes doing forensics easier because generally you don’t always have branch information when you poke around in the logs. Seeing individual commits with ticket numbers will be a boon, and some history viewers will connect your version control to the ticket system, so those ticket numbers become clickable links to the tickets. The ticket system can also link back to commits and branches because you referred to them. Let the computer work for you!

2.2.3.3. pull requests

The new hotness with merging work is forming a pull request. Your CI/CD software can do an additional check: When the pull request is formed, require that there’s something in there as a release message. The pull request might normally look like this:

Fixes issue [[112356]] by ceasing our blatant addiction to =null=. It's ok to
use an empty list. Seriously.

I also cleaned up the comments in that area because they were filthy lies.

That’s well and good, but that’s not something we really want to show as release notes to Powers That Be or our users. Let’s add a special bit of text we can easily search for:

Fixes issue [[112356]] by ceasing our blatant addiction to =null=. It's ok to
use an empty list. Seriously.

I also cleaned up the comments in that area because they were filthy lies.

RELEASE NOTE: Fix a bug where the app would crash if you forgot to add at least
one student to the test.

2.2.3.4. stringing together the whole process

Now here’s the kicker, you make some tooling that looks at the commit hash for the last release you did, and do a log all the way to the hash of the current release you’re doing. Collect all of the ticket numbers out of those commits, and then look for pull requests that also have those ticket numbers. Ask each pull request for its description and pull out the RELEASE NOTE text. That’s what you stuff into your release notes.

2.2.3.4.1. TODO add code to do the release note gathering

2.3. more to come

I’ll add some more about specific tools (Jenkins, Travis, etc) and some usable examples.