Whatever development method you use, eventually your software will need to be deployed to your production environment.
It’s a scenario that occurs in every company with a software development team: the software is declared to be finished and ready to be deployed from development into production. The deployment scripts and installers are ready (if you’re not using installers then that’s a totally different set of problems), and there is an air of tension around the team responsible for the deployment. That air of tension is actually the first serious warning sign and you should take notice of it.
At a time often chosen by someone who doesn’t have to actually be present, so usually a weekend, the deployment process begins.
Somewhere between the start of the process and the scheduled completion time, things will start to go wrong. Maybe the database schema won’t be what the developers thought it was; maybe someone deployed a change in the meantime and that was missed. Maybe the firewall is blocking communications between some of the servers. Maybe the configuration files contain the wrong details and don’t match the production environment.
Once this starts happening your plans for the rest of the day, and maybe the weekend, are gone. What happens next is fairly predictable: hours will be spent trying to make the deployment work, either because nobody wants to admit failure, or because the deployment stopped the existing production system from working.
The more people who are involved in the deployment, the longer it will take to discover the cause of the problems. People offering what seems like sensible advice will be ignored. Presenteeism, where nobody wants to be the first to leave, will develop rapidly, and so despondency will arise along with even more confusion. Before long you’ll start hearing some or all of the following; the more you hear the worse things are going:
- “X is / I am just making some changes to the configuration files,”
- “Can you reach server B from server A?”
- “Can you log in to the database with that account?”
- “Does that account have SELECT/EXECUTE privileges on that database?”
- “Does anyone know the database password?”
- “Where are the logs for this sent?”
- “Can we still roll this back?”
- Frustration, anger and profanity from a team who are usually calm and collected.
- Derogatory language used about the system/developers/test team/deployment team/IT team.
- “We need to call corporate IT,”
- “We need to call the engineering manager/director/VP,”
- “We need to call the supplier,”
By the time everyone has gone home, the chances are 50:50 between them going home angry because they had to roll back the deployment, and them going home angry because they had to force the deployment into production; no decent teams get any satisfaction from either of those outcomes.
That explains the tension in the air before the deployment.
In the aftermath of a failed deployment there will usually be a demand for a “root cause analysis” or RCA. If the deployment was a struggle but eventually succeeded, at least to some extent, there’s still a good chance that you’ll still have to do this. This can be a bad thing if you work for a company with a “blame culture” as they'll be looking for a scapegoat. Alternatively it can be a good thing because you’ll find out how to improve your deployment process and reduce the number of failures.
However, once you get past any hardware failures and any other factors outside the control of the development and deployment teams, you’ll find there is only one reason and it’s not a mystery: lack of preparation. That’s it, and if you were to read through most organizations’ root cause analyses you would find most of them just describe “Insufficient preparation” in a variety of ways.
It’s something of an anticlimax to discover this, because it’s such a dull reason. There are no complexities, no subtleties, nothing. By now you’re probably wondering whether, if the cause is simple, is the cure simple too? It is: most of your problems can be solved with a combination of replication and automation.
If the main cause of deployment struggles and failures is a lack of preparation, then practise is a solution. The best way to practise is to replicate the production environment, and use that for development, testing and deployment practice.
You should aim for a selection of replications of your production system. Ideally developers should develop against a small version or subsystem of production. As the code progresses through the test environment(s) they should become more like the production environment until the final environment is a very close replication.
You don’t need, or even want, an exact copy; that would probably be much bigger than you need and would contain a lot of confidential customer and company data. You need to replicate the structure of the environment, and especially the accounts and permissions within it.
All environments change with time; development environments can change numerous times every day and test environments maybe less frequently. With every change they get further away from the production environment. That’s not a bad thing; it’s what they’re for. On the other hand, the more differences between them and the production environment, the less realistic your test deployments will be.
To keep the environments current, you need to keep re-creating them by replicating the production environment. With any reasonably complex system, this will take a lot of work and is prone to error, and so needs to be automated.
The structure of the environments probably won’t change much so most of the automation, at least to begin with, will involve recreating databases and possibly machine images. I’ve seen systems where this is scheduled so that production is backed up, replicated to the test and development environments with production data replaced by test data. This takes place overnight, so that when the development and QA staff arrive in the morning, the environments are completely new.
When you set up the replicated environments, you may have to rename some of the machines paths, unless you can put them in totally separate domains. This is an inconvenience but there are a couple of ways around it.
If you can locate the machines in their own subnets, you can use separate DNS servers to create identically-named servers in every environment. For example, all of your file servers might be called “fileserver01.local”, even though their IP addresses are different.
This means that you don’t need to worry about changing configuration files as you deploy your code into different environments. On the other hand though, you need to keep your environments updated with every hardware change you make.
If you use configuration transforms, you need to update the configurations for each environment. Firstly, use some sort of regular naming format so that it’s simple to work out the name of each server in each environment. Something like:
Use the configuration transforms features in Visual Studio (if you’re using Visual Studio for development and building) or a script or application to update the configuration files for each environment.
A variation on this is to use a deployment tool like Ansible, or even MS WebDeploy packages for IIS, where the configuration file lacks any specific values. These are then set in the configuration files when the system is deployed, rather than being built into an installer or deployment package.
In both situations the settings for each environment should be taken from a version control system.
Configuration transforms, scripts and so on are just ways of changing the contents of your configuration files. Some might see this, and think that this means that changing configurations manually is a good thing to do. It isn’t. If you don’t know the configuration of your production environment before you deploy to it, your deployment is not ready. If your configuration is not under version control, then where will you find it when the disk fails on the server?
Configurations should be changed automatically, using one of the methods above or something similar. They should never be changed manually.
Manually changing one configuration file is prone to error, manually changing all of the configuration files in an environment is doomed to failure. Knowing this, and yet still deciding to manually change the files is irresponsible and unprofessional. It’s the difference between a ten minute deployment and a totally-avoidable four hour debacle.
Deploying your code and updates to multiple systems will help you find the main causes of failure before you get anywhere near production. If you've replicated the firewall rules, permissions and databases fully then you’ll experience all of the obvious errors and most of the obscure ones in the test environments. If you’ve automated all of the deployments as part of moving towards DevOps, you can also add the post-deployment and QA tests to the process.
If you want to reduce the deployment failure rate even further, you can create pre-deployment tests. I'll cover these in a later article.