Overhauling our deployment: Part 1
Written by Peter Uittenbroek on 18th June 2020
As a backend developer for our telephony platform VoIPGRID, maintaining the platform and keeping everything up to date is an essential part of what I do. We use a lot of different elements within the platform, and from time to time these things need to be updated in order to keep everything running and secure. In this blog post I will describe our process to update different parts of our deployment process simultaneously and the tools we use for our deployment. This blog post is part of a series, and the second article will outline the implementation part which was the result of this process.
Getting started with the project
In September 2019 we started our ‘Supported software’ milestone which focussed on getting our production portal to Python 3 and our database to MySQL 5.6. With a multidisciplinary team of four colleagues, we started to look into what needed to be done to ensure it was upgraded by the end of the year.
Due to the Python 3 upgrade, we also decided to start using Docker for our deployments, which meant further automation was needed. Automation is something I have experience with from my prior job so I jumped at the chance to incorporate it in our regular development and deployment processes.
Using Fabric tasks for automation
When I started working at Spindle on the VoIPGRID platform at the beginning of 2018, we were deploying our webapp to three servers. By hand. Which meant our deployers logged into each server, executed the needed commands, and moved on to the next server. This process was time consuming, error prone, and no fun at all. On top of that, we were basically replacing parts in a running engine. Safe to say, it was time for some automation.
As a first step, I introduced Fabric to execute the needed commands for deployment on our dev machines. Fabric is a scripting tool that enables us to automate things we needed to do manually. Things really started gaining traction at the end of 2018 and beginning of 2019 when my colleague Yi Ming started rewriting my fabric tasks and getting it ready to use in deployment to our staging and live environments.
A step in the right direction
This made our lives a lot easier because we could now use our Fabric tasks on a very frequent basis which quickly led to improvements.
But we were still replacing parts in a running engine; this is to say we put our ‘new code’ over the old code, quickly did what needed to be done and restarted processes which needed restarting. This often led to sentry warnings and possibly a few seconds of downtime. Because our platform gets up to 3500 requests per minute, even being down for a few seconds is simply way too much.
Getting ready for Python 3
Meanwhile 2019 was well underway which also meant the End-Of-Life date (1 January 2020) for Python 2 was slowly getting closer. Programming in Python 3 is not that different from programming in Python 2. Some syntax changes here, some ‘imports’ path there, topped off by some output changes. BUT we have over two hundred thousand lines of code, which means there are a lot of potential places we need to make minor changes to make it all ‘run’ in Python 3.
I say ‘run’, because there is a difference between ‘it works’ and ‘it runs’. With the latter I mean that Python 3 was able to ‘compile’ the code and is happy enough to start our application. That does not in any way imply it actually does what we were expecting it to do. On top of that, we needed to be Python 2 and Python 3 compatible (backwards compatible). Because while we were preparing our code base for Python 3, our production was still running in Python 2 and would remain to do so for some time. Luckily for us, there is a brilliant package called six, which helps overcome many differences between Python 2 and Python 3.
Another piece of the puzzle
Our servers were still running an older version of Debian which did not come with a ‘new enough’ Python 3 installation for us to work with. We were already using Docker for our local development since this allowed all of our developers to have the same working base regardless of the operating system on their computer. This also made it a logical choice to start using Docker on our servers.
But our Docker usage so far was limited, we only used it to ensure we had an image with the system packages installed that we needed to run our application.
We already had our ‘webbase’ Docker image which we used for local development, this was mostly unchanged but we made a copy of the Dockerfile using Python 3. So we ending up with this hierarchy for live:
Figuring out the setup
While working on the image, we were also making decisions about what to run, where, and how. This is because we had many things running in the background and on regular intervals as well:
- the portal – the actual application serving the website
- cron – commands running at regular intervals to do various jobs
- >d(aemon)jobs – commands running continuously in the background, for example:
– billing calls
– reading email2fax emails from a mailbox
– processing and sending email2fax emails
– processing voicemails
This list is also the way we split things up in various Docker containers with a specific role.
The webdocker was built for release branches and tags, resulting in a ‘3.45.0-rc’ or ‘3.46.0’ image tag. The image contained:
- s6-overlay for process management
- nginx (only started for ‘portal’ role)
- uwsgi (only started for ‘portal’ role)
- cron (optionally started with ‘cron’ flag)
- s6 services for each djob (optionally started, depending on ‘djob’ env list passed to container)
- all our Python dependencies needed for production
- all our gulp/statics compiled
In theory we can still do everything in one container, if we pass the ‘portal’ role flag, all the djobs we want to run and tell it to also start cron. But since we wanted zero downtime and the ability to restart djobs or cron containers without any impact on the portal, we were not done yet. We needed something like blue/green deployment.
Now that we had everything we needed, a new question arose: how do we make the deployment stable, predictable and most importantly, as easy as possible. The current situation was that we had nginx running on our server, serving the static files directly from file and other requests were passed on to uwsgi. The benefit of this is that nginx is fast in serving those static files. But that meant we had to expose our static files from a container to the server via mounts. Which is possible of course, but also meant we did not know which static files were actually being served if we did another deployment. How to avoid overwriting files or, even worse, serving the wrong files?
To solve this, we went for blue/green deployment. This basically means we have two containers with the incredible names ‘blue_portal’ and ‘green_portal’ that can run side by side without conflicts or interaction. Like I mentioned in the Docker setup, we also have nginx in our Docker container. Which takes care of the static files serving. Each Docker container exposes their nginx through a socket which is volume mounted in a ‘color specific’ directory on the server. So we have a /var/run/nginx/blue/nginx.sock and a /var/run/nginx/green/nginx.sock;.
Now all we had to do was tell the nginx on the web server to route its traffic to either of those sockets and then Bob’s your uncle.
Finalizing our decisions
For our djobs and cron, we didn’t need blue/green deployment because they don’t need to be highly available. We can live with such a job not running for a few seconds. And some of our djobs will cause financial problems if more than one of them is running. These containers were rebuilt in the final step of deployment with the new image. For cron there is a small chance we restart the container just when a scheduled job would have been started, but those that could be impacted are those that run hourly.
Once all of these decisions had been made, it was time to put the process in motion. In this post I talked about the reasons, processes, decisions and the changes that were needed to get Python 3 on our production machines. In my next post, I will explain the final implementation steps that we took, and what the outcomes were.