Overhauling our deployment: Part 2
Written by Peter Uittenbroek on 7th August 2020
While preparing to upgrade to Python 3, we decided it would be a great moment to make our deployment more automated. In Part 1 of this series, I talked about the reasons, processes, decisions and the changes that were needed to get Python 3 on our production machines. In this second part, I will explain the final implementation steps that we took, and what the outcomes were.
Merging the foundations of Fabric and Docker
Now that all the foundations were laid, it was time to piece them together into one big block of awesomeness. So to iterate over what we’re using:
- Fabric, to perform all the commands for us on the servers
- Docker and Docker-compose, the latter for easier management of the Docker containers, volumes, environment
Steps in our deployment: prepare and deploy
When doing a deployment, we do this in two stages: prepare and deploy.
Prepare (per server)
- Docker pulls the new Docker image on all the servers (parallel)
- Create a new date stamped Docker-compose file with the current and new image tags incorporated
- Primary server only: Execute any pre-deploy steps (such as Django migrations) in a migrator service with the new Docker image
- Start up the blue or green portal container (if we configured it for this server)
- Wait for portal container to be healthy
- Change nginx passive upstream to our just deployed container (blue or green)
- Reload nginx
- Notify slack that container color was prepared on server X
At this point, we’re able to reach the newly deployed container because we have two upstreams in nginx, one for live and one for passive. Outside of deployments, these point to the same color, but after a prepare stage, the passive points to the other one. Because each server maintains its own file with what color is live, we don’t have to worry or think about what container is what.
Deploy (per server)
Once we’re satisfied with the passive containers, we can finalize the deployment.
- Ensure (passive) portal container is healthy
- Update Docker-compose symlink to the date stamped one of today
- Change nginx live upstream to our just deployed container (blue or green)
- Reload nginx
- Update the live color with the color with just put live
- Start any other containers (djobs, cron) with the new imagetag
- Notify slack that container color was put live on server X
The description above is how we do it now, but for us there was a little more to it. We already had servers running the portal directly, without Docker. For our staging environment we could just rebuild the Salt state with the new portalDocker state. All worked fine there, but for our live servers we couldn’t just flip a switch and be done.
We needed to be able to run our bare metal and Docker next to each other. Luckily our Salt states did not conflict too much and we were able to set this up quite easily. We then had to do some nginx and load balancing magic. Having a Docker listening on port X and bare metal stayed where it was. This ‘traffic splitting’ was crucial, because well, nothing we have comes close to our live environment in terms of traffic, data and (mis)usage.
Our normal traffic requires certain files only available on our primary server, our API traffic only needs to be able to access the database and Redis for session management. Because of this, we were able to offload API traffic to other machines.
Bare metal / Docker (Python 2)
Our first step was testing that the portal running in Docker behaved exactly the same. We started testing this by offloading 0,5% of our API traffic to our Docker container. With over 3000 API requests per minute, this is still 15 per minute. No error? Up the percentage slowly to 5%, 20%, 50% and then 100%.
Now the tricky part was splitting the normal traffic, this had to happen on the same server. But as explained, luckily this was a matter of ensuring the nginx port and load balancing were all set up correctly. Once that was done, we started the same routine again for offloading 0,5% of the traffic, then 5%, 20% etc.
This first step in getting the portal into Docker went smoothly enough. Subsequently we rebuild our secondary server with only the portalDocker Salt state instead of also having the ‘classic’ state. After which we had to do a failover to that one and rebuild the other webserver. Of course reality is never that simple, but you catch my drift.
Docker: Python 2 / Python 3
Now that we were running fully in Docker, our life suddenly became much easier. We could just deploy containers with whatever we needed, wherever we needed. We already had our Docker images with Python 3 in them ready, so we could start offloading 0,5% of our API traffic to that. We had enough servers available, so we just picked one for Python 3.
Quite quickly this resulted in the first real issues, because we had both Python 2 and 3 talking to the same database and more importantly the same Redis backend. How Python 3 saved information to Redis was not backwards compatible and Python 2 was unable to read that data. This was caused by the ‘pickle version’, which also caused issues via the session serializer of Django. For this, we had to introduce our own version using the compatible pickle version.
try: from Django.utils.six.moves import cPickle as pickle except ImportError: import pickle class VoipgridSessionPickleSerializer(object): """ Override to not use highest pickle protocol but always use 2. Simple wrapper around pickle to be used in signing.dumps and signing.loads. """ def dumps(self, obj): return pickle.dumps(obj, 2) def loads(self, data): return pickle.loads(data)
SESSION_SERIALIZER = 'voipgrid.utils.session.VoipgridSessionPickleSerializer'
After these first hurdles, we were able to slowly raise our API offloading to 100%, which enabled us to move on to splitting our normal traffic.
But this again, had to happen on one server. To make this happen we introduced blue/green and red/black deployment. In a special Git branch we wrote code to deploy using red/black as colors, using its own directory on the server and naming its upstreams slightly different. That enabled us to have two live containers on the same server, using their own nginx port. On the load balancer we were then able to split normal traffic again, going from 0,5% to 100% to Python 3.
The aftermath of our overhaul
In the end we had to fix up to 8 issues related to Python 3 during or shortly after deploying it to our live server. This after very extensive testing by up to 5 different people and with 3000 tests giving us ~70% line coverage. Some of these were caused by our hybrid situation of Python 2 and 3 (e.g. pickle versions), some were tested but only the ‘happy’ flow, some were simply not tested and only showed up when we tried running certain djobs of ours.
The final result
The end result we have now is that every Git tag and release branch automatically have a Docker image built via a GitLab pipeline. This image can easily be deployed to any of our servers using our Fabric script. The chance of errors during deploy has dropped significantly and are now mostly limited to migrations we perform on our database during the deploy process.
One of the most important things we learned is that being able to divert a very small percentage of our traffic to the new environment helped surface issues we were unable to find otherwise and helped prevent them when we went live. It’s a huge step forward for our deploys compared to last year and we’ll keep optimizing our processes. If you have questions about our overhaul or ideas how we can make things even better, let us know in the comments!