moving from beanstalks to kubernetes
Sometimes things start with the best of intentions and then get out of control. This was the case for one of my clients who had chosen to give each developer a dedicated Elastic Beanstalk environment. This was initially extremely effective for the developers, but over time, multiple problems arose that resulted in slow deployments and difficult to troubleshoot issues.
My client also had a desire to ultimately move to Kubernetes
So we came up with a plan that would first alleviate the existing pain points, and then move on to containerizing the services, before transitioning them to Kubernetes.
The high level plan looked like this:
- stabilize and reduce friction with existing Elastic Beanstalk environments
- containerize services and change Elastic Beanstalks to run containers
- migrate containers to Kubernetes, beginning with staging and QA
The Beanstalks
Elastic Beanstalks are an AWS feature which provide a convenient way to stand up an environment that includes not just a compute instance but also data services such as a database or redis cache.
My client was using beanstalks not only for developer environments, abut also to power the production environment as well as the various staging, QA, and UAT platforms. Often differences between production and developer environments can lead to problems, and it's wonderful for developers to be able to test in a highly similar environment. The biggest difference between production and development environments essentially was scale.
Most of the developer beanstalks came from a couple of different sources, which were modified, cloned, modified and cloned again over time. Some developers needed full redundancy to test failure scenarios, some beanstalks had much larger instance types, which had been added during performance troubleshooting as a stop gap measure. The original beanstalks had evolved into many very different flavors by the time I was brought on.
The first thing I did was to assess the environments by bringing them under management as infrastructure as code (IaC). Terraform and OpenTofu work great for this purpose because they allow defining infrastructure as code (IaC) and then importing the resource from Amazon (or other cloud). This makes it easier to see how the imported resources differ from the default we defined in code. I did not spend much time in creating the initial "default" configuration. I simply used the default from the documentation. The goal was not to get it right the first time, but to see and understand everything that was out there. Once I repeated this process for all environments, it became possible to assess all the deployments as a whole, and suss out the useful commonalities, and settings that were likely causing issues.
In many cases we identified the reason for very slow deployments was the serial rolling deployment that replaced only one instance at a time. It's very useful to do this in a production roll-out, but in a developer environment it's frustratingly slow. Further, slowing down the process was the fact that the client used an immutable infrastructure approach. That meant that when a developer deployed their code to their development environment, it would frequently result in a single instance at a time being replaced which could take up to 10 minutes. Even with only two instances, this would quickly lead to a 20-minute wait each time a developer wanted to test their code.
Once we had the developer instances imported into terraform, we were able to identify the best settings for an average developer environment and rolled them out to each of those environments via our brand new Infrastrucure as Code.
By the time this part of the project completed, deployments on average we're down to less than 5 minutes, and we had shaved off around 40% of the environmental spend on the development environments.
As we were working with the developer Beanstalks, we also monitored the production Beanstalks. They also suffered from long and slow deployments, which often failed. These deployments took around an hour, with considerable amount of time spent watching instances sometimes fail to become healthy.
This led us to identify some additional beanstalk settings that would make production deployments succeed far more consistently cutting the time to less than half of what it had been prior, while not increasing risk.
This was a lenghty process, but ultimately it was made possible by bringing the infrastructure under control as source code. The source makes it much easier to reason through the differences and to apply changes consistently across the board.
The containers
It had become clear that the way deployments were being done was the biggest cause of the slowness. Essentially each instance ran the full deploy and build of the software on launch. Since the build process itself could take in excess of four or five minutes, it greatly extended the amount of time taken for each deployment: each (redundant) instance in the production beanstalks was adding that 4-5 minutes deployment delay.
So, we moved on to containerizing the services, which would allow us to build the container once and then deploy that instead. Since beanstalks support running containerized workloads, we pushed forward on that front and integrated container builds into the CI/CD pipeline. This was also how we introduced the developers to containers. Their workflow did not change in any meaninful way at this point, and since Beanstalks also use environment variables the only thing that needed to be changed was to move logging to go to STDOUT.
Again, the developer environments were used to iron out any of the rough edges during the roll-out. It went smooth overall, and brought deployment times for developers with multiple instances down to about 5 minutes as well.
Once we integrated the containers into the production deploys, each instance on average launched in 2 minutes and a roll-out to production no longer exceeded 15 minutes.
The lesson here is obvious: Build once! Deploy, deploy, deploy, over and over. Additionally, containers give you a tidy artifact to work with, which can be shared easily.
Kubernetes
My client was now enjoying some genuine stability and effective deployments for the first time in a while. Deployments to production were stable, perdictable, and quick. Developers had the ability to run reasonably quick build-deploy-test cycles, and they could spin up other developer's container images to help test.
The final transition was toward a full Kubernetes cluster, or technically two. We started with a cluster we called staging. It would hold more than just the staging environment. It was destined to hold all the non-user facing environments including the developer's.
This step in the process was relatively straight forward, once we had created the Kubernetes cluster in its own VPC and allowed the relevent access to existing resources (i.e. databases, caching, s3, etc).
Since the containers were using environment variables, we only needed to create some YAML files to spin each developer's environment up in a dedicated namespace. This allowed us to further empower the developers and grant them control over their deployments. This maintened parity with their beanstalk experience.
Once the organization was comfortable with the new setup, we continued the migration with the user-facing environments.
It is a bit of dull ending of a decently sized technology overhaul, but that's kinda how you want a big technology transition to go. The end result should be exciting, but not the process.
In the end it was a great example of Kaizen, the gradual and continual improvement of a process.
- ← Previous
Rethinking the Well-Architected Review