I thought another tech-related blogpost would be nice for a change, so here it is.
At work we usually try to avoid a monolithic infrastructure and prefer to build a flexible, scalable system. As we host most of our customers projects and websites on AWS an important part of this infrastructure is a frontend loadbalancer and a dynamic autoscaling group.
The typical approach is to have your server infrastructure set up with an deployment management software like puppet, chef or ansible so that every server is „build from scratch“ on boot and looks (and works) exactly like every other server with the same role (you should read up on the topics „infrastructure as a code“ or „immutable infrastructure“ if you are interested in that approach).
In our case, we don’t manage really large infrastructures, a couples of servers max, so we don’t actually have a chef-server or something similar to manage the servers. For that we so far rely on „chef-solo“ and a custom „prepare“ script which we pre-install on a vanilla ubuntu image. We then create a bootable server image (aka AMI) on Amazon and we then use this AMI as the baseimage for the servers.
On bootup we define the role of the server with Amazons own „user data“, a set of information you can define before startup and is programmatically accessible within the instance. With the information provided in the userdata, the prepare-script fetches the corresponding chef cookbooks, roles and so on and the server quite literally installs itself to the desired state.
This procedure works exactly the same on autoscaling groups. You define your base-image and the userdata and whenever the autoscaling group scales up (or adds new server to handle increased traffic) the new server configures itself with chef-solo.
This is a fairly tested and practicable system with one little disadvantage: duration of deployment.
On boot the server has to install chef-solo, fetch the corresponding cookbooks and then install the software packages, all of which can take, depending on the complexity of the setup, from a few seconds to a number of minutes in total.
Especially when dealing with heavy load spikes, a couple of minutes can be too long and your site may be slow or even unreachable before the new instance is in service.
To address this (possible) problem, I made use of a tool from notorious toolset of the HashiCorp:
Packer is like a provisioning tool for machine images and supports a lot of different tools, builders and provisioners.
In other words, you can take your chef cookbooks, ansible playbooks or whatever tool you use to describe your infrastructure, send it over to Packer and it then uses this information to build a „ready to go“ machine image for the target service of your choice. Be it AWS, GCE, Docker, DigitalOcean and so on (have a look at the Packer docs for the list of all available providers and builders).
As Packer also supports chef-solo as a provisioner, it was quite easy to make use of our already existing configuration management and build a pre-configured AMI. All I needed was a Packer-configuration file,which uses the JSON format.
In this „packer.json“ I defined the variables for our AWS „build-target“ like the region, the source-ami (vanilla ubuntu), the ssh username (as Packer works over ssh this is a requirement) as well as a suitable security group and some other basic information.
As „provisioner“ I used a combination of simple shell commands (for updating the base image to the newest patchlevel and to ensure our prepare script is in the right location) and our existing chef-solo cookbooks.
When all is set up, all I had to do was run „packer build packer.json“ and it went of by starting a new, temporary AWS instance, updating Ubuntu to the latest patchlevel, uploading our chef cookbooks, installing the software defined in the cookbooks, creating an Amazon AMI from the installed instance and terminating the instance after the created AMI becomes available.
I must admit, it took me a few attempts until all variables were set correctly and Packer was able to finish successfully, but as Packer is configured to terminate and delete every artifact it creates after an unsuccessful run, I see no big problem in using the trial and error approach. You won’t end up with dozens of stopped instances.
So after successfully building a pre-installed server, I then reconfigured (after some initial testing, of course) the autoscaling group to use this new AMI instead of the vanilla ubuntu image and triggered an upscaling event to see how it performed.
After all, I would consider it an success. Instead of over 3 minutes, the initial „chef run“ on bootup was finished in 17 seconds and the instance was „in service“ in the loadbalancer in under a minute.
That maybe doesn’t sound like much, but regarding autoscaling I hold the view that „faster is better“ and „every second counts“. So a warmup timer of under a minute is pretty good in my point of view.
BTW you could argue to get rid of chef-solo in the AMI completely and even gain this 17 seconds or so it takes on boot, but in our case I see advantages to keep our prepare script and the initial chef run.
With the script in place we can quickly implement new features or change settings in our cookbooks without having to rebuild a new AMI and reconfigure the autoscaling group every time we make a change.
We can rely on chef to reconfigure the existing AMI with the latest versions of our cookbooks on boot with maybe only loosing a few seconds. And we usually patch our baseimage on a monthly basis and at this point the latest changes in the cookbooks are again baked into the newest version of the AMI anyway.
So this is our setup right now.
Using Packer to make use of our already existing and tested chef cookbooks reduced our „scale up“ to „in service“ time from over 3 minutes to under a minute. And with this shorter warmup time your chances to successfully handle sudden peaks in load or traffic without downtime for the customers increased quite a bit.