Jon is a Principal Developer at Common Code, with nearly 2 decades of coding experience. He loves to experiment with technologies – working on soil monitoring systems for his vege patch, and a personal micro-paas to make deployment of his random projects easier.
Infrastructure (noun). The basic organizational structures needed for the operation of a society
Our way of life is supported by physical infrastructure – roads, buildings, power supplies. The same is true when creating great software. Most of the time, you’ll interact with software via a device, like playing Candy Crush on your phone. Behind the scenes, that software is supported by infrastructure – systems that manage how data (like your Candy Crush high scores) are stored and moved. It’s hidden, but it’s critical. And like any technology, it’s constantly evolving.
When I first started as a professional software developer, infrastructure was simple. It didn't seem it at the time of course but there wasn't that much to it for web applications. We had a server that someone else had created for us. We were given FTP (😱) access and deployment was nothing more than copying a bunch of PHP files across.
Sure, there was more to it than that, but that is all I knew and all I needed to know. As time went on I was lucky enough to be able to dive deeper into the setup of the server at a manageable pace. I learnt not just how to deploy my PHP code - but how PHP itself was installed and maintained on the server, how the Apache web server was configured, how to manually apply changes to the database schema and so on. I was able to slowly gain more and more of an understanding of how the systems fitted together.
So when the time came to transition from PHP to Python - the infrastructure side of that change was less daunting. Python web projects have a few more moving pieces than PHP projects. All with pretty good reasoning, but certainly more complex than the "just copy the files to the server" approach of PHP. I discovered Fabric - a tool that could be used to effectively script the deployment process to provide structure, repeatability and process. It could even be used to automate the setup of a server to some extent.
It was around this time that tools to automate the process of server configuration started appearing on my radar. Puppet, SaltStack, Ansible all gained popularity during this period. These tools again focused on the configuration management of individual servers: Ensuring the right packages were installed and updated. They ensured that the configuration of the software on those servers was consistent and repeatable. The company I worked for at the time chose to adopt SaltStack, which had the knock on effect of making our approach to server configuration standardised. This allowed us to leverage efforts made on one project across all projects. Others used Ansible - but the lack of a centralised configuration management server in Ansible often led to a more copy-paste approach between projects and the stagnation of the configuration management on projects that were not under heavy development.
SaltStack was better in some ways because it’s shared setup approach meant changes or upgrades made for one project could also affect others - helping to drag them along. There was of course a cost to this - updates in one place often have impact in others that have to be taken into account.
When projects were self-contained and ran on pet servers, it made a lot of sense that significant amounts of time were spent on tooling up those projects to be able to configure the server on which they ran. This brought consistency and the effort involved was justifiable because the servers were long term investments used for the sole purpose of that project and designed to stick around for the foreseeable future, with the odd retirement and upgrade as a project progressed.
Then came IaaS - Infrastructure as a Service. At first many simply treated it as another virtual machine hosting option. Using manual clicks and configuration to spin up servers, combined with those same configuration management tools to manage the content of those servers. But it turns out that IaaS systems are really quite different.
IaaS systems emphasise treating server resources like cattle not pets. Resources needed to be treated as ephemeral. This does not gel so well with configuration management systems. A different approach was needed.
Enter IaC Infrastructure as Code. IaC defines the infrastructure itself using code or more commonly declaring the desired state of the infrastructure and letting the IaC system do the actual provisioning. If infrastructure needs to be considered ephemeral then it follows that you need something to ensure that it can always be maintained in the right shape or created fresh. These tools however work at a different scale - configuring individual servers was likely not a core capability considered when they were designed.
One could even argue that the tooling for IaaS encourages the minimum unit of deployment to be a baked Virtual Machine image. You don't, or at least shouldn't, actually jump onto a server and install and configure them. Instead you should be building those server images and then using the IaC tooling to deploy those.
If we now throw in the meteoric rise in popularity of containerisation and hyper-scalable clusters things go a little off the rails.
Managed services like EKS and ECS on AWS, and similar offerings from others, make creating and maintaining a cluster achievable for companies without huge IT departments. However, they come with a lot of complexity. Not only do you need to define your cluster, but everything that feeds into it, permissions, policies, container registries, DNS, Load balancers, storage, databases (because running stateful loads on a container orchestration system is still not a well supported option).
Thank goodness for IaC I hear you say! Well, quite - these things certainly make it possible to keep the complexity under control. The same way you kept your pet server complexity under control with Ansible or SaltStack. But I believe it also does more than that. It tends to perhaps encourage people to blur the lines between infrastructure provisioning and project deployment - which I think is a mistake. IaC increases, by quite a lot, the amount of knowledge someone has to have in order to successfully understand deployment, and in general how things fit together in production.
Take a look at: python-docs-samples/polls.yaml for example. This is the deployment configuration for running a very simple Django application on a Google Kubernetes cluster. Even with pushing aside the need to build a Docker image and publish it to a registry, to understand this yaml file you need to understand containers, vs deployments, vs services, vs load balancers and how all those different parts fit together. You know what - a lot of devs would just give up at this point and just copy-paste someone else's config, change a few names around and cross their fingers. It is certainly a long way from the copy a file to the server approach I was able to cut my teeth on.
Despite this, I also think that IaC should be seen instead as an enabler, an opportunity to simplify things again. In the same way that using Docker images allows us to (and to some extent nudges us toward) fully disentangling build (docker build) from release (docker push) and ultimately deployment. IaC helps us make that separation more concrete from a different direction. IaC allows us to provision the place where things are released to (a registry) but it doesn't do the actual release. It also allows us to provision the substrate on which the deployed application runs.
However, we also try to use IaC to perform the deployment itself. To be fair, if your deployment process for the application is a full end-to-end deployment of not just the application code but also the spinning up a new instance of the entire infrastructure that goes with it, this is fine and is the approach taken by the immutable-infrastructure crowd.
The Immutable Infrastructure approach makes a lot of sense. Your build step becomes something like using Packer from Hashicorp to make an AMI with your product code baked in. Release could be as simple as pushing that to an S3 bucket. Deployment is running your Terraform or CloudFormation to spin up an entirely new set of infrastructure where the packer created AMIs are used as the base for each server and then repointing an existing load balancer to the new server group.
This, in my opinion, is how IaaS is probably supposed to be used - it comes back to the idea that in an IaaS world the unit of deployment is a VM image - not just the code. Servers are just cattle to be culled as needed.
The Immutable Infrastructure approach is fully aligned with treating infrastructure as cattle. That doesn't make sense when combined with Docker images instead of AMIs or other virtual machine images. They need an orchestration system. Right?
Well perhaps it actually does make sense - just at a different level of abstraction.
Virtual machines are after all virtual - they run on actual hardware at some point. All AWS is doing when you say you want an EC2 instance based on this AMI is allocate the resources required on physical hardware. There are a lot more smarts behind it than that, but the principal still holds. The IaaS provider is acting as an orchestration system.
What you are actually doing when you take the approach of immutable infrastructure is declaring what your infrastructure should look like - and letting AWS, Azure or Google Cloud spin it up and keep it running. Is this really any different to asking a Docker Swarm, Kubernetes or even just a Docker daemon on a single server to provision the resources needed to run my application? It is also declarative and the actual execution is up to the orchestrator.
New deployments are similar - providing a new declaration of what we want now instead. Those original resources are not updated with new code and configuration - they are replaced. So we effectively have Immutable Infrastructure again.
We need to be careful not to repeat the mistakes of the past and cling to tooling that made sense before and attempting to extend it to make sense at this new level of abstraction.
Ansible and it's ilk are fine tools for the configuration of servers, for declaring what the inside of a server should look like. It is not a good tool for provisioning those servers and other resources dynamically over time. For that you need CloudFormation or Terraform.
CloudFormation and Terraform are fine tools for defining infrastructure, and the rules used to scale them. They are great at telling the IaaS orchestrator what you want and in what situations it should change. They are not good at defining what the workload on that infrastructure should look like - for that you have Hashicorp Packer and similar tools.
If you go in the container direction - CloudFormation and Terraform as great at defining the cluster, which is really just an abstraction on top of infrastructure anyway. But they are not so good at defining the containers that run on it. For that we have declarative formats and tools like docker-compose and the object config yaml files used by Kubernetes along with emerging tools like Helm.
Each layer of the stack has a different cadence, and different tools have been designed to match the pace of those layers. When tools written for one layer attempt to extend into the next, things get more complex and can hold you back., preventing you from being able to fully embrace those abstractions or approaches.
Each layer still needs to be exercised lest it should fall into an unmaintainable state. So whilst I don't think you should be using IaC tooling to deploy application workloads - it should still be in a CI/CD type arrangement to ensure it is always kept in a working state.
Code that isn't run festers.
We need to trust that the IaaS providers are constantly exercising their layers, we should be doing the same for those for which we are responsible.
Ultimately I see infrastructure as effectively just another product separate from the workloads that run on it, but there to support those workloads. In the same way those workloads are all just individual products designed to be separate but complementary - supporting the business capabilities they enable.
When an agency like Common Code is developing something for a customer - we are invariably developing at least two parallel products.
The product to provide the capabilities the customer needs, and the product that supports that capability - the substrate it runs on - the infrastructure. The two are synergistic but should be loosely coupled. If that same customer returns for a second time to build a new product to support a different capability within their organisation - we should be able to re-use and evolve that substrate, not create an entirely separate substrate. Unless of course that is an actual desired feature.
p.s. Yes, I have completely ignored serverless. I felt that the quagmire around what the term actually means is likely to muddy the water a little too much. But in short I see things like AWS Lambda in the same way I see Kubernetes and ECS. It provides the substrate on which I run my workload just with a different set of features and constraints. I wouldn't use CloudFormation or Terraform to deploy Lambda functions either. However, they would have a place in getting all the supporting resources in place. PaaS (Platform as a Service) offerings like Vercel, Firebase and even Heroku probably fit in this category too.