On automating

Recently, I have been automating common tasks. Things like deployments, code analysis, infrastructure declarations etc. I’ve done this a few ways, in roughly a chronological way:

  • Bash scripts
  • Ansible
  • Kubernetes
  • Helm

Broadly, I’ve noticed these types of automation fall into a couple of different buckets. To understand the different buckets, I’m going to illustrate them through possibly the most boring analogy known to man: Heating your home.

There are a few different ways to heat your home:

  • Manually build a fire in a fireplace
  • Procedurally run a gas heater
  • Declaratively enable central heating

Manual tasks

This is perhaps the easiest type of automation to understand: none. Simply doing the tasks that must be completed.

Lighting a fire is a pain staking process. First, you must:

  • Cut, chop and store the wood
  • Build the fire such that it has a good air current
  • Stack kindling in the fire to generate the initial fire
  • Spark the fire
  • Carefully maintain the fire
  • Pray

In terms of time, lighting a fire is perhaps the most inefficient way to heat your home. Sure, you can get super good at lighting fires, but it must still be compiled manually.

In much the same way, managing infrastructure without automation is a pain staking process. You must:

  • Acquire the hardware
  • Install the correct base OS
  • Install software required
  • Install the network and discovery mechanisms
  • Pray
  • Carefully manage the software

Absent the satisfaction of a warm, comfortable file manual server management quickly turns into a chore more akin to cleaning the bathtub. A thankless task that you do only because it bothers you.

Procedural tasks

Procedural automation delegates the manual running of a process to another process. That process executes the same (or similar) processes in the same order as before.

The simplest way of thinking about procedural automation in terms of your house is gas heating. It follows similar steps before; gas is taken and stored, a spark is created and fuel is added and burned.

All of these steps are now automated through a single trigger; perhaps a pressing and holding a button to spark the flame, and releasing it to ensure a steady gas flow. A dial regulates the amount of heat.

But this must still be managed. A gas heater has no understanding of it’s impact on the world; it simply continues following instructions until it’s given another.

In terms of software management, common examples of this include:

  • CI/CD
  • Bash/Python/PHP scripts to complete certain tasks
  • Application / deployment package compilation
  • Ansible? This is a bit tricky.

It takes the complex work that was talked about earlier, and delegates the handling it to a separate process. It can even be expanded to handle a range of failure conditions; test configuration before restarting servers, restart servers if they have had updates that change system process and so fourth.

However, like the gas heater, it must still be managed. Failure conditions still appear and must be handled by processes that are out of band; servers must be manually recovered when they’re in a state automation does not expect, or when they fail in such a spectacular way they cannot be restored.

Reconciliation Tasks

Reconciliation tasks delegate the management of the application to an external process, usually called the “control plane”. In order to use a reconciliation task you must only indicate what you think the state of the world should be. It’s then up to the system that you delegate management to to make the changes required to bring that state about.

The system that keeps your house at that perfectly comfortable temperature of 23.5 degrees is central heating. It is a lovely example of a system that reconciles desired state with current.

Broadly, it works as follows:

  • A user will indicate the house should be 23.5 degrees

Separately, the central heating is continually working to keep the correct temperature:

  • Temperature gauges around the house will calculate the temperature as it is, and report this to the control plane
  • The control plane will then calculate how much heat it should apply. It will sit for a few seconds, and then ask temperature gauges to recheck temperature
  • The cycle repeats

This cycle of check, adjust, recheck, readjust makes for the perfect temperature house with nearly no effort. In many cases, we simply forget to think about how the heat works — it simply works.

In terms of software, control planes include:

  • Kubernetes
  • Mesos
  • Docker Swarm
  • AWS (Netflix Style)

Kubernetes is the control plane I know the best. It is fundamentally driven by the same reconciliation approach that powers central heating.

It works in the same way, as follows:

  • The user indicates they would like three copies of NGINX running
  • The control plane determines that there is only 1 copy of NGINX running
  • The control plane starts up 2 other copies of NGINX

This has a number of super nice properties:

When failures occur, they’re automatically handled. Consider that a machine has died. The following will occur:

  1. The reconciliation loop “controller mananger” will note that a machine has not checked in in a while. It will mark it as unhealthy, and go back to sleep.
  2. The reconciliation loop “scheduler” will note that workloads are on an unhealthy machine, and move them to other machines.
  3. The per-machine reconciliation loop will see that it has been assigned new work, and run it.

Total application failure at 4 in the morning; often less than 10 seconds later it’s back. Users don’t even notice.

Because Kubernetes can guarantee some level of operability in lots of different locations, cloud providers etc. it becomes much easier to package and distribute complex software definitions.

Complex application specifications can be drafted that indicate how the application should be deployed in abstract, and how it should find the other dependencies it needs — but not be concrete about specifically how that implementation should occur; instead, deferring that to Kubernetes.

The project helm is the perfect example of this. It takes applications such as:

  • MySQL
  • Redis
  • Magento

And packages them. It can even create packages that contain packages, creating a single, simple resource that includes all it’s dependant applications and is (probably) going to work across a large set of different environments. For example, the Magento application could look as follows:

Magento
|
|----MySQL
| |------MySQL Prometheus exporter
|
|----Redis
|----Varnish Operator
|----------Varnish

So, packages that depend on packages — just like apt or yum. These applications can even give hints as to their hardware requirements; MySQL requires a solid state disk, Redis requires 8Gb of memory etc.

Applications are managed in the context of other applications of the machine, and their own history. Consider an application has a particularly noisy neighbour on this machine that keeps exhausting it’s memory. The following will occur:

  1. The per machine reconciliation loop will detect the application has died, and restart it. It notes each restart in the global state.
  2. The scheduler loop will note that the application has been restarted a lot on this machine, and decide that the application is unhealthy. It will reschedule it to another machine.
  3. The current machine will note that the workload is no longer assigned to it, and stop running it. Additionally, the new machine will note the workload has been assigned to it, and start running it.

A more recent trend built on top of these reconciliation systems is to delegate complex application management to another process that runs on top of these systems — essentially, applications managing applications. This is termed the Operator pattern.

Operators allow an infrastructure as code approach to managing complex application. Engineering knowledge about:

  • Scaling
  • Failure
  • Management
  • Security

Can be delegated to a process that runs a cluster of applications. One such example is the rook operator, which deploys a ceph cluster. Historically, deploying ceph was super hard, however rook makes it a (relatively) simple operation.

The grand utopia of reconciliation isn’t one

Of course, there are always aspects of hosting that aren’t possible to make declarative. For example, building an application is an inherently stateful exercise, and that state must be managed to production.

However, finding opportunities to delegate more processes to control loops has netted me a large amount of time and reduced a chunk of my stress. I have rm ‘d all worker machines in a Kubernetes cluster to show the worst possible case in a stateful managed set in Kubernetes, and watched as software is back up 90 seconds later. Machines work faster than people.

Lessons

Moving forward, I am confident that the trend towards declarative infrastructure and systems that manage themselves will make managing software much simpler, and allow packaging and sharing of complex, distributed applications. Kubernetes is my system of choice, given that it’s so widely supported and there is such a large community around it. We’re already seeing it implemented successfully in many organisations, and the packaging of software by systems like helm is extremely valuable.

It is, however, a new way of thinking about managing our applications. It will take time to adjust.

Appendix

  • Jaime Hannaford
  • Willem de Groot

[1] - https://github.com/jamiehannaford/what-happens-when-k8s