David Chung

InfraKit and Docker Swarm Mode: A Fault-Tolerant and Self-Healing Cluster

Back in October 2016, Docker released Infrakit, an open source toolkit for creating and managing declarative, self-healing infrastructure. This is the second in a two part series that dives more deeply into the internals of InfraKit.

Introduction

In the first installment of this two part series about the internals of InfraKit, we presented InfraKit’s design, architecture, and approach to high availability.  We also discussed how it can be combined with other systems to give distributed computing clusters self-healing and self-managing properties. In this installment, we present an example of leveraging Docker Engine in Swarm Mode to achieve high availability for InfraKit, which in turn enhances the Docker Swarm cluster by making it self-healing.  

Docker Swarm Mode and InfraKit

One of the key architectural features of Docker in Swarm Mode is the manager quorum powered by SwarmKit.  The manager quorum stores information about the cluster, and the consistency of information is achieved through consensus via the Raft consensus algorithm, which is also at the heart of other systems like Etcd. This guide gives an overview of the architecture of Docker Swarm Mode and how the manager quorum maintains the state of the cluster.

One aspect of the cluster state maintained by the quorum is node membership — what nodes are in the cluster, who are the managers and workers, and their statuses. The Raft consensus algorithm gives us guarantees about our cluster’s behavior in the face of failure, and fault tolerance of the cluster is related to the number of manager nodes in the quorum. For example, a Docker Swarm with three managers can tolerate one node outage, planned or unplanned, while a quorum of five managers can tolerate outages of up to two members, possibly one planned and one unplanned.

The Raft quorum makes the Docker Swarm cluster fault tolerant; however, it cannot fix itself.  When the quorum experiences outage of manager nodes, manual steps are needed to troubleshoot and restore the cluster.  These procedures require the operator to update or restore the quorum’s topology by demoting and removing old nodes from the quorum and joining new manager nodes when replacements are brought online.  

While these administration tasks are easy via the Docker command line interface, InfraKit can automate this and make the cluster self-healing.  As described in our last post, InfraKit can be deployed in a highly available manner, with multiple replicas running and only one active master.  In this configuration, the InfraKit replicas can accept external input to determine which replica is the active master.  This makes it easy to integrate InfraKit with Docker in Swarm Mode: by running InfraKit on each manager node of the Swarm and by detecting the leadership changes in the Raft quorum via standard Docker API, InfraKit achieves the same fault-tolerance as the Swarm cluster. In turn, InfraKit’s monitoring and infrastructure orchestration capabilities, when there’s an outage, can automatically restore the quorum, making the cluster self-healing.

Example: A Docker Swarm with InfraKit on AWS

To illustrate this idea, we created a Cloudformation template that will bootstrap and create a cluster of Docker in Swarm Mode managed by InfraKit on AWS.  There are couple of ways to run this: you can clone the InfraKit examples repo and upload the template, or you can use this URL to launch the stack in the Cloudformation console.

Please note that this Cloudformation script is for demonstrations only and may not represent best practices.  However, technical users should experiment and customize it to suit their purposes.  A few things about this Cloudformation template:

  1. As a demo, only a few regions are supported: us-west-1 (Northern California), us-west-2 (Oregon), us-east-1 (Northern Virginia), and eu-central-1 (Frankfurt).
  2. It takes the cluster size (number of nodes), SSH key, and instance sizes as the primary user input when launching the stack.
  3. There are options for installing the latest Docker Engine on a base Ubuntu 16.04 AMI or using images which we have pre-installed Docker and published for this demonstration.
  4. It bootstraps the networking environment by creating a VPC, a gateway and routes, a subnet, and a security group.
  5. It creates an IAM role for InfraKit’s AWS instance plugin to describe and create EC2 instances.
  6. It creates a single bootstrap EC2 instance and three EBS volumes (more on this later).  The bootstrap instance is attached to one of the volumes and will be the first leader of the Swarm.  The entire Swarm cluster will grow from this seed, as driven by InfraKit.

With the elements above, this Cloudformation script has everything needed to boot up an Infrakit-managed Docker in Swarm Mode cluster of N nodes (with 3 managers and N-3 workers).  

About EBS Volumes and Auto-Scaling Groups

The use of EBS volumes in our example demonstrates an alternative approach to managing Docker Swarm Mode managers.  Instead of relying on manually updating the quorum topology by removing and then adding new manager nodes to replace crashed instances, we use EBS volumes attached to the manager instances and mounted at /var/lib/docker for durable state that survive past the life of an instance.  As soon as the volume of a terminated manager node is attached to a new replacement EC2 instance, we can carry the cluster state forward quickly because there’s much less state changes to catch up to.  This approach is attractive for large clusters running many nodes and services, where the entirety of cluster state may take a long time to be replicated to a brand new manager that just joined the Swarm.  

The use of persistent volumes in this example highlights InfraKit’s philosophy of running stateful services on immutable infrastructure:

  • Use compute instances for just the processing cores;  they can come and go.
  • Keep state on persistent volumes that can survive when compute instances don’t.
  • The orchestrator has the responsibility to maintain members in a group identified by fixed logical ID’s.  In this case these are the private IP addresses for the Swarm managers.
  • The pairing of logical ID (IP address) and state (on volume) need to be maintained.

This brings up a related implementation detail — why not use the Auto-Scaling Groups implementations that are already there?  First, auto-scaling group implementations vary from one cloud provider to the next, if even available.  Second, most auto-scalers are designed to manage cattle, where individual instances in a group are identical to one another.  This is clearly not the case for the Swarm managers:

  • The managers have some kind of identity as resources (via IP addresses)
  • As infrastructure resources, members of a group know about each other via membership in this stable set of IDs.
  • The managers identified by these IP addresses have state that need to be detached and reattached across instance lifetimes.  The pairing must be maintained.

Current auto-scaling group implementations focus on managing identical instances in a group.  New instances are launched with assigned IP addresses that don’t match the expectations of the group, and volumes from failed instances in an auto-scaling group don’t carry over to the new instance.  It is possible to work around these limitations with sweat and conviction; InfraKit, through support of allocation, logical IDs and attachments, support this use case natively.

Bootstrapping InfraKit and the Swarm

So far, the Cloudformation template implements what we called ‘bootstrapping’, or the process of creating the minimal set of resources to jumpstart an InfraKit managed cluster.  With the creation of the networking environment and the first “seed” EC2 instance, InfraKit has the requisite resources to take over and complete provisioning of the cluster to match the user’s specification of N nodes (with 3 managers and N-3 workers).   Here is an outline of the process:

When the single “seed” EC2 instance boots up, a single line of code is executed in the UserData (aka cloudinit), in Cloudformation JSON:

 "docker run --rm ",{"Ref":"InfrakitCore"}," infrakit template --url ",
    {"Ref":"InfrakitConfigRoot"}, "/boot.sh",
    " --global /cluster/name=", {"Ref":"AWS::StackName"},
    " --global /cluster/swarm/size=", {"Ref":"ClusterSize"},
    " --global /provider/image/hasDocker=yes",
    " --global /infrakit/config/root=", {"Ref":"InfrakitConfigRoot"},
    " --global /infrakit/docker/image=", {"Ref":"InfrakitCore"},
    " --global /infrakit/instance/docker/image=", {"Ref":"InfrakitInstancePlugin"},
    " --global /infrakit/metadata/docker/image=", {"Ref":"InfrakitMetadataPlugin"},
    " --global /infrakit/metadata/configURL=", {"Ref":"MetadataExportTemplate"},
    " | tee /var/lib/infrakit.boot | sh \n"

Here, we are running InfraKit packaged in a Docker image, and most of this Cloudformation statement references the Parameters (e.g. “InfrakitCore” and “ClusterSize”) defined at the beginning of the template.  Using parameters values in the stack template, this translates to a single statement like this that will execute during bootup of the instance:

docker run --rm infrakit/devbundle:0.4.1 infrakit template 
  --url https://infrakit.github.io/examples/swarm/boot.sh
  --global /cluster/name=mystack
  --global /cluster/swarm/size=4           # many more ...
  | tee /var/lib/infrakit.boot | sh        # tee just makes a copy on disk 
 

This single statement marks the hand-off from Cloudformation to InfraKit.  When the seed instance starts up (and installs Docker, if not already part of the AMI), the InfraKit container is run to execute the InfraKit template command.  The template command takes a URL as the source of the template (e.g. https://infrakit.github.io/examples/swarm/boot.sh, or a local file with a URL like file://) and a set of pre-conditions (as the –global variables) and renders.  Through the –global flags, we are able to pass a set of parameters entered by the user when launching the Cloudformation stack. This allows InfraKit to use Cloudformation as authentication and user interface for configuring the cluster.

InfraKit uses templates to simplify complex scripting and configuration tasks.  The templates can be any text that uses { { } } tags, aka “handle bar” syntax.  Here InfraKit is given a set of input parameters from the Cloudformation template and a URL referencing the boot script.  It then fetches the template and renders a script that is executed to perform the following during boot-up of the instance:

 

  1. Formatting the EBS if it’s not already formatted
  2. Stopping Docker if currently running and mount the volume at /var/lib/docker
  3. Configure the Docker engine with proper labels, restarting it.
  4. Starts up an InfraKit metadata plugin that can introspect its environment.  The AWS instance plugin, in v0.4.1, can introspect an environment formed by Cloudformation, as well as, using the instance metadata service available on AWS.   InfraKit metadata plugins can export important parameters in a read-only namespace that can be referenced in templates as file-system paths.  
  5. Start the InfraKit containers such as the manager, group, instance, and Swarm flavor plugins.
  6. Initializes the Swarm via docker swarm init.
  7. Generates a config JSON for InfraKit itself.  This JSON is also rendered by a template (https://github.com/infrakit/examples/blob/v0.4.1/swarm/groups.json) that references environmental parameters like region, availability zone, subnet id’s and security group id’s that are exported by the metadata plugins.
  8. Performs a infrakit manager commit to tell InfraKit to begin managing the cluster.

See https://github.com/infrakit/examples/blob/v0.4.1/swarm/boot.sh for details.

When the InfraKit replica begins running, it notices that the current infrastructure state (of only one node) does not match the user’s specification of 3 managers and N-3 worker nodes.  InfraKit will then drive the infrastructure state toward user’s specification by creating the rest of the managers and workers to complete the Swarm.

The topic of metadata and templating in InfraKit will be the subjects of future blog posts.  In a nutshell, metadata is information exposed by compatible plugins organized and accessible in a cluster-wide namespace.  Metadata can be accessed in the InfraKit CLI or in templates with file-like path names.  You can think of this as a cluster-wide read-only sysfs.  InfraKit template engine, on the other hand, can make use of this data to render complex configuration script files or JSON documents. The template engine supports fetching a collection of templates from local directory or from a remote site, like the example Github repo that has been configured to serve up the templates like a static website or S3 bucket.

 

Running the Example

You can either fork the examples repo or use this URL to launch the stack on AWS console.   Here we first bootstrap the Swarm with the Cloudformation template, then InfraKit takes over and provisions the rest of the cluster.  Then, we will demonstrate fault tolerance and self-healing by terminating the leader manager node in the Swarm to induce fault and force failover and recovery.

When you launch the stack, you have to answer a few questions:

    • The size of the cluster.  This script always starts a Swarm with 3 managers, so use a value greater than 3.
    • The SSH key.
    • There’s an option to install Docker or use an AMI with Docker pre-installed.  An AMI with Docker pre-installed gives shorter startup time when InfraKit needs to spin up a replacement instance.

InfraKit and AMI

Once you agree and launches the stack, it takes a few minutes for the cluster to be up.  In this case, we start a 4 node cluster.  In the AWS console we can verify that the cluster is fully provisioned by InfraKit:

InfraKit and AWS

Note the private IP addresses 172.31.16.101, 172.31.16.102, and 172.31.16.103 are assigned to the Swarm managers, and they are the values in our configuration. In this example the public IP addresses are dynamically assigned: 35.156.207.156 is bound to the manager instance at 172.31.16.101.  

Also, we see that InfraKit has attached the 3 EBS volumes to the manager nodes:

InfraKit and EBS volumes

Because InfraKit is configured with the Swarm Flavor plugin, it also made sure that the manager and worker instances successfully joined the Swarm.  To illustrate this, we can log into the manager instances and run docker node ls. As a means to visualize the Swarm membership in real-time, we log into all three manager instances and run

watch -d docker node ls  

The watch command will by default refresh docker node ls every 2 seconds.  This allows us to not only watch the Swarm membership changes in real-time but also check the availability of the Swarm as a whole.

InfraKit and Docker Swarm Mode

Note that at this time, the leader of the Swarm is just as we expected, the bootstrap instance, 172.31.16.101.  

Let’s make a note of this instance’s public IP address (35.156.207.156), private IP address (172.31.16.101), and its Swarm Node cryptographic identity (qpglaj6egxvl20vuisdbq8klr).  Now, to test fault tolerance and self-healing, let’s terminate this very leader instance.  As soon as this instance is terminated, we would expect the quorum leadership to go to a new node, and consequently, the InfraKit replica running on that node will become the new master.

InfraKit replicas

Immediately the screen shows there is an outage:  In the top terminal, the connection to the remote host (172.31.16.101) is lost.  In the second and third terminals below, the Swarm node lists are being updated in real time:

Docker Swarm Node

When the 172.31.16.101 instance is terminated, the leadership of the quorum is transferred to another node at IP address 172.31.16.102 Docker Swarm Mode is able to tolerate this failure and continue to function (as seen by the continuously functioning of docker node ls by the remaining managers).  However, the Swarm has noticed that the 172.31.16.101 instance is now Down and Unreachable.

InfraKit

As configured, a quorum of 3 managers can tolerate one instance outage.   At this point, the cluster continues operation without interruption.  All your apps running on the Swarm continue to work and you can deploy services as usual.  However, without any automation, the operator needs to intervene at some point and perform some tasks to restore the cluster before another outage to the remaining nodes occur.  

Because this cluster is managed by InfraKit, the replica running on 172.31.16.102 now becomes the master when the same instance assumes leadership of the quorum.  Because InfraKit is tasked to maintain the specification of 3 manager instances with IP addresses 172.31.16.101, 172.31.16.102, and 172.31.16.103, it will take action when it notices 172.31.16.101 is missing.  In order to correct the situation, it will

  1. Create a new instance with the private IP address 172.31.16.101
  2. Attach the EBS volume that was previously associated with the downed instance
  3. Restore the volume, so that Docker Engine and InfraKit starts running on that new instance.
  4. Join the new instance to the Swarm.

InfraKit and Swarm Mode

As seen above, the new instance at private IP 172.31.16.101 now has an ephemeral public IP address 35.157.163.34, when it was previously 35.156.207.156.  We also see that the EBS volume has been re-attached:

InfraKit and Swarm Mode

Because of re-attaching the EBS volume as /var/lib/docker for the new instance and using the same IP address, the new instance will appear exactly as though the downed instance was resurrected and rejoins the cluster.  So as far as the Swarm is concerned, 172.31.16.101 may as well have been subjected to a temporary network partition and has since recovered and rejoined the cluster:

InfraKit and Swarm Mode

At this point, the cluster has recovered without any manual intervention.  The managers are now showing as healthy, and the quorum lives on!

Conclusion

While this example is only a proof-of-concept, we hope it demonstrates the potential of InfraKit as an active infrastructure orchestrator which can make a distributed computing cluster both fault-tolerant and self-healing.  As these features and capabilities mature and harden, we will incorporate them into Docker products such as Docker Editions for AWS and Azure.

InfraKit is a young project and rapidly evolving, and we are actively testing and building ways to safeguard and automate the operations of large distributed computing clusters.   While this project is being developed in the open, your ideas and feedback can help guide us down the path toward making distributed computing resilient and easy to operate.

Check out the InfraKit repository README for more info, a quick tutorial and to start experimenting — from plain files to Terraform integration to building a Zookeeper ensemble. Have a look, explore, and join us on Github or online at the Docker Community Slack Channel (#infrakit).  Send us a PR, open an issue, or just say hello.  We look forward to hearing from you!

More Resources:

, , , , , ,

David Chung

InfraKit and Docker Swarm Mode: A Fault-Tolerant and Self-Healing Cluster


Leave a Reply

Get the Latest Docker News by Email

Docker Weekly is a newsletter with the latest content on Docker and the agenda for the upcoming weeks.