This post reviews the various security implications of using Docker to run applications within containers, and how to address them.
There are three great areas to consider:
- the intrinsic security of containers, as implemented by namespaces and cgroups;
- the specific attack surface of the Docker daemon itself;
- the “hardening” security features of the kernel and how they interact with containers.
We will also discuss how Docker security features compare with other systems.
image source: US PATENT US6877440 B1
Docker containers are essentially LXC containers, and they come with the same security features. When you start a container with
docker run, behind the scenes, it uses
lxc-start to execute the Docker container. This creates a set of namespaces and control groups for the container. Those namespaces and control groups are not created by Docker itself, but by
lxc-start. This means that as the LXC userland tools evolve (and provide additional namespaces and isolation features), Docker will automatically make use of them.
Namespaces provide the first, and most straightforward, form of isolation: processes running within a container cannot see, and even less affect, processes running in another container, or in the host system.
Each container also gets its own network stack, meaning that a container doesn’t get a privileged access to the sockets or interfaces of another container. Of course, if the host system is setup accordingly, containers can interact with each other through their respective network interfaces — just like they can interact with external hosts. By default, IP traffic is allowed between containers; so they can ping each other, send/receive UDP packets, and establish TCP connections; but that can be restricted if necessary. From a network architecture point of view, all containers on a given Docker host are sitting on a bridge interface. This means that they are just like physical machines connected through a common Ethernet switch; no more, no less.
We often get the question: “is this code mature?”, and the answer is “yes, pretty mature”. Kernel namespaces have been introduced between kernel version 2.6.15 and 2.6.26. This means that since July 2008 (date of the 2.6.26 release, now 5 years ago), namespace code has been exercised and scrutinized on a large number of production systems. And there is more: the design and inspiration for the namespaces code are even older. Namespaces are actually an effort to reimplement the features of OpenVZ in such a way that they could be merged within the mainstream kernel. And OpenVZ was initially released in 2005… So yes, both the design and the implementation are pretty mature.
Control Groups are the other key component of Linux Containers. They implement resource accounting and limiting. They provide a lot of very useful metrics, but they also help to ensure that each container gets its fair share of memory, CPU, disk I/O; and, more importantly, that a single container cannot bring the system down by exhausting one of those resources.
So while they do not play a role in preventing one container from accessing or affecting the data and processes of another container, they are essential to fend off some denial-of-service attacks. They are particularly important on multi-tenant platforms, like public and private PaaS, to guarantee a consistent uptime (and performance) even when some applications start to misbehave.
Control Groups have been around for a while as well: the code was started in 2006, and initially merged in kernel 2.6.24.
You can read the PaaS Under The Hood blog post about cgroups if you want to know more about those.
Running containers (and applications) with Docker implies running the Docker daemon. This daemon currently requires root privileges, and you should therefore be aware of some important details.
First of all, only trusted users should be allowed to control your Docker daemon. This is a direct consequence of some powerful Docker features. Specifically, Docker allows you to share a directory between the Docker host and a guest container; and it allows you to do so without limiting the access rights of the container. This means that you can start a container where the
/host directory will be the
/ directory on your host; and the container will be able to alter your host filesystem without any restriction. This sounds crazy? Well, you have to know that all virtualization systems allowing filesystem resource sharing behave the same way. Nothing prevents you from sharing your root filesystem (or even your root block device) with a virtual machine.
This has a strong security implication: if you instrument Docker from e.g. a web server to provision containers through an API, you should be even more careful than usual with parameter checking, to make sure that a malicious user cannot pass crafted parameters causing Docker to create arbitrary containers.
For this reason, the REST API endpoint (used by the Docker CLI to communicate with the Docker daemon) changed in Docker 0.5.2, and now uses a UNIX socket instead of a TCP socket bound on 127.0.0.1 (the latter being prone to cross-site-scripting attacks if you happen to run Docker directly on your local machine, outside of a VM). You can then use traditional UNIX permission checks to limit access to the control socket.
You can also expose the REST API over HTTP if you explicitly decide so. However, if you do that, being aware of the abovementioned security implication, you should make sure that it will be reachable only from a trusted network or VPN; or protected with e.g.
stunnel and client SSL certificates.
Recent improvements in Linux namespaces will soon allow to run full-featured containers without root privileges, thanks to the new user namespace. This is covered in detail here. Moreover, this will solve the problem caused by sharing filesystems between host and guest, since the user namespace allows users within containers (including the root user) to be mapped to other users in the host system.
The end goal for Docker is therefore to implement two additional security improvements:
- map the root user of a container to a non-root user of the Docker host, to mitigate the effects of a container-to-host privilege escalation;
- allow the Docker daemon to run without root privileges, and delegate operations requiring those privileges to well-audited sub-processes, each with its own (very limited) scope: virtual network setup, filesystem management, etc.
Finally, if you run Docker on a server, it is recommended to run exclusively Docker in the server, and move all other services within containers controlled by Docker. Of course, it is fine to keep your favorite admin tools (probably at least an SSH server), as well as existing monitoring/supervision processes (e.g. NRPE, collectd, etc).
By default, Docker starts containers with a very restricted set of capabilities. What does that mean?
Capabilities turn the binary “root/non-root” dichotomy into a fine-grained access control system. Processes (like web servers) that just need to bind on a port below 1024 do not have to run as root: they can just be granted the
net_bind_service capability instead. And there are many other capabilities, for almost all the specific areas where root privileges are usually needed.
This means a lot for container security; let’s see why!
Your average server (bare metal or virtual machine) needs to run a bunch of processes as root. Those typically include SSH, cron, syslogd; hardware management tools (to e.g. load modules), network configuration tools (to handle e.g. DHCP, WPA, or VPNs), and much more. A container is very different, because almost all of those tasks are handled by the infrastructure around the container:
- SSH access will typically be managed by a single server running in the Docker host;
- cron, when necessary, should run as an user process, dedicated and tailored for the app that needs its scheduling service, rather than as a platform-wide facility;
- log management will also typically be handed to Docker, or by third-party services like Loggly or Splunk;
- hardware management is irrelevant, meaning that you never need to run udevd or equivalent daemons within containers;
- network management happens outside of the containers, enforcing separation of concerns as much as possible, meaning that a container should never need to perform
ipcommands (except when a container is specifically engineered to behave like a router or firewall, of course).
This means that in most cases, containers will not need “real” root privileges at all. And therefore, containers can run with a reduced capability set; meaning that “root” within a container has much less privileges than the real “root”. For instance, it is possible to:
- deny all “mount” operations;
- deny access to raw sockets (to prevent packet spoofing);
- deny access to some filesystem operations, like creating new device nodes, changing the owner of files, or altering attributes (including the immutable flag);
- deny module loading;
- and many others.
This means that even if an intruder manages to escalate to root within a container, it will be much harder to do serious damage, or to escalate to the host.
This won’t affect regular web apps; but malicious users will find that the arsenal at their disposal has shrunk considerably! You can see the list of dropped capabilities in the Docker code, and a full list of available capabilities in Linux manpages.
Of course, you can always enable extra capabilities if you really need them (for instance, if you want to use a FUSE-based filesystem), but by default, Docker containers will be locked down to ensure maximum safety.
Capabilities are just one of the many security features provided by modern Linux kernels. It is also possible to leverage existing, well-known systems like TOMOYO, AppArmor, SELinux, GRSEC, etc. with Docker.
While Docker currently only enables capabilities, it doesn’t interfere with the other systems. This means that there are many different ways to harden a Docker host. Here are a few examples.
- You can run a kernel with GRSEC and PAX. This will add many safety checks, both at compile-time and run-time; it will also defeat many exploits, thanks to techniques like address randomization. It doesn’t require Docker-specific configuration, since those security features apply system-wide, independently of containers.
- If your distribution comes with security model templates for LXC containers, you can use them out of the box. For instance, Ubuntu comes with AppArmor templates for LXC, and those templates provide an extra safety net (even though it overlaps greatly with capabilities).
- You can define your own policies using your favorite access control mechanism. Since Docker containers are standard LXC containers, there is nothing “magic” or specific to Docker.
Just like there are many third-party tools to augment Docker containers with e.g. special network topologies or shared filesystems, you can expect to see tools to harden existing Docker containers without affecting Docker’s core.
Traditional virtualization techniques (as implemented by Xen, VMWare, KVM, etc.) are deemed to be more secure than containers, since they provide an extra level of isolation. A container can issue syscalls to the host kernel, while a full VM can only issue hypercalls to the host hypervisor, which will generally have a much smaller surface of attack.
But the real reason why full VMs would be considered more secure than containers, is because they got more exposure in production, and more scrutiny. There are many providers out there selling virtual machines to the public; while those selling containers are only a handful — mainly public PAAS providers. Since containers are much more resource-efficient and easier to manage, you can expect this situation to reverse over the next years; and you can rely on the responsiveness of the Linux kernel development community to patch security holes extremely quickly when they will surface.
However, it has been pointed out that if a kernel vulnerability allows arbitrary code execution, it will probably allow to break out of a container — but not out of a virtual machine. No exploit has been crafted yet to demonstrate this, but it will certainly happen in the feature (especially with more and more containers in production: they will become a more “interesting” target for a malicious user). Does that mean that containers are really less secure? We think not. First, hypervisors are not exempt of vulnerabilities. And then, critical kernel issues tend to be fixed very quickly when they’re discovered (since they potentially affect not only container-based systems, but all Linux systems out there).
There is another side to the coin: when an exploit or security hole is found in the kernel, you have to upgrade the kernel and reboot. Sometimes, you can use a system like Ksplice, which allows “reboot-less” upgrades. However, you still need to deploy the new kernel, and update your VM images. Things are easier with containers: since the kernel is outside of the scope of the container image, you don’t have to change all your container images when you upgrade the kernel. Even if you do something quite drastic like moving from AppArmor to SELinux or vice versa, you will make changes outside of your containers, but you won’t have to update the containers themselves. This clear and clean separation of concerns is a major advantage over VMs. Also, the availability of systems like CRIU means that you can do container live migration, i.e. move a container from a machine to another without killing processes. This means that it will be possible to achieve uninterrupted operation during kernel upgrades.
Virtual Machines might be more secure today, but containers are definitely catching up; and containers are already easier to manage, and therefore it’s easier to make sure that they are up-to-update from a security standpoint.
We can sort other containerization systems in three categories.
- LXC-based systems: those systems will provide exactly the same level of security as Docker itself. They might claim extra security features, but those features will not be provided by the system, they will be enabled. In other words, if another containerization software advertises “role-based authorization and enhanced security”, it means that it merely enables and configures some existing features like SELinux or SMACK, and that it should be fairly easy to add similar features to Docker — either in the core, or as a third-party add-on (as explained earlier).
- Linux systems not based on LXC: that would be OpenVZ. OpenVZ is great, and it has been around for longer than LXC, so some people consider it to be more stable and secure. However, one has to keep in mind that LXC and OpenVZ share many developers in common, and that LXC is nothing else than “OpenVZ redesigned to be able to be merged into the mainline kernel”. Therefore, OpenVZ will eventually sunset, to be fully replaced by LXC.
- Non-Linux containerization systems: some of those systems are plain awesome (e.g. Solaris Zones); however, to the best of our knowledge, none of them will let you run existing Linux processes as efficiently and as reliably as a “true” Linux system. This might or might not be a problem for you (after all, many people run e.g. Node and Mongo stacks on Solaris without any problem whatsoever). Note, however, that even if there are some big deployments of FreeBSD Jails and Solaris Zones out there, it’s just a drop of water in the big ocean of Linux-based “VPS” offerings out there. This means that Linux (and that includes VServer, OpenVZ, LXC) got much more exposure. That doesn’t make it intrinsically more secure, but that helps a lot.
Finally, it’s worth mentioning that Docker 1.0 won’t be LXC-specific. It will be able to support other runtimes through a plug-in mechanism.
Docker containers are, by default, quite secure; especially if you take care of running your processes inside the containers as non-privileged users (i.e. non root).
VMs are considered more secure than containers, but the difference blurs away if you abide by the previous advice, i.e. run processes as non-privileged users (which is sometimes impractical with VMs, but easy with containers).
Last but not least, if you see interesting security features in other containerization systems, you will be able to implement them as well with Docker, since everything is provided by the kernel anyway.
Note: the paragraphs about Containers/VM security, and about other isolation systems, have been updated following feedback on HackerNews. Thanks guys!