Photo by Fancycrave on Unsplash (Edited)

What is a container?

A story about picking apart how and why containers work the way they do

  • Running a software development environment
  • Compiling software with its dependencies in a sandbox
  • Analysing the behaviour of software within a sandbox

History

The “birth” of containers was denoted by Bryan Cantrill as March 18th, 1982[3] with the addition of the chroot syscall in BSD. From the FreeBSD website[4]:

# Get a shell
$ cd $(mktemp -d)
$ mkdir bin
$ $(which sh) bin/bash
# Find shared libraries required for shell
$ ldd bin/sh
linux-vdso.so.1 (0x00007ffe69784000)
/lib/x86_64-linux-gnu/libsnoopy.so (0x00007f6cc4c33000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6cc4a42000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f6cc4a21000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6cc4a1c000)
/lib64/ld-linux-x86-64.so.2 (0x00007f6cc4c66000)
# Duplicate libraries into root
$ mkdir -p lib64 lib/x86_64-linux-gnu
$ cp /lib/x86_64-linux-gnu/libsnoopy.so \
/lib/x86_64-linux-gnu/libc.so.6 \
/lib/x86_64-linux-gnu/libpthread.so.0 \
/lib/x86_64-linux-gnu/libdl.so.2 \
lib/x86_64-linux-gnu/
$ cp /lib64/ld-linux-x86-64.so.2 lib64/# Change into that root
$ sudo chroot .
# Test the chroot
# ls
/bin/bash: 1: ls: not found
#
  • Filesystem separation (similar to chroot)
  • A separate process space

Definition

It might be surprising to learn that a “container” is not a real thing — rather, it is a specification. At the time of writing this specification has implementations on^[11]:

  • Windows
  • Solaris
  • Virtual Machines
  1. Consistent regardless of what type of software is being run
  2. Agnostic to the underlying infrastructure the container is being run on
  3. Designed in a way that makes automation easy
  4. Of excellent quality

Implementation

While the standards give us some idea as to what a container is and how they should work, it’s perhaps useful to understand how a container implementation works. Not all container runtimes are implemented in this way; notably, kata containers implement hardware virtualisation as alluded to earlier with EC2.

  1. Distribution of that process(es)
  2. Connecting that process(es) to other machines

Kernel feature isolation: namespaces

The man namespaces command defines namespaces as follows:

  • setns: Allows the calling process to join an existing namespace, specified under /proc/[pid]/ns
  • unshare: Moves the calling process into a new namespace
# Scratch space
$ cd $(mktemp -d)
# Fork is required to spawn new processes, and proc is mounted to give accurate process information
$ sudo unshare \
--fork \
--pid \
--mount-proc \
--net
# Here we see that we only have access to the loopback interface
root@sw-20160616-01:/tmp/tmp.XBESuNMJJS# ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# Here we see that we can only see the first process (bash) and our `ps aux` invocation
root@sw-20160616-01:/tmp/tmp.XBESuNMJJS# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.3 0.0 8304 5092 pts/7 S 05:48 0:00 -bash
root 5 0.0 0.0 10888 3248 pts/7 R+ 05:49 0:00 ps aux
  • The net namespace: Managing network interfaces (NET: Networking).
  • The ipc namespace: Managing access to IPC resources (IPC: InterProcess Communication).
  • The mnt namespace: Managing filesystem mount points (MNT: Mount).
  • The uts namespace: Isolating kernel and version identifiers. (UTS: Unix Timesharing System).

Resource isolation: control groups

The kernel documentation for cgroups defines the cgroup as follows:

# Create a cgroup called "me"
$ mkdir /sys/fs/cgroup/memory/me
# Allocate the cgroup a max of 100Mb memory
$ echo '100000000' | sudo tee /sys/fs/cgroup/memory/me/memory.limit_in_bytes
# Move this proess into the cgroup
$ echo $$ | sudo tee /sys/fs/cgroup/memory/me/cgroup.procs
5924

Userland isolation: seccomp

While both namespaces and cgroups go a significant way to isolating processes into their own containers Docker goes further than that to restrict what access the process can have to the Linux kernel itself. This is enforced in supported operating systems via "SECure COMPuting with filters", also known as seccomp-bpf or simply seccomp.

$ grep CONFIG_SECCOMP= /boot/config-$(uname -r)# Our system supports seccomp
CONFIG_SECCOMP=y
docker run --rm \
-it \
--security-opt seccomp=/path/to/seccomp/profile.json \
hello-world
  • bpf: The ability to load and run bpf programs
  • add_key: The ability to access the kernel keyring
  • kexec_load: The ability to load a new linux kernel
  • SELinux
  • AppArmour
  • AuditD
  • Falco[24]

Distribution: the union file system

To generate a container Docker requires a set of “build instructions”. A trivial image could be:

# Scratch space
$ cd $(mktemp -d)
# Create a docker file
$ cat <<EOF > Dockerfile
FROM debian:buster
# Create a test directory
RUN mkdir /test
# Create a bunch of spam files
RUN echo $(date) > /test/a
RUN echo $(date) > /test/b
RUN echo $(date) > /test/c
EOF# Build the image
$ docker build .
Sending build context to Docker daemon 4.096kB
Step 1/5 : FROM debian:buster
---> ebdc13caae1e
Step 2/5 : RUN mkdir /test
---> Running in a9c0fa1a56c7
Removing intermediate container a9c0fa1a56c7
---> 6837541a46a5
Step 3/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/a
---> Running in 8b61ca022296
Removing intermediate container 8b61ca022296
---> 3ea076dcea98
Step 4/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/b
---> Running in 940d5bcaa715
Removing intermediate container 940d5bcaa715
---> 07b2f7a4dff8
Step 5/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/c
---> Running in 251f5d00b55f
Removing intermediate container 251f5d00b55f
---> 0122a70ad0a3
Successfully built 0122a70ad0a3
$ docker run \
--rm=true \
-it \
0122a70ad0a3 \
/bin/bash
$ cd /test
$ ls
a b c
$ cat *
Sat 30 Mar 18:05:24 CET 2019
Sat 30 Mar 18:05:24 CET 2019
Sat 30 Mar 18:05:24 CET 2019
$ docker run \
--rm=true \
-it \
07b2f7a4dff8 \
/bin/bash
$ ls test
a b
$ docker history 0122a70ad0a3
IMAGE CREATED CREATED BY SIZE COMMENT
0122a70ad0a3 5 minutes ago /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019… 29B
07b2f7a4dff8 5 minutes ago /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019… 29B
3ea076dcea98 5 minutes ago /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019… 29B
6837541a46a5 5 minutes ago /bin/sh -c mkdir /test 0B
ebdc13caae1e 12 months ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 12 months ago /bin/sh -c #(nop) ADD file:2219cecc89ed69975… 106MB
  1. A layer for a that renders the date to disk at 29B
  2. A layer for b that renders the date to disk at 29B
  • devicemapper
  • aufs
$ docker info | grep Storage
Storage Driver: overlay2
# scratch
cd $(mktemp -d)
# Create some layers
$ mkdir \
lower \
upper \
workdir \
overlay
# Create some files that represent the layers
$ touch lower/i-am-the-lower
$ touch higher/i-am-the-higher
# Create the layered filesystem at overlay with lower, upper and workdir
$ mount -t overlay \
-o lowerdir=lower,upperdir=upper,workdir=workdir \
./overlay \
overlay
# List the directory
$ ls overlay/
i-am-the-lower i-am-the-upper

Connectivity: networking

As mentioned earlier, containers make use of Linux namespaces. Of particular interest when understanding container networking is the network namespace. This namespace gives the process separate:

  • routing tables
  • iptables rules
# Create a new network namespace
$ sudo unshare --fork --net
# List the ethernet devices with associated ip addresses
$ ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# List all iptables rules
root@sw-20160616-01:/home/andrewhowden# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
# List all network routes
$ ip route show
$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
ping: sending packet: Network is unreachable
$ ip link set lo up
root@sw-20160616-01:/home/andrewhowden# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
# Test the loopback adapter
$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.092 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.068 ms
$ echo $$
18171
$ sudo ip link add veth0 type veth peer name veth0 netns 18171
# Container$ ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: veth0@if7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 16:34:52:54:a2:a1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
$ ip route show
# No output
# On the host
$ ip addr add 192.168.24.1 dev veth0
# Within the container
$ ip address add 192.168.24.10 dev veth0
# Both host and container
$ ip link set veth0 up
# Both host and guest
ip route add 192.168.24.0/24 dev veth0
# Within container
$ ping 192.168.24.1
PING 192.168.24.1 (192.168.24.1): 56 data bytes
64 bytes from 192.168.24.1: icmp_seq=0 ttl=64 time=0.149 ms
64 bytes from 192.168.24.1: icmp_seq=1 ttl=64 time=0.096 ms
64 bytes from 192.168.24.1: icmp_seq=2 ttl=64 time=0.104 ms
64 bytes from 192.168.24.1: icmp_seq=3 ttl=64 time=0.100 ms
# Within container
$ ping google.com
ping: unknown host
# Within container
$ echo 1 > /proc/sys/net/ipv4/ip_forward
# On the host
# Forward packets from the container to the host adapter
iptables -A FORWARD -i veth0 -o wlp2s0 -j ACCEPT
# Forward packets that have been established via egress from the host adapater back to the contianer
iptables -A FORWARD -i wlp2s0 -o veth0 -m state --state ESTABLISHED,RELATED -j ACCEPT
# Relabel the IPs for the container so return traffic will be routed correctly
iptables -t nat -A POSTROUTING -o wlp2s0 -j MASQUERADE
# Within the container
$ ip route add default via 192.168.24.1 dev veth0
$ # ping google.com
PING google.com (172.217.22.14): 56 data bytes
64 bytes from 172.217.22.14: icmp_seq=0 ttl=55 time=16.456 ms
64 bytes from 172.217.22.14: icmp_seq=1 ttl=55 time=15.102 ms
64 bytes from 172.217.22.14: icmp_seq=2 ttl=55 time=34.369 ms
64 bytes from 172.217.22.14: icmp_seq=3 ttl=55 time=15.319 ms
  1. A public facing eth0 (or similar) interface

Landscape review

Given our understanding of the implementation of containers we can now take a look at some of the classic docker discussions.

Systems Updates

One of the oft overlooked parts of containers is the necessity to keep both them, and the host system up to date.

Init within container

Given our understanding of containers its reasonable to consider the “1 process per container” advice and determine that it is an oversimplification of how containers work, and it makes sense in some cases to do service management within a container with a system like runit.

  • logrotate
  • cron

In Conclusion

Containers are an excellent way to ship software to production systems. They solve a swathe of interesting problems and cost very little as a result. However, their rapid growth has meant some confusion in industry as to exactly how they work, whether they’re stable and so fourth. Containers are a combination of both old and new Linux kernel technology such as namespaces, cgroups, seccomp and other Linux networking tooling but are as stable as any other kernel technology (so, very) and well suited for production systems.

References

  1. “Docker.” https://en.wikipedia.org/wiki/Docker_(software) .
  2. “Cloud Native Technologies in the Fortune 100.” https://redmonk.com/fryan/2017/09/10/cloud-native-technologies-in-the-fortune-100/ , Sep-2017.
  3. B. Cantrill, “The Container Revolution: Reflections After the First Decade.” https://www.youtube.com/watch?v=xXWaECk9XqM , Sep-2018.
  4. “Papers (Jail).” https://docs.freebsd.org/44doc/papers/jail/jail.html .
  5. “An absolutely minimal chroot.” https://sagar.se/an-absolutely-minimal-chroot.html , Jan-2011.
  6. J. Beck et al., “Virtualization and Namespace Isolation in the Solaris Operating System (PSARC/2002/174).” https://us-east.manta.joyent.com/jmc/public/opensolaris/ARChive/PSARC/2002/174/zones-design.spec.opensolaris.pdf , Sep-2006.
  7. M. Kerrisk, “Namespaces in operation, part 1: namespaces overview.” https://lwn.net/Articles/531114/ , Jan-2013.
  8. A. Polvi, “CoreOS is building a container runtime, rkt.” https://coreos.com/blog/rocket.html , Jan-2014.
  9. “Basics of the Unix Philosophy.” http://www.catb.org/ esr/writings/taoup/html/ch01s06.html .
  10. P. Estes and M. Brown, “OCI Image Support Comes to Open Source Docker Registry.” https://www.opencontainers.org/blog/2018/10/11/oci-image-support-comes-to-open-source-docker-registry , Oct-2018.
  11. “Open Container Initiative Runtime Specification.” https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/spec.md , Mar-2018.
  12. “The 5 principles of Standard Containers.” https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/principles.md , Dec-2016.
  13. “Open Container Initiative Image Specification.” https://github.com/opencontainers/image-spec/blob/db4d6de99a2adf83a672147d5f05a2e039e68ab6/spec.md , Jun-2017.
  14. “Open Container Initiative Distribution Specification.” https://github.com/opencontainers/distribution-spec/blob/d93cfa52800990932d24f86fd233070ad9adc5e0/spec.md , Mar-2019.
  15. “Docker Overview.” https://docs.docker.com/engine/docker-overview/ .
  16. J. Frazelle, “Containers aka crazy user space fun.” https://www.youtube.com/watch?v=7mzbIOtcIaQ , Jan-2018.
  17. “Use Host Networking.” https://docs.docker.com/network/host/ .
  18. Krallin, “Tini: A tini but valid init for containers.” https://github.com/krallin/tini , Nov-2018.
  19. https://chromium.googlesource.com/chromium/src.git/+/HEAD/docs/linux_sandboxing.md .
  20. [[0pointer.resources]]L. Poettering, “systemd for Administrators, Part XVIII.” http://0pointer.de/blog/projects/resources.html , Oct-2012.
  21. A. Howden, “Coming to grips with eBPF.” https://www.littleman.co/articles/coming-to-grips-with-ebpf/ , Mar-2019.
  22. “Seccomp security profiles for docker.” https://docs.docker.com/engine/security/seccomp/ .
  23. “Linux kernel capabilities.” https://docs.docker.com/engine/security/security/#linux-kernel-capabilities .
  24. M. Stemm, “SELinux, Seccomp, Sysdig Falco, and you: A technical discussion.” https://sysdig.com/blog/selinux-seccomp-falco-technical-discussion/ , Dec-2016.
  25. “Pod Security Policies.” https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp .
  26. Programster, “Example OverlayFS Usage.” https://askubuntu.com/a/704358 , Nov-2015.
  27. “How do I connect a veth device inside an ’anonymous’ network namespace to one outside?” https://unix.stackexchange.com/a/396210 , Oct-2017.
  28. D. P. García, “Network namespaces.” https://blogs.igalia.com/dpino/2016/04/10/network-namespaces/ , Apr-2016.

See https://www.andrewhowden.com/