$ tcpdump showing packets that were received and then … not.

The curious case of Docker being a fuck head

A story about debugging an opaque network issue

So, during the course of our development work we operate across a number of different infrastructure environments, each with their own particular requirements. In this case, the requirements were fairly normal — there was us, a server, and docker running on that server.

$ ssh ${SERVER}
$ sudo docker-compose up -d

(non operative shell)

Things that could go wrong #1: Port conflict

It’s possible, in principle, to drop SSH access by setting up an application that also requires SSH (GitLab), and forwarding that port with docker to 0.0.0.0:22. iptables will chomp all your rules and you’ll suddenly be completely unable to SSH into the machine.

  • All traffic was dropped; ICMP specifically, but we tested various other protocols to no avail
  • ICMP was available from outside our network

Things that could go wrong #2: Firewall conflicts

In this environment there are restrictive egress firewalls in place to prevent data being stolen off the machine.

Things that actually went wrong #3: Docker allocates a subnet in use

We left it alone for some time. In the meantime, we’d figured out that, somehow, we were able to gain access to the machine through another machine thereby allowing us to work even in the disaster case.

$ docker-compose stop
$ docker-compose up -d
660: br-8221f7b9a761: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
link/ether 02:42:b4:7a:15:82 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.1/16 brd 192.168.255.255 scope global br-8221f7b9a761
valid_lft forever preferred_lft forever
inet6 fe80::42:b4ff:fe7a:1582/64 scope link
valid_lft forever preferred_lft forever
$ sudo docker network create --ip-range=192.168.0.1/32 --subnet=192.168.0.0/16 testing

Theorizing

Broadly, the problem is that dockerrotates through a set of IP ranges that it’s allocated to. These ranges are all private:

  • 172.16.0.0–172.31.255.255
  • 192.168.0.0–192.168.255.255

Resolution

The resolution at this point is straight forward — don’t use the 192.168.0.0/24 subnet. However, it does highlight some of the fragility of coordinating site→site VPNs, particularly when we don’t control both sides of the connection.

In Conclusion

This was a satisfying bug to chase down, as it was not easy to debug, it stopped work in a dramatic way and it wasn’t easily reproducible. However, we learned a little about how this infrastructure is setup, and rebuilt some of our network knowledge as a team.

Thanks

  • Behrouz Abbasi was helping me debug this one.
  • Anton Boritskiy also helped, and had faith in us to deal with it … eventually.