$ tcpdump showing packets that were received and then … not.

The curious case of Docker being a fuck head

A story about debugging an opaque network issue

So, during the course of our development work we operate across a number of different infrastructure environments, each with their own particular requirements. In this case, the requirements were fairly normal — there was us, a server, and docker running on that server.

However, there was one small twist — access to this server was only granted on a VPN. So, like any practical nerds, we set up a permanent site → site VPN that we could access within the office. Great! Everything works.

Usually.

Unfortunately, we ended up in a situation something as follows:

This was a truly odd bug. Additionally, it only happened sometimes. At first, it wasn’t clear that it was a docker bug — occasionally, SSH would simply hang, and it was a little while before we determined it was ${SOMETHING_WITH_DOCKER}

Things that could go wrong #1: Port conflict

It’s possible, in principle, to drop SSH access by setting up an application that also requires SSH (GitLab), and forwarding that port with docker to 0.0.0.0:22. iptables will chomp all your rules and you’ll suddenly be completely unable to SSH into the machine.

At least, that’s what we thought. It turns out this wasn’t the case. Once we regained access to the machine thanks to our gracious hosts we determined:

  • We didn’t have a service forwarding port 22 at all
  • All traffic was dropped; ICMP specifically, but we tested various other protocols to no avail
  • ICMP was available from outside our network

So it was clearly “Something something VPN”

Things that could go wrong #2: Firewall conflicts

In this environment there are restrictive egress firewalls in place to prevent data being stolen off the machine.

These firewalls were presumably reloaded after each invocation of docker-compose ${SOMETHING}, as docker created and deleted virtual adapters.

However, for this to be true the problem would need to be consistent. Unfortunately, the problem happened only randomly.

Things that actually went wrong #3: Docker allocates a subnet in use

We left it alone for some time. In the meantime, we’d figured out that, somehow, we were able to gain access to the machine through another machine thereby allowing us to work even in the disaster case.

Notably, this also allowed us to snapshot machine state when it was broken. So, we waited.

Aaaand waited.

Aaaaaaaaaaand waited.

A day later: It breaks again.

So, we figured it was network related. We ran ip addr, saved the output to a file in /tmp/foo and rebooted docker:

It came up again. Odd. Okay, let’s save the output of ip addr into /tmp/bar and diff them. We fairly quickly noted the oddity:

In one case, there was an uncomfortably familiar IP range — the office network range. Could it be … an IP address conflict?

We reproduced the command that would break the connection:

Boom! Network dead.

Theorizing

Broadly, the problem is that dockerrotates through a set of IP ranges that it’s allocated to. These ranges are all private:

  • 10.0.0.0–10.255.255.255
  • 172.16.0.0–172.31.255.255
  • 192.168.0.0–192.168.255.255

However, the site → site VPN meant that we were not connecting between 1 public network and another, as is usual, but between two private networks.

Once docker would allocate a range that conflicted with our gateway, traffic would reach the machine successfully, but be inadvertently routed into the docker-compose network and not back to the office as intended.

This issue happened sporadically as docker iterates through a limited set of the private IP ranges by default. So, it’s only a matter of time before this IP range happens again.

Resolution

The resolution at this point is straight forward — don’t use the 192.168.0.0/24 subnet. However, it does highlight some of the fragility of coordinating site→site VPNs, particularly when we don’t control both sides of the connection.

In Conclusion

This was a satisfying bug to chase down, as it was not easy to debug, it stopped work in a dramatic way and it wasn’t easily reproducible. However, we learned a little about how this infrastructure is setup, and rebuilt some of our network knowledge as a team.

Nerd success. Happy friday, ya’ll.

Thanks

  • Behrouz Abbasi was helping me debug this one.
  • Anton Boritskiy also helped, and had faith in us to deal with it … eventually.