Photo by Analise Benevides on Unsplash (Edited)

Coming to grips with eBPF

Understanding what (e)BPF is and what its used for

I have a fairly long history using Linux for a number of purposes. After being assigned Linux as a development machine while working with the team at Fontis a combination of curiosity and the need to urgently repair this development machine as a result of curiosity driven stick pokery meant that I learned a large amount of Linux trivia knowledge fairly quickly. I built further on this while helping set up Sitewards infrastructure tooling; a much more heterogeneous set of computers and providers but with a standard approach emerging built of Docker and Kubernetes.

What is BPF?

The original “Berkeley Packet Filter” was derived from a paper written by Steve McCanne and Van Jacobson in 1992 for the Berkeley Software Distribution. It’s purpose was to allow an efficient capture of packets from within the Kernel to the Userland by compiling a program that filtered out packets that should not be copied across. This was subsequently employed in utilities such as tcpdump.

How does BPF work?

BPF is a sequence of 64 bit instructions. These instructions are generally generated by an intermediary such as tcpdump(libpcap):

# See https://blog.cloudflare.com/bpf-the-forgotten-bytecode/
$ sudo tcpdump -i wlp2s0 'ip and tcp' -d
(000) ldh [12] # Load a half-word (2 bytes) from the packet at offset 12.
(001) jeq #0x800 jt 2 jf 5 # Check if the value is 0x0800, otherwise fail.
# This checks for the IP packet on top of an Ethernet frame.
(002) ldb [23] # Load byte from a packet at offset 23.
# That's the "protocol" field 9 bytes within an IP frame.
(003) jeq #0x6 jt 4 jf 5 # Check if the value is 0x6, which is the TCP protocol number,
# otherwise fail.
(004) ret #262144 # Return fail
(005) ret #0 # Return success
  • There are no unreachable instructions
  • Every register and stack state are valid
  • Registers with uninitialized content are not read
  • The program only accesses structures appropriate for its BPF program type
  • (Optionally) pointer arithmetic is prevented
# Clone the repository$ git clone https://github.com/iovisor/bcc.git
Cloning into 'bcc'...
Receiving objects: 100% (17648/17648), 8.42 MiB | 1.21 MiB/s, done.
Resolving deltas: 100% (11460/11460), done.
# Pick the DNS matching
$ cd bcc/examples/networking/dns_matching
# Run it!
$ sudo ./dns_matching.py --domains fishfingers.io
$ sudo ./dns_matching.py --domains fishfingers.io
>>>> Adding map entry: fishfingers.io
Try to lookup some domain names using nslookup from another terminal.
For example: nslookup foo.bar
BPF program will filter-in DNS packets which match with map entries.
Packets received by user space program will be printed here
Hit Ctrl+C to end...
$ dig fishfingers.io
Hit Ctrl+C to end...[<DNS Question: 'fishfingers.io.' qtype=A qclass=IN>]
  1. Checks to see if its UDP
  2. Checks to see if its Port 53
  3. Check if the DNS name supplied is within the payload

eBPF in the wild

To understand where eBPF sits in the infrastructure ecosystem it’s worth looking at where other companies have chosen to use it over other, more conventional ways of solving the problem.

Firewall

The de facto implementation for a Linux firewall uses iptables as its underlying enforcement mechanism. iptables allows configuring a set of netfilter tables that manipulates packets in a number of ways. For example, the following rule drops all connections from the IP address 10.10.10.10:

iptables -A INPUT -s 10.10.10.10/32 -j DROP
  1. iptables updates must be made by recreating and updating all rules in a single transaction
  • It is matched against the “closest” rule, rather than by iterating over the entire rule set.
  • It can introspect specific packet data when making decisions as to whether to drop
  • It can be compiled and run in the Linux “Express Data Path” (or XDP); the earliest possible point to interact with network traffic

Kernel tracing & instrumentation

After running Linux in production for some period of time invariably we can run into issues. In the past I’ve had issues debugging:

  • Workload CPU performance
  • Software not loading configuration
  • Software becoming stalled
  • Systems being “slow” for no apparent reason
  • top
  • sysdig
  • iotop
  • df
  • perf
$ strace -e file cat /tmp/foo
execve("/bin/cat", ["cat", "/tmp/foo"], 0x7fffc2c8c308 /* 56 vars */) = 0
access("/etc/ld.so.preload", R_OK) = 0
openat(AT_FDCWD, "/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libsnoopy.so", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/tmp/foo", O_RDONLY) = 3
hi
$ sudo ./trace.py 'do_sys_open "%s", arg2' | grep 'cat'
$ cat /tmp/foo
13785   13785   cat             do_sys_open      /etc/ld.so.preload
13785 13785 cat do_sys_open /lib/x86_64-linux-gnu/libsnoopy.so
13785 13785 cat do_sys_open /etc/ld.so.cache
13785 13785 cat do_sys_open /lib/x86_64-linux-gnu/libc.so.6
13785 13785 cat do_sys_open /lib/x86_64-linux-gnu/libpthread.so.0
13785 13785 cat do_sys_open /lib/x86_64-linux-gnu/libdl.so.2
13785 13785 cat do_sys_open /usr/lib/locale/locale-archive
13785 13785 cat do_sys_open /tmp/foo

Network visibility

Given the history of BPF in packet filtering a reasonable next logical step is collecting statistics from the network for later analysis.

  • /proc/net/ip_vs
  • /proc/net/ip_vs_stats
  • /sys/class/net/
  • /proc/net/netstat
  • /proc/net/sockstat
  • /proc/net/tcp
  • /proc/net/tcp6
# Trace remote port 443
$ sudo ./tcplife.py -D 443
$ curl https://www.andrewhowden.com/ > /dev/null
PID   COMM       LADDR           LPORT RADDR           RPORT TX_KB RX_KB MS
7362 curl 10.1.1.247 43074 34.76.108.124 443 0 16 3369.32

Using BPF

Given the above BPF seems like a compelling technology that it’s worth investing in learning more about. However there are some difficulties in getting BPF to work properly:

BPF is only in “recent” kernels

BPF is an area that’s undergoing rapid development in the Linux kernel. Accordingly features may not be complete, or may not be present at all. Tools may not work as expected and their failure conditions not well documented. Accordingly if the kernels used in production are fairly modern than BPF may provide considerable utility. If not, it’s perhaps worth waiting until development in this area slows down and an LTS kernel with good BPF compatibility is released.

It’s hard to debug

BPF is fairly opaque at the moment. While there are bits of documentation here and there and one can go and read the kernel source its not as easy to debug as (for example) iptables or other system tools. It may be difficult to debug network issues that are created by improperly constructed bpf programs. The advice here is the same as other new or bespoke technologies: ensure that multiple team members understand and can debug it, and if they cant or those people are not available, pick another technology.

It’s an implementation detail

Its my suspicion that the vast majority of our interaction with BPF will not be interaction of our design. BPF is useful in the design of analysis tools, but the burden is perhaps too large to place on the shoulders of systems administrators. Accordingly, to start reaping the benefits of BPF its worth instead investing in tools that use this technology. These include:

  • BCC Tools
  • bpftrace
  • Sysdig

Conclusion

BPF is an old technology that has had new life breathed into it with the extended instruction set, implementation of a JIT and ability to execute BPF at various points in the Linux kernel. It provides a way to export information about or modify Linux kernel behaviour at runtime without needing to reboot or reload the Kernel, including just for transient systems introspection. BPF has probably most immediate ramifications on network performance as networks need to handle a truly bizarre level of both traffic and complexity, and BPF provides some concrete solutions to these problems. Accordingly its a good start to understand BPF in the context of networks, particularly instead of investing in nftables or iptables. BPF additionally provides some compelling insights into both system and network visibility that are otherwise difficult or impossible to achieve, though this area is somewhat more nascent than the network implementations.

References