That’s a nice service. Be a shame if it… went missing.

Debugging perfectly good pages returning a 404

5 min readFeb 1, 2019

So, as part of ongoing learning I am developing what would have to be the most complex wedding website in the history of man. It is a golang powered, multi-service, JSON → gRPC transcoded istio mediated REST API consumed by an Angular frontend service, and I had a bug with CORS.

It started out actually a bug on CORS. The API is transcoded from REST to gRPC by the excellent gateway written in Golang and forwarded on to the service. This worked fine with $ curl, but when it came time to implement it in a browser the browser has set up Cross Origin Resource Sharing (CORS) headers as a means to prevent cross site scripting. I hadn’t implemented these. I did, pushed it to prod and the problem went away.

Sometimes.

The problem being, the page would sometimes 404 on the options request.

Debuggery

Theory 1: I can’t develop well

The simplest theory is that when implementing the CORS component of the application I missed some request property in the routing stack that the browser fired off.

However, after looking for this awhile I used the “copy as curl” feature of Chrome to reproduce the issue outside the browser so I could manipulate the request more easily.

And it worked:

curl 'https://api.tld.com/v1alpha2/check-in' \
    -X OPTIONS \
    -H 'Host: api.tld.com' \
    -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0' \
    -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' \
    -H 'Accept-Language: en-US,en;q=0.5' --compressed \
    -H 'Access-Control-Request-Method: GET' \
    -H 'Access-Control-Request-Headers: authorization' \
    -H 'Origin: http://tld.local' \
    -H 'Connection: keep-alive' \
    -H 'Cache-Control: max-age=0' \
    -IHTTP/2 200 
access-control-allow-origin: http://tld.local
access-control-allow-credentials: true
access-control-allow-methods: GET,POST,PATCH,PUT,DELETE,OPTIONS
access-control-allow-headers: Content-Type,Accept,Authorization
date: Fri, 01 Feb 2019 09:40:38 GMT
server: envoy

That was the bizarre part

Theory 2: Envoy is not propagating CORS requests

Another theory while debugging was that Envoy was somehow not propagating the CORS requests. To address this, I introduced the CORS configuration to Istio such that Istio should return CORS headers appropriately.

While this worked (the above is generated by Istio and not the app), the 404 persisted

Theory 3: There’s a bug in Chrome

The next theory was perhaps this was a chrome specific bug. Indeed, in switching to Firefox the issue didn’t initially present.

However, after a few minutes it presented differently.

The OPTIONS request now worked! Hooray! However, the authentication server started to 404. Reloading the page caused the authentication server to respond correctly, but the api server to 404.

Live tweeting the fixing.

Theory 4: Envoy is not delimiting load balancing requests

The last theory was that the connection was being reused across multiple domains, and that sharing the connection was causing ${SOMETHING} wrong with the istio/envoy combination.

The necessary background is that both api.tld.com and login.tld.com both resolve to the same IP:

$ dig api.tld.com +short
35.205.247.239$ dig login.tld.com +short
35.205.247.239

They are additionally sharing an SSL certificate with SAN extensions.

$ openssl s_client -connect andrewhowden.com:443 -showcerts | openssl x509 -noout -textX509v3 Subject Alternative Name: 
                DNS:andrewhowden.com, DNS:api.tld.com, DNS:login.tld.com, tld.com, DNS:pgp.andrewhowden.com, DNS:www.tl.com, DNS:www.tld.com

That means in principle, the connection can be reused. It took a little while to reach this conclusion, and when I did reach it I looked for ways to verify it. It turns out it’s surprisingly difficult to verify whether this connection is being shared — chrome://net-internals shows two connections, one prefixed with a pm/ :

Not the actual net internals (bug is fixed), but an example.

The “fix”

I had previously done some research on how TLS works, and both connections require TLS. They must negotiate an ephemeral symmetric key, but they can reuse the same key for both connections as TLS is ~ TCP and not HTTP.

So, to force different connections I gave them different certificates — without the SAN extension. Different certificates means different public keys, and a symmetric key that could not be shared across the the two connections.

No more connection reuse!

The issue immediately went away. At the time it was ~11:30PM, and kind of could not believe it.

Learnings

Like many bugs this one was me discovering that a conceptual model of how things worked (that is, separate connections per domain) did not work as I imagined it did.

Having done the theoretical reading beforehand about TCP and TLS I knew in principle what could be the problem and, despite not being able to “prove” that my theory was correct pushing the “fix” was harmless if there was no bug, and fixed it if there was.

So, the process was just:

Investigate
Hypothesize
Alter
Repeat

Additionally, I was debugging this late at night and on a personal project. Late at night meant I was tired, but a personal project meant that the worst person I could disappoint is me. The tiredness contributed to an inability to think creatively about the problem, but there was also no stress of getting it fixed — I could only disappoint myself (and wifey).

Rediscovery

While writing this post, I found not only that I was not the first to discover this, but that there is a “better” fix. I will implement this … at some point.