DNS, eglibc and resolv-replace on Heroku

2015-03-01: Fixed versions of eglibc are available for Ubuntu Precise and Trusty. Time to update.

I work on the team that runs Heroku Postgres. As we have continued to grow, I have been tracking an intermitent error with Rollbar that occurs about once every 50,000 HTTP requests. As we are doing many hundreds of thousands of API calls a minute to various services, this error can pop up fairly frequently and in very inconvenient places. The most common traceback seems to indicate a failure to resolve DNS:

#<SocketError: getaddrinfo: Name or service not known>
/app/vendor/ruby-2.1.5/lib/ruby/2.1.0/net/http.rb:879:in 'initialize'
/app/vendor/ruby-2.1.5/lib/ruby/2.1.0/net/http.rb:879:in 'open'
/app/vendor/ruby-2.1.5/lib/ruby/2.1.0/net/http.rb:879:in 'block in connect'
...

Google led me to a pertinent blog post that recommended using ruby’s Resolv library for all DNS requests via a script called resolv-replace. Adding a single line to our initializers, require resolv-replace, caused errors while submitting Logplex messages to immediately drop:

Logplex Errors

As did errors from trying to interact with our monitoring service, Observatory:

Observatory Errors

In an internal thread, Ed Muller pointed out a golang work around of a bug in glibc which is very likely to be a factor in this error:

Under high load, getaddrinfo() starts sending DNS queries to random file descriptors, e.g. some unrelated socket connected to a remote service.

As Heroku is a shared platform with multitenant runtime instances, it is possible for a random runtime to experience high load and the cedar-14 glibc binaries are known to be impacted by this bug. Version 2.20 of glibc has a fix and as of 2.19-0ubuntu6.6 and 2.15-0ubuntu10.11 this fix was backported to Ubuntu Precise and Trusty. However, Ubuntu Precise currently ships 2.15-0ubuntu10.10 and Trusty provides 2.19-0ubuntu6.5, so this bug may continue to be a problem for some time to come.

My immediate recommendation is to use language native DNS resolution like resolv-replace whenever possible, on Heroku or other systems. However, if you require ipv6 or run into problems with third party gems attempting to resolve nil addresses, and are stuck with the system DNS, upgrade yourself! please indicate that this bug affects you on the Launchpad bug report requesting backporting to supported versions of Ubuntu.

Thanks to Ed Muller, Michael Hale, Keiko Oda, Steve Conklin, Terence Lee and Richard Schneeman for help in figuring this out.