We had been racking our brains on this one for a couple weeks. We have monit looking over our Mongrels, which usually keeps everything on the up and up. But every so often, our server would go bananas and the nginx error log would flood with the message:
939#0: accept() failed (24: Too many open files) while accepting new connection
Usually the problem automatically resolved itself, but last night it didn’t. Taking the error at face value, our server guy started looking at the number of open files on our system and the maximum files that could be opened (it’s confusing… “ulimit -a” reports one limit while “cat/proc/sys/fs/file-max” reports another. I think that the former might be for actual file system files opened and the latter might be for file handles (which also includes open IP connections and such)). But even after upping the limit and rebooting repeatedly, the problem persisted.
After server guy (literally) fell asleep on the keyboard around 2 AM, I figured out what had really been happening: any time a new visitor came to our site, we were geocoding their IP with a service that had gone AWOL. About a week earlier I’d noticed a similar slowdown of about 1-2 seconds with actions that created sessions, but I assumed it was the session creation itself that was causing the slowdown, when in fact it was the geocoding that happened alongside the session creation that was responsible for the lag.
Long story short, when nginx gives this error, what it really seems to mean is that it is holding too many open connections, and usually that is happening because you are using round robin dispatching (bad, I know, but we have our reasons) and one or more of the Mongrels is stuck and forcing the Mongrel queue to skyrocket.
The other lesson here is an obvious one that I’ve read many times before but have been slow to actually act on: making remote API calls without timeouts is asking for trouble. Here is a fine article if you’re interested in solving that problem in your own site before it is your ruin.