The Curious Case of the Uninterruptible Sleep

Tagged as lisp
Written on 2018-07-09 16:16:37

Let me tell you about one particular annoying issue I came across when using a GC-ed environment. No, not Python, though I'm almost definitely sure it will have the same issue. No, it's of course Common Lisp, in this case CCL and SBCL in particular, both of which have a stop-the-world GC (as far as I know there are no knobs to change the behaviour outside of what I'll describe below).

Now, do you also remember these things called HDDs? Turns out that one of my external hard drives likes to shut itself down after a while. You can hear that very well. However, since the drive is still mounted, accessing (uncached) parts of the filesystem will trigger a wake event. But getting the platters up to speed again takes a lot of time, so in between what happens?

Exactly, uninterruptible sleep for the process in question. It's one of the few possibilities where that process state can happen and if it was without the specifics of the GC involved it would just block one thread while the rest, in particular the GUI thread, would keep moving.

In this particular instance though, the GUI would work for a moment and then freeze until the drive finally responded with the requested data.

Now why is that?

Turns out the GC asks each (runtime) thread to block for the GC run. This is done via signals. Except one thread won't respond to signals since it's ... sleeping. Uninterruptibly.

Great.

Suggested options include spawning a separate process and wait for I/O (which I'd rather not do, since it'd mean doing that for every single library call that might operate with e.g. files, which is basically a lot. It just seems there's no good way to deal with this except in changing the runtime and dealing with the fact that the GC might not be able to reach all threads.

I looked a bit at the CCL runtime and it seems if we promise not to do anything bad with regards to the GC, we might be able to work around it by setting a new flag on a per-thread level that excludes it from the GC waiting loop. We'd only do this around potentially problematic FFI calls, but also around known internal system calls. When returning from the foreign land we'd also need to "catch up" with the GC, potentially blocking until the current run is done (if there is one). But that's solvable, with a little bit more overhead for FFI calls. I believe that's worth it.

Since I'm all around not familiar with either the CCL or SBCL code bases I suspect that even the generalisation of doing this for all FFI would be tenable, in which case the pesky annotations would cease to be necessary, fixing the issue without having to manually adjust anything on the developer's side.

Lastly, if I knew the right terms I'm sure there are solutions for this in either GHC or Erlang, except that I wouldn't really know what to search for.

Btw. this is all super iffy to debug too, since e.g. gdb will just hang in case the target process is in this specific task state, being based on ptrace and seemingly inheriting it through there.

Previous
Next

Unless otherwise credited all material Creative Commons License by Olof-Joachim Frahm