[Openmcl-devel] thinking out loud

Thu May 3 13:37:00 PDT 2007

[Unless you find ponderous deliberations about thread race conditions
and Mach exception/signal handling interesting, you probably won't
find this interesting ...  If you do find it interesting and have any
thoughts about whether this makes sense, please let me know.]

For a variety of reasons (most of which start with the phrase "Mach
sucks"), OpenMCL does exception handling on OSX at the Mach thread
level (as opposed to the "Posix signal" level, as on Linux/FreeBSD.)
The general way in which this works is:

- a distinguished thread is created at startup and spends its
   lifetime listening to and responding to Mach IPC messages on
   a dedicated set of message ports.
- whenever a lisp thread is created (or whenever a foreign thread
   calls back into lisp code), its private exception port is added
   to the set of ports that the distinguished exception thread
   listens to.
- whenever a lisp thread gets an exception (memory fault, trap or
   illegal instruction, arithmetic error), the OS kernel suspends
   it and sends a message to the thread's exception port.  The
   distinguished exception thread - which has been listening for
   such messages - wakes up, obtains the suspended thread's register
   state,  marks the thread as being in a "waiting to handle exception"
   state, puts the register state where other threads (including any
   thread that triggers the GC) can find it, and then manipulates
   the suspended thread's register state so that when it wakes up
   it will call code to actually handle the exception and return
   to a special trap instruction at a fixed address.  The distinguished
   exception thread then replies to the Mach exception message, telling
   the OS kernel that the thread should be made runnable, then goes
   back to listening for exception messages.  The newly awakened
   thread (the one that had the exeption) waits for a global lock
   (exception handling is serialized, for mostly-GC-related reasons),
   examines and modifies its register state (which reflects its state
   at the time of the exception) and, quite often, returns to the
   distinguished trap instruction "underneath" the faked call to
   the exception-handling code.  It gets another exception if it does
   so, and the exception-handling thread recognizes that it's trying
   to return from the original exception and arranges for it to do
   so.
- Even though POSIX signals aren't used to handle synchronous
   (hardware-related) exceptions on OSX, they are used for other
   purposes, notably PROCESS-INTERRUPT and various flavors of
   "suspend"; notably, any thread which triggers the GC sends signals
   to all other threads telling them to suspend themselves and waits
   for acknowledgement of that suspend request before the fun starts.
   (There are a variety of reasons that make it better to use signals
   for "suspend" rather than using Mach-level suspend/resume calls.)
   When a thread receives a signal, the handler function receives
   the thread's register context as an argument.  If the thread was
   suspended or interrupted when running lisp code, that register
   context has to be kept somewhere where the GC can find it (so that
   it can reliably trace through references to lisp objects and update
   those references if the objects move.)
- About a year ago, I got a couple of mostly reproducible bug reports
   involving GC misbehavior when several dozen threads were consing
   hysterically.  There were several little problems and one big one
   (some code that claimed to be masking a signal at some critical
   juncture was actually going to great lengths to enable it), but
   along the way I convinced myself that the problems had to do with
   the fact that it was possible for the GC thread (which accesses
   and modifies the register context of all other lisp threads) and
   the distinguished exception thread to be modifying a thread's
   state at the same time.  I added some locking mechanisms to
   prevent that from happening; the idea was that if a thread got
   an exception "at the same time" that it got a suspend signal,
   the exception handler would reply to the exeption message in
   a way that caused the OS to thaw it out; under Tiger and earlier,
   the thread would wake up and immediately invoke the handler for
   the pending suspend signal; when that handler returned, it'd
   preseumably force the exception to occur again.

That all (surprisingly enough) seemed to work fairly well under
Tiger; it seemed that if a thread had both a pending exception
and a pending signal when it was awakened, the signal handler
would run first.  (Of course, things aren't guaranteed to happen
in that order, but they seemed to reliably do so under Tiger.)

This morning, I was running some code that involved about about
a dozen threads consing hysterically under Leopard.  Things locked
up, and when I looked in GDB I saw that a thread with a pending
exception and a pending signal was ignoring the pending signal
and repeatedly sending the exception message.  The GC thread was
waiting for acknowledgement of the suspend request, and the
dedicated exception thread wasn't doing anything since the GC
thread owned the lock.

My first reaction was to try to make things more complicated,
but the point that I'm trying to think out loud about is whether
the locking mechanism is unnecessary because things are "already
atomic enough."  The reasoning goes like:

  - a thread that's been suspended by Mach can't run, and in particular
    can't run a handler for a pending "suspend" signal.
  - the distinguished exception thread and any client thread it's
    running on behalf of can't run concurrently; the client is suspended
    before the exception message is sent and is not resumed until the
    distinguished exception thread replies to the kernel exception message.
  - therefore, the scenario that the locking mechanism was intended to
    prevent - a thread's GC-visible context being clobbered by an
    exception and a signal being processed at the same time - can't
    happen; a thread might invoke a signal handler if it receives
    a signal before taking an exception or might respond to the signal
    as soon as it wakes up in an exception handler, but it can't
    process the exception and signal at the same time.

Too bad, because that was a nice theory that seemed to explain the
symptom that I saw most often (a thread that had just consed something
seemed to forget that it had just consed something), but the bug that
I found later - where the "suspend" signal handler could get reentered -
also explained those symptoms.

Right at the moment, I think that the locking mechanism is at best 
unnecessary and at worst - since it depends on signals being processed
when exceptions are also pending - wrong.