[Openmcl-devel] process-run-function and mach ports usage

Thu Feb 24 14:02:50 PST 2011

On Wed, 23 Feb 2011, Wim Oudshoorn wrote:

>> I was going to write a longer reply, but I'd already spent a few hours
>> looking into this today (seeing some of the same things that you saw
>> and reaching some different conclusions), and this is all starting to
>> seem too much like the sort of thing where it's said that "if you don't
>> stop doing it, you'll go blind."
>
> Yes, the mach_port mechanism in CCL seems to induce blindness.
> I stared it for quite a while and never really figuring out where
> the increased right count comes from.

It (or at least a major source of it) doesn't come from CCL.

When a Mach thread gets an exception when running in user mode, it
enters the kernel and effectively suspends itself (refuses to reenter
user mode), then:

  1) sends a message to the thread's exception port.  If it gets a reply
     with a code of 0, it assumes that the exception's been handled and
     allows the thread to resume execution.  Some other reply codes have
     other meanings; most other code values mean that the exception couldn't
     be handled at this level.

  2) If the exception is still pending and wasn't handled at the
     thread level, a similar message is sent to the task's exception
     port and the process is repeated.

  3) If the exception still hasn't been handled, the Mach exception is
     mapped to a Unix signal number and a signal is raised.  (This step
     is essentially the same as what occurs on other Unix-like systems.)
     If the application defins a handler for that signal, that handler
     is called on the thread that got the exception (and might be able
     to fix things up and continue, or signal a higher-level error, or
     whatever); if no handler for the signal in question is defined, the
     signal is likely fatal.

On most Unix-like platforms, CCL establishes handlers for a few signals
and these get fairly heavy use; a lot of situations are "exceptional"
(not having memory to CONS in and needing to do something - possibly
involving GCing - to get more) but not expected to be fatal.

On OSX, GDB sits at the task level and doesn't seem to be able to get
out of the way: if an exception occurs (and the application intends to
handle it at the signal level), GDB will report the Mach-level
exception and doesn't seem to offer a way of continuing so that the
signal handler is invoked.  Even if GDB is not involved, Apple has (or
had) a "Crash Reporter" application which also listened to each task's
exception port and by default would pop up a dialog proudly announcing
that an exception occurred (though I think it allowed the exception
to be handled - perhaps routinely - by a signal handler.)

CCL has a dedicated thread which listens for kernel messages on the set
of all other lisp threads' excepion ports.  When such a message is received,
the state of the thread on which the exception occurred is manipulated so
that it'll run a signal hanldler when it's next resumed, and the listening
thread tells the kernel that the thread can be resumed.

The message(s) sent to the various exception ports include the kernel
thread object (the port); naturally, the kernel conflates the ideas of
"referencing that object" with the idea of "retaining that object" in
this context just as it does in other contexts. It never releases that
reference (presumably since doing so would make at least some sense.)

One could argue that the thread that recieves the message should
decrement the port's reference count.  That might work, unless and until
the kernel was somehow fixed.  I think that it's safe to conclude that
we can stop talking about aesthetically pleasing solutions to the problem.
>
>> It -does- seem to be practical to try to ensure that the last thing
>> that a thread does (this would be in
>> thread_manager.c:shutdown_thread_tcr(),
>> which does most of the deallocation of thread-private resources) is to
>> ensure that the kernel port's send right's reference count is exactly 1.
>> That seems to make it very likely that the eventual mach_port_deallocate()
>> in the kernel destroys the port.

I checked this in to the trunk about 10 hours ago.  It seems to fix
the port leakage and I haven't seen evidence of any new problems
having been introduced, but this stuff is complicated enough that it's
hard to say that for sure.

>
> I did compile CCL with in shutdown_thread_tcr () a call to
> mach_port_destroy ().
> (This conceptually of course very wrong.  I don't even know
> what happens, is the thread immediately killed??)
> But this worked for my trivial test case and stopped the mach port leak.

A few MB of stack and the pthread object would stay mapped in this case.
Calling (or falling into a call to) pthread_exit() will arrange that
those and other resources are cleared up.
>