[Openmcl-devel] Random crashing

Gary Byers gb at clozure.com
Wed Jul 9 12:05:23 PDT 2008



--On July 9, 2008 2:26:56 PM -0400 Osei Poku <osei.poku at gmail.com> wrote:

> Hi,
>
> It crashed again for me.  This time I managed to grab the contents of
> /proc/pid/maps before I killed it.  Logs of the tty session and memory
> maps are attached.  I had also managed to update from the repository to
> r9890-RC1.
>
> Osei
>


It seems to be crashed in the threads library (libpthread.so).

There's a race condition in the code which suspends threads
on entry to the GC: the thread that's running the GC looks
at each thread that it wants to suspend to see if it's
still alive (the data structure that represents a thread
might still be around, even if the OS-level thread has
exited.)  The suspending thread looks at the tcr->osid
field of the target, notes that it's non-zero, then
calls a function to send the os-level thread a signal.
That function accesses the tcr->osid field again (which,
when non-zero, represents a POSIX thread ID) and calls
pthread_kill()).

When a thread dies, it clears its tcr->osid field, so
if the target thread dies between the point when the
suspending thread looks and the point where it leaps,
we wind up calling pthread_kill() with a first argument
of 0, and it crashes.  That's consistent with the
register information: we're somewhere in the threads
library (possibly in pthread_kill()), and the register
in which C functions receive their first argument (%rdi)
is  0.

I'll try to check in a fix for that (look before leaping)
soon.  As I understand it, SLIME will sometimes (depending
on the setting of a "communication style" variable)
spawn a thread in which to run each form being evaluated
(via C-M-x or whatever); whether that's a good idea or
not, consing short-lived threads all the time is probably
a good way to trigger this bug.  I don't use SLIME, and
don't know what the consequences of changing the communication
style variable would be.






More information about the Openmcl-devel mailing list