[Openmcl-devel] Random crashing

Thu Jul 17 12:54:02 PDT 2008

On Thu, 17 Jul 2008, Osei Poku wrote:

> Hello,
>
> I updated today from svn but this thing happened again.  Again the PC was in 
> the pthread memory region and %rdi was 0.  I verified that the fix (r9997 i 
> think) was in my ccl working directory (somewhere in thread_manager.c 
> right?).

Yes; there are 3 calls to pthread_kill() in that file.  One of 
them (in resume_tcr()) is conditionlized out; the other two
(in raise_thread_interrupt() and suspend_tcr()) should check
to make sure that the thread that they'd pass as the first
argument to pthread_kill is non-zero before doing the call.)

>
> My current version is:
> Clozure Common Lisp Version 1.2-r10073M-RC1  (LinuxX8664)!
>
> Is there anything other than (rebuild-ccl :force t) that I need to do to 
> recompile the c source for the lisp kernel?

As Gail just pointed out, :full t (or :kernel t) is necessary
in order to get the kernel updated. (:force t will recompile
FASLs even if they're newer than the corresponding source;
that's occasionally useful, but not really what you want here.)

If the kernel that you're running had its modified date change
by the rebuild process, it likely incorporates those changes.  If
those changes didn't fix the problem, then I don't have a good
guess as to what the problem is: there aren't too many places
where the lisp calls into the threads library: it creates threads
and sends them signals via pthread_kill().  (There's another 
place where a thread will send itself a signal via pthread_kill(),
but that is pretty much guaranteed to be a valid thread ...)

>
> Thanks,
> Osei
>
> On Jul 9, 2008, at 3:05 PM, Gary Byers wrote:
>
>> 
>> 
>> --On July 9, 2008 2:26:56 PM -0400 Osei Poku <osei.poku at gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> It crashed again for me.  This time I managed to grab the contents of
>>> /proc/pid/maps before I killed it.  Logs of the tty session and memory
>>> maps are attached.  I had also managed to update from the repository to
>>> r9890-RC1.
>>> 
>>> Osei
>>> 
>> 
>> 
>> It seems to be crashed in the threads library (libpthread.so).
>> 
>> There's a race condition in the code which suspends threads
>> on entry to the GC: the thread that's running the GC looks
>> at each thread that it wants to suspend to see if it's
>> still alive (the data structure that represents a thread
>> might still be around, even if the OS-level thread has
>> exited.)  The suspending thread looks at the tcr->osid
>> field of the target, notes that it's non-zero, then
>> calls a function to send the os-level thread a signal.
>> That function accesses the tcr->osid field again (which,
>> when non-zero, represents a POSIX thread ID) and calls
>> pthread_kill()).
>> 
>> When a thread dies, it clears its tcr->osid field, so
>> if the target thread dies between the point when the
>> suspending thread looks and the point where it leaps,
>> we wind up calling pthread_kill() with a first argument
>> of 0, and it crashes.  That's consistent with the
>> register information: we're somewhere in the threads
>> library (possibly in pthread_kill()), and the register
>> in which C functions receive their first argument (%rdi)
>> is  0.
>> 
>> I'll try to check in a fix for that (look before leaping)
>> soon.  As I understand it, SLIME will sometimes (depending
>> on the setting of a "communication style" variable)
>> spawn a thread in which to run each form being evaluated
>> (via C-M-x or whatever); whether that's a good idea or
>> not, consing short-lived threads all the time is probably
>> a good way to trigger this bug.  I don't use SLIME, and
>> don't know what the consequences of changing the communication
>> style variable would be.
>> 
>> 
>> 
>