[Openmcl-devel] Random crashing

Fri Jul 18 09:29:37 PDT 2008

The following info might also be useful..

[3268] OpenMCL kernel debugger: R
%rax = 0x0000000000000000      %r8  = 0x0000000000000000
%rcx = 0x0000000000000000      %r9  = 0x0000000040E577D0
%rdx = 0x0000000000000001      %r10 = 0x0000000000000008
%rbx = 0x00000000415837D0      %r11 = 0x0000000000000246
%rsp = 0x0000000040E56218      %r12 = 0x0000000040E577D0
%rbp = 0x0000000040E566F0      %r13 = 0x0000000040E56718
%rsi = 0x0000000000000001      %r14 = 0x0000000000000004
%rdi = 0x0000000000000000      %r15 = 0x0000000040E56AA0
%rip = 0x00002ADAFE2CA325   %rflags = 0x0000000000010246
[3268] OpenMCL kernel debugger: x
Unhandled exception 11 at 0x2adafe2ca325, context->regs at #x40e55d88
Exception occurred while executing foreign code
? for help
[3268] OpenMCL kernel debugger: x
exception in foreign context
Exception occurred while executing foreign code
? for help
[3268] OpenMCL kernel debugger: x
Unhandled exception 11 at 0x2adafe2ca325, context->regs at #x40e55d88
Exception occurred while executing foreign code
? for help
[3268] OpenMCL kernel debugger: t
Current Thread Context Record (tcr) = 0x40e577d0
Control (C) stack area:  low = 0x40c04000, high = 0x40e58000
Value (lisp) stack area: low = 0x2aaaacfa1000, high = 0x2aaaad1b2000
Exception stack pointer = 0x40e56218

On Jul 17, 2008, at 3:54 PM, Gary Byers wrote:

>
>
> On Thu, 17 Jul 2008, Osei Poku wrote:
>
>> Hello,
>>
>> I updated today from svn but this thing happened again.  Again the  
>> PC was in the pthread memory region and %rdi was 0.  I verified  
>> that the fix (r9997 i think) was in my ccl working directory  
>> (somewhere in thread_manager.c right?).
>
> Yes; there are 3 calls to pthread_kill() in that file.  One of them  
> (in resume_tcr()) is conditionlized out; the other two
> (in raise_thread_interrupt() and suspend_tcr()) should check
> to make sure that the thread that they'd pass as the first
> argument to pthread_kill is non-zero before doing the call.)
>
>>
>> My current version is:
>> Clozure Common Lisp Version 1.2-r10073M-RC1  (LinuxX8664)!
>>
>> Is there anything other than (rebuild-ccl :force t) that I need to  
>> do to recompile the c source for the lisp kernel?
>
> As Gail just pointed out, :full t (or :kernel t) is necessary
> in order to get the kernel updated. (:force t will recompile
> FASLs even if they're newer than the corresponding source;
> that's occasionally useful, but not really what you want here.)
>
> If the kernel that you're running had its modified date change
> by the rebuild process, it likely incorporates those changes.  If
> those changes didn't fix the problem, then I don't have a good
> guess as to what the problem is: there aren't too many places
> where the lisp calls into the threads library: it creates threads
> and sends them signals via pthread_kill().  (There's another place  
> where a thread will send itself a signal via pthread_kill(),
> but that is pretty much guaranteed to be a valid thread ...)
>
>
>>
>> Thanks,
>> Osei
>>
>> On Jul 9, 2008, at 3:05 PM, Gary Byers wrote:
>>
>>> --On July 9, 2008 2:26:56 PM -0400 Osei Poku <osei.poku at gmail.com>  
>>> wrote:
>>>> Hi,
>>>> It crashed again for me.  This time I managed to grab the  
>>>> contents of
>>>> /proc/pid/maps before I killed it.  Logs of the tty session and  
>>>> memory
>>>> maps are attached.  I had also managed to update from the  
>>>> repository to
>>>> r9890-RC1.
>>>> Osei
>>> It seems to be crashed in the threads library (libpthread.so).
>>> There's a race condition in the code which suspends threads
>>> on entry to the GC: the thread that's running the GC looks
>>> at each thread that it wants to suspend to see if it's
>>> still alive (the data structure that represents a thread
>>> might still be around, even if the OS-level thread has
>>> exited.)  The suspending thread looks at the tcr->osid
>>> field of the target, notes that it's non-zero, then
>>> calls a function to send the os-level thread a signal.
>>> That function accesses the tcr->osid field again (which,
>>> when non-zero, represents a POSIX thread ID) and calls
>>> pthread_kill()).
>>> When a thread dies, it clears its tcr->osid field, so
>>> if the target thread dies between the point when the
>>> suspending thread looks and the point where it leaps,
>>> we wind up calling pthread_kill() with a first argument
>>> of 0, and it crashes.  That's consistent with the
>>> register information: we're somewhere in the threads
>>> library (possibly in pthread_kill()), and the register
>>> in which C functions receive their first argument (%rdi)
>>> is  0.
>>> I'll try to check in a fix for that (look before leaping)
>>> soon.  As I understand it, SLIME will sometimes (depending
>>> on the setting of a "communication style" variable)
>>> spawn a thread in which to run each form being evaluated
>>> (via C-M-x or whatever); whether that's a good idea or
>>> not, consing short-lived threads all the time is probably
>>> a good way to trigger this bug.  I don't use SLIME, and
>>> don't know what the consequences of changing the communication
>>> style variable would be.
>>