[Openmcl-devel] Random crashing

Fri Jul 18 09:32:11 PDT 2008

More debug info... Sorry about the multiple emails, I'm figuring  
things out as I go.

(gdb) info threads
   9 Thread 0x40263950 (LWP 3271)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   8 Thread 0x404c7950 (LWP 3272)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   7 Thread 0x4072b950 (LWP 3305)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   6 Thread 0x4098f950 (LWP 3306)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   5 Thread 0x40bf3950 (LWP 3307)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   4 Thread 0x40e57950 (LWP 6093)  0x00002adafe591bfb in read () from / 
lib64/libc.so.6
   3 Thread 0x4131f950 (LWP 6094)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   2 Thread 0x410bb950 (LWP 6095)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   1 Thread 0x2adafe820880 (LWP 3268)  0x00002adafe2ca2cb in  
sem_timedwait () from /lib64/libpthread.so.0
(gdb) thread 4
[Switching to thread 4 (Thread 0x40e57950 (LWP 6093))]#0   
0x00002adafe591bfb in read () from /lib64/libc.so.6
(gdb) bt
#0  0x00002adafe591bfb in read () from /lib64/libc.so.6
#1  0x00002adafe545553 in _IO_file_underflow () from /lib64/libc.so.6
#2  0x00002adafe545d0e in _IO_default_uflow () from /lib64/libc.so.6
#3  0x00002adafe541404 in getc () from /lib64/libc.so.6
#4  0x000000000041d43d in readc () at /usr/include/bits/stdio.h:43
#5  0x000000000041d590 in lisp_Debugger (xp=0x40e55d60,  
info=0x40e56110, why=11, in_foreign_code=1, message=0x40e55b10  
"Unhandled exception 11 at 0x2adafe2ca325, context->regs at  
#x40e55d88") at ../lisp-debug.c:914
#6  0x000000000041a2c6 in signal_handler (signum=11, info=0x40e56110,  
context=0x40e55d60, tcr=0x40e577d0, old_valence=1) at ../x86- 
exceptions.c:1070
#7  <signal handler called>
#8  0x00002adafe2ca325 in sem_post () from /lib64/libpthread.so.0
#9  0x000000000041b3e2 in resume_tcr (tcr=0x415837d0) at ../ 
thread_manager.c:1376
#10 0x000000000041c146 in lisp_resume_tcr (tcr=0x415837d0) at ../ 
thread_manager.c:1418
#11 0x000000000041a0c8 in handle_exception (signum=<value optimized  
out>, info=0x40e56aa0, context=0x40e566f0, tcr=0x40e577d0,  
old_valence=0) at ../x86-exceptions.c:910
#12 0x000000000041a218 in signal_handler (signum=4, info=0x40e56aa0,  
context=0x40e566f0, tcr=0x40e577d0, old_valence=0) at ../x86- 
exceptions.c:1064
#13 <signal handler called>
#14 0x00003000400110ab in ?? ()
#15 0x000030004042660c in ?? ()
#16 0x000000000040e0ac in _SPnthrowvalues () at ../x86-spentry64.s:1404
#17 0x00002aaaad1b1110 in ?? ()
#18 0x0000000000000008 in ?? ()
#19 0x0000000000000000 in ?? ()
(gdb)

On Jul 17, 2008, at 3:54 PM, Gary Byers wrote:

>
>
> On Thu, 17 Jul 2008, Osei Poku wrote:
>
>> Hello,
>>
>> I updated today from svn but this thing happened again.  Again the  
>> PC was in the pthread memory region and %rdi was 0.  I verified  
>> that the fix (r9997 i think) was in my ccl working directory  
>> (somewhere in thread_manager.c right?).
>
> Yes; there are 3 calls to pthread_kill() in that file.  One of them  
> (in resume_tcr()) is conditionlized out; the other two
> (in raise_thread_interrupt() and suspend_tcr()) should check
> to make sure that the thread that they'd pass as the first
> argument to pthread_kill is non-zero before doing the call.)
>
>>
>> My current version is:
>> Clozure Common Lisp Version 1.2-r10073M-RC1  (LinuxX8664)!
>>
>> Is there anything other than (rebuild-ccl :force t) that I need to  
>> do to recompile the c source for the lisp kernel?
>
> As Gail just pointed out, :full t (or :kernel t) is necessary
> in order to get the kernel updated. (:force t will recompile
> FASLs even if they're newer than the corresponding source;
> that's occasionally useful, but not really what you want here.)
>
> If the kernel that you're running had its modified date change
> by the rebuild process, it likely incorporates those changes.  If
> those changes didn't fix the problem, then I don't have a good
> guess as to what the problem is: there aren't too many places
> where the lisp calls into the threads library: it creates threads
> and sends them signals via pthread_kill().  (There's another place  
> where a thread will send itself a signal via pthread_kill(),
> but that is pretty much guaranteed to be a valid thread ...)
>
>
>>
>> Thanks,
>> Osei
>>
>> On Jul 9, 2008, at 3:05 PM, Gary Byers wrote:
>>
>>> --On July 9, 2008 2:26:56 PM -0400 Osei Poku <osei.poku at gmail.com>  
>>> wrote:
>>>> Hi,
>>>> It crashed again for me.  This time I managed to grab the  
>>>> contents of
>>>> /proc/pid/maps before I killed it.  Logs of the tty session and  
>>>> memory
>>>> maps are attached.  I had also managed to update from the  
>>>> repository to
>>>> r9890-RC1.
>>>> Osei
>>> It seems to be crashed in the threads library (libpthread.so).
>>> There's a race condition in the code which suspends threads
>>> on entry to the GC: the thread that's running the GC looks
>>> at each thread that it wants to suspend to see if it's
>>> still alive (the data structure that represents a thread
>>> might still be around, even if the OS-level thread has
>>> exited.)  The suspending thread looks at the tcr->osid
>>> field of the target, notes that it's non-zero, then
>>> calls a function to send the os-level thread a signal.
>>> That function accesses the tcr->osid field again (which,
>>> when non-zero, represents a POSIX thread ID) and calls
>>> pthread_kill()).
>>> When a thread dies, it clears its tcr->osid field, so
>>> if the target thread dies between the point when the
>>> suspending thread looks and the point where it leaps,
>>> we wind up calling pthread_kill() with a first argument
>>> of 0, and it crashes.  That's consistent with the
>>> register information: we're somewhere in the threads
>>> library (possibly in pthread_kill()), and the register
>>> in which C functions receive their first argument (%rdi)
>>> is  0.
>>> I'll try to check in a fix for that (look before leaping)
>>> soon.  As I understand it, SLIME will sometimes (depending
>>> on the setting of a "communication style" variable)
>>> spawn a thread in which to run each form being evaluated
>>> (via C-M-x or whatever); whether that's a good idea or
>>> not, consing short-lived threads all the time is probably
>>> a good way to trigger this bug.  I don't use SLIME, and
>>> don't know what the consequences of changing the communication
>>> style variable would be.
>>