[Openmcl-devel] Random crashing

Osei Poku osei.poku at gmail.com
Fri Jul 18 16:25:50 UTC 2008

Ok... It happened again after recompiling the kernel.  I managed to  
attach a gdb session to the process and it is still running so I can  
possible provide more feedback if you need.  My current gdb session  
log is inserted below.

 > /usr/bin/gdb
GNU gdb
Copyright (C) 2007 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and  
you are
welcome to change it and/or distribute copies of it under certain  
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for  
This GDB was configured as "x86_64-suse-linux".
(gdb) attach 3268
Attaching to process 3268
Reading symbols from /home/opoku/local/share/ccl/lx86cl64...done.
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x2adafe820880 (LWP 3268)]
[New Thread 0x410bb950 (LWP 6095)]
[New Thread 0x4131f950 (LWP 6094)]
[New Thread 0x40e57950 (LWP 6093)]
[New Thread 0x40bf3950 (LWP 3307)]
[New Thread 0x4098f950 (LWP 3306)]
[New Thread 0x4072b950 (LWP 3305)]
[New Thread 0x404c7950 (LWP 3272)]
[New Thread 0x40263950 (LWP 3271)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libssl.so...done.
Loaded symbols for /usr/lib64/libssl.so
Reading symbols from /usr/lib64/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib64/libcrypto.so.0.9.8
Reading symbols from /lib64/libz.so.1...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /home/opoku/work/code/diagnosis/library/ 
Loaded symbols for /home/opoku/work/code/diagnosis/library/clsql-4.0.3/ 
Reading symbols from /usr/lib64/libmysqlclient.so...done.
Loaded symbols for /usr/lib64/libmysqlclient.so
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libnsl.so.1...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /home/opoku/work/code/diagnosis/library/ 
Loaded symbols for /home/opoku/work/code/diagnosis/library/clsql-4.0.3/ 
0x00002adafe2ca2cb in sem_timedwait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00002adafe2ca2cb in sem_timedwait () from /lib64/libpthread.so.0
#1  0x000000000041b89c in sem_wait_forever (s=0x6476b0) at ../ 
#2  0x000000000041bfef in suspend_resume_handler (signo=40,  
info=<value optimized out>, context=0x7ffface5c800) at ../ 
#3  <signal handler called>
#4  0x00002adafe2cb5c1 in nanosleep () from /lib64/libpthread.so.0
#5  0x00000000004105da in _SPffcall () at ../x86-spentry64.s:3983
#6  0x00007ffface5cf10 in ?? ()
#7  0x00002adafea64f88 in ?? ()
#8  0x000000000000031a in ?? ()
#9  0x00007ffface5cf00 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb) info threads
   9 Thread 0x40263950 (LWP 3271)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   8 Thread 0x404c7950 (LWP 3272)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   7 Thread 0x4072b950 (LWP 3305)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   6 Thread 0x4098f950 (LWP 3306)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   5 Thread 0x40bf3950 (LWP 3307)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   4 Thread 0x40e57950 (LWP 6093)  0x00002adafe591bfb in read () from / 
   3 Thread 0x4131f950 (LWP 6094)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   2 Thread 0x410bb950 (LWP 6095)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   1 Thread 0x2adafe820880 (LWP 3268)  0x00002adafe2ca2cb in  
sem_timedwait () from /lib64/libpthread.so.0

On Jul 17, 2008, at 3:54 PM, Gary Byers wrote:

> On Thu, 17 Jul 2008, Osei Poku wrote:
>> Hello,
>> I updated today from svn but this thing happened again.  Again the  
>> PC was in the pthread memory region and %rdi was 0.  I verified  
>> that the fix (r9997 i think) was in my ccl working directory  
>> (somewhere in thread_manager.c right?).
> Yes; there are 3 calls to pthread_kill() in that file.  One of them  
> (in resume_tcr()) is conditionlized out; the other two
> (in raise_thread_interrupt() and suspend_tcr()) should check
> to make sure that the thread that they'd pass as the first
> argument to pthread_kill is non-zero before doing the call.)
>> My current version is:
>> Clozure Common Lisp Version 1.2-r10073M-RC1  (LinuxX8664)!
>> Is there anything other than (rebuild-ccl :force t) that I need to  
>> do to recompile the c source for the lisp kernel?
> As Gail just pointed out, :full t (or :kernel t) is necessary
> in order to get the kernel updated. (:force t will recompile
> FASLs even if they're newer than the corresponding source;
> that's occasionally useful, but not really what you want here.)
> If the kernel that you're running had its modified date change
> by the rebuild process, it likely incorporates those changes.  If
> those changes didn't fix the problem, then I don't have a good
> guess as to what the problem is: there aren't too many places
> where the lisp calls into the threads library: it creates threads
> and sends them signals via pthread_kill().  (There's another place  
> where a thread will send itself a signal via pthread_kill(),
> but that is pretty much guaranteed to be a valid thread ...)
>> Thanks,
>> Osei
>> On Jul 9, 2008, at 3:05 PM, Gary Byers wrote:
>>> --On July 9, 2008 2:26:56 PM -0400 Osei Poku <osei.poku at gmail.com>  
>>> wrote:
>>>> Hi,
>>>> It crashed again for me.  This time I managed to grab the  
>>>> contents of
>>>> /proc/pid/maps before I killed it.  Logs of the tty session and  
>>>> memory
>>>> maps are attached.  I had also managed to update from the  
>>>> repository to
>>>> r9890-RC1.
>>>> Osei
>>> It seems to be crashed in the threads library (libpthread.so).
>>> There's a race condition in the code which suspends threads
>>> on entry to the GC: the thread that's running the GC looks
>>> at each thread that it wants to suspend to see if it's
>>> still alive (the data structure that represents a thread
>>> might still be around, even if the OS-level thread has
>>> exited.)  The suspending thread looks at the tcr->osid
>>> field of the target, notes that it's non-zero, then
>>> calls a function to send the os-level thread a signal.
>>> That function accesses the tcr->osid field again (which,
>>> when non-zero, represents a POSIX thread ID) and calls
>>> pthread_kill()).
>>> When a thread dies, it clears its tcr->osid field, so
>>> if the target thread dies between the point when the
>>> suspending thread looks and the point where it leaps,
>>> we wind up calling pthread_kill() with a first argument
>>> of 0, and it crashes.  That's consistent with the
>>> register information: we're somewhere in the threads
>>> library (possibly in pthread_kill()), and the register
>>> in which C functions receive their first argument (%rdi)
>>> is  0.
>>> I'll try to check in a fix for that (look before leaping)
>>> soon.  As I understand it, SLIME will sometimes (depending
>>> on the setting of a "communication style" variable)
>>> spawn a thread in which to run each form being evaluated
>>> (via C-M-x or whatever); whether that's a good idea or
>>> not, consing short-lived threads all the time is probably
>>> a good way to trigger this bug.  I don't use SLIME, and
>>> don't know what the consequences of changing the communication
>>> style variable would be.

More information about the Openmcl-devel mailing list