[Openmcl-devel] Random crashing

Fri Jul 18 09:25:50 PDT 2008

Ok... It happened again after recompiling the kernel.  I managed to  
attach a gdb session to the process and it is still running so I can  
possible provide more feedback if you need.  My current gdb session  
log is inserted below.

 > /usr/bin/gdb
GNU gdb 6.6.50.20070726-cvs
Copyright (C) 2007 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and  
you are
welcome to change it and/or distribute copies of it under certain  
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for  
details.
This GDB was configured as "x86_64-suse-linux".
(gdb) attach 3268
Attaching to process 3268
Reading symbols from /home/opoku/local/share/ccl/lx86cl64...done.
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x2adafe820880 (LWP 3268)]
[New Thread 0x410bb950 (LWP 6095)]
[New Thread 0x4131f950 (LWP 6094)]
[New Thread 0x40e57950 (LWP 6093)]
[New Thread 0x40bf3950 (LWP 3307)]
[New Thread 0x4098f950 (LWP 3306)]
[New Thread 0x4072b950 (LWP 3305)]
[New Thread 0x404c7950 (LWP 3272)]
[New Thread 0x40263950 (LWP 3271)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libssl.so...done.
Loaded symbols for /usr/lib64/libssl.so
Reading symbols from /usr/lib64/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib64/libcrypto.so.0.9.8
Reading symbols from /lib64/libz.so.1...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /home/opoku/work/code/diagnosis/library/ 
clsql-4.0.3/uffi/clsql_uffi.so...done.
Loaded symbols for /home/opoku/work/code/diagnosis/library/clsql-4.0.3/ 
uffi/clsql_uffi.so
Reading symbols from /usr/lib64/libmysqlclient.so...done.
Loaded symbols for /usr/lib64/libmysqlclient.so
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libnsl.so.1...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /home/opoku/work/code/diagnosis/library/ 
clsql-4.0.3/db-mysql/clsql_mysql.so...done.
Loaded symbols for /home/opoku/work/code/diagnosis/library/clsql-4.0.3/ 
db-mysql/clsql_mysql.so
0x00002adafe2ca2cb in sem_timedwait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00002adafe2ca2cb in sem_timedwait () from /lib64/libpthread.so.0
#1  0x000000000041b89c in sem_wait_forever (s=0x6476b0) at ../ 
thread_manager.c:338
#2  0x000000000041bfef in suspend_resume_handler (signo=40,  
info=<value optimized out>, context=0x7ffface5c800) at ../ 
thread_manager.c:455
#3  <signal handler called>
#4  0x00002adafe2cb5c1 in nanosleep () from /lib64/libpthread.so.0
#5  0x00000000004105da in _SPffcall () at ../x86-spentry64.s:3983
#6  0x00007ffface5cf10 in ?? ()
#7  0x00002adafea64f88 in ?? ()
#8  0x000000000000031a in ?? ()
#9  0x00007ffface5cf00 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb) info threads
   9 Thread 0x40263950 (LWP 3271)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   8 Thread 0x404c7950 (LWP 3272)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   7 Thread 0x4072b950 (LWP 3305)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   6 Thread 0x4098f950 (LWP 3306)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   5 Thread 0x40bf3950 (LWP 3307)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   4 Thread 0x40e57950 (LWP 6093)  0x00002adafe591bfb in read () from / 
lib64/libc.so.6
   3 Thread 0x4131f950 (LWP 6094)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   2 Thread 0x410bb950 (LWP 6095)  0x00002adafe2ca2cb in sem_timedwait  
() from /lib64/libpthread.so.0
   1 Thread 0x2adafe820880 (LWP 3268)  0x00002adafe2ca2cb in  
sem_timedwait () from /lib64/libpthread.so.0

On Jul 17, 2008, at 3:54 PM, Gary Byers wrote:

>
>
> On Thu, 17 Jul 2008, Osei Poku wrote:
>
>> Hello,
>>
>> I updated today from svn but this thing happened again.  Again the  
>> PC was in the pthread memory region and %rdi was 0.  I verified  
>> that the fix (r9997 i think) was in my ccl working directory  
>> (somewhere in thread_manager.c right?).
>
> Yes; there are 3 calls to pthread_kill() in that file.  One of them  
> (in resume_tcr()) is conditionlized out; the other two
> (in raise_thread_interrupt() and suspend_tcr()) should check
> to make sure that the thread that they'd pass as the first
> argument to pthread_kill is non-zero before doing the call.)
>
>>
>> My current version is:
>> Clozure Common Lisp Version 1.2-r10073M-RC1  (LinuxX8664)!
>>
>> Is there anything other than (rebuild-ccl :force t) that I need to  
>> do to recompile the c source for the lisp kernel?
>
> As Gail just pointed out, :full t (or :kernel t) is necessary
> in order to get the kernel updated. (:force t will recompile
> FASLs even if they're newer than the corresponding source;
> that's occasionally useful, but not really what you want here.)
>
> If the kernel that you're running had its modified date change
> by the rebuild process, it likely incorporates those changes.  If
> those changes didn't fix the problem, then I don't have a good
> guess as to what the problem is: there aren't too many places
> where the lisp calls into the threads library: it creates threads
> and sends them signals via pthread_kill().  (There's another place  
> where a thread will send itself a signal via pthread_kill(),
> but that is pretty much guaranteed to be a valid thread ...)
>
>
>>
>> Thanks,
>> Osei
>>
>> On Jul 9, 2008, at 3:05 PM, Gary Byers wrote:
>>
>>> --On July 9, 2008 2:26:56 PM -0400 Osei Poku <osei.poku at gmail.com>  
>>> wrote:
>>>> Hi,
>>>> It crashed again for me.  This time I managed to grab the  
>>>> contents of
>>>> /proc/pid/maps before I killed it.  Logs of the tty session and  
>>>> memory
>>>> maps are attached.  I had also managed to update from the  
>>>> repository to
>>>> r9890-RC1.
>>>> Osei
>>> It seems to be crashed in the threads library (libpthread.so).
>>> There's a race condition in the code which suspends threads
>>> on entry to the GC: the thread that's running the GC looks
>>> at each thread that it wants to suspend to see if it's
>>> still alive (the data structure that represents a thread
>>> might still be around, even if the OS-level thread has
>>> exited.)  The suspending thread looks at the tcr->osid
>>> field of the target, notes that it's non-zero, then
>>> calls a function to send the os-level thread a signal.
>>> That function accesses the tcr->osid field again (which,
>>> when non-zero, represents a POSIX thread ID) and calls
>>> pthread_kill()).
>>> When a thread dies, it clears its tcr->osid field, so
>>> if the target thread dies between the point when the
>>> suspending thread looks and the point where it leaps,
>>> we wind up calling pthread_kill() with a first argument
>>> of 0, and it crashes.  That's consistent with the
>>> register information: we're somewhere in the threads
>>> library (possibly in pthread_kill()), and the register
>>> in which C functions receive their first argument (%rdi)
>>> is  0.
>>> I'll try to check in a fix for that (look before leaping)
>>> soon.  As I understand it, SLIME will sometimes (depending
>>> on the setting of a "communication style" variable)
>>> spawn a thread in which to run each form being evaluated
>>> (via C-M-x or whatever); whether that's a good idea or
>>> not, consing short-lived threads all the time is probably
>>> a good way to trigger this bug.  I don't use SLIME, and
>>> don't know what the consequences of changing the communication
>>> style variable would be.
>>