[Openmcl-devel] Random crashing
Osei Poku
osei.poku at gmail.com
Fri Jul 18 09:32:11 PDT 2008
More debug info... Sorry about the multiple emails, I'm figuring
things out as I go.
(gdb) info threads
9 Thread 0x40263950 (LWP 3271) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
8 Thread 0x404c7950 (LWP 3272) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
7 Thread 0x4072b950 (LWP 3305) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
6 Thread 0x4098f950 (LWP 3306) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
5 Thread 0x40bf3950 (LWP 3307) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
4 Thread 0x40e57950 (LWP 6093) 0x00002adafe591bfb in read () from /
lib64/libc.so.6
3 Thread 0x4131f950 (LWP 6094) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
2 Thread 0x410bb950 (LWP 6095) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
1 Thread 0x2adafe820880 (LWP 3268) 0x00002adafe2ca2cb in
sem_timedwait () from /lib64/libpthread.so.0
(gdb) thread 4
[Switching to thread 4 (Thread 0x40e57950 (LWP 6093))]#0
0x00002adafe591bfb in read () from /lib64/libc.so.6
(gdb) bt
#0 0x00002adafe591bfb in read () from /lib64/libc.so.6
#1 0x00002adafe545553 in _IO_file_underflow () from /lib64/libc.so.6
#2 0x00002adafe545d0e in _IO_default_uflow () from /lib64/libc.so.6
#3 0x00002adafe541404 in getc () from /lib64/libc.so.6
#4 0x000000000041d43d in readc () at /usr/include/bits/stdio.h:43
#5 0x000000000041d590 in lisp_Debugger (xp=0x40e55d60,
info=0x40e56110, why=11, in_foreign_code=1, message=0x40e55b10
"Unhandled exception 11 at 0x2adafe2ca325, context->regs at
#x40e55d88") at ../lisp-debug.c:914
#6 0x000000000041a2c6 in signal_handler (signum=11, info=0x40e56110,
context=0x40e55d60, tcr=0x40e577d0, old_valence=1) at ../x86-
exceptions.c:1070
#7 <signal handler called>
#8 0x00002adafe2ca325 in sem_post () from /lib64/libpthread.so.0
#9 0x000000000041b3e2 in resume_tcr (tcr=0x415837d0) at ../
thread_manager.c:1376
#10 0x000000000041c146 in lisp_resume_tcr (tcr=0x415837d0) at ../
thread_manager.c:1418
#11 0x000000000041a0c8 in handle_exception (signum=<value optimized
out>, info=0x40e56aa0, context=0x40e566f0, tcr=0x40e577d0,
old_valence=0) at ../x86-exceptions.c:910
#12 0x000000000041a218 in signal_handler (signum=4, info=0x40e56aa0,
context=0x40e566f0, tcr=0x40e577d0, old_valence=0) at ../x86-
exceptions.c:1064
#13 <signal handler called>
#14 0x00003000400110ab in ?? ()
#15 0x000030004042660c in ?? ()
#16 0x000000000040e0ac in _SPnthrowvalues () at ../x86-spentry64.s:1404
#17 0x00002aaaad1b1110 in ?? ()
#18 0x0000000000000008 in ?? ()
#19 0x0000000000000000 in ?? ()
(gdb)
On Jul 17, 2008, at 3:54 PM, Gary Byers wrote:
>
>
> On Thu, 17 Jul 2008, Osei Poku wrote:
>
>> Hello,
>>
>> I updated today from svn but this thing happened again. Again the
>> PC was in the pthread memory region and %rdi was 0. I verified
>> that the fix (r9997 i think) was in my ccl working directory
>> (somewhere in thread_manager.c right?).
>
> Yes; there are 3 calls to pthread_kill() in that file. One of them
> (in resume_tcr()) is conditionlized out; the other two
> (in raise_thread_interrupt() and suspend_tcr()) should check
> to make sure that the thread that they'd pass as the first
> argument to pthread_kill is non-zero before doing the call.)
>
>>
>> My current version is:
>> Clozure Common Lisp Version 1.2-r10073M-RC1 (LinuxX8664)!
>>
>> Is there anything other than (rebuild-ccl :force t) that I need to
>> do to recompile the c source for the lisp kernel?
>
> As Gail just pointed out, :full t (or :kernel t) is necessary
> in order to get the kernel updated. (:force t will recompile
> FASLs even if they're newer than the corresponding source;
> that's occasionally useful, but not really what you want here.)
>
> If the kernel that you're running had its modified date change
> by the rebuild process, it likely incorporates those changes. If
> those changes didn't fix the problem, then I don't have a good
> guess as to what the problem is: there aren't too many places
> where the lisp calls into the threads library: it creates threads
> and sends them signals via pthread_kill(). (There's another place
> where a thread will send itself a signal via pthread_kill(),
> but that is pretty much guaranteed to be a valid thread ...)
>
>
>>
>> Thanks,
>> Osei
>>
>> On Jul 9, 2008, at 3:05 PM, Gary Byers wrote:
>>
>>> --On July 9, 2008 2:26:56 PM -0400 Osei Poku <osei.poku at gmail.com>
>>> wrote:
>>>> Hi,
>>>> It crashed again for me. This time I managed to grab the
>>>> contents of
>>>> /proc/pid/maps before I killed it. Logs of the tty session and
>>>> memory
>>>> maps are attached. I had also managed to update from the
>>>> repository to
>>>> r9890-RC1.
>>>> Osei
>>> It seems to be crashed in the threads library (libpthread.so).
>>> There's a race condition in the code which suspends threads
>>> on entry to the GC: the thread that's running the GC looks
>>> at each thread that it wants to suspend to see if it's
>>> still alive (the data structure that represents a thread
>>> might still be around, even if the OS-level thread has
>>> exited.) The suspending thread looks at the tcr->osid
>>> field of the target, notes that it's non-zero, then
>>> calls a function to send the os-level thread a signal.
>>> That function accesses the tcr->osid field again (which,
>>> when non-zero, represents a POSIX thread ID) and calls
>>> pthread_kill()).
>>> When a thread dies, it clears its tcr->osid field, so
>>> if the target thread dies between the point when the
>>> suspending thread looks and the point where it leaps,
>>> we wind up calling pthread_kill() with a first argument
>>> of 0, and it crashes. That's consistent with the
>>> register information: we're somewhere in the threads
>>> library (possibly in pthread_kill()), and the register
>>> in which C functions receive their first argument (%rdi)
>>> is 0.
>>> I'll try to check in a fix for that (look before leaping)
>>> soon. As I understand it, SLIME will sometimes (depending
>>> on the setting of a "communication style" variable)
>>> spawn a thread in which to run each form being evaluated
>>> (via C-M-x or whatever); whether that's a good idea or
>>> not, consing short-lived threads all the time is probably
>>> a good way to trigger this bug. I don't use SLIME, and
>>> don't know what the consequences of changing the communication
>>> style variable would be.
>>
More information about the Openmcl-devel
mailing list