[Openmcl-devel] Random crashing
Osei Poku
osei.poku at gmail.com
Wed Aug 6 09:59:49 PDT 2008
This thing is not going away....
lisp debugger and gdb session below...
====lisp debugger
session
=
=
=
=
=
=
=
========================================================================
? exception in foreign context
Exception occurred while executing foreign code
? for help
[17455] OpenMCL kernel debugger: ?
(G) Set specified GPR to new value
(R) Show raw GPR/SPR register values
(L) Show Lisp values of tagged registers
(F) Show FPU registers
(S) Find and describe symbol matching specified name
(B) Show backtrace
(T) Show info about current thread
(X) Exit from this debugger, asserting that any exception was handled
(K) Kill OpenMCL process
(?) Show this help
[17455] OpenMCL kernel debugger: R
%rax = 0x0000000000000000 %r8 = 0x0000000000000000
%rcx = 0x0000000000000000 %r9 = 0x000000004072B7D0
%rdx = 0x0000000000000001 %r10 = 0x0000000000000008
%rbx = 0x00000000410BB7D0 %r11 = 0x0000000000000246
%rsp = 0x000000004072A218 %r12 = 0x000000004072B7D0
%rbp = 0x000000004072A6F0 %r13 = 0x000000004072A718
%rsi = 0x0000000000000001 %r14 = 0x0000000000000004
%rdi = 0x0000000000000000 %r15 = 0x000000004072AAA0
%rip = 0x00002B37EEFB3325 %rflags = 0x0000000000010246
[17455] OpenMCL kernel debugger: B
Framepointer [#x4072A6F0] in unknown area.
[17455] OpenMCL kernel debugger: T
Current Thread Context Record (tcr) = 0x4072b7d0
Control (C) stack area: low = 0x404d8000, high = 0x4072c000
Value (lisp) stack area: low = 0x2aaaab0f1000, high = 0x2aaaab302000
Exception stack pointer = 0x4072a218
[17455] OpenMCL kernel debugger: X
====gdb
session
=
=
=
=
=
=
=
========================================================================
(gdb) source local/share/ccl/lisp-kernel/linuxx8664/.gdbinit
No symbol table is loaded. Use the "file" command.
(gdb) attach 17455
Attaching to process 17455
Reading symbols from /home/opoku/local/share/ccl/lx86cl64...done.
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x2b37ef509880 (LWP 17455)]
[New Thread 0x40e57950 (LWP 17605)]
[New Thread 0x40bf3950 (LWP 17462)]
[New Thread 0x4098f950 (LWP 17461)]
[New Thread 0x4072b950 (LWP 17460)]
[New Thread 0x404c7950 (LWP 17459)]
[New Thread 0x40263950 (LWP 17458)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libssl.so...done.
Loaded symbols for /usr/lib64/libssl.so
Reading symbols from /usr/lib64/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib64/libcrypto.so.0.9.8
Reading symbols from /lib64/libz.so.1...done.
Loaded symbols for /lib64/libz.so.1
0x00002b37eefb32cb in sem_timedwait () from /lib64/libpthread.so.0
(gdb) source local/share/ccl/lisp-kernel/linuxx8664/.gdbinit
Redefine command "x86_lisp_string"? (y or n) [answered Y; input not
from terminal]
Redefine command "gtra"? (y or n) [answered Y; input not from terminal]
Redefine command "x86pname"? (y or n) [answered Y; input not from
terminal]
Redefine command "pname"? (y or n) [answered Y; input not from terminal]
Redefine command "l"? (y or n) [answered Y; input not from terminal]
Redefine command "lw"? (y or n) [answered Y; input not from terminal]
Redefine command "clobber_breakpoint"? (y or n) [answered Y; input not
from terminal]
Redefine command "arg_z"? (y or n) [answered Y; input not from terminal]
Redefine command "arg_y"? (y or n) [answered Y; input not from terminal]
Redefine command "arg_x"? (y or n) [answered Y; input not from terminal]
Redefine command "bx"? (y or n) [answered Y; input not from terminal]
Redefine command "showlist"? (y or n) [answered Y; input not from
terminal]
Redefine command "lbt"? (y or n) [answered Y; input not from terminal]
Redefine command "ada"? (y or n) [answered Y; input not from terminal]
Redefine command "lregs"? (y or n) [answered Y; input not from terminal]
Breakpoint 1 at 0x41d780: file ../lisp-debug.c, line 934.
(gdb) br *0x00002B37EEFB3325
Breakpoint 2 at 0x2b37eefb3325
(gdb) continue
Continuing.
[Switching to Thread 0x4072b950 (LWP 17460)]
Breakpoint 2, 0x00002b37eefb3325 in sem_post () from /lib64/
libpthread.so.0
2: x/i $pc
0x2b37eefb3325 <sem_post+5>: lock xadd %edx,(%rdi)
(gdb) info thread
7 Thread 0x40263950 (LWP 17458) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
6 Thread 0x404c7950 (LWP 17459) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
* 5 Thread 0x4072b950 (LWP 17460) 0x00002b37eefb3325 in sem_post ()
from /lib64/libpthread.so.0
4 Thread 0x4098f950 (LWP 17461) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
3 Thread 0x40bf3950 (LWP 17462) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
2 Thread 0x40e57950 (LWP 17605) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
1 Thread 0x2b37ef509880 (LWP 17455) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00002b37eefb3325 in sem_post () from /lib64/libpthread.so.0
#1 0x000000000041b3e2 in resume_tcr (tcr=0x4098f7d0) at ../
thread_manager.c:1376
#2 0x000000000041bbea in resume_other_threads (for_gc=<value
optimized out>) at ../thread_manager.c:1546
#3 0x000000000041d66e in lisp_Debugger (xp=0x404d7e40, info=0x0,
why=-1, in_foreign_code=1, message=0x404d7b30 "exception in foreign
context") at ../lisp-debug.c:919
#4 0x000000000041d757 in FBug (xp=0x404d7e40, format=<value optimized
out>) at ../lisp-debug.c:954
#5 0x0000000000418a07 in altstack_signal_handler (signum=11,
info=0x404d7f70, context=0x404d7e40) at ../x86-exceptions.c:1287
#6 <signal handler called>
#7 0x00002b37eefb3325 in sem_post () from /lib64/libpthread.so.0
#8 0x000000000041b3e2 in resume_tcr (tcr=0x410bb7d0) at ../
thread_manager.c:1376
#9 0x000000000041bf56 in lisp_resume_tcr (tcr=0x410bb7d0) at ../
thread_manager.c:1418
#10 0x000000000041a0c8 in handle_exception (signum=<value optimized
out>, info=0x4072aaa0, context=0x4072a6f0, tcr=0x4072b7d0,
old_valence=0) at ../x86-exceptions.c:910
#11 0x000000000041a218 in signal_handler (signum=4, info=0x4072aaa0,
context=0x4072a6f0, tcr=0x4072b7d0, old_valence=0) at ../x86-
exceptions.c:1064
#12 <signal handler called>
#13 0x00003000400110ab in ?? ()
#14 0x0000300040431a5c in ?? ()
#15 0x000000000040e2b4 in _SPnthrow1value () at ../x86-spentry64.s:1516
#16 0x00002aaaab300ff0 in ?? ()
#17 0x0000000000000008 in ?? ()
#18 0xfffffffffffffff8 in ?? ()
#19 0x00002aaaab301110 in ?? ()
#20 0x0000000000000008 in ?? ()
#21 0x0000000000000000 in ?? ()
(gdb) p/x *(TCR *) 0x4098f7d0
$1 = {next = 0x40bf37d0, prev = 0x4072b7d0, single_float_convert =
{tag = 0x1, f = 0x0}, linear = 0x4098f7d0, save_rbp = 0x2aaaab624c40,
lisp_mxcsr = 0x1920, foreign_mxcsr = 0x1f80, db_link = 0x2aaaab624dc0,
catch_top = 0x2aaaab73599d, save_vsp = 0x2aaaab624be8, save_tsp =
0x2aaaab7358e0, foreign_sp = 0x4098ed60, cs_area = 0x6647e0, vs_area =
0x660610, ts_area = 0x6606f0,
cs_limit = 0x4075e000, bytes_allocated = 0x58ef60,
log2_allocation_quantum = 0x11, interrupt_pending = 0x0, xframe = 0x0,
errno_loc = 0x4098f770, ffi_exception = 0x1f80, osid = 0x4098f950,
valence = 0x1, foreign_exception_status = 0x0, native_thread_info =
0x0, native_thread_id = 0x442f, last_allocptr = 0x3000447c0000,
save_allocptr = 0x3000447bd1a0,
save_allocbase = 0x3000447a0000, reset_completion = 0x660550,
activate = 0x660580, suspend_count = 0x0, suspend_context =
0x4098e6f0, pending_exception_context = 0x0, suspend = 0x6604f0,
resume = 0x660520, flags = 0x0, gc_context = 0x0,
termination_semaphore = 0x664990, unwinding = 0x0, tlb_limit = 0x4000,
tlb_pointer = 0x6607a0, shutdown_count = 0x4,
next_tsp = 0x2aaaab7358e0, safe_ref_address = 0x0}
(gdb) p/x *(TCR *) 0x410bb7d0
$2 = {next = 0x0, prev = 0x0, single_float_convert = {tag = 0x1, f =
0x0}, linear = 0x0, save_rbp = 0x2aaaac7d0ab0, lisp_mxcsr = 0x1920,
foreign_mxcsr = 0x1f80, db_link = 0x0, catch_top = 0x0, save_vsp =
0x2aaaac7d0a58, save_tsp = 0x2aaaac8e2000, foreign_sp = 0x410bada0,
cs_area = 0x0, vs_area = 0x0, ts_area = 0x0, cs_limit = 0x40e8a000,
bytes_allocated = 0x0,
log2_allocation_quantum = 0x11, interrupt_pending = 0x0, xframe =
0x0, errno_loc = 0x410bb770, ffi_exception = 0x1f80, osid = 0x0,
valence = 0x1, foreign_exception_status = 0x0, native_thread_info =
0x0, native_thread_id = 0x442f, last_allocptr = 0x300044840000,
save_allocptr = 0x30004483c730, save_allocbase = 0x300044820000,
reset_completion = 0x0, activate = 0x0,
suspend_count = 0x0, suspend_context = 0x0,
pending_exception_context = 0x0, suspend = 0x0, resume = 0x0, flags =
0x0, gc_context = 0x0, termination_semaphore = 0x0, unwinding = 0x0,
tlb_limit = 0x0, tlb_pointer = 0x0, shutdown_count = 0x0, next_tsp =
0x2aaaac8e2000, safe_ref_address = 0x0}
(gdb) down
Bottom (innermost) frame selected; you cannot go down.
(gdb) up
#1 0x000000000041b3e2 in resume_tcr (tcr=0x4098f7d0) at ../
thread_manager.c:1376
1376 ../thread_manager.c: No such file or directory.
in ../thread_manager.c
(gdb) up
#2 0x000000000041bbea in resume_other_threads (for_gc=<value
optimized out>) at ../thread_manager.c:1546
1546 in ../thread_manager.c
(gdb) up
#3 0x000000000041d66e in lisp_Debugger (xp=0x404d7e40, info=0x0,
why=-1, in_foreign_code=1, message=0x404d7b30 "exception in foreign
context") at ../lisp-debug.c:919
919 ../lisp-debug.c: No such file or directory.
in ../lisp-debug.c
(gdb) up
#4 0x000000000041d757 in FBug (xp=0x404d7e40, format=<value optimized
out>) at ../lisp-debug.c:954
954 in ../lisp-debug.c
(gdb) up
#5 0x0000000000418a07 in altstack_signal_handler (signum=11,
info=0x404d7f70, context=0x404d7e40) at ../x86-exceptions.c:1287
1287 ../x86-exceptions.c: No such file or directory.
in ../x86-exceptions.c
(gdb) up
#6 <signal handler called>
(gdb) up
#7 0x00002b37eefb3325 in sem_post () from /lib64/libpthread.so.0
(gdb) up
#8 0x000000000041b3e2 in resume_tcr (tcr=0x410bb7d0) at ../
thread_manager.c:1376
1376 ../thread_manager.c: No such file or directory.
in ../thread_manager.c
(gdb) p/x *(TCR *) 0x410bb7d0
$3 = {next = 0x0, prev = 0x0, single_float_convert = {tag = 0x1, f =
0x0}, linear = 0x0, save_rbp = 0x2aaaac7d0ab0, lisp_mxcsr = 0x1920,
foreign_mxcsr = 0x1f80, db_link = 0x0, catch_top = 0x0, save_vsp =
0x2aaaac7d0a58, save_tsp = 0x2aaaac8e2000, foreign_sp = 0x410bada0,
cs_area = 0x0, vs_area = 0x0, ts_area = 0x0, cs_limit = 0x40e8a000,
bytes_allocated = 0x0,
log2_allocation_quantum = 0x11, interrupt_pending = 0x0, xframe =
0x0, errno_loc = 0x410bb770, ffi_exception = 0x1f80, osid = 0x0,
valence = 0x1, foreign_exception_status = 0x0, native_thread_info =
0x0, native_thread_id = 0x442f, last_allocptr = 0x300044840000,
save_allocptr = 0x30004483c730, save_allocbase = 0x300044820000,
reset_completion = 0x0, activate = 0x0,
suspend_count = 0x0, suspend_context = 0x0,
pending_exception_context = 0x0, suspend = 0x0, resume = 0x0, flags =
0x0, gc_context = 0x0, termination_semaphore = 0x0, unwinding = 0x0,
tlb_limit = 0x0, tlb_pointer = 0x0, shutdown_count = 0x0, next_tsp =
0x2aaaac8e2000, safe_ref_address = 0x0}
(gdb)
On Jul 21, 2008, at 6:42 PM, Gary Byers wrote:
> Thanks. Curiouser and curiouser, not only is the "resume" field 0,
> but many other fields are as well, including 'next' and 'prev'. (TCR
> structures are maintained in a circular, doubly-linked list; this guy
> seems to have died and spliced himself out of that list.) Enough
> fields are set that this looks like a dead thread rather than a
> newly-created one.
>
> The backtrace indicates that this was coming from
> 'lisp_resume_other_threads()", which is called as part of the
> expansion
> of WITH-OTHER-THREADS-SUSPENDED. And lisp_resume_other_threads()
> and lisp_suspend_other_threads() don't bother to grab and release
> the lock which allows modification of the tcr list.
>
> I'm not quite sure why what happened happened, but the code that
> walks this doubly-linked list suspending and resuming threads should
> be confident that other threads aren't splicing themselves on and off
> that list while it's being walked.
>
>
> On Mon, 21 Jul 2008, Osei Poku wrote:
>
>>
>> On Jul 21, 2008, at 5:53 PM, Gary Byers wrote:
>>
>>> If you still have the debugging session running, could you do:
>>> (gdb) p/x *(TCR *)0x417e77d0
>>
>> (gdb) p/x *(TCR *)0x417e77d0
>> $1 = {next = 0x0, prev = 0x0, single_float_convert = {tag = 0x1, f
>> = 0x0}, linear = 0x0, save_rbp = 0x2aaaadd49ab0, lisp_mxcsr =
>> 0x1920, foreign_mxcsr = 0x1f80, db_link = 0x0, catch_top = 0x0,
>> save_vsp = 0x2aaaadd49a58, save_tsp = 0x2aaaade5b000, foreign_sp =
>> 0x417e6da0, cs_area = 0x0, vs_area = 0x0, ts_area = 0x0, cs_limit =
>> 0x415b6000, bytes_allocated = 0x0,
>> log2_allocation_quantum = 0x11, interrupt_pending = 0x0, xframe =
>> 0x0, errno_loc = 0x417e7770, ffi_exception = 0x1f80, osid = 0x0,
>> valence = 0x1, foreign_exception_status = 0x0, native_thread_info =
>> 0x0, native_thread_id = 0x1847, last_allocptr = 0x3000455e0000,
>> save_allocptr = 0x3000455db200, save_allocbase = 0x3000455c0000,
>> reset_completion = 0x0, activate = 0x0,
>> suspend_count = 0x0, suspend_context = 0x0,
>> pending_exception_context = 0x0, suspend = 0x0, resume = 0x0, flags
>> = 0x0, gc_context = 0x0, termination_semaphore = 0x0, unwinding =
>> 0x0, tlb_limit = 0x0, tlb_pointer = 0x0, shutdown_count = 0x0,
>> next_tsp = 0x2aaaade5b000, safe_ref_address = 0x0}
>>
>>
>> To save your eyes scanning,
>>
>> resume = 0x0
>>
>>
>>> That address is the value of the "tcr" argument to "resume_tcr()" in
>>> frame #7 in the backtrace below, so if you don't still have the
>>> debugging session and reproduce the problem, we want to see what
>>> the value of the "tcr" argument to resume_tcr() at the point was
>>> at the point where resume_tcr() called sem_post() and crashed.
>>> The gdb command above means "print, in hex, this contents of
>>> what this address points to, interpreting that address as
>>> being of type "pointer to TCR" (where a TCR is a "Thread Context
>>> Record" that contains several interesting fields.)
>>> 'resume_tcr()' basically does 'sem_post(tcr->resume)', and a crash
>>> would make sense if tcr->resume was NULL. If it was, then one of
>>> the threads that's doing sem_timedwait() on its 'resume' semaphore
>>> would presumably be waiting on a NULL semahore, and that doesn't
>>> make sense.
>>
More information about the Openmcl-devel
mailing list