[Openmcl-devel] odd ccl hang during gc
Osei Poku
osei.poku at gmail.com
Tue Jul 7 11:06:25 PDT 2009
Hi,
This issue cropped up again on the same 16-core opteron machine. It
is a similar situation with multiple threads running and then, all at
once, all threads except one stop at sem_timedwait(). The other
thread is stuck in the same spot as before trying to free something.
A few backtraces attached. I am still running 1.3-RC1-r11885M so this
may have been fixed in newer releases.
I will wait until I get replies before killing this process.
Osei
(gdb) bt
#0 0x00002b9fab6e249b in sem_timedwait () from /lib/libpthread.so.0
#1 0x000000000041cc3c in sem_wait_forever (s=0x6496b0) at ../
thread_manager.c:443
#2 0x000000000041da7f in suspend_resume_handler (signo=<value
optimized out>, info=<value optimized out>, context=0x7fffffa75570)
at ../thread_manager.c:594
#3 <signal handler called>
#4 0x00002b9fab6e37b1 in ?? () from /lib/libpthread.so.0
#5 0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
#6 0x00007fffffa75c80 in ?? ()
#7 0x00002b9fabe90f88 in ?? ()
#8 0x000000000000031a in ?? ()
#9 0x00007fffffa75c70 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb) threads
Undefined command: "threads". Try "help".
(gdb) info threads
14 Thread 1076246864 (LWP 3695) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
13 Thread 1078753616 (LWP 3697) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
12 Thread 1081260368 (LWP 3701) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
11 Thread 1083767120 (LWP 3702) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
10 Thread 1086273872 (LWP 3703) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
9 Thread 1088780624 (LWP 3704) 0x00002b9fab6e249b in sem_timedwait
() from /lib/libpthread.so.0
8 Thread 1091287376 (LWP 19572) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
7 Thread 1116133712 (LWP 28035) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
6 Thread 1111120208 (LWP 28036) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
5 Thread 1099680080 (LWP 28037) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
4 Thread 1108072784 (LWP 28038) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
3 Thread 1113626960 (LWP 28625) 0x00002b9fab9d11d8 in ?? () from /
lib/libc.so.6
2 Thread 1121147216 (LWP 29139) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
1 Thread 47964781596768 (LWP 3691) 0x00002b9fab6e249b in
sem_timedwait () from /lib/libpthread.so.0
(gdb) thread 3
[Switching to thread 3 (Thread 1113626960 (LWP 28625))]#0
0x00002b9fab9d11d8 in ?? () from /lib/libc.so.6
(gdb) bt
#0 0x00002b9fab9d11d8 in ?? () from /lib/libc.so.6
#1 0x00002b9fab967662 in ?? () from /lib/libc.so.6
#2 0x00002b9fab9666f1 in free () from /lib/libc.so.6
#3 0x000000000041ca69 in destroy_semaphore (s=0x3000436cc0c8) at ../
thread_manager.c:724
#4 0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
#5 0x00000000004157bd in gc (tcr=0x426097b0, param=<value optimized
out>) at ../gc-common.c:1331
#6 0x00000000004196db in gc_from_tcr (tcr=0x2aaab0000020, param=0)
at ../x86-exceptions.c:2786
#7 0x000000000041abe1 in gc_like_from_xp (xp=<value optimized out>,
fun=0x4196b0 <gc_from_tcr>, param=0) at ../x86-exceptions.c:2743
#8 0x000000000041ac3e in gc_from_xp (xp=0x2aaab0000020,
param=46912586140688) at ../x86-exceptions.c:2798
#9 0x000000000041b14f in handle_gc_trap (xp=0x426086f0,
tcr=0x426097b0) at ../x86-exceptions.c:239
#10 0x000000000041b5eb in handle_exception (signum=<value optimized
out>, info=0x42608aa0, context=0x426086f0, tcr=0x8000000000000000,
old_valence=-1) at ../x86-exceptions.c:1066
#11 0x000000000041b698 in signal_handler (signum=11, info=0x42608aa0,
context=0x426086f0) at ../x86-exceptions.c:1330
#12 <signal handler called>
#13 0x00003000400135e7 in ?? ()
#14 0x00003000419096a4 in ?? ()
#15 0x000000000040ee7c in _SPnthrowvalues () at ../x86-spentry64.s:1405
#16 0x00002aaaae17eae0 in ?? ()
#17 0x0000000000000008 in ?? ()
#18 0x0000000000000000 in ?? ()
On Apr 15, 2009, at 2:50 AM, Gary Byers wrote:
> and the sem_wait_forever() is coming from suspend_resume_handler(),
> which
> means that those threads (and likely all others) are waiting on
> semaphores
> that should have been signaled a while ago.
>
> Just as a stab in the dark: what kind of hardware is this ? (e.g.,
> what
> does
>
> $ cat /proc/cpuinfo
>
> say ?)
>
>
> On Wed, 15 Apr 2009, Osei Poku wrote:
>
>> Unfortunately, I killed the process. Fortunately, I did run the
>> following backtraces on other a few other threads than #3.
>> According to the line numbers shown in the output below, it appears
>> that they (at least the 3 threads I looked at) are all stopped at
>> SEM_WAIT_FOREVER().
>>
>> I guess the mystery continues...
>>
>> Osei
>>
>> (gdb) info threads
>> 15 Thread 1076246864 (LWP 14275) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 14 Thread 1078753616 (LWP 14278) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 13 Thread 1081260368 (LWP 14280) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 12 Thread 1083767120 (LWP 14281) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 11 Thread 1086273872 (LWP 14282) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 10 Thread 1088780624 (LWP 14283) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 9 Thread 1091287376 (LWP 24522) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 8 Thread 1093794128 (LWP 24523) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 7 Thread 1096841552 (LWP 24524) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 6 Thread 1099348304 (LWP 24525) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 5 Thread 1101855056 (LWP 28424) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 4 Thread 1104361808 (LWP 28425) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> * 3 Thread 1106868560 (LWP 28426) 0x00002b9bfca601d8 in ?? ()
>> from /lib/libc.so.6
>> 2 Thread 1109375312 (LWP 28427) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> 1 Thread 47948961267808 (LWP 14271) 0x00002b9bfc77149b in
>> sem_timedwait () from /lib/libpthread.so.0
>> (gdb) thread 4
>> [Switching to thread 4 (Thread 1104361808 (LWP 28425))]#0
>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>> (gdb) bt
>> #0 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>> #1 0x000000000041cc3c in sem_wait_forever (s=0x1ce0cb0) at ../
>> thread_manager.c:443
>> #2 0x000000000041da7f in suspend_resume_handler (signo=<value
>> optimized out>, info=<value optimized out>, context=0x41d32670)
>> at ../thread_manager.c:594
>> #3 <signal handler called>
>> #4 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>> #5 0x000000000041cbdb in wait_on_semaphore (s=0x18078b0,
>> seconds=1, millis=0) at ../thread_manager.c:468
>> #6 0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
>> #7 0x0000000000000000 in ?? ()
>> (gdb) thread 5
>> [Switching to thread 5 (Thread 1101855056 (LWP 28424))]#0
>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>> (gdb) bt
>> #0 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>> #1 0x000000000041cc3c in sem_wait_forever (s=0xd3e4b0) at ../
>> thread_manager.c:443
>> #2 0x000000000041da7f in suspend_resume_handler (signo=<value
>> optimized out>, info=<value optimized out>, context=0x41ace650)
>> at ../thread_manager.c:594
>> #3 <signal handler called>
>> #4 0x00002b9bfc7727b1 in ?? () from /lib/libpthread.so.0
>> #5 0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
>> #6 0x0000000041aced60 in ?? ()
>> #7 0x00002aaaaf859e88 in ?? ()
>> #8 0x000000000000031a in ?? ()
>> #9 0x0000000041aced50 in ?? ()
>> #10 0x0000000000000000 in ?? ()
>> (gdb) thread 6
>> [Switching to thread 6 (Thread 1099348304 (LWP 24525))]#0
>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>> (gdb) bt
>> #0 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>> #1 0x000000000041cc3c in sem_wait_forever (s=0x23af950) at ../
>> thread_manager.c:443
>> #2 0x000000000041da7f in suspend_resume_handler (signo=<value
>> optimized out>, info=<value optimized out>, context=0x4186a6c0)
>> at ../thread_manager.c:594
>> #3 <signal handler called>
>> #4 0x00002b9bfc771efb in ?? () from /lib/libpthread.so.0
>> #5 0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
>> #6 0x0000000000000000 in ?? ()
>>
>>
>>
>> On Apr 15, 2009, at 1:34 AM, Gary Byers wrote:
>>
>>> A bunch of threads (all but the one that GDB identifies as #3) were
>>> suspended by thread 3 when it ran the GC; each such suspended thread
>>> is waiting on a semaphore that'd be signaled/raised when the GC
>>> exits.
>>> It looks like the GC exited and is in the process of trying to free
>>> some foreign objects that were referenced by "gcable pointers"; it
>>> has to signal the other threads' semaphores (to wake them up) before
>>> it actually frees anything, since one of those threads may own a
>>> lock
>>> on the malloc heap.
>>> It looks like the GC thread (#3) called free(), and is waiting for
>>> a lock that's likely held by one of the other threads that's likely
>>> to still be waiting (in sem_timedwait()) to be told that it's OK
>>> to run. The GC thread's already signaled the semaphores that the
>>> other threads are waiting on, but they don't seem to have gotten
>>> the message.
>>> It's not that unusual to see that some of the other threads are
>>> apparently
>>> still waiting for their "resume" semaphores; it's a little unusual
>>> to
>>> see that they're all waiting, and I'd guess that they have been
>>> waiting
>>> since you noticed the load average dropping to 0.
>>> If you still have the GDB attached, you might try something: make
>>> each
>>> other thread (other than #3) active in GDB and use 'up' to see how
>>> it
>>> got to sem_timedwait(). My guess is that you'd see the call to
>>> SEM_WAIT_FOREVER in the function 'suspend_resume_handler()' (in
>>> ccl/lisp-kernel/thread-manager.c) a few frames up the stack in each
>>> cases. If you see that any of these threads is waiting for anything
>>> other than its tcr->resume semaphore, that'd be interesting.
>>> Otherwise, this is a bit of a mystery, since the whole 'suspend
>>> other
>>> threads on GC entry, signal their resume semaphores on GC exit'
>>> routine
>>> happens on every GC, and has likely happened hundreds or thousands
>>> of
>>> times already in this session, and it's not clear what's different
>>> this
>>> time.
>>> On Wed, 15 Apr 2009, Osei Poku wrote:
>>>> Hi,
>>>> I was running three slime repls (9 threads total) and in each
>>>> repl I
>>>> was running the same long running function on different inputs.
>>>> After
>>>> a few hours of my load being about 2.8, it dropped down to 0.0.
>>>> Basically nothing was running, nothing responding etc. All threads
>>>> are frozen. As is confirmed by the backtrace in the main thread
>>>> (see
>>>> below), this happened during gc which makes sense since all
>>>> activity
>>>> stops during gc. Not sure what else I can do to revive the process
>>>> but I thought it might be useful to report the issue.
>>>> This is from CCL version 1.3-RC1-r11885M.
>>>> Osei
>>>> --- gdb output -----
>>>> (gdb) info threads
>>>> 15 Thread 1076246864 (LWP 14275) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 14 Thread 1078753616 (LWP 14278) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 13 Thread 1081260368 (LWP 14280) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 12 Thread 1083767120 (LWP 14281) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 11 Thread 1086273872 (LWP 14282) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 10 Thread 1088780624 (LWP 14283) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 9 Thread 1091287376 (LWP 24522) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 8 Thread 1093794128 (LWP 24523) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 7 Thread 1096841552 (LWP 24524) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 6 Thread 1099348304 (LWP 24525) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 5 Thread 1101855056 (LWP 28424) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 4 Thread 1104361808 (LWP 28425) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 3 Thread 1106868560 (LWP 28426) 0x00002b9bfca601d8 in ?? () from /
>>>> lib/libc.so.6
>>>> 2 Thread 1109375312 (LWP 28427) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 1 Thread 47948961267808 (LWP 14271) 0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> (gdb) thread
>>>> [Current thread is 0 (LWP 14271)]
>>>> (gdb) thread 2
>>>> [Switching to thread 2 (Thread 1109375312 (LWP 28427))]#0
>>>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>>> (gdb) thread 3
>>>> [Switching to thread 3 (Thread 1106868560 (LWP 28426))]#0
>>>> 0x00002b9bfca601d8 in ?? () from /lib/libc.so.6
>>>> (gdb) bt
>>>> #0 0x00002b9bfca601d8 in ?? () from /lib/libc.so.6
>>>> #1 0x00002b9bfc9f6662 in ?? () from /lib/libc.so.6
>>>> #2 0x00002b9bfc9f56f1 in free () from /lib/libc.so.6
>>>> #3 0x000000000041ca69 in destroy_semaphore (s=0x300055b0bc08)
>>>> at ../
>>>> thread_manager.c:724
>>>> #4 0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
>>>> #5 0x00000000004157bd in gc (tcr=0x41f977b0, param=<value
>>>> optimized
>>>> out>) at ../gc-common.c:1331
>>>> #6 0x00000000004196db in gc_from_tcr (tcr=0x2b9bfccd6980, param=0)
>>>> at ../x86-exceptions.c:2786
>>>> #7 0x000000000041abe1 in gc_like_from_xp (xp=<value optimized
>>>> out>,
>>>> fun=0x4196b0 <gc_from_tcr>, param=0) at ../x86-exceptions.c:2743
>>>> #8 0x000000000041ac3e in gc_from_xp (xp=0x2b9bfccd6980,
>>>> param=25576416) at ../x86-exceptions.c:2798
>>>> #9 0x000000000041ad67 in allocate_object (xp=0x41f966f0,
>>>> bytes_needed=6352096, disp_from_allocptr=6352083, tcr=0x41f977b0)
>>>> at ../x86-exceptions.c:145
>>>> #10 0x000000000041adce in handle_alloc_trap (xp=0x41f966f0,
>>>> tcr=0x41f977b0) at ../x86-exceptions.c:572
>>>> #11 0x000000000041b5ce in handle_exception (signum=<value optimized
>>>> out>, info=0x41f96aa0, context=0x41f966f0, tcr=0x8000000000000000,
>>>> old_valence=-1) at ../x86-exceptions.c:1060
>>>> #12 0x000000000041b698 in signal_handler (signum=11,
>>>> info=0x41f96aa0,
>>>> context=0x41f966f0) at ../x86-exceptions.c:1330
>>>> #13 <signal handler called>
>>>> #14 0x000000000041054a in _SPmisc_alloc () at ../x86-spentry64.s:
>>>> 2840
>>>> #15 0x000030004036943c in ?? ()
>>>> #16 0x000000000060ecd0 in ?? ()
>>>> #17 0x0000000000000000 in ?? ()
>>>> (gdb) up
>>>> #1 0x00002b9bfc9f6662 in ?? () from /lib/libc.so.6
>>>> (gdb) up
>>>> #2 0x00002b9bfc9f56f1 in free () from /lib/libc.so.6
>>>> (gdb) up
>>>> #3 0x000000000041ca69 in destroy_semaphore (s=0x300055b0bc08)
>>>> at ../
>>>> thread_manager.c:724
>>>> 724 free(*s);
>>>> (gdb) list
>>>> 719 destroy_semaphore(void **s)
>>>> 720 {
>>>> 721 if (*s) {
>>>> 722 #ifdef USE_POSIX_SEMAPHORES
>>>> 723 sem_destroy((sem_t *)*s);
>>>> 724 free(*s);
>>>> 725 #endif
>>>> 726 #ifdef USE_MACH_SEMAPHORES
>>>> 727 semaphore_destroy(mach_task_self(),((semaphore_t)
>>>> (natural)
>>>> *s));
>>>> 728 #endif
>>>> (gdb) up
>>>> #4 0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
>>>> 746 destroy_semaphore((void**)&(x->address));
>>>> (gdb) list
>>>> 741 case xmacptr_flag_rwlock:
>>>> 742 rwlock_destroy((rwlock *)ptr_from_lispobj(ptr));
>>>> 743 break;
>>>> 744
>>>> 745 case xmacptr_flag_semaphore:
>>>> 746 destroy_semaphore((void**)&(x->address));
>>>> 747 break;
>>>> 748
>>>> 749 default:
>>>> 750 if ((flag >= xmacptr_flag_user_first) &&
>>>> (gdb)
>>>> _______________________________________________
>>>> Openmcl-devel mailing list
>>>> Openmcl-devel at clozure.com
>>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>
More information about the Openmcl-devel
mailing list