[Openmcl-devel] odd ccl hang during gc

Gary Byers gb at clozure.com
Tue Apr 14 23:50:37 PDT 2009


and the sem_wait_forever() is coming from suspend_resume_handler(), which
means that those threads (and likely all others) are waiting on semaphores
that should have been signaled a while ago.

Just as a stab in the dark: what kind of hardware is this ?  (e.g., what
does

$ cat /proc/cpuinfo

say ?)


On Wed, 15 Apr 2009, Osei Poku wrote:

> Unfortunately, I killed the process.  Fortunately, I did run the following 
> backtraces on other a few other threads than #3.  According to the line 
> numbers shown in the output below, it appears that they (at least the 3 
> threads I looked at) are all stopped at SEM_WAIT_FOREVER().
>
> I guess the mystery continues...
>
> Osei
>
> (gdb) info threads
> 15 Thread 1076246864 (LWP 14275)  0x00002b9bfc77149b in sem_timedwait () 
> from /lib/libpthread.so.0
> 14 Thread 1078753616 (LWP 14278)  0x00002b9bfc77149b in sem_timedwait () 
> from /lib/libpthread.so.0
> 13 Thread 1081260368 (LWP 14280)  0x00002b9bfc77149b in sem_timedwait () 
> from /lib/libpthread.so.0
> 12 Thread 1083767120 (LWP 14281)  0x00002b9bfc77149b in sem_timedwait () 
> from /lib/libpthread.so.0
> 11 Thread 1086273872 (LWP 14282)  0x00002b9bfc77149b in sem_timedwait () 
> from /lib/libpthread.so.0
> 10 Thread 1088780624 (LWP 14283)  0x00002b9bfc77149b in sem_timedwait () 
> from /lib/libpthread.so.0
> 9 Thread 1091287376 (LWP 24522)  0x00002b9bfc77149b in sem_timedwait () from 
> /lib/libpthread.so.0
> 8 Thread 1093794128 (LWP 24523)  0x00002b9bfc77149b in sem_timedwait () from 
> /lib/libpthread.so.0
> 7 Thread 1096841552 (LWP 24524)  0x00002b9bfc77149b in sem_timedwait () from 
> /lib/libpthread.so.0
> 6 Thread 1099348304 (LWP 24525)  0x00002b9bfc77149b in sem_timedwait () from 
> /lib/libpthread.so.0
> 5 Thread 1101855056 (LWP 28424)  0x00002b9bfc77149b in sem_timedwait () from 
> /lib/libpthread.so.0
> 4 Thread 1104361808 (LWP 28425)  0x00002b9bfc77149b in sem_timedwait () from 
> /lib/libpthread.so.0
> * 3 Thread 1106868560 (LWP 28426)  0x00002b9bfca601d8 in ?? () from 
> /lib/libc.so.6
> 2 Thread 1109375312 (LWP 28427)  0x00002b9bfc77149b in sem_timedwait () from 
> /lib/libpthread.so.0
> 1 Thread 47948961267808 (LWP 14271)  0x00002b9bfc77149b in sem_timedwait () 
> from /lib/libpthread.so.0
> (gdb) thread 4
> [Switching to thread 4 (Thread 1104361808 (LWP 28425))]#0  0x00002b9bfc77149b 
> in sem_timedwait () from /lib/libpthread.so.0
> (gdb) bt
> #0  0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
> #1  0x000000000041cc3c in sem_wait_forever (s=0x1ce0cb0) at 
> ../thread_manager.c:443
> #2  0x000000000041da7f in suspend_resume_handler (signo=<value optimized 
> out>, info=<value optimized out>, context=0x41d32670) at 
> ../thread_manager.c:594
> #3  <signal handler called>
> #4  0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
> #5  0x000000000041cbdb in wait_on_semaphore (s=0x18078b0, seconds=1, 
> millis=0) at ../thread_manager.c:468
> #6  0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
> #7  0x0000000000000000 in ?? ()
> (gdb) thread 5
> [Switching to thread 5 (Thread 1101855056 (LWP 28424))]#0  0x00002b9bfc77149b 
> in sem_timedwait () from /lib/libpthread.so.0
> (gdb) bt
> #0  0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
> #1  0x000000000041cc3c in sem_wait_forever (s=0xd3e4b0) at 
> ../thread_manager.c:443
> #2  0x000000000041da7f in suspend_resume_handler (signo=<value optimized 
> out>, info=<value optimized out>, context=0x41ace650) at 
> ../thread_manager.c:594
> #3  <signal handler called>
> #4  0x00002b9bfc7727b1 in ?? () from /lib/libpthread.so.0
> #5  0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
> #6  0x0000000041aced60 in ?? ()
> #7  0x00002aaaaf859e88 in ?? ()
> #8  0x000000000000031a in ?? ()
> #9  0x0000000041aced50 in ?? ()
> #10 0x0000000000000000 in ?? ()
> (gdb) thread 6
> [Switching to thread 6 (Thread 1099348304 (LWP 24525))]#0  0x00002b9bfc77149b 
> in sem_timedwait () from /lib/libpthread.so.0
> (gdb) bt
> #0  0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
> #1  0x000000000041cc3c in sem_wait_forever (s=0x23af950) at 
> ../thread_manager.c:443
> #2  0x000000000041da7f in suspend_resume_handler (signo=<value optimized 
> out>, info=<value optimized out>, context=0x4186a6c0) at 
> ../thread_manager.c:594
> #3  <signal handler called>
> #4  0x00002b9bfc771efb in ?? () from /lib/libpthread.so.0
> #5  0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
> #6  0x0000000000000000 in ?? ()
>
>
>
> On Apr 15, 2009, at 1:34 AM, Gary Byers wrote:
>
>> A bunch of threads (all but the one that GDB identifies as #3) were
>> suspended by thread 3 when it ran the GC; each such suspended thread
>> is waiting on a semaphore that'd be signaled/raised when the GC exits.
>> It looks like the GC exited and is in the process of trying to free
>> some foreign objects that were referenced by "gcable pointers"; it
>> has to signal the other threads' semaphores (to wake them up) before
>> it actually frees anything, since one of those threads may own a lock
>> on the malloc heap.
>> 
>> It looks like the GC thread (#3) called free(), and is waiting for
>> a lock that's likely held by one of the other threads that's likely
>> to still be waiting (in sem_timedwait()) to be told that it's OK
>> to run.  The GC thread's already signaled the semaphores that the
>> other threads are waiting on, but they don't seem to have gotten
>> the message.
>> 
>> 
>> It's not that unusual to see that some of the other threads are apparently
>> still waiting for their "resume" semaphores; it's a little unusual to
>> see that they're all waiting, and I'd guess that they have been waiting
>> since you noticed the load average dropping to 0.
>> 
>> If you still have the GDB attached, you might try something: make each
>> other thread (other than #3) active in GDB and use 'up' to see how it
>> got to sem_timedwait().  My guess is that you'd see the call to
>> SEM_WAIT_FOREVER in the function 'suspend_resume_handler()' (in
>> ccl/lisp-kernel/thread-manager.c) a few frames up the stack in each
>> cases.  If you see that any of these threads is waiting for anything
>> other than its tcr->resume semaphore, that'd be interesting.
>> 
>> Otherwise, this is a bit of a mystery, since the whole 'suspend other
>> threads on GC entry, signal their resume semaphores on GC exit' routine
>> happens on every GC, and has likely happened hundreds or thousands of
>> times already in this session, and it's not clear what's different this
>> time.
>> 
>> 
>> 
>> On Wed, 15 Apr 2009, Osei Poku wrote:
>> 
>>> Hi,
>>> 
>>> I was running three slime repls (9 threads total) and in each repl I
>>> was running the same long running function on different inputs.  After
>>> a few hours of my load being about 2.8, it dropped down to 0.0.
>>> Basically nothing was running, nothing responding etc.  All threads
>>> are frozen.  As is confirmed by the backtrace in the main thread (see
>>> below), this happened during gc which makes sense since all activity
>>> stops during gc.  Not sure what else I can do to revive the process
>>> but I thought it might be useful to report the issue.
>>> 
>>> This is from CCL version 1.3-RC1-r11885M.
>>> 
>>> Osei
>>> 
>>> --- gdb output -----
>>> 
>>> (gdb) info threads
>>> 15 Thread 1076246864 (LWP 14275)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 14 Thread 1078753616 (LWP 14278)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 13 Thread 1081260368 (LWP 14280)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 12 Thread 1083767120 (LWP 14281)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 11 Thread 1086273872 (LWP 14282)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 10 Thread 1088780624 (LWP 14283)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 9 Thread 1091287376 (LWP 24522)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 8 Thread 1093794128 (LWP 24523)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 7 Thread 1096841552 (LWP 24524)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 6 Thread 1099348304 (LWP 24525)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 5 Thread 1101855056 (LWP 28424)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 4 Thread 1104361808 (LWP 28425)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 3 Thread 1106868560 (LWP 28426)  0x00002b9bfca601d8 in ?? () from /
>>> lib/libc.so.6
>>> 2 Thread 1109375312 (LWP 28427)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> 1 Thread 47948961267808 (LWP 14271)  0x00002b9bfc77149b in
>>> sem_timedwait () from /lib/libpthread.so.0
>>> (gdb) thread
>>> [Current thread is 0 (LWP 14271)]
>>> (gdb) thread 2
>>> [Switching to thread 2 (Thread 1109375312 (LWP 28427))]#0
>>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>> (gdb) thread 3
>>> [Switching to thread 3 (Thread 1106868560 (LWP 28426))]#0
>>> 0x00002b9bfca601d8 in ?? () from /lib/libc.so.6
>>> (gdb) bt
>>> #0  0x00002b9bfca601d8 in ?? () from /lib/libc.so.6
>>> #1  0x00002b9bfc9f6662 in ?? () from /lib/libc.so.6
>>> #2  0x00002b9bfc9f56f1 in free () from /lib/libc.so.6
>>> #3  0x000000000041ca69 in destroy_semaphore (s=0x300055b0bc08) at ../
>>> thread_manager.c:724
>>> #4  0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
>>> #5  0x00000000004157bd in gc (tcr=0x41f977b0, param=<value optimized
>>> out>) at ../gc-common.c:1331
>>> #6  0x00000000004196db in gc_from_tcr (tcr=0x2b9bfccd6980, param=0)
>>> at ../x86-exceptions.c:2786
>>> #7  0x000000000041abe1 in gc_like_from_xp (xp=<value optimized out>,
>>> fun=0x4196b0 <gc_from_tcr>, param=0) at ../x86-exceptions.c:2743
>>> #8  0x000000000041ac3e in gc_from_xp (xp=0x2b9bfccd6980,
>>> param=25576416) at ../x86-exceptions.c:2798
>>> #9  0x000000000041ad67 in allocate_object (xp=0x41f966f0,
>>> bytes_needed=6352096, disp_from_allocptr=6352083, tcr=0x41f977b0)
>>> at ../x86-exceptions.c:145
>>> #10 0x000000000041adce in handle_alloc_trap (xp=0x41f966f0,
>>> tcr=0x41f977b0) at ../x86-exceptions.c:572
>>> #11 0x000000000041b5ce in handle_exception (signum=<value optimized
>>> out>, info=0x41f96aa0, context=0x41f966f0, tcr=0x8000000000000000,
>>> old_valence=-1) at ../x86-exceptions.c:1060
>>> #12 0x000000000041b698 in signal_handler (signum=11, info=0x41f96aa0,
>>> context=0x41f966f0) at ../x86-exceptions.c:1330
>>> #13 <signal handler called>
>>> #14 0x000000000041054a in _SPmisc_alloc () at ../x86-spentry64.s:2840
>>> #15 0x000030004036943c in ?? ()
>>> #16 0x000000000060ecd0 in ?? ()
>>> #17 0x0000000000000000 in ?? ()
>>> (gdb) up
>>> #1  0x00002b9bfc9f6662 in ?? () from /lib/libc.so.6
>>> (gdb) up
>>> #2  0x00002b9bfc9f56f1 in free () from /lib/libc.so.6
>>> (gdb) up
>>> #3  0x000000000041ca69 in destroy_semaphore (s=0x300055b0bc08) at ../
>>> thread_manager.c:724
>>> 724         free(*s);
>>> (gdb) list
>>> 719     destroy_semaphore(void **s)
>>> 720     {
>>> 721       if (*s) {
>>> 722     #ifdef USE_POSIX_SEMAPHORES
>>> 723         sem_destroy((sem_t *)*s);
>>> 724         free(*s);
>>> 725     #endif
>>> 726     #ifdef USE_MACH_SEMAPHORES
>>> 727         semaphore_destroy(mach_task_self(),((semaphore_t)(natural)
>>> *s));
>>> 728     #endif
>>> (gdb) up
>>> #4  0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
>>> 746               destroy_semaphore((void**)&(x->address));
>>> (gdb) list
>>> 741             case xmacptr_flag_rwlock:
>>> 742               rwlock_destroy((rwlock *)ptr_from_lispobj(ptr));
>>> 743               break;
>>> 744
>>> 745             case xmacptr_flag_semaphore:
>>> 746               destroy_semaphore((void**)&(x->address));
>>> 747               break;
>>> 748
>>> 749             default:
>>> 750               if ((flag >= xmacptr_flag_user_first) &&
>>> (gdb)
>>> 
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>> 
>>> 
>



More information about the Openmcl-devel mailing list