[Openmcl-devel] odd ccl hang during gc

Gary Byers gb at clozure.com
Tue Apr 14 22:34:20 PDT 2009


A bunch of threads (all but the one that GDB identifies as #3) were
suspended by thread 3 when it ran the GC; each such suspended thread
is waiting on a semaphore that'd be signaled/raised when the GC exits.
It looks like the GC exited and is in the process of trying to free
some foreign objects that were referenced by "gcable pointers"; it
has to signal the other threads' semaphores (to wake them up) before
it actually frees anything, since one of those threads may own a lock
on the malloc heap.

It looks like the GC thread (#3) called free(), and is waiting for
a lock that's likely held by one of the other threads that's likely
to still be waiting (in sem_timedwait()) to be told that it's OK
to run.  The GC thread's already signaled the semaphores that the
other threads are waiting on, but they don't seem to have gotten
the message.


It's not that unusual to see that some of the other threads are apparently
still waiting for their "resume" semaphores; it's a little unusual to
see that they're all waiting, and I'd guess that they have been waiting
since you noticed the load average dropping to 0.

If you still have the GDB attached, you might try something: make each
other thread (other than #3) active in GDB and use 'up' to see how it
got to sem_timedwait().  My guess is that you'd see the call to
SEM_WAIT_FOREVER in the function 'suspend_resume_handler()' (in
ccl/lisp-kernel/thread-manager.c) a few frames up the stack in each
cases.  If you see that any of these threads is waiting for anything
other than its tcr->resume semaphore, that'd be interesting.

Otherwise, this is a bit of a mystery, since the whole 'suspend other
threads on GC entry, signal their resume semaphores on GC exit' routine
happens on every GC, and has likely happened hundreds or thousands of
times already in this session, and it's not clear what's different this
time.



On Wed, 15 Apr 2009, Osei Poku wrote:

> Hi,
>
> I was running three slime repls (9 threads total) and in each repl I
> was running the same long running function on different inputs.  After
> a few hours of my load being about 2.8, it dropped down to 0.0.
> Basically nothing was running, nothing responding etc.  All threads
> are frozen.  As is confirmed by the backtrace in the main thread (see
> below), this happened during gc which makes sense since all activity
> stops during gc.  Not sure what else I can do to revive the process
> but I thought it might be useful to report the issue.
>
> This is from CCL version 1.3-RC1-r11885M.
>
> Osei
>
> --- gdb output -----
>
> (gdb) info threads
>   15 Thread 1076246864 (LWP 14275)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   14 Thread 1078753616 (LWP 14278)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   13 Thread 1081260368 (LWP 14280)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   12 Thread 1083767120 (LWP 14281)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   11 Thread 1086273872 (LWP 14282)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   10 Thread 1088780624 (LWP 14283)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   9 Thread 1091287376 (LWP 24522)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   8 Thread 1093794128 (LWP 24523)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   7 Thread 1096841552 (LWP 24524)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   6 Thread 1099348304 (LWP 24525)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   5 Thread 1101855056 (LWP 28424)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   4 Thread 1104361808 (LWP 28425)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   3 Thread 1106868560 (LWP 28426)  0x00002b9bfca601d8 in ?? () from /
> lib/libc.so.6
>   2 Thread 1109375312 (LWP 28427)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
>   1 Thread 47948961267808 (LWP 14271)  0x00002b9bfc77149b in
> sem_timedwait () from /lib/libpthread.so.0
> (gdb) thread
> [Current thread is 0 (LWP 14271)]
> (gdb) thread 2
> [Switching to thread 2 (Thread 1109375312 (LWP 28427))]#0
> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
> (gdb) thread 3
> [Switching to thread 3 (Thread 1106868560 (LWP 28426))]#0
> 0x00002b9bfca601d8 in ?? () from /lib/libc.so.6
> (gdb) bt
> #0  0x00002b9bfca601d8 in ?? () from /lib/libc.so.6
> #1  0x00002b9bfc9f6662 in ?? () from /lib/libc.so.6
> #2  0x00002b9bfc9f56f1 in free () from /lib/libc.so.6
> #3  0x000000000041ca69 in destroy_semaphore (s=0x300055b0bc08) at ../
> thread_manager.c:724
> #4  0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
> #5  0x00000000004157bd in gc (tcr=0x41f977b0, param=<value optimized
> out>) at ../gc-common.c:1331
> #6  0x00000000004196db in gc_from_tcr (tcr=0x2b9bfccd6980, param=0)
> at ../x86-exceptions.c:2786
> #7  0x000000000041abe1 in gc_like_from_xp (xp=<value optimized out>,
> fun=0x4196b0 <gc_from_tcr>, param=0) at ../x86-exceptions.c:2743
> #8  0x000000000041ac3e in gc_from_xp (xp=0x2b9bfccd6980,
> param=25576416) at ../x86-exceptions.c:2798
> #9  0x000000000041ad67 in allocate_object (xp=0x41f966f0,
> bytes_needed=6352096, disp_from_allocptr=6352083, tcr=0x41f977b0)
> at ../x86-exceptions.c:145
> #10 0x000000000041adce in handle_alloc_trap (xp=0x41f966f0,
> tcr=0x41f977b0) at ../x86-exceptions.c:572
> #11 0x000000000041b5ce in handle_exception (signum=<value optimized
> out>, info=0x41f96aa0, context=0x41f966f0, tcr=0x8000000000000000,
> old_valence=-1) at ../x86-exceptions.c:1060
> #12 0x000000000041b698 in signal_handler (signum=11, info=0x41f96aa0,
> context=0x41f966f0) at ../x86-exceptions.c:1330
> #13 <signal handler called>
> #14 0x000000000041054a in _SPmisc_alloc () at ../x86-spentry64.s:2840
> #15 0x000030004036943c in ?? ()
> #16 0x000000000060ecd0 in ?? ()
> #17 0x0000000000000000 in ?? ()
> (gdb) up
> #1  0x00002b9bfc9f6662 in ?? () from /lib/libc.so.6
> (gdb) up
> #2  0x00002b9bfc9f56f1 in free () from /lib/libc.so.6
> (gdb) up
> #3  0x000000000041ca69 in destroy_semaphore (s=0x300055b0bc08) at ../
> thread_manager.c:724
> 724         free(*s);
> (gdb) list
> 719     destroy_semaphore(void **s)
> 720     {
> 721       if (*s) {
> 722     #ifdef USE_POSIX_SEMAPHORES
> 723         sem_destroy((sem_t *)*s);
> 724         free(*s);
> 725     #endif
> 726     #ifdef USE_MACH_SEMAPHORES
> 727         semaphore_destroy(mach_task_self(),((semaphore_t)(natural)
> *s));
> 728     #endif
> (gdb) up
> #4  0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
> 746               destroy_semaphore((void**)&(x->address));
> (gdb) list
> 741             case xmacptr_flag_rwlock:
> 742               rwlock_destroy((rwlock *)ptr_from_lispobj(ptr));
> 743               break;
> 744
> 745             case xmacptr_flag_semaphore:
> 746               destroy_semaphore((void**)&(x->address));
> 747               break;
> 748
> 749             default:
> 750               if ((flag >= xmacptr_flag_user_first) &&
> (gdb)
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>



More information about the Openmcl-devel mailing list