[Openmcl-devel] odd ccl hang during gc

Gary Byers gb at clozure.com
Tue Jul 7 20:31:32 UTC 2009



On Tue, 7 Jul 2009, Osei Poku wrote:

> Hi,
>
> This issue cropped up again on the same 16-core opteron machine.  It is a 
> similar situation with multiple threads running and then, all at once, all 
> threads except one stop at sem_timedwait().  The other thread is stuck in the 
> same spot as before trying to free something.  A few backtraces attached.  I 
> am still running 1.3-RC1-r11885M so this may have been fixed in newer 
> releases.


I had thought that this was more mysterious than it was.

When the GC runs (in some thread), it suspends all other threads (at least
those threads which might reference lisp objects.)

Some of what the GC does involves freeing up resources associated with
"gcable pointers" - malloc'ed memory regions or other foreign objects
that we want to reclaim when it becomes impossible to reference them
from lisp, e.g., when the pointer to them becomes garbage.

The GC thread has to be somewhat careful about trying to free memory
associated with a "gcable pointer": since some suspended thread might
own a lock used by malloc()/free() to serialize access to the heap
that they manage, the GC thread can'd directly free foreign memory
when it discovers that the gcable pointer associated with it has become
garbage; instead, it enqueues the foreign pointer for deletion at a later
time (after suspended threads are resumed.)

When I first looked at this, it seemed that the GC thread should have
resumed all other threads and then tried to free something, but for
some bizarre reason the other threads - including any other thread that
might have owned a malloc lock - were all waiting on semaphores that
the GC thread should have signaled.

If I'd looked a little harder, I'd have seen that the GC thread was
trying to call free() directly as a side-effect of destroying a semaphore:


>>>>> #2  0x00002b9bfc9f56f1 in free () from /lib/libc.so.6
>>>>> #3  0x000000000041ca69 in destroy_semaphore (s=0x300055b0bc08) at ../
>>>>> thread_manager.c:724
>>>>> #4  0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
>>>>> #5  0x00000000004157bd in gc (tcr=0x41f977b0, param=<value optimized


If you still have the image running and the one thread that's not in
sem_timedwait() is actually trying to call free() from inside the gc()
- as is happening above - then this isn't the subtle, mysterious thing
that I thought it must be but is instead just a stupid bug.  (We know
that we can't call free() from inside the GC without risking deadlock,
but calling destroy_semaphore() in that context can cause free() to be
called.)



>
> I will wait until I get replies before killing this process.
>
> Osei
>
> (gdb) bt
> #0  0x00002b9fab6e249b in sem_timedwait () from /lib/libpthread.so.0
> #1  0x000000000041cc3c in sem_wait_forever (s=0x6496b0) at 
> ../thread_manager.c:443
> #2  0x000000000041da7f in suspend_resume_handler (signo=<value optimized 
> out>, info=<value optimized out>, context=0x7fffffa75570) at 
> ../thread_manager.c:594
> #3  <signal handler called>
> #4  0x00002b9fab6e37b1 in ?? () from /lib/libpthread.so.0
> #5  0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
> #6  0x00007fffffa75c80 in ?? ()
> #7  0x00002b9fabe90f88 in ?? ()
> #8  0x000000000000031a in ?? ()
> #9  0x00007fffffa75c70 in ?? ()
> #10 0x0000000000000000 in ?? ()
> (gdb) threads
> Undefined command: "threads".  Try "help".
> (gdb) info threads
> 14 Thread 1076246864 (LWP 3695)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 13 Thread 1078753616 (LWP 3697)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 12 Thread 1081260368 (LWP 3701)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 11 Thread 1083767120 (LWP 3702)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 10 Thread 1086273872 (LWP 3703)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 9 Thread 1088780624 (LWP 3704)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 8 Thread 1091287376 (LWP 19572)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 7 Thread 1116133712 (LWP 28035)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 6 Thread 1111120208 (LWP 28036)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 5 Thread 1099680080 (LWP 28037)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 4 Thread 1108072784 (LWP 28038)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 3 Thread 1113626960 (LWP 28625)  0x00002b9fab9d11d8 in ?? () from 
> /lib/libc.so.6
> 2 Thread 1121147216 (LWP 29139)  0x00002b9fab6e249b in sem_timedwait () from 
> /lib/libpthread.so.0
> 1 Thread 47964781596768 (LWP 3691)  0x00002b9fab6e249b in sem_timedwait () 
> from /lib/libpthread.so.0
> (gdb) thread 3
> [Switching to thread 3 (Thread 1113626960 (LWP 28625))]#0  0x00002b9fab9d11d8 
> in ?? () from /lib/libc.so.6
> (gdb) bt
> #0  0x00002b9fab9d11d8 in ?? () from /lib/libc.so.6
> #1  0x00002b9fab967662 in ?? () from /lib/libc.so.6
> #2  0x00002b9fab9666f1 in free () from /lib/libc.so.6
> #3  0x000000000041ca69 in destroy_semaphore (s=0x3000436cc0c8) at 
> ../thread_manager.c:724
> #4  0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
> #5  0x00000000004157bd in gc (tcr=0x426097b0, param=<value optimized out>) at 
> ../gc-common.c:1331
> #6  0x00000000004196db in gc_from_tcr (tcr=0x2aaab0000020, param=0) at 
> ../x86-exceptions.c:2786
> #7  0x000000000041abe1 in gc_like_from_xp (xp=<value optimized out>, 
> fun=0x4196b0 <gc_from_tcr>, param=0) at ../x86-exceptions.c:2743
> #8  0x000000000041ac3e in gc_from_xp (xp=0x2aaab0000020, 
> param=46912586140688) at ../x86-exceptions.c:2798
> #9  0x000000000041b14f in handle_gc_trap (xp=0x426086f0, tcr=0x426097b0) at 
> ../x86-exceptions.c:239
> #10 0x000000000041b5eb in handle_exception (signum=<value optimized out>, 
> info=0x42608aa0, context=0x426086f0, tcr=0x8000000000000000, old_valence=-1) 
> at ../x86-exceptions.c:1066
> #11 0x000000000041b698 in signal_handler (signum=11, info=0x42608aa0, 
> context=0x426086f0) at ../x86-exceptions.c:1330
> #12 <signal handler called>
> #13 0x00003000400135e7 in ?? ()
> #14 0x00003000419096a4 in ?? ()
> #15 0x000000000040ee7c in _SPnthrowvalues () at ../x86-spentry64.s:1405
> #16 0x00002aaaae17eae0 in ?? ()
> #17 0x0000000000000008 in ?? ()
> #18 0x0000000000000000 in ?? ()
>
>
> On Apr 15, 2009, at 2:50 AM, Gary Byers wrote:
>
>> and the sem_wait_forever() is coming from suspend_resume_handler(), which
>> means that those threads (and likely all others) are waiting on semaphores
>> that should have been signaled a while ago.
>> 
>> Just as a stab in the dark: what kind of hardware is this ?  (e.g., what
>> does
>> 
>> $ cat /proc/cpuinfo
>> 
>> say ?)
>> 
>> 
>> On Wed, 15 Apr 2009, Osei Poku wrote:
>> 
>>> Unfortunately, I killed the process.  Fortunately, I did run the following 
>>> backtraces on other a few other threads than #3.  According to the line 
>>> numbers shown in the output below, it appears that they (at least the 3 
>>> threads I looked at) are all stopped at SEM_WAIT_FOREVER().
>>> 
>>> I guess the mystery continues...
>>> 
>>> Osei
>>> 
>>> (gdb) info threads
>>> 15 Thread 1076246864 (LWP 14275)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 14 Thread 1078753616 (LWP 14278)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 13 Thread 1081260368 (LWP 14280)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 12 Thread 1083767120 (LWP 14281)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 11 Thread 1086273872 (LWP 14282)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 10 Thread 1088780624 (LWP 14283)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 9 Thread 1091287376 (LWP 24522)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 8 Thread 1093794128 (LWP 24523)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 7 Thread 1096841552 (LWP 24524)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 6 Thread 1099348304 (LWP 24525)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 5 Thread 1101855056 (LWP 28424)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 4 Thread 1104361808 (LWP 28425)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> * 3 Thread 1106868560 (LWP 28426)  0x00002b9bfca601d8 in ?? () from 
>>> /lib/libc.so.6
>>> 2 Thread 1109375312 (LWP 28427)  0x00002b9bfc77149b in sem_timedwait () 
>>> from /lib/libpthread.so.0
>>> 1 Thread 47948961267808 (LWP 14271)  0x00002b9bfc77149b in sem_timedwait 
>>> () from /lib/libpthread.so.0
>>> (gdb) thread 4
>>> [Switching to thread 4 (Thread 1104361808 (LWP 28425))]#0 
>>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>> (gdb) bt
>>> #0  0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>> #1  0x000000000041cc3c in sem_wait_forever (s=0x1ce0cb0) at 
>>> ../thread_manager.c:443
>>> #2  0x000000000041da7f in suspend_resume_handler (signo=<value optimized 
>>> out>, info=<value optimized out>, context=0x41d32670) at 
>>> ../thread_manager.c:594
>>> #3  <signal handler called>
>>> #4  0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>> #5  0x000000000041cbdb in wait_on_semaphore (s=0x18078b0, seconds=1, 
>>> millis=0) at ../thread_manager.c:468
>>> #6  0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
>>> #7  0x0000000000000000 in ?? ()
>>> (gdb) thread 5
>>> [Switching to thread 5 (Thread 1101855056 (LWP 28424))]#0 
>>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>> (gdb) bt
>>> #0  0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>> #1  0x000000000041cc3c in sem_wait_forever (s=0xd3e4b0) at 
>>> ../thread_manager.c:443
>>> #2  0x000000000041da7f in suspend_resume_handler (signo=<value optimized 
>>> out>, info=<value optimized out>, context=0x41ace650) at 
>>> ../thread_manager.c:594
>>> #3  <signal handler called>
>>> #4  0x00002b9bfc7727b1 in ?? () from /lib/libpthread.so.0
>>> #5  0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
>>> #6  0x0000000041aced60 in ?? ()
>>> #7  0x00002aaaaf859e88 in ?? ()
>>> #8  0x000000000000031a in ?? ()
>>> #9  0x0000000041aced50 in ?? ()
>>> #10 0x0000000000000000 in ?? ()
>>> (gdb) thread 6
>>> [Switching to thread 6 (Thread 1099348304 (LWP 24525))]#0 
>>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>> (gdb) bt
>>> #0  0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>> #1  0x000000000041cc3c in sem_wait_forever (s=0x23af950) at 
>>> ../thread_manager.c:443
>>> #2  0x000000000041da7f in suspend_resume_handler (signo=<value optimized 
>>> out>, info=<value optimized out>, context=0x4186a6c0) at 
>>> ../thread_manager.c:594
>>> #3  <signal handler called>
>>> #4  0x00002b9bfc771efb in ?? () from /lib/libpthread.so.0
>>> #5  0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
>>> #6  0x0000000000000000 in ?? ()
>>> 
>>> 
>>> 
>>> On Apr 15, 2009, at 1:34 AM, Gary Byers wrote:
>>> 
>>>> A bunch of threads (all but the one that GDB identifies as #3) were
>>>> suspended by thread 3 when it ran the GC; each such suspended thread
>>>> is waiting on a semaphore that'd be signaled/raised when the GC exits.
>>>> It looks like the GC exited and is in the process of trying to free
>>>> some foreign objects that were referenced by "gcable pointers"; it
>>>> has to signal the other threads' semaphores (to wake them up) before
>>>> it actually frees anything, since one of those threads may own a lock
>>>> on the malloc heap.
>>>> It looks like the GC thread (#3) called free(), and is waiting for
>>>> a lock that's likely held by one of the other threads that's likely
>>>> to still be waiting (in sem_timedwait()) to be told that it's OK
>>>> to run.  The GC thread's already signaled the semaphores that the
>>>> other threads are waiting on, but they don't seem to have gotten
>>>> the message.
>>>> It's not that unusual to see that some of the other threads are 
>>>> apparently
>>>> still waiting for their "resume" semaphores; it's a little unusual to
>>>> see that they're all waiting, and I'd guess that they have been waiting
>>>> since you noticed the load average dropping to 0.
>>>> If you still have the GDB attached, you might try something: make each
>>>> other thread (other than #3) active in GDB and use 'up' to see how it
>>>> got to sem_timedwait().  My guess is that you'd see the call to
>>>> SEM_WAIT_FOREVER in the function 'suspend_resume_handler()' (in
>>>> ccl/lisp-kernel/thread-manager.c) a few frames up the stack in each
>>>> cases.  If you see that any of these threads is waiting for anything
>>>> other than its tcr->resume semaphore, that'd be interesting.
>>>> Otherwise, this is a bit of a mystery, since the whole 'suspend other
>>>> threads on GC entry, signal their resume semaphores on GC exit' routine
>>>> happens on every GC, and has likely happened hundreds or thousands of
>>>> times already in this session, and it's not clear what's different this
>>>> time.
>>>> On Wed, 15 Apr 2009, Osei Poku wrote:
>>>>> Hi,
>>>>> I was running three slime repls (9 threads total) and in each repl I
>>>>> was running the same long running function on different inputs.  After
>>>>> a few hours of my load being about 2.8, it dropped down to 0.0.
>>>>> Basically nothing was running, nothing responding etc.  All threads
>>>>> are frozen.  As is confirmed by the backtrace in the main thread (see
>>>>> below), this happened during gc which makes sense since all activity
>>>>> stops during gc.  Not sure what else I can do to revive the process
>>>>> but I thought it might be useful to report the issue.
>>>>> This is from CCL version 1.3-RC1-r11885M.
>>>>> Osei
>>>>> --- gdb output -----
>>>>> (gdb) info threads
>>>>> 15 Thread 1076246864 (LWP 14275)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 14 Thread 1078753616 (LWP 14278)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 13 Thread 1081260368 (LWP 14280)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 12 Thread 1083767120 (LWP 14281)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 11 Thread 1086273872 (LWP 14282)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 10 Thread 1088780624 (LWP 14283)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 9 Thread 1091287376 (LWP 24522)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 8 Thread 1093794128 (LWP 24523)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 7 Thread 1096841552 (LWP 24524)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 6 Thread 1099348304 (LWP 24525)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 5 Thread 1101855056 (LWP 28424)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 4 Thread 1104361808 (LWP 28425)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 3 Thread 1106868560 (LWP 28426)  0x00002b9bfca601d8 in ?? () from /
>>>>> lib/libc.so.6
>>>>> 2 Thread 1109375312 (LWP 28427)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> 1 Thread 47948961267808 (LWP 14271)  0x00002b9bfc77149b in
>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>> (gdb) thread
>>>>> [Current thread is 0 (LWP 14271)]
>>>>> (gdb) thread 2
>>>>> [Switching to thread 2 (Thread 1109375312 (LWP 28427))]#0
>>>>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>>>> (gdb) thread 3
>>>>> [Switching to thread 3 (Thread 1106868560 (LWP 28426))]#0
>>>>> 0x00002b9bfca601d8 in ?? () from /lib/libc.so.6
>>>>> (gdb) bt
>>>>> #0  0x00002b9bfca601d8 in ?? () from /lib/libc.so.6
>>>>> #1  0x00002b9bfc9f6662 in ?? () from /lib/libc.so.6
>>>>> #2  0x00002b9bfc9f56f1 in free () from /lib/libc.so.6
>>>>> #3  0x000000000041ca69 in destroy_semaphore (s=0x300055b0bc08) at ../
>>>>> thread_manager.c:724
>>>>> #4  0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
>>>>> #5  0x00000000004157bd in gc (tcr=0x41f977b0, param=<value optimized
>>>>> out>) at ../gc-common.c:1331
>>>>> #6  0x00000000004196db in gc_from_tcr (tcr=0x2b9bfccd6980, param=0)
>>>>> at ../x86-exceptions.c:2786
>>>>> #7  0x000000000041abe1 in gc_like_from_xp (xp=<value optimized out>,
>>>>> fun=0x4196b0 <gc_from_tcr>, param=0) at ../x86-exceptions.c:2743
>>>>> #8  0x000000000041ac3e in gc_from_xp (xp=0x2b9bfccd6980,
>>>>> param=25576416) at ../x86-exceptions.c:2798
>>>>> #9  0x000000000041ad67 in allocate_object (xp=0x41f966f0,
>>>>> bytes_needed=6352096, disp_from_allocptr=6352083, tcr=0x41f977b0)
>>>>> at ../x86-exceptions.c:145
>>>>> #10 0x000000000041adce in handle_alloc_trap (xp=0x41f966f0,
>>>>> tcr=0x41f977b0) at ../x86-exceptions.c:572
>>>>> #11 0x000000000041b5ce in handle_exception (signum=<value optimized
>>>>> out>, info=0x41f96aa0, context=0x41f966f0, tcr=0x8000000000000000,
>>>>> old_valence=-1) at ../x86-exceptions.c:1060
>>>>> #12 0x000000000041b698 in signal_handler (signum=11, info=0x41f96aa0,
>>>>> context=0x41f966f0) at ../x86-exceptions.c:1330
>>>>> #13 <signal handler called>
>>>>> #14 0x000000000041054a in _SPmisc_alloc () at ../x86-spentry64.s:2840
>>>>> #15 0x000030004036943c in ?? ()
>>>>> #16 0x000000000060ecd0 in ?? ()
>>>>> #17 0x0000000000000000 in ?? ()
>>>>> (gdb) up
>>>>> #1  0x00002b9bfc9f6662 in ?? () from /lib/libc.so.6
>>>>> (gdb) up
>>>>> #2  0x00002b9bfc9f56f1 in free () from /lib/libc.so.6
>>>>> (gdb) up
>>>>> #3  0x000000000041ca69 in destroy_semaphore (s=0x300055b0bc08) at ../
>>>>> thread_manager.c:724
>>>>> 724         free(*s);
>>>>> (gdb) list
>>>>> 719     destroy_semaphore(void **s)
>>>>> 720     {
>>>>> 721       if (*s) {
>>>>> 722     #ifdef USE_POSIX_SEMAPHORES
>>>>> 723         sem_destroy((sem_t *)*s);
>>>>> 724         free(*s);
>>>>> 725     #endif
>>>>> 726     #ifdef USE_MACH_SEMAPHORES
>>>>> 727         semaphore_destroy(mach_task_self(),((semaphore_t)(natural)
>>>>> *s));
>>>>> 728     #endif
>>>>> (gdb) up
>>>>> #4  0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
>>>>> 746               destroy_semaphore((void**)&(x->address));
>>>>> (gdb) list
>>>>> 741             case xmacptr_flag_rwlock:
>>>>> 742               rwlock_destroy((rwlock *)ptr_from_lispobj(ptr));
>>>>> 743               break;
>>>>> 744
>>>>> 745             case xmacptr_flag_semaphore:
>>>>> 746               destroy_semaphore((void**)&(x->address));
>>>>> 747               break;
>>>>> 748
>>>>> 749             default:
>>>>> 750               if ((flag >= xmacptr_flag_user_first) &&
>>>>> (gdb)
>>>>> _______________________________________________
>>>>> Openmcl-devel mailing list
>>>>> Openmcl-devel at clozure.com
>>>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>> 
>



More information about the Openmcl-devel mailing list