[Openmcl-devel] odd ccl hang during gc

Osei Poku osei.poku at gmail.com
Wed Jul 8 05:29:06 PDT 2009


My issue did not really have much to do the SAVE-APPLICATION.  It just  
happens sporadically during normal multiprocessing.

On Jul 7, 2009, at 2:31 PM, Bill St. Clair wrote:

> I ran into a similar problem, an infinite loop in
> reap_gcable_pointers, over the weekend. It only happens if you
> SAVE-APPLICATION with a :TOPLEVEL-FUNCTION. It was fixed by:
>
>  http://trac.clozure.com/openmcl/changeset/11983
>
> So if you are indeed running from a saved application with a
> :TOPLEVEL-FUNCTION, try updating CCL from trunk, and rebuilding.
>
> -Bill
>
> On Tue, Jul 7, 2009 at 2:06 PM, Osei Poku<osei.poku at gmail.com> wrote:
>> Hi,
>>
>> This issue cropped up again on the same 16-core opteron machine.  It
>> is a similar situation with multiple threads running and then, all at
>> once, all threads except one stop at sem_timedwait().  The other
>> thread is stuck in the same spot as before trying to free something.
>> A few backtraces attached.  I am still running 1.3-RC1-r11885M so  
>> this
>> may have been fixed in newer releases.
>>
>> I will wait until I get replies before killing this process.
>>
>> Osei
>>
>> (gdb) bt
>> #0  0x00002b9fab6e249b in sem_timedwait () from /lib/libpthread.so.0
>> #1  0x000000000041cc3c in sem_wait_forever (s=0x6496b0) at ../
>> thread_manager.c:443
>> #2  0x000000000041da7f in suspend_resume_handler (signo=<value
>> optimized out>, info=<value optimized out>, context=0x7fffffa75570)
>> at ../thread_manager.c:594
>> #3  <signal handler called>
>> #4  0x00002b9fab6e37b1 in ?? () from /lib/libpthread.so.0
>> #5  0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
>> #6  0x00007fffffa75c80 in ?? ()
>> #7  0x00002b9fabe90f88 in ?? ()
>> #8  0x000000000000031a in ?? ()
>> #9  0x00007fffffa75c70 in ?? ()
>> #10 0x0000000000000000 in ?? ()
>> (gdb) threads
>> Undefined command: "threads".  Try "help".
>> (gdb) info threads
>>   14 Thread 1076246864 (LWP 3695)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>>   13 Thread 1078753616 (LWP 3697)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>>   12 Thread 1081260368 (LWP 3701)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>>   11 Thread 1083767120 (LWP 3702)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>>   10 Thread 1086273872 (LWP 3703)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>>   9 Thread 1088780624 (LWP 3704)  0x00002b9fab6e249b in sem_timedwait
>> () from /lib/libpthread.so.0
>>   8 Thread 1091287376 (LWP 19572)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>>   7 Thread 1116133712 (LWP 28035)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>>   6 Thread 1111120208 (LWP 28036)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>>   5 Thread 1099680080 (LWP 28037)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>>   4 Thread 1108072784 (LWP 28038)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>>   3 Thread 1113626960 (LWP 28625)  0x00002b9fab9d11d8 in ?? () from /
>> lib/libc.so.6
>>   2 Thread 1121147216 (LWP 29139)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>>   1 Thread 47964781596768 (LWP 3691)  0x00002b9fab6e249b in
>> sem_timedwait () from /lib/libpthread.so.0
>> (gdb) thread 3
>> [Switching to thread 3 (Thread 1113626960 (LWP 28625))]#0
>> 0x00002b9fab9d11d8 in ?? () from /lib/libc.so.6
>> (gdb) bt
>> #0  0x00002b9fab9d11d8 in ?? () from /lib/libc.so.6
>> #1  0x00002b9fab967662 in ?? () from /lib/libc.so.6
>> #2  0x00002b9fab9666f1 in free () from /lib/libc.so.6
>> #3  0x000000000041ca69 in destroy_semaphore (s=0x3000436cc0c8) at ../
>> thread_manager.c:724
>> #4  0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c:746
>> #5  0x00000000004157bd in gc (tcr=0x426097b0, param=<value optimized
>> out>) at ../gc-common.c:1331
>> #6  0x00000000004196db in gc_from_tcr (tcr=0x2aaab0000020, param=0)
>> at ../x86-exceptions.c:2786
>> #7  0x000000000041abe1 in gc_like_from_xp (xp=<value optimized out>,
>> fun=0x4196b0 <gc_from_tcr>, param=0) at ../x86-exceptions.c:2743
>> #8  0x000000000041ac3e in gc_from_xp (xp=0x2aaab0000020,
>> param=46912586140688) at ../x86-exceptions.c:2798
>> #9  0x000000000041b14f in handle_gc_trap (xp=0x426086f0,
>> tcr=0x426097b0) at ../x86-exceptions.c:239
>> #10 0x000000000041b5eb in handle_exception (signum=<value optimized
>> out>, info=0x42608aa0, context=0x426086f0, tcr=0x8000000000000000,
>> old_valence=-1) at ../x86-exceptions.c:1066
>> #11 0x000000000041b698 in signal_handler (signum=11, info=0x42608aa0,
>> context=0x426086f0) at ../x86-exceptions.c:1330
>> #12 <signal handler called>
>> #13 0x00003000400135e7 in ?? ()
>> #14 0x00003000419096a4 in ?? ()
>> #15 0x000000000040ee7c in _SPnthrowvalues () at ../x86-spentry64.s: 
>> 1405
>> #16 0x00002aaaae17eae0 in ?? ()
>> #17 0x0000000000000008 in ?? ()
>> #18 0x0000000000000000 in ?? ()
>>
>>
>> On Apr 15, 2009, at 2:50 AM, Gary Byers wrote:
>>
>>> and the sem_wait_forever() is coming from suspend_resume_handler(),
>>> which
>>> means that those threads (and likely all others) are waiting on
>>> semaphores
>>> that should have been signaled a while ago.
>>>
>>> Just as a stab in the dark: what kind of hardware is this ?  (e.g.,
>>> what
>>> does
>>>
>>> $ cat /proc/cpuinfo
>>>
>>> say ?)
>>>
>>>
>>> On Wed, 15 Apr 2009, Osei Poku wrote:
>>>
>>>> Unfortunately, I killed the process.  Fortunately, I did run the
>>>> following backtraces on other a few other threads than #3.
>>>> According to the line numbers shown in the output below, it appears
>>>> that they (at least the 3 threads I looked at) are all stopped at
>>>> SEM_WAIT_FOREVER().
>>>>
>>>> I guess the mystery continues...
>>>>
>>>> Osei
>>>>
>>>> (gdb) info threads
>>>> 15 Thread 1076246864 (LWP 14275)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 14 Thread 1078753616 (LWP 14278)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 13 Thread 1081260368 (LWP 14280)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 12 Thread 1083767120 (LWP 14281)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 11 Thread 1086273872 (LWP 14282)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 10 Thread 1088780624 (LWP 14283)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 9 Thread 1091287376 (LWP 24522)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 8 Thread 1093794128 (LWP 24523)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 7 Thread 1096841552 (LWP 24524)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 6 Thread 1099348304 (LWP 24525)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 5 Thread 1101855056 (LWP 28424)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 4 Thread 1104361808 (LWP 28425)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> * 3 Thread 1106868560 (LWP 28426)  0x00002b9bfca601d8 in ?? ()
>>>> from /lib/libc.so.6
>>>> 2 Thread 1109375312 (LWP 28427)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> 1 Thread 47948961267808 (LWP 14271)  0x00002b9bfc77149b in
>>>> sem_timedwait () from /lib/libpthread.so.0
>>>> (gdb) thread 4
>>>> [Switching to thread 4 (Thread 1104361808 (LWP 28425))]#0
>>>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>>> (gdb) bt
>>>> #0  0x00002b9bfc77149b in sem_timedwait () from /lib/ 
>>>> libpthread.so.0
>>>> #1  0x000000000041cc3c in sem_wait_forever (s=0x1ce0cb0) at ../
>>>> thread_manager.c:443
>>>> #2  0x000000000041da7f in suspend_resume_handler (signo=<value
>>>> optimized out>, info=<value optimized out>, context=0x41d32670)
>>>> at ../thread_manager.c:594
>>>> #3  <signal handler called>
>>>> #4  0x00002b9bfc77149b in sem_timedwait () from /lib/ 
>>>> libpthread.so.0
>>>> #5  0x000000000041cbdb in wait_on_semaphore (s=0x18078b0,
>>>> seconds=1, millis=0) at ../thread_manager.c:468
>>>> #6  0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
>>>> #7  0x0000000000000000 in ?? ()
>>>> (gdb) thread 5
>>>> [Switching to thread 5 (Thread 1101855056 (LWP 28424))]#0
>>>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>>> (gdb) bt
>>>> #0  0x00002b9bfc77149b in sem_timedwait () from /lib/ 
>>>> libpthread.so.0
>>>> #1  0x000000000041cc3c in sem_wait_forever (s=0xd3e4b0) at ../
>>>> thread_manager.c:443
>>>> #2  0x000000000041da7f in suspend_resume_handler (signo=<value
>>>> optimized out>, info=<value optimized out>, context=0x41ace650)
>>>> at ../thread_manager.c:594
>>>> #3  <signal handler called>
>>>> #4  0x00002b9bfc7727b1 in ?? () from /lib/libpthread.so.0
>>>> #5  0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
>>>> #6  0x0000000041aced60 in ?? ()
>>>> #7  0x00002aaaaf859e88 in ?? ()
>>>> #8  0x000000000000031a in ?? ()
>>>> #9  0x0000000041aced50 in ?? ()
>>>> #10 0x0000000000000000 in ?? ()
>>>> (gdb) thread 6
>>>> [Switching to thread 6 (Thread 1099348304 (LWP 24525))]#0
>>>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>>> (gdb) bt
>>>> #0  0x00002b9bfc77149b in sem_timedwait () from /lib/ 
>>>> libpthread.so.0
>>>> #1  0x000000000041cc3c in sem_wait_forever (s=0x23af950) at ../
>>>> thread_manager.c:443
>>>> #2  0x000000000041da7f in suspend_resume_handler (signo=<value
>>>> optimized out>, info=<value optimized out>, context=0x4186a6c0)
>>>> at ../thread_manager.c:594
>>>> #3  <signal handler called>
>>>> #4  0x00002b9bfc771efb in ?? () from /lib/libpthread.so.0
>>>> #5  0x0000000000411402 in _SPffcall () at ../x86-spentry64.s:4041
>>>> #6  0x0000000000000000 in ?? ()
>>>>
>>>>
>>>>
>>>> On Apr 15, 2009, at 1:34 AM, Gary Byers wrote:
>>>>
>>>>> A bunch of threads (all but the one that GDB identifies as #3)  
>>>>> were
>>>>> suspended by thread 3 when it ran the GC; each such suspended  
>>>>> thread
>>>>> is waiting on a semaphore that'd be signaled/raised when the GC
>>>>> exits.
>>>>> It looks like the GC exited and is in the process of trying to  
>>>>> free
>>>>> some foreign objects that were referenced by "gcable pointers"; it
>>>>> has to signal the other threads' semaphores (to wake them up)  
>>>>> before
>>>>> it actually frees anything, since one of those threads may own a
>>>>> lock
>>>>> on the malloc heap.
>>>>> It looks like the GC thread (#3) called free(), and is waiting for
>>>>> a lock that's likely held by one of the other threads that's  
>>>>> likely
>>>>> to still be waiting (in sem_timedwait()) to be told that it's OK
>>>>> to run.  The GC thread's already signaled the semaphores that the
>>>>> other threads are waiting on, but they don't seem to have gotten
>>>>> the message.
>>>>> It's not that unusual to see that some of the other threads are
>>>>> apparently
>>>>> still waiting for their "resume" semaphores; it's a little unusual
>>>>> to
>>>>> see that they're all waiting, and I'd guess that they have been
>>>>> waiting
>>>>> since you noticed the load average dropping to 0.
>>>>> If you still have the GDB attached, you might try something: make
>>>>> each
>>>>> other thread (other than #3) active in GDB and use 'up' to see how
>>>>> it
>>>>> got to sem_timedwait().  My guess is that you'd see the call to
>>>>> SEM_WAIT_FOREVER in the function 'suspend_resume_handler()' (in
>>>>> ccl/lisp-kernel/thread-manager.c) a few frames up the stack in  
>>>>> each
>>>>> cases.  If you see that any of these threads is waiting for  
>>>>> anything
>>>>> other than its tcr->resume semaphore, that'd be interesting.
>>>>> Otherwise, this is a bit of a mystery, since the whole 'suspend
>>>>> other
>>>>> threads on GC entry, signal their resume semaphores on GC exit'
>>>>> routine
>>>>> happens on every GC, and has likely happened hundreds or thousands
>>>>> of
>>>>> times already in this session, and it's not clear what's different
>>>>> this
>>>>> time.
>>>>> On Wed, 15 Apr 2009, Osei Poku wrote:
>>>>>> Hi,
>>>>>> I was running three slime repls (9 threads total) and in each
>>>>>> repl I
>>>>>> was running the same long running function on different inputs.
>>>>>> After
>>>>>> a few hours of my load being about 2.8, it dropped down to 0.0.
>>>>>> Basically nothing was running, nothing responding etc.  All  
>>>>>> threads
>>>>>> are frozen.  As is confirmed by the backtrace in the main thread
>>>>>> (see
>>>>>> below), this happened during gc which makes sense since all
>>>>>> activity
>>>>>> stops during gc.  Not sure what else I can do to revive the  
>>>>>> process
>>>>>> but I thought it might be useful to report the issue.
>>>>>> This is from CCL version 1.3-RC1-r11885M.
>>>>>> Osei
>>>>>> --- gdb output -----
>>>>>> (gdb) info threads
>>>>>> 15 Thread 1076246864 (LWP 14275)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 14 Thread 1078753616 (LWP 14278)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 13 Thread 1081260368 (LWP 14280)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 12 Thread 1083767120 (LWP 14281)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 11 Thread 1086273872 (LWP 14282)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 10 Thread 1088780624 (LWP 14283)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 9 Thread 1091287376 (LWP 24522)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 8 Thread 1093794128 (LWP 24523)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 7 Thread 1096841552 (LWP 24524)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 6 Thread 1099348304 (LWP 24525)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 5 Thread 1101855056 (LWP 28424)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 4 Thread 1104361808 (LWP 28425)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 3 Thread 1106868560 (LWP 28426)  0x00002b9bfca601d8 in ?? ()  
>>>>>> from /
>>>>>> lib/libc.so.6
>>>>>> 2 Thread 1109375312 (LWP 28427)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> 1 Thread 47948961267808 (LWP 14271)  0x00002b9bfc77149b in
>>>>>> sem_timedwait () from /lib/libpthread.so.0
>>>>>> (gdb) thread
>>>>>> [Current thread is 0 (LWP 14271)]
>>>>>> (gdb) thread 2
>>>>>> [Switching to thread 2 (Thread 1109375312 (LWP 28427))]#0
>>>>>> 0x00002b9bfc77149b in sem_timedwait () from /lib/libpthread.so.0
>>>>>> (gdb) thread 3
>>>>>> [Switching to thread 3 (Thread 1106868560 (LWP 28426))]#0
>>>>>> 0x00002b9bfca601d8 in ?? () from /lib/libc.so.6
>>>>>> (gdb) bt
>>>>>> #0  0x00002b9bfca601d8 in ?? () from /lib/libc.so.6
>>>>>> #1  0x00002b9bfc9f6662 in ?? () from /lib/libc.so.6
>>>>>> #2  0x00002b9bfc9f56f1 in free () from /lib/libc.so.6
>>>>>> #3  0x000000000041ca69 in destroy_semaphore (s=0x300055b0bc08)
>>>>>> at ../
>>>>>> thread_manager.c:724
>>>>>> #4  0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c: 
>>>>>> 746
>>>>>> #5  0x00000000004157bd in gc (tcr=0x41f977b0, param=<value
>>>>>> optimized
>>>>>> out>) at ../gc-common.c:1331
>>>>>> #6  0x00000000004196db in gc_from_tcr (tcr=0x2b9bfccd6980,  
>>>>>> param=0)
>>>>>> at ../x86-exceptions.c:2786
>>>>>> #7  0x000000000041abe1 in gc_like_from_xp (xp=<value optimized
>>>>>> out>,
>>>>>> fun=0x4196b0 <gc_from_tcr>, param=0) at ../x86-exceptions.c:2743
>>>>>> #8  0x000000000041ac3e in gc_from_xp (xp=0x2b9bfccd6980,
>>>>>> param=25576416) at ../x86-exceptions.c:2798
>>>>>> #9  0x000000000041ad67 in allocate_object (xp=0x41f966f0,
>>>>>> bytes_needed=6352096, disp_from_allocptr=6352083, tcr=0x41f977b0)
>>>>>> at ../x86-exceptions.c:145
>>>>>> #10 0x000000000041adce in handle_alloc_trap (xp=0x41f966f0,
>>>>>> tcr=0x41f977b0) at ../x86-exceptions.c:572
>>>>>> #11 0x000000000041b5ce in handle_exception (signum=<value  
>>>>>> optimized
>>>>>> out>, info=0x41f96aa0, context=0x41f966f0,  
>>>>>> tcr=0x8000000000000000,
>>>>>> old_valence=-1) at ../x86-exceptions.c:1060
>>>>>> #12 0x000000000041b698 in signal_handler (signum=11,
>>>>>> info=0x41f96aa0,
>>>>>> context=0x41f966f0) at ../x86-exceptions.c:1330
>>>>>> #13 <signal handler called>
>>>>>> #14 0x000000000041054a in _SPmisc_alloc () at ../x86-spentry64.s:
>>>>>> 2840
>>>>>> #15 0x000030004036943c in ?? ()
>>>>>> #16 0x000000000060ecd0 in ?? ()
>>>>>> #17 0x0000000000000000 in ?? ()
>>>>>> (gdb) up
>>>>>> #1  0x00002b9bfc9f6662 in ?? () from /lib/libc.so.6
>>>>>> (gdb) up
>>>>>> #2  0x00002b9bfc9f56f1 in free () from /lib/libc.so.6
>>>>>> (gdb) up
>>>>>> #3  0x000000000041ca69 in destroy_semaphore (s=0x300055b0bc08)
>>>>>> at ../
>>>>>> thread_manager.c:724
>>>>>> 724         free(*s);
>>>>>> (gdb) list
>>>>>> 719     destroy_semaphore(void **s)
>>>>>> 720     {
>>>>>> 721       if (*s) {
>>>>>> 722     #ifdef USE_POSIX_SEMAPHORES
>>>>>> 723         sem_destroy((sem_t *)*s);
>>>>>> 724         free(*s);
>>>>>> 725     #endif
>>>>>> 726     #ifdef USE_MACH_SEMAPHORES
>>>>>> 727         semaphore_destroy(mach_task_self(),((semaphore_t)
>>>>>> (natural)
>>>>>> *s));
>>>>>> 728     #endif
>>>>>> (gdb) up
>>>>>> #4  0x0000000000414caa in reap_gcable_ptrs () at ../gc-common.c: 
>>>>>> 746
>>>>>> 746               destroy_semaphore((void**)&(x->address));
>>>>>> (gdb) list
>>>>>> 741             case xmacptr_flag_rwlock:
>>>>>> 742               rwlock_destroy((rwlock  
>>>>>> *)ptr_from_lispobj(ptr));
>>>>>> 743               break;
>>>>>> 744
>>>>>> 745             case xmacptr_flag_semaphore:
>>>>>> 746               destroy_semaphore((void**)&(x->address));
>>>>>> 747               break;
>>>>>> 748
>>>>>> 749             default:
>>>>>> 750               if ((flag >= xmacptr_flag_user_first) &&
>>>>>> (gdb)
>>>>>> _______________________________________________
>>>>>> Openmcl-devel mailing list
>>>>>> Openmcl-devel at clozure.com
>>>>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>>>
>>
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>>




More information about the Openmcl-devel mailing list