[Openmcl-devel] Random crashing

Mon Jul 21 15:42:00 PDT 2008

Thanks.  Curiouser and curiouser, not only is the "resume" field 0,
but many other fields are as well, including 'next' and 'prev'.  (TCR
structures are maintained in a circular, doubly-linked list; this guy
seems to have died and spliced himself out of that list.)  Enough
fields are set that this looks like a dead thread rather than a
newly-created one.

The backtrace indicates that this was coming from
'lisp_resume_other_threads()", which is called as part of the expansion
of WITH-OTHER-THREADS-SUSPENDED.  And lisp_resume_other_threads()
and lisp_suspend_other_threads() don't bother to grab and release
the lock which allows modification of the tcr list.

I'm not quite sure why what happened happened, but the code that
walks this doubly-linked list suspending and resuming threads should
be confident that other threads aren't splicing themselves on and off
that list while it's being walked.

On Mon, 21 Jul 2008, Osei Poku wrote:

>
> On Jul 21, 2008, at 5:53 PM, Gary Byers wrote:
>
>> If you still have the debugging session running, could you do:
>> 
>> (gdb) p/x *(TCR *)0x417e77d0
>
> (gdb) p/x *(TCR *)0x417e77d0
> $1 = {next = 0x0, prev = 0x0, single_float_convert = {tag = 0x1, f = 0x0}, 
> linear = 0x0, save_rbp = 0x2aaaadd49ab0, lisp_mxcsr = 0x1920, foreign_mxcsr = 
> 0x1f80, db_link = 0x0, catch_top = 0x0, save_vsp = 0x2aaaadd49a58, save_tsp = 
> 0x2aaaade5b000, foreign_sp = 0x417e6da0, cs_area = 0x0, vs_area = 0x0, 
> ts_area = 0x0, cs_limit = 0x415b6000, bytes_allocated = 0x0,
> log2_allocation_quantum = 0x11, interrupt_pending = 0x0, xframe = 0x0, 
> errno_loc = 0x417e7770, ffi_exception = 0x1f80, osid = 0x0, valence = 0x1, 
> foreign_exception_status = 0x0, native_thread_info = 0x0, native_thread_id = 
> 0x1847, last_allocptr = 0x3000455e0000, save_allocptr = 0x3000455db200, 
> save_allocbase = 0x3000455c0000, reset_completion = 0x0, activate = 0x0,
> suspend_count = 0x0, suspend_context = 0x0, pending_exception_context = 0x0, 
> suspend = 0x0, resume = 0x0, flags = 0x0, gc_context = 0x0, 
> termination_semaphore = 0x0, unwinding = 0x0, tlb_limit = 0x0, tlb_pointer = 
> 0x0, shutdown_count = 0x0, next_tsp = 0x2aaaade5b000, safe_ref_address = 0x0}
>
>
> To save your eyes scanning,
>
> resume = 0x0
>
>
>> 
>> 
>> That address is the value of the "tcr" argument to "resume_tcr()" in
>> frame #7 in the backtrace below, so if you don't still have the
>> debugging session and reproduce the problem, we want to see what
>> the value of the "tcr" argument to resume_tcr() at the point was
>> at the point where resume_tcr() called sem_post() and crashed.
>> 
>> The gdb command above means "print, in hex, this contents of
>> what this address points to, interpreting that address as
>> being of type "pointer to TCR" (where a TCR is a "Thread Context
>> Record" that contains several interesting fields.)
>> 
>> 'resume_tcr()' basically does 'sem_post(tcr->resume)', and a crash
>> would make sense if tcr->resume was NULL.  If it was, then one of
>> the threads that's doing sem_timedwait() on its 'resume' semaphore
>> would presumably be waiting on a NULL semahore, and that doesn't
>> make sense.
>