[Openmcl-devel] Random crashing

Mon Aug 18 11:42:05 PDT 2008

Just a quick report...  I updated to r10465M-RC1 and have had no  
crashes yet.   So I'm keeping my fingers crossed :)

Something else strange happened (the same day I updated), where it was  
not in the debugger but I could not evaluate any forms both in emacs/ 
slime and in the plain tty repl.  It hasn't happened again since then  
so I think I probably screwed something up.

Anyhow, thanks for all the help tracking down this issue and improving  
the situation.  I was this ( || ) close to ponying up a few thousand  
bucks for LW64 :)

Osei

On Aug 10, 2008, at 4:14 PM, Gary Byers wrote:

> I've said this (and been wrong) a few times already, but I think that
> I (partly) fixed this in svn a few days ago.  (Or at least fixed the
> part that led to the crash.)
>
> Some things that try to examine the status of a process (PROCESS- 
> WHOSTATE)
> do so by briefly suspending and resuming the process.  Unfortunately,
> the code that does this doesn't reliably ensure that the thread
> hasn't exited before we try to suspend it, and trying to  
> (unconditionally)
> resume a thread that exited before it was suspended can wind up trying
> to signal a NULL semaphore (which is the symptom that Osei is seeing.)
>
> That's sort of a perfect storm of everyhing that could go wrong
> going wrong at the same time.  I'm not 100% sure that PROCESS-WHOSTATE
> is the culprit; there's at least one other thing (SYMBOL-VALUE-IN- 
> PROCESS)
> that does similar things and has similar race conditions that it  
> doesn't
> handle.
>
> Whatever the culprit(s) is or are, there are ways to reach the C
> function 'resume_tcr()' in the lisp kernel, and that function can
> afford to check to see if the semaphore that it's going to signal
> is NULL before blindly signaling it.  (Not checking - on Linux,
> at least - leads to the crash that Osei's seeing.)
>
> If you do:
>
> ? (process-run-function "do nothing" (lambda ()))
>
> in the listener, you'll probably see the result print as something
> like:
>
> #<PROCESS do nothing(9) [Exhausted] #x1058B2ACC>
>
> which basically means that there's no underlying OS-level thread
> associated with the process anymore (the process's initial function
> exited by the time the PRINT-OBJECT method was called to print the
> result in the REPL.
>
> Depending on the whims of the scheduler, there's a small chance
> that the process could print with a WHOSTATE of "Active" (if the
> function was a little less trivial or if the thread didn't get
> scheduled before the listener thread tried to deternine its state.)
>
> I think that there's an even smaller chance that between the time
> that PROCESS-WHOSTATE checks for the "exhausted" case and the
> time that it does the suspend/resume the process could basically
> become "exhausted" (the underlying thread could exit), and resuming
> a thread that's exited has caused a NULL semaphore to be raised.
> Code that creates and prints a lot of short-lived threads could
> run into that timing screw, as could other things that suspend/
> resume threads sloppily (SYMBOL-VALUE-IN-PROCESS, :PROC, etc.)
>
> The NULL semaphore problem should be fixed in SVN; there are a few
> other bits of sloppiness there that need some more work.  I've never
> seen this happen (and the PROCESS-WHOSTATE/SYMBOL-VALUE-IN-PROCESS
> idea is partly a guess), but someone else reported the same crash
> (the NULL semaphore) a few days ago.  It might be a little sensitive
> to CPU speed/number of cores/scheduler details, but I believe that
> this could happen without a hardware problem being involved.)
>
>
>
> On Sun, 10 Aug 2008, Wade Humeniuk wrote:
>
>> Maybe a hardware problem with your computer?  Could
>> be faulty RAM/Processor/Motherboard.....  You said this problem is
>> happening on a
>> particular machine.  Perhaps running some diagnostics might show up  
>> something
>> (though I have no suggestions what that diagnostic program might be.)
>>
>> Wade
>>
>> On Wed, Aug 6, 2008 at 10:59 AM, Osei Poku <osei.poku at gmail.com>  
>> wrote:
>>> This thing is not going away....
>>> lisp debugger and gdb session below...
>>>
>>> ====lisp debugger
>>> session
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> = 
>>> = 
>>> = 
>>> = 
>>> ====================================================================
>>>
>>> ? exception in foreign context
>>> Exception occurred while executing foreign code
>>> ? for help
>>> [17455] OpenMCL kernel debugger: ?
>>> (G)  Set specified GPR to new value
>>> (R)  Show raw GPR/SPR register values
>>> (L)  Show Lisp values of tagged registers
>>> (F)  Show FPU registers
>>> (S)  Find and describe symbol matching specified name
>>> (B)  Show backtrace
>>> (T)  Show info about current thread
>>> (X)  Exit from this debugger, asserting that any exception was  
>>> handled
>>> (K)  Kill OpenMCL process
>>> (?)  Show this help
>>> [17455] OpenMCL kernel debugger: R
>>> %rax = 0x0000000000000000      %r8  = 0x0000000000000000
>>> %rcx = 0x0000000000000000      %r9  = 0x000000004072B7D0
>>> %rdx = 0x0000000000000001      %r10 = 0x0000000000000008
>>> %rbx = 0x00000000410BB7D0      %r11 = 0x0000000000000246
>>> %rsp = 0x000000004072A218      %r12 = 0x000000004072B7D0
>>> %rbp = 0x000000004072A6F0      %r13 = 0x000000004072A718
>>> %rsi = 0x0000000000000001      %r14 = 0x0000000000000004
>>> %rdi = 0x0000000000000000      %r15 = 0x000000004072AAA0
>>> %rip = 0x00002B37EEFB3325   %rflags = 0x0000000000010246
>>> [17455] OpenMCL kernel debugger: B
>>>
>>> Framepointer [#x4072A6F0] in unknown area.
>>> [17455] OpenMCL kernel debugger: T
>>> Current Thread Context Record (tcr) = 0x4072b7d0
>>> Control (C) stack area:  low = 0x404d8000, high = 0x4072c000
>>> Value (lisp) stack area: low = 0x2aaaab0f1000, high = 0x2aaaab302000
>>> Exception stack pointer = 0x4072a218
>>> [17455] OpenMCL kernel debugger: X
>>>
>>
>> <Rest Deleted>
>>
>>