[Openmcl-devel] Random crashing

Sun Aug 10 13:14:43 PDT 2008

I've said this (and been wrong) a few times already, but I think that
I (partly) fixed this in svn a few days ago.  (Or at least fixed the
part that led to the crash.)

Some things that try to examine the status of a process (PROCESS-WHOSTATE)
do so by briefly suspending and resuming the process.  Unfortunately,
the code that does this doesn't reliably ensure that the thread
hasn't exited before we try to suspend it, and trying to (unconditionally)
resume a thread that exited before it was suspended can wind up trying
to signal a NULL semaphore (which is the symptom that Osei is seeing.)

That's sort of a perfect storm of everyhing that could go wrong
going wrong at the same time.  I'm not 100% sure that PROCESS-WHOSTATE
is the culprit; there's at least one other thing (SYMBOL-VALUE-IN-PROCESS)
that does similar things and has similar race conditions that it doesn't
handle.

Whatever the culprit(s) is or are, there are ways to reach the C
function 'resume_tcr()' in the lisp kernel, and that function can
afford to check to see if the semaphore that it's going to signal
is NULL before blindly signaling it.  (Not checking - on Linux,
at least - leads to the crash that Osei's seeing.)

If you do:

? (process-run-function "do nothing" (lambda ()))

in the listener, you'll probably see the result print as something
like:

#<PROCESS do nothing(9) [Exhausted] #x1058B2ACC>

which basically means that there's no underlying OS-level thread
associated with the process anymore (the process's initial function
exited by the time the PRINT-OBJECT method was called to print the
result in the REPL.

Depending on the whims of the scheduler, there's a small chance
that the process could print with a WHOSTATE of "Active" (if the
function was a little less trivial or if the thread didn't get
scheduled before the listener thread tried to deternine its state.)

I think that there's an even smaller chance that between the time
that PROCESS-WHOSTATE checks for the "exhausted" case and the
time that it does the suspend/resume the process could basically
become "exhausted" (the underlying thread could exit), and resuming
a thread that's exited has caused a NULL semaphore to be raised.
Code that creates and prints a lot of short-lived threads could
run into that timing screw, as could other things that suspend/
resume threads sloppily (SYMBOL-VALUE-IN-PROCESS, :PROC, etc.)

The NULL semaphore problem should be fixed in SVN; there are a few
other bits of sloppiness there that need some more work.  I've never
seen this happen (and the PROCESS-WHOSTATE/SYMBOL-VALUE-IN-PROCESS
idea is partly a guess), but someone else reported the same crash
(the NULL semaphore) a few days ago.  It might be a little sensitive
to CPU speed/number of cores/scheduler details, but I believe that
this could happen without a hardware problem being involved.)

On Sun, 10 Aug 2008, Wade Humeniuk wrote:

> Maybe a hardware problem with your computer?  Could
> be faulty RAM/Processor/Motherboard.....  You said this problem is
> happening on a
> particular machine.  Perhaps running some diagnostics might show up something
> (though I have no suggestions what that diagnostic program might be.)
>
> Wade
>
> On Wed, Aug 6, 2008 at 10:59 AM, Osei Poku <osei.poku at gmail.com> wrote:
>> This thing is not going away....
>> lisp debugger and gdb session below...
>>
>> ====lisp debugger
>> session
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> ========================================================================
>>
>> ? exception in foreign context
>> Exception occurred while executing foreign code
>> ? for help
>> [17455] OpenMCL kernel debugger: ?
>> (G)  Set specified GPR to new value
>> (R)  Show raw GPR/SPR register values
>> (L)  Show Lisp values of tagged registers
>> (F)  Show FPU registers
>> (S)  Find and describe symbol matching specified name
>> (B)  Show backtrace
>> (T)  Show info about current thread
>> (X)  Exit from this debugger, asserting that any exception was handled
>> (K)  Kill OpenMCL process
>> (?)  Show this help
>> [17455] OpenMCL kernel debugger: R
>> %rax = 0x0000000000000000      %r8  = 0x0000000000000000
>> %rcx = 0x0000000000000000      %r9  = 0x000000004072B7D0
>> %rdx = 0x0000000000000001      %r10 = 0x0000000000000008
>> %rbx = 0x00000000410BB7D0      %r11 = 0x0000000000000246
>> %rsp = 0x000000004072A218      %r12 = 0x000000004072B7D0
>> %rbp = 0x000000004072A6F0      %r13 = 0x000000004072A718
>> %rsi = 0x0000000000000001      %r14 = 0x0000000000000004
>> %rdi = 0x0000000000000000      %r15 = 0x000000004072AAA0
>> %rip = 0x00002B37EEFB3325   %rflags = 0x0000000000010246
>> [17455] OpenMCL kernel debugger: B
>>
>> Framepointer [#x4072A6F0] in unknown area.
>> [17455] OpenMCL kernel debugger: T
>> Current Thread Context Record (tcr) = 0x4072b7d0
>> Control (C) stack area:  low = 0x404d8000, high = 0x4072c000
>> Value (lisp) stack area: low = 0x2aaaab0f1000, high = 0x2aaaab302000
>> Exception stack pointer = 0x4072a218
>> [17455] OpenMCL kernel debugger: X
>>
>
> <Rest Deleted>
>
>