[Openmcl-devel] Random crashing
Osei Poku
osei.poku at gmail.com
Mon Aug 18 11:42:05 PDT 2008
Just a quick report... I updated to r10465M-RC1 and have had no
crashes yet. So I'm keeping my fingers crossed :)
Something else strange happened (the same day I updated), where it was
not in the debugger but I could not evaluate any forms both in emacs/
slime and in the plain tty repl. It hasn't happened again since then
so I think I probably screwed something up.
Anyhow, thanks for all the help tracking down this issue and improving
the situation. I was this ( || ) close to ponying up a few thousand
bucks for LW64 :)
Osei
On Aug 10, 2008, at 4:14 PM, Gary Byers wrote:
> I've said this (and been wrong) a few times already, but I think that
> I (partly) fixed this in svn a few days ago. (Or at least fixed the
> part that led to the crash.)
>
> Some things that try to examine the status of a process (PROCESS-
> WHOSTATE)
> do so by briefly suspending and resuming the process. Unfortunately,
> the code that does this doesn't reliably ensure that the thread
> hasn't exited before we try to suspend it, and trying to
> (unconditionally)
> resume a thread that exited before it was suspended can wind up trying
> to signal a NULL semaphore (which is the symptom that Osei is seeing.)
>
> That's sort of a perfect storm of everyhing that could go wrong
> going wrong at the same time. I'm not 100% sure that PROCESS-WHOSTATE
> is the culprit; there's at least one other thing (SYMBOL-VALUE-IN-
> PROCESS)
> that does similar things and has similar race conditions that it
> doesn't
> handle.
>
> Whatever the culprit(s) is or are, there are ways to reach the C
> function 'resume_tcr()' in the lisp kernel, and that function can
> afford to check to see if the semaphore that it's going to signal
> is NULL before blindly signaling it. (Not checking - on Linux,
> at least - leads to the crash that Osei's seeing.)
>
> If you do:
>
> ? (process-run-function "do nothing" (lambda ()))
>
> in the listener, you'll probably see the result print as something
> like:
>
> #<PROCESS do nothing(9) [Exhausted] #x1058B2ACC>
>
> which basically means that there's no underlying OS-level thread
> associated with the process anymore (the process's initial function
> exited by the time the PRINT-OBJECT method was called to print the
> result in the REPL.
>
> Depending on the whims of the scheduler, there's a small chance
> that the process could print with a WHOSTATE of "Active" (if the
> function was a little less trivial or if the thread didn't get
> scheduled before the listener thread tried to deternine its state.)
>
> I think that there's an even smaller chance that between the time
> that PROCESS-WHOSTATE checks for the "exhausted" case and the
> time that it does the suspend/resume the process could basically
> become "exhausted" (the underlying thread could exit), and resuming
> a thread that's exited has caused a NULL semaphore to be raised.
> Code that creates and prints a lot of short-lived threads could
> run into that timing screw, as could other things that suspend/
> resume threads sloppily (SYMBOL-VALUE-IN-PROCESS, :PROC, etc.)
>
> The NULL semaphore problem should be fixed in SVN; there are a few
> other bits of sloppiness there that need some more work. I've never
> seen this happen (and the PROCESS-WHOSTATE/SYMBOL-VALUE-IN-PROCESS
> idea is partly a guess), but someone else reported the same crash
> (the NULL semaphore) a few days ago. It might be a little sensitive
> to CPU speed/number of cores/scheduler details, but I believe that
> this could happen without a hardware problem being involved.)
>
>
>
> On Sun, 10 Aug 2008, Wade Humeniuk wrote:
>
>> Maybe a hardware problem with your computer? Could
>> be faulty RAM/Processor/Motherboard..... You said this problem is
>> happening on a
>> particular machine. Perhaps running some diagnostics might show up
>> something
>> (though I have no suggestions what that diagnostic program might be.)
>>
>> Wade
>>
>> On Wed, Aug 6, 2008 at 10:59 AM, Osei Poku <osei.poku at gmail.com>
>> wrote:
>>> This thing is not going away....
>>> lisp debugger and gdb session below...
>>>
>>> ====lisp debugger
>>> session
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> ====================================================================
>>>
>>> ? exception in foreign context
>>> Exception occurred while executing foreign code
>>> ? for help
>>> [17455] OpenMCL kernel debugger: ?
>>> (G) Set specified GPR to new value
>>> (R) Show raw GPR/SPR register values
>>> (L) Show Lisp values of tagged registers
>>> (F) Show FPU registers
>>> (S) Find and describe symbol matching specified name
>>> (B) Show backtrace
>>> (T) Show info about current thread
>>> (X) Exit from this debugger, asserting that any exception was
>>> handled
>>> (K) Kill OpenMCL process
>>> (?) Show this help
>>> [17455] OpenMCL kernel debugger: R
>>> %rax = 0x0000000000000000 %r8 = 0x0000000000000000
>>> %rcx = 0x0000000000000000 %r9 = 0x000000004072B7D0
>>> %rdx = 0x0000000000000001 %r10 = 0x0000000000000008
>>> %rbx = 0x00000000410BB7D0 %r11 = 0x0000000000000246
>>> %rsp = 0x000000004072A218 %r12 = 0x000000004072B7D0
>>> %rbp = 0x000000004072A6F0 %r13 = 0x000000004072A718
>>> %rsi = 0x0000000000000001 %r14 = 0x0000000000000004
>>> %rdi = 0x0000000000000000 %r15 = 0x000000004072AAA0
>>> %rip = 0x00002B37EEFB3325 %rflags = 0x0000000000010246
>>> [17455] OpenMCL kernel debugger: B
>>>
>>> Framepointer [#x4072A6F0] in unknown area.
>>> [17455] OpenMCL kernel debugger: T
>>> Current Thread Context Record (tcr) = 0x4072b7d0
>>> Control (C) stack area: low = 0x404d8000, high = 0x4072c000
>>> Value (lisp) stack area: low = 0x2aaaab0f1000, high = 0x2aaaab302000
>>> Exception stack pointer = 0x4072a218
>>> [17455] OpenMCL kernel debugger: X
>>>
>>
>> <Rest Deleted>
>>
>>
More information about the Openmcl-devel
mailing list