[Openmcl-devel] crash without report

Mon Nov 10 12:57:52 PST 2008

If you're asking "what should have been logged somewhere but wasn't?",
I don't know.  (That's kind of like a Zen koan, only instead of
achieving enlightenment by contemplating it you wind up with a bad
headache.)

If lisp code does something that results in an illegal memory reference,
the lisp kernel catches the resulting exception and signals a lisp error.

? (%get-byte (%null-ptr))
> Error: Fault during read of memory address #x0
> While executing: %GET-BYTE, in process Listener(5).
> Type :POP to abort, :R for a list of available restarts.
> Type :? for other options.
1

For a simple case like this, we can slap ourselves in the forehead
(figuratively ...) and remind ourselves not to dereference obviously
null pointers.  Even in more realistic cases, it may be easy to figure
out what caused the memory fault and convince ourselves that the
damage was localized.  (In general, it's possible to scribble randomly
over memory for a while before we try to write to an address that'll
cause a fault, so if we don't understand what caused a memory fault
like this we should view the lisp session with suspicion: if something's
doing incorrect memory accesses, it might have overwritten something
important before writing to an address that caused a fault.)  From
the lisp kernel's point of view, trying to report this as a lisp error
is "worth a try", and it often works well in practice.

If foreign (C) code does an invalid memory access, it's much harder to
know how to recover from that: we don't know what state that foreign
code may have changed and we don't know what the consequences of
signaling a lisp error in the middle of some unknown foreign code
might be.  (E.g., if we get a fault in the middle of #_malloc or
something similar, trying to signal a lisp error at that point might
just lead to a lot of secondary problems and not get very far.)

When any kind of unhandled exception (memory fault or other) happens
in foreign code, the lisp enters its kernel debugger.  It's not much
of a debugger, and what there is of it is oriented towards printing
lisp objects (with varying degrees of success ...) and lisp
backtraces.  There's a little information in the Wiki about debugging
under GDB:

<http://trac.clozure.com/openmcl/wiki/CclUnderGdb>

but it's probably fair to say that trying to figure out how/why
some foreign code crashed can be a hard problem.  (Many great
minds have spent countless hours on this problem ...)

If we're running the lisp as a non-OSX-GUI application and we
do something like:

? (ff-call (%null-ptr) :void)

we get:

Unhandled exception 10 at 0x0, context->regs at #xb029b8f0
Exception occurred while executing foreign code
? for help
[50778] OpenMCL kernel debugger:

Well, yes: we did a foreign function call to an invalid address,
and now we're pretty much stuck.  In a more realistic example -
where we were in some real foreign code and that code caused a
fault - the kernel debugger will try to print the name of a
known foreign function whose address is near the PC at the time
of the exception.

We can ask the kernel debugger to show us the values of the machine
registers (x8664 in this case):

[50778] OpenMCL kernel debugger: r
%rax = 0x0000000000000000      %r8  = 0x000000000000031a
%rcx = 0x00000000006a5a30      %r9  = 0x00000000001047f0
%rdx = 0x00000000b029bde0      %r10 = 0x00003000400090f4
%rbx = 0x0000000000104be0      %r11 = 0x0000000000000000
%rsp = 0x00000000b029bdc8      %r12 = 0x0000000000000000
%rbp = 0x00000000b029bdd0      %r13 = 0x0000000000000000
%rsi = 0x0000000000000200      %r14 = 0x000000000001300b
%rdi = 0x00000000001047c0      %r15 = 0x0000000000000200
%rip = 0x0000000000000000   %rflags = 0x00010206

which shows us that %rip (the instruction pointer/program counter) is
at address 0, and if we try to get a lisp backtrace at this point
we can see how we got here (this may or may not work in 1.2):

(#x00000000006A5A58) #x000030004000821C : #<Function %DO-FF-CALL #x00003000400081CF> + 77
(#x00000000006A5A68) #x00003000400090F4 : #<Function %FF-CALL #x00003000400082CF> + 3621
(#x00000000006A5AE0) #x00003000404C5A84 : #<Function CALL-CHECK-REGS #x00003000404C599F> + 229
(#x00000000006A5B18) #x00003000404BCA9C : #<Function TOPLEVEL-EVAL #x00003000404BC7BF> + 733
(#x00000000006A5BB8) #x00003000404BEB0C : #<Function READ-LOOP #x00003000404BE3EF> + 1821
(#x00000000006A5DD8) #x00003000404C556C : #<Function TOPLEVEL-LOOP #x00003000404C54EF> + 125

from which we -might- be able to conclude that FF-CALLing a null pointer
is a bad idea.  (This example may not convince anyone who's skeptical
of my assertion that it's hard to reliably recover from an exception
in foreign code; I honestly do think that that's a hard problem.)

The kernel debugger just writes to the (Unix) process-level standard error
descriptor and reads from the process's standard input.

An OSX's GUI application's standard I/O descriptors are ordinarily
redirected: input usually comes from /dev/null (the null device, which
always returns EOF on input) and output and error (supposedly) go to a
logfile somewhere.  (On Leopard, "somewhere" seems to be
/private/tmp.)  It's probably the case that we get the EOF (reading
from /dev/null) before anything's actually flushed to that logfile
when the kernel debugger's entered from the IDE.

While waiting for someone to figure out what to do about that ...
you can run a GUI application in Terminal (or equivalent); when
it's run this way, its standard I/O file descriptors remain unchanged
(and therefore the kernel debugger works.)  The general idea is
to invoke the executable program inside the .app bundle:

shell> /path/to/Clozure\ CL.app/Contents/MacOS/dx86cl64

The good news is that that'll leave standard I/O attached to the
"terminal" (or Emacs shell buffer, or ...) and it's possible to
interact with the kernel debugger (and entering the kernel debugger
won't cause the lisp to exit unless/until it gets an EOF when
reading from standard input).  The bad news is that the standard
error of a GUI application often gets filled with diagnostic
messages that are probably more meaningful to whoever wrote them
than to anyone else, and the fact that that the kernel debugger
is better than nothing doesn't mean that it's a whole lot better
than nothing ...

There are a variety of reasons why Apple's Crash Reporter doesn't
get invoked in this case (they're related to the reasons why it
sometimes gets invoked whenever some lisps get exceptions that
they routinely handle.)  If it were invoked, it wouldn't be
able to make a whole lot of sense out of the lisp-specific side
of things.  (If lisp crashes generated Crash Reporter logs, I
wouldn't often find them very useful and I doubt if other people
would, either.)  Generating someting somewhat like a crash
reporter log would be useful (even if that's equivalent to
having the kernel debugger invoke as many of its options as
might be useful and save the output somewhere.)  Just exiting
on EOF because the EOF comes from /dev/null is probably less
useful.

In the short term, running the IDE from the terminal might be enough
to let the kernel debugger point you in the general direction of the
problem.

On Mon, 10 Nov 2008, Alexander Repenning wrote:

> this may have been discussed in some other context but I cannot find any 
> trace. Anyway, while usually pretty stable CCL 1.2 (mac) works well with 
> Cocoa in general and even reports, without crashing on some memory management 
> issues. But once in a while CCL really does crash but unfortunately without 
> creating a crashlog file. What is missing? I have
>
> COREDUMPS=-YES-
>
> in etc/hostconfig
>
> but when getting a Nov 10 11:54:15 Ristretto-to-Go-7 com.apple.launchd[67] 
> ([0x0-0x15015].com.clozure.Clozure CL[119]): Exited: Killed
>
> there is no crash.log
>
> Am I missing something?
>
> all the best,  Alex
>
>
>
> Prof. Alexander Repenning
>
> University of Colorado
> Computer Science Department
> Boulder, CO 80309-430
>
> vCard: http://www.cs.colorado.edu/~ralex/AlexanderRepenning.vcf
>
>