[Openmcl-devel] find reason for crashes and prevent mcl from going into kernel debugger?

Thu Jan 12 09:24:33 PST 2006

On Thu, 12 Jan 2006, rs wrote:

> Hi,
> sometimes i see openmcl 1.0 crash during gc (osx 10.4.3, G4/2x500
> MHz, 1GB RAM 600GB RAID):
>
> ...Unhandled exception 11 at 0xfd30, context->regs at #xf0600a38
> Write operation to unmapped address 0x82087000
> In foreign code at address 0x0000fd30
>
>  The code run fine in previous releases (14.2 and 14.4) (example of
> stacktrace appended)
> The App is a simple backup-server which runs 24/7 serving up to 12
> clients. During backup it uses several hundrets of MB RAM because the
> clients ask the server about the files already on the server and the
> server returns a list of filedescriptions.
>
> questions:
> 1.) how to find the problem (maybe from backtrace)
>       (does "In foreign code at address 0x0000fd30" mean that the
> error is in a lib using dangling ptrs?)

It seems to be dying in the GC (which is foreign code ...), and I'd
guess that address 0x0000fd30 is somewhere in the GC.  (Some nearby
addresses show up in the backtrace below as being in/near the function
"gc()".

The backtrace seems to suggest that lisp code was doing some pathname-
related consing when the GC was triggered.  That doesn't tell us too
much (pathname operations do cons a lot.)

Debugging the GC tends to be hard.  (It's -possible- that it's a GC
bug per se; it's also possible that something has caused some sort
of memory corruption.  A classic (if not entirely plausible) example
of something that could cause memory corruption is:

(defvar *a* (make-array 10)

(defun foo (a i val)
   (declare (optimize (speed 3) (safety 0)))
   (setf (aref a i) val))

(foo *a* 11 nil)

(Depending on when and how *A* was allocated, this might clobber
some other object near it in memory.  That might or might not
be important.)

There are other (possibly more common) ways of fouling things up.

The only way that I know of to debug crashes like this involves trying
to persuade the GC to do extra integrity/consistency checking before
and after it runs.  (The GC itself blindly trusts memory to be in a
sane, consistent state.)  If a consistency check discovers an
inconsistency, it'll break into the kernel debugger with a description
of the problem.  That can still be hard to debug, but sometimes an
inconsistency will be relatively harmless when it's first introduced
and will only actually cause a crash after the GC's moved things
around several times.)

> 2.) Since i *have* to solve the problem: Is it possible to tell mcl
> to exit instead of going into the kernel debugger? I could make an
> appropriate entry for launchd to start it up again when it quits.

If the kernel debugger would be entered when the --batch command-line
arg is in effect, the lisp is supposed to try to kill itself.  I believe
that this works pretty reliably; there -could- be some issues that I'm
not thinking of that could make it hard for the lisp to die in the
middle of the GC (if some suspended thread hold a lock and if some code
invoked by #_exit needs that lock.)  It tries to kill itself by sending
its own PID a SIGKILL, and that -should- be fatal ...