[Openmcl-devel] how do I debug this?

Thu Nov 11 18:32:06 PST 2004

On Thu, 11 Nov 2004, alex crain wrote:

>
> Ok, I'm in hemlock and I do something, say send an expression to the
> lisp world (M-C-x),
> and the cocoa interface locks up.
>
> I switch over to the tty listener, which is still responding and ask
> for a list of processes
>
> ? :proc
> 5 :    Hemlock window thread  [Active]
> 4 :    Listener     [Active]
> 3 :    Hemlock window thread  [Active]
> 2 :    housekeeping  [Active]
> 1 : -> listener     [Active]
> 0 :    Initial      [Active]
>
> I know that hemlock is doing something in the "CCL" package, so i try
> and switch
> over....
>
> ? (in-package "CCL")
>
> and now the tty listener is locked as well. At first I can't get
> anything to respond, but
> eventually (possibly because of a ^C) I get
>
> ^Cwhat's being allocated here ?
> ? for help
> [10981] OpenMCL kernel debugger:

That probably means that memory's corrupt.

The GC ran (possibly/probably to quickly collect recent garbage) and
started suspending other threads.  Some thread appeared to be running
lisp code and in the process of allocating something; the sequence of
instructions that're used to allocate a lisp object (a cons or uvector)
have to be treated atomically by the GC, so the GC tries to ensure
that that short critical instruction sequence will appear to have
completed.  (This practice is sometimes called "pc-lusering".)
The GC can tell reliably whether it's a cons or a uvector being allocated
(based on the value of the ppc::allocptr register's low bits); in your
case, it was neither and you dropped into the kernel debugger.

>
> I put in the stack trace below.
>
> Anyway - this is usually where I get stuck, but I'd really like to be
> able to do something with
> this stuff. It looks like hemlock tried to insert a character and cocoa
> either got confused
> and started calling attributesAtIndex:effectiveRange: over and over or
> cocoa is waiting
> forever inside of %OBJC-INSTANCE-CLASS-INDEX. I can understand the
> first scenario
> but why does the (IN-PACKAGE :CCL) call cause the tty listener to lock
> up?

It triggered a GC in an environent where a thread that seemed to be
running lisp code had a bogus value in its ALLOCPTR register (r9).
>
> Helpful suggestions, anyone?

I'd be interested to know which thread had the bogus ALLOCPTR.  You can
often get a strong hint of this by showing (L)isp registers and (R)egisters
in the kernel debugger.  (If there -is- some sort of memory corruption,
the kernel debugger is likely to choke trying to print lisp values, but
it's worth a try.)

The real mystery, of course, is how the bogus ALLOCPTR got that way.
It's hard to know exactly; it may be a symptom of some other type
of memory corruption.  For instance:

(defun foo (s i c)
  (declare (simple-string s) (fixnum i) (character c)
           (optimize (speed 3) (safety 0)))
  (setf (schar s i) c))

(foo "abcd" 4 #\e)

will store #\e one byte beyond the end of the string "abcd", silently
clobbering whatever happens to be in memory there.  Something bad
will likely happen eventually (possibly immediately, possibly hours
later.)  A lot of bad things -could- happen before any of those bad
things caused a spurious error or crash, and figuring out what caused
what is often nearly impossible.  Often, the only consistent feature of
those symptoms is that things that can't happen start happening.

If you have unsafe code like the above (perhaps "like the above, but
more realistic") you might want to back out of the aggressive declarations
and see if the problem goes away.  (Even if it doesn't go away, we'd
at least know where the problem isn't.)