[Openmcl-devel] Another linux86-32: signed doubleword parameters.

Mon Oct 13 09:59:38 PDT 2008

On Mon, 13 Oct 2008, David Brown wrote:

> On Mon, Oct 13, 2008 at 06:20:04AM -0600, Gary Byers wrote:
>
>> When sockets are involved, are threads also involved ?  (Threads
>> generally offer more ways for things to go wrong; sockets are mostly
>> just streams and are probably no more likely to scribble over memory
>> than other streams are.)
>
> Yes, I was about to post another message mentioning that this was also
> when threads got involved in the picture.  You're right, the sockets
> probably don't have much to do with it.
>
>> Whether or not the integrity checks are performed is controlled
>> by a bit in the fixnum which is the global value of the variable
>> CCL::*GC-EVENT-STATUS-BITS*; doing:
>
>> It's sometimes very hard to debug this kind of problem, and even harder to
>> explain to someone else how to do so.  It -might- be interesting to see
>> what gets reported if you run your code with integrity checking on, but
>> in practice it's probably necessary for us to either run that code or
>> run something similar that triggers the same.  If you can send us your
>> code, we can try to figure out what's going on.
>
> I've debugged plenty of GC/memory errors, so I have some idea of what
> is going on.
>
> Let me poke around a bit and see if anything is obvious.
>
> So, first crash is "Exception occurred while executing foreign code."
> 	Exception occurred while executing foreign code
> 	 at mark_root + 45

mark_root() is in the GC.  One thing that you might try is to note the
value of eip in the kernel debugger, attach gdb to the crashed lisp
process, set a breakpoint at that address, (cont)inue in gdb, and
eXit from the kernel debugger; you should wind up back in gdb at
the point where the exception occurred, and it can tell you more
about an exception in C code than the kernel debugger does.

<http://trac.clozure.com/openmcl/wiki/CclUnderGdb>> has some tips
on debugging CCL with GDB.  (A more useful document would be a lot
longer.)  One thing that may or may not be necessary is to point
gdb at the kernel source directory:

(gdb) directory /path/to/ccl/lisp-kernel

mark_root() is called to (of all things ...) mark "root"s, where a
"root" is a register or stack location (or one of a few other things.)
It'd be interesting to see who called it; it's usually called from
mark_[cvt]stack_area() or from mark_xp() (though there may be an
intervening function or two.)  The various stack marking functions
obviously traverse a thread's stacks; mark_xp() looks at the registers
in an exception context.  If we have a lot of threads running and
those threads are doing lots of I/O (transitioning between running C
and lisp code), there are lots of things that have to be done exactly
right.  (We generally say that a GC can occur on any instruction
boundary; if anyone thinks that that's an exaggeration, they should
poke around in a crash like this in gdb.)

What this particular crash (in mark_root()) means is that we have
some sort of thing (in a lisp stack or register or other root-bearing
location) that looks superficially like a tagged lisp object but isn't.
mark_root+45 seems to actually be in this code, right near the start
of mark_root() (in x86-gc.c).

#ifdef X8632
   if (tag_n == fulltag_tra) {
     if (*(unsigned char *)n == RECOVER_FN_OPCODE) {
       n = *(LispObj *)(n + 1);
       tag_n = fulltag_misc;
     } else
       return;
   }
#endif

So, the low bits of the (alleged) pointer that we're looking at are =
to "fulltag_tra", which are the low bits that a "tagged return
address" would have.  We look to see if the tagged return address is
pointing at an instruction that references the containing function
(and if so, act as if the root was that function.)  If 'n' just some
random value whose low 3 bits happen to be 5 (= fulltag_tra), we
segfault while trying to see if it's pointing at RECOVER_FN_OPCODE.
(It'd be safer but a little slower to check if N is pointing into the
lisp heap before doing this indirection, but if there are random bits
where a lisp object should be we're already doomed.)

If this is reapeatable, it's really good news.

>
> My code does make fairly heavy usage of 'with-pointer-to-ivector'.
> Perhaps %log-gc-lock (without-gcing) isn't working, and another thread
> is starting a garbage collection while the other CPU is running the
> foreign code.

Maybe.  If foreign code somehow caches a pointer (that happens to
be pointing into the lisp heap) and writes to that pointer after
WITH-POINTER-TO-IVECTOR exists (and after the GC has moved the
vector that the pointer was pointing into ...), that's another way to
lose.

>
> Basically, a good portion of the CPU time of the application is spent
> inside of zlib with pointers to lisp heap data.  At some point, I
> might change this to use foreign pointers, but then I have to worry
> about space leaks and such.
>
> I do also have multiple CPUs.
>
> Thanks,
> David
>
>