[Openmcl-devel] Another linux86-32: signed doubleword parameters.

Mon Oct 13 21:26:36 PDT 2008

On Mon, 13 Oct 2008, David Brown wrote:

> On Mon, Oct 13, 2008 at 10:59:38AM -0600, Gary Byers wrote:
>
>> mark_root() is called to (of all things ...) mark "root"s, where a
>> "root" is a register or stack location (or one of a few other things.)
>> It'd be interesting to see who called it; it's usually called from
>> mark_[cvt]stack_area() or from mark_xp() (though there may be an
>> intervening function or two.)  The various stack marking functions
>> obviously traverse a thread's stacks; mark_xp() looks at the registers
>> in an exception context.  If we have a lot of threads running and
>> those threads are doing lots of I/O (transitioning between running C
>> and lisp code), there are lots of things that have to be done exactly
>> right.  (We generally say that a GC can occur on any instruction
>> boundary; if anyone thinks that that's an exaggeration, they should
>> poke around in a crash like this in gdb.)
>
> (gdb) bt
> #0  0x0805af25 in mark_root (n=2941) at ../x86-gc.c:292

Well, that's bogus: 2941 (#xb7d) is tagged as a return address,
but isn't in mapped memory.  Especially on x8632 (where it can
be awkward to maintain GC constraints on register usage), it's
usually safe to keep a very small unboxed value in a register
that's advertised as containing boxed values.  That's generally
true, but mark_root() does its indirection to map from a return
address to the containing function before checking to see if
N is even interesting.  (The same bug is present on x86-64, but
it's easier to find an unboxed register there.)

> #1  0x0805c2f4 in mark_xp (xp=0xf6369cf4, node_regs_mask=138) at 
> ../x86-gc.c:1347
> #2  0x080595be in mark_tcr_xframes (tcr=0xf636aaa0) at ../gc-common.c:617
> #3  0x0805a296 in gc (tcr=0xf5d5caa0, param=0) at ../gc-common.c:1197
> #4  0x08061383 in gc_from_tcr (tcr=0xf5d5caa0, param=0) at 
> ../x86-exceptions.c:2621
> #5  0x080612c2 in gc_like_from_xp (xp=0xf5d5bcf4, fun=0x806134d 
> <gc_from_tcr>, param=0)
>    at ../x86-exceptions.c:2578
> #6  0x080613d4 in gc_from_xp (xp=0xf5d5bcf4, param=0) at 
> ../x86-exceptions.c:2633
> #7  0x0805e7b4 in handle_gc_trap (xp=0xf5d5bcf4, tcr=0xf5d5caa0) at 
> ../x86-exceptions.c:221
> #8  0x0805f9f4 in handle_exception (signum=11, info=0xf5d5bc74, 
> context=0xf5d5bcf4, tcr=0xf5d5caa0, old_valence=0)
>    at ../x86-exceptions.c:995
> #9  0x0806001b in signal_handler (signum=11, info=0xf5d5bc74, 
> context=0xf5d5bcf4, tcr=0xf5d5caa0, old_valence=0)
>    at ../x86-exceptions.c:1242
> #10 <signal handler called>
> Backtrace stopped: previous frame inner to this frame (corrupt stack?)

We bounce back and forth between  "lisp only" and  "C only" stacks, and
can't easily walk back past a transition point.  In this case, lisp code
invoked the GC (or something that invokes the GC) via an exception.

>
> The mark_xp looks like it was doing
>
> 	if (node_regs_mask & (1<<3)) mark_root(regs[REG_EDX]);
>
> but, with optimization, this might be a little misleading.

Yes; there's a bitmap (in node_regs_mask at this point) that indicates
which registers contain tagged lisp objects ("nodes") and which don't.
The bitmap is saying that the EDX register is a node at this point
in time, but it seems to be a small unboxed value.

(EDX is usually a node, but it's often "stolen" temporarily and used
to hold non-node values.  rme came up with a scheme that uses the
x86 direction flag to indicate when EDX has been stolen in this 
way, and it looks like mark_xp() checks for that before looking
at the bit in node_regs_mask.  So, this is either a bug, or we're
hoping that the small immediate value in EDX is "safe" because
it's clearly outside of the heap, and (if so) we should look before
leaping when trying to map from a return address to a function.)

We have a ticket in Trac that says "if we can guarantee that the
lowest address in the lisp heap or any stack is above some address,
we can be more casual about maintaining the node_regs_mask at runtime."
We should really fix the marker behavior before doing that; I'm
not sure if the code that uses EDX this way is doing so intentionally,
but I think that we want to be able to do that without setting/clearing
bits.

>
> regs[REG_EIP] is 0x1400a767, but I'm not quite sure how to figure out
> what that is inside of.  What's the easiest way of figuring out what
> code that is inside of?

regs[REG_EDI] should contain contain the function; subtracting the
value regs[REG_EDI] from that of regs[EDI_EIP] should show the relative
offset of the program counter within that function.  Doing:

(gdb) call print_lisp_object(<value_of_regs[REG_EDI])

will try (using the kernel debugger's imperfect but sometimes useful)
lisp-object printer.  ("call" in gdb sometimes gets confused but often
works)

If you can identify the function, the relative offsets that the
disassembler prints may be able to help us identify where within EDI
EIP is pointing.

>
> So, does it look like it caught a place where EDX contained a non-lisp
> object, but the regs mask indicated it did?

Exactly.

>
>> If this is reapeatable, it's really good news.
>
> Reasonably repeatable.  It happens about one out of 10 times I run the
> test.  Other times, it fails with valid incorrect objects getting
> caught by something (usually seems to be arrays turning into conses).

Well, two repeatable symptoms is a lot better than a large number of random
ones.  However it got there, it's also much better that the troublesome
EDX value seems to be a "small" immediate object (and not a stale return
address from a previous GC.)

>
> David
>
>