[Openmcl-devel] Another linux86-32: signed doubleword parameters.

Gary Byers gb at clozure.com
Wed Oct 15 00:08:20 PDT 2008

On Tue, 14 Oct 2008, David Brown wrote:

> On Tue, Oct 14, 2008 at 09:41:17AM -0600, Gary Byers wrote:
>> There are probably lots of scenarios that -can- lead to that sort of thing.
>> One canonical example involves unsafe code storing outside the bounds
>> of an object.  Suppose that memory contains
> Ok, a few things that make me think this isn't a problem with unsafe
> code:
>  - The code works fine on ccl64, as well as sbcl.

Note that it could very easily be a problem with unsafe code in CCL
itself.  (The x86-32 port is still pretty new.)  If it sounded like
I was saying "well, it must be your bug", I'm sorry: I don't know what
the bug (or bugs) is (or are.)

Suppose that there's another case similar to what we found yesterday,
where the GC doesn't follow a pointer to a lisp object correctly.
Suppose that memory looks like:

|object a | object b ...........| object c ............| object d|

and some register or stack location X is pointing at object b - a
tagged pointer points a few bytes into the object - but the GC doesn't
see that for some reason and concludes that object b is garbage;
object A is in fact garbage.  It then compacts the live objects (or
those that it thinks are live), and memory winds up looking like:

| object c ............| object d|

The register or stack location X is pointing at the same address, but
it's now pointing into the middle of some object (rather than at a header
which describes the object's type and size.)  The next time that the GC
runs, it might see X and (mis)interpret the data in the middle of that
object as a header (and conclude that it's looking at a vector of 
several million DOUBLE-FLOATS that for some strange reason extends into
unmapped memory, or ... well, the possibilities are endless.)  This
can create secondary problems, and if we haven't crashed after a few
GCs memory can have gotten so scrambled by the time we do that it's
virtually impossible to find the original problem.

The GC checks usually do a good job of finding bad pointers and of
finding early evidence of corruption.  I said yesterday that if your
program passes those checks, it'probably mean that the GC is no longer
a prime suspect.  There are at least two (related) problems with that

1) the integrity-checking code (mostly check_all_areas() in x86-gc.c)
walks all pointer-bearing memory areas and checks all of the pointers
it finds there.  It doesn't walk the registers in the exception frames
of all threads; if any of those registers contain bad pointers, we
won't find out about that until it's done some damage.

2) on x8632. there are two senses in which a register pointer can be
"bad" - it can claim to be a node but not look enough like a valid
lisp object reference to pass the consistency checks, or it can claim
to be an immediate object but look so much like a valid, consistent
lisp object that it's pretty suspicious that it's claiming not to be.
(I think that we'd get a lot of false positives because of the
way that registers are marked as being immediate, but we might be
able to conditionally compile that a little differently to facilitate

So, at the moment I'd want to revise what I claimed yesterday and
say that passing the GC integrity checks would only divert suspicion
from the GC if those checks handled these 2 cases as well.  (1) is
fairly easy; (2) is a little heuristic and needs some sort of "build
everything for debugging" mode in order to be useful.  I need to think
about this a bit more; I think that that sort of mode might be generally
useful, but don't know how simple it is to implement.

>  - I've wrapped every pointer in my code with some code that copies
>    the block, puts a pattern around it, evaluates the body, checks
>    the pattern, and it has never caught this.
>  - I've also wrapped every with-pointer-to-ivector with something
>    that takes the address as an integer, and compares it with the
>    pointer again afterward, just to make sure the GC didn't move
>    things around.  Never saw this, but wasn't really expecting to.
> The only thing I haven't done is try running this code on another
> 32-bit system.  I'll need to build sbcl on the 32-bit machine to see
> if I can figure out what this might show.

If the bug was in your code, that might show something.

More information about the Openmcl-devel mailing list