[Openmcl-devel] Another linux86-32: signed doubleword parameters.
Gary Byers
gb at clozure.com
Wed Oct 15 00:08:20 PDT 2008
On Tue, 14 Oct 2008, David Brown wrote:
> On Tue, Oct 14, 2008 at 09:41:17AM -0600, Gary Byers wrote:
>
>> There are probably lots of scenarios that -can- lead to that sort of thing.
>> One canonical example involves unsafe code storing outside the bounds
>> of an object. Suppose that memory contains
>
> Ok, a few things that make me think this isn't a problem with unsafe
> code:
>
> - The code works fine on ccl64, as well as sbcl.
Note that it could very easily be a problem with unsafe code in CCL
itself. (The x86-32 port is still pretty new.) If it sounded like
I was saying "well, it must be your bug", I'm sorry: I don't know what
the bug (or bugs) is (or are.)
Suppose that there's another case similar to what we found yesterday,
where the GC doesn't follow a pointer to a lisp object correctly.
Suppose that memory looks like:
|object a | object b ...........| object c ............| object d|
^
|
X
and some register or stack location X is pointing at object b - a
tagged pointer points a few bytes into the object - but the GC doesn't
see that for some reason and concludes that object b is garbage;
object A is in fact garbage. It then compacts the live objects (or
those that it thinks are live), and memory winds up looking like:
| object c ............| object d|
^
|
X
The register or stack location X is pointing at the same address, but
it's now pointing into the middle of some object (rather than at a header
which describes the object's type and size.) The next time that the GC
runs, it might see X and (mis)interpret the data in the middle of that
object as a header (and conclude that it's looking at a vector of
several million DOUBLE-FLOATS that for some strange reason extends into
unmapped memory, or ... well, the possibilities are endless.) This
can create secondary problems, and if we haven't crashed after a few
GCs memory can have gotten so scrambled by the time we do that it's
virtually impossible to find the original problem.
The GC checks usually do a good job of finding bad pointers and of
finding early evidence of corruption. I said yesterday that if your
program passes those checks, it'probably mean that the GC is no longer
a prime suspect. There are at least two (related) problems with that
assertion:
1) the integrity-checking code (mostly check_all_areas() in x86-gc.c)
walks all pointer-bearing memory areas and checks all of the pointers
it finds there. It doesn't walk the registers in the exception frames
of all threads; if any of those registers contain bad pointers, we
won't find out about that until it's done some damage.
2) on x8632. there are two senses in which a register pointer can be
"bad" - it can claim to be a node but not look enough like a valid
lisp object reference to pass the consistency checks, or it can claim
to be an immediate object but look so much like a valid, consistent
lisp object that it's pretty suspicious that it's claiming not to be.
(I think that we'd get a lot of false positives because of the
way that registers are marked as being immediate, but we might be
able to conditionally compile that a little differently to facilitate
debugging.)
So, at the moment I'd want to revise what I claimed yesterday and
say that passing the GC integrity checks would only divert suspicion
from the GC if those checks handled these 2 cases as well. (1) is
fairly easy; (2) is a little heuristic and needs some sort of "build
everything for debugging" mode in order to be useful. I need to think
about this a bit more; I think that that sort of mode might be generally
useful, but don't know how simple it is to implement.
>
> - I've wrapped every pointer in my code with some code that copies
> the block, puts a pattern around it, evaluates the body, checks
> the pattern, and it has never caught this.
>
> - I've also wrapped every with-pointer-to-ivector with something
> that takes the address as an integer, and compares it with the
> pointer again afterward, just to make sure the GC didn't move
> things around. Never saw this, but wasn't really expecting to.
>
> The only thing I haven't done is try running this code on another
> 32-bit system. I'll need to build sbcl on the 32-bit machine to see
> if I can figure out what this might show.
>
If the bug was in your code, that might show something.
More information about the Openmcl-devel
mailing list